Sun has just signed a global reseller agreement with Diligent for its virtual tape library and de-duplication software products. Hitachi Data Systems also has a reselling agreement with the company. What are the unique aspects of Diligent's technology that have led leading storage system vendors to choose it over others?

Techworld asked Diligent's chief technical officer, Neville Yates, some questions in order to find out more about Diligent's de-duplication product and strategy.

TW: What are the strengths and weaknesses of different approaches to de-duplication such as hash-based and content-aware?

Neville Yates: Hash-based storage is useful for storing archival data, immutable data, as it is difficult to change the data without the corresponding hash value changing. Hash-based storage has been extended from its original intent to also be used for data reduction, as it will store copies of the same data only once.

Products that utilize the hash-based approach are limited in performance and scalability. Beyond the limitations of performance and scalability, hash-based de-duplication entails a significant risk to data integrity. The process for de-duplication using a hashing scheme requires that chunks of data be represented by a hash (sometimes called a key or a signature). Data is stored on storage medium and the hash is remembered in some location.

When new data is received the process of creating the hashes is repeated and the resulting hashes are compared with existing hashes in order to look for a match. If a match is found, it is assumed that the data the original hash was created from is the same as the data the new hash was created from, facilitating the scheme to discard the new data in favor of referencing the already existing data.

The bottom line is that this approach does not scale. While hash-based data coalescence does enable more efficient use of storage, it does so at the cost of throughput, performance and data integrity. Because of these limitations, such technologies do not address the demands of the enterprise data center.

TW: And the content-aware approach?

Neville Yates: This approach suffers from many of the same performance issues as with the Hash-based approach. The content-aware approach needs to remember where previous versions of files are located within the disk repository. While it does not need to remember a hash for a particular chunk of data, keeping track of the location of hundreds of millions of file names quickly eliminates the opportunity to keep track of this inventory in a memory-resident Index. As a result, this scheme also defaults to a disk-based Index, and performance is negatively impacted by the same factor as with Hash schemes. Furthermore, unlike Hash schemes, a byte-by-byte comparison process of the data/files is required, placing significantly more burden on the disk I/O payload.

The content-aware approach need to utilize a 'post-process approach'. To attempt to handle the performance impact, this approach separates the process of ingesting data into the disk-repository from a post-process of de-duplicating the data. The limitations of this approach on the Enterprise-customer's data-protection processes are obvious. It necessitates a staging area to be provisioned on the disk-repository, allowing the complete backup job to be ingested, before the de-duplication process is initiated. Note the use of a staging area adds yet more load to the I/O subsystem, adding an additional two times the I/O of the original backup stream. An approach that attempts to use 'off hours' as a period of time to de-duplicate must consider that 'off hours' does not start after the primary backup workload has finished. 'Off hours'starts only after the business is protected which includes on-site and off-site vaulting.

In summary, a content-aware-based approach for data de-duplication is plagued with trying to find a balance between 3 dimensions:

1. performance

2. capacity scalability

3. de-duplication ratio

It also has additional continual challenges resulting from the requirement to understand the logical content of the data. This is a fluid entity with existing applications and an unknown with new applications.