Sun has just signed a global reseller agreement with Diligent for its virtual tape library and de-duplication software products. Hitachi Data Systems also has a reselling agreement with the company. What are the unique aspects of Diligent's technology that have led leading storage system vendors to choose it over others?

Techworld asked Diligent's chief technical officer, Neville Yates, some questions in order to find out more about Diligent's de-duplication product and strategy.

TW: What are the strengths and weaknesses of different approaches to de-duplication such as hash-based and content-aware?

Neville Yates: Hash-based storage is useful for storing archival data, immutable data, as it is difficult to change the data without the corresponding hash value changing. Hash-based storage has been extended from its original intent to also be used for data reduction, as it will store copies of the same data only once.

Products that utilize the hash-based approach are limited in performance and scalability. Beyond the limitations of performance and scalability, hash-based de-duplication entails a significant risk to data integrity. The process for de-duplication using a hashing scheme requires that chunks of data be represented by a hash (sometimes called a key or a signature). Data is stored on storage medium and the hash is remembered in some location.

When new data is received the process of creating the hashes is repeated and the resulting hashes are compared with existing hashes in order to look for a match. If a match is found, it is assumed that the data the original hash was created from is the same as the data the new hash was created from, facilitating the scheme to discard the new data in favor of referencing the already existing data.

The bottom line is that this approach does not scale. While hash-based data coalescence does enable more efficient use of storage, it does so at the cost of throughput, performance and data integrity. Because of these limitations, such technologies do not address the demands of the enterprise data center.

TW: And the content-aware approach?

Neville Yates: This approach suffers from many of the same performance issues as with the Hash-based approach. The content-aware approach needs to remember where previous versions of files are located within the disk repository. While it does not need to remember a hash for a particular chunk of data, keeping track of the location of hundreds of millions of file names quickly eliminates the opportunity to keep track of this inventory in a memory-resident Index. As a result, this scheme also defaults to a disk-based Index, and performance is negatively impacted by the same factor as with Hash schemes. Furthermore, unlike Hash schemes, a byte-by-byte comparison process of the data/files is required, placing significantly more burden on the disk I/O payload.

The content-aware approach need to utilize a 'post-process approach'. To attempt to handle the performance impact, this approach separates the process of ingesting data into the disk-repository from a post-process of de-duplicating the data. The limitations of this approach on the Enterprise-customer's data-protection processes are obvious. It necessitates a staging area to be provisioned on the disk-repository, allowing the complete backup job to be ingested, before the de-duplication process is initiated. Note the use of a staging area adds yet more load to the I/O subsystem, adding an additional two times the I/O of the original backup stream. An approach that attempts to use 'off hours' as a period of time to de-duplicate must consider that 'off hours' does not start after the primary backup workload has finished. 'Off hours'starts only after the business is protected which includes on-site and off-site vaulting.

In summary, a content-aware-based approach for data de-duplication is plagued with trying to find a balance between 3 dimensions:

1. performance

2. capacity scalability

3. de-duplication ratio

It also has additional continual challenges resulting from the requirement to understand the logical content of the data. This is a fluid entity with existing applications and an unknown with new applications.

TW: How does Diligent's HyperFactor approach compare and contrast to hash-based and content-aware-based approaches?

Neville Yates: Unlike hash-based and content-aware-based approaches, HyperFactor can scale without a negative impact on performance. Hash-based and content-aware-based approaches to data de-duplication are plagued with inherent inescapable tradeoffs among the following four critical criteria enterprise customers demand:

1. Inline performance

2. capacity scalability

3. de-duplication ratio

4. 100 percent data integrity

Preferring one of the above dimensions over the others has a dramatic negative impact. HyperFactor has overcome these challenges by architecting a new approach that revolves around a small, memory-resident Index combined with byte level comparison, thereby achieving performance, scalability, high de-duplication rates and 100 percent data-integrity.

TW: What are the operational issues involved in inline de-duplication and post-processing de-duplication?

Neville Yates: With in-line de-duplication, once the data hits disk the job is complete. All that follows is maintenance; activities such as repository maintenance; this can be scheduled and designed to be pre-emptive resulting in no interference with other business work that might be required.

Any other processing on the data such as indexing of the logical content may involve more overhead than the dual hop/no hop approach for two reasons; 1) in the dual hop approach the data to be indexed is most likely available at the highest of sequential read speeds (a side note, this is also true as a post-process for the no-hop approach with forward referencing, an approach not reasonable in an in-line implementation), 2) it is not reasonable to accomplish de-dupe and content indexing in-line without a performance penalty.

In general post processing the de-dupe function can facilitate a level of parallelism; doing other work at the same time such as Indexing of the logical content of the data may improve efficiency. This will elongate the time to complete the post process; in some cases this is tolerable and in some cases it is not. One case where it is tolerable is in small environments where the maintenance window is sufficiently small, allowing for an elongated post-process.

However, in many cases the elongation of time to complete factoring is not acceptable. Post-processing fails when other processes need to occur within the domain of the backup application such as vaulting, which demands resources of the VTL (virtual tape library) that conflict with the post process de-duplication.

The bottom line is that the VTL is a slave to the backup app and does not have the luxury of scheduling events at its convenience. For example, a typical customer example is the following sequence of events: Step 1, primary backups, Step 2, off-site copies, Step 3, resource management in prep for the next cycle (TSM defrag type activity or the old fashioned pull list scheme supporting scratch management).

Steps 2 and 3 in this case will contend with the de-dup operation in a post-process approach. Contention for resources will exist if the VTL does not control when events happen, therefore negatively impacting the whole backup window.

TW: Could you segment the market into areas where different approaches fit best please? Where are the sweet spots for individual suppliers' products and technologies?

In answer to this Yates used a PowerPoint slide. Summarised, it divided the potential market for de-dupe products into two: A mid-range one for branch offices with 25-75 MB/sec throughput and 3-10TB of data; an enterprise or high-end one featuring data centres with 200-400 MB/sec throughput and 10-200TB or more stored per node.

In-line de-dupe for the mid-range is supplied by Data Domain; for the high-end it is Diligent. Post-provessing de-dupe for the mid-range comes from EMC, NetApp, FalconStor, Sepaton and Quantum. There is no post-processing de-dupe supplier for the high-end.

Yates identifies two suppliers of in-line de-duplication for data on mid-range production servers: Avamar (EMC); and Symantec with PureDisk. There are no suppliers of a high-end product.

The development of de-duplication technology and the implementation of products are both still in the early phase. The potential results in terms of effectively increasing disk space data capacity by factors of twenty or more are so attractive that developers of the technology and suppliers embodying it in their product lines face a period of frenetic development, claim and counter-claim about the benefits of individual approaches.

Diligent has two high-profile backers in the form of HDS and Sun. HP has a Sepaton arrangement. IBM has a reselling arrangement with Falconstor. NetApp has it ASIS technology and EMC has bought Avamar. As customer implementations develop we will be able to get real-life data on the benefits of the different approaches and see how they compare in practice.