Techworld had the opportunity to discuss de-duplication technology and some other issues with Miki Sandorfi, SEPATON’s chief technology officer. He explained how Sepaton’s content-aware approach differed from hash-based approaches used by Avamar and FalconStor.

TW: Could you compare and contrast the sub-file de-duplication technologies of Avamar, FalconStor and Sepaton please?

MIKI SANDORFI: The hash-based data de-duplication approach (used by Avamar and FalconStor) is typically used with in-band de-duplication solutions. This model runs incoming data through a hashing algorithm (typically MD5), which results in an identifier that is assumed to be unique to that piece of data. It then compares that hash to previous hashes stored in a lookup table. If a match is found,then the data is discarded and a pointer to the existing hash is added.If it is not found, then the data is added to the lookup table.

(FalconStor disputes much of Sandorfi's views above, and below, on its products. See its statement here.)

(EMC disputes much of Sandorfi's views above, and below, on its products. See its statement here.)

The idea is that the lookup table will be populated with many hashes from the backup data, making it more efficient over time. Therefore, the best de-duplication ratios will not be achieved until the hash table is populated. There are other challenges with this approach that are summarized below.

Hash-based data de-duplication requires substantial CPU performance because all of the hashes are generated by the CPU in real time. The more granular (e.g. the smaller the size of each piece of data being hashed) the hash, the more CPU-intensive and slow the process becomes.

Another drawback relates to the size of the hash table and where it resides. Storing the hash table on disk further degrades performance. While storing it in memory increases performance, it requires that the table size (and thus amount of data protected) is constrained by the amount of memory in the system. Hash collisions and subsequent data integrity issues are possible. Maximum backup set size per appliance is also limited. Technologies using this method cannot de-duplicate data across appliances

FalconStor has taken a mixed approach with its SIR Technology. It uses a content-aware type approach to identify common objects and then uses hashing to find the redundancies. Because hashing is fundamentally part of finding redundancies, the algorithm is classified as hash-based.

The content-aware aware approach (used by SEPATON) is entirely different. This approach focuses on out-of-band data de-duplication. Data is backed up to the VTL first. When a backup set has completed, the data de-duplication process begins. This approach allows for unimpeded backup performance since de-duplication is not being performed on incoming data.

The other key element of the content-aware approach is that it uses a higher level of abstraction when analyzing backup data. Unlike the previous two approaches, content-aware de-duplication looks at data as objects. Unlike hashing or byte-level comparisons, which try to find redundancies in byte streams, content-aware looks at objects, comparing them to other objects. (e.g., Word document to Word document or Oracle database to Oracle database.)

This approach results in an increase in disk space requirements, but provides better performance and de-duplication ratios than the other solutions. It also requires only minimal incremental disk, which is negligible in the overall solution cost.

TW: You've mentioned several differentiating topics. Could you discuss data integrity please in more detail?

MIKI SANDORFI: Since data de-duplication is modifying data stored on the backup system, it is vital that data integrity be guaranteed at all times. Given the many pointers involved, a data integrity issue can potentially have a cascading, negative impact on many backups.

The problem with hash-based algorithms is that they require that a unique hash be generated for each piece of data. The hash must provide a unique identifier for each given chunk of data. If this is not true then the system will silently corrupt data. This corruption occurs when the de-duplication algorithm mistakenly discards non-redundant data. This error will not be found during the backup. It is only apparent if a restore is attempted on data that includes or has pointers to the incorrectly discarded data. All modern hashing algorithms are susceptible to collisions and consequently, any hash-based data de-duplication approach is susceptible to this problem.

Our content-aware algorithm is not susceptible to hash-based data integrity concerns since byte-level comparisons are performed.

TW: Could you discuss the scalability issue a little more?

MIKI SANDORFI: Data de-duplication allows customers to store dramatically more data online. As customers need to store more data and the system grows, capacity scalability becomes a substantial challenge with other solutions. However, the content-aware method enables them to minimize footprint and management overhead by reducing the number of systems that need to be managed.

As mentioned previously, hash-based algorithms rely on a lookup table that contains all previously seen unique hashes. In most implementations, this hash table is stored in memory to improve performance. As a result, the scalability of many of these systems is limited by the amount of memory and the size limitations of the supported hash table. Although vendors promote higher scalability numbers they typically requires multiple separate units to achieve them due to the limitations of the hash lookup table. Multiple separate units are inherently less efficient because each unit is a separate data-duplication space with its own lookup table. As a result, you gain no efficiencies from shared data de-duplication between systems.

FalconStor indicates that it supports clustering. It is not clear what size data set can be supported with a dual node cluster. Their technology is designed as an add-on technology requiring entirely separate hardware and storage and is not an integrated part of their VTL solution.

SEPATON's content-aware technology creates a content-aware database that incorporates the metadata associated with backups. The database is dynamically scalable and can support 50 PB plus of corporate data and backups of any size.

Part 2 continues here.