In a recent article Sepaton's chief technology officer, Miki Sandorfi, described how he say Sepaton's de-duplication advantages against Avamer (owned by EMC) and FalconStor. Both of of these companies vigorously dispute many of Sanforfi's assertions. (Falconstor's comments can be seen here.)

EMC has sent me its comments as inserts placed in its copy of the original article text. I'm reproducing below exactly what it sent me.

- - - - - - - - - -

Techworld had the opportunity to discuss de-duplication technology and some other issues with Miki Sandorfi, SEPATON's chief technology officer. He explained how Sepaton's content-aware approach differed from hash-based approaches used by Avamar and FalconStor.

TW: Could you compare and contrast the sub-file de-duplication technologies of Avamar, FalconStor and Sepaton please?
MIKI SANDORFI: The hash-based data de-duplication approach (used by Avamar and FalconStor) is typically used with in-band de-duplication solutions. This model runs incoming data through a hashing algorithm (typically MD5), which results in an identifier that is assumed to be unique to that piece of data. It then compares that hash to previous hashes stored in a lookup table. If a match is found,then the data is discarded and a pointer to the existing hash is added.If it is not found, then the data is added to the lookup table.

The idea is that the lookup table will be populated with many hashes from the backup data, making it more efficient over time. Therefore, the best de-duplication ratios will not be achieved until the hash table is populated. There are other challenges with this approach that are summarized below.

- - - - - - - - - -

EMC RESPONSE - Any de-dupe approach relies upon pre-existing data. In fact, we can de-dupe within the first job whereas they must have a second job to perform de-duplication. Also, the EMC Avamar software uses SHA-1. Avamar operates globally and there is no single lookup table. We can de-duplicate at both source and target, looking at local caches or across Avamar storage nodes that comprise an Avamar grid.

- - - - - - - - - -

Hash-based data de-duplication requires substantial CPU performance because all of the hashes are generated by the CPU in real time. The more granular (e.g. the smaller the size of each piece of data being hashed) the hash, the more CPU-intensive and slow the process becomes.

- - - - - - - - - -

EMC RESPONSE -This is an incorrect statement. The CPU usage of the hashing function is independent of object size; it is a function of the total data set size. There is more lookup with smaller objects, but this is not CPU intensive. In fact, our software reduces weekly CPU load by dramatically reducing the amount of work required on client systems for backup and recovery (by up to 20x). By performing de-duplication at the client, we eliminate the "bottle-necking" of inline approaches.

- - - - - - - - - -

Another drawback relates to the size of the hash table and where it resides. Storing the hash table on disk further degrades performance. While storing it in memory increases performance, it requires that the table size (and thus amount of data protected) is constrained by the amount of memory in the system. Hash collisions and subsequent data integrity issues are possible. Maximum backup set size per appliance is also limited. Technologies using this method cannot de-duplicate data across appliances.

- - - - - - - - - -

EMC RESPONSE - The constraint "cannot de-duplicate data across appliances" does not apply to us. Same for "Maximum backup set size." We use a very small memory footprint across all the clients - only during backup operations, which are fast - in order to de-duplicate quickly. Memory available in the Avamar server scales as nodes are added to the Avamar storage grid. We eliminate the avalanche of data at the top of the mountain, rather than de-duplicating inline at the target or the base of the mountain, after the avalanche has formed.

- - - - - - - - - -

FalconStor has taken a mixed approach with its SIR Technology. It uses a ContentAware type approach to identify common objects and then uses hashing to find the redundancies. Because hashing is fundamentally part of finding redundancies, the algorithm is classified as hash-based.

The content-aware aware approach (used by SEPATON) is entirely different. This approach focuses on out-of-band data de-duplication. Data is backed up to the VTL first. When a backup set has completed, the data de-duplication process begins. This approach allows for unimpeded backup performance since de-duplication is not being performed on incoming data.

- - - - - - - - - -

EMC RESPONSE - This totally ignores LAN/WAN based efficiencies of client side de-duplication. While it allows for unimpeded backup performance, it does not reduce backup times like EMC's approach, which dramatically reduces the amount of work required for backup. They are still forced to move full backups on a recurring basis, so - to reiterate our previous comment - they are de-duplicating at the base of the mountain, after the avalanche has formed.

- - - - - - - - - -

The other key element of the content-aware approach is that it uses a higher level of abstraction when analyzing backup data. Unlike the previous two approaches, content-aware de-duplication looks at data as objects. Unlike hashing or byte-level comparisons, which try to find redundancies in byte streams, content-aware looks at objects, comparing them to other objects. (e.g., Word document to Word document or Oracle database to Oracle database.)

- - - - - - - - - -

EMC RESPONSE - A lower level of abstraction allows our software to look across all files and systems for any duplicate data that can be eliminated. We are also content aware, looking at the bytes that make up a file to determine optimal segment boundaries, which maximizes the likelihood of finding and eliminating duplicates.

- - - - - - - - - -

This approach results in an increase in disk space requirements, but provides better performance and de-duplication ratios than the other solutions. It also requires only minimal incremental disk, which is negligible in the overall solution cost.

- - - - - - - - - -

EMC RESPONSE - They are performing byte level comparisons, which can often be more CPU intensive than hashing when accounting for insertions or deletions. In addition, they require all the data to be moved to perform de-duplication which results in far fewer benefits than EMC Avamar customers realize due to our de-duplicating at the source.

- - - - - - - - - -

TW: You've mentioned several differentiating topics. Could you discuss data integrity please in more detail?
MIKI SANDORFI: Since data de-duplication is modifying data stored on the backup system, it is vital that data integrity be guaranteed at all times. Given the many pointers involved, a data integrity issue can potentially have a cascading, negative impact on many backups.

The problem with hash-based algorithms is that they require that a unique hash be generated for each piece of data. The hash must provide a unique identifier for each given chunk of data. If this is not true then the system will silently corrupt data. This corruption occurs when the de-duplication algorithm mistakenly discards non-redundant data. This error will not be found during the backup. It is only apparent if a restore is attempted on data that includes or has pointers to the incorrectly discarded data. All modern hashing algorithms are susceptible to collisions and consequently, any hash-based data de-duplication approach is susceptible to this problem.

- - - - - - - - - -

EMC RESPONSE - All storage systems and file systems are susceptible to corruption and errors on reads and writes. The likelihood of hash collision for SHA-1 is very limited and unlikely. In the extremely remote event that it does occur, it does not have a cascading effect. It will only affect files that share a specific segment. The chances of file corruption from primary file systems during tape backups is orders of magnitude more likely to take place.

- - - - - - - - - -

Our content-aware algorithm is not susceptible to hash-based data integrity concerns since byte-level comparisons are performed.

TW: Could you discuss the scalability issue a little more?
MIKI SANDORFI: Data de-duplication allows customers to store dramatically more data online. As customers need to store more data and the system grows, capacity scalability becomes a substantial challenge with other solutions. However, the content-aware method enables them to minimize footprint and management overhead by reducing the number of systems that need to be managed.

As mentioned previously, hash-based algorithms rely on a lookup table that contains all previously seen unique hashes. In most implementations, this hash table is stored in memory to improve performance. As a result, the scalability of many of these systems is limited by the amount of memory and the size limitations of the supported hash table. Although vendors promote higher scalability numbers they typically requires multiple separate units to achieve them due to the limitations of the hash lookup table. Multiple separate units are inherently less efficient because each unit is a separate data-duplication space with its own lookup table. As a result, you gain no efficiencies from shared data de-duplication between systems.

- - - - - - - - - -

EMC RESPONSE - To be clear, EMC Avamar does not have this issue. Our system is a distributed grid that scales with additional nodes and has a single de-duplication domain.

- - - - - - - - - -

FalconStor indicates that it supports clustering. It is not clear what size data set can be supported with a dual node cluster. Their technology is designed as an add-on technology requiring entirely separate hardware and storage and is not an integrated part of their VTL solution.

SEPATON's content-aware technology creates a content-aware database that incorporates the metadata associated with backups. The database is dynamically scalable and can support 50 PB plus of corporate data and backups of any size.

(Part 2 continued here.)