Doron Kempel of Diligent is a confident CEO, even a cooly certain one. In a nutshell it seems as if the Diligent pitch is that inline de-dupe is the way to go, HyperFactor is better than hash-based de-dupe, and Diligent can produce de-duping speeds up to five times faster than other suppliers, such as Data Domain.

Making a hash of de-dupe

Kempel dismisses hash-based de-dupe products on two counts. First there is a risk of false positives, that a hash might actually be a duplicate and point towards the wrong data. Data Domain's marketing VP, Beth White, said that the sun would have to go around the moon before this could happen. Kempel points to the theoretical possibility and says that because so much real data is resting, like an upside-down pyramid on its tip, on the hash element, then any risk, however small, is too much. So mch raw data could be corrupted by a duplicated hash that it doesn't bear thinking about.

Beth White says let's not get into a death-by-math type argument but, really, our feet should be kept on the ground and the theoretical risk is vanishingly small. There are hundreds of de-duping Data Domain products out in the real world doing real jobs so - I'm paraphrasing her here - get real. She says anti-hash FUD is corrected by ESG analysis reports and by other analysts.

In-memory indices

Secondly, Kempel says, HyperFactor produces very compact indices. These are the byte strings that represent unique data elements; hashes in the hash addressing schemes. Each byte string represents a certain amount of unique raw data stored on the de-dupe product's hard drives. The larger the amount of unique raw data elements stored on disk the larger the index. This does not matter much - as long as the index can be stored in the de-dupe server's memory.

If the raw data elements grow to such a size that the index overflows the server's memory then some of it has to be moved out to disk and de-dupe performance abruptly worsens and continues to worsen as the proportion of disk-based index lookups increases. Diligent's Kempel says his company's HyperFactor index is so compact that this boundary is a lot further off than with other de-dupe suppliers.

He says that there is a ratio of unique raw data element store size to index size. In Diligent's case it is 250,000:1, meaning a 1PB repository can be represented by a 4GB index. Using assumptions about chunk size and chunk hash address size he reckons that Data Domain needs a 3GB index for a 1TB repository: "A 900:1 difference from Diligent."

Avoiding the de-dupe ratio trap

Kempel avoids the de-dupe ratio trap, the 'my ratio is better than yours' one, by saying 'you mileage may vary' and the ratio varies with the type of data and the amount of it that you have already de-duped. The ratio with tend days' worth of backup data will be a lot less than with fifty days of backup data. De-duping full backups will be more productive, in a de-dupe ratio sense, than de-duping incremental backups. This is all good, solid stuff and other de-dupe suppliers acknowledge it too.

But then, like any financial fund seeking fresh investors he points to returns gained by previous investors. There is a large US client data reference company with 28 Diligent products. Each runs on a Sun V40Z server - four dual-core CPUs - with 80TB of Sun 9990 disk, aka HDS TagmaStore USP arrays. The firm manages to keep 30 - 45 days of data with de-dupe ratios in the 12:1 and 15:1 area. Kempel says: "Effectively it's 1PB per server. They want to go to 40 systems by the end of the year."

Other customers have achieved higher ratios and some less. His points though, not that he came out and said them, is that customers using Diligent systems will achieve as good if not better de-dupe ratios than other other products, will de-dupe faster, and have more reliable data recovery from the de-dupe data repository.

Customer bake-off

Kempel mentioned a customer bake-off where various de-dupe systems were being compared with a standard data set. Diligent, using a 4-cpu x86 server, achieved 580MB/sec de-duping throughput. He thought that a Data Domain system achieved 100-110MB/sec. Data Domain's White suggested the Diligent speed might have been achieved by using an unrealistically large number of disks.

Where does this leave us? We have to make conjectures based on unverifiable claims until public, believable and verifiable data about de-dupe product performance emerges. When it does then we can make more realistic assessments of the differing products and their performance, their price/performance, their scalability and, in other words, make the kind of informed product comparisons commonplace elsewhere in storage.

As to arguments about my de-dupe algorithm being better than yours, we're firmly in what Data Domain nicely described as 'death by maths' territory. We can't understand the maths involved and it comes down to judgements based on uncertain information. That's called gambling.