The tape tribe of hardware and software vendors could be facing extinction. De-duplication could prompt the long-awaited re-assessment of tape backup and send it into the forgotten past. Why is this? An examination of what de-duplication is, how it works and what the vendors' product plans are reveals the extent of the threat, the real threat facing tape.

De-duplication is a technology for reducing the amount of data stored on disk. It has a couple of problems associated with it. One is the processing burden. The other is understanding exactly what de-duplication means. Is it fancier compression, single instance storage or the use of fiendishly clever algorithms working with variable block sizes, hash addressing and the like?

So first the term de-duplication itself needs de-duping.

De-duping 'de-duplication'

There are basically three kinds of de-duplication activity possible with a file of data:-

1. Character level
2. Block level
3. File level

We'll look at one and two first and then turn to block level which is more complex and has larger claims associated with it.

Character level

Here every character in a file is compared to existing characters and, if repetition is detected it is replaced by a location pointer and an identifier. Obviously the space taken up by the location pointer and identifier has to be less than that taken up by a string of repeated characters.

This technique is quite old in computing terms and is called compression.

The effectiveness of compression depends upon the type of data being compressed. Word documents and PowerPoint decks can be compressed quite well because of repeated characters such as or <...>, etc. Compressing other data types can be unrewarding though.

Compression can generate 2:1 to 5:1 size reduction ratios. Zip archives are a typical use of compression technology.

Over the last couple of years, as de-duplication technology as been developed by several vendors they, and the compression technology vendors, are generally insistent that de-duplication is not compression and vice versa. They can be quite indignant about it.

At a dictionary level they are simply wrong. Removing repeated characters from a file is de-duplication at a character level. Arguing it's not is like saying that water is not wet.

File level

You may often hear the phrase 'single instance storage' (SIS) and this is equally often used to be synonymous with de-duplication, particularly by vendors who don't offer block level de-dupe.

The idea is conceptually simple. Any file to be stored is compared to existing stored files. If a match is found the incoming file is discarded and replaced with a stub, a pointer to the existing stored file.

This is effective when storing e-mails with common attachments. Network file management vendors may claim they have de-duplication, Njini for one. It means single instance storage and the company says it can release 20 to 30 percent of the disk space in a company with 1 - 3TB of disk.

This is obviously worthwhile but the actual de-dupe ratio is a relatively small 5:1 or 3:1, little more than effective character level compression. This is small potatoes compared to the much higher claims of the block level de-dupers.

Block level

By looking for repeated patterns at a block level then much greater de-dupe ratios could be achieved than possible either through compression or file-level de-duping. NexSan is working on de-duping 32 or 64KB data blocks. Avamar and Rocksoft technologies use variable block-level de-duping. They don't rely on fixed block boundaries. Instead they look for byte group-level repeated patterns. These might coincide with a company logo in a PowerPoint deck or standard paragraphs in Word documents.

Once a repeating pattern is found then metadata is constructed. The first instance is stored and subsequent instances replaced by a file position pointer and a reference to the stored pattern.

As the library of stored pattern templates builds up to the thousands and millions a very good indexing or access scheme has to be built up and stored in RAM for pattern recognition by the de-dupe software to be fast. Any need for disk accesses will slow down the ingest speed and effectively rule out in-band de-duplication (see below).

The block-level de-dupe software uses mathematical algorithms that are not application-aware. They don't know about such things as PowerPoint slide boundaries or Word page and section breaks. All they see are bit streams of bytes coming in and they do their pattern recognition at this level.

De-dupe ratios

Vendors make large claims about the effectiveness of de-duplication. These centre on the amount of reduction in data size possible, that is the de-dupe ratio. EMC, with its Avamar acquisition makes the most grandiose claim, stating that a 300:1 de-dupe ratio is possible. (In fact this is achievable with its 2-round de-dupe scheme of de-duping the source data and then de-duping a second time at the hub.)

Others, like Diligent, OEM'd by Hitachi Data Systems, talk about a 30:1 de-dupe ratio, ten times less. The idea of being able to cut down the stored data amount to a thirtieth of the size it would have been without de-duping is seemingly fantastic.

These claims are realistic when applied to block level de-dupe runs on the same disk volume of data when it is backed up repeatedly over time. The incoming backup data is compared to the already backed up data and only the new items are actually stored on disk. The repeated or duplicated items are replaced with identifiers and pointers. Over time, ten, twenty, thirty or more backup runs, the de-dupe ratio heads towards 30:1 or more.

Quantum tell us that a user doing a weekly full backup and daily incrementals could expect to achieve a 25:1 de-dupe ratio. If the user is doing daily full backups then the de-dupe ratio could head towards 50:1.

Such de-duping is not a one-shot exercise and doesn't mean you can reduce a multi-gigabyte PowerPoint presentation to a thirtieth of its raw size in a single de-dupe run.

Also the achieved de-dupe ratio depends upon what kind of data is being backed up and de-duped. The numbers above apply to typical Windows or Unix servers running a general mix of business applications.

De-dupe processing burden

De-duping is typically recommended for use in disk-to-disk backups and virtual tape library scenarios. The amount of data to be backed up increases all the time and disk space is limited.

The de-duping can be done as the raw data is first ingested - in-band - or after the data is first backed up to disk - out-of-band. Such post-backup processing means that the server running the backup application does not have to run the de-dupe code. This can be done off-line, as it were, outside the backup window, on the destination device or appliance, typically a VTL (virtual tape library).

De-duping at the character or file levels is not that processor-intensive. It does require reading a whole file but doesn't involve much processing beyond that. The de-duping takes time but not as much time as a block level de-dupe of the same file would take.

De-duping at the variable block level needs a lot of CPU cycles as every byte in a file has to be read and potential patterns compared to every stored byte pattern using some kind of table lookup using hash address-type identifiers calculated from a block's contents.

The processing needs of block level de-duping are much higher but the pay-off in terms of a higher de-dupe ratio are much higher also.

Think of block-level de-duping as being like a much more granular incremental backup

De-dupe and tape

A de-duped file cannot be restored to a user without first being reconstructed. The de-dupe software typically has the reconstruction capability included. This effectively rules out backing up de-duped data to tape.

It's not that it can't be written to tape; it can. But backup software writes to tape and backup software doesn't include the functionality to reconstruct de-duped files. So restoring files from a de-duped backup set written to tape would be a two-stage process. First restore the de-duped data from tape to the de-dupe device. Then reconstruct the file and send it on to the user.

In practice no de-dupe supplier is doing this. Instead they say that, if customers want to store data on tape then use backup software and write the raw data to tape in a standard full or incremental backup.

Alternatively they say that, as EMC does, that the benefits from de-duping are so significant as to make it worthwhile to re-think the data protection process from first principles and consider not using tape or traditional backup software at all.

If de-duping is being done by the destination machine, such as a VTL, then it has to store the ingested backup data. That means its disk capacity has to be used both for de-duped data and for non-de-duped data. In turn it means there is less space for de-duped data. The effect of this could be that, instead of being able to store, six months of backup data on the VTL you can only store three months because the need for a 6TB buffer, say, to store the ingested raw backup data.

De-duping will prompt drive array and VTL developments

If you de-dupe to a VTL and then de-dupe again at the VTL across multiple incoming de-duped target (edge) device data streams, then you have a very efficient way of storing data. You can afford to replicate it across a LAN or WAN for disaster recovery.

This fact can prompt the question 'why do we need to deploy tape at all?' If your answer to that is 'we don't' then it prompts a second question; 'why do we need a VTL at all?' Why not call it a de-duping drive array and have done with it?

Data Domain does this already with its de-duping drive array. Of course its disk write and disk tread speed in terms of receiving and delivering data to/from requesting servers is not as fast as that of an array dealing with raw data.

But a combination of flash memory caching, 2.5 inch drives, and a controller with an embedded server with high-speed processor and lots of RAM could dramatically increase the effective I/O bandwidth to that of a 3.5 inch disk drive array dealing with raw data.

It would also use much less energy than the equivalent drive array storing raw data because it would need far few spindles and much less cooling. In terms of watts used per raw TB stored a de-duping array wins hands down over a non-de-duping array. A 30:1 de-dupe ratio means that where a raw data array would need 180 spindles a de-duping array would need six.

Such a theoretical de-duping drive array could deal with general business data and not just backup data. It is likely, in this writer’s view that VTLs will decrease in popularity and general-purpose de-duping drive array products will appear.

De-duping and tape hardware and backup software vendors

Lets face it, de-duping could represent apocalypse now for tape hardware and backup software vendors. It as a potentially extremely disruptive technology for Symantec's Veritas, EMC's Legato and Dantz, Backbone, CA and other backup software vendors. It could also kill the business model of tape hardware vendors.

They may say tape is not dead, there will always be a need for tape to be the remove-to-a-vault, archive of record medium. A Quantum developing its own de-dupe technology (aka we'll eat our own tape children if we have to) and an EMC buying the claimed game-changing Avamar (aka we'll eat our own backup software children if we have to) means one thing; vendors with a heavy dependence on tape-based legacy revenue streams are thinking the previously unthinkable.

Tape could not just be in decline; it could be facing technology extinction. Tape and tape software are the equivalent of punched cards. They. Are. Simply. Doomed.

That's an extreme view at present. But it's one that EMC, with all its market weight, is going to be pushing. Another important tape hardware vendor is going to gradually join in. The slow tidal change from tape to disk that several vendors are beginning to discern could become a tape-devouring tsunami.

A straw in the wind you could watch out for would be backup software products incorporating de-dupe agent software sending de-duped data to a target drive array.

Vendor de-duping technology

Data Domain - de-duping drive array.

EMC (Avamar) - in-band, sub-file level, variable block size, de-dupe at target edge clients and second round at data centre hub across multiple client de-duped data.

FalconStor - file level, single instance store. Maybe developing sub-file level de-dupe capability with Sun Microsystems.

Hitachi Data Systems - sub-file level de-dupe using Diligent technology.

H-P - It OEM's Sepaton VTL software which has byte-level (block-level I think) de-duplication, not that HP staff seem to know about it. For them HP's DAT-72 and LTO 3 and 4 tape formats have a steady-as-we-go future.

Network Appliance - no de-duping products but it is working on developing one.

NexSan - Has a de-duping VTL.

Njini - file-level, single instance storage de-duping.

Quantum - in-band, sub-file level de-dupe product expected using Rocksoft technology obtained via ADIC acquisition.