So what is deduplication?
Deduplication is a means by which data is examined and compared to existing data. If it is the same, it is filtered out and the existing data is referenced. Deduplication is very prominent in applications such as backup that cause a lot of duplication as a byproduct of how they work. These applications are prime targets for deduplication technology.
What forms of deduplication are there?
There are three ways deduplication can occur that are talked about today in the market. One of them is the offering from Diligent called HyperFactor, which takes a look at data in an agnostic form and searches the datastream for similarity. Once similarity is found, a computation difference is performed guaranteeing that what is to be filtered out is exactly the same as what is referenced. Only new data is stored.
Another one uses hash technology or hash algorithms whereby data is sliced into some digestible piece -- such as perhaps 8Kbytes in size -- and a hash is assigned to that data and the data is stored. If that signature or hash is recomputed on a new datastream, then that computation suggests that that data already exists and can be referenced. It doesn't need to consume more storage, thereby reducing the amount of storage consumed.
The third is one where the datastream is looked at inside for its logical content, assuming that a file of a particular name is most likely to be a good candidate when compared to the contents of a file of exactly the same name on a fully qualified basis, meaning directory, directory tree, etc., and then a computational difference is done between the two files.
What are the different ways deduplication has been implemented?
One of the implementation differences in those approaches is whether you receive all of the data and lay it down on disk and then sometime in the future read it back in from a deduplication perspective, or whether during the receipt of the data you process it inline and in real time to achieve the deduplication.
Those are called inline and post-processing?
That is correct.
You say that Diligent uses the HyperFactor approach. Who are some of the vendors that use hash algorithms?
What are the advantages and disadvantages of inline deduplication and post-processing?
Inline deduplication first of all is difficult to achieve in terms of performance. But if you do achieve it, it is advantageous because once you have finished the job, the job is done -- there is no heavy lifting and you don't have to worry about capacity planning for any background tasks and what resources might be available to support that. Contrary to post-processing, while the data is being received by the backup application, none of the heavy lifting is being done, and so end users need to concern themselves with the amount of effort needed to do the post-processing.
It is quite easy to understand when you look under the covers that the activity on the disk subsystem is greatly increased as a byproduct of post processing, simply because you have to write everything and read it back. Then there's all the database and indexing overhead that is painful and can slow the process down. It is quite reasonable to assert that if you are able to de-dupe inline at 300 to 400MB per sec you wouldn't even consider doing post processing because the situation drives toward a higher I/O profile and slows you down.
Who are the vendors that do inline deduplication besides Diligent?
I believe Data Domain is the only other vendor doing inline processing. What's very interesting about that is the results from early beta tests support the claim that we make that post processing is slow when you have a large repository of data to deal with. A large repository, especially when it’s hash-based, cause the knowledge base, the index and catalog to be incredibly active. When I say large, I mean anything the size of 20, 30 or 40TB.
If you are using disk as your endpoint instead of tape, is it better to choose a system that does post processing or inline processing, or does it make a difference?
The decision point is going to be based on the magnitude of the workload. If you only have a small workload and you are only backing up 1TB a night, then there are many different offerings that might suffice. There are other attributes that have to do with scalability, flexibility in configuration and expansion. When you are looking at large quantities of data, then you really need to be concerned about the configurations necessary to support the payload when it gets to 10 to 20TB a night. If you are dealing with those large payloads you are likely to find yourself buying more hardware to support a post-processing deployment.
If my goal is to send data off to tape from its staging area on disk, do I need to un-de-duplicate that data before sending it off to tape?
Yes, you should because the benefit of putting it on tape is likely to send it off-site and your use profile dictates in all probability that you need native access to that data, meaning that NetBackup, TSM or Legato can use those tapes directly. If you de-dupe the data and then put it on tape, it’s a privately owned proprietary format on the tape that needs to be un-de-duplicated in order for the data to be of use to any application.
It seems that there would be opportunities for deduplication in areas other than in virtual tape libraries?
Deduplication works with any target. Diligent will be introducing file system deduplication with a Network File System interface and leveraging our deduplication engine to the network-attached storage topology. We are also developing an image interface in support of NetBackup. The technology is not bound by a VTL.