A wave of sub-file-level de-duplication is washing over secondary storage and virtual tape library (VTL) manufacturers. Both Fujitsu Siemens Computers and Overland Storage have added de-dupe capabilities to their product lines.

Fujitsu Siemens Computers

FSC has turned to EMC, with whom it has a solid partnership, for its Avamar de-duplication product. It is supplying (OEM'ing) Avamar as a software product with a recommended and certified PRIMERGY TX300 or RX300 server configuration.

The aim, according to Helmut Muhleis, an FSC principal consultant, is to provide an effective and capacious disk-based backup facility for virtual servers and for remote and branch offices. It's necessary to carefully balance the host server's memory, CPU and other characteristics to provide effective de-duping performance.

Backup data is retrieved and de-duplicated by Avamar agents installed on client systems - application and file servers - and then sent over a network to a central Avamar system. The backup data is minimised in size at the source before being sent over the LAN or WAN.

For backing up VMware virtual servers the software can be used to de-dupe within and across virtual machines.

Why choose Avamar? Muhleis said: "We carried out in-depth testing of the de-dupe players. We need scalability as good as CentricStor's. The de-duplication algorithm needs to slice up incoming data to avoid overflowing the cache and so needing disk accesses which slows things down. We couldn't compromise by introducing a limited de-duplication capability."

An aspect of de-duping that FSC has identified is that certain files must not be de-duped, said Muhleis: "Documents that are contracts with signatures on them may need to kept inviolate; they cannot be altered. For these files de-duplication must be capable of being turned off."

In other words there must be a policy or sysadm capability to switch de-duping on and off depending upon file type and/or contents.

FSC is not announcing de-duplication for its CentricStor VTL.

The CentricStor VTL is the only VTL written from the ground up for both mainframe and open systems use. More than 500 have been bought - it's the leading VTL product in Europe - and look after more than 260PB of customer data. A major new version, 4.0, will be announced in early December.

Techworld's view is that 'if' FSC adds de-dupe to CentricStor this will be the only de-duping mainframe VTL in the industry.

Overland Storage

The REO 9500d is a de-duplicating VTL appliance that uses Diligent's de-duplication product. Previously Diligent has concentrated on supplying enterprise-class products but signalled a step into the mid-market fairly recently with its ProtecTIER line.

Overland's Chris James, EMEA marketing director, said the 9500d is a REO chassis running Linux and with Diligent's de-dupe software running under Linux. It is a new product for existing customers and has a 'powered by Diligent' tag. Overland has been working on it in its lab for 'almost a year.' The scheme is to provide enterprise VTL de-dupe capabilities at a mid-range price.

James said: "Absolutely. It's for the SMB market with a $65,400 (about £33,000 at ordinary conversion rates) price for a 3.75TB capacity product. That's potentially a lot more in effective stored data terms after the data is de-duplicated."

Maximum capacity is 168TB and the product uses two 4Gbit/s Fibre Channel ports. There can be 12 library partitions, 64 virtual tape drives and 3,000 virtual cartridges.

De-duping is like applying a multiplier to your VTL's disks. A 10TB VTL could actually hold 80 to 100TB of de-duped raw data. James cautioned that: "Your de-dupe mileage may vary. The longer you keep de-duped data for, the better your de-dupe ratio will be. Each extra backup increases the de-dupe ratio."

This is because the more unique data there is in your VTL repository the greater the chance that incoming backups will contain some of the unique data strings which can be removed and replaced by pointers.

"The de-dupe ratio might be 8:1 or 12:1 if data is kept on the VTL for 8 to 10 days. The ratio will probably increase if data is kept on the VTL for longer." Overland claims that the product can retain typically up to 25 times more data on disk in this circumstance.

The 9500d de-dupe's data inline, as it is ingested, and has a throughput of 80-100MB/sec. This means that the 9500d doesn't have to maintain a portion of its physical capacity to store an incoming backup before it is de-duped. More of the products disk capacity is available for de-duped data storage and so the array is more effective in terms of de-duped data storage utilisation.

To strengthen its SMB appeal Overland says the 9500d can be integrated seamlessly into existing backup environments and there is an easy-to-use configuration wizard to streamline deployments, needing less than an hour, without requiring any modifications to application or backup servers.

Commentary

De-duplication is becoming a near-universal adjunct to the storage of enterprise persistent data: backups, retained emails, retained unstructured data, etc, but it is not becoming a commodity. That is because the de-dupe algorithms are highly complex and vary as does the de-dupe time: inline with data ingest or post-ingest, known as post-processing.

The point about data characteristics affecting achieved de-dupe performance is vital and customers are encouraged to run pilot de-dupe projects pumping known data sets through different de-dupe products to find out which one best suits their own individual situation.

We can expect it to be a long time before the SPC comes up with a SPC de-dupe benchmark and, even then, its data sets may not reflect customer's real-world file stores. Ironically enough, every customer might be unique in its de-duplication needs - they simply cannot be duplicated.