TW: What about performance?
MIKI SANDORFI: As the amount of data protected increases, customers need performance to increase concurrently to ensure that there is enough bandwidth available for backing up and restoring. Since data-duplication is complex and often compute-intensive, it can have a major impact on backup and restore performance.
Since hash-based algorithms are generating hashes in real time, the process is fundamentally compute-intensive. The hash generation and lookup processes create substantial environmental overhead and slow performance ignificantly.
The only way to solve these problems in a hash-based environment is to add more systems. As mentioned above, this creates complexity and reduces the overall de-duplication ratio. This inline approach relies on the hash table to maintain pointers to existing data. The hash table is initially populated with the first backups. Subsequent backups add new data to the table and include pointers to existing data.
The data stored from a given backup includes pointers to data from previous backups. It is important to note these pointers can point to backup all over the disk (e.g., it typically includes pointers to the first backup, to the backup on the previous night and to everything in between.) When a restore is requested, the de-duplication technology must recreate the data by following the various pointers.
However, the algorithm has fragmented the backup data all over the disk so to restore data it has to rebuild that data and, in the process, performance is slowed down significantly. This impact is also seen when performing tape copy operations. Tape copy can be a major challenge with real-time data de-duplication algorithms because restore performance is at least as important as backup performance.
FalconStor highlights an ability to cluster its SIR engines. Note that its de-duplication occurs inside an entirely separate environment then its VTL - requiring separate servers and storage that must be managed separately.
FalconStor highlights its ability to add up to two SIR engines to provide additional de-duplication performance. Of course, since each is a separate managed entity this adds substantially to system management overhead.
The ContentAware algorithm as implemented by SEPATON is designed for scalability. Unlike the other approaches, the algorithm is designed to natively support clustering. Each time you add a node to the cluster, the SEPATON system adds processing and disk I/O resources to enable increased de-duplication performance. The SEPATON solution also performs de-duplication out of-band so that the native backup performance is not affected. As a result, the SEPATON approach can deliver wire speed Fibre Channel performance while delivering industry leading de-duplication performance and scalability.
SEPATON's content-aware algorithm is not susceptible to the restore performance degradation problems of hashing algorithms. Hashing algorithms use the first backup as a key source of data for populating the lookup data. This causes them to fragment data all over the disks. They point backward to earlier backups.
In contrast, the SEPATON solution uses the most recent backup as the primary source of data. Its pointers point forward to the newest backup. As a result, the SEPATON solution always provides the fastest restores for the most recently backed-up data (which is typically the most business-critical data for backup.)
TW: There is a wide range of de-duplication ratios put out by suppliers. What is your take on this topic?
MIKI SANDORFI: A variety of factors influence the capacity reduction or data de-duplication ratios. These factors include backup schemas, data types, data change rates and many other factors. While these are inevitable and will influence all data de-duplication methodologies, there are inherent differences in the various approaches that will result in differing ratios. The primary issue is how efficient a de-duplication methodology is or how efficiently the algorithm finds and removes redundancies. In all cases, there is a trade off between de-duplication granularity and performance, and different algorithms take different approaches to balance these two requirements.
With hash-based algorithms de-duplication products do not balance de-duplication ratio and performance well. The issue is that the smaller and more granular the hashes, the better the de-duplication ratio but the greater the performance and the larger the hash lookup table.
Another factor that affects all inline de-duplication ratios is the difficulty in understanding where files and other data objects begin and end. This is simple for technologies that look at data on a file system level, but very complex for technologies that look at data on a byte stream level. In the latter case, the application is seeing a string of streaming bytes and must deduce breakpoints. If the algorithm is incorrect, it will negatively affect de-duplication ratios.
Unlike full backups where the data is always in a consistent order and much of it is unchanged, identifying breakpoints is particularly complex for incremental backups where data is in an inconsistent order. In practice, these de-duplication solutions will always have a dramatically better de-duplication ratio when performing full backups versus incremental backups. This factor can be a major point of differentiation in incremental-only environments such as TSM where full backups happen only once.
FalconStor's use of a hybrid content-aware and hash approaches may improve on the limitations of traditional pure hash-based approaches. However, it still has the challenge of how small the block of data being hashed is and this will have a dramatic impact on both de-dupe performance and de-dupe ratios.
Unlike the hash-based and byte-level algorithms, which effectively guess at the data breakpoints, the content-aware approach natively includes and understands both files and breakpoints, eliminating the guesswork of where a file ends and/or begins or the redundancy (or lack thereof) of a given data set. All of this information is contained within with content-aware database.
The content-aware model intelligently analyzes the relationships between files and understands what files are likely to be redundant before it compares the two files. This additional intelligence enables the content-aware approach to balance performance and granularity with unparalleled efficiency. Since the algorithm knows which files are likely to be redundant it can de-duplicate those files more aggressively while avoiding unnecessarily processing new files.
The other key point is that the content-aware approach is equally efficient for incremental and full backups. Since the technology looks at the object, the order of the backups or finding breakpoints is irrelevant. A given file has the same metadata and path regardless, of whether it is backed up in a full or incremental backup.
Since the content-aware technology has more intelligence around the content of backup data, it can support sophisticated configuration rules that enables the customer to tune the algorithm to their environment. For example, customers can configure the solution to handle designated data types, data from certain servers, and data from certain policies or exclude specific data formats.
TW: Do you think the EMC/Avamar approach of claiming an up to 50:1 de-dupe ration through running a second de-dupe round at the datacentre on remote/branch office de-duped backups consolidated to the datacentre is worthwhile or not?
MIKI SANDORFI: It would depend on the customer's data. If the customer has duplicate data spread across many sites, it would be of value. If not, it adds little value. Higher ratios would be achieved where backups of operating system files (Windows, Linux) are routinely backed up since they are the same across all sites. Of course, the second round of de-dupe requires additional resources (compute, network etc.)
TW: How realistic is it to compare de-dupe ratios between different vendors products? Is an apple-to-apples comparison even remotely possible?
MIKI SANDORFI: Absolutely. From the customer's perspective, they can retain more data online in a given amount of underlying physical disk. How much more is very measurable and results in the de-dupe ratio. SEPATON's DeltaStor graphical user interface supplies this information for the system overall as well as for each and every backup job and is available after the second backup.
TW: Should customers ask vendors to run a sample customer-supplied file set to be de-duped through their products to arrive at a valid de-dupe ratio for the customer's specific requirements?
MIKI SANDORFI: Since the achievable de-dupe ratio is highly dependent on the customer's backup policies and data sets, the customer should absolutely ask for a representative sample set to be run by the vendor. Unfortunately, this is not usually practical for vendors using hash-based approaches since the optimal de-dupe ratio is not achieved for weeks or months. An advantage to the ContentAware approach is that de-dupe ratios are demonstrable after the second backup is processed.
TW: Could you provide an overview of Sepaton's OEM relationships please?
MIKI SANDORFI: HP OEMs SEPATON's VTL software which it integrates with its choice of servers and storage for its 6XXX and VLS 300 families of virtual tape libraries. The OEM relationship was announced in May of 2005 (see here).
TW: Could you provide your view of the role tape media should play in data storage please?
MIKI SANDORFI: Tape has a place for "deep archiving" where the data is not expected to be accessed again but must be retained for legal or regulatory purposes.With the cost of disk retention with de-duplication approaching that of tape, it is now feasible to create an "active archive" for fast recovery of data that is likely to be needed in the future. The line between what should be in the deep archive or active archived varies by industry and customer.