Document originality and integrity can be an absolutely vital element in legal proceedings and compliance situations. Cast doubts on a document's authenticity and millions of pounds could be lost.

There is a concern in some quarters of the storage industry that sub-file-level de-duplication, because it necessarily alters the original representation of a file, compromises a stored document's authenticity and so renders it un-usable or less usable in a court of law.

De-duplication equivalent to electronic tampering

Say you need to prove that an electronically-stored file or object is the original document. If it is an electronically-generated document in the first place, say a Word file or an Outlook Express email, then storing it in its native format is straightforward and if you store it on write-once, read-many (WORM) media then you have a pretty rock solid case that your stored file is the same as the original file.

Now say you de-dupe the stored file. It is altered as sections of it are replaced by pointers to byte strings stored elsewhere. Yes, it can be reconstructed but you can no longer say that it is the original document or email. It isn't. It is a representation of it in a different format.

Here is Gary Watson, Nexsan's CTO, talking about the subject: "Assureon stores files in a very straightforward XML format which could be easily understood in a court proceeding (e.g. during forensic cross-examination), whereas as far as I know the sub-file systems physically store files as recursive lists of pointers to blocks (or something even more complex) which would be challenging to explain to a judge or jury. It’s a layer of potential risk we want to avoid."

Nexsan isn't offering sub-file-level de-dupe with its Assureon product.

Andy Hale, the technical manager at storage integrator B2net thinks differently: “There is no reason why a sub-file-level de-duplicated document or mail file can not be presented to a court of law for compliance as long as the contents can be proven to be unaltered. All disk arrays store things in different ways using different block sizes and file systems, the fact it is de-duped should not alter the validity of this evidence in court.”

"There are products out there that can offer this type of compliance, enabling organisations to demonstrate that files have not been tampered with - by showing file and access history, which can then be used in a court of law. Software products such as Symantec Enterprise Vault will allow administrators to track versions of documents and allow legal searches to be done. "

"There are storage products on the market that achieve this level of protection, such as EMC Centera and NetApp A-SIS (de-duplication technology) and SnapLock (compliance/worm technology) that work at the storage level and meet current US legislation requirements. These latter two provide a disk-based worm device that guarantees that data written to it can not be tampered with."

Also David Ebsworth, Technology Director of oncore IT, an IT support company that provides Asigra's software as part of its managed service solutions, said: "The simple answer is yes, a de-duplicated file can be presented as an unaltered original document or mail in a court of law in a compliance situation."

"The whole reason for de-duplicating files is so that they take up less space in storage. The reason isn't to change the file and the technology used in de-duplication doesn't and can't change the file content. When a file is taken out of storage - whether it is primary or secondary (as in the case of Asigra's backup software) storage - it is automatically reconstructed back to the original document structure, with the same date on it as when the original document was last modified and the same digital signature."

" You wouldn't deploy data de-duplication if the files you restore are different to the original file. Storing and recovering de-duplicated files does not change the original file."

Others too think that quite long-standing data storage details, such as RAID, stop sub-file de-dupe being regarded as a special case.

The RAID precedent

Kevin Platz, Data Domain's EMEA sales MD, says that altering the electronic representation on disk is not new; think of RAID, saying: "As long as hard disk is a legal media in the case in question, all data stored on today's RAID systems is "altered." RAID systems split up data, and then re-assemble it whenever the file is requested. Generally speaking, compliance allows vendors to RAID, stripe (or de-duplicate) data as they see fit. As long as the RAID system, for example, presents the document back in its unaltered form, then that's fine from a legal point of view."

Mimecast's CTO, Neil Murray, has a similar view: “I would argue that de-duping in its many forms is somewhat irrelevant to compliance. It is just a form of intelligent storage abstraction that enables you to use less disk space as your real data volumes increase."

"You could go all the way down and ask whether converting a document into a long string of zero's and one's and storing it electronically as such and then later reconstructing and presenting it using "software" would render it as altered for compliance purposes."

"Or perhaps if we assert that manipulation within the boundaries of a file define the degree to which it has been altered we should look at other common storage technologies for reference points. Take RAID for example. RAID 5 maintains parity data that is not local to any one file, and this parity data may be used to reconstruct the original data of a file or parts thereof in the event of physical disk failure."

"Does this render the file altered? What if the RAID 5 system also used data compression that spanned multiple files? Then you have essentially created an even greater "data fruit salad" than sub-file-level de-duplicating - yet do we question the compliance readiness of such data?”

Obviously not is his answer.

Proving originality

The position that Martin Baldock, electronic discovery firm Kroll Ontrack's operations manager, takes is relevant here.

He thinks that the actual format that electronic data is stored in is not the key thing. In effect all electronic representations reconstruct a file for viewing or printing. What matters is that the content is original, not that the representation is in WORM format.

He said: "We look at the hash value of the file's contents compared to what we know was the original value. We are told, for example, that file A on disk is the original file and we compute its hash value and compare it to other copies of the file to see if it has changed." He couldn't necessarily say what has changed, only that something has.

The hash value is the determinant and even so little a change as adding an extra space between words can alter it.

His concern with sub-file-level de-duplication is with the reconstruction of the file when it is needed. "If you are recomputing the file from the components how confident are you that a bit pattern is exactly the same and so will compute exactly the same hash value? It would be a huge burden of concern to me."

Nexsan's Gary Watson is also a strong proponent of hashing as well as other measures to ensure file integrity: "Assureon is highly obsessed with data integrity – files are serialised, stored at least twice on separate RAIDs, and possibly stored on two RAIDs at a DR site, and in all cases are protected with two different hash algorithms which are checked every time the file is touched (plus a dozen other integrity features I won’t bore you with here)."

Referring to de-dupe he said: "In contrast, a given sub-block (say, of zeros) might be referenced by a million files, and the corruption of this single sub-block could have wide-ranging impact though a wide swath of files. It’s like a failure 'amplifier'. I’m not saying this is an impossible challenge to overcome, but an enterprise-class solution to the problem is non-trivial."

The legal holy grail

This seems to be the key thing here. Whatever form the electronic document is stored in: RAIDed and striped; or de-duplicated, as long as it can be provably reconstructed in an unaltered form then it would/should/could be accepted in a court of law.

One way to do that is by computing the file's hash value before electronically altering its representation and then re-computing the hash value when the file is to be used for compliance or legal purposes.

If they are the same then the file is good. If they are not then it isn't.

Will they be the same after the file has gone through a de-duplication process?

No-one knows for sure and until it can be proved that they are the same, de-dupe doubters have a point.