EMC has its Symmetrix, Celerra (NAS) and Centera (fixed content) disk array offerings. The company has also recently bought Legato (backup and e-mail archiving SW amongst others) and Documentum (document content management SW). We talked to Roy Sanford, VP markets and alliances for EMC in its Centera division, about the concepts EMC has that drive its product strategies.
One of the most important things that EMC sees is a real ramp up in the need to store semi-structured and unstructured information. Sanford asks: "Will the amount of unstructured and semi-structured data grow and grow and surpass structured data? Yes, unquestionably. Up to three quarters of the data being created today is unstructured."
What has recently changed is that organisations need to retain their non-structured data for longer than before so as to be able to demonstrate compliance with legislative requirements, such as Sarbanes Oxley in the USA and the Financial Services Authority in the UK.
Sanford refers to a February 2004 Butler Group White Paper on Compliance and Record Management. It states that 'various pieces of legislation ... will demand that content is maintained as proof of operation.' Further, 'content has to have a single provable instance based on a given point in time.'
It is not good sense to hold all this data on fast disk. Equally, it is not a good idea to back it all up to tape. How will you find it, if required?
A compliance officer may be asked to 'find all information held electronically dealing with customer X'. Sanford says this requires SW which is aware of the data's content. For example, database SW must be able to retrieve all records dealing with X. E-mail SW must be able to retrieve all mails referring to X and so on.
He says that there is meta-data for e-mails, word documents and other content that can define, amongst other things, what it is and how it should be dealt with. Fixed content data requires meta-data to be stored with it so that application software can deal with it. Thus, a mortgage record might be stored with a policy that says 'move it to archive storage upon completion'.
What this means is that information lifecycle management (ILM) "is not just HSM." The old hierarchical storage management idea moved data between tiers of storage, generally from expensive fast disk to progressively cheaper, slower and more capacious media as it aged. There was little intelligence involved.
With today's compliance requirements ILM software has to be content-aware. To know that data is structured or unstructured is not enough. Sanford says that "structured data is something that is actively changing." (A transaction in progress in a TP system or an individual current account in a bank customer database.)
"Unstructured data is fixed. It's an object that doesn't change. You read it, write it, verify it's authentic and (eventually) delete it. Semi-structured data means collaborative documents, spreadsheets or something like an individual working on a tax return, an active e-mail environment or shared CAD/CAM files."
He desribes three main phases of use: structured and transactional; semi-structured and colaborative; unstructured and fixed and unchanging. "Information moves through these phases based on its source and use. A stock transaction in an Oracle database (structured) could move to a fixed content store (unstructured) when it's closed.
Sanford believes that, "There is no enterprise-wide ILM today. It's very, very complex." He also says that ILM in this era of compliance needs to be able to show that data is intact and has not changed. This relates to the Butler idea of a 'single provable instance' mentioned above. Centera has this kind of functionality.
Thinking again of the compliance officer and the request to find all information dealing with X, the officer needs to be able to query the three types of data containers: structured; unstructured; and semi-structured. Currently that generally means three different software interfaces.
Alternatively we could think of using only container and thus one interface. Sanford asks, "Should databases keep unstructured data?" and answers the question by pointing out that keeping, for example, digital photos in databases will dramatically increase their size. The larger a database the poorer its performance, generally.
Better to have a pointer in the database which points to the unstructured object. "We think a content address is the right method. We store data and meta data about the data with an abstraction layer between them."
Back to three interfaces. There is, in a way, a missing link here. Our compliance officer really needs one application portal through which to execute a request. Thus an application, for example, EMC's Documentum, could be developed to fire off queries to a database to find all records relating to X.
The ILM area is developing fast and a storage infrastructure is being developed to store and safeguard data, move it to appropriate platforms, verify that it is intact, and locate it, wherevever it is stored.
NB. A downloadable EMC White Paper discusses the UK compliance issues further.