It's all very well having a multi-tiered Information Lifecycle Management (ILM) system with data moving facilities but... there is a but. How does an ILM system know what files to move? It has to be told. What is it told by? The default is a storage admin person doing it file type by file type, or file owner by file owner, or file inactivity period, or some combination of some or all of these.
This might conceivably be manageable in a small company. But start thinking of hundreds, thousands, even tens of thousands of files and it's obvious we need help, we need some form of automation. In fact, we need another layer of software functionality above the ILM storage tiering and data moving.
Techworld has been briefed recently by a Mr X involved in defining such a new layer, and also by Tony Cotterill, CEO of Bridgehead Software, which is announcing its HT FileStore product to automate such things.
The new software layer is a data abstraction layer. What we need to make decisions about policies for the ILM product we might have is data about our data. We need metadata; file attributes such as owner, file type, access profile, content hashing, and current storage media. That way, Mr X said: "We could identify the value of the (information) asset to the individual."
This is most needed with unstructured data: e-mails; documents; spreadsheets, Acrobat files, OLE data, images, etc. Databases generally have meta-data-based infrastructures already. Unstructured data has meta data but there is no organising infrastructure to find it and then act upon it.
When we have the meta data then: "Once an event happens, do something. I.E. if there has been no access to a file in N days then move it down the storage hierarchy;" meaning to a cheaper and slower storage medium.
Were such an ILM data abstraction layer product to exist, then there would clearly be a performance hit but, "I don't see it as being any great shakes for the individual users."
Such a software layer could also provide security by authenticating users against Active Directory. "Profiles could be built against the AD structure. Which corporate policies affect which users? How should a user's data be treated? I.E. make a copy of anything created and send it to a WORM device. The number of copies can be determined. The exact storage location (tier 1, tier 2, etc.) can be specified."
"Data movement can be carried out by invoking whatever existing software a user might have: Veritas; CommVault; Legato; whatever."
Other enterprise-class suppliers in the ILM space, such as Tivoli, HP with OpenView, EMC with ControlCenter, and StorageTek: "don't have this. No one else is using the data abstraction layer category at the moment. Acopia, Rainfinity and others do bits of it but that's not at the enterprise level."
There needs to be a facility, a software layer, that "lives in the data path and sees every unstructured data bit that passes through."
"What you need to be able to do is set up policies such that MP3 files are only stored on a PC's hard drive, not on shared storage, whereas JPG files may be stored anywhere. This would provide a better use of both information and storage media assets in the future. The standard curve of storage growth would fall to a lower level."
Data could be de-duped, using the content hashing, "with pointers left behind such that users of the same data items, like e-mail attachments, still think they have their own copy."
Data could also be identified that was needed for compliance reasons and have special policies applied to it.
"Data is distributed across storage tiers based on things such as its access profile. There has to be a period of non-access before data gets moved to a lower tier. ILM should be about the activity of data and its relevance over time. It's not static HSM; it has to be bi-directional."
Bi-directional? Yes. "Any re-access of data that has been moved down a tier or tiers of storage since it was created means it's moved up a tier and left on the new access tier for a defined time period. For example, you move it up from tape to disk." The access period clock starts ticking and it's only moved back to tape if not accessed within a defined period. If it is accessed again then the clock is reset once more.
What's the benefit?
A theoretical example would be an organisation with 1.9PB of unstructured data. This contains duplicated data at a ratio of 67.1 to 1; that's 67 percent duplication. A data abstraction layer-based product with the attributes above could identify around a petabyte of wasted storage. If that was managed storage at a cost of $60/GB then a $600,000 saving has just been identified."
Also the growth rate of storage capacity could be decisively lowered and kept lower than before.
Bridgehead Software's Filestore
Coincidentally and contemporaneously Bridgehead Software's Tony Cotterill says that companies need tools to beat out-of-control data growth. His firm's new software reports and deletes irrelevant data, and performs continuous, automated migration of data for effective ILM.
Companies can regain control over their data growth by removing at an early stage data that does not need to be retained, and archiving non-changing data automatically. They can perform continuous, automated migration of data to appropriate media. It is aimed at unstructured data but does not include E-mail in its ambit. Partnerships are being formed with e-mail archiving suppliers to provide this functionality. Cotterill says: "We provide them with valet parking for data."
Unwanted files can be deleleted, removed to an archive, or removed to a archive with a stub left behind. Users (and the file system) think the file is still there. A Bridgehead filter sits in front of the file system and fetches archived files back if the stub is accessed.
Cotterill said: “Up to 80 per cent of an organisation’s data has not been accessed in the previous 90 days, and at least 60 per cent of it will never be required in the future, even for compliance purposes. By adopting ‘keep everything online’ strategies instead of ‘delete irrelevant data and archive non-changing data’ strategies, most IT managers have, in effect, contributed to the problems they face today, such as lengthening backup windows, increasing capacity requirements, and spiralling management costs."
Bridgehead's HT FileStore can physically and totally remove files from primary storage, leaving no residual file stubs or directory entries. Instead, those files are moved to secondary storage, where they are fully indexed and secured for future use, removing the ongoing need to backup and manage the online files.
Cotterill doesn't like stubs. If a virus scan is run then the virus checker accesses the stub. The archived file has to be brought back into memory for checking. Do this with hundreds of stubs, or more, and virus scan time shoots up.
The archive is organised with a Windows Explorer type interface. The software can be driven manually but is usually used in full automatic mode. It is available now and charged for on a per GB basis. Think £10,000-12,000 for a terabyte of data.
It's also part of Bridgehead's integrated storage management suite of products, not a stand-alone product.
Adding to the ILM layer cake
Hamish Macarthur, CEO of Macarthur Stroud International, said of the Bridgehead product, in an echo of Mr X' sentiments: “Setting policies to automatically manage storage resources is a key benefit for users. When this is achieved by a tool that also addresses archiving and compliance policies, Storage Resource Management is now addressing more than lower storage costs, it is delivering real benefits to the business. It is not just identifying a problem, it is doing something about it.”
Bridgehead is ahead of the pack here. Its software embodies the approach that Mr. X is advocating. It would not be at surprising if other suppliers recognised the same need and responded with product.