XAM (eXtensible Access Method) is a potentially gigantic revolution in the way data is accessed on storage devices. The SNIA is developing, through the XAM initiative, a standard way to access and present metadata; that's information about stored data. Its current focus is fixed content data but it should be extensible to any stored data and potentially provide a common way to store data objects, objects of all sorts: text; spreadsheet; mail; sound record; image; whatever, and their metadata. It should also provide a way for any stored information to be searched in response to a compliance or legal discovery request.

This view has been developed after a conversation with Paul Talbut, chairman of SNIA in Europe. Let's start with a little history. In December last year Techworld ran a story - New storage standard proposed. It introduced SNIA's XAM initiative and pointed out "The (XAM) interface would reside between applications and storage systems and would co-ordinate metadata for long-term archiving, interoperability and automation."

That's not the half of it.

In January this year we noted that Hitachi (might) do CAS. The story said that Hitachi Data Systems (HDS) had persuaded SNIA's XAM group "that file system interfaces such as NFS and CIFS should be included" as well as an API-led approach.

What's it all about?
Originally EMC and IBM got together and developed some IP to describe how chunks of data in a fixed content data store, one using a content-addressing scheme (CAS) such as EMC's ground-breaking Centera, could be coupled with a metadata presented in a standard way and given a unique and persistent name/number that lasted as long as the data was stored.

As our first Techworld XAM article stated, "The X-Access Method initiative began October 2004 as a collaborative project between IBM and EMC, subsequently joined by HP, Hitachi Data Systems and Sun. In September 2005, the XAM presented the proposal to SNIA, which after reviewing it passed it to a SNIA working group developing standards for fixed content data."

Once CAS data chunks are named and described in a standard way then they can be accessed by applications other than the one that originated them. Furthermore they can, in theory, be moved to another storage device.

"You mean", I asked Paul Talbut, "that EMC was working out with IBM how it could let other suppliers store data held on Centera?" "Yes", he replied, "Isn't that interesting!" Indeed it is.

Proprietary CAS
The approach to CAS, to storing fixed, static or reference data, is to split it up into hash-addressed chunks. Each chunk is unique and has metadata associated wit it, such as originating file and position in that file, length, type, date created, access rights, and so forth. Any new files being stored on the CAS device are also chunked up and a check is made for duplicates, i.e, for already existing chunks. If they are found then a new file reference is added to the existing chunk and the newly created but duplicate chunk is deleted. CAS stores automatically de-dupe data. Each chunk is unique. This is the EMC Centera approach, one also used by Archivas' ArC.

Storage is efficient but proprietary. An application accessing the ArC store could not access the Centera store and vice versa.

The same is true for data transfer between or access across other fixed content CAS products:-

- Sun's Data Management Group (StorageTek) and IntelliStore,
- Permabit's Permeon,
- Nexsan's Assureon,
- Avamar's Axion.

XAM should alter this. Talbut says: "SNIA with XAM is developing a standardised way for applications to access metadata via standard API. What an application could do is to use the metadata to identify classes of stored data which can be placed on different tiers of an information (or data) lifecycle management scheme. It could identify retention periods. It could also deal with what Talbut calls the 'immutability' problem.

Data written to a WORM optical disc (write once: read many) is inherently trustworthy. It can't be altered. It's in its original state and intact. Data written to rewriteable disk or tape is theoretically changeable, even with some kind of WORM lock on it.

With XAM each chunk of data is unique and has a persistent name and metadata stored with it. We're talking here about billions of data objects potentially. The fact that a data object exists in this scheme can be used to indicate that it has been unchanged in its data content since it was created.

But this happens, this immutability exists, is true, even though the data chunk could have been moved from one storage device to another in a technology refresh, and even from one application's fief to another.

There is more.

An XAM data object could aid searching because the metadata could be used by a search engine. Talbut explains that, of course, (in my words) comparatively trivial metadata could be used: author; e-mail subject, etc. But far richer metadata could be used. Suppose the metadata contained keywords?

A search engine looking at a CAS device currently sees nothing. Centera is opaque to Google desktop for example. Talbut says: "We want to find a consistent way of searching metadata as a longer term aim. It's good for compliance and legal discovery."

We could conceive of search engines being developed that are XAM-compliant.

The metadata for a data chunk could contain keywords from it. Then a search engine could look for 'Starbucks' as a keyword and find all data chunks whose metadata contained it. These chunks could be part of documents, e-mails, presentation slides and spreadsheets. In effect because of XAM and metadata containing keywords the whole fixed content data store could become a single search space. An XAM-compliant Google Desktop could search through XAM-compliant versions of Centera, Archivas Arc, Sun IntelliStore or Permabit's Permeon.

There is an obvious fit her with NetApp and Kazeon where there is a similar intent to open up stored backup data to search.

The point about compliance and legal discovery is that all of a business's electronic stored data needs to be searchable - unstructured data, structured data, content-addressed data and backup and archive data. Time waits for no man, particularly not a storage sysadm struggling to respond within the 19-day time limit of a SarBox request.

If there were a way to open up currently proprietary and opaque data stores to an automated search through a single pane of glass then that would be a goal worth pursuing. Talbut says that XAM is about: "How we can create storage objects so that metadata is more meaningful and how can we increase our chances of being compliant? XAM is just the tip of the iceberg. The industry has to improve its overall ability to search on keyword and metadata.""

ODF - the Open Document Format
ODF is a similar idea to XAM. It is attempting to bring a standardised way to store and access unstructured information, non-static information, in word processing documents, e-mails, spreadsheets, web pages, etc. Currently these are stored in the proprietary format that the editing software used to create and store them uses: dot PPT or dot DOC for example. It would aid openness and searchability, hence compliance and legal discovery needs, if common formats for each type could be used such that any word processor could access any document created by any other word processor.

ODF is trying to do at a filesystem level what XAM is trying to do at a data chunk level. The XAM people see obvious extensibility of their ideas to the world of unstructured information. If word documents, for example, were stored in an XAN way then any word processor capable of using XAM access could retrieve, edit and store any other XAM-compliant word processor's documents. Ditto spreadsheet, browser, e-mail and presentation editor, media player, you name it data accessing and manipulating application.

Massive ramifications
What XAM and ODF are both trying to do is to introduce a formal abstraction layer between applications and the data they create and the storage the data resides on. Then data can be freely moved between storage devices, accessed by different applications and searched in a consistent way across different data types.

XAM could be extended to cover unstructured data. Talbut offers this thought: "If we succeed in this (in fixed content XAM) whose to say we can't move into unstructured data as well?"

It is a massive task. It will take years. Talbut says: "It's very early days. Development is under way but this is a statement of intent."

He adds: "Think about the ongoing ability to read a storage object in 50 years time. That (XAM) lessens the impact of technology refresh. We need to improve business' response to regulatory compliance and court requests. We need to be able to produce 100 percent guaranteed data pursuant to such requests."

For Talbut XAM is the key to this: "Without standards this won't happen. XAM is important. What's beyond it is huge."

XAM for fixed content has a lot of energy and committed suppliers behind it. It will take time. The more end users join in through SNIA the better. It may be a long journey to extend it to unstructured data but it is worth the effort. Find out about SNIA and XAM here.