Here is the second part of the feature on Object Storage Devices. Part one can be found here.

In reference to the limitations of SAN and NAS approaches...

These problems are architectural in nature. Steps have been taken to alleviate these issues somewhat, but these approaches ultimately fall flat.

Some have tried to couple multiple machines into a single logical NAS server. This can scale performance of a NAS system somewhat. However, the number of communication links between members of the server cluster rises exponentially with the number of servers in the cluster. Ultimately, the bandwidth necessary for maintenance of cache coherency outpaces the capacity of the communications mechanism.

Other companies have created file system components meant to be installed by all clients on a SAN. These file system components coordinate each client with all other clients such that they have access to a shared data set. However, these efforts are proprietary in nature, and are therefore limited in application.

In contrast, OSA (Object Storage Architecture) manages to combine the best attributes of NAS and SAN, while overcoming their aforementioned limitations, by means of a novel architecture.

The fundamental innovation which enables this architecture is, as you noted in your article, the reassignment of the responsibility for the mapping of streams to sectors from the client or server to the data store itself (OSD).

Once the OSD can perform the stream to sector mapping, the clients no longer need to agree what algorithm to employ for this mapping. This is one of the aspects that makes OSD so suitable for data sharing in a cross-platform environment. (I will discuss the other major aspect of data sharing below.)

This cross-platform data sharing addresses the major benefit of NAS over SAN, and is inherent in OSA.

The scalable performance is encompassed in OSA by virtue of the fact that the data flow is direct between the client(s) and the OSD(s), without any need to pass through a server. This allows aggregate system performance to rise to the inherent capacity of the underlying fabric. This direct pipe from clients to storage addresses the major benefit of SAN with respect to NAS, and is inherent in OSA.

I have so far described several benefits of OSA, without having yet described the structure thereof. It is important to note that these benefits require a rethinking of the parties involved in any data transaction. OSA is a tripartite architecture. In addition to the clients and storage units (OSDs) that we are familiar with, we need to introduce a third actor - the MetaData Server (MDS).

While it is an over-simplification, it is helpful to think of the MDS as having responsibility for maintaining all of the file system other than the mapping of streams to sectors. The MDS is where any hierarchical directory structure would be maintained, along with permissions, file-scope locking, etc.

Without getting into the security aspects a typical scenario would have a client walking a directory tree to find a file, by means of communications solely with the MDS. Once the file is located in the file system tree, the MDS would return the name of the SCSI target and LUN of the OSD which houses the relevant object, and the Object ID (OID) of the object within the OSD.

The client then builds a SCSI CDB specifying the OID, and the byte range of interest within that OID, along with an op code (READ, WRITE, etc), and sends it to the relevant OSD. The OSD responds then with the appropriate data transfer operation (after checking the Credential sent as part of the CDB).

So we have identified three primary classes in the architecture - client, OSD, and MDS. In any system, there are any number of clients, any number of ODSs, and one MDS. So is it not true that, due to the presence of a *single* logical MDS in any system, OSA suffers from the same scalability problem as NAS does? While it is true that all communications must hit the MDS at some point, this is at a radically different scale than the situation embodied in NAS.

First and foremost, the data itself never passes through the MDS. All data flow is directly between clients and OSDs. Additionally, once granted access to an object, the client may use the MDS-supplied credential (permission) across multiple accesses. (This capability is granted by the MDS for a certain time interval, which it may revoke before the stated expiration time).

Research has shown that a single MDS runs out of bandwidth at a rate several orders of magnitude greater than does a NAS server. In other words, a single MDS is capable of serving on the order of 100x to 1000x more clients and 100x to 1000x more back end storage than a NAS server is capable of, before running out of bandwidth. Additionally, for performance or reliability reasons, the logical MDS may be implemented as a cluster of machines, much as multiple machines clustered into a single logical NAS head.

I have glossed over a lot of detail above, but hope that I have shed some light on other aspects of OSA for you.

Next Steps
With the above in mind, where is OSA today, and what are the next steps? There are products now shipping which encompass fundamental aspects of this architecture. Most so-called SAN File systems incorporate some aspect of OSA, but are proprietary. Approximately half of the world's ten fastest supercomputers employ the Lustre file system.

The Lustre file system is an implementation of OSA, created as an open source project (though not now, and perhaps never, compliant with SCSI-OSD).

Panasas is shipping OSA product, and is also active in the standardization effort for the various pieces of the architecture. It demonstrated a while back 11GB/s performance to a single directory.

You mentioned the Emulex and Seagate demo that you witnessed at SNW. There are many more players in this effort, including my organization, lingua data, which is developing an initiative labeled obstor.

As we go forward from here, much work remains to be done to deliver on the promise of this superior architecture. As I mentioned, the SCSI OSD spec is ratified.

However, it defines only the characteristics of OSDs and communications between the clients and the OSDs, while stating only what is necessary about the overall architecture and the implementation of MDSs and communications between client and MDS as is necessary to frame the responsibility of the OSD.

This limited scope of the OSD spec is largely due to the legacy of T10 specifications, which focus on the target, while saying as little about the initiator as possible.

There is an effort afoot in the IETF to define the communications between the clients and the MDS. So far, it appears as this work will be incorporated into a minor versioning of the NFSv4 spec - perhaps NFSv4.2. This has been dubbed within the nfsv4 group as pNFS (for parallel NFS), and several internet drafts are available on the topic.

As far as I know, there has not yet been any standardization effort on the private channel between MDS and OSDs. This channel is used merely for the maintenance of a shared master secret key used in security mechanisms. The MDS and OSD collaborate on lower level (working) secret key maintenance over the same interface that the clients invoke upon the OSD - SCSI OSD.

So there is some time that will pass before all necessary parts of the architecture are standardized. However, many of us that are aware of OSD feel that its long term prospects include relegating both NAS and SAN to legacy environments. This will of course take time - more than a decade, but I feel it is destined to happen.

If there is one point I would like you to take from this discussion, it would be: Yes, it is true that OSA brings the benefits of device-managed replication and other stuff that you mention. However, I believe the big hitter is combining the cross-platform data sharing of NAS with the scalable performance of SAN, abandoning the limitations of both, and wrapping it all in strong, fine-grained security. *That's* the promise of OSA.