Within the next five months EMC will announce its cloud computing/web 2.0 storage products, code-named Hulk and Maui. What do we know so far about them?

For the purposes of this feature the terms cloud computing and web 2.0 refer to storage applications characterised by potentially extremely rapid and high growth in capacity to multi-petabyte levels, unstructured and semi-structured content, and possible remote delivery of storage services in a utility style.

We should be thinking of Google-type storage scale and not Isilon or NetApp ONTAP GX. IBM's XIV purchase appears to be aimed towards this area. EMC's recent MozyEnterprise announcement is also a cloud computing/web 2.0 application and EMC has indicated that the platform behind it will be used for backup and recovery and archiving services.

EMC CEO Joe Tucci first revealed them at a November 2007 analysts' conference. Hulk is hardware and Maui is software. Together they form a clusterable storage system, built on commodity software and servers, for multi-petabyte capacity storage applications.

Of Maui we know that it goes beyond what a clustered file system, such as Isilon's, does but includes part of what it does. It provides more of a global storage repository than a clustered file system and is orders of magnitude beyond what is currently available on the market, according to EMC.

Of Hulk we know that it will involve clusterable storage units. We can deduce more from what an EMC staffer has written.

EMC clues

This is Chuck Hollis, EMC's VP for technical alliances, and he has written about his (EMC's) view of cloud storage needs thus: "Presenting storage as blocks (e.g. LUNs) won't scale. Presenting storage as files won't scale. You'll need an object-oriented approach with rich semantics - nothing else will work at this uber-massive scale."

"It goes without saying that costs matter, but in a very different way. Take any small cost (hardware, software, energy, administration, etc) multiply it by a very large number, and you have a very large cost."

He thinks cloud storage must be autonomous: "If you can imagine many petabytes with billions of objects in hundreds of locations and millions of users, this means that management is an entirely unique proposition."

"The environment must be self-tuning, and automatically react to surges in demand. It must be self-healing and self-correcting at a massive scale -- like the internet, no single scenario of failures can bring it down."

"The idea of a bunch of administrators sitting glued to multiple consoles, watching indicators and firing off commands -- well, that just won't work here. Not only is it hard for people to react fast enough, no one can afford that much human capital to keep things running smoothly."

He thinks it is universal in terms of access: "We might keep thinking "browser access," but that's only one of many potential models for global information ingestion and access. What about my set-top box? A mobile iTunes device? Ingestion of sensor data? RFID? VoIP phones? Security cameras? Or maybe satellites?"

"Thinking browser-oriented stuff is way too limiting, I think. Give yourself some time and fully ponder all the different ways information could be gathered and distributed on a massive, global scale, and you'll start to realise the enormity of the appeal."

He says that cloud storage is infrastructure: "Infrastructure is a platform to build other, more useful things. It takes care of all the hard stuff, so that its users can focus on the interesting and useful stuff. It's dependable. It's available. It's got to be delivered as a commercially-available, carrier-class product supported by a vendor. Just like power, and phones, ..."

Deductions

The environment is one in which the software recognises that disks will fail and that nodes can crash and both things can be handled automatically.

The Hulk HW and Maui SW must provide an autonomous, self-healing, and self-optimising storage environment with the ability to react to demand surges. That suggests it involves servers, running Linux for low O/S cost, with system software that groups Hulk/Maui nodes together into a logical entity offering a virtual pool of object-based storage.

A node has lots of direct-attached storage (DAS), say 1TB SATA drives and the node's DAS is combined into a single virtual pool.

Objects are written across multiple nodes for data protection and there is either a RAID scheme or a RAID-like scheme to protect against drive and node failures. At a node level a redundant array of independent nodes (RAIN) scheme operates like RAID. EMC has RAIN technology courtesy of its Rainfinity acquisition.

EMC has not announced an XIV-like acquisition and, even if it had, the lead time to productisation would go well past a July 20-008 deadline. Therefore EMC is building the Hulk/Maui infrastructure in-house using existing technology resources.

A self-optimising storage environment is one in which stored objects are moved from node to node to cope with demand variations and optimise data access times. This kind of migration is a Rainfinity capability.

The cluster interconnect must be fast and broad. The only commodity-like one available is gigabit Ethernet. My assumption is that Infiniband is too expensive and too niche.

How will accessing devices: browser-using PCS, mobile phones, intelligent hand-held devices, etc, access a Hulk/Maui installation? Will there be one hugely-capable front-end device like an enormous SAN director in terms of capability? There has to be some function which receives data read/write requests and routes them to the right places in the Hulk/Maui infrastructure.

That means the infrastructure has to be indexed and the indices maintained and referred to constantly.

EMC will be running beta tests currently to try out its new cloud storage product. These will involve multiple petabytes of data. These tests will involve two scenarios: in-house use; and storage service provider use. Perhaps the Mozy online backup service area is being used for in-house beta testing. In fact this looks like a natural testbed for Hulk/Maui. It already has 5PB of data to look after.

Conclusion

My take is that we could expect Linux X86 servers running system SW which includes Rainfinity and Mozy-type technology and which have a relatively large amount of DAS. These nodes are clustered together over Ethernet and present as a single virtual storage system, based on object storage technology, with built-in recovery for both node and disk failure. There is also a built-in performance monitoring and optimising feature.

Oh, and the new product will have a new name to rank alongside Symmetrix, Celerra, Centera and Clariion.

When will it be announced? EMC World might be a suitable time and place. It starts May 19th and takes place at Las Vegas at the Mandalay Bay.