Imagine a storage array with capacity that's equivalent to a stack of iPods three times the height of the Empire State Building but that can be managed with common Ethernet networking tools, and you'll get what a group of MIT scientists and four storage vendors are in the process of building.

The storage array will support an MIT Media Lab project called the Human Speechome Project that is studying how babies develop the ability to talk. The project began three months ago when MIT associate professor Deb Roy began recording his baby boy's everyday life through the use of 14 microphones and 11 fish-eye lens cameras set up throughout his house, giving researchers a bird's-eye view of every room. The baby is now nine months old.

A 5TB disk cache in the basement stores data temporarily until it is physically carted back to the Media Lab for analysis.

In order to store and then process the video and audio data, a massive storage area network (SAN) was needed to archive and search what is expected to be 1.4 petabytes of data, or 1,400TB of data, over the span of the three-year project.

The SAN is being built from commodity hardware and uses a 10GbE IP network for data transfer between the backend SAN and hundreds of servers.

"I think here what we're seeing is what the future of storage is going to be like. This is a great marriage between industry and the academic world," said Frank Moss, director of the Media Lab and a former CEO of Tivoli Systems, a maker of storage management software now owned by IBM.

Moss spoke at a press conference held yesterday at MIT's Media Lab in Cambridge, Mass.

The Human Speechome Project computing infrastructure is expected to be composed of more than 300 Hammer Z-Rack storage enclosures from Bell Microproducts, about 3,000 SATA (Serial Advanced Technology Attachment) hard disk drives from Seagate Technology LLC. and more than 100 10GbE switches and 400 blade processors from Marvell Technology Group Ltd.

The high-throughput switches are needed for the storage I/O anticipated by researchers who believe they'll be processing 700TB of data during every 12-hour analytical run. To achieve the desired performance requirements, 150-drive stripes (aggregated virtual volumes) will be created using the native virtualisation capabilities of Bell's (actually Zetera's) Z-SAN. Protection against data loss will be delivered through RAID 10 mirrors (duplicate copies) of the raw video data, transform data, and metadata files.

"Our approach allows us to eliminate a lot of cost by using high volume, commonly available systems," said Jeff Greenberg, senior director of product marketing at Zetera, the vendor designing the SAN.

The project has been amassing several terabytes of audio and video data per week of early childhood learning and socialization data in order to model human language acquisition.

"If you take all parallel tracks of data over three years you'll have 400,000 hours of video and audio data," Roy said.

Roy said an application the university built allows researchers to quickly hone in on video and audio streams that involve his child's development while avoiding video playback of empty rooms or footage of mundane tasks, such as getting a drink of water or making coffee.