With 6 billion web pages to index and millions of Google searches run daily you would think, wouldn't you, that Google has an almighty impressive storage setup. It does, but not the way you think. The world's largest search company does use networked storage but in the form of networked clusters of Linux servers, cheap rack'em high, buy'em cheap x86 servers with one or two internal drives.

A cluster will consist of several hundred, even thousands of machines, each with their internal disk. At the last public count, in April 2003, there were 15,000 plus such machines with 80GB drives. As an exercise let's assume 16,000 machines with 1.5 disk drives, 120MB, per machine. That totals up to 1.84TB. In fact Google probably has between two and five petabytes altogether, if we add in duplicated systems, test systems and news systems and Froogle systems and so forth. Why does Google use such a massively distributed system?

It's the application
Crudely speaking, Google's storage has to do two production jobs. First it has to assimilate the results of the web crawlers which discover and index new pages. In file system terms the bulk of this activity is appending data to existing files rather than overwriting them.

The second task is to respond to the millions of online search requests, query the stored data, and come up with results. These searches can be extensively parallelised.

Google has its own GFS - Google File System - and it is described here. It has implemented this on several very large clusters of Linux machines spread across the globe in data centres.

Google's application is unique and not comparable to a general enterprise application which typically involves file data being overwritten and a much lower degree of parallelism. Google also requires that its services be up and running 7 x 24, every day of the year, no matter what. Single or even double points of failure, or network bottlenecks are simply not acceptable - ever.

Overall system configuration
Google has devised its own cluster architecture, which has evolved from the first Google system set up at Stanford by the founders in 1998 (so recent!) Sergey Brin and Larry Page.

The nature of a Google query, such as search for 'EMC', requires the scanning of hundreds of magabytes of data and billions of cpu cycles. But each web page that might contain the term 'EMC' can be read independently of the others. Thus it is inherently parallel. Brin and Page reasoned it was better to have many cheap Linux machines do the search in parallel rather than running an SMP Unix server. The Unix server would cost 5-10 times as much and represent a point of failure.

Run the search in clustered Linux PC servers (cheap, very cheap), each with their own internal disk rather than a networked storage device (expensive; network link is a bottleneck) and combine the results. Even better, store the index data for the web pages separately from the web pages themselves. Run the search across the web page index, then aggregate the positive hits and search the web pages to extract the little snippets of text surrounding the search term. Aggregate these and serve them to the user.

Linux was chosen because it was inexpensive and more reliable than either Windows NT or any proprietary Unix version.

There is no concept of state as there would be with a commercial web transaction. Each search request is atomic, can be dealt with and forgotten.

In scaling terms this is a classic scale out or horizontal scaling scenario and not a scale up, as in adding CPUs to a server, requirement.

The index is separated into what Google calls shards and these are stored on separate index servers.

The hard drives
Given this why not have a large disk server used by the clustered Linux machines? It's cost and reliability that drives this. A disk server is expensive and, as a single box, is vulnerable. Getting the hard drives with the PC servers means that the data is stored across hundreds if not thousands of drives. Google replicates data three times for redundancy. It can afford to be cavalier about hardware failures. So a drive fails. Log it, switch queries on that data to a replica and move on. It's all pretty instant.

There isn't even RAID protection. In a way the Google cluster architecture is similar to the RAIN storage idea, a redundant array of inexpensive nodes. (Techworld mentioned RAIN here. Exagrid is a supplier with RAIN storage product ideas which Techworld discussed recently here.)

The drives are IDE drives and not SCSI, which would be more expensive. Google spends more time reading files than waiting for them to be read. Latency is not that great an issue so having lightning fast 15,000pm SCSI drives is not a requirement. In 2001, 5400rpm 80GB maxtor IDE drives were mentioned as being used by Google.

Google's architecture is home-grown. Its PC servers are supplied by two specialist server builders. There is no great case study material here for Sun or IBM or HP, none whatsoever. The only well-known supplier is Red Hat for Linux, and much of its distribution is discarded as not needed.

Google gets its system reliability from software and hardware duplication. It uses commodity PCs to build a high-end computing cluster.

File System
The Google file system basics are that each GFS cluster has a single GFS master node and many chunk servers. These are accessed by many, many clients. Files are divided into fixed-size chunks of 64MB. The master maintains all file system metadata. The chunk servers store chunks on their local disks as Linux files. They need not cache file data because the local systems' Linux buffer cache keeps frequently accessed data in RAM.

To understand more about this read the GFS paper referenced above. The assumptions behind the file system includes one that component failures are normal. So system component health is watched rigorously and constantly and automatic recovery is integral to Google's operations.

Google has been growing at a phenomenal rate. In June 2000 it had three data centres and 4,000 Linux servers. Six months earlier it had 2,000. By April 2001 it had 8,000 servers and was moving to four datacentres from its then total of five. At that point it had 1 petabyte of storage. The number of servers had passed 15,000 in April,2003, probably well past.

By the end of this year Google could have around 18,000 servers and more than 5PB of storage. It is a fascinating exercise in commodity computing economics, performance and reliability but, unless your applications are inherently parallel, not a general role model, alas.