It is a truism that business data centres are full to bursting with racks of networked storage that is hard to manage and growing like a cancer on steroids. Yet Google is offering a suite of desktop productivity applications as a service with all user data stored on Google systems. How is Google embracing a data storage problem that would cripple the average business data centre in very short order?

How is Google coping with storing petabytes of data, petabytes that are growing and growing, against a background of inadequate storage management products, disk crashes and both SAN and NAS products with capacity limitations? Surely Google can't be using standard products for this incredible data storage enterprise it has embarked upon.

There have been a couple of papers published by Google people which give an insight into how Google is setting up its enormous capacity storage infrastructure. This is what we've gleaned from them.


In a paper available as a PDF download from Google entitled 'Bigtable: A Distributed Storage System for Structured Data' the authors write 'Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, Writely and Google Finance.' In fact over 60 Google products and projects use it.

(We're going to summarise and quote a lot of text from this paper and won't bother indicating quotes with quotemarks any more in this feature.)

Bigtable is a distributed storage system for managing structured data designed to reliably scale to petabytes of data and thousands of machines. It has achieved several goals: wide applicability, scalability, high performance, and high availability. Bigtable is used for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.

It is implemented in clusters which span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data. As of August 2006, there are 388 non-test Bigtable clusters running in various Google machine clusters.

Bigtable resembles a database and it shares many implementation strategies with databases but Bigtable provides a different interface than such systems. It does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage.

Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.

We're not going to look at the Bigtable data model, instead teasing out details of the underlying storage infrastructure.

Bigtable storage foundation

Bigtable is built on several other pieces of Google infrastructure. Bigtable uses the distributed Google File System (GFS) to store log and data files. Bigtable takes care of all the data layout, compression, and access chores associated with a large data store.

A Bigtable cluster, which will support one or several Google projects, typically operates in a shared pool of machines that run a wide variety of other distributed applications, and Bigtable processes often share the same machines with processes from other applications.

A large cluster will provide hundreds of terabytes of storage across thousands of disks on over a thousand machines, and is concurrently accessed by hundreds of clients. (Read more on Google's computing cluster ideas here.)

Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.

The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable).

A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.

(Part 2 looks at the Bigtable storage foundation.)