In part one of this feature we learnt about Bigtable. Next we look at the file system infrastructure underneath it and aim to see if this is something enterprise datacentres could use.
No standard Windows or Unix/Linux product
Obviously Google isn't using any standard operating system and file system here. It's Linux O/S has Google's own Google File System, a distributed one and we see how efficient it is in looking for and reading block-level data from disk.
Another Google paper states: 'the Google File System (is) a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.'
Google file system
(We quote extensively from this paper in this section.) GFS was conceived as the backend file system for Google's production systems. GFS provides a location independent namespace which enables data to be moved transparently for load balance or fault tolerance.
It has been designed from a point of view that component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines.
The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. Google has seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.
Files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. Google is regularly working with fast growing data sets of many TBs comprising billions of objects, so it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes were revisited. (GFS uses a chunk size of 64MB, much larger than typical file system block sizes.)
Most files are mutated by appending new data rather than overwriting existing data. (This characteristic will radically increase Google's disk capacity needs on its own.) Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics.
Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time.
Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
The Google file system has been designed for a specific enterprise environment. Google is, in effect, an absolutely massive but highly specialised set of applications with parallelism characteristic of them.
GFS data is stored in chunks. A GFS cluster is highly distributed and typically has hundreds of chunkservers spread across many machine racks. These chunkservers in turn may be accessed from hundreds of clients from the same or different racks. Communication between two machines on different racks may cross one or more network switches. Additionally, bandwidth into or out of a rack may be less than the aggregate bandwidth of all the machines within the rack. Multi-level distribution presents a unique challenge to distribute data for scalability, reliability, and availability.
The chunk replica placement policy serves two purposes: maximize data reliability and availability, and maximize network bandwidth utilization. For both, it is not enough to spread replicas across machines, which only guards against disk or machine failures and fully utilizes each machine’s network bandwidth. GFS must also spread chunk replicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit).
It also means that traffic, especially reads, for a chunk can exploit the aggregate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a trade-off Google makes willingly.
Users can specify different replication levels for different parts of the file namespace. The default is three. The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers go offline or detect corrupted replicas through checksum verification
As disks are relatively cheap and replication is simpler than more sophisticated RAID approaches, GFS currently
uses only replication for redundancy and so consumes more raw storage than other approaches..
The disk infrastructure Google uses will have been developed in conjunction with the file system and cluster-based processing concepts with Bigtable developed on this foundation. We'll look no further into the file system unless it's necessary to understand the disk infrastructure.
Google's paper on drive failures stated 'More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to400 GB. All units in this study were put into production in or after 2001. The population contains several models from many of the largest disk drive manufacturers and from at least nine different models.
They are deployed in rack-mounted servers and housed in professionally-managed datacenter facilities. Google runs its own burn-in process: 'Before being put into production, all disk drives go through a short burn-in process, which consists of a combination of read/write stress tests designed to catch many of the most common assembly, configuration, or component-level problems.'
Google is building a data handling infrastructure that is probably the largest the world has ever seen, and one that is greatly different in scale and use from business data centres, even the largest ones.
Everything is layered with each layer dependent upon features of the one beneath it, and tuned to help the layers above it. In other words, you can't take out a layer of this infrastructure and use it on its own.
One reading of this is that the Google storage infrastructure is irrelevant as a model for business to use. A second is that Google could realistically provide software as a service. It will already have accumulated much experience from its initial Gmail offering and the rolling out of its desktop productivity applications seems quite practical, even at this simplistic review level of its activities.