One of the biggest challenges for corporate IT departments today is managing the spread of data throughout their organisation. From servers holding just a few MBs just 2 decades ago, most departmental servers now hold up to a TB. Desktop computers now ship with at least 30GB of disk space and few IT departments can claim success in making users store data on the server and not locally. When you aggregate all the data together, in a very large corporate environment, you might begin to break into the PB range but, with the exception of a handful of organisations, it is unlikely that all that storage is online. As the European Centre for Nuclear Research, CERN is one of a very select number of high energy particle physics sites. Over the next few years, work on the new Large Hadron Collider will generate one of the most ambitious storage projects of this decade. At 27 km in length the LHC will be the largest scientific instrument in the world. Scientists will use the LHC to accelerate nuclear particles to massive speeds before crashing them into each other. The scientists are looking at the remnants of particles for new particles or behaviour. Finding a new element is not just like looking for a needle in a haystack but looking for a needle in 20 million haystacks. This process is expected to generate date of between 10 and 15 terrabytes per year. That data has to be written both to disk and tape simultaneously and made available to large numbers of researchers across the world. It’s critical that no data is lost during this process hence the need to write to both disk and tape simultaneously. The data will also have to be moved across the Internet at high speed to ensure that researchers have access to local copies of the data. The current estimate, from CERN, is that the total amount of all data stored in the world today is 5 exabytes. This means that CERN is going to add more data to the human knowledge pool than any single country, for every year of the 10 years this project will be live. As multiple copies of that data will be held around the world we are talking about significant percentage points of world-wide data being stored and generated per year. This raises some huge questions over the storage medium. CERN is only able to talk about its own plans to store the data. What makes this very interesting is that CERN, for budgetary reasons, has decided to build this solution using grid computing and commodity hardware. Mainframe and high-density storage systems are out. File servers using Serial ATA drives are in. Only as the systems are being built will we know if the speed and reliability of the systems are capable of handling this amount of data storage. The data gathered will need to be written as close to real time as possible. This is to reduce the need for massive memory caching and the potential for memory loss or corruption in the event of power problems. CERN has been testing tape storage solutions that are capable of writing data to tape at 1GB per second. Using tape libraries that rate can be sustained for over 1 hour allowing the entire data set for a single experiment to be stored in one operation. CERN will be the nexus or tier 0 and all initial disk storage will be done locally. That data will be replicated, in real time, to a set of tier 1 storage centres. The locations of some of these tier 1 centres are already known. Recently HP has announced that four of its sites, Bristol, Palo Alto, Brazil and Puerto Rico will also be tier 1 sites. CERN’s implementation plans are detailed here.