In 2007, scientists will begin smashing protons and ions together in a massive, multi-national experiment to understand what the universe looked like tiny fractions of a second after the Big Bang. The particle accelerator used in this test will release a vast flood of data on a scale unlike anything seen before, and for that scientists will need a computing grid of equally great capability.

The Large Hadron Collider (LHC), which is being built near Geneva, will be a circular structure 17 miles in circumference. It will produce data in the neighbourhood of 1.5GB/sec., or as many as 10 petabytes of data annually, 1,000 times bigger than the Library of Congress' print collection. The data flows will likely begin in earnest in 2008.

As part of this effort, which is costing about €5 billion, scientists are building a grid using 100,000 CPUs, mostly PCs and workstations, available at university and research labs in the US, Europe, Japan, Taiwan and other locations. Scientists need to harness raw computing power to meet computational demands and to give researchers a single view of this disbursed data.

This latter goal -- creating a centralised view of data that may be located in Europe, the US or somewhere else -- is the key research problem.

Centralising the data virtually, or creating what is called a data grid, means extending the capability of existing databases, such as Oracle 10g and MySQL, to scale to these extraordinary data volumes. And it requires new tools for coordinating data requests across the grid in order to synchronise multiple, disparate databases.

"It's all about pushing the envelope in terms of scale of robustness," says Tony Doyle, project leader of Grid Particle Physics (GridPP) project, a UK-based scientific grid initiative that's also part of the international effort to develop the grid middleware tools.

Researchers believe that improving the ability of a grid to handle petabyte-scale data, split up among multiple sites, will benefit not only the scientific community but also mainstream commercial enterprises. They expect that corporations -- especially those involved in fields such as life sciences -- will one day need a similar ability to harness computing resources globally as their data requirements grow.

"If this works, it will spawn companies that will just set up clusters to provide grid computing to other people," says Steve Lloyd, who chairs the GridPP Collaboration Board, based at the Rutherford Appleton Laboratory in Oxfordshire, UK. GridPP is working with the international team to develop the grid the LHC will use.

CERN, the European laboratory for particle physics, is leading the LHC and its grid effort. From CERN's facility near Geneva, the data produced by the particle accelerator will be distributed to nine other major computing centres, including the Brookhaven National Laboratory and the Fermi National Accelerator Laboratory in the US, says Fabio Hernandez, grid technical leader at one of the major project sites in France.

As part of a backup plan, each of the 10 centres will have two-tenths of the total data, so that each one will be responsible for its own 10 per cent plus a duplicate of the 10 per cent held by another centre, says Hernandez. In total, some 150 universities and research labs worldwide will be connected to this system, all providing some degree of processing capability. The operation will be running on versions of the Linux operating systems running on clusters of Intel and AMD processors.

Developing the grid involves a combination of efforts. In April, the LHC team conducted a test, distributing the data to 10 major sites at a total rate of 600Mbit/sec. Much of the work was "low level," says Hernandez, such as adjusting parameters of a network card and firewall configurations.

"It was important to prove that we can maintain the processes for an extended period... almost without human attendance," says Hernandez. This means ensuring that network interconnects are tuned and synchronised and that there's sufficient security and monitoring, as well as staffing and automation, at the respective data gathering sites, he says.

The more difficult aspect is providing simultaneous access to the data by as many as 1,000 physicists working around the world. "You cannot... predict what the users will want in any given moment," says Hernandez.

One limiting factor that's getting a lot of attention from the approximately 100 developers working on the grid worldwide has been the capabilities of resource brokers -- the middleware that submits the jobs and distributes the work. If the processing isn't effectively routed, databases can crash under heavy loads, says Doyle. There's also a need to ensure that the system has no single point of failure.

This involves keeping track of the data. The data could be in one place while the CPU resource capable of processing it is in another. Metadata, which describes what the data is about, will play a critical role. "These are some of the big challenges," says Doyle.

"The most important thing is to show that the grid model can be used to process real data in a scientific context—and data distributed all over the world," says Hernandez. "I think it's the most important lesson we are going to learn."