CERN’s coming experiments will cause huge data recording requirements. Speed and reliability of the replication will be essential to the distribution of the data. Recent speed tests across portions of the Internet have shown that it is possible to transmit data at speeds in excess of over 4TB per hour, or 1CD per second. As the data is being replicated to the tier 1 sites, it will be streamed rather than transmitted to each in turn. This will allow the tier 2 sites and those research facilities that need immediate access to the data to take what they need without waiting for bandwidth to become free. Given the relatively low budget that CERN has for this project, it has decided to stay with IPv4 rather than move to IPv6. This is an interesting choice, not least because IPv6 would provide better handling for large packets, more efficient Quality of Service and allow use of newer government backbones which are all likely to be IPv6. Over the last 18 months, the EU, USA and several Asian Pacific countries have mandated IPv6 for their new backbones. Local servers will be clustered together and access to the drives will be through virtualization. In effect, this will create the world’s largest Storage Area Network running over iSCSI at 1Gbit/s. This use of massively virtualized storage systems will place a huge emphasis on the middleware. At the moment, CERN is carefully loading the system to create a large enough store of data to test the middleware. The middleware is using the Globus project as its base and is expected to push the boundaries of the project very hard. One of the big problems for the middleware will be dealing with conflicts throughout the SAN. In order to get an understanding of those conflicts the project is using simulation programmes, along with some data, to try and predict the behaviour. Over the next few years the data loading will be ramped up until there is a similar amount of test data to expected real data. To make the usage real, a number of physicists have been signed up to work with the system and act as guinea pigs. What they are not able to do currently is data mining of the gathered data. The internal network is being constructed around 10GbE switches. This should mean that the network isn’t swamped as the live data is gathered and will provide enough bandwidth to build private circuits for management and other teams. The structure is intended to keep the computer centre and data separate and security will be built from a community perspective. The approach here is to rely on the use of certificate management, with each research establishment responsible for issuing and maintaining its own access certificates. One thing that hasn’t been considered, as yet, is disaster recovery. Given the amount of unknowns here that isn’t surprising. However, what will be interesting will be how CERN will look to use the tier 1 sites in this role. If the replication of live data works and those sites are holding identical copies, there should be sufficient redundancy to recover any lost data. This may prove to be a serious model for other large organisations which want a DR policy that is more than a simple backup and restore. Much is left to be done between now and 2007 when this project will go live but the lessons will have a huge impact on governments and businesses alike. Can commodity hardware cut it? Is GRID computing a reality or just an overhyped rebirth of distributed processing? Is data mining petabytes of data achievable? Despite being a research establishment, life at CERN is never dull but if you want to be at the cutting edge of storage, you might want to apply for one of the places on their new IT team.
CERN to drive storage speed advances
CERN Needs to store massive amounts of data per hour. It’s aiming to create the world’s largest SAN to do it.