High-performance computing expert Jason Stowe recently asked two of his engineers a simple question: Can you build a 10,000-core cluster in the cloud?
"It's a really nice round number," says Stowe, the CEO and founder of Cycle Computing, a vendor that helps customers gain fast and efficient access to the kind of supercomputing power usually reserved for universities and large research organisations.
Cycle Computing had already built a few clusters on Amazon's Elastic Compute Cloud that scaled up to several thousand cores. But Stowe wanted to take it to the next level. Provisioning 10,000 cores on Amazon has probably been done numerous times, but Stowe says he's not aware of anyone else achieving that number in an HPC cluster, meaning one that uses a batch scheduling technology and runs an HPC-optimised application.
"We haven't found references to anything larger," Stowe says. Had it been tested for speed, the Linux-based cluster Stowe ran on Amazon might have been big enough to make the Top 500 list of the world's fastest supercomputers.
One of the first steps was finding a customer that would benefit from such a large cluster. There's no sense in spinning up such a large environment unless it's devoted to some real work.
The customer that opted for the 10,000-core cloud cluster was biotech company Genentech in San Francisco, where scientist Jacob Corn needed computing power to examine how proteins bind to each other, in research that might eventually lead to medical treatments. Compared to the 10,000-core cluster, "we're a tenth the size internally," Corn says.
Cycle Computing and Genentech spun up the cluster on 1 March a little after midnight, based on Amazon's advice regarding the optimal time to request 10,000 cores. While Amazon offers virtual machine instances optimised for high-performance computing, Cycle and Genentech instead opted for a "standard vanilla CentOS" Linux cluster to save money, according to Stowe. CentOS is a version of Linux based on Red Hat's Linux.
The 10,000 cores were composed of 1,250 instances with eight cores each, as well as 8.75TB of RAM and 2PB disk space. Scaling up a couple of thousand cores at a time, it took 45 minutes to provision the whole cluster. There were no problems. "When we requested the 10,000th core, we got it," Stowe said.
The cluster ran for eight hours at a cost of $8,500, including all the fees to Amazon and Cycle Computing.
For Genentech, this was cheap and easy compared to the alternative of buying 10,000 cores for its own data centre and having them idle away with no work for most of their lives, Corn says. Using Genentech's existing resources to perform the simulations would take weeks or months instead of the eight hours it took on Amazon, he says. Genentech benefited from the high number of cores because its calculations were "embarrassingly parallel," with no communication between nodes, so performance stats "scaled linearly with the number of cores," Corn said.
Cycle also used some of its own software to detect errors and restart nodes when necessary, a shared file system and a few extra nodes on top of the 10,000 to handle some of the legwork. To ensure security, the cluster was engineered with secure-HTTP and 128/256-bit Advanced Encryption Standard encryption, according to Cycle.
Cycle Computing boasted that the cluster was roughly equivalent to the 114th fastest supercomputer in the world on the Top 500 list, which hit about 66 teraflops. In reality, they didn't run the speed benchmark required to submit a cluster to the Top 500 list, but nearly all of the systems listed below 114 in the ranking contain fewer than 10,000 cores.
Genentech is still waiting to see whether the simulations lead to anything useful in the real world, but Corn says the data "looks fantastic". He says Genentech is "very open" to building out more Amazon clusters and Cycle Computing is looking ahead as well.
"We're already working on scaling up larger," Stowe says. All Cycle needs is a customer with "a use case to take advantage of it".