Relief is on the way for users of the open source Apache Hadoop distributed computing platform who have wrestled with the complexity of the technology.
A planned upgrade to Hadoop distributed computing platform, which has become popular for analysing large volumes of data, is intended to make the platform more user friendly, said Eric Baldeschwieler, CEO of HortonWorks, which was unveiled as a Yahoo spinoff last month with the intent of building a support and training business around Hadoop. The upgrade also will feature improvements for high availability, installation and data management.
Due in beta releases later this year with a general availability release eyed for the second quarter of 2012, the release is probably going to be called Hadoop 0.23.
"A big focus for us is going to be adding tools for monitoring and distributing and management, [making it] much easier for organisations to use Hadoop. The problem now is it takes a pretty sophisticated operations staff to install and use it," Baldeschwieler said. He formerly was vice president of Hadoop engineering at Yahoo, which has been instrumental in Hadoop development.
Version 0.23 also is set for improvements in availability, performance and scalability. "That's a big one for very large customers," such as Yahoo and Facebook, Baldeschwieler said. Tending to single points of failure in Hadoop's master nodes will be a goal.
Also, the new HCatalog data management software layer planned for Hadoop 0.23 will let users store data in a more traditional table style, enabling users to transparently move data between tools. It also yields benefits for the MapReduce programming model used with Hadoop.
Currently, users can work with two higher level languages on top of Hadoop, Pig and Hive, said Baldeschwieler. Pig and Hive have their own specialty data stores. "What HCatalog's going to allow is for Pig and Hive and MapReduce itself to operate on one set of tables," he said.
An Apache representative concurred that goals for Hadoop include improvements for high availability, data management and user friendliness, but Apache would not confirm what will be in the next release or what the version number will be. Because of Hadoop's culture of continuous beta releases, there has yet to be a formal 1.0 release, Baldeschwieler said. "There will come a point where we will want to call it 1.0 or 2.0."