With a planned upgrade to its Hadoop distributed data processing technology, the Apache Software Foundation intends for the platform to run across much larger clusters and take on larger workloads, an Apache official said.

A key goal for the upcoming 0.23 release of Hadoop, which could eventually be called version two or three, is to have it run across 6,000-node clusters. It currently has run on 4,000-node clusters, said Arun Murthy, vice president of Apache Hadoop and a founder of Hortonworks, which offers Hadoop technologies and services. Release 0.23 is currently alpha quality, it is due for more formal release later this year.

Hadoop has become popular for mining large data sets. Plans call for Hadoop 0.23 to run across 6,000-machine clusters, each with 16 or more cores, and process 10,000 concurrent jobs. Users will get more work done, Murthy said in a presentation at the O'Reilly Strata conference. Performance, he stressed, is something users "can never have enough of".

Other capabilities eyed for the upgrade include HDFS (Hadoop Distributed File System) federation as well as high availability for HDFS. MapReduce, which is the programming model and software framework in Hadoop, will be improved as well. Called Yarn, the MapReduce upgrade "is the first to take Hadoop and make it a much more general data processing system," Murthy said.

Yarn is "a high performance rewrite of MapReduce," with twice the throughput on large clusters, said Eric Baldeschwieler, Hortonworks CTO. Also, wire protocol compatibility planned for the 0.23 release will enable server and client upgrades to be done independently.

Also at Strata on Thursday, MarkLogic and Hortonworks announced integration between Hortonworks Data Platform and MarkLogic's operational database platform.

The integration will allow users to combine MapReduce with MarkLogic's real-time interactive analysis and indexing on a single, unified platform, MarkLogic said. MarkLogic will certify its Connector for Hadoop against Hortonworks Data Platform.