Oracle unveiled the Big Data Appliance, the newest addition to its line of products that combine software and hardware, during the OpenWorld conference in San Francisco yesterday.

"Big data" is an industry buzzword that refers generally to the massive amounts of information generated by websites, sensors and other sources apart from traditional enterprise applications.

The new appliance includes a distribution of the open source Hadoop programming framework, Oracle Data Integrator Application Adapter for Hadoop, Oracle Loader for Hadoop, a distribution of the R open-source statistical analysis software, and the Oracle NoSQL database, according to a statement.

"There's a lot of data, and a lot of it has very low business value. There's only a few nuggets that people want to find," Andy Mendelsohn, senior vice president of database server technologies, told press and analysts. Hadoop and other tools can distill that data down to something useful, and it can then be loaded into a data warehouse, particularly one powered by Oracle's Exadata appliance, for further analysis, he said.

NoSQL refers to a growing set of database technologies that can be defined by what they omit, such as "SQL, joins, strong analytic alternatives to those, and some forms of database integrity," analyst Curt Monash said recently. "If you leave all four out, and you have a strong scale-out story, you're in the NoSQL mainstream."

Based on Berkeley DB

The Oracle NoSQL database is a "distributed, highly scalable, key-value database" that is "easy to install, configure and manage, supports a broad set of workloads and delivers enterprise-class reliability backed by enterprise-class Oracle support," according to an Oracle statement.

It is based on Oracle's Berkeley DB product. "Berkeley DB is probably the most popular key-value store out there on the web," but it uses a single index, Mendelsohn said. For the NoSQL database, Oracle "turned it from a single index to a distributed implementation, where you could have maybe 100 indexes," he said.

Mendelsohn said that like Berkeley DB, the NoSQL database will be available in both open-source and commercial versions. The latter will probably gain premium features over time.

Meanwhile, Oracle recognises that administrators and developers may not be familiar with programming models like Hadoop, Mendelsohn said.

"Hadoop as it currently stands is a very niche technology," according to Mendelson. "Everybody's talking about it, but who in our enterprise installed base can use something like this?"

That's why tools like the data-integrator adapter and loader for Hadoop are so important, since they help bridge that skills gap, he said.

"Have we done enough with Hadoop tooling? I don't think we're there yet, but we've made some good steps," Mendelsohn added.

Proprietary packages for R distribution

Oracle's R distribution is integrated with its 11g database, allowing R applications to tap data within those systems, Oracle said. A standard distribution of R will be used, but Oracle also plans to release some proprietary packages for it, Mendelsohn said.

Oracle also plans to offer all of the software products in stand-alone form as well as with the appliance, according to a statement.

Pricing and a release date for the Big Data Appliance weren't available, although it will compete with products such as Aster Data, Netezza and Greenplum.

Forrester analyst James Kobielus said it's not Oracle's first 'Big Data' appliance, if big data is defined as "the three Vs," he said. "Volume (petabytes of stored analytic data), velocity (real-time data capture, transformation, loading, analysis, and query), and variety (handle diverse structured, semi-structured data)."

Massively parallel processing

"Exadata is all of that, and Exadata is already optimised for mixed workloads of in-database analytics and massively parallel processing (MPP) with a rich library of advanced analytics algorithms and models," he said.

One important consideration is how many of Hadoop's many sub-projects will be part of Oracle's distribution, Kobielus said.

"MapReduce and Pig are core of Hadoop modeling and development, with Mahout libraries increasingly being adopted for machine learning," he said. "HDFS and HBase are at the core of Hadoop batch and real-time data storage and management, with some uptake of Cassandra for distributed real-time analytics and transactional computing. If Oracle's Hadoop appliance doesn't incorporate most of these, plus Zookeeper and Hadoop Common tools, then it cannot be regarded as a full enterprise-ready Hadoop platform."

Mendelsohn declined to enumerate every Hadoop component Oracle plans to include in the distribution.

However, "what the people in the Hadoop community expect is going to be there," he said. "We're not going to pull out something because it competes with Oracle. It will be a complete distribution."

It's likely that Oracle will end up acquiring specialised Hadoop vendors to beef up its array of tools, Kobielus said.