Greenplum has released new technology which it says can speed the loading of data into large scale databases, without compromising overall performance.

San Mateo, California-based Greenplum provides a high performance database (DBMS) typically used in data warehousing and large-scale analytical processing (or business intelligence) applications. It powers the Sun Data Warehouse Appliance, and customers include the likes of Linkedin, Nasdaq, NYSE Euronext, Fox Interactive Media, and Myspace.

Data loading is rapidly becoming an issue for companies increasingly facing exponential data growth. "For many companies data loading is a bottleneck," said Ben Werther, director of product marketing at Greenplum. "Data loading is traditionally done at night, but more data and longer loading cycles, sometime means this extends into the working day."

"The amount of data is growing on a daily or weekly basis," said Paul Salazar, VP of corporate marketing. "Companies are seeking to gain competitive advantage from analysing the data they capture and they are also choosing to store more data about specific events."

Salazar said that if customers can gain field intelligence quickly, by shorten data loading times to a couple of hours instead of overnight or longer, then there is a definite competitive advantage to be had.

To this end, Greenplum has introduced technology it is calling ‘MPP Scatter/Gather Streaming' (or SG Steaming for short). SG Streaming technology is available immediately with the Greenplum Database. It is included at no extra charge to Greenplum customers, and the company says it eliminates the bottlenecks associated with other approaches to data loading.

Indeed, Greenplum cites customers that are achieving production loading speeds of over 4TB per hour. "The loading capabilities of this database are remarkable," said Brian Dolan, director of research analytics at Fox Interactive Media. "We're loading at rates of four terabytes an hour, consistently."

"This is definitely the fastest in the industry," said Greenplum's Werther. "Netezza for example quotes 500GB an hour, and we have not seen anyone doing more than 1TB an hour."

According to Werther, Greenplum utilises a "parallel-everywhere" approach to loading in which data flows from one or more source systems to every node of the database without any sequential choke points. This differs from traditional "bulk loading" technologies, used by most mainstream database and MPP appliance vendors that push data from a single source, often over a single or small number of parallel channels, and result in fundamental bottlenecks and ever-increasing load times. Greenplum's approach also avoids the need for a "loader" tier of servers, as required by some other MPP database vendors.

The SG Streaming technology ensures parallelism by "scattering" data from all source systems across 100s or 1,000s of parallel streams that simultaneously flow to all nodes of the Greenplum Database. Performance scales with the number of Greenplum Database nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations.

Another useful feature is that the data can be transformed and processed in-flight, utilising all nodes of the database in parallel, for extremely high-performance ELT (extract-load-transform) and ETLT (extract-transform-load-transform) loading pipelines.

Of course, this means that Greenplum competes against the likes of hardware-based players like NCR's Teradata and Netezza, as well as other mainstream players such as Oracle. But Greenplum says that its ability to utilise off-the-shelf servers, storage, and networking, means that customers are not tied into any particular hardware configuration, and instead are offered cost-effective scaling on commodity hardware.

Greenplum launched version 3.2 of its database software back in September last year. Greenplum Database 3.2 was the first database to include MapReduce, a parallel computing technique pioneered by Google for analysing the web, which boosted the data analytics capabilities of the new DBMS.