Hadoop is about to get a lot less complicated. A number of the largest big data vendors, including IBM, Hortonworks and Pivotal, have joined forces to standardise the base platform for the open source software.

It is hoped the Open Data Platform announced in February will reduce the work required on the part of enterprises to build and maintain complex Hadoop-based data analysis systems.

Large firms may begin to see Hadoop creep in next to their mission critical apps in the datacentre Credit: iStock/Alvarez
Large firms may begin to see Hadoop creep in next to their mission critical apps in the datacentre Credit: iStock/Alvarez

Establishing a common library for Hadoop will take the pain out of understanding how the framework and its distribution packages integrate with a firm’s existing infrastructure. If it is standardised, organisations could use off-the-shelf software in their Hadoop systems, mixing-and-matching different Hadoop components from different vendors.

Techworld analyses the business case, and the skills you need in your workforce, to get you going on this technology favoured by startups.

Hadoop history

Hadoop’s roots lie with the unequivocal big data behemoth of our age – Google. The search giant wasn’t the first to enter the market, and was nineteenth behind billion dollar firm Yahoo. Within two years, Google’s proprietary technology – its MapReduce framework, helped it become the dominant player.

When Google published a paper on its architecture, it inspired the open source project now known as Hadoop.

Google now invests in Hadoop distribution packages, like MapR, which make it easier for firms to organise and provision their data on the framework.

MapR’s chief technology officer, M.C Srivas, is a former Google employee and SAP architect at Spinnaker. His vision was to bring the power of an open source framework together with the breadth and type of applications that help it operate like high end storage – including high availability, disaster recovery as a service and snapshots.

Why Hadoop?

The majority of firms turn to Hadoop for cost reduction. It uses commodity servers that eliminate the need for mainframe applications, enterprise data warehouses or high end storage. However, some are beginning to consider it to boost revenue, for example in recommendation engines to drive customer interactions online, as well as marketing and churn analysis.

Additionally, some credit card companies are using it to deliver offers to customers based on their buying preferences in the US.

The third motivation for using Hadoop is risk mitigation. Hadoop can detect anomalies very quickly, and by placing an automated package on top of that data layer, firms can predict and mitigate fraud before it occurs.

While these issues have been addressed by data analytic practices for decades in the enterprise, business intelligence arrives in hindsight, and uses only a small sample. Hadoop can crunch all your data and look at the longtail – giving your business the bigger picture.

As Jack Norris, chief marketing officer at MapR, explains: “The approach with Hadoop is brute force…I’m looking for the actual anomalies, rather than a model version of reality from a sample set. I’m looking at the real spikes and if they are revenue spikes – how do I find more of those?”

Startups like Netflix have an advantage over firms with legacy infrastructure when it comes to implementing Hadoop, often using the technology and big data analysis as their core competitive differentiator or product.

But if enterprise wants to prepare for disruption and utilise the large sets of data it already collects, it should be considering the tool, Norris says.

Enterprise storage and data warehouse employees who are not versed in Hadoop may struggle with new interfaces, and extra training may not be an option for you. By hooking up new connectors, APIs and developing around unfamiliar UIs you are at risk of adding complexity, so a distributed package to layer on top of your data may be the simplest option.

However, if you are looking to learn Hadoop-specific skills or build up a Hadoop team within your firm, here are the three main suspects you need to be successful.

1. Developer

A developer will be creating applications in Hadoop. The role could be considered an incremental approach to data warehousing.

That said, many firms are using their data warehouse alongside Hadoop and augmenting it to fit both. But real game-changing firms, Norris says, are using, or developing, automated applications so it transitions from being a reporting function to making judgements as actions occur. For example, skills in Hbase could assist with adjusting online adverts to optimise supply and demand; deciding whether there is fraudulent activity before a credit card transaction is complete; or even predicting equipment failure from sensor data on the factory floor.

“It is a very different process,” says Norris. “You still need to augment, offload and optimise the data but it doesn’t stop there.”

2. Data analysts

Analysts will need to access and understand the data contained in Hadoop. They will need to do SQL queries on Hadoop and learn how to compare large datasets. Many data analysts will already be familiar in these processes, and be looking to use those skills in the Hadoop environment.

3. Administrators

The traditional role of the IT professional remains with Hadoop. Skills include maintaining and deploying provisions that manage the Hadoop cluster. They need experience in meeting meeting service level agreements (SLAs) and integrating Hadoop with certain environments dependent on your infrastructure.

“We believe Hadoop is so powerful it is going to sit alongside mission critical processes in the datacentre, and that integration has to be front and centre,” Norris adds.

If you are interested in learning more about Hadoop as a developer, administrator or analyst, there are a host of free online guides that teach you the ropes in several hours.

MapR is also offering free courses and certification for Hadoop.