Hadoop is all the rage, it seems. With more than 150 enterprises of various sizes using it, including major companies such as JP Morgan Chase, Google and Yahoo, it may seem inevitable that the open source Big Data management system will land in your shop, too.
But before rushing in, make sure you know what you're signing up for. Using Hadoop requires training and a level of analytics expertise that not all companies have quite yet, customers and industry analysts say. And it's still a very young market. A number of Hadoop vendors are duking it out with various implementations, including cloud-based.
Most importantly perhaps, don't buy into the hype. Forrester Research analyst James Kobielus points out that only 1% of US enterprises are using Hadoop in production environments. "That will double or triple in the coming year," he expects, but caution is still called for, as with any up-and-coming technology.
To be sure, Hadoop has advantages over traditional database management systems, especially the ability to handle both structured data like that found in relational databases, say, as well as unstructured information such as video, and lots of it. The system can also scale up with a minimum of fuss and bother.
eBay, the online global marketplace, has 9 petabytes of both structured data on clusters from Terabyte, as well as unstructured data on Hadoop-based clusters running on "thousands" of nodes, according to Hugh Williams, vice president of experience, search and platforms for the company.
"Hadoop has really changed the landscape for us," he says. "You can run lots of different jobs of different types on the same hardware. The world pre-Hadoop was fairly inflexible that way."
"You can make full use of a cluster in a way that's different from the way the last user used it," Williams explains. "It allows you to create innovation with very little barrier to entry. That's pretty powerful."
Scaling up, and up
One early Hadoop adopter, Concurrent, sells video streaming systems. It also stores and analyses huge quantities of video data for its customers. To better cope with the ever-rising amount of data it processes, Concurrent started using Hadoop CDH from Cloudera two years ago.
"Hadoop is the iron hammer we use for taking down big data problems," says William Lazzaro, Concurrent's director of engineering. "It allows us to take in and process large amounts of data in a short amount of time."
One Concurrent division collects and stores consumer statistics about video. That's where Hadoop comes to the rescue, Lazzaro says. "We have one customer now that is generating and storing three billion records a month. We expect at full rollout in the next three months that it will be 10 billion records a month."
Two key limitations for Concurrent in the past were that traditional relational databases can't handle unstructured data such as video and that the amount of data to be processed and stored was growing exponentially larger. "My customers want to keep their data for four to five years," Lazzaro explains. "And when they're generating one petabyte a day, that can be a big data problem."
With Hadoop, Concurrent engineers found that they could handle the growing needs of their clients, he says. "During testing they tried processing two billion records a day for the customer, and by adding another server to the node we found we could complete what they needed and that it scaled immediately," Lazzaro says.
The company ran the same tests using traditional databases for comparison and found that one of the key benefits of Hadoop was that additional hardware could easily and quickly be added on as needed without requiring extra licensing fees because it is open source, he says. "That became a differentiator," Lazzaro says.
Another Hadoop user, life sciences and genomics company NextBio, works on projects involving huge data sets for human gene sequencing and related scientific research.
"We bring in all kinds of genomics data, then curate it, enrich it and compare it with other data sets" using Hadoop, says Satnam Alag, vice president of engineering for NextBio. "It allows mass analytics on huge amounts of public data" for their customers, which range from pharmaceutical companies to academic researchers. NextBio uses a Hadoop distribution from MapR.
A typical full genome sequence can contain 120GB to 150GB of compressed data, requiring about half a terabyte of storage for processing, he says. In the past, it would take three days to analyse it, but with 30 to 40 machines running Hadoop, NextBio's staff can do it now in three to four hours. "For any application that has to make use of this data, this makes a big difference," Alag says.
Another big advantage is that he can keep scaling the system up as needed by simply adding more nodes. "Without Hadoop, scaling would be challenging and costly," he says. This so-called horizontal scaling, adding more nodes of commodity hardware to the Hadoop cluster, is a "very cost effective way of scaling our system," Alag explains. The Hadoop framework automatically takes care of nodes failing in the cluster.
That's dramatically changed the way the company can expand its computing power to meet its needs, he says. "We don't want to spend millions of dollars on infrastructure. We don't have that kind of money available."
Allows for new types of applications
One huge benefit of Hadoop is its ability to be able to analyse huge data sets to quickly spot trends, Lazzaro says. For a major retailer, that could mean scouring Facebook or Twitter user data to learn what scarf colors were in fashion last season, to be able to compare that information with today's hot colour trends to help determine what will sell this season.
"It gives you the ability to look back in time to look for opportunities for new sales," Lazzaro says. This plays out at Concurrent when the firm analyses a commercial or ad for a car dealership. "We can look at the data to see who's watched the commercials; then you might have a targeted sales lead you can leverage to make a sale. You don't always know what you are looking for."
Traditional databases can work for many sorting and analysis needs, but with ultra-large data sets, Hadoop can be a much more efficient way to find things, Lazzaro says. "It's really built for handling that."
For their part, eBay's engineers "like being able to work with unstructured data... and build new products for eBay quickly," Williams says. Because eBay engineers can access the firm's 300 million listings, historical information and vast amounts of related information, Williams says, "this allows us to understand customers and build experiences they want."
It's not really about the structured versus unstructured issue; rather, "it's about our engineers being able to roll up their sleeves and work with our data like never before," he says.
In the last year, eBay has done "some really amazing things with Hadoop, including improvements in merchandising, buyer experience and how customers use the site," Williams says.
During the year, for instance, eBay staffers can see when customers start typing in Halloween queries and Christmas queries. "With that I can tell you the kinds of things people are looking for. We didn't comprehend this use of the data five years ago, not at all."
Be careful out there
As good as Hadoop is, there are some cautions. First, don't commit to or standardise on one vendor quite yet, because it's such a turbulent space right now, Forrester's Kobielus suggests. "The vendors are all continuing to rapidly evolve." On the other hand, that does create a "vibrant ecosystem," he says.
Marcus Collins, an analyst at Gartner, says it's up to the enterprise to get the expertise needed to get the most out of Hadoop. "It's asking for a level of analytics capabilities that many companies don't have today," he says. "You need to train your staff and invest in analytics, and that will put you in the best position to exploit this technology."
Another key consideration: Most shops will need to hire Hadoop specialists, who are in short supply, or will need to train in-house staffers. "It's not trivial to use," eBay's Williams says. "So we've put a lot of training in place so our engineers know how to use Hadoop and can write code. You're going to have to invest in your developers and programme manager so they can become proficient users. Don't underestimate that."
Also be prepared for an organisational learning curve in terms of relying on an open-source system for a mission critical application. Using it for a few under-the-radar kinds of projects is one thing, but it's another entirely to develop a massive system for all the world to see. Best be prepared to educate your management about the benefits of open source.
Another tip from Collins is to stay "intimately involved" with the project to make sure it goes as planned. "Don't just give your problems to your Hadoop vendor," he says. At the end of the day, "you're going to be running it."
Also, Kobielus explains, best practices with Hadoop are still evolving, so it's best to figure out some short term benefit you might get from the system and avoid anything too long-term to start. As you build up expertise, you can figure out more things to do with the software. In the meantime, the range of approaches that early adopters are using to build out and scale their clusters "are all over the board," he says.
Adds to, doesn't replace, other databases
Most customers are using Hadoop in addition to, not instead of, other types of software. At eBay, for instance, the company still uses relational databases as well as does "a lot of custom [database] work," Williams explains. "At eBay, we see value in using multiple technologies to work with our data. Hadoop is a terrific choice for certain uses, while other technologies work alongside it for other purposes."
For example, when it comes to transactions, "it makes total sense to use a relational database system," he says. But overall the idea is to remain "flexible in what technologies we use at eBay; we don't see a world where there will be one unifying technology."
Learn how to manage Hadoop efficiently by learning its organisational structure. "If you have large numbers of people using a Hadoop cluster, they'll likely be trying to do some of the same things at once," Williams says. "That means they'll probably be generating the same intermediate data sets to analyse, and that's a waste."
Instead, he suggests, run common data queries once a morning and save the results in one place where anyone who needs them can use them, saving large amounts of processing time and related resources. "Think very hard about what data sets are useful for your users and create those data sets."
Cleaning up your Hadoop cluster is a key maintenance item. "This is really important," Williams says. "You'll probably run a lot of Hadoop jobs and you'll create a lot of data. Often, though, the people doing the work with the files will just walk away. That's pretty typical for users. If you do that, though, you'll end up with lots of extra Hadoop files."
"So you really have to create a strategy to keep your Hadoop cluster neat so you don't run out of disk space. Have people clean up what they don't need. Those kinds of things turn out to be pretty important if you've got a large Hadoop cluster."
The same is true at Concurrent. Hadoop hasn't replaced the company's use of traditional relational databases, including MySQL, PostgreSQL and Oracle. "It is a combined solution," Lazzaro says. "We use Hadoop to do the heavy lifting, such as large-scale data processing. We then use Map/Reduce within Hadoop to create summary data that is easily accessible through a traditional RDBMS."
What tends to happen in relational databases, he explains, is that when the system gets too large, say 250 million records a day, the database becomes "non-responsive to data queries".
"However," he says, "Hadoop at that scale is not even breaking a sweat. Hadoop therefore can store, say, five billion records and with Map/Reduce we can create a summary of that data and insert it into a standard RDBMS for quick access."
In general, Williams says, "I don't think too much" about Hadoop's limitations. "I think about the opportunities. You can find solutions to any problems pretty quickly" through the open source community. "Some people do gripe about different aspects of Hadoop, but it's a reasonably new thing. It's like Linux was back in 1993 or 1994."
"We do see unique technology challenges at our scale and with our extreme data," Williams explains, among them architecting data centres, designing a network to support Hadoop and choosing the right hardware.
Overall, Hadoop has been a good strategy for eBay, Williams says. "For us it's been an absolute game changer. It's what our engineers want to use and it's really helped us become a really data-driven company."