Big Data is all the rage these days, and more than a few organisations are at least wondering what sort of business intelligence they could derive from all the information at their dispoals.
But while awareness of Big Data is growing, only a few organisations - like Google or Facebook - are really in position to capitalise on it now. However, the time is coming and organisations that expect to leverage Big Data will not only have to understand the intricacies of foundational technologies like Apache Hadoop, they'll need the infrastructure to help them make sense of the data and secure it.
In the next three to five years, we will see a widening gap between companies that understand and exploit Big Data and companies that are aware of it but don't know what to do about it, says Kaylan Viswanathan, global head of information management with Tata Consultancy Services' (TCS) global consulting group. The companies that succeed in turning Big Data into actionable information with have a clear competitive advantage, Viswanathan says.
"Today, most companies are aware of Big Data," he says. "There's a lot written about it. There are conferences about it. Awareness has become quite pervasive. But if you look at actually exploiting Big Data, I would say we're at the very beginning stages of it."
Viswanathan says he believes that Silicon Valley Internet-based businesses like Facebook and Google - where the entire business is based upon the management and exploitation of data - are leading the charge when it comes to Big Data. Industries like financial services won't be far behind, he says, and neither will the intelligence or military communities. Other verticals like retail, telecom, healthcare and manufacturing will follow.
"In terms of readiness to exploit Big Data relatively soon, I would say the companies have to be market leaders in their industry segments," he says. "They will be the ones that tend not to wait until others have exploited new technology. They would rather forge ahead and set the standard for their industry vertical."
The role of Big Data
What role would Big Data play? Well, for instance, a pharmaceutical company might want to identify the top 100 opinion-makers in the pharmaceutical world. To do so, it could crawl the web and go to millions of pages related to the industry, ingesting the data while weeding out anything that's not related to the objective. Or an automobile manufacturer could collect instrumentation data live from its cars in real-time as they're driven on the road.
In many cases, says Larry Warnock, CEO of Big Data encryption and key management specialist Gazzang, we have not yet imagined the ways in which we will leverage Big Data.
"It's like a giant fishing net dragging the bottom," Warnock says. "There's big fat tuna and swordfish in there, but also mussels and lobsters and flounder. They're just scraping data and they don't know yet what they're going to do with it. The correlations that could be drawn from that data haven't even been determined yet."
The semantic data model in Big Data
One of the keys to taking unstructured data - audio, video, images, unstructured text, events, tweets, wikis, forums and blog - and extracting useful data from it is to create a semantic data model as a layer that sits on top of your data stores and helps you make sense of everything.
"We have to put data together from disparate sources and make sense of it," says David Saul, chief scientist at State Street, a financial services provider that serves global institutional investors. "Traditionally, the way in which we've done that and the way in which the industry has done that is we'll take extractions of that data from however many different places and build a repository and produce reports off that repository. That's a time-consuming process and not an extremely flexible one. Every time you make a change, you have to go back and change the data repository."
To make that process more efficient, State Street set out to establish a semantic layer that allows data to stay where it is, but provides additional descriptive information about it.
"We have to deal with a lot of reference information," Saul says. "Reference information can come from different sources. Our customers may call the same thing by two different names. Semantic technology has the ability to indicate those things are in fact the same thing. For instance, someone might call IBM 'IBM' or 'International Business Machines' or 'IBM Corporation' or some other variation. They really are the same thing. By showing that equivalence within the semantic layer, you can indicate they're the same thing."
Another example involves State Street's risk management business.
"If we're trying to pull together a risk profile for all of the exposures we have to a particular entity or geography or whatever, that information is kept in lots of different places. Numerical information in databases, unstructured information in documents or spreadsheets. We see that providing a semantic description for these various sources of risk information means we can quickly pull together a consolidated risk profile or an ad hoc request. One of the other benefits that we see is that semantic technology, unlike a lot of other things, doesn't mean we have to go back and redo all of our legacy systems and database definitions. It lays on top of that, so it's much less disruptive than another type of technology that would require us to go to a clean slate. We can do it incrementally. Once we've provided a semantic definition for one of these sources, we can add on other definitions from other sources without having to go back and redo the first one."
State Street has approached the semantic data model by building a set of tools to help end users - generally a business person rather than a programmer or DBA - do the description.
"The tools are much more designed for the actual owner of the data," Saul says. "In most cases that's not a programmer or DBA, that's a business person. The business person, in describing the data, knows what that data is. They know what this reference information is supposed to connote. Using the tool, they can translate that into a semantic definition and in turn use that and combine it with some other definitions to produce, say, a risk report or the onboarding of a new customer. For years we've talked about being able to blur the line that exists between IT and the business and having business be able to have tools where they can more clearly express requirements. This is a step in that direction. It's not full business process management, but it's certainly a step in getting there."
Securing Big Data
But collecting all this data and making it more accessible also means organisations need to be serious about securing it. And that requires thinking about security architecture from the beginning, Saul says.
"I believe the biggest mistake that most people make with security is they leave thinking about it until the very end, until they've done everything else: architecture, design and, in some cases, development," Saul says. "That is always a mistake."
Saul says that State Street has implemented an enterprise security framework in which every piece of data in its stores includes with it the kind of credentials required to access that data.
"By doing that, we get better security," he says. "We get much finer control. We have the ability to do reporting to satisfy audit requirements. Every piece of data is considered an asset. Part of that asset is who's entitled to look at it, who's entitled to change it, who's entitled to delete it, etc. Combine that with encryption, and if someone does break in and has free reign throughout the organisation, once they get to the data, there's still another protection that keeps them from getting access to the data and the context."
Gazzang's Warnock agrees, noting that companies that collect and leverage Big Data very quickly find that they have what Gartner calls 'toxic data' on their hands. For instance, imagine a wireless company that is collecting machine data - who's logged onto which towers, how long they're online, how much data they're using, whether they're moving or staying still - that can be used to provide insight to user behaviour. That same wireless company may have lots of user-generated data as wellcredit card numbers, social security numbers, data on buying habits and patterns of usage -any information that a human has volunteered about their experience.
The capability to correlate that data and draw inferences from it could be valuable, but it is also toxic because if that correlated data were to go outside the organisation and wind up in someone else's hands, it could be devastating both to the individual and the organisation.
Warnock says the risk is often worth it. "Downstream analytics is the reason you gather all this data in the first place," he says. But organisations should then follow best practices by encrypting it.
"Over time, just as it's best practice to protect the perimeter with firewalls, it will be best practice to encrypt data at rest," he says.
When it comes to Big Data, Warnock says the key to encryption is transparent data encryption: essentially encrypting everything on the fly as it is captured and written to disk. That way, every piece of data ingested by the organization is protected. In the past, companies have resisted such measures because of the monetary cost and performance cost. But Warnock notes that many tools are now open source, driving down their cost in dollars, and the performance hit has dropped substantially to only 3-5% at the application layer.
The other step to really making that encryption secure is an automated key management solution. "The secret for Big Data security, and quite frankly any kind of security, is key management," Warnock says. "Key management is the weak link in this whole encryption process."