The burgeoning tech industry movement around big data is churning up a variety of new applications, but remains an evolving field that faces lingering challenges, judging from an event held at a Microsoft research facility.
Big data refers to the ever-growing quantity and variety of data, particularly in unstructured form, being generated by websites, sensors, social media and other sources, as well as a growing array of technologies aimed at deriving insights from it.
Startup Recorded Future seeks to perform "temporal analysis" of information found in the public web, said Christopher Ahlberg, CEO and co-founder, during a panel discussion at the Massachusetts Technology Leadership Council's Big Data Disruption event.
Recorded Future's system taps into some 70,000 sources, including news sites, trade publications, blogs and financial databases, sifting through the information and identifying references to individual entities and events, Ahlberg said. "We're ingesting 100,000 to 300,000 documents every hour."
Publication dates and other time-related bits of information are associated with the references, allowing them to be organised in a historical manner. Then they are analysed for sentiment and tone.
Recorded Future's capabilities are used by defence agencies, financial services firms and competitive intelligence experts. The system can be used to pinpoint "broad signals," such as regarding the potential rise or fall of a stock, or "fine-grained alerting" of a specific type of news event, Ahlberg said.
Startup DataXu offers analytics meant to help digital marketing executives. Its software analyses data derived from tracking pixels embedded in online ads and builds predictive models showing which types of ad impressions are most like to lead to sales, said CTO Bill Simmons during another panel talk. DataXu's customers "want to change the minds of consumers and build a brand," he said. To do so, they may need to show an advertising message 100 times, but "where do you show it," he added.
DataXu is applying machine learning "to a very imbalanced problem," given that today, thousands of ad impressions may lead to only one person buying anything, Simmons added. His company also has to make its service more cost-effective than simply buying and running ads at saturation levels, he said.
Many speakers on Wednesday referred to their companies' use of one the most closely associated technologies with big data, Hadoop, an open source programming framework that allows users to split up large processing jobs and run them in parallel across clusters of servers.
But Hadoop in its current form has serious limitations, said Michael Stonebraker, a Massachusetts Institute of Technology professor and founder of a number of database vendors. He was also the primary architect for the Ingres and Postgres database systems and is currently CTO of VoltDB.
For one, it "has terrible performance on data management," he said. In addition, Hadoop is a low level interface that requires people to program in Java, Stonebraker said. "Forty years of research says high level languages are good."
The problems Stonebraker cited could be mitigated over time, however, given that an array of vendors have been rolling out various tools meant to make Hadoop easier to use.
Meanwhile, EMC's Greenplum division is "building a platform for the future of big data," said George Radford, field CTO. That includes both row-based and columnar stores, integrated Hadoop storage, and integration with the Gemfire in-memory data grid for in-memory analytics, he said. This integration is crucial, according to Radford. "One of the problems with point solutions is with big data, the last thing you want to do it move it. You want to ingest it and analyse it in place."
But a new problem for big data is emerging even as companies like EMC Greenplum make these technological strides, Radford added. "Like everyone else here, we're looking for data scientists. As we solve the platform issues, people are going to be transformed from bit-tweakers and tuners to active partners with the business."
At another point, talk turned to big data's relationship with cloud computing, particularly public infrastructure offerings like Amazon Web Services, which offer raw compute power for developers.
Such systems present "an extremely challenging environment" for big data processing given the limited control users ultimately have over factors like the underlying network and storage, said Fritz Knabe, distinguished engineer at IBM's Netezza division.
But the public cloud does make sense for large processing jobs in some cases, Stonebraker said. "If you are doing month-end reporting and you need 1,000 processors for three hours, go ahead and do that on the [public] cloud. There's some low-hanging fruit."