House of Cards, staring Kevin Spacey, is the first major TV show to completely bypass the usual television ecosystem of networks and cable operators and premier on the streaming service Netflix.
It may seem like Netflix took a big risk buying in unproven content rather than licensing content that was already successful. In reality, however, Netflix knew that the series would be a hit, based on data about the viewing habits of its 33 million users.
Using the NoSQL database Apache Cassandra, Netflix was able to gather real-time data about the programmes its customers were watching, their demographics and viewing patterns, and build up an authoritative picture of the kind of content that would be well received.
Matt Pfeil, co-founder and VP of Customer Solutions at big data software company DataStax, which worked with Netflix to implement Cassandra, explained that this is the first time that programming has been developed with the aid of big data algorithms.
“Netflix has all these data points about movies getting watched, and they can look for things like, do people like actions or drama who are our highest returning customers? Who is the lead in most of those? What type of characteristics of films provide the most engaged watching experience? And then they can use that to go figure out which series they should potentially buy,” he said.
Netflix began moving its data to Amazon Web Services (AWS) in 2010 and replaced its Oracle SQL database with Apache Cassandra the following year. According to Netflix, Oracle's SQL database inhibited the exchange of data around the world and required regular downtime for schema changes.
“From a practical computer science-level, traditional relational database technologies were not built to accommodate large volumes of data, especially in any way shape or form from a cost-effective perspective,” said Pfeil.
Netflix chose Cassandra because it offered a globally distributed data model, along with the flexibility to create and manage data clusters quickly.
By mid-2011, Netflix was using six major applications with Cassandra, including its subscriber system, AB testing, and viewing history service (including positions at which members stopped watching a streaming programme).
Each cluster has a multiple of 12 nodes. In addition to the six clusters for each application in production, Netflix has a shared Cassandra cluster with 12 nodes, used for smaller applications that don’t need their own cluster.
According to Adrian Cockcroft, cloud architect at Netflix, the regular downtime that was needed for schema changes to the Oracle SQL databased is no longer necessary, and a Cassandra cluster can be created in any region of the world in 10 minutes.
“We don’t have to plan capacity in advance, we don’t need to ask permission of other people to build things for us, and we don’t worry about running out of space or power,” he said.
Netflix is by no means the only big company using Cassandra to process big data in real time. DataStax has more than 250 customers worldwide, including 20 of the Fortune 100 companies, and the company claims that there is demand for the technology in almost every vertical.
For example, Rackspace is using Cassandra to monitor the metrics on all of its servers to determine which ones are under heavy load and might fall over.
“Everything from tech, healthcare, education, financials, retail. We're true platform players, so we're across the board,” said Pfiel.
DataStax already integrates Apache Hadoop and Apache Solr into its NoSQL big data platform, and Pfiel expects the technology to continue evolving over the next ten years.
“If you talk about this age as the data age, we're still in the teenage years, and as it matures there's going to be orders or magnitudes of different types of technologies that just encompass big data,” he said. “The more data you have and the more you can do with it, the smarter this business decision.”