Pinterest has seen an increase in users since it began refining its big data analysis with its in-house analytics platform, Pinalytics.
“We deployed the system in October last year and have seen an average daily increase thanks to internal tools,” Krishna Gade, Pinterest's engineering manager for the data team, told the Hadoop Summit in Brussels yesterday.
The social network, which describes itself as a ‘personalised discovery machine’ lets users ‘pin’ an interest like a street-style picture, or recipe, to the site. With over 30 billion pins since it went live in 2010, the company is heavily invested in keeping on top of its data.
It has seen an average boost of 400 unique users, 800 pageviews and 1,500 custom charts created and updated daily since October last year and it is reported to have raised $367million (£246 million) in new equity from new and existing investors, taking its total funding to $1.1bn (£740 million).
The internal Pinalytics system uses data processing pipeline Thrift as an interface on Apache HBase databases, as well as a web application user-interface (UI) for dashboards. The firm needed infrastructure that could scale to help it understand the context and user intent of every pin, so looked at incorporating the Hadoop platform.
Discussing the project at the European conference, Gade explained how the data modelling tool uses HBase to store reporting data and Thrift services to scan and aggregate tags like country or gender of user.
“We chose HBase instead of MySQL because there was no application-level sharding,” he added.
This meant the scrapbooking site's developers could use simplified filtering instead of sharding - the partitioning of a database - on the application level using Hbase’s FuzzyRowFilter.
Pinalytics’ reporting mechanism is a client library written in Python, Gade said. This lets employees create a weekly active user report, specify queries - like the date and to filter spam. This helps the team monitor user registrations and to monitor retention, which Gade said had contributed to the site’s impressive traffic growth.
The front-end also includes an anomalous metric tracker, which spots any inconsistencies and sends emails to employees that use the dashboard for analysis. It also has formatted dashboards for reporting so the data can be presented to Pinterest and its partners' senior management.
Pinterest couldn’t find a good open source workflow management system, Gade told the summit, so it built its own on Hadoop and recently open sourced the code, which was put on GitHub last month.
Again, Pinterest used Hadoop to build the developer’s job tool.
Already working entirely in the cloud, the firm chose Hadoop to “take advantage of the elasticity."
“We wanted to be able to spin as many clusters as possible as we change and grow," Gade said.
Pinterest's site is hosted in AWS where it stores over eight billion objects and 400 terabytes of data. It uses Amazon Simple Storage Service (S3) and Amazon Elastic Compute Cloud (E3).
Gade, a former Twitter employee, and his data team builds core data infrastructure to support Pinterest's data tools and services as well as working on cutting-edge big data technologies like Apache's Kafka, Hadoop and Storm as well as Amazon Web Service's Redshift.