Big is not a precise word - it depends on perspective. You can hide an army of vagueness behind the word ‘big.’ With the amount of attention the term “big data” has been getting lately, the challenge is trying to add something of value to the subject and stay away from the ambiguous, fluffy marketing speak that often dominates conversations on the topic.

Big data is not new, but it is in the spotlight because we are generating more data than ever before on our tablets, smartphones and all the apps that power them. We are also generating data in new ways in a variety of structured and unstructured formats. Highly transactional GPS feeds, stock market information, and social media posts might be stored in the same database with large image and video files for instant analysis.

As we learn to extract information from these data sets, big data will soon become the new normal. Until then, having the speed to process these data sets at a pace that makes the information relevant is one of the biggest challenges or drivers around big data.

The variable of what constitutes ‘big’ could cover data set size, rate of change or growth, or complexity of relationship and structure, to name but a few metrics. Examining rate of change or speed in particular gives insight into some of the challenges around big data. While CPU performance follows Moore's law, data storage system transaction performance has stagnated in comparison which leaves CPUs data processing potential sadly underutilised.

If you can’t get data to and from the CPU at a fast enough rate, then the data to be analysed can back up, which could in turn lead to system failure. This is a common form of big data problem, and this problem can exist on numerous scales depending on the size of an organisation’s infrastructure.

The main approach to solving this challenge has been to scale out hardware and software. Scale out has been very effective at solving the problem at a relatively affordable price point. Software has grown and developed around the assumption that the scale out building block is a server of some form that has CPU, RAM and some storage, and that storage is comparatively slow to the CPU. Due to the relatively inexpensive cost of servers, it has long been more affordable to buy another server than to try and increase the CPU efficiency by solving the data supply problem.

However, by using NAND flash as a memory tier and not restricting it behind disk protocols, CPU workloads can often increase by more than 10 times. By implementing flash as a memory tier, companies can host terabytes more data on high-performance memory in each server than would be possible (or affordable) with DRAM alone. Whilst some scale out may still be necessary, enterprises won’t need to scale out on such a grand scale to accommodate the processing demands of big data.

This application acceleration approach allows many companies to scale current architectures beyond their originally intended workloads by a considerable amount, but eventually there is a limit. The next step in making the CPU to data store transaction more efficient in supporting big data would be to utilise NAND flash directly, bypassing redundant elements of the architecture using software designed interfaces.

A number of very innovative companies are now entering this brave new world, as they see the impressive benefits in both performance and efficiency. For example, block protocols are in simplistic terms, read a block, write a block of data, with very little or often no application awareness. By using NAND flash natively, application programmers can not only read or persist data in an application transaction (series of blocks) manner, but also do it faster and with much less code. NAND flash has only just started to deliver on its true potential after being wasted behind disk-based protocols for so long.

The benefit of these improvements in efficiencies is that companies can now perform up to 10 times the big data workload on the same system, meaning the need to scale out is dramatically reduced and performance improved. As companies’ data sets continue to grow, the benefits will grow accordingly, as big data gets bigger and bigger.

Big data is really about big questions. To ask a big question, and get an answer in a useable timeframe in which there is a high degree of confidence, usually takes a very large data set and a complex model to run that set though.

This is very evident in the academic, financial and retail fields, where big questions are asked every day. Achieving this requires harnessing a balance between the power and potential efficiency of the CPU and persistent storage memory. To keep up with big data and go beyond what we have today will require us to rethink what is possible. Luckily, many innovative enterprises are already making great strides with data centre solutions architected to address the big data needs of business today.

Posted by Mat Young, senior director at Fusion-io. Follow Mat on twitter @ispider and Fusion-io via @fusionioUK

Enhanced by Zemanta