There seems to be a common misconception around the reliability of emerging big data platforms: "NoSQL—no worries, the backup is built-in".

The built-in redundancy of distributed NoSQL architectures leaves many with a false sense of security. But alone the out-of-the-box is not enough to satisfy backup requirements.

Consider the reasons for backup: availability and persistence. Just like traditional RDBMS, without special consideration out-of-the-box NoSQL paradigms fail on both accounts.

  1. Availability guarantees data access in event of equipment failure. Indeed, NoSQL replicates data across nodes which protects against server failure. But what about site failure? What if the data center hosting your NoSQL cluster experiences a service outage? You could find yourself in serious trouble.

    Okay, so lesson 1, you want to deploy your system with disaster recovery (DR) enabled to distribute your data across different data centres. Luckily many emerging big data platforms have this option built-in, and for some it was a salient design consideration (unlike most relational platforms).
  2. Persistence protects data from loss in the event of data corruption, user error or malicious users. Data replicas enable restoration of the data set to a particular point in time. By itself, most NoSQL just maintain the current state of data; they do not track previous versions. Even for schemes with versioning, as with any live system the platform itself is not the right place for persisting data, you simply cannot afford the storage space.

    For lesson two, you need to include a retention mechanism. You can apply traditional snapshot schemes that copy the entire cluster. But the resulting copies include the built-in replication factor that protects against server failure. Remember we’re talking big data, petabytes of data. Do you really want to include the replication factor in all backups? Maybe… but the good news is that there are ways to backup just one copy.

Let’s consider lessons learned from a certain subscription-based, on-demand streaming media provider. This media provider stores their user’s favourite movies list in Cassandra on Amazon. Their DR scheme tunes Cassandra so that one replica of the data is hosted in a DR site in a different Amazon region.

Protect from site failure? Check! You can then backup just this DR cluster. The media provider backs it up to Amazon’s S3. Data is persisted? Replication is minimised? Check! Check! No replication is carried in the backup replicas, and the backup process will not impact the performance of the primary cluster. Additionally, you can also use this DR cluster for analytics. Bonus!

This DR scheme gets us part of the way there, but there are still some open issues worth considering.

  • Understanding the consistency trade-off: The DR site in a different data centre is updated after the primary. When dealing with any DR scheme, you need to understand just how out of sync is the DR data with the primary data? For movie-related data, it may not be a big deal, but for user-related or transaction-related data it could be a big deal.
  • Backup and restore applies to the entire data set: Did we fail to mention that the above DR scheme saves and restores the entire dataset? This use case makes sense for the media provider where it may need to restore the favourites list for all of its users from last week. But what about just a subset of the data? What if my 3 year-old son scores a bunch of movies on my account? To recover just my information would require a restore of the entire dataset. Again it’s big data, do you think it’s worthwhile to stage a replica of the entire backed-up cluster just for one user? Probably not. It looks like I’ll get lots of recommendations for Thomas the Tank Engine.

The good news is that addressing these issues is an area of focus for development teams working on these platforms. Stay tuned for some recent advancements.

So while out-of-the-box NoSQL is risky, there are precautions available to add backup functionality. And in many ways, they are better prepared than traditional relational models. In the meantime, I hope I’ve convinced you that if you don’t take additional measures, you’re operating big data without a safety net!

Posted by Teresa Tung