I suspect many of us have experienced bewilderment when attempting to make sense of the NoSQL industry. A dizzying array of products, together with all kinds of newfangled concepts, often add up to confusion in the mind of the architect, who only wants to use the right technology to get a specific set of jobs done. I directly empathize, of course, since I myself am well along my own journey down this path, and have learned some important lessons along the way.
In this article, I hope to share some of these lessons with you and give you the practical advice I wish someone would have given me before I started. So here goes.
Why Use NoSQL At All?
If you’re considering it because it seems to be the new trend, or because you know someone who knows someone who uses it, or because it is the shiny new toy, then stop right now. In fact, if you don’t have a specific problem you’re trying to solve that a relational database is incapable of solving, then already you have gone too far.
That ancient technology we call the relational database is your friend, truly. It is proven, it doesn’t lose your data, it doesn’t crash, it performs amazingly well considering what it does, and many find it the most straightforward persistence foundation for a business application. Like the family SUV, it’s not glamorous, but it gets you where you want to go with a minimum of fuss, and allows you to put all kinds of stuff in it.
So my first bit of advice:
You should only use NoSQL if you really need it.
The Four Key Problems
Given that, how do you know when you really need it? In my experience, the typical compelling needs for NoSQL revolve around questions like this:
- My database server is overwhelmed and I’ve already purchased the fastest machine on the planet – what now?
- How can I stand up a new data center on another continent and somehow keep data in sync across the ocean?
- I need to create a huge pile of read-only semi structured data for analytic purposes, and the relational model is an awkward fit. Where should I store it?
- I have a really exotic type of data that just doesn’t fit well into a relational model. What should I do?
One can imagine a great many questions along these lines. But even with this small sampling, one can already identify the key problems that various NoSQL products try to solve:
The Local Scale Problem. “My transaction volume is too large for a single server, and I need something that scales out.”
The Big Data Problem. “My data volume is too large for a single server, and I need something that spans servers.”
The Global Data Problem. “I have to expand beyond a single data center, and I need something that supports the global cloud.”
The Data Diversity Problem. “My data needs have diverged and a one-size-fits-all approach no longer works.”
In my experience, no single NoSQL product really solves all of these problems perfectly, although many will claim they do. Therefore, you should choose the NoSQL product that solves the actual problems you have, and don’t waste time worrying about the others. And if you think about it, the business issue that is forcing you to consider NoSQL probably emerges from one of these basic problems.
Note that not one of these problems has anything to do with cost. I doubt that you will save money by moving to NoSQL, and in fact it might cost somewhat more, in sum. Certainly in the short run you’ll have increased costs related to retraining of developers, and so on. I purposefully have omitted any discussion of the financial aspect of NoSQL, since I think the reasons for using it are grounded in technical necessity, not any sort of cost savings or revenue acceleration.
Also note that, with the exception of the Data Diversity problem, all of these problems directly imply the need for a distributed persistence system of some kind. And that leads us to …
The CAP Theorem
We can’t go further without understanding the basics of the CAP theorem. I’m sure you’ve heard of this foundational theorem put forth by Eric Brewer. Certainly a great deal of fear, uncertainty and doubt exists surrounding it. So let’s try to make it as simple as we possibly can.
The basic idea is that for any distributed data system, there are three fundamental guarantees the system could make, and you can only have two of them at any moment in time:
Consistency – the data system guarantees consistency of data updates across threads. Think isolation levels, transactioning, etc.
Availability – the data system guarantees it will operate continuously without interruption or outage. Think no single points of failure, etc.
Partition Tolerance – the data system guarantees it can tolerate network partition without error or data loss. Think surviving network outages, etc.
Every single data system out there will implement at least one, and hopefully two, of the the three guarantees. But it is physically impossible to guarantee all three, at least until someone invents a time machine (and I certainly want to talk to that person if they do). Consequently, the CAP theorem defines the key trade-offs any persistence system makes, and understanding how a persistence system relates to the CAP theorem should be your first objective when evaluating any persistence system.
Consistency v. Availability
The CAP model, however, is incomplete. If you think about it, what is a CA system really guaranteeing you? It claims it gives you Consistency and Availability, sacrificing the ability to survive a network Partition. But when a network outage does occur, what happens? Either consistency must be sacrificed (meaning you allow data discrepancies to occur), or availability must be sacrificed (meaning the database is allowed to go down). Given that, I think a better definition of the CAP theorem is simply that during a network Partition, a distributed system must choose either Consistency or Availability.
Given that in the real world networks do go down, I see no other conclusion:
There are only two kinds of practical distributed persistence systems: Consistent systems and Available systems.
Now consider what sacrificing Availability really means. If your distributed persistence system can be located all within a single data center atop a truly reliable and high speed network that almost never goes down, then you might choose Consistency over Availability. The bet you are placing is that the consistency benefits outweigh the remote possibility of the network failing. And in most data centers, this may be a very reasonable bet to place.
However, if your distributed persistence system cannot be on a single reliable network, or if you want to plan for a future where your system will span data centers, then you really have little choice but to favor Availability and give up on Consistency.
This is the essential trade-off in distributed databases. If your business allows you a single data center to serve all customers around the globe, and will never ask you to change that topology, then lucky you. You have the choice of Consistency v. Availability. The rest of us really have no choice but to give up on Consistency so that we have a reasonable amount of Availability. And honestly:
In the long run, Consistency is doomed.
As the internet gets bigger and the world gets smaller, global scalability will become increasingly the norm. And folks, there can be no consistency in a globally distributed database. But for now, if your application doesn’t demand that kind of scale, by all means use a Consistent database.
Available Databases That Scale
Now I will take a position on products. The only NoSQL databases that really interest me are those that scale out linearly. Perhaps you have other needs (see the next section) but for me the issue of scale defines NoSQL. So let’s examine how various NoSQL databases perform, along this scalability dimension.
Firstly, I am unimpressed by any NoSQL database that uses any form of master-slave replication, or that has any notion of one node in the cluster that is more important than the others. Anything along these lines constitutes a scalability anti-pattern. I am sure there are plenty of people who will tell me I’m wrong here, given the popularity of MongoDB, but the fact remains that these kinds of systems do not scale linearly. They just don’t.
Secondly, I think it is critical that a NoSQL database has some notion of the difference between a local network and a wide-area network. Local networks tend to have amazing reliability, and tend to provide quite astonishing bandwidth. Gigabit Ethernet is now the standard in every data center, and even faster networks become more and more common every day. Wide-area networks, however, typically provide much less bandwidth and are often quite unreliable. By far the most cost-effective wide-area network is a VPN tunneled through the public net itself, but such connections are as unreliable as the public net and tend to have a lot of transient hiccups. Private lines can be much more reliable than VPNs but the bandwidth is quite costly. So, if your scalability equation includes the need for a wide-area network, you need to consider a persistence system that understands the differences.
This leads to the following analysis, presented in order by product popularity.
|Product||Clustering||Eventual Consistency||Data Center Awareness|
|MongoDB||Master-slave||Yes||Some capabilities but hindered by clustering paradigm|
|Cassandra||Shared-nothing||Yes||Full data center (and even rack) awareness|
|HBase||Shared-nothing||Yes||Some capabilities but a master-slave anti-pattern emerges between data centers|
|Couchbase||Shared-nothing||Yes||Full data center awareness|
|Riak||Shared-nothing||Yes||Full data center awareness (in commercial version)|
(I stopped after I got to DBase. I cannot believe people are still using that!)
In any event, to me the conclusion is pretty simple. At present, there are only three NoSQL databases that meet my definition of scalability, in that they scale out linearly and are able to handle the differences between local and wide-area networking: Cassandra, Couchbase, and Riak. MongoDB is notably not on that list.
The Data Diversity Problem
Note that thus far, I have not mentioned a thing about graph databases v. document databases v. columnar databases v. key-value stores, etc. This is deliberate, as I do not consider the differences between those sorts of databases to be the defining qualities of a NoSQL database. The issues of scale and Consistency v. Availability are the foundational issues that drive the reason for NoSQL’s existence, in my view. With that said, sometimes you really do have an exotic type of data that doesn’t fit into a general-purpose scale-out distributed database.
The most common need along these lines is for analytical data, which tends to be data that is appended to but is never modified. Gigantic piles of data can be generated in such a scenario, and certainly there are a host of NoSQL systems that have emerged to meet this challenge. I’m not a data scientist, so I’m not going to weigh in on the relative merits of the various products out there in this space, other than to say that many choices exist and you should investigate them if you have this need.
Or, perhaps you’re storing transient user session information that has a simple schema but would destroy a relational database server because of the volume of writes. In such a case, perhaps you should investigate a key-value store or a distributed cache/data-grid solution.
Maybe you need the ability to search large quantities of data that do not change often. Maybe you need to be able to perform heuristic/fuzzy searches that are linguistic-aware. In these scenarios, an inverted-index search engine might be the ticket.
Sometimes you may have a more transactional type of data that still defies storage in a relational databases. Perhaps you’re Facebook and you have these insane graphs of relationships between people that relational databases can model but cannot support at high transaction volumes. In such a scenario, perhaps you should investigate a graph database. Perhaps you have some kind of binary data that would fill up that precious relational database’s expensive solid-state disk too quickly. Maybe you should consider a key-value store optimized for large binary objects.
So yes, there are certainly scenarios where the type of data you have points in you the direction of NoSQL. My feeling is that these tend to be exceptions. I read so much buzz online about how document databases are the one thing you should care about, and all other database models are passé, and all I can do is roll my eyes. I hear about how important are the differences between a key-value database and a big table database, and honestly I just don’t see it. Yes, there are relative advantages and disadvantages, but those seem minor in comparison to the really important aspects of scale and distribution. As with all things in technology, I think we should concentrate on the truly important attributes of a new technology and ignore the shiny parts that inevitably become buzzwords that we quickly forget.
So my final bit of advice is:
Don’t use NoSQL only because of the alleged benefits of one data model over another.
Use it because you need scalability, or you honestly have a data usage scenario that isn’t appropriate for relational.
A lot of people have a lot of opinions about NoSQL these days, and I am certainly no different. I think that this article is a little different from the many others I’ve read recently, which tend to focus on data model and other features of NoSQL products that I find interesting but ultimately irrelevant to the big picture. Hopefully you agree.
I titled the article “The NoSQL advice I wish someone had given me” sincerely. When I first began investigating NoSQL products several years ago, and discovered how many choices there were and how few standards, I could have used the advice in this article. It certainly would have helped me avoid a few early mistakes. I hope you find it similarly valuable.