The Pluses and Minuses of Data Grids

My recent post titled The NoSQL Advice I Wish Someone Had Given Me generated some good discussion online at DZone. One particular question, asking me about my opinion of GemFire as a NoSQL database, caused me to write a long reply. After reading said reply, I realized it would serve better as a post, so here we are.

Now, to be clear, I have not used GemFire itself, but the question got me thinking about data grid solutions, something I know well based on years of experience with Infinispan. In this article, I’m going to try to generalize about caches and data grids, from the perspective of NoSQL, and provide what I hope is useful advice to architects considering a move towards one.

What Is A Cache?

Let’s start with the basics. A cache, in modern enterprise software stack terms, can be viewed as a map. You can put values into it by providing a key, and then you can get values back out of it again by providing the key. Couldn’t be simpler.

What makes a cache a cache and not merely a map is primarily how it is used. The general idea is that you use a cache to reduce the load on some back-end data store, typically a relational database, by storing frequently-referenced data in it and not bothering the database server every time you need such a piece of data. Generally the pseudo-code goes something like this:

Object readSomething(Object key) {
   Object value = cache.get(key)
   if (value == null) {
      Object value = database.readUsingKey(key)
      cache.put(key, value)
   return value

Or, perhaps you’re using an ORM like JPA, which itself uses a cache to speed up operations. But that really just hides the same general kind of logic inside the ORM. The obvious objective is to read the database only when you don’t already know the value. For this purpose, this sort of logic works great.

The problem, of course, is what happens when the data changes? Then you end up with something like this, which keeps the cache up to date with data changes:

void writeSomething(Object key, Object value) {
   database.writeUsingKey(key, value)
   cache.remove(key, value)

Or alternatively:

void writeSomething(Object key, Object value) {
   database.writeUsingKey(key, value)
   cache.put(key, value)

But there is still a problem. What if you have multiple application servers? Each one of them has their own cache in memory. If a piece of data is written on one, how do the others know about it? Generally, one of these techniques are used:

  • Time-based caching – each node only remembers a given entry for a fixed period of time. After the entry expires, the database is again consulted for the latest data.
  • In-memory clustering – make the cache clusterable across a network, so that changes on one node are communicated to the other nodes.
  • Network caching – Remove the cache from application server memory, instead hosting it on a cluster of shared network servers.
  • Hybrid – some combination of the above.

Time-based caching is the simplest, since the cache does not need to know or care about any other nodes. And it can be extremely effective for data that changes rarely, or for data that does not need to be very consistent across your nodes. But what do you do when the data is continuously changing, and nodes really need to stay as up to date as possible?

You end up with either the in-memory or network caching models, which are simply collection of caches hosted across a set of nodes in a cluster:

Multi-node cache system.

Now we are no longer talking about simple caches but distributed caches. They could be in-memory caches, or external caches, or a combination of both. The key issue now is how changes are updated across all of the caches, which really defines the following cache types:

  • Invalidation caches – a given entry can exist on multiple nodes as long as it is unchanging. When one node changes an entry in the cache, all nodes are notified that their copies are invalid (and thus they should throw their copy of that entry away).
  • Replication caches – all nodes have all entries. When one node changes an entry, the change is replicated to all other nodes so they can update their copy with the fresh data.
  • Distribution caches – the entries are distributed among a subset of the nodes using some kind of consistent hashing scheme. When one node changes an entry, the change is replicated only to the other nodes that care about that entry.
  • Hybrid caches – some combination of the previous three.

Invalidation caches tend to be the most efficient at lightening the load on a database server, since only invalidation messages need to be sent around the network, and these are typically much smaller than the full entries in the cache. But the latter two, replication and distribution caches, are interesting when combined with the notion of cache persistence. What if the cache is capable of persisting its contents in some form, so that when it is bounced it does not start out cold and empty? This is a nice way to accomplish what is called cache warming, which resolves the thundering herd problem that often occurs when a system that relies on cache first starts up.

Once you have a distributed cache that is capable of remembering its contents, the obvious question is why do you need the database behind the cache at all? Which leads us to …

Data Grids

A data grid is simply a distributed caching system that also acts as the database itself. You get and put data into the cache by key, and the data grid takes care of distributing or replicating it. Each cache keeps a copy of its entries on local disk, so that it can re-read the contents at start up. You end up with something that looks like this:

Data Grid
Data grid, with local persistence for each cache.

It’s a simple concept and it can be extremely effective at managing large amounts of data at scale.  But some scalability challenges remain.

Firstly, data grids require all caches to be warmed at start up, since the cache cannot be distinguished from the database. But cache warming is deceptively hard to implement well, since the cache has to communicate with other nodes at start up to ensure its contents are still fresh, which often involves a lot of data transfer between nodes. You don’t want the cache to come online until it is done catching up, which tends to cause a long delay in start up. There is also an unavoidable race condition, since the node starting up is trying to warm a cache that is itself constantly changing on other nodes in the cluster. Think about that problem for a minute and you’ll see why it is really hard to solve.

Replication caches offer the best read performance, since every node in the cluster has every entry. The data is literally waiting there in memory for you to use it; what could be faster? However, write performance suffers since every change must be pushed to every node. Replication caches also have an n-squared scalability problem: the number of network connections between the nodes is the square of the number of nodes. This severely limits the size of your cluster.

Distribution caches partially mitigate the n-squared problem, since only a consistent subset of the cluster contains any particular entry. You still need a n-squared number of node-to-node network connections, but at least a given change only has to be pushed to a constant number of nodes. But how does the data grid react when nodes are added or removed from the cluster? Does the whole cluster have to be rebalanced? This can cause serious performance issues, as massive amounts of data must be transferred from node to node.

Also, the n-squared point-to-point connection problem still limits the number of nodes in your cluster, so modern data grids usually offer some means of using multicast communications. Multicast truly is the bomb! It is astonishingly parsimonious with network bandwidth, and can transmit huge amounts of data quickly. But it is a local network technology only. Routers do not like multicast since they don’t know who wants to receive the packets, so most routers will not allow multicast traffic to pass through. In other words: forget multicast if you want to scale beyond one data center, or if you want to host your cluster in the cloud, or even if you want to have more than one network in your private data center. Notably, Amazon AWS does not support multicast in their cloud; I would be surprised if other cloud vendors do not have the same limitation.

Distribution caches also do not perform as well on reads. Cache reads are most efficient when the entry you’re trying to read is already in memory. This is why replication caching is so fast at reads. But distribution caches do not guarantee you this. In fact, the larger your cluster is, the less likely it is that the entry you need is already available in local memory, and so a network hop has to occur to look up any piece of data. This tends to drive people towards standing up a cache in front of their cache, which starts to become ridiculous.

In my experience, the most realistic model for data grids is the distribution cache approach, using multicast network communications, with a sophisticated consistent hashing scheme that minimizes cluster rebalancing when the topology changes, and with a sophisticated cache warming scheme that minimizes network communication at start up and handles the inevitable race between cache warming and cache changes that are ongoing throughout the cluster.

But because multicast is involved, this limits your data grid to a single data center only. For many use cases, this is sufficient. But if you need to scale beyond the single data center, then you are going to have to think about an advanced data grid topology, some kind of multi-level “data grid of data grids”. For this to work, your data grid software will need to support asynchronous communications between individual data grids, something along the lines of the following:

Multi-Level Data Grid
Data grid of data grids.

Note the complexity involved. Each cache has to be able to replicate or distribute across all levels of the data-grid-of-data-grids. And for it to be of practical value, those replications across data grids must be able to happen over wide area networks, so they must be asynchronous. Our old friend the CAP Theorem raises its ugly head, forcing us into a choice between consistency and availability – which really forces us to give up immediate consistency in favor of availability (AP). While this model possesses the virtue of flexibility – all aspects of the multi-level data grid are visible to and tunable by the architect – it also results in much more of a maintenance headache operationally.

Indexing and Searching

Now let’s consider the developer’s view of the data. Any persistence system worth its salt will allow for the possibility of secondary indexes. After all, we often need to look data up by something other than the primary key. The cache model of the world – just a glorified map – does not lend itself well to secondary keys on cache entries, unless an additional layer is added on top of the cache interface to give the developer a higher-level view of the data.

But this leads to another level of complexity, where each cache is really a virtual construct composed of multiple physical caches – one primary and zero or more secondary:

Cache With Secondary Indexes
Virtual cache with primary and secondary indexes.

Put this together with the multi-level data grid model, and you now have a real bowl of spaghetti. It’s hierarchical, so it’s not impossible to visualize, but neither is it trivial. Furthermore, the larger the data model becomes, the more developers want a query language to help out. Add yet another layer atop of the cache API, to translate queries into cache calls and vice versa.

At this point, given we’re starting to approach the functionality of a full NoSQL database, I recommend you should start to look at software packages that will handle all this nastiness automatically for you. For example, products like Cassandra or Couchbase encapsulate all this complexity behind a nice familiar database query language and driver. You end up with about the same level of capabilities, just a different model for how the architect, developer, and administrator deal with the system. The data grid exposes more of its guts, which may provide more flexibility, but it also means one often has to think at a lower level about how the data grid all hangs together. It’s a matter of preference, I suppose, but I prefer a persistence system that looks and smells like a database, not a cache. Others may disagree. To each their own.


This whole post started with a question about what I thought about GemFire, a popular data grid solution. Again, I haven’t used GemFire specifically, so I cannot comment on how well it performs as a data grid. But I think this article provides a fairly simple general-purpose framework for considering data grids, and it shouldn’t be too hard to apply this framework when considering GemFire, or any other caching solution that holds itself forth as a data grid.

I hope you find it useful, if you are considering going the data grid route.