IT & Infrastructure

Golang’s superior cache solution to Memcached and Redis

GroupCache has been a fantastic addition to our distributed services toolset at Mailgun. Here's how you can implement it in yours.



Distributed caching systems like redis and memcached clients typically work in the following manner:

  1. App asks the Client for the cached data via a key.

  2. Client performs a consistent hash on the key to determine which Node has the data

  3. Client makes a network request to the Node.

  4. Node returns the data if found.

  5. App checks if data is returned, else render or fetch data from the database.

  6. App tells Client to store data for this key.

  7. Client performs a consistent hash on the key to determine which Node should own the data.

  8. Client stores the data on the Node.

Looking at this flow, there are a few implications that stand out:

  1. Every cache request results in a round trip to a Node regardless of cache hit or miss.

  2. You can’t avoid the round trip to a Node by caching the value locally, as the remote Node could invalidate the data at any time without the Apps knowledge.

While none of these implications are particularly troublesome for the majority of applications, the extra round trips to the database can be influential for high performance, low latency applications. However, there is one more implication that may not be immediately obvious, the thundering herd!

Thundering herd

The thundering herd problem is sometimes called a cache stampede, dog-piling or the slashdot effect. Essentially it is what it sounds like, a stampede of requests that overwhelm the system.

For this discussion, our thundering herd is in response to a cache miss. For example, In normal operation an application remains responsive under heavy load as long as the data remains cached. When the cache doesn’t have the data, concurrent instances of the application all attempt to render or access the data simultaneously. Depending on your application and the number of concurrent instances/threads in your server farm, this thundering herd of concurrent work could overwhelm the system and result in congestion of the system, possibly resulting in a collapse.

To combat this concurrent work, you would need a system to synchronize fetching or rendering of the data. Fortunately, there is a golang library called groupcache which can be used to resolve the thundering herd issue and improve on the remote cache implications mentioned above.


GroupCache differs from redis and memcache as it integrates directly with your code as an In Code Distributed Cache (ICDC). This means that every instance of the App is a Node in the distributed cache. As a full member of the distributed cache, each application instance not only knows how to store data for the node, but also how to fetch or render the data if it’s missing.

To understand why this is superior to redis or memcached, lets run through the distributed caching flow when using groupcache. When reading through the flow, keep in mind the GroupCache is a library used by the application that also listens for incoming requests from other instances of the application that are using GroupCache.

  1. App asks the GroupCache for the data via a key.

  2. GroupCache checks the in memory hot cache for the data, if no data, continue.

  3. GroupCache performs a consistent hash on the key to determine which GroupCache instance has the data.

  4. GroupCache makes a network request to the GroupCache instance that has the data

  5. GroupCache returns the data if it exists in memory, if not it asks the App to render or fetch data.

  6. GroupCache returns data to GroupCache instance that initiated the request.

Step 5 is significant in the context of a thundering herd event as only one of the GroupCache instances will perform the render or fetch of the data requested. All other instances of the application that are also requesting the data from the GroupCache instance will block until the owning instance of the application successfully renders or fetches the data. This creates a natural synchronization point for data access in the distributed system and negates the thundering herd issue.

Step 2 is also significant as the ability to locally cache the data in memory which avoids the cost of a network round trip, thus providing a huge performance benefit and reduced network pressure. Since GroupCache is a part of the application, we avoid the possibility of the GroupCache deleting the data without the application’s knowledge as any such delete event will be shared by all instances of the application using GroupCache.

An admittedly minor benefit, but one that those of us who enjoy simplicity can appreciate, is that of deployment. Although it is not overly difficult to deploy and secure redis and memcached as separate entities from the application, having a single application to deploy means one less thing for an operator to deal with and keep up to date and secure.

It’s worth mentioning again as it’s easy to overlook. The ability for the cache implementation to render or fetch the data from a database during a cache miss and the ability to rely on a local in memory hot cache is what makes GroupCache a superior choice among distributed caches. No distributed cache external to your application can provide these benefits.

Groupcache as a synchronization tool

Because groupcache provides great synchronization semantics, we have found Groupcache to be a superior alternative to distributed or database level locks when creating and managing unique resources.

As an example, our internal analytics engine reads thousands of events and dynamically adds tags with assigned stats. Since we have many instances of the engine running, every new tag seen must be treated as a possible new tag. Normally this would generate a constant stream of upsert requests to our database. By using groupcache, each instance can query the cache with account:tag key. If the tag already exists, it’s returned with the latest data on the tag. However, if the tag doesn’t exist, groupcache relays the request to the owning instance and creates the tag. In this manner, only a single upsert is sent to the database when the system encounters a new tag.

Similarly, we use groupcache to count unique counters where the system should only record a single instance of a counter. Because we are using Groupcache, we avoid using a distributed lock and deadlock issues completely. This is especially useful when using a nosql database with little to no locking or synchronization semantics of their own.


Mailgun runs a modified version of Brad Fitzpatrick’s original groupcache library.

Notable changes to the library are:

  • Support for explicit key removal from a group. Remove()

  • Support for expired values. SetBytes()SetProto() and SetString() now accept an optional time.Time{} which represents a time in the future when the value will expire

  • Support for golang standard context.Context

  • Always populates the hotcache

To use GroupCache, you create a Pool of instances each GroupCache instance will talk to, then you create multiple independent cache Groups which use the same Pool of instances.

// Keep track of peers in our cluster and add our instance to the pool `http://localhost:8080` pool := groupcache.NewHTTPPoolOpts("http://localhost:8080", &groupcache.HTTPPoolOptions{})

// Add more peers pool.Set("http://peer1:8080", "http://peer2:8080")

// Create a new group cache with a max cache size of 3MB group := groupcache.NewGroup("users", 3000000, groupcache.GetterFunc( func(ctx context.Context, id string, dest groupcache.Sink) error { // Returns a protobuf struct `User` if user, err := fetchUserFromMongo(ctx, id); err != nil { return err }

// Set the user in the groupcache to expire after 5 minutes if err := dest.SetProto(&user, time.Now().Add(time.Minute*5)); err != nil { return err } return nil }, ))

var user User

// Fetch the definition from the group cache ctx, cancel := context.WithTimeout(context.Background(), time.Second*10) if err := group.Get(ctx, “key”, groupcache.ProtoSink(&user)); err != nil { return nil, err } cancel()

HTTP/2 and TLS

GroupCache uses HTTP to communicate between instances in the cluster. If your application also uses HTTP, GroupCache can use the same HTTP port as your application. Simply add the pool as a handler with it’s own path.

1// Our application2http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {3 fmt.Fprint(w, "Hi there")4})5// Handle GroupCache requests6http.Handle("/_groupcache/", pool)7log.Fatal(http.ListenAndServe(":8080", nil))8

If your application has TLS configured, GroupCache will benefit from the same TLS config your application uses, additionally having the option of enabling HTTP/2 which further improves performance of GroupCache requests. You can also use HTTP/2 without TLS via H2C.


At Mailgun, our first production use of GroupCache has been in our ratelimits service. Since this service must operate at a very high performance low latency level, traditional caching systems were a concern as the more round trips we introduce to the request pipeline the more opportunities we have to introduce additional latency. The following graphs show the total number of cache hits and of those hits the total number of hits that resulted in a roundtrip to another GroupCache instance. This demonstrates exactly how much we benefit from the local in memory hot cache as opposed to making a roundtrip call on every request.

You can see in the next graph exactly how much mongodb and our application benefits from the thundering herd avoidance as new keys are retrieved from the system.

The actual calls to mongodb are a small fraction of the total requests actually made to the service. This, in addition to the speed of gubernator, is what allows our ratelimit service to perform at low latency even during high load.

Here you can see the rate limits service response times. Keep in mind for each request we make a gubernator call and a groupcache request which may or may not result in a groupcache http request or mongoDB request. (This graph shows the slowest of the responses, not the average).

Key removal

A notable change to the Mailgun version of GroupCache is the ability to explicitly delete keys from the cache. When an instance wants to remove a key, it first deletes the key from the owning instance and then sends the delete requests to all other instances in the pool. This ensures any future requests from non owner instances in the pool to the owner will result in a new fetch or render of the data (or an error if the data is no longer available.)

As with any distributed system, there is the possibility that an instance is not available or connectivity has been lost when the remove request was made. In this scenario group.Remove() will return an error indicating that some instances were not contacted and the nature of the error. Depending on the use case, the user then has the option of retrying the group.Remove() call or ignoring the error. For some systems, ignoring the error might be acceptable especially if you are using the expired values feature.

The expire values feature allows you to provide an optional time.Time which specifies a future time when the data should expire.

1// Data expires in 2 minutes2if err := dest.SetProto(data, time.Now().Add(time.Minute*2)); err != nil {3 return err4}5

In the scenario above where we temporarily lose connectivity to an instance, we say that the pool of instances is in an inconsistent state. However, when used in conjunction with the data expiration feature we know the system will eventually become consistent again when the data on the disconnected instances expires. There are other much more complex solutions to solving this problem but we have found in practice eventually consistent solutions are the simplest and least error prone way to deal with network disruptions.

Instance discovery

A quick note about discovery, while GroupCache doesn’t provide a method of discovering instances there are several widely available systems to make discovery simple. I’ll list a few below with some examples to get you started.


GroupCache has been a fantastic addition to our distributed services toolset at Mailgun. It makes distributed caching and synchronization simple and easy to deploy. My hope is that others will discover the same benefits that we are enjoying and inspire other ICDC implementations in other languages. 

Related readings

Gubernator: Cloud-native distributed rate limiting for microservices

Today, Mailgun is excited to opensource Gubernator, a high performance distributed rate-limiting microservice. What does Gubernator do? Great question.

Read more

How and why we adopted service mesh with Vulcand and Nginx

Over the past year, service mesh has officially become a thing. But what is service mesh, why have we adopted it and how are we using it to deliver our software?

Read more

How Node.js app Cloud Monitoring uses the Mailgun API to automate email workflow

This post is written by Dan Di Spaltro, Director of Product for Rackspace Cloud Monitoring. Cloud Monitoring lets you monitor any server in any data center so that you can always make sure that your application infrastructure is a-ok. Before leading the Cloud Monitoring engineering team, Dan co-founded Cloudkick, a Y-Combinator startup focused on monitoring cloud infrastructure that was acquired by Rackspace in 2010.

Read more

Popular posts

Mailgun iconSee what you can accomplish with the world's best email delivery platform. It's easy to get started.Let's get sending
CTA icon Mailgun Icon