Stampy

Dev Blog

Redis Tips: Persistence-Only Slaves and the Slowlog

At Paperless Post we rely on a lot of different Open Source software, from the Nginx web server that serves our traffic, to the PostgreSQL RDBMS that we use to keep our data safe and secure. One of the more interesting is Redis: the advanced key/value store that has gone from an obscure project by an old school Linux hacker to a gigantic Open Source project backed by VMWare and used in production environments all over the world.

Redis is so interesting because it is very compact and flexible. By reading the source code and keeping in touch with the community around the software, we’ve picked up some helpful tricks, and I thought the Paperless Post Dev Blog would be a great place to share some. Today I’ll be talking about using Redis in a persistence-only slave setup, and a relatively new feature for tracking slow queries called the slowlog.

Persistence-Only Slaves

Paperless Post is a seasonal business – we do a lot of business around the winter holidays and focus a good part of our year preparing for the spikes in traffic that come along with the cold weather, fattening food, and adorable reindeer-based greeting cards. Because Redis is a key part of our infrastructure and often sees the effects of heightened traffic before other pieces do, when traffic came a little earlier than we expected this year, we were forced to do some live tuning of our production instances to handle the increased amount of operations that our servers were processing.

For some background, we use Redis in a variety of ways at Paperless Post:

Resque and Other Queues

The queueing and background processing library started by Chris Wanstrath at Github has become a mainstay of larger Rails applications because of how easy it is to hook into its simple abstractions for delaying the work to be done by a particular piece of code. Resque has played a large role in Redis getting into production in the Ruby and Rails world, easily demonstrating the efficient algorithms and general reliability and stability of Redis as a server for this kind of “ephemeral” data. Almost all of the major components of our user experience run through Resque at some point or another, and we have lots and lots of background worker instances processing these jobs. The “payload” data from the application gets passed through Redis, as does all of the metadata about the process.

Once we saw how effective Redis’s built-in blocking-pop commands such as BLPOP were for building small, reliable queues, we employed a similar technique in some other small applications that we use for queueing and sending emails, cards, and more. The idea behind BLPOP is that the client blocks while it watches a given key or group of keys, and using the Redis Ruby client, we are able to execute a block of code when this happens. That way, several clients can be watching a single queue, and reliably pop jobs off for execution.

Dashboards and Admin Tools

Because it is so fast to insert and query most of Redis’s built-in data types, we saw it as an ideal place to store a wide range of trending data that we collect throughout our application. Our admin dashboards keep us up to date in trends surrounding various site and business metrics throughout the day, and Redis makes this a very simple and efficient thing to do. Through the magic of Ruby, we have exposed a simple interface for adding information to this dashboard:

1
Paperless::Dashboard.increment_for_now(:rendered_images)

This takes the recorded stat and handles the calculations for views so that we can display the last minute, last 60 minutes, 24 hours, etc. We then use these stats to produce pretty graphs that let us know what’s going on inside our system. (Note: we’ve since ‘graduated’ beyond these simple trending graphs to using more robust tools for information gathering and display - read all about it here).

And A Lot of Other Places

Redis provides a fast, easily accessible way to deal with information that you want to persist temporarily without the overhead of dealing with an RDBMS. I’ve found that beyond performing well, the kinds of data structures and operations that Redis provides allow you to think about traditional problems in new and interesting ways. If Redis is new to you or if you only use it with Resque, take a look at using it more deeply. It might fit as nicely in your stack as it has in ours.

Back to the Slave

Now that I’ve given you an idea of how extensively we use Redis, you can probably guess the problem we ran into – we over-relied a bit on Redis as a catch-all for our data, and didn’t think carefully enough up front about the costs we could incur if the server got overloaded. Because we use the snapshotting functionality provided by Redis for our on-disk persistence, as we relied on Redis for more and more data, more keys were changing, and the snapshotting became more and more operationally expensive. Eventually, we started to notice mounting EAGAIN and TIMEOUT errors from the Ruby Redis clients. Our master-of-correlation and Ops person extraordinaire Johnny Tan noticed that this happened approximately 15 seconds before the BGSAVE snapshots terminated, and I quickly dug into the problem. It turns out that the combination of increased load on the server from higher traffic, more regular BGSAVE calls and the I/O operations that come along with them was blocking client reconnections.

After doing some research and having a chat with Pieter Noordhuis, one of the Redis maintainers, we decided to try something a bit radical, which ended up solving our problems, buying us enough time to re-think how we structure our Redis data. Since we already had Redis configured in a master-slave setup and were not writing or reading to the slave at all, we decided to remove the job of persistence from the master, giving it instead to the slave. This way, the master was free to accept the millions of reads and writes that we throw at it every hour, and we had a reasonable assurance that our data was safe via the syncing, streaming, and dumping that was now the slave’s responsibility. As with most other things in Redis land, this is a very straightforward thing to accomplish:

1
2
redis 127.0.0.1:6379> config set save ""
OK

Assuming you have snapshotting configured to run on your slave instances, and you’re happy with the frequency with which they’re happening, the above configuration change is all you have to do to get the master to stop persisting. We immediately saw a gain in throughput on the master and we were willing to accept the slightly less durable means of persistence, which would only affect us if the master crashed. In that case we would be liable for the delta of what has been saved by the slave and new data that had been written to the master, a sacrifice we were willing to make. Making these kinds of operational decisions is all about accepting the tradeoffs.

For the record, Pieter assured me that this was a relatively common design pattern for sites like ours, and that there is nothing inherently unsafe about running Redis in this kind of persistence-only slave configuration.

The Slowlog

During the process of attempting to uncover the source of inconsistent connection and timeout errors, I realized that a feature had snuck into the stable Redis distributions that didn’t get much fanfare – the SLOWLOG. Slowlog is similar to slow query mechanisms available in traditional relational databases – it allows you to set a time threshold beyond which queries get logged, and allows you to set a number for the maximum number of these queries that the server should store. It requires two entries in redis.conf:

1
2
slowlog-log-slower-than 10000
slowlog-max-len 1024

and interacting with it in redis-cli looks like this:

1
2
3
4
5
6
7
redis 127.0.0.1:6379> slowlog len
(integer) 2
redis 127.0.0.1:6379> slowlog get 1
1) 1) (integer) 14
   2) (integer) 1309448221
   3) (integer) 15
   4) 1) "ping"

Here I am checking the length of the slowlog list, and then retrieving the first entry. Each entry contains:

  • A unique ID provided by the server
  • The unix timestamp at which the logged command was processed
  • The amount of time needed for its execution, in microseconds
  • An array composing the arguments of the command

The Slowlog data is kept in memory by the Redis server, so it will not persist at all, and accessing the data should not interrupt your server in any way. See the full documentation here.

I thought that this feature was cool and noticed it wasn’t supported by the Ruby client because of a limitation in the hiredis-rb C extension library, so in the spirit of Open Source, I added it, and the command should now be available in recent releases of the official Ruby Redis client. I suggest you check it out, you might be surprised at what you find. While slow queries didn’t end up being the problem that I was looking for during the high traffic episode I outlined above, it did give me a very clear glimpse into how unorganized some of our Redis data had become. For example, I saw a pattern of massive MGET calls, which led us down the road toward redesigning how we store some of our trending data.

Conclusion

Redis is a relatively new tool that is deployed in a variety of production environments and used in a multitude of ways, but the tooling around it is still maturing. Knowing how flexible and live-configurable it can be can save your hide if you get into trouble like we did, and being aware of the built-in tools that can be used for debugging production servers and looking for data and architectural bottlenecks.

If you’re interested in working on cool projects like this, or with our awesome team, we’re hiring in NYC and SF.

Comments