Paperless Post does business year round, but we are a company whose traffic and usage patterns are driven by holidays and seasons. This past week included Valentine’s Day, which historically is one of our busiest days of the year. The interesting thing about Valentine’s Day for us, compared to other holidays, is that the traffic pattern is completely unique. Our other busiest days are the ones leading up to Christmas, and those are characterized by very high overall traffic and a large number of cards sent to average numbers of people (~60 guests per event). Valentine’s, however, is a single day with a very high number of senders and only a slightly higher number of receivers – Valentine’s cards are predominantly 1-to-1. This means that, while during the rest of the year, most of our infrastructure is dedicated to serving receivers (they make up > 99% of our daily traffic), Valentine’s puts a huge amount of pressure on our sending infrastructure, more specifically on the creating and rendering of Paperless Post cards.
This Valentine’s Day, we saw an unpredicted growth in sender traffic and that, combined with some issues I’ll describe below, led to a cascading series of problems that affected many of our users on a very important day. For most of the day, users were unable to get through the full process of creating and sending a card and, as a company (and a Dev, Ops, and Support Team), we couldn’t be more apologetic and sorry for how this affected our users.
The Problem
By 7:25am EST, when our team began to come online, we immediately noticed that cards were taking a longer-than-average time to render, preventing users from sending. This manifests itself to users as “eternal spinny” (as users often call it), which is on the page after you’ve completed designing your card. We poll the server to see if the images for your card have completed rendering. If they have, we show you a sample of the e-mail that will be sent (usually an envelope with the recipient’s name) with a link to a preview of the card the way a recipient will see it. If the images are not ready, you will see a loading GIF that will refresh once the cards are ready. On a normal day, most users will never see the “spinny” as card rendering takes less than 4 seconds total, on average. On this morning, however, cards seemed to never appear. To the user, this perceived “eternal spinny” was actually a failure on a number of levels, and that continued through to the late afternoon, resulting in most cards not going out until much later in the day.
A little background on “Rendering”
When a user saves a card in our design tool, it goes through a multi-step process before the recipient of the card receives it in their inbox and then sees it as a recipient. The basic workflow is as follows:
- User saves card in design tool
- JSON is sent to the server and is serialized into different PostgreSQL tables and relationships
- A JSON representation is stored in Redis and enqueued in Redis for “rendering”
- A server-side “Renderer” process picks jobs off the queue and turns the JSON into a visual representation of the card (an image)
- The image is put back into a Resque/Redis queue as a Base64-encoded PNG
- Resque picks up the image, writes it to a shared NFS storage and writes a pointer to it (called a MediaFile record) to Postgres – this enqueues another job called “MediaFileConversion”
- MediaFileConversion takes the rendered PNG and, using ImageMagick, turns the image into a series of differently-sized and -shaped thumbnails and JPGs used across the site and on mobile web
- Eventually (about 1 hour later) a “DiskCleaner” process comes around and uploads the on-disk files to Amazon S3, marks them as uploaded in Postgres, and then clears the local files from NFS
We arrived at this multi-step process over the years and, though it has some faults and slow areas, under normal usage it is more than fast enough. By making different parts of the worfklow have different jobs and responsibilities, we’ve been able to scale each as needed.
The Initial Failure
When we came online on 2/14, the Renderer queue was already backed up, with over 1000 jobs in the queue. Worse, it seemed like it was growing much faster than it was processing. Initially, we worked on bringing up more Renderer processes/boxes (we run a number of these within our datacenter and a number in EC2), but even this was not helping. Something was wrong with the Renderer processes themselves – even though they were running, they appeared “red” in our dashboard. On examination, the processes were indeed still running but, at some point, they stopped picking up jobs. A restart turned them all green again, but one by one they quickly turned back to red and the queue continued to grow. This was almost definitely happening throughout the week and the day before, but we never noticed because 1) our daily # of renders is small comparatively and 2) we alert when the queue grows large and when the processes die but not when they “disconnect” in this way.
We began messaging our users through a site-wide alert that creating cards was taking longer than normal. Part of the team divided off and started working feverishly to try to fix the Renderer code that seemed to be causing the disconnections. Unfortunately, before we were able to deploy any changes and because of the way we recently started storing the JSON for the cards in Redis, the queue grew so large that Redis itself filled up and ran out of memory. At 9:48am, it was OOM-killed (by Linux “out of memory” process), bringing the entire site down for a few minutes. Because of an issue with our Redis saving/backup strategy when the process auto-restarted, it came up with an older backup, losing our existing queue and a bunch of other realtime data. We deployed a change to our queueing strategy to take the JSON data out of Redis and have the Renderers instead fetch through our HTTP API, but before the deploy finished, the queue filled up again and Redis was OOM-killed again. This time, before we allowed it to restart, we doubled its memory.
At this point, the Renderers were still not staying connected, but we continued to bring up more Resque workers going from our day-to-day scale of 28 processes to 178. We deployed a number of fixes to the Renderer that put layers of rescuing in place so that no matter what part of the process was failing, it would always continue to request the next job. When the last of these was deployed at 3:26pm, the Renderers finally started to pull off jobs quickly and without disconnecting.
Up until then, we continued to deploy a number of changes to our main app to try to reduce the amount of work the Renderers and other workers were doing. We stopped creating thumbnails only used by certain applications and turned down the amount of polling that clients were doing to check for Renderers. We also noticed that API response times were high and increased our Postgres max connections to account for the higher number of worker connections.
Second Wave
Now that Renderers were staying alive and quickly pushing through the backlog, the next process to get bogged down was the Resque job/queue responsible for saving the images to disk. We brought up more and more workers to try to deal with the backlog, but noticed diminishing returns because they were all operating on shared storage. The process of bringing up and reassigning new workers, though involving very little actual work/effort from the Ops team (thanks to Chef), unexpectedly consumed a bunch of time. We had to pull resources from other sources (our staging clusters). Also, changing the queue distribution required a HUP/restart of the workers, which meant the queues weren’t draining as fast as we hoped.
We continued to make changes to try to get the cards to show up faster (changing what images/steps were required for viewing/sending/previewing) – however, the queue continued to grow. Because, again, we were storing full/large data as part of the queue (this time the rendered images), Redis continued to have problems/get close to capacity. To stem the tide, we stopped the Renderers themselves so that the queue did not continue to grow past Redis’s capacity.
Third Wave
By 7pm, we were back to showing cards within a couple of minutes but the third part of the process, the resizing/thumbnailing queue, had over 30K jobs in it. This meant that the number of files were growing on disk because we had diverted resources from our uploading/cleanup queue to the resizing queue. We started to receive alerts for our shared storage as it approached 100%.
We worked quickly to modify our disk cleaner script to be able to work in parallel over multiple boxes. Unfortunately, at 7:50pm, we ran out of disk space on our shared storage, and the jobs to save/render images began to fail. We had to temporarily stop all workers. Once we got the parallelized disk cleaner running, we were able to reclaim enough space to start the workers again. The key was that the (50) disk cleaners were reclaiming space faster then the media jobs were filling it.
Finally
By 10:00pm, we were back to “normal” and we started to re-enqueue cards/emails/etc. that had not yet been processed or were lost when Redis went down. This took a little over 2 hours and, at 11:30pm, we were able to send out an apology email to our affected users. After a very rough day (to say the least), we were gratified with the general response from our users:
“That is the nicest apology I’ve ever received. Happy Valentine’s Day!”
“Thank you so much for your kind response. I am very impressed with your customer service. You offer the best in email stationary and I will continue to be a big fan and customer!!”
“Thank you. Admirable response. I already knew you provided a good, user friendly service. And now it’s clear you follow good business practices. How refreshing!!”
“Thank you for the coins. I am a loyal customer and I appreciate the indication that you prize excellent customer experience.”
Root Cause
We like to think of ourselves as an ego-less team, and this day really proved it. Wherever you want to point the blame based on the details above, every member of the team clearly felt responsible. We worked together, constantly communicating and trying to keep emotion out of it. Our Support team took the brunt of the user complaints and issues, but at no point did that reflect on how the Dev or Ops teams were treated. Similarly, the Dev/Ops teams were under a huge amount of pressure to “fix it” but this pressure was self-inflicted and we did our best to communicate the current status throughout the day.
It’s hard to point at a real root cause with something like this. We clearly had a number of early failures that took the rest of the day to recover from. The biggest failure, though, was really in planning. We knew this day was coming, but were so focused on shipping all the things that we didn’t step back to adequately test or prepare. Lesson learned.
Post-Mortems
We had two separate post-mortems the day after, one going over the accounts of the day in detail with the Ops and Dev teams, and the other with the management team going over the events at a high level. Both were positive and left us with a large number of next-steps and ways to improve every part of our process.
Next Steps
We have a lot to do to ensure something like this never happens again. We’ve organized the main work under three categories:
- Load and configuration testing
- We need ways to have automated load tests, because currently we are only really ever testing production load in production.
- We also had issues with broken chef recipes that could have been caught earlier by Chef integration testing.
- Faster deploys/configuration changes
- Bootstrapping should only take a couple minutes and nodes should “come up swinging” without an additional deploy.
- Current deploys are O(N), meaning that the more nodes we add, the slower deploys get.
- We should be able to reassign/reorganize queues without redeploying Chef or restarting the workers.
- Optimizing the Renderer pipeline
- Getting off of shared storage.
- Moving other classes of workers (besides Renderers) to EC2.
- Giving the renderer more/less responsibility in a live, configurable way.
- Moving to a more durable queueing system.
On days like this, we’re thankful not only for our team, but for our users. Having scaling problems like this is generally a good problem to have, and it’s up to the entire company to improve and rise to the occasion. We knew about some of the problematic areas in our architecture before Valentine’s, but a crush like this exposed them (as well as new issues) in a clearer way. It’s our job to catch and fix these before our users do and we continue to make steps toward this goal.