Stampy

Dev Blog

VDAY14: A Different Story

Every year, February rolls around and things get a little crazy at PP HQ. Though the holiday season has more sustained traffic and usage, Valentine’s Day is our single most popular day for sending and creating cards. Last year, things did not go so well. The day itself was a really hard one, and it really shook our entire team and woke us up to what kind of scale we were dealing with. We immediately turned around, had very productive postmortems, and planned out a lot of things we could do to make the next year as different as possible. We did not get every single thing we planned fully completed, but I’m happy to say that, thanks to a lot of changes, VDAY14 was a very different day in a very positive way. We’re happy to report that we survived with 0 application downtime and only small scale interruptions to service.

What was different

There were a huge number of large and small changes that led to making this year’s Valentine’s Day a success, but a few major overarching changes deserve a special shout-out for having the biggest impact.

EC2

After moving to our new datacenter in August ‘12 we were able to set up a ‘low latency’ Direct Connect to EC2. This opened the potential to have our core services remain in our datacenter, but have services that scale horizontally (our workers and renderers) be brought up as needed in EC2. Last VDAY this story wasn’t really complete, but by Holidays ‘13 we were using a simple script to bring up an EC2 worker cluster every morning and shut it down every night. This gave us a huge amount of capacity to do the background work that makes up much of the sending a card.

Coming Up Swinging

One of our major pain points from VDAY13 was that even though we were able to bring up new worker nodes quickly with Chef (in vSphere and EC2), these boxes “came up” without the latest application code. This required a deploy to our full cluster and in turn damaged the depth of queues as it required app restarts. Our solution (on which we’ll go more in-depth in a later post) changed our deploys into a two step process, build and standup. build happens in a central location, followed by standup which happens on each individual node. By splitting it out, the standup script can be run as part of our Chef converge on the node when it’s initially provisioned. This means that when we bring new nodes online, they can immediately do work without a deploy. This made bringing up new worker nodes throughout the holidays as easy as a single command in a shell.

iOS

A big difference in terms of our usage from last year is that in October ‘13 we launched a whole new create tool on iPad (and Jan ‘14 on iPhone). Though the creation process is done in-app, cards are still rendered and delivered through our API with our backend infrastructure. This had an interesting effect on traffic as it reduced the number of requests for certain resources, while greatly increasing the load in certain API endpoints. Overall this meant an increase in the number of concurrent senders by enabling a whole new class of sender.

Scheduled Sending

In addition to the changes in iOS, this year we released a frequently requested feature: Scheduled Sending. This allows lovebirds (and other hosts) to create their cards ahead of time and schedule them to send on a specific date and time (e.g., on Valentine’s Day). This is a boon for us as it allows us not only to better predict how many cards we’ll be sending, but also to spread out the resource-intensive act of rendering the card, which is done when it’s created and scheduled instead of when it’s sent. Leading up to VDAY we were keeping a very close eye on this number and our marketing team worked hard to make our users aware of this new capability.

Agency

One of our major failures last year was around our image processing pipeline and a collapse of our NFS server (a single point of failure) which holds our images temporarily before shipping them off to S3. At this point last year, we already had a plan for a system we wanted to build (codenamed agency) that would replace this core part of our image pipeline.

Its design and implementation also require a full blog post, but the general idea was having a distributed and concurrent service that could handle the storage and transformation of images. It would be backed by a distributed data store (Riak) instead of flat files on NFS. v0.1.0 shipped internally during the holidays and we were able to slowly roll out to more users and eventually enable it for 100% of card images in the weeks leading up to VDAY. In this model a small cluster of (3) agency servers with a 6-node Riak cluster was able to handle the work of what previously took 30 or more ‘worker’ boxes.

CDN

In the last year we went from having some of our images in our CDN, to now having ALL of our static content served by the CDN. This includes images, CSS, JS, web fonts, and SWFs. Not only does this benefit our end users, but it alleviates our cluster from serving those (high frequency/low impact) requests.

Performance

As we’ve talked about before, we’re serious about the performance of our core applications. More than any other year though, we worked very very hard before and after the holidays to keep our p90s (90th Percentile Response Times) down. We used a variety of tools and techniques to accomplish this, but this is to just say that we did it and it helped tremendously.

Better Monitoring

One of the core failures we had last year was being unaware of a recurring crash in our renderer service until the morning of VDAY. Since then we’ve doubled down on our use of Sensu and created and refined a number of alerts to the point that we’re very certain that the alerts we’re getting are real and need to be addressed. We still occasionally experience alert fatigue, but we have a much better sense of our cluster’s health than we had before.

Load Testing

Leading up to Valentine’s Day, we wanted to get a sense of how our site would respond to the flood of traffic we expected to receive. So we fired up JMeter and crafted a script that would log into our site, create a card, and then send it to a guest. Unlike a simple benchmark that just navigated through our website making GET requests, this routine exercises the various backend services that we expected would be impacted most heavily by the traffic. Specifically, rendering card images based on a JSON representation of a card, and converting that image to the many formats used on our site and in emails. Although this was our first foray into load testing, we were able to find a few issues to fix immediately. We also learned that it isn’t hard to take out our site with a few thousand synthetic users located near our datacenter.

Code Freeze

Since we’re very aware of the times of year that are the most crucial for operations, we can put stakes in the ground around weeks where activity on the site should be relegated more to end users and not to developers. As much as we truly believe in continuous deployment and strive towards that most of the year, we’ve experienced difficult situations in the past where deploys immediately before large amounts of traffic can cause major problems. It’s often hard to measure the effect or scope of “small” changes, so instead of blocking only “big deploys” we’ve started to have periods of time where production deploys are off limits. The intent is not to put a roadblock in front of developer progress (deploys to staging and local dev continue as normal), but to minimize the impact that “hot” fixes have on our most crucial days.

Last Minute Fixes

pgbouncer pooling

One problem we discovered through load testing was that despite our use of pgbouncer to maintain a pool of connections to our PostgreSQL server, a spike in activity would cause many new connections to be opened on the backend server, which bogged down the server and slowed down query response times.

To address this, we set pgbouncer’s min_pool_size parameter to keep a minimum number of connections open on the server. This way, when activity spikes, the connections are already open and ready to handle queries without paying the penalty of spawning many heavy connection processes.

Redis timeouts and nutcracker

We brought up our full cluster of EC2 workers a week ahead of VDAY and immediately started to see some strange issues with the workers communicating with our production Redis. These issues manifested themselves as Redis::Timeout errors in our Rails application. We had seen some of these before (during the holidays, also when EC2 was in play) and had created a StatsD counter for their frequency. This graph sums up the situation pretty well:

In our normal cluster there are rarely any timeouts, but during our holiday periods (and especially the week leading up to VDAY) this spiked tremendously. These timeouts are critical as they cause jobs to fail unpredictably and affect the speed at which our cluster can work.

During load testing we noticed a clear correlation between the number of workers working and the number of timeouts we would see. When these timeouts were frequent, they not only affected our EC2 workers but also workers in our datacenter. From this we inferred that it couldn’t be a pure latency issue. We went back and forth about what the root cause could be and we settled on the theory that the high number of active concurrent connections, in addition to the added latency across the EC2 tunnel, were causing requests to get blocked in Redis and causing timeouts for clients. We split these into two different issues.

For latency, we dusted off tcpdump and tested out changes to our TCP window sizes mainly guided by mailing list posts and this great post from Chartbeat. Unfortunately, after additional load tests, this didn’t have the big impact we wanted.

At the same time, we also started looking into reducing the number of concurrent connections. We had been following and interested in using twemproxy (aka nutcracker) for a while as a means of distributing our memcached data. We remembered that it also works with the Redis protocol, and in a really simple setup, can be used as a simple connection pool similar to pgbouncer. Thanks to its simplicity, we were able to set it up very quickly in a per-node topology. In this way, each worker and renderer talk to a nutcracker process on localhost which then pooled requests to our main Redis server. Each node had n (8-12) workers, so each node we rolled out to would reduce the number of connections by n - 1. We rolled this out to a single worker at first (two nights before VDAY), ran the load tests and saw a drop off in timeouts, we rolled out to all the workers the next night and saw an even steeper drop in timeouts, along with a drop in connections.

http://ppgraphiti.s3.amazonaws.com/snapshots/2e861e21f47/1392390670799.png

In total, in 3 days we went from > 900 connections to our Redis server to < 200. Redis response time also did not increase significantly. This didn’t fix ALL of our Redis timeout problems, but a big thank you to Twitter Eng for a great piece of software.

The Big Day

All of the above is to say that more than any previous year, we felt very prepared for what was ahead.

Early arrival/shifts

On a personal note, our issues were exacerbated last year because the problems started very early, and even though some of us were up early there wasn’t a full team online. Worse than that, because a bunch of us woke up to alerts, we were trapped working from our home kitchen tables for the rest of the day.

This year we established shifts where two of us started and were at the office at 6AM and others started later in the day and stayed later. Getting to the office on a snowy dark morning was strange, but we’re glad we did it.

Quick climb

As soon as we got working, things were looking good, but by 7AM the climb began in earnest.

http://ppgraphiti.s3.amazonaws.com/snapshots/4a69d9adba4/1392387059283.png

The first things to show problems were our front-end loadbalancers (haproxy and nginx). Though our Ruby processes seemed to be keeping up (but queueing requests), the edge balancers were queueing an even higher number of requests which eventually led to users seeing timeouts. This did not result in full downtime, but certain users were definitely having trouble getting through. To alleviate we brought up additional web apps and this had an immediate effect.

Periods of queue backup

At this point, queues started to back up a bit. Because of a lot of the fixes we managed to get in, these didn’t result in completely lost render jobs or cards, but at one point in the day it was taking minutes for a job to get through the pipeline (when it usually takes seconds).

This was due to not only the Redis-based queues backing up, but the in-process queues in our agency service. We helped alleviate this by bringing up additional agency processes and redistributing workers to help handle the queues.

Redis timeouts

Though nutcracker undoubtedly helped, we still saw a lot of Redis timeouts throughout the day which resulted in failed jobs and some user cards not going out immediately. We had to go back and address some of these failures manually as the failures happened in places in the pipeline where retries were not automatic. In some cases, the failures happened in reporting the status of the job, which caused some of the jobs to be “stuck,” making it impossible to automatically retry the jobs without manual intervention.

Post hump

After the morning hump of traffic, things calmed down, but we still saw some larger queue depth than normal. Despite this, the site was operating at full capacity without rejecting requests or having unreasonable wait times.

By not failing tremendously we managed to grow our “Cards Sent on VDAY” year over year by >150%.

Moving forward

In hindsight there were some near-critical problems that could have been a bigger deal if we hadn’t jumped on them so quickly. Some of these, like disk space on our Sphinx search node and Riak cluster, as well as bringing up maximum capacity for webapps and other nodes, should have been addressed well before VDAY.

PostgreSQL and other services performed admirably, but with a very strong caveat that it’s very clear we’re at the end of the line for our relatively simple architecture that relies on shared resources. The future is distributed, redundant, and highly available. Services that don’t fit into this mold are our biggest hurdle and bottleneck.

Obviously, we weren’t perfect and we again have a ton of TODOs to address in the next couple of months. We do feel somewhat triumphant in that we know that some of the plans we put into motion over a year ago had a great impact on our bottom line. We hope to continue to be able to grow and meet next year’s challenges head on.

If these type of challenges sound interesting to you, join our team and help us do even better for VDAY15.

Comments