Stampy

Dev Blog

Stampy Day: Gif TV

This is the second in a series of posts on our most recent Stampy Day, a company-wide hack day here at Paperless Post. We hope you enjoy reading about some of the projects that came out of that day.

We love gifs. We love gifs so much that we have, at time of writing, 8,574 gifs saved in our chat bot. Naturally, it would be great if we had a way to see all of our gifs that we love so much. Out of this idea, gif TV was born.

I’m a big fan of Giphy TV, so it was a big source of inspiration for an internal version of this.

Gif TV is a small Go application that reads in a list of URLs from a text file and displays them sequentially in a web page. Go’s net/http and html/template make this really easy. The URL and index are easily injected into the page’s source.

The application currently shows the gifs for two seconds at a time. In the future it would be fun to figure out how to show each gif for its elapsed time before moving on to the next. All in all, this was a fun little hack to work on and watching the full set of gifs was entertaining.

gif tv source | gif tv demo

Starting Android: An iOS Developer’s Perspective

This is the first of a series of posts on our most recent Stampy Day, a company-wide hack day here at Paperless Post. We hope you enjoy reading about some of the projects that came out of that day.

This Stampy Day, Ivan, Teresa, and I decided to play around with Android to see if we could get a very minimal Paperless Post app working. Our goals for the app were to create a home page with package promotions, to get a simple tab bar working to switch back and forth between views, and to load and display some basic packages from the PP API.

The parallels between the general structure of iOS and Android apps were immediately apparent. Each page usually consists of a controller that contains a root view. There are callbacks for every change in the view/controller lifecycle, from its creation, movement to the foreground and background, and deallocation, among other things. There is a navigation stack that is automatically maintained by the system. There are even similar UI elements like table views and grid views, with similar delegation patterns for supplying these UI elements with the data it needs to configure itself.

Despite feeling at home with Android after some basic tutorials, I found a number of things I really didn’t like. I wish to discuss only one of them here in the interest of brevity.

The thing that stood out most to me was the concept of an Intent. Android allows you to send an Intent to certain objects, which is basically a data structure that holds arbitrary information. The object that receives the Intent is then supposed to use this information as it sees fit. This pattern is often used when creating a new Activity (Android’s poorly named version of the UIViewController, basically a controller containing a view hierarchy). One Activity will pass an Intent to a new Activity so that the new one can use the information provided in the Intent to set itself up.

In order for one object to send a useful Intent to another object, it must know what the other object needs, yet I found no clear way for this second object to enforce or signal to other objects what exactly those needs are. This is in contrast to an interface in iOS, where an object will provide you with a method for initialization, specifying exactly what parameters it needs when initializing itself.

The object sending the Intent therefore must make certain assumptions about what the receiving object needs, based on the programmer figuring that out by looking at the internal implementation of the receiving object. This is very prone to breaking because if the internal implementation of the receiver object ever changes, its needs might change, yet outside objects have no idea, and even worse the receiver object itself has no way to tell them! In iOS, we might simply change the method signature of the initialization method to include a new parameter, and any other object calling the old initialization method would no longer be calling a valid method. This would simply cause the app to not compile.

This seems like a blatant violation of the basic design pattern paradigm we all know as decoupling. In the case of one Activity presenting another, the presenting activity is coupled to the presented activity because it is relying on the internal implementation of the presented activity in order to set that activity up.

Overall, playing with Android was a fun and worthwhile experiment. It was cool to see another perspective on native mobile development. For the record, we did get a solid homepage (though with weird image drawing issues) working, with a tab bar and some paper packages loaded in a list view.

The Data You Need

There are lots of people at Paperless Post who need data, and over time we have been consistently improving how we get it to them. This is an article about a simple gem we built, ReplicaConnect, that allows easy access to our data inside simple ruby scripts.

Since you’re reading our dev blog, you probably know that to get data from a database, we need to run SQL queries. Data analysis queries can take a long time, and if we ran them on our actual production db, they would take down the site. So, we have a replica of our production database that we use to run big queries, without worrying about affecting users.

Traditionally, when we had a long query to run, a developer would load up pgAdmin, which is a GUI tool for querying a PostgreSQL database. Then the query would be run, results saved to a csv and emailed to whomever had requested the data.

For one-off queries this worked fine, but it was painful for common queries and reports. Often, multiple queries were run to produce a report. The mindless process of copying a query from a file to pgAdmin, updating the date, running the query, pasting the result into a spreadsheet cell, and repeating 20 times, was tedious, time consuming and prone to errors. Additionally, while SQL is amazing at getting data from a database, if you want to take that data and do a bunch of stuff to it, it’s probably going to be a lot easier with another programming language.

Transforming these reports from multiple SQL files into a ruby script that ran the SQL and formatted the results was clearly necessary. Every single script would need a way to connect to our replica db to run queries, so Richard Boyle and I built ReplicaConnect to make that functionality simple and seamless.

In order to connect to a database using ActiveRecord, you need an adapter, hostname, port, database name, username, and password. The first time you use ReplicaConnect, it prompts you for that information, then saves it in a file, and any future connections from the same folder will just use the information in that file.

Using ReplicaConnect is simple.

1
2
3
  require 'replica_connect'
  connection = ReplicaConnect::Connection.new().connect
  result = connection.execute("SELECT * FROM users LIMIT 1")

Done! That result is an Enumarable that contains the results of your query.

If you want to save that to a csv, use PgToCsv, another gem we created.

1
PgToCsv.to_csv(result, "filename.csv")

See how easy that was?

One example of where this has had a huge impact is in creating marketing email lists. Our marketing team wants lists of unique email addresses for various segments of our users on a semi-regular basis. The process for generating the 5 - 10 separate lists, and making sure each email is only included once in SQL was incredibly complicated. You had to run a different query for each of the segments, and the queries became very complicated as we had to de-duplicate names from multiple lists. This process used to take a few hours.

Now, this process takes about a minute. The customer segment queries are defined as strings earlier in the file, and then this is run:

1
2
3
4
5
6
7
8
  connection = ReplicaConnect::Connection.new().connect
  dedupe = []
  ['segment_1', 'segment_2', 'segment3', 'segment4'].each do |query|
    query_result = connection.execute(eval(query))
    result_to_save = query_result.to_a - dedupe.flatten
    PgToCsv.to_csv(result_to_save, "#{query}_#{Time.now.strftime("%m_%d_%y")}.csv")
    dedupe << result_to_save
  end

We quickly make a connection, execute our query, and then have all the power and ease of ruby to manipulate the results.

Check out the source code for Replica Connect and PgToCsv, and feel free to fork and contribute!

VDAY14: A Different Story

Every year, February rolls around and things get a little crazy at PP HQ. Though the holiday season has more sustained traffic and usage, Valentine’s Day is our single most popular day for sending and creating cards. Last year, things did not go so well. The day itself was a really hard one, and it really shook our entire team and woke us up to what kind of scale we were dealing with. We immediately turned around, had very productive postmortems, and planned out a lot of things we could do to make the next year as different as possible. We did not get every single thing we planned fully completed, but I’m happy to say that, thanks to a lot of changes, VDAY14 was a very different day in a very positive way. We’re happy to report that we survived with 0 application downtime and only small scale interruptions to service.

What was different

There were a huge number of large and small changes that led to making this year’s Valentine’s Day a success, but a few major overarching changes deserve a special shout-out for having the biggest impact.

EC2

After moving to our new datacenter in August ‘12 we were able to set up a ‘low latency’ Direct Connect to EC2. This opened the potential to have our core services remain in our datacenter, but have services that scale horizontally (our workers and renderers) be brought up as needed in EC2. Last VDAY this story wasn’t really complete, but by Holidays ‘13 we were using a simple script to bring up an EC2 worker cluster every morning and shut it down every night. This gave us a huge amount of capacity to do the background work that makes up much of the sending a card.

Coming Up Swinging

One of our major pain points from VDAY13 was that even though we were able to bring up new worker nodes quickly with Chef (in vSphere and EC2), these boxes “came up” without the latest application code. This required a deploy to our full cluster and in turn damaged the depth of queues as it required app restarts. Our solution (on which we’ll go more in-depth in a later post) changed our deploys into a two step process, build and standup. build happens in a central location, followed by standup which happens on each individual node. By splitting it out, the standup script can be run as part of our Chef converge on the node when it’s initially provisioned. This means that when we bring new nodes online, they can immediately do work without a deploy. This made bringing up new worker nodes throughout the holidays as easy as a single command in a shell.

iOS

A big difference in terms of our usage from last year is that in October ‘13 we launched a whole new create tool on iPad (and Jan ‘14 on iPhone). Though the creation process is done in-app, cards are still rendered and delivered through our API with our backend infrastructure. This had an interesting effect on traffic as it reduced the number of requests for certain resources, while greatly increasing the load in certain API endpoints. Overall this meant an increase in the number of concurrent senders by enabling a whole new class of sender.

Scheduled Sending

In addition to the changes in iOS, this year we released a frequently requested feature: Scheduled Sending. This allows lovebirds (and other hosts) to create their cards ahead of time and schedule them to send on a specific date and time (e.g., on Valentine’s Day). This is a boon for us as it allows us not only to better predict how many cards we’ll be sending, but also to spread out the resource-intensive act of rendering the card, which is done when it’s created and scheduled instead of when it’s sent. Leading up to VDAY we were keeping a very close eye on this number and our marketing team worked hard to make our users aware of this new capability.

Agency

One of our major failures last year was around our image processing pipeline and a collapse of our NFS server (a single point of failure) which holds our images temporarily before shipping them off to S3. At this point last year, we already had a plan for a system we wanted to build (codenamed agency) that would replace this core part of our image pipeline.

Its design and implementation also require a full blog post, but the general idea was having a distributed and concurrent service that could handle the storage and transformation of images. It would be backed by a distributed data store (Riak) instead of flat files on NFS. v0.1.0 shipped internally during the holidays and we were able to slowly roll out to more users and eventually enable it for 100% of card images in the weeks leading up to VDAY. In this model a small cluster of (3) agency servers with a 6-node Riak cluster was able to handle the work of what previously took 30 or more ‘worker’ boxes.

CDN

In the last year we went from having some of our images in our CDN, to now having ALL of our static content served by the CDN. This includes images, CSS, JS, web fonts, and SWFs. Not only does this benefit our end users, but it alleviates our cluster from serving those (high frequency/low impact) requests.

Performance

As we’ve talked about before, we’re serious about the performance of our core applications. More than any other year though, we worked very very hard before and after the holidays to keep our p90s (90th Percentile Response Times) down. We used a variety of tools and techniques to accomplish this, but this is to just say that we did it and it helped tremendously.

Better Monitoring

One of the core failures we had last year was being unaware of a recurring crash in our renderer service until the morning of VDAY. Since then we’ve doubled down on our use of Sensu and created and refined a number of alerts to the point that we’re very certain that the alerts we’re getting are real and need to be addressed. We still occasionally experience alert fatigue, but we have a much better sense of our cluster’s health than we had before.

Load Testing

Leading up to Valentine’s Day, we wanted to get a sense of how our site would respond to the flood of traffic we expected to receive. So we fired up JMeter and crafted a script that would log into our site, create a card, and then send it to a guest. Unlike a simple benchmark that just navigated through our website making GET requests, this routine exercises the various backend services that we expected would be impacted most heavily by the traffic. Specifically, rendering card images based on a JSON representation of a card, and converting that image to the many formats used on our site and in emails. Although this was our first foray into load testing, we were able to find a few issues to fix immediately. We also learned that it isn’t hard to take out our site with a few thousand synthetic users located near our datacenter.

Code Freeze

Since we’re very aware of the times of year that are the most crucial for operations, we can put stakes in the ground around weeks where activity on the site should be relegated more to end users and not to developers. As much as we truly believe in continuous deployment and strive towards that most of the year, we’ve experienced difficult situations in the past where deploys immediately before large amounts of traffic can cause major problems. It’s often hard to measure the effect or scope of “small” changes, so instead of blocking only “big deploys” we’ve started to have periods of time where production deploys are off limits. The intent is not to put a roadblock in front of developer progress (deploys to staging and local dev continue as normal), but to minimize the impact that “hot” fixes have on our most crucial days.

Last Minute Fixes

pgbouncer pooling

One problem we discovered through load testing was that despite our use of pgbouncer to maintain a pool of connections to our PostgreSQL server, a spike in activity would cause many new connections to be opened on the backend server, which bogged down the server and slowed down query response times.

To address this, we set pgbouncer’s min_pool_size parameter to keep a minimum number of connections open on the server. This way, when activity spikes, the connections are already open and ready to handle queries without paying the penalty of spawning many heavy connection processes.

Redis timeouts and nutcracker

We brought up our full cluster of EC2 workers a week ahead of VDAY and immediately started to see some strange issues with the workers communicating with our production Redis. These issues manifested themselves as Redis::Timeout errors in our Rails application. We had seen some of these before (during the holidays, also when EC2 was in play) and had created a StatsD counter for their frequency. This graph sums up the situation pretty well:

In our normal cluster there are rarely any timeouts, but during our holiday periods (and especially the week leading up to VDAY) this spiked tremendously. These timeouts are critical as they cause jobs to fail unpredictably and affect the speed at which our cluster can work.

During load testing we noticed a clear correlation between the number of workers working and the number of timeouts we would see. When these timeouts were frequent, they not only affected our EC2 workers but also workers in our datacenter. From this we inferred that it couldn’t be a pure latency issue. We went back and forth about what the root cause could be and we settled on the theory that the high number of active concurrent connections, in addition to the added latency across the EC2 tunnel, were causing requests to get blocked in Redis and causing timeouts for clients. We split these into two different issues.

For latency, we dusted off tcpdump and tested out changes to our TCP window sizes mainly guided by mailing list posts and this great post from Chartbeat. Unfortunately, after additional load tests, this didn’t have the big impact we wanted.

At the same time, we also started looking into reducing the number of concurrent connections. We had been following and interested in using twemproxy (aka nutcracker) for a while as a means of distributing our memcached data. We remembered that it also works with the Redis protocol, and in a really simple setup, can be used as a simple connection pool similar to pgbouncer. Thanks to its simplicity, we were able to set it up very quickly in a per-node topology. In this way, each worker and renderer talk to a nutcracker process on localhost which then pooled requests to our main Redis server. Each node had n (8-12) workers, so each node we rolled out to would reduce the number of connections by n - 1. We rolled this out to a single worker at first (two nights before VDAY), ran the load tests and saw a drop off in timeouts, we rolled out to all the workers the next night and saw an even steeper drop in timeouts, along with a drop in connections.

http://ppgraphiti.s3.amazonaws.com/snapshots/2e861e21f47/1392390670799.png

In total, in 3 days we went from > 900 connections to our Redis server to < 200. Redis response time also did not increase significantly. This didn’t fix ALL of our Redis timeout problems, but a big thank you to Twitter Eng for a great piece of software.

The Big Day

All of the above is to say that more than any previous year, we felt very prepared for what was ahead.

Early arrival/shifts

On a personal note, our issues were exacerbated last year because the problems started very early, and even though some of us were up early there wasn’t a full team online. Worse than that, because a bunch of us woke up to alerts, we were trapped working from our home kitchen tables for the rest of the day.

This year we established shifts where two of us started and were at the office at 6AM and others started later in the day and stayed later. Getting to the office on a snowy dark morning was strange, but we’re glad we did it.

Quick climb

As soon as we got working, things were looking good, but by 7AM the climb began in earnest.

http://ppgraphiti.s3.amazonaws.com/snapshots/4a69d9adba4/1392387059283.png

The first things to show problems were our front-end loadbalancers (haproxy and nginx). Though our Ruby processes seemed to be keeping up (but queueing requests), the edge balancers were queueing an even higher number of requests which eventually led to users seeing timeouts. This did not result in full downtime, but certain users were definitely having trouble getting through. To alleviate we brought up additional web apps and this had an immediate effect.

Periods of queue backup

At this point, queues started to back up a bit. Because of a lot of the fixes we managed to get in, these didn’t result in completely lost render jobs or cards, but at one point in the day it was taking minutes for a job to get through the pipeline (when it usually takes seconds).

This was due to not only the Redis-based queues backing up, but the in-process queues in our agency service. We helped alleviate this by bringing up additional agency processes and redistributing workers to help handle the queues.

Redis timeouts

Though nutcracker undoubtedly helped, we still saw a lot of Redis timeouts throughout the day which resulted in failed jobs and some user cards not going out immediately. We had to go back and address some of these failures manually as the failures happened in places in the pipeline where retries were not automatic. In some cases, the failures happened in reporting the status of the job, which caused some of the jobs to be “stuck,” making it impossible to automatically retry the jobs without manual intervention.

Post hump

After the morning hump of traffic, things calmed down, but we still saw some larger queue depth than normal. Despite this, the site was operating at full capacity without rejecting requests or having unreasonable wait times.

By not failing tremendously we managed to grow our “Cards Sent on VDAY” year over year by >150%.

Moving forward

In hindsight there were some near-critical problems that could have been a bigger deal if we hadn’t jumped on them so quickly. Some of these, like disk space on our Sphinx search node and Riak cluster, as well as bringing up maximum capacity for webapps and other nodes, should have been addressed well before VDAY.

PostgreSQL and other services performed admirably, but with a very strong caveat that it’s very clear we’re at the end of the line for our relatively simple architecture that relies on shared resources. The future is distributed, redundant, and highly available. Services that don’t fit into this mold are our biggest hurdle and bottleneck.

Obviously, we weren’t perfect and we again have a ton of TODOs to address in the next couple of months. We do feel somewhat triumphant in that we know that some of the plans we put into motion over a year ago had a great impact on our bottom line. We hope to continue to be able to grow and meet next year’s challenges head on.

If these type of challenges sound interesting to you, join our team and help us do even better for VDAY15.

Oktoperfest: Ruby Performance Tooling

Oktoberfest

There’s no doubt that the Paperless Dev team is a group of polyglots, but our core application is still a Ruby on Rails app. There’s also no question that we’re obsessed with performance. Not only do our users expect us to be fast, but our uptime during our peaks is highly dependent on how quickly we can handle requests. For the past three years, we’ve had a tradition of spending the month of October focused on improving our speed and paying down our technical debt. We call it Oktoperfest (misspelling intentional), and every year it gets better.

Over the past year especially, our tools and the tools from the community for diagnosing and fixing the performance of Rails apps have improved dramatically. Here I’ll share a brief overview of the some of the tools we use daily.

Graphite

We’ve talked extensively about Graphite before and we continue to use and improve our Graphiti frontend. Our primary method for consuming the Graphs is through our daily reports. Every morning I get a bunch of emails with a bunch of statically snapshoted graphs.

Snapshots

I’ll peruse through them on my way to work and see if there’s anything out of the ordinary. Besides our one off counters and timers, we use a patched version of Rails 2.3 to send general controller timings through a ActiveSupport::Notifications. We catch these notifications in what we call the RequestSubscriber, which is an ActiveSupport::LogSubscriber collects all these notifications and then collates them into info related to a specific request. This then gets delivered to graphite through Statsd and also written to our rails logs in a unified format similar to lograge

Logs

An example log line looks like:

1
2
3
4
5
6
uuid=b60c0303f347c11719e4ddfad97893bd type=http method=GET path=/paper controller=HomeController 
action=show format=html status=200 db_runtime=44 view_runtime=375 length=88395 duration=434 
params={"version"=>"paper"} cache_hits=17 cache_misses=0 cache_reads=25 cache_writes=2 
cache_read_runtime=74 cache_write_runtime=1 cache_runtime=75 query_count=27 query_cache_count=7 
redis_count=2 redis_runtime=3 redis_data_size=82 partial_renders=30 json_runtime=0 snappy_runtime=0 
snappy_in_count=0 snappy_out_count=5 account_id=1566743 gc_count=0 hostname=production-webapp09 total_runtime=434

As you can see theres a LOT of info here. Since we have a large number of servers at this point, surfing through log files (even when collected on a central log server with rsyslog) can be almost impossible. Additionally, we rarely care about a single request, rather when trying to diagnose a slow controller action or a jump in response time, we want to look at a sample of requests to a single endpoint. This is where logstash comes in. Using this unified format and a simple filter, we’re able to search for a single endpoint or even a single node and see live and historical data. Right now we keep logs in logstash for 2 days and everything else is backed up and shipped off site. This data is so valuable because unlike Graphite/Statsd it is not averaged or summarized. Fairly often, we’ll find problems where the mean or even 90th percentile of response time for an endpoint is low, but this belies the fact that though many requests are fast, there will be some that are so slow that it has an adverse effect on total performance. By looking at a sample of the actual requests we can see these anomalies and the to find out why.

Performance Bar

Performance Bar

When browsing the production site it’s really useful to have information about the current request in plain view. We call it the performance bar, and its a little red bar at the top of every page. Tools like Peek can do this for you simply, but we already have all the data we need in our RequestSubscriber. All the performance bar does is subscribe to the notification in a middleware and inject all the data into the page as JSON. Then some simple JS renders it on the screen all pretty like. We also suggest using Sam Saffron’s MiniProfiler project, which gave us a ton of useful data but ended up not fitting our needs exactly.

In development and staging, some of our favorite MCs represent different useful tools:

  • Daddy Mac of Kriss Kross is “Jump”, a set of quick links to different parts of the site.
  • Ludacris lets you “Rollout” different features to yourself
  • The RZA lets you toggle Caching in development and
  • Large Professor lets you re-run the current request with rblineprof

rblineprof

rblineprof is definitely one of the single most useful tools to come into the ruby spectrum in a long time. Thanks to Aman Gupta (@tmm1) we can now see what individual lines and blocks of our code take the most time or (CPU time) in a request. Though, its not always perfect, this can be extremely useful for quickly narrowing down on the most problematic parts of a code path. We use a slightly modified version of the code in peek-rblineprof to render syntax-highlighted code with the timings on the left. This can be invoked in staging or development by just clicking on the Large Professor.

rblineprof

ppprofiler

Once we felt the power of rblineprof, we wanted more. Specifically, once we identified a slow action or section of code, we wanted a way to repeatedly run the code, before and after making changes to track our progress. We also wanted a way to report these improvements to the rest of the team. For this, we created the “Paperless Profiler”, which is a script that can be invoked with:

1
./script/ppprofiler 'CODE TO EXECUTE' | tee -a mylog.md | less

This does a number of things that are pretty clearly summed up in the classes run method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def run
  output_header
  # prime
  toggle_caching
  run_subscriber
  toggle_caching
  run_subscriber

  toggle_caching
  run_benchmark
  toggle_caching
  run_benchmark

  toggle_caching
  run_lineprof
  toggle_caching
  run_lineprof

  output_to_console
end

It runs raw benchmarks (using Benchmark.measure), it collects sql, cache, and Redis counts, and it runs rblineprof. It also does all of these with the cache enabled and with it disabled. It then takes all of this data and formats it in a simple Markdown template. This has the advantage of easily being piped into gist or pasted into a github issue. You can see some sample output here. I’ve shared the full code as a gist and though it can’t be run as is out of the box (it relies on some PP variables), it should be easy enough to modify for your needs.

Find and destroy

Together, we use all of these tools to identify our problems, try to catch them quickly, and pin down the actual source. The actual solving of the issues often comes down to just “doing less”. This means either less things uncached, less things in the request/response cycle (moving to background jobs/threads) or just eliminating unused code.

Though Ruby still has its issues, there are some great tools and some even cooler things on the horizon. One of the things that’s often most problematic and we’re most interested in is tracking down Ruby memory usage problems. Sam’s MemoryProfiler though only functional on Ruby HEAD, is extremely promising. Charlie Sommerville and Aman have also been pushing some amazing work to the Ruby language itself, and we’re watching closely. Stay tuned as we continue to write more about the tools we’re using at PP.