The Data You Need - Paperless Post Dev Blog

There are lots of people at Paperless Post who need data, and over time we have been consistently improving how we get it to them. This is an article about a simple gem we built, ReplicaConnect, that allows easy access to our data inside simple ruby scripts.

Since you’re reading our dev blog, you probably know that to get data from a database, we need to run SQL queries. Data analysis queries can take a long time, and if we ran them on our actual production db, they would take down the site. So, we have a replica of our production database that we use to run big queries, without worrying about affecting users.

Traditionally, when we had a long query to run, a developer would load up pgAdmin, which is a GUI tool for querying a PostgreSQL database. Then the query would be run, results saved to a csv and emailed to whomever had requested the data.

For one-off queries this worked fine, but it was painful for common queries and reports. Often, multiple queries were run to produce a report. The mindless process of copying a query from a file to pgAdmin, updating the date, running the query, pasting the result into a spreadsheet cell, and repeating 20 times, was tedious, time consuming and prone to errors. Additionally, while SQL is amazing at getting data from a database, if you want to take that data and do a bunch of stuff to it, it’s probably going to be a lot easier with another programming language.

Transforming these reports from multiple SQL files into a ruby script that ran the SQL and formatted the results was clearly necessary. Every single script would need a way to connect to our replica db to run queries, so Richard Boyle and I built ReplicaConnect to make that functionality simple and seamless.

In order to connect to a database using ActiveRecord, you need an adapter, hostname, port, database name, username, and password. The first time you use ReplicaConnect, it prompts you for that information, then saves it in a file, and any future connections from the same folder will just use the information in that file.

Using ReplicaConnect is simple.

  require 'replica_connect'
  connection = ReplicaConnect::Connection.new().connect
  result = connection.execute("SELECT * FROM users LIMIT 1")

Done! That result is an Enumarable that contains the results of your query.

If you want to save that to a csv, use PgToCsv, another gem we created.

PgToCsv.to_csv(result, "filename.csv")

See how easy that was?

One example of where this has had a huge impact is in creating marketing email lists. Our marketing team wants lists of unique email addresses for various segments of our users on a semi-regular basis. The process for generating the 5 - 10 separate lists, and making sure each email is only included once in SQL was incredibly complicated. You had to run a different query for each of the segments, and the queries became very complicated as we had to de-duplicate names from multiple lists. This process used to take a few hours.

Now, this process takes about a minute. The customer segment queries are defined as strings earlier in the file, and then this is run:

  connection = ReplicaConnect::Connection.new().connect
  dedupe = []
  ['segment_1', 'segment_2', 'segment3', 'segment4'].each do |query|
    query_result = connection.execute(eval(query))
    result_to_save = query_result.to_a - dedupe.flatten
    PgToCsv.to_csv(result_to_save, "#{query}_#{Time.now.strftime("%m_%d_%y")}.csv")
    dedupe << result_to_save
  end

We quickly make a connection, execute our query, and then have all the power and ease of ruby to manipulate the results.

Check out the source code for Replica Connect and PgToCsv, and feel free to fork and contribute!

Comments