SQL Is Your Friend, Part 2

Yesterday I began to make the case that it’s OK to skirt your ORM and write some raw SQL every once in a while.

I’ll be honest with you. I have no idea how well the ActiveRecord postgres adapter handles the JSON datatype. I mentioned in that post that I always store entire JSON payloads. I also never select the entire column. I always pick what I want out, like I showed in the post.

Today, we have a brand spanking new version of Postgres! Version 9.4 shipped today, and it has an update that I had already planned to talk about. But we aren’t quite ready to talk about it yet.

I’m working through this blog series the same way I work through a refactoring process:

  • I have some data, and I want to query it.
  • Huh. That’s a giant pile of SQL. How can I clean that up?
  • Maybe there is a more performant way to access this gnarly data?

So today we will be in the first 2 bullets.

Windowing Functions

You have a lot of data. You can take that data and fold it on itself to see how one record is related to or changes from another.

Let’s say you want to look and see when someone has moved from one place to another. You can take a row and fold in data from the next row to make that comparison by using the LEAD windowing function. Like so:

SELECT major,
       LEAD(minor) OVER (ORDER BY reading_timestamp) AS next_minor
FROM readings
ORDER BY reading_timestamp;

See the query in action with this SQL Fiddle

So what’s happening here?

We’ve asked for the next record — the LEAD. We use the OVER keyword to tell Postgres how to construct the window. This is how we fold the data. So I’ve said that I want the data sorted by the reading timestamps, and given that same sort I want to pull in the value for the next record’s minor column.

Repetition Is Lame

Now imagine that query is the basis for a couple of different queries. You can store that query in the database in a function. You saw how to create a function in the previous post. This is something that I actually do every once in a while.

I will, at this point, take a step back and say that I am not advocating that you stuff all of your business logic in the database. I have fought in that stored procedure holy war on the side of keeping business logic in the application that really needs to know about it.

I see this in a different light, though. This is a code optimization. I’m in control of the code and the logic, and I’ve decided where it makes the most sense for what code to run. Yes, the SQL in the function is compiled and optimized and whatever in Postgres, but I don’t honestly know how important that is. This is just me DRYing up redundant queries and making joins less hairy.

CREATE OR REPLACE FUNCTION readings_with_lead()
  id INT, phone TEXT, major INT, minor INT,
  reading_timestamp TIMESTAMP,
  prev_beacon_minor INT, next_beacon_minor INT
AS $$
  lag(minor) OVER (ORDER BY phone, reading_timestamp) AS prev_beacon_minor,
  lead(minor) OVER (ORDER BY phone, reading_timestamp) AS next_beacon_minor
FROM readings
ORDER BY phone, reading_timestamp;$$

This gives us a nice way to call on an enriched version of our data. It quacks like a table, too, so you can join on it and do anything else you would do with a normal table.

FROM readings_with_lead()
WHERE next_beacon_minor IS NOT NULL;

See the function in action with this SQL Fiddle

Leveling Up Your Windowing

Now I want to take the data and start to piece together sequences of events. What I really want to do is put together specific events in sequence. And I can totally do that with a windowing function:

SELECT phone,
       FIRST_VALUE(reading_timestamp) OVER (PARTITION BY phone, major ORDER BY phone, reading_timestamp)
FROM readings
ORDER BY 1, 4;

I’ve asked Postgres to give me the first reading_timestamp value across every row in a window. That window is defined as a grouping of like phone numbers and major values, and those partitions are also sorted.

See it in action with this SQL Fiddle

Once you have this information you can start looking at all sorts of interesting things. How long was an event? What happened in that event?

We will look at some more techniques that make this even more interesting soon. Next up: subqueries.


Filed under programming, ruby, SQL