SQL Is Your Friend, Part 3

Yo dawg, I heard you like queries. So I put a query in your query so you can query while you query.

Now we are starting to get into some really cool and powerful SQL. Subqueries! Let’s jump right in, shall we?


Subqueries For Change


I want to count up all the times something changed across some set of data. You could pull all the data down and iterate over it in your code, and maybe that runs just fine. You can also let the database do the heavy lifting and pull down just the results that you need.

Using the LEAD windowing function we can look forward one record to see what happens next, and include that in the “current” record. We saw that in yesterday’s post. That’s not quite what we need, though. Fortunately we can query against that query, or use that as a subquery for the query we actually want to run.

SELECT *
FROM (
  SELECT id,
         major,
         minor,
         LEAD(minor) OVER (ORDER BY reading_timestamp) AS next_minor
  FROM readings
) AS amazing_subquery
WHERE minor != next_minor;

Right, so what we have there is a nested subquery. The outer query runs against the output of the inner query. If you were to run an EXPLAIN plan against this you would see how many rows the inner query returns and at what cost. I do tend to run the query plans against any of these and then keep an eye on them as the data scales up.

See the nested subquery for yourself in this SQL Fiddle.


Subqueries For A Field


We can also pull in a specific data point with a subquery and stuff it into a row. Let’s say I just want to know the last reading ID for an event.

SELECT id,
         phone,
         major,
         minor,
         reading_timestamp,
         ( SELECT MAX(id)
           FROM readings r2
           WHERE r2.phone = r1.phone AND r2.id > r1.id
         ) AS last_id
FROM readings r1
ORDER BY 2, 5;

Sorcery! You can see that I refer to the main query in the subquery. That’s pretty handy, but you don’t have to. You could just ask Postgres to give you the max value from this or some other table.

The reading ID may not be the most useful thing, so maybe instead you pull a timestamp. If you did that then you could do math and calculate the duration, which Postgres can also do for you. You can also wrap that up in another outer query. That’s when I start looking to DRY up the base query into a function.

You can play with this some more in this SQL Fiddle


Common Table Expression (CTE)


You can see where all this subquerying can maybe get out of hand. I’ll confess that I have stopped myself when I’ve nested a few queries deep and asked if maybe there is a better way. I try to avoid Russian Dolls when I program. You can also JOIN on a subquery, which is powerful when you need it and also pretty ugly.

Enter the Common Table Expression, or CTE. This is also known as a WITH query. Let’s take a look at that first query and see how that would look in CTE.

WITH amazing_subquery AS (
  SELECT id,
         major,
         minor,
         LEAD(minor) OVER (ORDER BY reading_timestamp) AS next_minor
  FROM readings
)
SELECT *
FROM amazing_subquery
WHERE minor != next_minor;

Here is the SQL Fiddle. This is a pretty simplified example, so maybe that doesn’t so much knock your socks off.

Let’s take this up to 11, shall we?

WITH interval_query AS (
  -- Start by making a series of 12 time intervals
  SELECT (ts ||' hour')::INTERVAL AS hour_interval
  FROM generate_series(0,11) AS ts
), time_series AS (
  -- Then take that interval and convert it into a TIMESTAMP
  SELECT date_trunc('hour', now()) + INTERVAL '60 min' * ROUND(date_part('minute', now()) / 60.0) - interval_query.hour_interval AS start_time
  FROM interval_query
), time_intervals AS (
  -- Then take that TIMESTAMP and add an hour to it to make an hour timebox
  SELECT start_time, start_time + '1 hour'::INTERVAL AS end_time
  FROM time_series
  ORDER BY start_time
)
-- RIGHT JOIN on those times s.t. we get all the times even when data is missing
SELECT f.start_time, f.end_time, r1.minor, count(DISTINCT r1.phone) readings
FROM readings r1
RIGHT JOIN time_intervals f
  ON r1.reading_timestamp >= f.start_time AND r1.reading_timestamp < f.end_time
GROUP BY f.start_time, f.end_time, r1.minor
ORDER BY f.start_time, r1.minor

BOOM. Time series data with counts. Here is that SQL Fiddle.


55502354


Now it is entirely possible that there are better ways to do this. I am no DBA. I’m just a simple caveman programmer who likes to take data out for a spin and see what it can do for me.

When I had projects in Sybase and SQL Server I would build these complex queries up over a series of temp tables. You can do that with Postgres, too. I just tend to prefer CTE and subqueries. However, there is another technique that Postgres brings to the table that I do use from time to time. Tomorrow.

1 Comment

Filed under programming, ruby, SQL