Category Archives: SQL

SQL Is Your Friend, Part 4

So far we’ve talked about the JSON datatype with the first part. Postgres 9.4 brings some new features that enhance it’s JSON capabilities even more, but I haven’t had a chance to play with the new shiny yet.

In part 2 we talked about using window functions. We also made our own functions in both to help extend functionality or DRY up repetitive queries.

I went over a couple of subquery techniques in part 3. These are exciting to me because it gives me even more capabilities to make sense of data at the data level. With windowing and subquerying I can give context to the data in-place, without having to pull it down and write piles of app server code. I like that.

Today we are going to talk about a performance optimization. What do you do when you’ve got the data indexed well, but the query still doesn’t respond in a quick enough time? As a measuring stick, I give a critical eye to anything that takes more than 500 milliseconds. Once it nears a full second I start looking for options.


Materialized Views


Postgres gives you the ability to save (materialize) the results from a query. It’s not a temp table, or even a table, but you can give the data indexes and query against it the same as a table.

-- Store off the query results
CREATE MATERIALIZED VIEW readings_mv AS
SELECT *, LEAD(minor) OVER (ORDER BY reading_timestamp) AS next_minor
FROM readings
WHERE reading_timestamp >= NOW() - '1 day'::INTERVAL;

-- Add some indexes
CREATE INDEX index_readings_mv_on_major ON readings_mv USING btree(major ASC NULLS LAST);
CREATE INDEX index_readings_mv_on_phone ON readings_mv USING btree(phone COLLATE "default" ASC NULLS LAST);
CREATE INDEX index_readings_mv_on_reading_timestamp ON readings_mv USING btree(reading_timestamp ASC NULLS LAST);

Note the name: readings_mv. It’s a materialized view that consists of readings table data. That’s a convention that I picked up along the way to help communicate what database objects are being used. This is helping out your team and also future you.

In this particular example the query needs to run against the past 24 hours of data and do some calculations (windowing). Now we’ve taken the heavy lifting and calculating and done that in advance out of band from the request/response cycle. We can also do more efficient subqueries that rely upon this same data. You don’t have to index the materialized view, but I tend to mirror table indexes (or add new ones where it makes sense).

Check out the SQL Fiddle. Scroll all the way down in the left pane to see the creation of the materialized view.


Refresh The Data


That’s great, but I said that I wanted to query against the past 24 hours and time is a moving target.

No worries, you can refresh the data with a simple query:

REFRESH MATERIALIZED VIEW readings_mv;

Easy peasy. It blocks, though, so while the materialized view is being updated you have to wait. Or you had to wait. Postgres 9.4, which just dropped this week, added the ability to refresh a view CONCURRENTLY.

REFRESH MATERIALIZED VIEW CONCURRENTLY readings_mv;

Here is the Postgres 9.4 changelog (of sorts) and a little more explanation on the REFRESH CONCURRENTLY functionality as it was being included in trunk.

You can put this into a rake task as well:

namespace :db do
  namespace :heroku do
    desc "Load the PROD database from Heroku to the local dev database"
    task :load => :environment do
      if Rails.env == 'development'
        Bundler.with_clean_env do
          config = Rails.configuration.database_configuration[Rails.env]
          system <<-CMD
            heroku pgbackups:capture --expire
            curl -o latest.dump `heroku pgbackups:url`
            pg_restore --verbose --clean --no-acl --no-owner -h localhost \
              -U #{config["username"]} -d #{config["database"]} latest.dump
            rm -rf latest.dump
          CMD
        end
      end
    end

    # https://devcenter.heroku.com/articles/scheduler
    desc 'Update the materialized view(s)'
    task :update_materialized_view => :environment do
      sql = 'REFRESH MATERIALIZED VIEW beacon_readings_lead_lag_mv;'
      ActiveRecord::Base.connection.execute(sql)
    end
  end
end

If you use Heroku you can this up with scheduler to run every 10 minutes. I included an additional task that I find useful to populate my local database with live data.


kthxbai


These are techniques that I find myself using when an app goes from simple CRUD to having to make sense of complex data. I hope that you found this series interesting and useful. I also give you permission to occasionally use raw SQL to take advantage of the features that your database brings to bear. You chose it for a reason, so use it.

Now go make big, interesting things.

1 Comment

Filed under programming, ruby, SQL

SQL Is Your Friend, Part 3

Yo dawg, I heard you like queries. So I put a query in your query so you can query while you query.

Now we are starting to get into some really cool and powerful SQL. Subqueries! Let’s jump right in, shall we?


Subqueries For Change


I want to count up all the times something changed across some set of data. You could pull all the data down and iterate over it in your code, and maybe that runs just fine. You can also let the database do the heavy lifting and pull down just the results that you need.

Using the LEAD windowing function we can look forward one record to see what happens next, and include that in the “current” record. We saw that in yesterday’s post. That’s not quite what we need, though. Fortunately we can query against that query, or use that as a subquery for the query we actually want to run.

SELECT *
FROM (
  SELECT id,
         major,
         minor,
         LEAD(minor) OVER (ORDER BY reading_timestamp) AS next_minor
  FROM readings
) AS amazing_subquery
WHERE minor != next_minor;

Right, so what we have there is a nested subquery. The outer query runs against the output of the inner query. If you were to run an EXPLAIN plan against this you would see how many rows the inner query returns and at what cost. I do tend to run the query plans against any of these and then keep an eye on them as the data scales up.

See the nested subquery for yourself in this SQL Fiddle.


Subqueries For A Field


We can also pull in a specific data point with a subquery and stuff it into a row. Let’s say I just want to know the last reading ID for an event.

SELECT id,
         phone,
         major,
         minor,
         reading_timestamp,
         ( SELECT MAX(id)
           FROM readings r2
           WHERE r2.phone = r1.phone AND r2.id > r1.id
         ) AS last_id
FROM readings r1
ORDER BY 2, 5;

Sorcery! You can see that I refer to the main query in the subquery. That’s pretty handy, but you don’t have to. You could just ask Postgres to give you the max value from this or some other table.

The reading ID may not be the most useful thing, so maybe instead you pull a timestamp. If you did that then you could do math and calculate the duration, which Postgres can also do for you. You can also wrap that up in another outer query. That’s when I start looking to DRY up the base query into a function.

You can play with this some more in this SQL Fiddle


Common Table Expression (CTE)


You can see where all this subquerying can maybe get out of hand. I’ll confess that I have stopped myself when I’ve nested a few queries deep and asked if maybe there is a better way. I try to avoid Russian Dolls when I program. You can also JOIN on a subquery, which is powerful when you need it and also pretty ugly.

Enter the Common Table Expression, or CTE. This is also known as a WITH query. Let’s take a look at that first query and see how that would look in CTE.

WITH amazing_subquery AS (
  SELECT id,
         major,
         minor,
         LEAD(minor) OVER (ORDER BY reading_timestamp) AS next_minor
  FROM readings
)
SELECT *
FROM amazing_subquery
WHERE minor != next_minor;

Here is the SQL Fiddle. This is a pretty simplified example, so maybe that doesn’t so much knock your socks off.

Let’s take this up to 11, shall we?

WITH interval_query AS (
  -- Start by making a series of 12 time intervals
  SELECT (ts ||' hour')::INTERVAL AS hour_interval
  FROM generate_series(0,11) AS ts
), time_series AS (
  -- Then take that interval and convert it into a TIMESTAMP
  SELECT date_trunc('hour', now()) + INTERVAL '60 min' * ROUND(date_part('minute', now()) / 60.0) - interval_query.hour_interval AS start_time
  FROM interval_query
), time_intervals AS (
  -- Then take that TIMESTAMP and add an hour to it to make an hour timebox
  SELECT start_time, start_time + '1 hour'::INTERVAL AS end_time
  FROM time_series
  ORDER BY start_time
)
-- RIGHT JOIN on those times s.t. we get all the times even when data is missing
SELECT f.start_time, f.end_time, r1.minor, count(DISTINCT r1.phone) readings
FROM readings r1
RIGHT JOIN time_intervals f
  ON r1.reading_timestamp >= f.start_time AND r1.reading_timestamp < f.end_time
GROUP BY f.start_time, f.end_time, r1.minor
ORDER BY f.start_time, r1.minor

BOOM. Time series data with counts. Here is that SQL Fiddle.


55502354


Now it is entirely possible that there are better ways to do this. I am no DBA. I’m just a simple caveman programmer who likes to take data out for a spin and see what it can do for me.

When I had projects in Sybase and SQL Server I would build these complex queries up over a series of temp tables. You can do that with Postgres, too. I just tend to prefer CTE and subqueries. However, there is another technique that Postgres brings to the table that I do use from time to time. Tomorrow.

1 Comment

Filed under programming, ruby, SQL

SQL Is Your Friend, Part 2

Yesterday I began to make the case that it’s OK to skirt your ORM and write some raw SQL every once in a while.

I’ll be honest with you. I have no idea how well the ActiveRecord postgres adapter handles the JSON datatype. I mentioned in that post that I always store entire JSON payloads. I also never select the entire column. I always pick what I want out, like I showed in the post.

Today, we have a brand spanking new version of Postgres! Version 9.4 shipped today, and it has an update that I had already planned to talk about. But we aren’t quite ready to talk about it yet.

I’m working through this blog series the same way I work through a refactoring process:

  • I have some data, and I want to query it.
  • Huh. That’s a giant pile of SQL. How can I clean that up?
  • Maybe there is a more performant way to access this gnarly data?

So today we will be in the first 2 bullets.


Windowing Functions


You have a lot of data. You can take that data and fold it on itself to see how one record is related to or changes from another.

Let’s say you want to look and see when someone has moved from one place to another. You can take a row and fold in data from the next row to make that comparison by using the LEAD windowing function. Like so:

SELECT major,
       minor,
       LEAD(minor) OVER (ORDER BY reading_timestamp) AS next_minor
FROM readings
ORDER BY reading_timestamp;

See the query in action with this SQL Fiddle

So what’s happening here?

We’ve asked for the next record — the LEAD. We use the OVER keyword to tell Postgres how to construct the window. This is how we fold the data. So I’ve said that I want the data sorted by the reading timestamps, and given that same sort I want to pull in the value for the next record’s minor column.


Repetition Is Lame


Now imagine that query is the basis for a couple of different queries. You can store that query in the database in a function. You saw how to create a function in the previous post. This is something that I actually do every once in a while.

I will, at this point, take a step back and say that I am not advocating that you stuff all of your business logic in the database. I have fought in that stored procedure holy war on the side of keeping business logic in the application that really needs to know about it.

I see this in a different light, though. This is a code optimization. I’m in control of the code and the logic, and I’ve decided where it makes the most sense for what code to run. Yes, the SQL in the function is compiled and optimized and whatever in Postgres, but I don’t honestly know how important that is. This is just me DRYing up redundant queries and making joins less hairy.

CREATE OR REPLACE FUNCTION readings_with_lead()
RETURNS TABLE (
  id INT, phone TEXT, major INT, minor INT,
  reading_timestamp TIMESTAMP,
  prev_beacon_minor INT, next_beacon_minor INT
)
LANGUAGE sql
AS $$
SELECT
  id,
  phone,
  major,
  minor,
  reading_timestamp,
  lag(minor) OVER (ORDER BY phone, reading_timestamp) AS prev_beacon_minor,
  lead(minor) OVER (ORDER BY phone, reading_timestamp) AS next_beacon_minor
FROM readings
ORDER BY phone, reading_timestamp;$$

This gives us a nice way to call on an enriched version of our data. It quacks like a table, too, so you can join on it and do anything else you would do with a normal table.

SELECT *
FROM readings_with_lead()
WHERE next_beacon_minor IS NOT NULL;

See the function in action with this SQL Fiddle


Leveling Up Your Windowing


Now I want to take the data and start to piece together sequences of events. What I really want to do is put together specific events in sequence. And I can totally do that with a windowing function:

SELECT phone,
       major,
       minor,
       reading_timestamp,
       FIRST_VALUE(reading_timestamp) OVER (PARTITION BY phone, major ORDER BY phone, reading_timestamp)
FROM readings
ORDER BY 1, 4;

I’ve asked Postgres to give me the first reading_timestamp value across every row in a window. That window is defined as a grouping of like phone numbers and major values, and those partitions are also sorted.

See it in action with this SQL Fiddle

Once you have this information you can start looking at all sorts of interesting things. How long was an event? What happened in that event?

We will look at some more techniques that make this even more interesting soon. Next up: subqueries.

8 Comments

Filed under programming, ruby, SQL

SQL Is Your Friend, Part 1

This is the kickoff of a series of posts on using Postgres and SQL to do some interesting things that you may not have known were possible. If you typically use an ORM, like ActiveRecord, you probably do not write a lot of raw SQL. I like Rails, and ActiveRecord has it’s place. This is not a rant where I shake my fist in the air and tell ActiveRecord to get off my lawn.

Instead I want to highlight a handful of techniques that I find really useful, and also perhaps suggest that it’s OK to venture off the golden path and do a little raw SQL where it makes sense. The fact of the matter is that you cannot simply switch databases and have the ORM just handle it for you. At least not once your application starts to get interesting. So use the right tool for the job, and I hope to highlight a few tools that Postgres brings to the table.

TL;DR: SQL is your friend.


JSON Datatype


You may already be familiar with the HSTORE datatype that came in Postgres 9.0. It’s nice, but does not handle nesting. In Postgres 9.2 we got something even awesomer: the JSON datatype. It’s pretty spectacular, and I have used it to store entire raw payloads now anytime I consume resources.  You can also use this datatype in ActiveRecord migrations without requiring any additional gems.

What makes the JSON datatype especially nice is that you can reach into the data and pull out specific elements. You can also create custom functions and even operators to do even more powerful things with the data in the JSON column.

Here is an example that takes a column called “beacons” which contains a list of bluetooth beacons that a phone has ranged. We want to look into that list and pull out the minor number from the closest beacon, and the accuracy from the next closest beacon. We also cast values from the JSON datatype to a more meaningful datatype. Yes, the cast to TEXT and then to the actual datatype is a little awkward, but it works.

SELECT beacons->0,
  json_array_length(beacons),
  json_extract_path(beacons->0, 'minor')::TEXT::INT AS minor2,
  json_extract_path(beacons->1, 'accuracy')::TEXT::FLOAT AS accuracy2
FROM readings;

See this code in action with this SQL Fiddle


Custom Function and Operator


There are a lot of good functions in Postgres to handle the JSON datatype, but let’s say that I want to do something that’s not built in. I want to look at ALL of the values for a given key.

Take a look at this discussion. You can create a function in Postgres, and even create a new operator for that function:

CREATE OR REPLACE FUNCTION json_pluck(j json, field TEXT)
RETURNS TEXT[]
LANGUAGE sql
AS
$$ SELECT array_agg(j->>$2) FROM json_array_elements($1) j $$;

DROP OPERATOR IF EXISTS | (json, text);
CREATE OPERATOR | (procedure = json_pluck, leftarg = 'json', rightarg = 'text');

Wait. What even is that?! You create a function called json_pluck. Then you create a custom operator | that allows you to invoke json_pluck by simply calling the pipe character and telling it for what key to return the values.

Like so:

SELECT beacons->0,
  json_array_length(beacons),
  json_extract_path(beacons->0, 'minor')::TEXT::INT AS minor2,
  json_extract_path(beacons->1, 'accuracy')::TEXT::FLOAT AS accuracy2,
  beacons | 'minor' AS minors
FROM readings;

You’ll note that I use CREATE OR REPLACE wherever possible. You can’t do that with an operator, though, so I preface that with a guard to drop it if it already exists, and then create it.

See this code in action with this SQL Fiddle

Yeah. That’s pretty cool, no? You can drop all of those into ActiveRecord migrations, too:

class CreateFunctionJsonPluck < ActiveRecord::Migration
  def up
    sql = <<-SQL
      CREATE OR REPLACE FUNCTION json_pluck(j json, field TEXT)
      RETURNS TEXT[]
      LANGUAGE sql
      AS
      $$ SELECT array_agg(j->>$2) FROM json_array_elements($1) j $$;
    SQL
    ActiveRecord::Base.connection.execute(sql)

    sql = <<-SQL
      DROP OPERATOR IF EXISTS | (json, text);
      CREATE OPERATOR | (procedure = json_pluck, leftarg = 'json', rightarg = 'text');
    SQL
    ActiveRecord::Base.connection.execute(sql)
  end

  def down
    sql = "DROP OPERATOR IF EXISTS | (json, text);"
    ActiveRecord::Base.connection.execute(sql)
    sql = "DROP FUNCTION IF EXISTS json_pluck(j json, field TEXT);"
    ActiveRecord::Base.connection.execute(sql)
  end
end

So that’s the JSON datatype. In this series plan to discuss window functions, subqueries, WITH queries (Common Table Expression), and materialized views.

8 Comments

Filed under programming, ruby, SQL