Tuesday, October 24

PG Phriday: Uptown Func

One of the coolest things about Postgres functions is that they can return rows as if they were a table. Not only is it possible, but creating a set-returning function in Postgres is almost frighteningly trivial when compared to other database engines. In fact, some Oracle users are probably rolling their eyes and muttering “Just use a cursor!” already. Just hear us out!

Postgres functions can return cursors too, but using them afterwards isn’t exactly a friendly experience. Cursors are passed by reference, so the function must either accept a parameter to name the cursor, or it generates something wacky like “<unnamed portal 1>”. After that, cursors can only be used with FETCH instead of SELECT, greatly limiting their utility.

What about views, then? Obviously they can return rows like a table, so why not just write everything as a query and turn it into a view? It’s true that practically any function can be emulated with a sufficiently advanced query. This is especially true since Postgres 9 added CTEs, allowing it to generate step-by-step transformations. Some use cases aren’t well covered by that approach, however.

Stop, Wait a Minute

Consider the Fibonacci Sequence. Here’s a function that iteratively generates the whole sequence up to the Nth value:

CREATE OR REPLACE FUNCTION fibonacci(nLoops INT)
RETURNS SETOF NUMERIC AS
$$
DECLARE
  i INT;
  x NUMERIC := 0;
  y NUMERIC := 1;
  z NUMERIC := 1;
BEGIN
  FOR i IN 1..nLoops LOOP
    RETURN NEXT x;
    x = y;
    y = z;
    z = x + y;
  END LOOP;
END;
$$ LANGUAGE plpgsql;
 
SELECT MAX(fibonacci) FROM fibonacci(100);
 
          MAX          
-----------------------
 218922995834555169026

Amusingly enough, this particular function works until the NUMERIC datatype itself is exhausted, provided sufficient hardware. Our test VM was able to survive until the 400,000th result before subsequent attempts exhausted our meager 8GB of temporary disk space. We’d print it here, but sadly a number with 83,595 digits won’t fit in the article.

Take a Sip, Sign a Check

Now consider if we had implemented this as a CTE instead. That by itself isn’t difficult, we just need to rewrite things a bit:

WITH RECURSIVE fib(i, a, b) AS (
  SELECT 1, 0::NUMERIC, 1::NUMERIC
  UNION ALL
  SELECT i + 1, b, a + b
    FROM fib
   WHERE i <= 100
)
SELECT MAX(a)
  FROM fib;

But how can this be turned into a general-purpose view? We would have to raise the limit of the recursive portion of the CTE to some comfortable (or uncomfortable) upper bound so the caller could limit the results with either a LIMIT or WHERE clause.

This has two problems. The first is that the maximum is arbitrary and may not correspond with desired use patterns. The second is that CTEs produce all results before filters are applied, meaning a lot of wasted effort. CTEs don’t really work well in views since the inner components can’t be tweaked for specific conditions, yet without them, SQL is stateless. There’s also the matter of exception handling. So any time we need a state machine that returns rows, functions are really our only answer.

Don’t Brag About it, Come Show Me

The primary limitation to Postgres functions is that the return type must exist before the function is declared. There are currently three ways to do this. Let’s go back to our trusty sensor_log table:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 10000000) s(id);
 
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

As usual, the table consists of 10-million rows, spread across 1000 locations with 100 possible reading values. In Postgres, every table also exists as a type which represents the table structure. We can see this by executing a completely non-recommended bit of SQL:

DROP TYPE sensor_log;
 
ERROR:  cannot DROP TYPE sensor_log because TABLE sensor_log requires it
HINT:  You can DROP TABLE sensor_log instead.

Given the type already exists, we can write a function which returns rows that masquerade as the sensor_log table. Here’s a very basic function that merely grabs the top ten readings for a particular location:

CREATE OR REPLACE FUNCTION sensor_location_top_ten(nLocation VARCHAR)
RETURNS SETOF sensor_log AS
$$
DECLARE
  rReturn sensor_log;
BEGIN
  FOR rReturn IN
    SELECT *
      FROM sensor_log
     WHERE location = nLocation
     ORDER BY reading DESC
     LIMIT 10
  LOOP
    RETURN NEXT rReturn;
  END LOOP;
END;
$$ LANGUAGE plpgsql;
 
SELECT * FROM sensor_location_top_ten('15');
 
 sensor_log_id | location | reading |    reading_date     
---------------+----------+---------+---------------------
       1068015 | 15       |     100 | 2017-02-22 09:17:30
       1255015 | 15       |     100 | 2017-01-31 17:50:50
        497015 | 15       |     100 | 2017-04-29 11:24:10
        894015 | 15       |     100 | 2017-03-14 12:37:30
        248015 | 15       |     100 | 2017-05-28 07:04:10
       1092015 | 15       |     100 | 2017-02-19 14:37:30
        183015 | 15       |     100 | 2017-06-04 19:37:30
         35015 | 15       |     100 | 2017-06-21 22:44:10
       2245015 | 15       |     100 | 2016-10-09 03:50:50
       1767015 | 15       |     100 | 2016-12-03 11:37:30
(10 ROWS)

Break it Down

But what if we want to permanently remove the location column from the results? We can’t do that because there’s no type with that definition. Of course, we can make one and then use it instead:

CREATE TYPE sensor_log_noloc AS (
  sensor_log_id  INT,
  reading        BIGINT,
  reading_date   TIMESTAMP
);
 
CREATE OR REPLACE FUNCTION sensor_location_top_ten(nLocation VARCHAR)
RETURNS SETOF sensor_log_noloc AS
$$
DECLARE
  rReturn sensor_log_noloc;
BEGIN
  FOR rReturn IN
    SELECT sensor_log_id, reading, reading_date
      FROM sensor_log
     WHERE location = nLocation
     ORDER BY reading DESC
     LIMIT 10
  LOOP
    RETURN NEXT rReturn;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

Julio, Get the Stretch

Alternatively, Postgres functions can define specific OUT variables. Declared in order, we can use them as a virtual row-set. The major caveat here is that parameter names must not conflict with table column names due to ambiguity concerns. This means we should either rename our parameters, or be particularly meticulous when using table column aliases.

Here’s how the function would look if we took the latter approach:

CREATE OR REPLACE FUNCTION sensor_location_top_ten(
  nLocation          VARCHAR,
  sensor_log_id  OUT INT,
  reading        OUT BIGINT,
  reading_date   OUT TIMESTAMP
)
RETURNS SETOF RECORD AS
$$
BEGIN
  FOR sensor_log_id, reading, reading_date IN
    SELECT s.sensor_log_id, s.reading, s.reading_date
      FROM sensor_log s
     WHERE s.location = nLocation
     ORDER BY s.reading DESC
     LIMIT 10
  LOOP
    RETURN NEXT;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

It’s a bit more work up front, but it removes the necessity of creating a corresponding type for every function that returns row sets.

Saturday Night, and We in the Spot

Of course, saving the best for last is a pretty classic move, and we’re suckers for tradition. Beginning in Postgres 8.4, functions gained one final syntax that did two things that effectively deprecated the previous two approaches. First, we no longer have to declare the type before the function. Second, we can avoid polluting our argument list with return values.

Here’s how it looks:

CREATE OR REPLACE FUNCTION sensor_location_top_ten(nLocation VARCHAR)
RETURNS TABLE(
  sensor_log_id INT,
  reading BIGINT,
  reading_date TIMESTAMP
) AS
$$
BEGIN
  FOR sensor_log_id, reading, reading_date IN
    SELECT s.sensor_log_id, s.reading, s.reading_date
      FROM sensor_log s
     WHERE s.location = nLocation
     ORDER BY s.reading DESC
     LIMIT 10
  LOOP
    RETURN NEXT;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

What’s interesting about this approach is that it’s otherwise identical to using OUT parameters. The principal differences are the previously mentioned reduction in parameter clutter, and that a table-returning function implicitly returns a set of records. In essence, it’s basically just OUT-parameter shorthand someone hacked into the Postgres engine.

Don’t Believe Me, Just Watch

Postgres functions have a plethora of such conveniences, and like most maturing platforms, it accumulates more with every passing generation. Since Postgres 10 is shaping up to be a rather comprehensive amalgam of varied enhancements, it’s exciting to see how functions might evolve. Even if they remain static until some mysterious future version, there’s still a lot more for us to explore.

And we most definitely will.

5 Comments

  • The stuff you’re doing with PL/pgsql really doesn’t require that language. SQL will do, as in:

    CREATE OR REPLACE FUNCTION sensor_location_top_ten(nLocation VARCHAR)
    RETURNS TABLE(
    sensor_log_id INT,
    reading BIGINT,
    reading_date TIMESTAMP
    )
    LANGUAGE sql
    AS $$
    SELECT s.sensor_log_id, s.reading, s.reading_date
    FROM sensor_log s
    WHERE s.location = nLocation
    ORDER BY s.reading DESC
    LIMIT 10
    $$;

    • Shaun Thomas

      Great example! Using SQL is a good way to simplify in some cases.

      I still prefer to use PLPGSQL in most cases. Keep in mind that these examples are general purpose to illustrate the concept. For every function that is compatible with straight SQL, there is another that needs an exception block, or contains multiple phases that don’t really map well to a single query.

    • Shaun Thomas

      Thanks for the catch, Nikolay! I swear I read that particular sentence ten times and somehow my eyes skipped right over that. Doh!

Leave a Reply to Shaun Thomas Cancel reply

Your email address will not be published. Required fields are marked *