Wednesday, October 24

Partitioning Improvements in PostgreSQL 11

A partitioning system in PostgreSQL was first added in PostgreSQL 8.1 by 2ndQuadrant founder Simon Riggs. It was based on relation inheritance and used a novel technique to exclude tables from being scanned by a query, called “constraint exclusion”. While it was a huge step forward at the time, it is nowadays seen as cumbersome to use as well as slow, and thus needing replacement.

In version 10, it was replaced thanks to heroic efforts by Amit Langote with modern-style “declarative partitioning”. This new tech meant you no longer needed to write code manually to route tuples to their correct partitions, and no longer needed to manually declare correct constraints for each partition: the system did those things automatically for you.

Sadly, in PostgreSQL 10 that’s pretty much all it did. Because of the sheer complexity and the time constraints, there were many things in the PostgreSQL 10 implementation that were lacking. Robert Haas gave a talk about it in Warsaw’s PGConf.EU.

Many people worked on improving the situation for PostgreSQL 11; here’s my attempt at a recount. I split these in three areas:

  1. New partitioning features
  2. Better DDL support for partitioned tables
  3. Performance optimizations.

New Partitioning Features

In PostgreSQL 10, your partitioned tables can be so in RANGE and LIST modes. These are powerful tools to base many real-world databases on, but for many others designs you need the new mode added in PostgreSQL 11: HASH partitioning. Many customers need this, and Amul Sul worked hard to make it possible. Here’s a simple example:

CREATE TABLE clients (
client_id INTEGER, name TEXT
) PARTITION BY HASH (client_id);
 
CREATE TABLE clients_0 PARTITION OF clients
    FOR VALUES WITH (MODULUS 3, REMAINDER 0);
CREATE TABLE clients_1 PARTITION OF clients
    FOR VALUES WITH (MODULUS 3, REMAINDER 1);
CREATE TABLE clients_2 PARTITION OF clients
    FOR VALUES WITH (MODULUS 3, REMAINDER 2);

It is not mandatory to use the same modulus value for all partitions; this lets you create more partitions later and redistribute the rows one partition at a time, if necessary.

Another very useful feature, written by Amit Khandekar is the ability to allow UPDATE to move rows from one partition to another — that is, if there’s a change in the values of the partitioning column, the row is automatically moved to the correct partition. Previously, that operation would have thrown an error.

Another new feature, written by Amit Langote and yours truly, is that INSERT ON CONFLICT UPDATE can be applied to partitioned tables. Previously this command would fail if it targeted a partitioned table. You could make it work by knowing exactly which partition would the row end up in, but that’s not very convenient. I won’t go over the details of that command, but if you’ve ever wished you had UPSERT in Postgres, this is it. One caveat is that the UPDATE action may not move the row to another partition.

Finally, another cute new feature in PostgreSQL 11, this time by Jeevan Ladhe, Beena Emerson, Ashutosh Bapat, Rahila Syed, and Robert Haas is support for a default partition in a partitioned table, that is, a partition which receives all rows that don’t fit in any of the regular partitions. However, while nice on paper, this feature is not very convenient on production settings because some operations require heavier locking with default partitions than without. Example: creating a new partition requires scanning the default partition in order to determine that no existing rows match the new partition’s boundaries. Maybe in the future these lock requirements will be lowered, but in the meantime my suggestion is not to use it.

Better DDL support

In PostgreSQL 10, certain DDL would refuse to work when applied to a partitioned table, and required you to process each partition individually. In PostgreSQL 11 we have fixed a few of these limitations, as previously announced by Simon Riggs. First, you can now use CREATE INDEX on a partitioned table, a feature written by yours truly. This one can be seen as just a matter of reducing tedium: instead of repeating the command for each partition (and making sure never to forget for each new partition), you can do it only once for the parent partitioned table, and it automatically applies to all partitions, existing and future.

One cool thing to keep in mind is the matching of existing indexes in partitions. As you know, creating an index is a blocking proposition, so the less time it takes, the better. I wrote this feature so that existing indexes in the partition would be compared to the indexes being created, and if there are matches, it’s not necessary to scan the partition to create new indexes: the existing indexes would be used.

Together with this, also by yours truly, you can also create UNIQUE constraints, as well as PRIMARY KEY constraints. Two caveats: first, the partition key must be part of the primary key. This allows the unique checks to be done locally per partition, avoiding global indexes. Second, it’s not possible to have foreign keys that reference these primary keys yet. I’m working on that for PostgreSQL 12.

Another thing you can do (thanks to the same person) is create FOR EACH ROW triggers on a partitioned table, and have it apply to all partitions (existing and future). As a side effect, you can have deferred unique constraints on partitioned tables. One caveat: only AFTER triggers are allowed, until we figure out how to deal with BEFORE triggers that move rows to a different partition.

Lastly, a partitioned table can have FOREIGN KEY constraints. This is very handy to partition large fact tables while avoiding dangling references, which everybody loathes. My colleague Gabriele Bartolini grabbed me by my lap when he found out I had written and committed this, yelling that this was a game-changer and how could I be so insensitive as not to inform him of this. Me, I just continue to hack the code for fun.

Performance Work

Previously, pre-processing queries to find out which partitions not to scan (constraint exclusion) was rather simplistic and slow. This has been improved by admirable teamwork pulled off by Amit Langote, David Rowley, Beena Emerson, Dilip Kumar to introduce “faster pruning” first and “runtime pruning” based on it afterwards. The result is much more powerful as well as faster (David Rowley already described this in a previous article.) After all this effort, partition pruning is applied at three points in the life of a query:

  1. At query plan time,
  2. When the query parameters are received,
  3. At each point where one query node passes values as parameters to another node.

This is a remarkable improvement from the original system which could only be applied at query plan time, and I believe it will please many.

You can see this feature in action by comparing EXPLAIN output for a query before and after turning off the enable_partition_pruning option. As a very simplistic example, compare this plan without pruning:

SET enable_partition_pruning TO off;
EXPLAIN (ANALYZE, COSTS off)
SELECT * FROM clientes
WHERE cliente_id = 1234;
                                QUERY PLAN                                
-------------------------------------------------------------------------
 Append (actual time=6.658..10.549 rows=1 loops=1)
   ->  Seq Scan on clientes_1 (actual time=4.724..4.724 rows=0 loops=1)
         Filter: (cliente_id = 1234)
         Rows Removed by Filter: 24978
   ->  Seq Scan on clientes_00 (actual time=1.914..1.914 rows=0 loops=1)
         Filter: (cliente_id = 1234)
         Rows Removed by Filter: 12644
   ->  Seq Scan on clientes_2 (actual time=0.017..1.021 rows=1 loops=1)
         Filter: (cliente_id = 1234)
         Rows Removed by Filter: 12570
   ->  Seq Scan on clientes_3 (actual time=0.746..0.746 rows=0 loops=1)
         Filter: (cliente_id = 1234)
         Rows Removed by Filter: 12448
   ->  Seq Scan on clientes_01 (actual time=0.648..0.648 rows=0 loops=1)
         Filter: (cliente_id = 1234)
         Rows Removed by Filter: 12482
   ->  Seq Scan on clientes_4 (actual time=0.774..0.774 rows=0 loops=1)
         Filter: (cliente_id = 1234)
         Rows Removed by Filter: 12400
   ->  Seq Scan on clientes_5 (actual time=0.717..0.717 rows=0 loops=1)
         Filter: (cliente_id = 1234)
         Rows Removed by Filter: 12477
 Planning Time: 0.375 ms
 Execution Time: 10.603 ms

with the one produced with pruning:

EXPLAIN (ANALYZE, COSTS off)
SELECT * FROM clientes
WHERE cliente_id = 1234;
                                QUERY PLAN                               
----------------------------------------------------------------------
 Append (actual time=0.054..2.787 rows=1 loops=1)
   ->  Seq Scan on clientes_2 (actual time=0.052..2.785 rows=1 loops=1)
         Filter: (cliente_id = 1234)
         Rows Removed by Filter: 12570
 Planning Time: 0.292 ms
 Execution Time: 2.822 ms

I’m sure you’ll find that compelling.  You can see a ton of more sophisticated examples by perusing the regression tests expected file.

Another item was the introduction of partitionwise joins, by Ashutosh Bapat. The idea here is that if you have two partitioned tables, and they are partitioned in identical ways, then when they are joined you can join each partition on one side to its matching partition on the other side; this is much better than joining each partition on side to every partition on the other side. The fact that the partition schemes need to match exactly may make this seem unlikely to have much real world use, but in reality there are many situations where this applies. Example: an orders table and its corresponding orders_items table. Thankfully, there’s already plenty of work on relaxing this restriction.

The last item I want to mention is partitionwise aggregates, by Jeevan Chalke, Ashutosh Bapat, and Robert Haas. This optimization means that an aggregation that includes the partition keys in the GROUP BY clause can be executed by aggregating each partition’s rows separately, which is much faster.

Closing Thoughts

After the significant developments in this cycle, PostgreSQL has a much more compelling partitioning story. While there are still many improvements to be made, particularly to improve the performance and concurrency of various operations involving partitioned tables, we’re now at a point where declarative partitioning has become a very valuable tool to serve many use cases.  At 2ndQuadrant we’ll continue to contribute code to improve PostgreSQL in this area and others, like we’ve done for every single release since 8.0.

4 Comments

  • adm

    For the default partition, if I add a check constraint directly onto the table for the default partition, when I add additional partitions I get a message “INFO: updated partition constraint for default partition “measurement_default” is implied by existing constraints”.

    In that scenario, does that avoid scanning the default partition, as it knows that no rows in the default partition can possibly be rows which belong in the new partition ? Or does it still scan the default partition ?

    • alvherre

      Yes: scanning the default partition is not necessary in that case.

      So what do you do with the rows that are already in the default partition? You cannot move them out of the way (because any queries accessing them will get bogus results (missing rows); but you cannot leave them there either, because you wouldn’t be able to add the constraint.

      If you don’t have any, then why do you *have* a default partition in the first place?

  • adm

    Some questions on partitioning impact on Insert performance:

    Does routing rows to the correct partition add much performance overhead ? For example if you have 100 partitions say.

    If I know which partition rows will belong to, would inserting directly into the underlying table for that partition provide any performance gain, by avoiding the need for Postgresql to route the rows ?

    Would it make any performance difference if rows go to the default partition -v- a specific partition for a date-range ?

    • alvherre

      Yes, routing tuples is slower than not routing tuples. However, routing tuples in the server is a lot faster than writing the correct code to route the tuples in your application — particularly when, months later, you want to change the partitioning scheme and you can avoid rewriting tons of application code.

      I don’t know about the performance of the default partition — I would never have a default partition in the first place, since it’s mostly a trap for the unwary. I expect it performs about the same as any other partition, though.

Leave a Reply

Your email address will not be published. Required fields are marked *