MAX Partitioning with MIN Pain in PostgreSQL

One of the things the PostgreSQL community is rightly proud of is the quality of the database’s query optimizer.  People like to squabble about whether hints or necessary, the user interface to query tuning is a bit strange, and there are some things that the database just doesn’t know how to do yet that limit its ability to compete against commercial offerings with some types queries.  The lack of index-only scans is a good example.  But the fact that a fully open-source optimizer has managed to get in the right neighborhood, where for many applications it does a perfectly good job of optimizing even complicated queries, is pretty remarkable.  For most of computing history, getting a database with a optimizer as good as the free one in PostgreSQL cost more than a little money.

One area where there are still some serious problems relative to more popular commercial products is partitioning.  Table partitioning is an essential component for building large databases, both due to how indexes work and to how some maintenance tasks must happen.  There is support for partitioning in PostgreSQL, but it’s harder to setup and use than everyone would like it to be.  There are plans around improving that, but they haven’t converged to anything useful yet.

And the optimizer hasn’t traditionally been too smart about what it can do with partitioned queries.  The technique it implements is referred to as “constraint exclusion”.  The idea is that if you specifically compare a column against a fixed value, and there are constraints on the sub-partitions, the optimizer knows how to prove that some partitions cannot contain that value.  They are then removed from being considered by the query.  So if you set your partitions up right, and you do your comparison against a value such that optimizer can prove a partition is excluded, it will be.  That’s it though, and there are some common situations where the limitations of that are not obvious.

One of the nastiest of them is fixed in PostgreSQL 9.1, but lately I’ve been seeing a whole lot more people run into this in earlier versions and get blind-sided by it.  The issue is with how aggregates like MIN() and MAX() are implemented.  When you use those against a normal table that has an index on it, you expect that the index will be used as a way to get a sorted copy of the data, and then the first/last row will be popped off of there.

Here’s the problem:  that doesn’t map directly into a constraint exclusion context.  I first ran into this a few years ago, and my co-worker at the time Alan Li submitted a test case and patch to work around the problem.  Here’s what a test case based on that looks like:

CREATE TABLE partition_test (a int, b timestamp);
CREATE INDEX partition_test_idx on partition_test(b);
CREATE TABLE partition_test_2007_01_01 (CONSTRAINT partition_test_2007_01_01_b_check CHECK (((b >= ’2007-01-01′::date) AND (b < ’2007-01-02′::date)))) INHERITS (partition_test);
CREATE INDEX partition_test_2007_01_01_idx ON partition_test_2007_01_01 USING btree (b);
CREATE TABLE partition_test_2007_01_02 (CONSTRAINT partition_test_2007_01_02_b_check CHECK (((b >= ’2007-01-02′::date) AND (b <’2007-01-03′::date)))) INHERITS (partition_test);
CREATE INDEX partition_test_2007_01_02_idx ON partition_test_2007_01_02 USING btree (b);
CREATE TABLE partition_test_2007_01_03 (CONSTRAINT
partition_test_2007_01_03_b_check CHECK (((b >= ’2007-01-03′::date) AND (b < ’2007-01-04′::date)))) INHERITS (partition_test);
CREATE INDEX partition_test_2007_01_03_idx ON partition_test_2007_01_03 USING btree (b);
EXPLAIN SELECT max(b) from partition_test;

The query plan you get out of this in PostgreSQL 9.0 and earlier is not so nice:

 Aggregate  (cost=137.00..137.01 rows=1 width=8)
   ->  Append  (cost=0.00..117.60 rows=7760 width=8)
         ->  Seq Scan on partition_test  (cost=0.00..29.40 rows=1940 width=8)
         ->  Seq Scan on partition_test_2007_01_01 partition_test  (cost=0.00..29.40 rows=1940 width=8)
         ->  Seq Scan on partition_test_2007_01_02 partition_test  (cost=0.00..29.40 rows=1940 width=8)
         ->  Seq Scan on partition_test_2007_01_03 partition_test  (cost=0.00..29.40 rows=1940 width=8)

The optimizer neither excludes any partitions, nor does it use an index even when scanning each of them!  This means that a MIN() or MAX() against a giant partitioned table is guaranteed to scan every record in the table.  And if you went to the trouble of partitioning, that’s probably a ton of data.  This sort of thing is very popular when people want to write queries like “when was the last time data was inserted into this table?”  You might put SELECT now() – max(ts) FROM t into a monitoring system, then trigger an alarm if it gets too high suggesting there’s a problem.  But you just can’t do it on a large partitioned table here.  There are plenty of other use cases that need to pull the first or last value from a partitioned table too.

Unfortunately, Alan’s fix for this was a bit too specialized to really solve all the problems like this that are around, so the problem has been sitting around for a while unresolved.  It was finally fixed after a whole development arc ending with work from Tom Lane during the now beta PostgreSQL 9.1; the commit talks a bit about why the final approach used is the right one to have settled on.  With 9.1, you get the query plan you probably expected in the first place:

 Result  (cost=0.10..0.11 rows=1 width=0)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.04..0.10 rows=1 width=8)
           ->  Merge Append  (cost=0.04..485.84 rows=7720 width=8)
                 Sort Key: public.partition_test.b
                 ->  Index Scan Backward using partition_test_idx on partition_test  (cost=0.00..78.03 rows=1930 width=8)
                       Index Cond: (b IS NOT NULL)
                 ->  Index Scan Backward using partition_test_2007_01_01_idx on partition_test_2007_01_01 partition_test  (cost=0.00..78.03 rows=1930 width=8)
                       Index Cond: (b IS NOT NULL)
                 ->  Index Scan Backward using partition_test_2007_01_02_idx on partition_test_2007_01_02 partition_test  (cost=0.00..78.03 rows=1930 width=8)
                       Index Cond: (b IS NOT NULL)
                 ->  Index Scan Backward using partition_test_2007_01_03_idx on partition_test_2007_01_03 partition_test  (cost=0.00..78.03 rows=1930 width=8)
bsp;                    Index Cond: (b IS NOT NULL)

Where the index is used to get the individual MIN/MAX values, then they are combined and the true first or last value returned.

So what do you do if you need to deploy PostgreSQL with partitioning, and you need MIN/MAX to work?  Well, if you deployed a new version of PostgreSQL while it was still in beta, you wouldn’t be the first.  At this point, you should be able to do that now, and then upgrade to the final binaries once released without changing the database itself.  A major problem would have to occur before binary compatibility from beta to release would be broken at this late stage, it’s avoided if at all possible.

The other possibility is to follow what I was talking about here last time, and create a backport of just this feature to your own custom PostgreSQL 9.0.  It’s possible to build “9.0 plus ‘Reimplement planner’s handling of MIN/MAX aggregate optimization’”, and off you go.  In theory at least.  I mentioned this was the end of a development arc, it’s not a single commit that adds the capability.  This feature builds on top of the addition of MergeAppend plans, which is another major 9.1 change.  Turns out there are at least six commits that touched this area of the code during 9.1 you need to grapple with, either taking their changes or resolving the merge conflicts in later ones.  It’s not really a feature backport for the squeamish.

But it’s possible.  I’ve done it, and so has at least one other person for different purposes.  When my customer ran into this issue, it was months into a PostgreSQL 9.0 migration with partitioning, and the use of MAX() was not optional in their application.  Without the source code and the ability to backport a fix, this migration would have failed; with it and some sweat, it succeeded.  The PostgreSQL project itself doesn’t do backports for new features, only bugs.  But sometimes a feature backport is the least risky way to improve something in an otherwise stable product, one that prefers to avoid change whenever possible.  Kernel backporting has been a major component to the success of RedHat Linux in corporate environments.  A good PostgreSQL support company should be able to do the same for you, and there are several places you can purchases that quality level of work from.

Once a business has seen this sort of thing in action, I’ve found it makes them a whole lot more comfortable with using open-source software.  When you’re having a problem, being able to have some control over the process the whole way helps.  And it can be hard to get that out of the large proprietary database vendors.  For now I’m maintaining a backport, my customer is happy that they got a quick fix here.  And everyone is happy that once 9.1 is released, this long standing issue with partitioning and these extremely common aggregates is resolved in completely standard PostgreSQL.

This Post Has 0 Comments

Leave A Reply