Friday, November 24

On the usefulness of expression indexes

When teaching PostgreSQL trainings, both on basics and advanced topics, I often find out the attendees have very little idea how powerful the expression indexes may be (if they are aware of them at all). So let me give you a brief overview.

So, let’s say we have a table, with a range of timestamps (yes, we have generate_series function that can generate dates):

CREATE TABLE t AS
SELECT d, repeat(md5(d::text), 10) AS padding
  FROM generate_series(timestamp '1900-01-01',
                       timestamp '2100-01-01',
                       interval '1 day') s(d);
VACUUM ANALYZE t;

The table also includes a padding column, to make it a bit larger. Now, let’s do a simple range query, selecting just one month from the ~200 years included in the table. If you do explain on the query, you’ll see something like this:

EXPLAIN SELECT * FROM t WHERE d BETWEEN '2001-01-01' AND '2001-02-01';

                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..4416.75 rows=32 width=332)
   Filter: ((d >= '2001-01-01 00:00:00'::timestamp without time zone)
        AND (d <= '2001-02-01 00:00:00'::timestamp without time zone))
(2 rows)

and on my laptop, this runs in ~20ms. Not bad, considering this has to walk through the whole table with ~75k rows.

But let’s create an index on the timestamp column (all indexes here are the default type, i.e. btree, unless mentioned explicitly):

CREATE INDEX idx_t_d ON t (d);

And now let’s try to run the query again:

                               QUERY PLAN
------------------------------------------------------------------------
 Index Scan using idx_t_d on t  (cost=0.29..9.97 rows=34 width=332)
   Index Cond: ((d >= '2001-01-01 00:00:00'::timestamp without time zone)
            AND (d <= '2001-02-01 00:00:00'::timestamp without time zone))
(2 rows)

and this runs in 0.5ms, so roughly 40x faster. But that was of course a simple indexes, created directly on the column, not expression index. So let’s assume we instead need to select data from each 1st day of each month, doing a query like this

SELECT * FROM t WHERE EXTRACT(day FROM d) = 1;

which however can’t use the index, as it needs to evaluate an expression on the column while the index is built on the column itself, as shown on the EXPLAIN ANALYZE:

                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..4416.75 rows=365 width=332)
                (actual time=0.045..40.601 rows=2401 loops=1)
   Filter: (date_part('day'::text, d) = '1'::double precision)
   Rows Removed by Filter: 70649
 Planning time: 0.209 ms
 Execution time: 43.018 ms
(5 rows)

So not only this has to do a sequential scan, it also has to do the evaluation, increasing the query duration to 43ms.

The database is unable to use the index for multiple reasons. Indexes (at least btree indexes) rely on querying sorted data, provided by the tree-like structure, and while the range query can benefit from that, the second query (with `extract` call) can’t.

Note: Another issue is that the set of operators supported by indexes (i.e. that can be evaluated on indexes directly) is very limited. And the “extract” function is not supported, so the query can’t work around the ordering issue by using a Bitmap Index Scan.

In theory the database might try to transform the condition into range conditions, but that is extremely difficult and specific to expression. In this case we’d have to generate an infinite number of such “per-day” ranges, because the planner does not really know the min/max timestamps in the table. So the database does not even try.

But while the database does not know how to transform the conditions, developers often do. For example with conditions like

(column + 1) >= 1000

it’s not difficult to rewrite it like this

column >= (1000 - 1)

which works just fine with the indexes.

But what if such transformation is not possible, as for example for the example query

SELECT * FROM t WHERE EXTRACT(day FROM d) = 1;

In this case the developer would have to face the same issue with unknown min/max for the d column, and even then it would generate a lot of ranges.

Well, this blog post is about expression indexes, and so far we have only used regular indexes, built on the column directly. So, let’s create the first expression index:

CREATE INDEX idx_t_expr ON t ((extract(day FROM d)));
ANALYZE t;

which then gives us this explain plan

                               QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=47.35..3305.25 rows=2459 width=332)
                        (actual time=2.400..12.539 rows=2401 loops=1)
   Recheck Cond: (date_part('day'::text, d) = '1'::double precision)
   Heap Blocks: exact=2401
   ->  Bitmap Index Scan on idx_t_expr  (cost=0.00..46.73 rows=2459 width=0)
                                (actual time=1.243..1.243 rows=2401 loops=1)
         Index Cond: (date_part('day'::text, d) = '1'::double precision)
 Planning time: 0.374 ms
 Execution time: 17.136 ms
(7 rows)

So while this does not give us the same 40x speedup as the index in the first example, that’s kinda expected as this query returns far more tuples (2401 vs. 32). Moreover those are spread through the whole table and not as localized as in the first example. So it’s a nice 2x speedup, and in many real-world cases you’ll see much larger improvements.

But the ability to use indexes for conditions with complex expressions is not the most interesting information here – that’s kinda the reason why people create expression indexes. But that’s not the only benefit.

If you look at the two explain plans presented above (without and with the expression index), you might notice this:

                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..4416.75 rows=365 width=332)
                (actual time=0.045..40.601 rows=2401 loops=1)
 ...
                               QUERY PLAN
------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=47.35..3305.25 rows=2459 width=332)
                        (actual time=2.400..12.539 rows=2401 loops=1)
 ...

Right – creating the expression index significantly improved estimates. Without the index we only have statistics (MCV + histogram) for raw table columns, so the database does not know how to estimate the expression

EXTRACT(day FROM d) = 1

So it instead applies a default estimate for equality conditions, which is 0.5% of all rows – as the table has 73050 rows, we end up with an estimate of just 365 rows. It’s common to see much worse estimation errors in real-world applications.

With the index, however, the database also collected statistics on columns of the index, and in this case the column contains results of the expression. And while planning, the optimizer notices this and produces much better estimate.

This is a huge benefit, and may help with fixing some cases of poor query plans caused by inaccurate estimates. Yet most people are unaware of this handy tool.

And the usefulness of this tool only increased with the introduction of JSONB data type in 9.4, because it’s about the only way to collect statistics about the contents of the JSONB documents.

When indexing JSONB documents, two basic indexing strategies exist. You can either create a GIN/GiST index on the whole document, e.g. like this

CREATE INDEX ON t USING GIN (jsonb_column);

which allows you to query arbitrary paths in the JSONB column, use containment operator to match sub-documents, etc. That’s great, but you still have only the basic per-column statistics, which are
not very useful as the documents are treated as scalar values (and no one matches whole documents or uses range of documents).

Expression indexes, for example created like this:

CREATE INDEX ON t ((jsonb_column->'id'));

will only be useful for the particular expression, i.e. this newly created index will be useful for

SELECT * FROM t WHERE jsonb_column ->> 'id' = 123;

but not for queries accessing other JSON keys, like ‘value’ for example

SELECT * FROM t WHERE jsonb_column ->> 'value' = 'xxxx';

This is not to say that GIN/GiST indexes on the whole document are useless, but you have to choose. Either you create a focused expression index, useful when querying a particular key and with the added benefit of statistics on the expression. Or you create a GIN/GiST index on the whole document, able to handle queries on arbitrary keys, but without the statistics.

However you can have a cake and eat it too, in this case, because you can create both indexes at the same time, and the database will choose which of them to use for individual queries. And you’ll have accurate statistics, thanks to the expression indexes.

Sadly, you can’t eat the whole cake, because expression indexes and GIN/GiST indexes use different conditions

-- expression (btree)
SELECT * FROM t WHERE jsonb_column ->> 'id' = 123;

-- GIN/GiST
SELECT * FROM t WHERE jsonb_column @> '{"id" : 123}';

so the planner can’t use them at the same time – expression indexes for estimation and GIN/GiST for execution.

Leave a Reply

Your email address will not be published. Required fields are marked *