Parallel Aggregate – Getting the most out of your CPUs

A small peek into the future of what should be arriving for PostgreSQL 9.6.

Today PostgreSQL took a big step ahead in the data warehouse world and we now are able to perform aggregation in parallel using multiple worker processes! This is great news for those of you who are running large aggregate queries over 10’s of millions or even billions of records, as the workload can now be divided up and shared between worker processes seamlessly.

We performed some tests on a 4 CPU 64 core server with 256GB of RAM using TPC-H @ 100 GB scale on query 1. This query performs some complex aggregation on just over 600 million records and produces 4 output rows.

The base time for this query without parallel aggregates (max_parallel_degree = 0) is 1375 seconds.  If we add a single worker (max_parallel_degree = 1) then the time comes down to 693 seconds, which is just 6 seconds off being twice as fast! So quite close to linear scaling. If we take the worker count up to 10 (max_parallel_degree=10), then the time comes down to 131 seconds, which is once again just 6 seconds off perfect linear scaling!

The chart below helps to paint a picture of this. The blue line is the time in seconds. Remember that the time axis is on a logarithmic scale, (the performance increase is a little too much to see the detail at higher worker counts otherwise)

You can see that even with 30 worker processes we’re still just 20% off of the linear scale. Here the query runs in 56 seconds, which is almost 25 times faster than the non-parallel run.


This really makes the existing parallel seqscan and parallel join changes really shine. Without parallel seqscan, parallel aggregation wouldn’t have been possible, and without parallel join parallel aggregate would only be possible on queries that didn’t contain any joins.

More work is still to be done to parallel-enable some aggregate functions, but today’s commit is a great step forward.

This Post Has 6 Comments

  1. paul says:

    Awesome test. Can you say what actual hardware (4 cpu 64 core) you used? Thanks.

  2. Dmitry says:

    Hi, David

    First big thank’s for implementing parallel aggr for PG.
    I try to test Q1 query from TPC-H (50GB scale factor, 300M rows of LINEITEM table) on PG 9.6beta1 and get next results :

    302171.230 ms for max_parallel_degree = 0
    175506.106 ms for max_parallel_degree = 4
    144931.588 ms for max_parallel_degree = 8

    Changes made to postgresql.conf :

    shared_buffers = 8000MB
    work_mem = 2000MB
    max_parallel_degree = 16
    enable_seqscan = on

    Test performed on SLES11 SP3 virtual machine with 16VCPU

    Can you please advise some tuning for parallel work in this case ?


    • David Rowley says:

      Hi Dmitry,

      Thanks for the comment.

      Remember the max_parallel_degree is “maximum”, the size of the relation will control how many workers the planner requests the executor to use. Also remember that the number of workers that the planner request the executor might not have available at the time the query executes. If you run EXPLAIN ANALYZE on the query it should show you how many workers are requested and how many are used.

      However, I’d say most likely is that I/O requests cannot be served fast enough to feed the CPUs with the lineitem heap pages they need. When I ran my test I ran it on a machine with 256GB of RAM, and all the buffers were either in kernel page cache or in PostgreSQL’s buffer cache, so it wouldn’t have needed to read any pages from disk.

      I’d say if you have less than 64GB of memory on this machine then that might be the reason. How does it perform on Q1 @ 10GB scale?

      If the planner is not requesting as many workers as you’d like, then you can tweak this manually with;

      alter table lineitem set (parallel_degree=16);

      I’m interested in hearing your updated results on this.


  3. Dmitry says:

    Hi, David

    Thank you for explanation.

    You rigth, it is some disk I/O bottleneck’s on my VM

    Unfortunatelly for some reason’s can’t cache full LINEITEM table (44GB) on 60GB RAM VM
    (each Q1 run is reading from disk, but filesystem cache is 45GB populated)

    Ok. I test Q1 (no filter predicates, scan of full table) for rows subset of 45M rows from LINEITEM table (data for 1994 year, 6.5GB in size) and after it is cached get next results :

    86235.123 ms for max_parallel_degree = 0
    22301.170 ms for max_parallel_degree = 4
    12711.368 ms for max_parallel_degree = 8

    Next I see what increasing max_parallel_degree more than 8 is not increasing number of real parallel workers.

    As you advise I change table default parallelism as :

    alter table lineitem1994 set (parallel_degree=16);

    But EXPLAIN ANALYZE for Q1 show what :

    Workers Planned: 16
    Workers Launched: 8

    So,no more than 8 workers again.

    Best regards,

    • David Rowley says:

      Remember that the executor will only start at the most max_worker_processes. Setting the relations parallel_degree higher than this does not make sense.

      It looks like it’s scaling better with a smaller table. Seems that 4 workers is about 77% efficient compared to no parallelism, and 8 workers 75%. This not really very close to what I got, so perhaps it’s worth watching top to see if the workers are fully utilising the CPU. If they’re not then you have a bottleneck somewhere else.

Leave A Reply