On Rocks and Sand

When working with database capacity planning, there are a lot of variables to consider, and Postgres is no different in this regard. One of the elements which requires management is storage. However, there’s an aspect of storage that escapes inspection almost without exception, and it’s hiding in the shadows between the columns themselves.

Alignment Basics

In most low-level computer languages like C, in which the venerable Postgres is written, data types are addressed by their maximum size, regardless of how large they actually are. Thus a standard 32-bit integer which can store a value of just over 2-billion, must be read as a whole unit. This means even the value of 0 requires 4 bytes of storage.

Further, Postgres is designed such that its own internal natural alignment is 8 bytes, meaning consecutive fixed-length columns of differing size must be padded with empty bytes in some cases. We can see that with this example:

SELECT pg_column_size(row()) AS empty,
       pg_column_size(row(0::SMALLINT)) AS byte2,
       pg_column_size(row(0::BIGINT)) AS byte8,
       pg_column_size(row(0::SMALLINT, 0::BIGINT)) AS byte16;

 empty | byte2 | byte8 | byte16 
-------+-------+-------+--------
    24 |    26 |    32 |     40

This suggests that an empty Postgres row requires 24 bytes of various header elements, a SMALLINT is 2 bytes, a BIGINT is 8 bytes, and combining them is… 16 bytes? That’s no mistake; Postgres is padding the smaller column to match the size of the following column for alignment purposes. Instead of 2 + 8 = 10, our math becomes 8 + 8 = 16.

Intensity Intervals

By itself, this may not necessarily be a problem. But consider a contrived ordering system with this table:

CREATE TABLE user_order (
  is_shipped    BOOLEAN NOT NULL DEFAULT false,
  user_id       BIGINT NOT NULL,
  order_total   NUMERIC NOT NULL,
  order_dt      TIMESTAMPTZ NOT NULL,
  order_type    SMALLINT NOT NULL,
  ship_dt       TIMESTAMPTZ,
  item_ct       INT NOT NULL,
  ship_cost     NUMERIC,
  receive_dt    TIMESTAMPTZ,
  tracking_cd   TEXT,
  id            BIGSERIAL PRIMARY KEY NOT NULL
);

If it looks a little weird, that’s because it is. It’s not uncommon for column orders to be dictated by a hurried dev simply jotting down attributes, or an ORM that generates output from an arbitrary hashed key position. Seeing the id column at the end is a good indicator that column order wasn’t part of the architecture or planning roadmap.

Given that, this is what Postgres sees:

SELECT a.attname, t.typname, t.typalign, t.typlen
  FROM pg_class c
  JOIN pg_attribute a ON (a.attrelid = c.oid)
  JOIN pg_type t ON (t.oid = a.atttypid)
 WHERE c.relname = 'user_order'
   AND a.attnum >= 0
 ORDER BY a.attnum;

   attname   |   typname   | typalign | typlen 
-------------+-------------+----------+--------
 is_shipped  | bool        | c        |      1
 user_id     | int8        | d        |      8
 order_total | numeric     | i        |     -1
 order_dt    | timestamptz | d        |      8
 order_type  | int2        | s        |      2
 ship_dt     | timestamptz | d        |      8
 item_ct     | int4        | i        |      4
 ship_cost   | numeric     | i        |     -1
 receive_dt  | timestamptz | d        |      8
 tracking_cd | text        | i        |     -1
 id          | int8        | d        |      8

The typalign column dictates the intended alignment type according to the values listed in the pg_type manual page.

From earlier discussion, we can see that the INT types trivially map to their respective byte sizes. For NUMERIC and TEXT, things are a bit trickier, but we’ll need to address that later. For now, consider the constant size transitions and the effect that may have on alignment.

A Little Padded Room

To avoid this insanity, Postgres pads each smaller column to match the size of the next consecutive column. As a result from this particular column arrangement, there’s almost a slight amount of buffering between every single column pair.

Let’s fill the table with one million rows and check the resulting size:

INSERT INTO user_order (
    is_shipped, user_id, order_total, order_dt, order_type,
    ship_dt, item_ct, ship_cost, receive_dt, tracking_cd
)
SELECT true, 1000, 500.00, now() - INTERVAL '7 days',
       3, now() - INTERVAL '5 days', 10, 4.99,
       now() - INTERVAL '3 days', 'X5901324123479RROIENSTBKCV4'
  FROM generate_series(1, 1000000);

SELECT pg_relation_size('user_order') AS size_bytes,
       pg_size_pretty(pg_relation_size('user_order')) AS size_pretty;

 size_bytes | size_pretty 
------------+-------------
  141246464 | 135 MB

Now we can use that as our benchmark for a value we want to beat. The next question is: how much of that is padding? Well, here’s an estimate of what minimum we’re wasting based on the declared alignment:

7 bytes between is_shipped and user_id
4 bytes between order_total and order_dt
6 bytes between order_type and ship_dt
4 bytes between receive_dt and id

So we’re probably losing around 21 bytes per row, with the declared types accounting for 59 bytes of actual space, for a total row length of 80 bytes without row overhead. Again, this is based only on the alignment storage. It turns out the NUMERIC and TEXT columns screw with the totals a bit more than the alignments suggest, but this will work fine for an estimate.

If we’re even close, that means we may be able to shrink the table by 26%, or about 37MB.

Some Ground Rules

The trick to performing this kind of surgery is obtaining the ideal column alignment that results in the fewest added bytes. To do that, we really need to consider the NUMERIC and TEXT columns. Because these are variable length types, they get special treatment. As an example, consider this regarding NUMERIC:

SELECT pg_column_size(row()) AS empty_row,
       pg_column_size(row(0::NUMERIC)) AS no_val,
       pg_column_size(row(1::NUMERIC)) AS no_dec,
       pg_column_size(row(9.9::NUMERIC)) AS with_dec,
       pg_column_size(row(1::INT2, 1::NUMERIC)) AS col2,
       pg_column_size(row(1::INT4, 1::NUMERIC)) AS col4,
       pg_column_size(row(1::NUMERIC, 1::INT4)) AS round8;

 empty_row | no_val | no_dec | with_dec | col2 | col4 | round8 
-----------+--------+--------+----------+------+------+--------
        24 |     27 |     29 |       31 |   31 |   33 |     36

These results suggest we can treat NUMERIC as unaligned, with some caveats. Even a single digit in a NUMERIC requires 5 bytes, but it also doesn’t impact the previous column the way an INT8 might.

Here’s the same concept with TEXT:

SELECT pg_column_size(row()) AS empty_row,
       pg_column_size(row(''::TEXT)) AS no_text,
       pg_column_size(row('a'::TEXT)) AS min_text,
       pg_column_size(row(1::INT4, 'a'::TEXT)) AS two_col,
       pg_column_size(row('a'::TEXT, 1::INT4)) AS round4;

 empty_row | no_text | min_text | two_col | round4 
-----------+---------+----------+---------+--------
        24 |      25 |       26 |      30 |     32

From this, we can see that variable types round to the nearest 4-bytes based on the type of the next column. This means we can chain variable length columns all day long without introducing padding except at the right boundary. Consequently, we can deduce that variable length columns introduce no bloat so long as they’re at the end of a column listing.

If the constant length column types adjust based on the next column, that would imply the largest types should come first. Barring that, we can “pack” columns such that consecutive columns consume 8 bytes cumulatively.

Again, we can see that in action:

SELECT pg_column_size(row()) AS empty_row,
       pg_column_size(row(1::SMALLINT)) AS int2,
       pg_column_size(row(1::INT)) AS int4,
       pg_column_size(row(1::BIGINT)) AS int8,
       pg_column_size(row(1::SMALLINT, 1::BIGINT)) AS padded,
       pg_column_size(row(1::INT, 1::INT, 1::BIGINT)) AS not_padded;

 empty_row | int2 | int4 | int8 | padded | not_padded 
-----------+------+------+------+--------+------------
        24 |   26 |   28 |   32 |     40 |         40

Neat, eh?

Column Tetris

The previous sections are a fancy way of saying, “Sort the columns by their type length as defined in pg_type.” Luckily enough, we can get that information by slightly adjusting the query we used to output the column types:

SELECT a.attname, t.typname, t.typalign, t.typlen
  FROM pg_class c
  JOIN pg_attribute a ON (a.attrelid = c.oid)
  JOIN pg_type t ON (t.oid = a.atttypid)
 WHERE c.relname = 'user_order'
   AND a.attnum >= 0
 ORDER BY t.typlen DESC;

   attname   |   typname   | typalign | typlen 
-------------+-------------+----------+--------
 id          | int8        | d        |      8
 user_id     | int8        | d        |      8
 order_dt    | timestamptz | d        |      8
 ship_dt     | timestamptz | d        |      8
 receive_dt  | timestamptz | d        |      8
 item_ct     | int4        | i        |      4
 order_type  | int2        | s        |      2
 is_shipped  | bool        | c        |      1
 tracking_cd | text        | i        |     -1
 ship_cost   | numeric     | i        |     -1
 order_total | numeric     | i        |     -1

All things being equal, we can mix and match columns with matching type lengths, and if we’re brave or desire a more pretty column ordering, can combine types of shorter lengths where necessary.

Let’s see what happens if we go with this table design:

DROP TABLE user_order;

CREATE TABLE user_order (
  id            BIGSERIAL PRIMARY KEY NOT NULL,
  user_id       BIGINT NOT NULL,
  order_dt      TIMESTAMPTZ NOT NULL,
  ship_dt       TIMESTAMPTZ,
  receive_dt    TIMESTAMPTZ,
  item_ct       INT NOT NULL,
  order_type    SMALLINT NOT NULL,
  is_shipped    BOOLEAN NOT NULL DEFAULT false,
  order_total   NUMERIC NOT NULL,
  ship_cost     NUMERIC,
  tracking_cd   TEXT
);

If we repeat the previous insert of 1-million rows, the new table size is 117,030,912 bytes, or roughly 112MB. By simply reorganizing the table columns, we’ve saved 21% of the total space.

This may not mean much for a single table, but repeated on every table in a database instance, it could result in a sizeable reduction in storage consumption. In a data warehouse context where data is commonly loaded once and never modified again, a 10-20% storage reduction would be worth considering simply due to the scales involved. I’ve seen 60TB Postgres databases; imagine reducing that by 6-12TB without actually removing any data.

Settled Differences

Much like filling a jar with rocks, pebbles, and sand, the most efficient way to declare a Postgres table is by the column alignment type. Bigger columns first, medium columns next, small columns last, and weird exceptions like NUMERIC and TEXT tacked to the end as if they were dust in our analogy. That’s what we get for playing with pointers.

As it stands, a more “natural” declaration of the table columns is probably along these lines:

CREATE TABLE user_order (
  id            BIGSERIAL PRIMARY KEY NOT NULL,
  user_id       BIGINT NOT NULL,
  order_type    SMALLINT NOT NULL,
  order_total   NUMERIC NOT NULL,
  order_dt      TIMESTAMPTZ NOT NULL,
  item_ct       INT NOT NULL,
  ship_dt       TIMESTAMPTZ,
  is_shipped    BOOLEAN NOT NULL DEFAULT false,
  ship_cost     NUMERIC,
  tracking_cd   TEXT,
  receive_dt    TIMESTAMPTZ
);

In this case, it just so happens we are 8MB away from the ideal, or about 7.7% larger. For the sake of demonstration, we intentionally botched the column ordering to make a point. A real table will probably sit somewhere between the best and worst case scenarios, and the only way to ensure the least amount of leakage is to manually reorder columns.

Some might ask why this isn’t built into Postgres. Surely it knows the ideal column ordering and has the power to decouple a user’s visible mapping from what actually hits the disk. That’s a legitimate question, but it’s a lot more difficult to answer and involves a lot of bike-shedding.

One major benefit from decoupling physical from logical representation is that Postgres would finally allow column reordering, or adding a column in a specific position. If a user wants their column listing to look pretty after multiple modifications, why not let them?

It’s all about priorities. There’s been a TODO item to address this going back to at least 2006. Patches have gone back and forth since then, and every time, the conversation eventually ends without a definitive conclusion. It’s clearly a difficult problem to address, and there are, as they say, bigger fish to fry.

Given sufficient demand, someone will sponsor a patch to completion, even if it requires multiple Postgres versions for the necessary under-the-hood changes to manifest. Until then, a simple query can magically reveal the ideal column ordering if the impact is pressing enough for a particular use case.

To those who love that kind of low-level tweaking, let us know if it was worth the effort. We’d love to hear your stories!

Craig Lewis says:

September 21, 2018 at 11:31 pm

It seems like a simpler and easier fix would be to add an argument to pg_dump to do optimal reordering. Then the column re-ordering can be tested as part of a dump and restore upgrade.

Michal says:

September 26, 2018 at 10:12 pm

Thanks, this is a very informative article that explains in an easy to understand way how Postgres stores data and fills the storage it with unused bytes. The ability to reorder columns is an important missing feature – after some time of application lifetime and various database modifications the crazy column order you presented in your example may not be far off from the truth. Then working with such table structures in database administration tools becomes clumsy as you end up with related columns far away from one another instead of being grouped logically in close proximity.

Given the padding and space issues it would be logical to disconnect the internal column order from presentational order. And we should have the ability to:

1. change presentational column order with ALTER TABLE (without table rewrite)
2. run a table rewrite that would automatically determine the most efficient internal column order and apply it to storage while preserving the presentational order – something like OPTIMIZE TABLE COLUMN ORDER – or maybe simply do it as part of VACUUM FULL.

I hope the long standing TODO item will find some love among the developers!