Monday, December 10

Managing Freezing in PostgreSQL

Postgres contains a moving event horizon, which is in effect about 2 billion transactions ahead of or behind the current transaction id. Transactions more than 2 billion ahead of or more than 2 billion behind the current transaction id are considered to be in the future, and will thus be invisible to current transactions.

Postgres avoids this catastrophic data loss by specially marking old rows so that no matter where they are in relation to the current transaction id they will be visible.

Freezing is this process of marking old live tuples (i.e. database rows) so that they don’t get run over by the moving event horizon that would otherwise make them appear to be in the future. This is in contrast to vacuuming, which is the freeing up of space consumed by old dead tuples that are no longer visible to any transaction.

Both processes are managed by vacuum.

There are a number of settings that govern how freezing is done.

First, vacuum_freeze_min_age governs whether or not a tuple will be frozen while vacuum is already looking at a page to see if it has dead tuples that can be cleaned up. Tuples older than vacuum_freeze_min_age will be frozen in this case. Setting this low means that there will be less work to do later on, but at the possible cost of extra effort both in CPU and IO or WAL activity. Generally you probably want this set to at least a few hours worth of transactions. Let’s say you’re expecting to do up to 2000 transactions per second as a sustained rate. 2000 TPS is 7.2m transactions per hour. Thus a fairly aggressive setting for this case might be say 20m. The default setting is 50m. Similarly for vacuum_multixact_freeze_min_age. Note that the transaction_id and multixid counters are independent – you need to keep track of both of them.

Second, there are vacuum_freeze_table_age and vacuum_multixact_freeze_table_age. These settings govern when autovacuum will not just look at pages that might have dead rows, but any page that might have unfrozen rows. The defaults for these settings are 150m. If you have reduced vacuum_freeze_min_age enough, in many cases this more aggressive vacuum will have little or no work to do. In any case, this process is not as busy as it used to be, since modern versions of Postgres (9.6 and up) keep a map of pages where all the tuples are frozen, and only visit those pages that are not all frozen. That means this is no longer a full table scan.

Last there is autovacuum_max_freeze_age. If the last time the table was scanned completely for unfrozen rows was more than this many transactions ago, autovacuum will start an anti-wraparound vacuum on the table. The default is 200m. Similarly for autovacuum_multixact_freeze_max_age for which the default is 400m. This is something you really want to avoid. There are two things that can be done. First, it is very common to increase these settings to something like 1 billion, to give yourself more headroom, especially on systems that are heavy consumers of transactions. You could make it more but you want to have plenty of transaction space between your oldest tuple and the event horizon. Second, it is important to monitor your systems and take remedial action before any databases run into this. This remedial action often includes manual vacuuming.

One problem that can occur is where you have DDL that causes normal (i.e. not anti-wraparound) autovacuum to cancel itself. If you do this enough eventually you will get an anti-wraparound vacuum forced, and any DDL then queues up behind the vacuum process, and that in turn blocks any further DML. At this stage your table is effectively unreadable until the vacuum finishes. This depends on the usage pattern of your database, but this is not just a theoretical possibility and Postgres deployments and DBAs do need to take it into account.

Monitoring your database cluster is critical to managing this. In particular you need to monitor the datfrozenxid and datminmxid of each database in the cluster, and if these get too old take remedial action before an anti-wraparound vacuum is required. Often the problem is with one or a few tables in the database. Which ones are the problem can be discovered by examining the relfrozenxid and relminmxid of the tables in the database. The age() and mxid_age() functions are useful to discover the age of transaction id and multixid counters respectively.

Freezing is not something you can avoid, it is an essential maintenance activity in Postgres that needs to be actively managed.

Leave a Reply

Your email address will not be published. Required fields are marked *