A few weeks ago I explained basics of autovacuum tuning. At the end of that post I promised to look into problems with vacuuming soon. Well, it took a bit longer than I planned, but here we go.
To quickly recap,
autovacuum is a background process cleaning up dead rows, e.g. old deleted row versions. You can also perform the cleanup manually by running
autovacuum does that automatically depending on the amount of dead rows in the table, at the right moment – not too often but frequently enough to keep the amount of “garbage” under control.
autovacuum can’t be running too often – the cleanup is only performed after reaching some number dead rows accumulates in the table. But it may be delayed for various reasons, resulting in tables and indexes getting larger than desirable. And that’s exactly the topic of this post. So what are the common culprits and how to identify them?
As explained in tuning basics,
autovacuum workers are throttled to only perform certain amount of work per time interval. The default limits are fairly low – about 4MB/s of writes, 8MB/s of reads. That is suitable for tiny machines like Raspberry Pi or small servers from 10 years ago, but current machines are way more powerful (both in terms of CPU and I/O) and handle much more data.
Imagine you have a few large tables and some small ones. If all three
autovacuum workers start cleaning up the large tables, none of the small tables will get vacuumed regardless of the amount of dead rows they accumulate. Identifying this is not particularly difficult, assuming you have sufficient monitoring. Look for periods when all
autovacuum workers are busy while tables are not vacuumed despite accumulating many dead rows.
All the necessary information is in
pg_stat_activity (number of
autovacuum worker processes) and
Increasing the number of
autovacuum workers is not a solution, as the total amount of work remains the same. You can specify per-table throttling limits, excluding that worker from the total limit, but that still does not guarantee there will be available workers when needed.
The right solution is tuning the throttling, using limits reasonable with respect to the hardware configuration and workload patterns. Some basic throttling recommendations are mentioned in the previous post. (Obviously, if you can reduce the amount of dead rows generated in the database, that would be an ideal solution.)
From this point we’ll assume the throttling is not the issue, i.e. that the
autovacuum workers are not saturated for long periods of time, and that the cleanup is triggered on all tables without unreasonable delays.
So, if the table is vacuumed regularly, surely it can’t accumulate a lot of dead rows, right? Unfortunately, no. The rows are not actually “removable” immediately after getting deleted, but only when there are no transactions that might possibly see them. The exact behavior depends on what the other transactions are (were) doing and serialization level, but in general:
In practice it’s more nuanced of course, but explaining all the various bits would require first explaining how XIDs and snapshots work, and that’s not the goal of this post. What you really should take away from this is that long transactions are a bad idea, particularly if those transactions might have done writes.
Of course, there are perfectly valid reasons why you may need to keep transactions for long periods of time (e.g. if you need to ensure ACID for all the changes). But make sure it does not happen unnecessarily, e.g. due to a poor application design.
A somewhat unexpected consequence of this is high CPU and I/O usage, due to
autovacuum running over and over, without cleaning any dead rows (or just a few of them). Because of that the tables are still eligible for cleanup on the next round, causing more harm than good.
How to detect this? Firstly, you need to monitor long-running transactions, particularly idle ones. All you need to do is reading data from
pg_stat_activity. The view definition changes a bit with PostgreSQL version, so you may need to tweak this a bit:
SELECT xact_start, state FROM pg_stat_activity; -- count 'idle' transactions longer than 15 minutes (since BEGIN) SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'idle in transaction' AND (now() - xact_start) > interval '15 minutes' -- count transactions 'idle' for more than 5 minutes SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'idle in transaction' AND (now() - state_change) > interval '5 minutes'
You may also simply use some existing monitoring plugin, e.g. check_postgres.pl. Those already include this type of sanity checks. You’ll have to decide what is a reasonable transaction/query duration, which is application-specific.
Since PostgreSQL 9.6 you can also use
idle_in_transaction_session_timeout so that transactions idle for too long are terminated automatically. Similarly, for long queries there’s
Another useful thing is
VACUUM VERBOSE which will actually tell you how many dead rows could not be removed yet:
db=# VACUUM verbose z; INFO: vacuuming "public.z" INFO: "z": found 0 removable, 66797 nonremovable row versions in 443 out of 443 pages DETAIL: 12308 dead row versions cannot be removed yet. ...
It will not tell you which backend is preventing the cleanup, but it’s a pretty clear sign of what’s happening.
Note:. You can’t easily get this information from
autovacuum because it’s only logged with
DEBUG2 by default (and you surely don’t want to run with that log level in production).
Let’s assume tables are being vacuumed in a timely manner, but not removing dead tuples, resulting in table and index bloat. You are monitoring
pg_stat_activity and there are no long-running transactions. What could be the issue?
If you have a streaming replica, chances are the issue might be there. If the replica uses
hot_standby_feedback=on, queries on the replica act pretty much as transactions on the primary, including blocking cleanup. Of course,
hot_standby_feedback=on is used exactly when running long queries (e.g. analytics and BI workloads) on replicas, to prevent cancellations due to replication conflicts.
Unfortunately, you’ll have to choose – either keep
hot_standby_feedback=on and accept delays in cleanup, or deal with canceled queries. You might also use
max_standby_streaming_delay to limit the impact, although that does not prevent the cancellations entirely (so you still need to retry the queries).
Actually, there’s a third option now – logical replication. Instead of using physical streaming replication for the BI replica, you can copy the changes using the new logical replication, available in PostgreSQL 10. Logical replication relaxes the coupling between the primary and replica, and makes the clusters mostly independent (are cleaned up independently, etc.).
This solves the two issues associated with physical streaming replication – delayed cleanup on primary or canceled queries on the BI replica. For replicas serving DR purposes streaming replication remains to be the right choice, though. But those replicas are not (or should not be) running long queries.
Note: While I mentioned that logical replication will be available in PostgreSQL 10, a significant portion of the infrastructure was available in previous releases (particularly PostgreSQL 9.6). So you might be able to do this even on older releases (we did that for some of our customers), but PostgreSQL 10 will make it much more convenient and comfortable.
A detail you might miss is that
autovacuum workers actually performs two different tasks. Firstly the cleanup (as if running
VACUUM), but also collecting statistics (as if running
ANALYZE). And both parts are throttled using
But there’s a big difference in handling transactions. Whenever the
VACUUM part reaches
autovacuum_cost_limit, the worker releases the snapshot and sleeps for a while. The
VACUUM however has to run in a single snapshot/transaction, which does block cleanup.
This is an elegant way to shoot yourself in the foot, particularly if you also do some of this:
default_statistics_targetto build more accurate statistics from larger samples
autovacuum_analyze_scale_factorto collect statistics more frequently
The unintended consequence of course is that
ANALYZE will be happening more frequently, will take much longer and will (unlike the
VACUUM part) prevent cleanup. The solution is usually fairly simple – don’t lower
autovacuum_analyze_scale_factor too much. Running
ANALYZE every time 10% of the table changes should be more than enough in most cases.
One last thing I’d like to mention is about changes in
pg_stat_all_tables.n_dead_tup values. You might think that the value is a simple counter, incremented whenever a new dead tuple is created and decremented whenever it’s cleaned up. But it’s actually only an estimate of the number of dead tuples, updated by
ANALYZE. For small tables (less than 240MB) it’s not really a big difference, because
ANALYZE reads the whole table and so it’s pretty exact. For large tables it may however change quite a bit depending on what subset of table gets sampled. And lowering
autovacuum_vacuum_scale_factor makes it more random.
So be careful when looking at
n_dead_tup in a monitoring system. Sudden drops or increases in the value may be simply due to
ANALYZE recomputing a different estimate, and not due to actual cleanup and/or new dead tuples appearing in the table.
To summarize this into a few simple points:
autovacuumcan only do it’s work if there are no transactions that might need the dead tuples.
statement_timeoutto limit the damage.
hot_standby_feedback=onmay also block cleanup.
autoanalyzeis throttled too, but unlike the
VACUUMpart it keeps a single snapshot (and thus blocks cleanup).
n_dead_tupis just an estimate maintained by
ANALYZE, so expect some fluctuation (especially on large tables).