Wednesday, August 16

Basics of Tuning Checkpoints

On systems doing non-trivial number of writes, tuning checkpoints is crucial for getting good performance. Yet checkpoints are one of the areas where we often identify confusion and configuration issues, both on the community mailing lists and during performance tuning reviews for our customers. (The other one being autovacuum, discussed a few days ago by Joe Nelson from Citus.) So let me walk you through the checkpoints – what they do and how to tune them in PostgreSQL.

What is the point of checkpoints?

PostgreSQL is one of the databases relying on write-ahead log (WAL) – all changes are written to a log (a stream of changes) first, and only then to the data files. That provides durability, because in case of a crash the database may use WAL to perform recovery – read the changes from WAL and re-apply them on data files.

While this may double the amount of writes, it may actually improve performance. Users only have to wait for the WAL (to be flushed to disk), while the data files are modified only in memory and then flushed later in the background. That is is good, because while WAL writes are sequential in nature, while writes to data files are often random.

Let’s say the system crashed and the database needs to perform recovery. The simplest approach would be to take start from scratch and replay the whole WAL from the very beginning. In the end we should get a complete (and correct) database. The obvious disadvantage is of course the need to keep and replay the whole WAL. We often deal with databases that are not extremely big (say, few hundred GBs), but produce several TBs of WAL per day. So imagine how much disk space you would need to keep all WAL when running the database for a year, and how much time it would take to replay during recovery.

But what if the database could guarantee that all changes for a given WAL position (offset in the log), all data file changes up to that position are flushed to disk. Then it could determine this position during recovery, and replay only the remaining part of the WAL, significantly reducing the recovery time. And it could also remove WAL before that “known good” position.

That is exactly what checkpoints are for – guarantee that WAL before some point in time are no longer needed for recovery, reducing the disk space requirements and recovery time.

Note: If you happen to be a gamer, you are probably familiar with the concept of checkpoints – your character passes a certain point in the game, and if you fail to beat the next boss or fall into a lake hot lava, you start from that last point instead of starting from the very beginning. Let’s see how to achieve that in PostgreSQL ;-)

Let’s also discuss the other extreme – doing very frequent checkpoints (say, every second or so). That would allow keeping only tiny amount of WAL and the recovery would be very fast (having to replay only the tiny WAL amount). But it would also turn the asynchronous writes to data files into synchronous ones, seriously impacting the users (e.g. increasing COMMIT latency, reducing throughput).

So in practice you want the checkpoints to happen infrequently not to affect the users, but frequently enough to reasonably limit time for recovery and disk space requirements.

Triggering checkpoints

There are about three or four reasons why a checkpoint can be triggered:

  • executing the CHECKPOINT command directly
  • executing a command that requires a checkpoint (e.g. pg_start_backup, CREATE DATABASE, or pg_ctl stop|restart and a few others)
  • reaching a configured amount of time since the last checkpoint
  • generating a configured amount of WAL since the previous checkpoint (aka “running out of WAL” or “filling WAL”)

The first two points are rather irrelevant here – those are rare, manually triggered, events. This blog post is about how to configure the two other events, affecting regular periodic checkpoints.

Those time/size limits are set using two configuration options:

  • checkpoint_timeout = 5min
  • max_wal_size = 1GB (before PostgreSQL 9.5 this was checkpoint_segments)

With these (default) values, PostgreSQL will trigger a CHECKPOINT every 5 minutes, or after the WAL grows to about 1GB on disk.

Note: The max_wal_size is a soft limit for total WAL size, which has two consequences. Firstly, the database will try not to exceed it but is allowed to, so keep enough free space on the partition and monitor it. Secondly, it’s not a “per checkpoint” limit – due to spread checkpoints (explained later) the WAL quota is divided between 2 – 3 checkpoints. So with max_wal_size the database will start a CHECKPOINT after writing 300 – 500 MB of WAL, depending on checkpoint_completion_target.

The default values are rather low, like most other default values in the sample configuration file, sized to work even on small systems like Raspberry Pi.

But how to determine good values for your system? The goal is not to do checkpoints too often or too infrequently, and our tuning “best practice” consists of two steps:

  1. choose a “reasonable” checkpoint_timeout value
  2. set max_wal_size high enough to be rarely reached

It’s hard to say what a “reasonable” value for checkpoint_timeout is, as it depends on recovery time objective (RTO), i.e. what is the acceptable maximum recovery duration.

It’s a bit tricky, though, because checkpoint_timeout is a limit on how long it took to generate the WAL, not directly on recovery time. And sadly it’s not possible to say exactly how long a recovery will take for a number of reasons. WAL is usually generated by multiple processes (running DML), while recovery is performed by a single process (and this limitation is mostly inherent and unlikely to go away anytime soon). This affects not only local recovery, but also failover to a standby with streaming replication. And local recovery of course often happens right after a reboot, when file system caches are cold.

But in general, the default value (5 minutes) is rather low and values between 30 minutes and 1 hour are fairly common. PostgreSQL 9.6 even increases the maximum to 1 day (so there are hackers that think it’s a good idea for some use cases). Low values may also result write amplification due to full-page writes (which I’m not going to discuss here).

So let’s say we’ve decided to use 30 minutes.

checkpoint_timeout = 30min

Now we need to estimate how much WAL will the database produce over 30 minutes, so that we can use that for max_wal_size. There are several ways to determine how much WAL is generated:

  • Look at the actual WAL position (essentially offset in a file) using pg_current_xlog_insert_location(), and compute difference between positions measured every 30 minutes.
  • Enable log_checkpoints=on and then extract the information from the server log (there will be detailed statistics for each completed checkpoint, including the amount of WAL).
  • Using data from pg_stat_bgwriter, which also includes information about number of checkpoints (which you can combine with the knowledge of current max_wal_size value).

Let’s use the first approach, for example. On my test machine running pgbench, I do see this:

postgres=# SELECT pg_current_xlog_insert_location();
 pg_current_xlog_insert_location 
---------------------------------
 3D/B4020A58
(1 row)

... after 5 minutes ...

postgres=# SELECT pg_current_xlog_insert_location();
 pg_current_xlog_insert_location 
---------------------------------
 3E/2203E0F8
(1 row)

postgres=# SELECT pg_xlog_location_diff('3E/2203E0F8', '3D/B4020A58');
 pg_xlog_location_diff 
-----------------------
            1845614240
(1 row)

This shows that over the 5 minutes, the database generated ~1.8GB of WAL, so for checkpoint_timeout = 30min that would be about 10GB of WAL. However as mentioned before, max_wal_size is a quota for 2 – 3 checkpoints combined, so max_wal_size = 30GB (3 x 10GB) seems about right.

The other approaches use different sources of data, but the idea is the same.

Spread checkpoints

I suggested that you need to tune checkpoint_timeout and max_wal_size, I wasn’t telling the whole truth. There’s another parameter, called checkpoint_completion_target. But to tune it, you need to understand what a “spread checkpoint” means.

But during a CHECKPOINT, the database needs to perform these three basic steps:

  1. identify all dirty (modified) blocks in shared buffers
  2. write all those buffers to disk (or rather to filesystem cache)
  3. fsync() all the modified files to disk

Only when all those steps finish, the checkpoint can be considered complete. You could do these steps “as fast as possible” i.e. write all the dirty buffers in one go and then call fsync on the files, and in fact that’s what PostgreSQL up to 8.2 did. But that leads to I/O stalls due to filling filesystem caches and saturating the devices, and affecting user sessions.

To address this, PostgreSQL 8.3 introduced the concept of “spread checkpoints” – instead of writing all data at once, the writes are spread over a long period of time. That gives the OS time to flush the dirty data in the background, making the final fsync() much cheaper.

The writes are throttled based on progress towards the next checkpoint – the database knows how much time / WAL do we have left until another checkpoint will be needed, and it can compute how many buffers should be already written out. The database however must not issue writes until the very end – that would mean the last batch of writes would still be in filesystem cache, making the final fsync() calls (issued right before starting the next checkpoint) expensive again.

So the database needs to leave the kernel enough time so that the dirty data get flushed to disk in the background – and the expiration from page cache (Linux filesystem cache) is generally driven by time, particularly by this kernel parameter:

vm.dirty_expire_centisecs = 3000

which says that data expire after 30 seconds (by default).

Note: When it comes to kernel parameters, it’s important to tune vm.dirty_background_bytes. On systems with a lot of memory the default value is far too high, allowing the kernel to accumulate a lot of dirty data in filesystem cache. Kernel often decides to flush them all at once, reducing the benefit of spread checkpoints.

Now, back to checkpoint_completion_target = 0.5. This configuration parameter say how far towards the next checkpoint should all the writes be complete. For example assuming checkpoints triggered only by checkpoint_timeout = 5min, the database will throttle the writes so that the last write is done after 2.5 minutes. The OS then has another 2.5 minutes to flush the data to disk, so that the fsync calls issued after 5 minutes are cheap and fast.

Sure, leaving the system 2.5 minutes may seem excessive, considering the expiration timeout is only 30 seconds. You might increase checkpoint_completion_target e.g. to 0.85 which would leave the system about 45 second, a bit more than the 30 seconds it needs. That’s not recommended though, because in case of intensive writes the checkpoint may be triggered by max_wal_size much sooner than after the 5 minutes, leaving OS less than the 30 seconds.

Systems that handle write-intensive workloads are however unlikely to run with much higher checkpoint_timeouts values, making the default completion_target value definitely too low. For example with the timeout set to 30 minutes, it forces the database to do all writes in the first 15 minutes (at double the write rate), and then sit idly for the remaining 15 minutes.

Instead, you might try setting checkpoint_completion_target roughly using this formula

(checkpoint_timeout - 2min) / checkpoint_timeout

which for 30 minutes is about 0.93. Sometimes it recommended not to go above 0.9 – that’s probably OK, and you’re unlikely to observe any significant difference between those two values. (This may change when using very high checkpoint_timeout values, which now can be up to 1 day on PostgreSQL 9.6).

Summary

So now you should know what’s the purpose of checkpoints, and also basics of tuning them. To sum this up:

  • most checkpoints should be time-based, i.e. triggered by checkpoint_timeout
    • compromise between performance (infrequent checkpoints) and time needed for recovery (frequent checkpoints)
    • values between 15-30 minutes are the most common, but going up to 1h is not somehow bad
  • after deciding on the timeout, choose max_wal_size by estimating the amount of WAL
  • set checkpoint_completion_target so that kernel has enough (but not unnecessarily much) time to flush data to disk
  • also tune vm.dirty_background_bytes to prevent kernel from accumulating a lot of dirty data in page cache
  • Or you can simply use our performance tuning service, which covers not only checkpoints but all the other important bits of configuration.

3 Comments

  • cook

    Thanks so much for sharing this in the public. It really helps new community members to be able to tackle a non-trivial part of database tuning in a way that works for our business. Much appreciated!

  • Thank you for this great post. I have some questions related to this:

    Checking the checkpoints theme and the configuration of memory buffers for writing on to the operating system (vm.dirty_background_ratio and
    Vm.dirty_ratio = 2), I wanted to know what the relationship exists between these parameters and checkpoint_completion_target.

    If synchronous_commit = on, it means that each write action will be persisted in the redo logs before the final commit of the transaction, and after a checkpoint all the generated Wal will be synchronized with the data files, then in this scenario:

    The bgwriter plays some role during bgwriter_delay by writing the changes of the shared buffers directly to the data files or the write cache of the operating system or simply does nothing in synchronous commit mode.

    Is Wal_buffers used in synchronous mode?

    If after each commit an fsync is executed to write to disk the data generated by the transaction, what information is stored in the operating system’s write cache, and how much can this value grow if there is write after each transaction?

    Any documentation related to these issues will be well appreciated.

    Regards and many thanks to all for your time.

    • That’s a lot of questions, so let me get through that one by one:

      > Checking the checkpoints theme and the configuration of memory buffers for writing on to the operating system (vm.dirty_background_ratio and
      > Vm.dirty_ratio = 2), I wanted to know what the relationship exists between these parameters and checkpoint_completion_target.

      There is no direct connection (i.e. no explicit connection in the code or anything like that), but of course those parameters are related as both deal with writing data out. To get the smoother behavior, you’d want low vm.dirty_background_bytes (so that the kernel does not accumulate a lot of dirty data in page cache) and high checkpoint_completion_target (so that the writes are spread over a long time interval). Let’s assume you have larget shared buffers, and there’s ~16GB of dirty data to write out. With checkpoint_timeout=15min and checkpoint_completion_target=0.9, that means ~1.2GB/min, or 20MB/s. So high completion_target gives you smooth writeout from shared buffers to page cache. What about page cache? By default, the kernel will allow accumulating dirty data for up to 30 seconds (dirty_expire_centisecs=3000), which means up to 600 MB of dirty data. And then it tends to flush everything as fast as possible, which leads to latency spikes. vm.dirty_background_bytes takes care of that by starting the writeout based on amount of dirty data, not just time. So those parameters affect different parts of the stack, but serve the same purpose.

      >
      > If synchronous_commit = on, it means that each write action will be persisted in the redo logs before the final commit of the transaction, and after a checkpoint all the generated Wal will be synchronized with the data files, then in this scenario:
      >
      > The bgwriter plays some role during bgwriter_delay by writing the changes of the shared buffers directly to the data files or the write cache of the operating system or simply does nothing in synchronous commit mode.
      >
      > Is Wal_buffers used in synchronous mode?
      >
      > If after each commit an fsync is executed to write to disk the data generated by the transaction, what information is stored in the operating system’s write cache, and how much can this value grow if there is write after each transaction?
      >

      Not sure I follow. synchronous_commit only affects writes to WAL, it has nothing to do with data files. bgwriter OTOH only deals with data files, it has nothing to do with WAL. When you do an UPDATE, it writes the change into WAL and then modifies the pages (from data files) in shared buffers. When you then do a COMMIT, it will wait for fsync on WAL to happen – you can say synchronous_commit=off and this wait won’t happen, which will preserve the atomicity of the transaction but you won’t get a confirmation whether it succeeded or not. The COMMIT however does nothing with the dirty pages (from data files) in shared buffers, those remain in memory and are evicted either by checkpoint or bgwriter. The only purpose of bgwriter is to ensure there’s free memory for queries (otherwise the eviction would be performed by them).

      > Any documentation related to these issues will be well appreciated.

      I’m sure there are sections in the official documentation explaining what bgwriter does, what synchronous_commit is about etc. But if you want technical details, you’ll probably have to look at the source code – there are various READMEs explaining how things work.

      > Regards and many thanks to all for your time.

Leave a Reply

Your email address will not be published. Required fields are marked *