Friday, November 24

How monitoring of WAL archiving improves with PostgreSQL 9.4 and pg_stat_archiver

PostgreSQL 9.4 introduces a new statistic in the catalogue, called pg_stat_archiver.
Thanks to the SQL language it is now possible, in an instant, to check the state of the archiving process of transactional logs (WALs), crucial component of a PostgreSQL disaster recovery system.

PostgreSQL and Barman

Introduction and reasons

The need for the pg_stat_archiver view comes from the last few years of experience with Barman as a solution for disaster recovery of PostgreSQL databases in business continuity environments.

In particular, some of the needs and recurring questions that DBAs, system administrators, CTOs and CIOs repeatedly – and legimitately – share with us, are:

  • how much disk space will I need?
  • how can I keep the whole backup process, including continuous archiving, under control?

The starting point can only be a business requirement and it is defined through the concept of retention policy.
Usually, a company defines a disaster recovery plan within a business continuity plan, where it is clearly defined the period of retention of backup data. In the same documents we find both the recovery point objective (RPO) and the recovery time objective (RTO) definitions, the two key metrics that respectively measure the amount of data that a business can afford to lose and the maximum allowed time to recover from a disaster.
Monitoring and studying the past behaviour of a database are important elements for correctly sizing the storage requirements of a PostgreSQL backup solution.

For example, a company may decide to retain data of a PostgreSQL database for a month, in order to reconstruct the state of the database in a consistent manner at any point in time from the first available backup to the last available WAL file (through the rock solid Point-in-Time-Recovery technology of PostgreSQL, introduced 10 years ago).

The required size is given not only by the number of periodic full backups (for example, one a week), but also the number of WAL files stored in the Barman archive, each containing all the transactions that have been properly executed in the database, in a differential manner.

One of the needed metrics is therefore the number of archived WAL files per second, through which it is possible to estimate the number of WALs generated in a week and to predict disk usage requirements.

Before PostgreSQL 9.4, unless you:

  1. used tools for sampling and trending (e.g. Munin), or
  2. examined the timestamp of each WAL file, taking advantage of a high value of wal_keep_segments, or
  3. delegated this information to a custom script invoked by archive_command,

it was not possible to get any of the following pieces of information from the database server: number of archived WAL files, timestamp and name of the last archived WAL file, number of failures, timestamp and WAL name of the last failure, … and so on.

For this reason, last year I decided to write a small patch for Postgres which then became part of the core for version 9.4. This patch adds a real-time statistic in the PostgreSQL catalogue, called pg_stat_archiver.

Statistics overview

The pg_stat_archiver catalogue view in PostgreSQL 9.4 makes available to Barman users – and from a more general point of view to everyone using classic continuous backup with WAL file shipping – the following fields:

  • archived_count: number of WAL files successfully archived;
  • last_archived_wal: name of the last successfully archived WAL file;
  • last_archived_time: timestamp of the last successfully archived WAL file;
  • failed_count: number of failed WAL archiving attempts;
  • last_failed_wal: WAL name of the last archiving failure;
  • last_failed_time: timestamp of the last archiving failure;
  • stats_reset : timestamp of the last reset of statistics.

Here is an example taken from a local database server (with very low workload):

postgres=# SELECT * FROM pg_stat_archiver;
-[ RECORD 1 ]------+-----------------------------------------
archived_count     | 17
last_archived_wal  | 00000001000000000000000B.00000028.backup
last_archived_time | 2014-12-23 08:40:17.858291+01
failed_count       | 13
last_failed_wal    | 000000010000000000000001
last_failed_time   | 2014-12-04 13:09:07.348307+01
stats_reset        | 2014-12-03 16:52:21.755025+01

Basically, once continuous archiving is activated through archive_mode, PostgreSQL becomes responsible that, for each WAL file that is generated, the archive_command program is successfully executed, re-trying indefinitely in case of error (disk space permitting).

Unlike previous versions, PostgreSQL 9.4 is now able to collect some information regarding the two main events of the archiving process: success and failure. In particular, for both operations, the catalogue view reports:

  • count (since cluster initialisation or the last time statistics were reset);
  • WAL name of the last operation;
  • time of the last operation.

Moreover, you can reset the statistics of the archiver with the following SQL statement:

-- Requires superuser privileges
SELECT pg_stat_reset_shared('archiver');

Integration with Barman

From version 1.4 onwards, Barman automatically takes advantage of the pg_stat_archiver view for PostgreSQL 9.4 databases, transparently reporting information regarding the archiver process in some common informative commands such as status and show-server.

Furthermore, using the power of the SQL language, reliability of the check command has been improved so that an archiver problem is detected directly from the source.

Let’s walk through the query that has been introduced in Barman:

SELECT *,
    current_setting('archive_mode')::BOOLEAN
        AND (last_failed_wal IS NULL
            OR last_failed_wal <= last_archived_wal)
        AS is_archiving,
    CAST (archived_count AS NUMERIC)
        / EXTRACT (EPOCH FROM age(now(), stats_reset))
        AS current_archived_wals_per_second
FROM pg_stat_archiver

In addition to retrieving all columns of the pg_stat_archiver view, the query calculates two fields, directly from the source:

  • is_archiving: is the process of WAL archiving in progress or not?
  • current_archived_wals_per_second: frequency of archived WALs per second.

The is_archiving field must be TRUE, as the process of archiving is necessary for Barman to work correctly. Therefore, archive_mode must be active and the value of the last failed WAL must be either undefined (NULL) or not higher than the last properly archived WAL. This control is now part of the barman check command for Postgres 9.4 servers (and subsequent versions).

The second field, on the other hand, returns a very interesting statistic on the workload produced by the Postgres server. This metric allows operation managers to estimate the disk space that is required to store days, weeks and months of WAL files (even compressed), thus responding to one of the major initial questions.

How to check the operating status of archiving

Thanks to pg_stat_archiver, checking the status of WAL continuous archiving comes down to the execution of a single SQL query.

As a result, you can use the above query, already used by Barman, to verify that archiving is correctly working and integrate it in the probes and plugins used by your company’s alerting system.

Important

Those using Barman with a PostgreSQL 9.4 server that have already integrated barman check within Nagios or Icinga will transparently enjoy this feature.

Conclusions

In its simplicity, the pg_stat_archiver view is a very important tool for those who consider disaster recovery a critical component – not a peripheral one – of a PostgreSQL database in a business continuity system.

Although PostgreSQL allows you to perform backups using streaming replication, continuous archiving of WAL files through shipping is a very safe and reliable fallback method (and still the only one so far supported by Barman). Therefore, proper monitoring of this component significantly increases the robustness of the entire Postgres database solution.

Finally, being able to access some statistics about the number of archived WAL files by a PostgreSQL server in a given period through a simple SQL query is an important step towards the analysis of transactional workload and the prediction of disk usage for a backup solution.

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *