Monday, November 12

PG Phriday: Adventures in BAR Management

Backups are a critical component to a fully covered Postgres database infrastructure. In some ways, it’s fair to say a database without a backup is no database at all—sometimes literally. 2ndQuadrant’s Barman tool is aptly named as a Backup And Recovery Manager for Postgres, and it exists primarily for encouraging a stable and robust backup process.

Backing up a database and then restoring from that backup often has an army of associated scripts and utilities. Some of these components are probably the Postgres default tools of pg_dump, pg_restore, and pg_basebackup. The rest are often shell scripts, homegrown or otherwise copied from blogs or git repositories. Let’s just assume the best case scenario and assume all of these work just fine; I used to think the same thing myself.

But a fully qualified backup and restore management suite is more than a set of scripts. Once I became fully familiar with Barman, I even discarded a backup tool I wrote several years ago and was very proud of. Let’s explore the stack and see why.

Stand in the place where you live

Before creating a bunch of backups, we need a database. A big one. The pgbench utility is often used to generate data this way, and even has a parameter for setting the database scale. At a scale of 10,000, we should get a 150GB database. That’s not ponderously large in the grand scheme of things, but it’s enough for a few tests.

createdb pgbench
pgbench -i -s 10000 pgbench

After waiting for about an hour and a half, there should be a good chunk of data that will represent our production system. Now let’s back it up.

Now face north

Barman is in the official Postgres package repositories. The installation instructions essentially amount to:

  1. Install PGDG
  2. install Barman

Though Barman is a backup management suite, so there are a few more configuration steps we need to take for it to work properly.

Think about direction

As a management suite, Barman can be configured to back up any number of Postgres instances, local or remote. As such, while Barman doesn’t require dedicated hardware, it should have sufficient disk and network bandwidth to accommodate one or more Postgres instances and the associated traffic.

In our case, we just have one instance. Still, for this to be a real-world test, let’s get everything running in a separate VM. A bonus with this approach is that it ensures the backup is at least on a separate system in case the original has an unrecoverable crash.

Now we have a Postgres server, and a Barman server. A default Barman installation associates everything with the barman OS user. One of the configurations Barman uses for extremely large databases is using rsync for data copy. That means we need to set up SSH keys such that the barman OS user on the Barman server can log into the Postgres server as the postgres OS user.

So on the Barman server:

ssh-keygen
ssh-copy-id POSTGRES_HOST

In a real production system, we’d hope for some kind of configuration management process to handle this step.

Now stand in the place where you work

The base configuration of Barman is in the /etc/barman.conf file. This sets defaults for the installation itself, and for all backup procedures Barman controls. Since we don’t have one of those yet, we should make one. Such instance-specific files go in /etc/barman.d.

Our database is 150GB, and we want to use rsync to back up the actual data files, but what about transaction logs? The "old way" to manage these is to set archive_command in the postgresql.conf configuration to transmit WAL files to some tertiary directory or target server. Barman can be configured to stream these using pg_receivexlog or pg_receivewal. This means every backup should always have a full series of associated transaction logs so we can use Point In Time Recovery (PITR) at any time between backups.

This means we’ll be using something of a hybrid config: rsync to copy the datafiles (instead of pg_basebackup) and pg_receivewal for transaction logs. The default site-configuration templates tend to use one or the other. So knowing what we know now, our /etc/barman.d/pgbench.conf might look like this:

[pgbench]
 
description =  "Benchmark Barman Test"
 
ssh_command = ssh postgres@postgres.host
 
conninfo = host=postgres.host user=barman dbname=postgres
 
backup_method = rsync
reuse_backup = link
 
streaming_archiver = on
slot_name = barman
 
path_prefix = "/usr/lib/postgresql/10/bin"

Now face west

Astute readers may have noticed we used the barman user instead of postgres. It’s generally not a great idea to use the same user everywhere. Having a user we can tie directly to backup procedures is extremely useful. We just need to set it up on the Postgres server so.

We start by creating the barman database user:

CREATE USER barman WITH SUPERUSER REPLICATION PASSWORD 'redacted';

Then we need the following lines in pg_hba.conf, assuming 192.168.1.120 is the Barman server:

host  postgres     barman  192.168.1.120/32  md5
host  replication  barman  192.168.1.120/32  md5

Once Postgres reloads its configuration files, we’re almost done.

Your feet are going to be on the ground

At this point, we can’t produce a backup just yet. Barman is safety-first, so until it can establish it has received at least one transaction log, it will refuse to attempt a backup. Since we’ve configured streaming WAL, we need to get it running.

Assuming the barman user exists on Postgres, and we’re logged in as the barman user on the Barman server, we just need to do this:

barman receive-wal pgbench --create-slot
barman cron

As an interesting bit of trivia, the receive-wal command is essentially a wrapper for pg_receivewal or pg_receivexlog. As such, it doesn’t have a daemon mode, and it assumes it’s being launched in a script or another background process.

By supplying --create-slot, only the slot is created and nothing more. If we removed that parameter, Barman would start receiving WAL files, but would do so on the local TTY where we launched the command. The cron command tells Barman to handle any necessary maintenance, and that also means launching any missing WAL streaming backends. Handy, right?

If we then check on the Postgres server, we should see the slot is up and transmitting:

SELECT slot_name, active
  FROM pg_replication_slots;
 
 slot_name | active 
-----------+--------
 barman    | t

If you are confused, check with the sun

That’s a lot of prerequisites! Luckily, Barman should be ready after all of our hard work. To ensure this is the case, Barman also has status output we can check to see the health of our installation.

barman check pgbench
 
Server pgbench:
    PostgreSQL: OK
    is_superuser: OK
    PostgreSQL streaming: OK
    wal_level: OK
    replication slot: OK
    directories: OK
    retention policy settings: OK
    backup maximum age: OK (no last_backup_maximum_age provided)
    compression settings: OK
    failed backups: OK (there are 0 failed backups)
    minimum redundancy requirements: OK (have 0 backups, expected at least 0)
    ssh: OK (PostgreSQL server)
    not in recovery: OK
    pg_receivexlog: OK
    pg_receivexlog compatible: OK
    receive-wal running: OK
    archiver errors: OK

Once we see OK on all status items, we can safely invoke a backup.

Wonder why you haven’t before

Actually creating the backup is a single command. With it, Barman will execute pg_start_backup() on the database, launch rsync to obtain all of the Postgres data files, call pg_stop_backup(), and ensure all transaction logs between the function calls are present. If any of these steps do not complete as expected, the backup is marked as FAILED.

For now, let’s just try to back up our 150GB behemoth.

barman backup pgbench
 
Starting backup using rsync-exclusive method for server pgbench in /barman/pgbench/base/20180115T203123
Backup start at LSN: 1E/BC000028 (000000010000001E000000BC, 00000028)
This is the first backup for server pgbench
WAL segments preceding the current backup have been found:
    000000010000001E000000BA from server pgbench has been removed
Starting backup copy via rsync/SSH for 20180115T203123
Copy done (time: 26 minutes, 8 seconds)
This is the first backup for server pgbench
WAL segments preceding the current backup have been found:
    000000010000001E000000BB from server pgbench has been removed
Asking PostgreSQL server to finalize the backup.
Backup size: 146.1 GiB. Actual size on disk: 146.1 GiB (-0.00% deduplication ratio).
Backup end at LSN: 1E/BC000130 (000000010000001E000000BC, 00000130)
Backup completed (start time: 2018-01-15 20:31:23.504082, elapsed time: 26 minutes, 10 seconds)

26 minutes for a 146GB database is pretty good, but the fun is just getting started. We should also note here that it’s possible to set the parallel_jobs configuration parameter to launch several simultaneous rsync threads. In well-equipped hardware, this can result in even further reduced backup times.

If wishes were trees, the trees would be falling

One of the major benefits to using Barman over, say, pg_basebackup or an equivalent tool, is that Barman is context aware. The rsync tool has the capability of making hard links if provided a directory where previous versions of files may exist. Since we have at least one backup, we already have one of those!

What does this actually mean? Even though Postgres doesn’t support it directly, we can actually produce incremental backups. Let’s see how this works by generating a bit of database activity on the Postgres system. We can’t use pgbench itself for this because it generates random data in such a way that it will probably touch every data file. A simple UPDATE should suffice:

UPDATE pgbench_accounts SET abalance=5
 WHERE aid BETWEEN 100000 AND 200000;

Then if we run another backup, it should be much faster and use comparatively less space.

barman backup pgbench
 
Starting backup using rsync-exclusive method for server pgbench in /barman/pgbench/base/20180115T205910
Backup start at LSN: 1E/BF000028 (000000010000001E000000BF, 00000028)
Starting backup copy via rsync/SSH for 20180115T205910
Copy done (time: 34 seconds)
Asking PostgreSQL server to finalize the backup.
Backup size: 146.1 GiB. Actual size on disk: 3.1 GiB (-97.91% deduplication ratio).
Backup end at LSN: 1E/BF000130 (000000010000001E000000BF, 00000130)
Backup completed (start time: 2018-01-15 20:59:10.891769, elapsed time: 36 seconds)
Processing xlog segments from streaming for pgbench
    000000010000001E000000BE

How’s that for an improvement? 34 seconds to produce a 3GB incremental backup. We can even check with the operating system to see what the real sizes of the backups are, respectively:

du -sh --apparent-size /barman/pgbench/base/*
 
147G    /barman/pgbench/base/20180115T203123
3.1G    /barman/pgbench/base/20180115T205910

Listen to reason, reason is calling

As a management suite, Barman also has an inventory system. Say we logged into a new system and knew nothing about it, other than the fact it ran Barman. Let’s start by seeing which systems it’s set to back up:

barman list-server
 
pgbench - Benchmark Barman Test

Now we know pgbench is configured with Barman. We can follow up by listing any backups:

barman list-backup pgbench
 
pgbench 20180115T205910 - Mon Jan 15 20:59:47 2018 - Size: 146.1 GiB - WAL Size: 0 B
pgbench 20180115T203123 - Mon Jan 15 20:57:33 2018 - Size: 146.1 GiB - WAL Size: 48.0 MiB

We can do anything we want with these backup identifiers, or the ‘latest’ label will always point to the most recent backup. Since we have two backups in rapid succession, let’s delete the oldest one:

barman delete pgbench 20180115T203123
 
Deleting backup 20180115T203123 for server pgbench
Delete associated WAL segments:
    000000010000001E000000BC
    000000010000001E000000BD
    000000010000001E000000BE
Deleted backup 20180115T203123 (start time: Mon Jan 15 21:02:37 2018, elapsed time: less than one second)

And if we check the filesystem, we can see that the hard linked files are gone, so the backup size of the incremental returns to the full size of all inventoried files:

du -sh --apparent-size /barman/pgbench/base/*
 
147G    /barman/pgbench/base/20180115T205910

Your head is there to move you around

These are backups, so presumably data restores are also involved. Let’s assume there’s a third server in another environment where we want to restore the most recent backup. It already has SSH keys set up, so now we just need to perform the actual restoration:

barman recover pgbench latest /data/main \
       --remote-ssh-command='ssh [email protected]'
 
Starting remote restore for server pgbench using backup 20180115T205910
Destination directory: /data/main
Copying the base backup.
Copying required WAL segments.
Generating archive status files
Identify dangerous settings in destination directory.
 
IMPORTANT
These settings have been modified to prevent data losses
 
postgresql.conf line 218: archive_command = false
 
Your PostgreSQL server has been successfully prepared for recovery!

As with any restore of this type, completion depends on the full capabilities of the hardware. The faster the network and underlying storage, the more aggressive Barman can be in transmitting files and writing to disk. There are even options to launch multiple rsync worker threads for extremely well equipped systems.

Note that Barman will not start a restored system. There are various PITR options so it can properly apply transaction log files, and it will produce the necessary recovery.conf file for everything to proceed. However, the assumption is that a recovery is critical and requires administrator intervention to verify settings and other assumptions. Once the restored backup is started, it can’t be rolled backwards; it pays to double-check!

Carry a compass to help you along

But what is a backup management system without a retention period and enforced backup policy? Let’s add two lines to the end of our /etc/barman.d/pgbench.conf file:

retention_policy = RECOVERY WINDOW OF 7 DAYS 
minimum_redundancy = 7

Now every time barman cron runs, it will check the backup inventory to match these requirements. If we have too many past WAL files or backups, it will automatically prune them for us. It also just so happens that the packaged version of Barman installs a cron entry in /etc/cron.d that runs every minute and ensures all of this happens.

It’s up to us to schedule the actual backup. We just need to choose a time of day and set barman backup pgbench to run. After a day or two of daily traffic, we will know roughly how large the incremental backups are, and how long they take to produce, which may influence when we set the backup to run.

Once it’s running, there’s little else for us to do. It would serve us well to occasionally inventory the backup list, and restore to ensure the backups are working properly. Otherwise, it should be self-maintaining.

Think about the place where you live

Finally, Barman strongly encourages following the 3-2-1 backup rule:

  1. Have at least three backups.
  2. Use at least two different storage mediums.
  3. Keep at least one copy off-site.

To that end, Barman allows configuration of several hook scripts, one of which is post_backup_script. Assuming Barman completes a backup normally, this script can transmit the backup to another data center, or off-site location such as Glacier, Carbonite, or some other long-term storage. Subsequent backup steps such as encryption can also be incorporated before off-site transmission.

2ndQuadrant maintains several such scripts for support customers, but the amount of effort involved here is much less than Barman itself saves. A VM snapshot, scheduled transmission to a tape archive target, or any number of other solutions exist as well. Barman doesn’t have to integrate directly into a tiered backup stack, but the capability is there.

Stand in the place where you are

The reasons for Barman are the same reasons I wrote my own backup system several years ago. Dumping doesn’t scale well, and restoring takes much longer than dumping thanks to indexes and keys. Even the binary backup tools only scale until physics steps in. If I have a 10TB database and a 10Gbit network link, I’d be lucky to back up in less than three hours fully saturating the link; a 1Gbit network link is hardly worth considering unless I have a spare 30 hours.

In these days when database sizes are steadily increasing, incremental backups are a requirement. The provided tools will work until a certain point, and after that, DBAs are forced to provide their own solution. Well, 2ndQuadrant has already done a lot of that legwork, and are not charging for the end result.

If Barman had existed way back then, I could have saved weeks of coding and testing an ultimately inferior solution. What I produced worked very well, backing up a 60TB system in 20 minutes thanks to 200GB incrementals. But there were more failure cases, fewer robust tests, and less documentation.

Even if Barman isn’t ultimately the right choice, I urge any sysadmin or DBA to choose a well-tested and maintained piece of software over home-grown scripts whenever possible. Backing up Postgres is easy until it isn’t, and it’s critically important to have a robust solution before the worst happens.

Barman does take a bit of effort to properly deploy, but an ounce of prevention is worth a pound of cure.

Leave a Reply

Your email address will not be published. Required fields are marked *