Wednesday, October 24

PG Phriday: Getting RAD with Docker [Part 4]

PgBouncer is a popular proxy and pooling layer for Postgres. It’s extremely common to reconfigure PgBouncer with repmgr so it always directs connections to the current primary node. It just so happens our emerging Docker stack could use such a component.

In our last article, we combined Postgres with repmgr to build a Docker container that could initialize and maintain a Postgres cluster with automated failover capabilities. Yet there was the lingering issue of connecting to the cluster. It’s great that Postgres is always online, but how do we connect to whichever node is the primary?

While we could write a layer into our application stack to call repmgr cluster show to find the primary before connecting, that’s extremely cumbersome. Besides that, there’s a better way. Let’s alter our stack to incorporate PgBouncer and open a whole new world of RAD opportunities.

Tough Decisions

One major shortcoming regarding Docker containers is that they’re not exactly designed to communicate with each other. Yet simultaneously, Docker encourages single-purpose containers. We already had to circumvent this design to run both repmgr and Postgres in the same container, since repmgr needs access to pg_ctl or other similar utilities.

In that same vein, integrating PgBouncer could mean directly injecting it into the container, or running a separate dedicated PgBouncer container. If we install PgBouncer in every node, each node can only reconfigure its local PgBouncer instance. As a separate container, no nodes can alter the PgBouncer configuration file.

So we need to cheat a bit. We already know from the first part in this series that Docker allows mounting user-defined volumes within a container. If we mount that same volume onto every container, reconfiguring PgBouncer suddenly becomes trivial. With that in mind, let’s do things The Right Way ™, and give PgBouncer its own container.

Hiring a Bouncer

Fortunately a minimal PgBouncer setup only needs to know the hostname of the primary node. We could get far more involved with users, authentication configuration, and so on, but this is a RAD stack where none of that matters.

So what does PgBouncer need to do in this context?

  1. Wait for a Postgres node to declare itself the primary.
  2. Reconfigure itself to redirect traffic to that system.
  3. Start and wait for connections.

Here’s what such a script might look like:

#!/bin/bash
 
BOUNCE_CONF=/etc/pgbouncer/pgbouncer.ini
 
if [ ! -f /etc/pgbouncer/hostname ]; then
    echo ${HOSTNAME} > /etc/pgbouncer/hostname
 
    until [ -f /etc/pgbouncer/primary ]; do
        echo "Waiting for primary to be listed in /etc/pgbouncer/primary"
        sleep 1
    done
 
    PRIMARY=$(cat /etc/pgbouncer/primary)
 
    sed -i "s/;*\* = .*/* = host=${PRIMARY}/" ${BOUNCE_CONF}
    sed -i "s/listen_addr =.*/listen_addr = 0.0.0.0/" ${BOUNCE_CONF}
    sed -i "s/listen_port =.*/listen_port = 5432/" ${BOUNCE_CONF}
    sed -i "s/;admin_users =.*/admin_users = postgres/" ${BOUNCE_CONF}
 
    echo '"postgres" "postgres"' > /etc/pgbouncer/userlist.txt
    chmod 600 /etc/pgbouncer/userlist.txt
fi
 
gosu postgres pgbouncer /etc/pgbouncer/pgbouncer.ini

We only have to bootstrap the container once. Until some other container defines the location of the primary, there’s nothing for us to do but wait.

Once the waiting is over, we redirect any traffic at this container to the declared primary. After that change, we want to listen on port 5432 on all interfaces as if this were a basic Postgres node. That permits anyone connecting to this stack to treat it as if it were a Postgres instance instead of a bunch of loosely-coupled Docker containers.

Finally, we allow the postgres user to connect to the special pgbouncer database which allows controlling PgBouncer itself. This means any of the Postgres containers can pause the connection pool, change the shared configuration file, and resume connections without restarting the pgbouncer daemon itself.

Then we can just launch the pgbouncer daemon as the postgres OS user, and we’re done.

Note that the gosu postgres preamble is a neat way of doing sudo the way most people probably want sudo to work. We’ll be using it a lot in these scripts, and we thank the official Postgres image maintainers for the trick.

The PgBouncer Image

As before, we need a Dockerfile to construct our actual image. And like then, we’re going to inherit from the official Postgres image to save a lot of setup on our part. This means our Dockerfile is extremely simple:

FROM postgres:latest
 
RUN apt-get update -y; \
    apt-get install -y pgbouncer
 
VOLUME /etc/pgbouncer
 
COPY scripts/docker-entrypoint.sh /usr/local/bin/

Aside from installing the pgbouncer packages and defining our configuration volume mount, we also overwrite the Postgres entrypoint script with the one we wrote. This means our script is the only thing that runs when the container boots. Let’s build the image and set it aside for later:

$> find pgbouncer -type f
 
pgbouncer/scripts/docker-entrypoint.sh
pgbouncer/Dockerfile
 
$> docker build --tag pgbouncer pgbouncer
 
...
Successfully built 9bac90fcbfda
Successfully tagged pgbouncer:latest

Spring Cleaning

At this point we have a PgBouncer image that will create containers that wait for something to come along and set /etc/pgbouncer/primary to a hostname. Now we just need such a container.

To do that, it’s time to rethink our previous postgres-repmgr image, where we allowed the official Postgres entrypoint call our scripts that built the cluster. Now that we’re physically interacting with other containers, having that extra layer of indirection is something of a liability. We’ve already clobbered the official entrypoint once, so we may as well do it again.

That decision made, what should the new coordination script do?

  1. If no primary has ever existed, bootstrap a new Postgres + repmgr cluster.
  2. Wait until PgBouncer starts itself with the newly created primary.
  3. If this is a new replica, clone the existing primary node.
  4. If this is a rebooted replica, try to rejoin the cluster, or rebuild the node.
  5. Start repmgrd.

Given these steps, we need several helper scripts to accomplish this goal:

  1. One to initialize a Postgres node, configure it the way we want, start, and register it as a repmgr primary.
  2. One to configure repmgr, since clones will need this too.
  3. One to initialize new replicas by cloning the primary.
  4. One to reintegrate or rebuild restarted nodes.
  5. One to reconfigure PgBouncer during standby promotions.
  6. The entrypoint script itself.

That’s a lot of work! We can reclaim some of the work from our previous forays into Docker, but let’s take this one step at a time.

By our Bootstraps

The first script is meant to build and configure a Postgres instance and accompanying repmgr database. Let’s name it init_stack.sh for truth in advertising.

#!/bin/bash
 
PG_CONF=${PGDATA}/postgresql.conf
 
chown postgres:postgres ${PGDATA}
chmod 700 ${PGDATA}
 
gosu postgres initdb -D ${PGDATA}
 
sed -i "s/#*\(shared_preload_libraries\).*/\1='repmgr'/;" ${PG_CONF}
 
cat<<EOF>>${PGDATA}/pg_hba.conf
host all all all trust
host replication all all trust
EOF
 
gosu postgres pg_ctl -D ${PGDATA} start
 
createuser -U postgres -s --replication repmgr
createdb -U postgres -O repmgr repmgr
 
psql -U postgres -d repmgr -c "CREATE SEQUENCE public.repmgr_node"
 
conf_repmgr.sh
gosu postgres repmgr primary register
 
echo -n ${HOSTNAME} > /etc/pgbouncer/primary

It looks like there’s a lot going on here, but it’s actually fairly simple. Here’s what it does:

  1. Creating a new Postgres instance with initdb.
  2. Ensure all nodes can connect to each other by making pg_hba.conf really permissive. (Don’t do this in production…)
  3. Adding the repmgr library to the configuration file.
  4. Creating a repmgr user and repmgr database.
  5. Defining a sequence for assigning node IDs. This is far safer than using max(node_id) like we were before, since it avoids race conditions.
  6. Calling the script to configure repmgr itself.
  7. Register the node with repmgr as the primary since we know it’s the first node.
  8. Trigger PgBouncer to reconfigure itself by putting our hostname in the /etc/pgbouncer/primary file its entrypoint is watching.

Since this script itself depends on configuring repmgr, that should be our next step.

Setting up repmgr

Configuring repmgr is also eerily similar to previous efforts. The only real difference is how we determine the initial node ID. If this is the primary node, we connect to ourselves to get the next sequence value from our node sequence. Otherwise, we must connect to PgBouncer to obtain the value from the primary node.

This next script is named conf_repmgr.sh:

#!/bin/bash
 
if [ -s /etc/repmgr.conf ]; then
    return
fi
 
if [ "$(psql -qAt -U postgres -c 'SELECT 1')" = '1' ]; then
    USEHOST=${HOSTNAME}
else
    USEHOST=$(cat /etc/pgbouncer/hostname)
fi
 
my_node=$(psql -h ${USEHOST} -U postgres -qAt -d repmgr \
               -c "SELECT nextval('public.repmgr_node')")
 
cat<<EOF>/etc/repmgr.conf
node_id=${my_node}
node_name=${HOSTNAME}
conninfo=host=${HOSTNAME} user=repmgr dbname=repmgr
data_directory=${PGDATA}
 
pg_bindir=/usr/lib/postgresql/10/bin
use_replication_slots=1
 
failover=automatic
reconnect_interval=1
promote_command=promote_current.sh
follow_command=repmgr standby follow --wait
 
service_start_command=pg_ctl -D ${PGDATA} start
service_stop_command=pg_ctl -D ${PGDATA} stop -m fast
service_restart_command=pg_ctl -D ${PGDATA} restart -m fast
service_reload_command=pg_ctl -D ${PGDATA} reload
EOF

Besides how we get the node ID itself, everything else is mostly the same. Take note however, that we set promote_command to promote_current.sh. The repmgrd daemon will launch that script on a node if it gets promoted to be the new cluster primary. As such, it doesn’t need any parameters as we can derive everything we need from the environment of the container itself.

I Think I’m a Clone Now

In order to build a clone with repmgr, we only need to know the location of the primary node. Conveniently, that’s already done for us in the /etc/pgbouncer/primary file. Beyond that, we merely need to start and register the node itself.

This script is named init_replica.sh:

#!/bin/bash
 
PRIMARY=$(cat /etc/pgbouncer/primary)
 
repmgr -h ${PRIMARY} -U repmgr -d repmgr standby clone --fast-checkpoint
 
pg_ctl -D ${PGDATA} start
repmgr standby register

See? Nothing to it. The script designated for rebuilding a returning node is slightly more involved, but not by much.

We Can Rebuild Him!

Reintegrating a node can have a few variants. First, we can attempt to rejoin the cluster immediately. In that case, repmgr will determine whether or not to rewind if this node was previously a primary instead of a standby.

If the node wasn’t shut down cleanly, the rewind doesn’t work, or some other error shows up, it’s too much for our script’s tiny little mind. At that point, it’s best to simply erase the old node and start over.

And of course, this script is called rebuild_node.sh:

#!/bin/bash
 
PRIMARY=$(cat /etc/pgbouncer/primary)
 
repmgr -h ${PRIMARY} -U repmgr -d repmgr node rejoin
 
if [ $? -ne 0 ]; then
    rm -Rf ${PGDATA}/*
    repmgr -h ${PRIMARY} -U repmgr -d repmgr standby clone --fast-checkpoint
    pg_ctl -D ${PGDATA} start
    repmgr standby register --force
fi

Note that by using the --force argument, the newly rebuilt replica retains its node ID and other previously established attributes. Waste not, want not!

Shameless Self Promotion

Now we’re at the exciting part. If a primary node fails or is shut down, we still need a writable primary. After a suitable timeout, the repmgrd daemon will call this next script only on the standby which will become the new cluster leader.

As such, there are always three phases:

  1. Pause the PgBouncer pool.
  2. Reconfigure.
  3. Reload configuration files, and resume the pool.

This script goes by promote_current.sh:

#!/bin/bash
 
BOUNCER=$(cat /etc/pgbouncer/hostname)
 
psql -h ${BOUNCER} -q -c 'PAUSE' pgbouncer
 
repmgr standby promote
 
sed -i "s/\* = .*/* = host=${HOSTNAME}/" /etc/pgbouncer/pgbouncer.ini
echo -n ${HOSTNAME} > /etc/pgbouncer/primary
 
psql -h ${BOUNCER} -q -c 'RELOAD' pgbouncer
psql -h ${BOUNCER} -q -c 'RESUME' pgbouncer

We already know how to change the entry that redirects all traffic in the pgbouncer.ini file, but to be good parents, we should also update the /etc/pgbouncer/primary tracking file we’ve been using. Isn’t sharing fun?

Building a Better Doorway

Now it’s time for our coordinator entrypoint script to get to work. It contains a bit more logic than the others, so let’s take it in a few parts.

#!/bin/bash
 
chown -R postgres:postgres /etc/pgbouncer
 
if [ ! -f /etc/pgbouncer/primary ]; then
    flock -w 0 -E 0 /etc/pgbouncer/primary.tmp init_stack.sh
else
    if [ ${HOSTNAME} == "$(cat /etc/pgbouncer/primary)" ]; then
        gosu postgres pg_ctl -D ${PGDATA} start
    fi
fi

The beginning of the script needs to either initialize a new cluster, or if the last known primary is restarting, continue from where the cluster left off.

Since we’re sharing a directory with multiple containers, we can leverage flock to ensure only one container is ever able to initialize the cluster, even if hundreds start simultaneously. All other containers will wait until the winner is done setting up, and will then become new replicas.

until [ -f /etc/pgbouncer/primary ] && [ -f /etc/pgbouncer/hostname ]; do
    echo "Waiting for Primary to initialize"
    sleep 1
done
 
BOUNCER=$(cat /etc/pgbouncer/hostname)
 
until [ "$(psql -h ${BOUNCER} -qAt -U postgres -c 'SELECT 1')" == '1' ]; do
    sleep 1
done

Here the losers of the lock race sit around and wait for the primary node to initialize, and for PgBouncer to follow suit. Idling in a loop waiting for PgBouncer to connect to the primary ensures all subsequent steps have a working upstream target.

if [ ! -f ${PGDATA}/PG_VERSION ]; then
    conf_repmgr.sh
    gosu postgres init_replica.sh
fi
 
if [ "$(psql -qAt -U postgres -c 'SELECT 1')" != '1' ]; then
    gosu postgres rebuild_node.sh
fi
 
gosu postgres repmgrd -v

Finally, if no node exists, configure repmgr and initialize the replica. If postgres still isn’t running on this node, it doesn’t fit any of the above criteria, so it’s either an old node that’s rejoining, or some other case we don’t handle. Try to rejoin the cluster, or failing that, rebuild the node.

Then we start the repmgrd daemon to watch over the local Postgres node and trigger a promotion if necessary, or follow a promoted standby otherwise.

Spitting Image

Which brings us to the Dockerfile that defines our new image. Have a gander:

FROM postgres:latest
 
RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg-testing main 10" \
          >> /etc/apt/sources.list.d/pgdg.list; \
    apt-get update -y; \
    apt-get install -y postgresql-10-repmgr repmgr-common=4.0\*
 
VOLUME /etc/pgbouncer
 
RUN touch /etc/repmgr.conf
 
COPY scripts/*.sh /usr/local/bin/

The only real differences between this and the Dockerfile we used for our postgres-repmgr image, is that we’ve defined the /etc/pgbouncer shared volume, and installed all scripts into /usr/local/bin. This both overwrites the official Postgres entrypoint script, and makes all the scripts easily available in the container.

Now to build:

$> find postgres-stack -type f
 
postgres-stack/scripts/init_replica.sh
postgres-stack/scripts/rebuild_node.sh
postgres-stack/scripts/conf_repmgr.sh
postgres-stack/scripts/promote_current.sh
postgres-stack/scripts/init_stack.sh
postgres-stack/scripts/docker-entrypoint.sh
postgres-stack/Dockerfile
 
$> docker build --tag postgres-stack postgres-stack
 
...
Successfully built 7ad62dae5f71
Successfully tagged postgres-stack:latest

Putting Everything in Motion

Consider what happens if we want to spin up a cluster of three Postgres nodes and the new PgBouncer image. We also need the shared volume, a network bridge, and of course, a mapped port so we can communicate with our stack.

That’s a lot of moving parts, so for our last trick, we’re going to make use of docker-compose. Once installed, it lets is create a file named docker-compose.yml which will define the organization of our cluster stack once. Then we can start or stop the entire cluster as a single entity.

This is the docker-compose.yml file for a basic 3-node cluster with this stack:

version: "3.0"
services:
  pg1:
    image: postgres-stack:latest
    volumes:
     - pgbouncer:/etc/pgbouncer
  pg2:
    image: postgres-stack:latest
    volumes:
     - pgbouncer:/etc/pgbouncer
  pg3:
    image: postgres-stack:latest
    volumes:
     - pgbouncer:/etc/pgbouncer
  pgbouncer:
    image: pgbouncer:latest
    ports:
     - "6543:5432"
    volumes:
     - pgbouncer:/etc/pgbouncer
 
volumes:
  pgbouncer:

Now we can start the stack, check the state of repmgr, and connect to Postgres from the host machine to prove everything is working as advertised:

$> docker-compose up -d
 
Creating volume "stackcompose_pgbouncer" with default driver
Creating stackcompose_pg1_1 ... 
Creating stackcompose_pg3_1 ... 
Creating stackcompose_pgbouncer_1 ... 
Creating stackcompose_pg2_1 ... 
Creating stackcompose_pg1_1
Creating stackcompose_pgbouncer_1
Creating stackcompose_pg2_1
Creating stackcompose_pg1_1 ... done
 
$> docker exec -it stackcompose_pg1_1 gosu postgres repmgr cluster show
 
 ID | Name         | Role    | Status    | Upstream     | Location | Connection string                          
----+--------------+---------+-----------+--------------+----------+---------------------------------------------
 1  | adafb9033a63 | primary | * running |              | default  | host=adafb9033a63 user=repmgr dbname=repmgr
 2  | 3eeaa6e16f1a | standby |   running | adafb9033a63 | default  | host=3eeaa6e16f1a user=repmgr dbname=repmgr
 3  | ddb7a3a5af37 | standby |   running | adafb9033a63 | default  | host=ddb7a3a5af37 user=repmgr dbname=repmgr
 
$> psql -p 6543 -U postgres -h localhost -c "CREATE TABLE hello_world(id INT);"
 
CREATE TABLE

Not only were we able to connect, but creating a table proves we definitely reached the primary node.

A Useful Tool

The last entry in the series was more of a cute toy. Having an immortal cluster is fine and dandy, but it wasn’t especially useful since we could only interact with it using other Docker tools. Now we have the same immortal cluster, but we can actually connect to it from our host machine.

That drastically opens up the amount of useful work we can extract from this stack. Build-up and tear-down are fast enough we can even use it for testing cluster deployment scenarios. Online migrations, code deployment, unit, load, or integration testing, and more. There are endless scenarios we can imagine that could benefit from an ephemeral yet stable database target.

It doesn’t matter where the application source lives: inside Docker, the local host system, or even outside on the local network. All of them can utilize this stack as presented.

Now go forth and invoke your new Postgres RAD stack, and it will never let you down. Just don’t use it in production! We may have a later series focused on improving Postgres Docker security, but the current stack is meant for RAD use only, unless you want anyone on the WAN able to connect to your database as a superuser.

Otherwise, happy hunting.

5 Comments

    • Shaun Thomas

      That’s correct. However if you specified an NFS mount or something similar, it would probably work as well. Though I’m not sure how well flock works in those kinds of situation.

  • Steve

    What is the name of the first script? When I invoke the first “build”, the “COPY scripts/docker-entrypoint.sh /usr/local/bin/” causes all sorts of hang and errors.

    • Shaun Thomas

      The script name is: scripts/docker-entrypoint.sh

      As the previous find command illustrated, there are only two files defined for the PgBouncer image: the Dockerfile itself, and the entrypoint replacement script.

Leave a Reply

Your email address will not be published. Required fields are marked *