Let me discuss a topic that is not inherently PostgreSQL specific, but that I regularly run into while investigating issues on customer systems, evaluating “supportability” of those systems, etc. It’s the importance of having a monitoring solution for system metrics, configuring it reasonably, and why
sar is still by far my favorite tool (at least on Linux).
Firstly, monitoring of basic system metrics (CPU, I/O, memory) is extremely important. It’s a bit strange having to point this in discussions with other engineers, but I’d say 1 in 10 engineers thinks they don’t really need monitoring. The reasoning usually goes along these lines:
It’s just another source of useless overhead. You don’t really need monitoring unless there’s an issue, and issues should be rare. And if there’s an issue, we can enable the monitoring temporarily.
It’s true monitoring adds overhead, no doubt about it. But it’s likely negligible compared to what the application is doing. Actually,
sar is not really adding any extra instrumentation, it’s merely reading counters from the nernel, computing deltas and writing that to disk. It may need some disk space and I/O (depending on number of CPUs and disks) but that’s about it.
For example collecting per-second statistics on a machine with 32 cores and multiple disks will produce ~5GB of raw data per day, but it compresses extremely well, often to ~5-10%). And it’s barely visible in
top. Per-second resolution is a bit extreme, and using 5 or 10 seconds will further reduce the overhead.
So no, turns out the overhead actually is not a valid reasons not to enable monitoring.
More importantly though, “How much overhead do I eliminate by not enabling monitoring?” is the wrong question to ask. Instead you should be asking “What benefits do I get from the monitoring? Do the benefits outweigh the costs?”
We already know the costs (overhead) are fairly small or entirely negligible. What are the benefits? In my experience, having monitoring data is effectively invaluable.
Firstly, it allows you to investigate issues – looking at a bunch of charts and looking for sudden changes is surprisingly effective, and often leads you directly to the right issue. Similarly, comparing the current data (collected during the issue) to a baseline (collected when everything is fine) is very useful, and impossible if you only enable monitoring when things break.
Secondly, it allows you to evaluate trends and identify potential issues before they actually hit you. How much CPU are you using? Is the CPU usage growing over time? Are there some suspicious patterns in memory usage? You can only answer those questions if you have the monitoring in place.
saris my favorite tool
Let’s assume I’ve convinced you monitoring is important and you should definitely do it. But why is
sar our favorite tool, when there are various fancy alternatives, both on-premise and cloud based?
I do admit some of this comes from the fact that I work for a company providing PostgreSQL services to other companies (be it 24×7 support or Remote DBA. So we usually get only a very limited access to customer systems (mostly just database servers and nothing more). That means having all the important data on the database server itself, accessible over plain SSH, is extremely convenient and eliminates unnecessary round-trips only to request another piece of data from some other system. Which saves both time and sanity on both sides.
If you have many systems to manage, you’ll probably prefer a monitoring solution that collects data from many machines to a single place. But for me,
sar still wins.
I mentioned installing and enabling
sar (or rather
sysstat, which is the package including
sar) is very simple. Unfortunately, the default configuration is somewhat bad. After installing
sysstat, you’ll find something like this in
/etc/cron.d/sysstat (or wherever your distribution stores
*/10 * * * * root /usr/lib64/sa/sa1 1 1
This effectively says the
sa1 command will be executed every 10 minutes, and it will collect a single sample over 1 second. There are two issues, here. Firstly, 10 minutes is fairly low resolution. Secondly, the sample only covers 1 second out of 600, so the remaining 9:59 are not really included in it. This is somewhat OK for long-term trending, where low-resolution random sampling is sufficient. For other purposes you probably need to do something like this instead:
* * * * * root /usr/lib64/sa/sa1 -S XALL 60 1
Which collects one sample per minute, and every sample covers one minute. The
-S XALL means all statistics should be collected, including interrupts, individual block devices and partitions, etc. See
man sadc for more details.
So, to sum this post into a few simple points:
saris convenient and very efficient. Maybe you’ll use something else in the future, but it’s good first step.
One thing I haven’t mentioned is that
sar only deals with system metrics – CPU, disks, memory, processes, not with PostgreSQL statistics. You should definitely monitor that part of the stack too, of course.