<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>2ndQuadrant, Professional PostgreSQL - Greg's PlanetPostgreSQL</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/" />
    <link rel="self" type="application/atom+xml" href="http://blog.2ndquadrant.com/en/atom-greg-planetpostgresql.xml" />
    <id>tag:blog.2ndquadrant.com,2010-01-19:/en//3</id>
    <updated>2010-02-24T18:46:44Z</updated>
    <subtitle>2ndQuadrant Ltd official blog</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Open Source 4.12</generator>

<entry>
    <title>Trade-offs in Hot Standby Deployments</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.79</id>

    <published>2010-02-24T07:25:39Z</published>
    <updated>2010-02-24T18:46:44Z</updated>

    <summary>The new Hot Standby feature in the upcoming PostgreSQL 9.0 allows running queries against standby nodes that previously did nothing but execute a recovery process. Two common expectations I&apos;ve heard from users anticipating this feature is that it will allow...</summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="hotstandbyusergroup" label="Hot Standby User Group" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>The new Hot Standby feature in the upcoming PostgreSQL 9.0 allows running queries against standby nodes that previously did nothing but execute a recovery process.  Two common expectations I've heard from users anticipating this feature is that it will allow either distributing short queries across both nodes, or allow running long reports against the standby without using resources on the master.  These are both possible to do right now, but unless you understand the trade-offs involved in how Hot Standby works there can be some unanticipated behavior here.</p>

<h2>Standard Long-running Queries</h2>

<p>One of the traditional problems in a database using MVCC, like PostgreSQL, is that a long-running query has to keep open a resource--referred to as a <a href="http://www.postgresql.org/docs/current/static/transaction-iso.html">snapshot</a> in the current Postgres implementation--to prevent the database from removing data the query needs to operate.  For example, just because another client has deleted a row and committed, if an already running query needs that row to complete you can't actually wipe the physical disk blocks related to that row out just yet.  You have to wait until no open queries that expect that row to be visible are still around.</p>

<h2>Hot Standby Limitations</h2>

<p>If you have a long-running query you want Hot Standby to execute, there are a couple of types of bad things that can happen when the recovery process is applying updates.  These are described in detail in the <a href="http://developer.postgresql.org/pgdocs/postgres/hot-standby.html">Hot Standby Documentation</a>.  Some of these bad things will cause queries running on the standby to be canceled for reasons that might not be intuitively obvious:</p>

<ul>
<li>A HOT update or VACUUM related update arrives to delete something that query expects to be visible</li>
<li>A B-tree deletion appears</li>
<li>There is a locking issue between the query you're running and what locks are required for the update to be processed.</li>
</ul>

<p>The lock situation is difficult to deal with, but not very likely to happen in practice for all that long if you're just running read-only queries on the standby, because those will be isolated via MVCC.  The other two are not hard to run into.  The basic thing to understand is that <em>any</em> UPDATE or DELETE on the master can lead to interrupting any query on the standby; doesn't matter if the changes even relate to what the query is doing.</p>

<h2>Good, fast, cheap:  pick two</h2>

<p>Essentially, there are three things people might want to prioritize:</p>

<ol>
<li>Avoid master limiting:  Allow xids and associated snapshots to advance unbounded on the master, so that VACUUM and similar cleanup isn't held back by what the standby is doing</li>
<li>Unlimited queries:  Run queries on the slave for any arbitrary period of time</li>
<li>Current recovery:  Keep the recovery process on the standby up to date with what's happening on the master, allowing fast fail-over for HA</li>
</ol>

<p>In any situation with Hot Standby, it's literally impossible to have all three at once.  You can only pick your trade-off.  The tunable parameters available already let you optimize a couple of ways:</p>

<ul>
<li>Disabling all these delay/defer settings optimizes for always current recovery, but then you'll discover queries are more likely to be canceled than you might expect.</li>
<li><em>max_standby_delay</em> optimizes for longer queries, at the expense of keeping recovery current.  This delays applying updates to the standby once one that will cause a problem (HOT, VACUUM, B-tree delete, etc.) appears.
<li><em>vacuum_defer_cleanup_age</em> and some snapshot hacks can introduce some master limiting to improve on the other two issues, but with a weak UI to do that.  vacuum_defer_cleanup_age is in units of transaction IDs.  You need to have some idea the average amount of xid churn on your system per unit of time to turn the way people think about this problem ("defer by at least 1 hour so my reports will run") into a setting for this value.  xid consumption rate just isn't a common or even reasonable thing to measure/predict.  Alternately, you can open a snapshot on the primary before starting a long-running query on the standby.  dblink is suggested in the Hot Standby documentation as a way to accomplish that.  Theoretically a daemon on the standby could be written in user-land, living on the primary, to work around this problem too (Simon has a basic design for one).  Basically, you start a series of processes that each acquire a snapshot and then sleep for a period before releasing it.  By spacing out how long they each slept for you could ensure xid snapshots never advanced forward too quickly on the master.  It should already sound obvious how much of a terrible hack this would be.</li>
</ul>

<h2>Potential Improvements</h2>

<p>The only one of these you can really do something about cleanly is tightening up and improving the UI for the master limiting.  That turns this into the traditional problem already present in the database:  a long-running query holds open a snapshot (or at least limits the advance of visibility related transaction IDs) on the master, preventing the master from removing things needed for that query to complete.  You might alternately think of this as an auto-tuning vacuum_defer_cleanup_age.</p>

<p>The question is how to make the <em>primary</em> respect the needs of long running queries on the <em>standby</em>.  This might be possible if more information about the transaction visibility requirements of the standby were shared with the master.  Doing that sort of exchange would really be something more appropriate for the new Streaming Replication implementation to share.  The way a simple Hot Standby server is provisioned does not provide any feedback toward the master suitable for this data to be exchanged, besides approaches like the already mentioned dblink hack.</p>

<p>With PostgreSQL 9.0 just reaching a fourth alpha release, there may still be time to see some improvements in this area yet before the 9.0 release.  It would be nice to see Hot Standby and Streaming Replication really integrated together in a way that accomplishes things that neither is fully capable of doing on their own before coding on this release completely freezes.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Measuring PostgreSQL Checkpoint Statistics</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/01/measuring-postgresql-checkpoin.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.75</id>

    <published>2010-01-29T07:25:26Z</published>
    <updated>2010-01-29T18:29:41Z</updated>

    <summary>Checkpoints can be a major drag on write-heavy PostgreSQL installations. The first step toward identifying issues in this area is to monitor how often they happen, which just got an easier to use interface added to the database recently....</summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresqlperformance" label="postgresql performance" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>Checkpoints can be a major drag on write-heavy PostgreSQL installations.  The first step toward identifying issues in this area is to monitor how often they happen, which just got an easier to use interface added to the database recently.</p>]]>
        <![CDATA[<p>Checkpoints are periodic maintenance operations the database performs to make sure that everything it's been caching in memory has been synchronized with the disk.  The idea is that once you've finished one, you can eliminate needing to worry about older entries placed into the write-ahead log of the database.  That means less time to recover after a crash.</p>

<p>The problem with checkpoints is that they can be very intensive, because to complete one requires writing every single bit of changed data in the database's buffer cache out to disk.  There were a number of features added to PostgreSQL 8.3 that allow you to better monitor the checkpoint overhead, and to lower it by spreading the activity over a longer period of time.  I wrote a long article about those changes called  <a href="http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm">Checkpoints and the Background Writer</a> that goes over what changed, but it's pretty dry reading.</p>

<p>What you probably want to know is how to monitor checkpoints on your production system, and how to tell if they're happening too often.  Even though things have improved, "checkpoint spikes" where disk I/O becomes really heavy are still possible even in current PostgreSQL versions.  And it doesn't help that the default configuration is tuned for very low disk space and fast crash recovery rather than performance.  The checkpoint_segments parameter that's one input on how often a checkpoint happens defaults to 3, which forces a checkpoint after only 48MB of writes.</p>

<p>You can find out checkpoint frequency two ways.  You can turn on log_checkpoints and watch what happens in the logs.  You can also use the pg_stat_bgwriter view, which gives a count of each of the two sources for checkpoints (time passing and writes occurring) as well as statistics about how much work they did.</p>

<p>The main problem with making that easier to do is that until recently, it's been impossible to reset the counters inside of pg_stat_bgwriter.  That means you have to take a snapshot with a timestamp on it, wait a while, take another snapshot, then subtract all the values to derive any useful statistics from the data.  That's a pain.</p>

<p>Enough of a pain that I  <a href="http://archives.postgresql.org/message-id/4B4F8A96.5080004@2ndquadrant.com">wrote a patch</a> to make it easier.  With the current development version of the database, you can now call pg_stat_reset_shared('bgwriter') and pop all these values back to 0 again.  This allows following a practice that used to be common on PostgreSQL.  Before 8.3, there was a parameter named stats_reset_on_server_start you could turn on.  That reset all of the server's internal statistics each time you started it.  That meant that you could call the handy pg_postmaster_start_time() function, compare with the current time, and always have an accurate count in terms of operations/second of any statistic available on the system.</p>

<p>It's still not automatic, but now that resetting these shared pieces is possible you can do it yourself.  The first key is to integrate statistics clearing into your server startup sequence.  A script like this will work:</p>

<pre><code>
pg_ctl start -l $PGLOG -w
psql -c "select pg_stat_reset();"
psql -c "select pg_stat_reset_shared('bgwriter');"
</code></pre>

<p>Note the "-w" on the start command there--that will make pg_ctl wait until the server is finished starting before it returns, which is vital if you want to immediately execute a statement against it.</p>

<p>If you've done that, and your server start time is essentially the same as when the background writer stats started collection, you can now use this fun query:</p>

<pre><code>
SELECT 
  total_checkpoints,
  seconds_since_start / total_checkpoints / 60 AS minutes_between_checkpoints
FROM 
  (SELECT 
      EXTRACT(EPOCH FROM (now() - pg_postmaster_start_time())) AS seconds_since_start
      (checkpoints_timed+checkpoints_req) AS total_checkpoints 
    FROM pg_stat_bgwriter
  ) AS sub;
</code></pre>

<p>And get a simple report of exactly how often checkpoints are happening on your system.  The output looks like this:</p>

<pre><code>
total_checkpoints           | 9
minutes_between_checkpoints | 3.82999310740741
</code></pre>

<p>What you do with this information is stare at the average time interval and see if it seems too fast.  Normally, you'd want a checkpoint to happen no more than every five minutes, and on a busy system you might need to push it to ten minutes or more to have a hope of keeping up.  With this example, every 3.8 minutes is probably too fast--this is a system that needs checkpoint_segments to be higher.</p>

<p>Using this technique to measure the checkpoint interval lets you know if you need to increase the checkpoint_segments and checkpoint_timeout parameters in order to achieve that goal.  You can compute the numbers manually right now, and once 9.0 ships it's something you can consider making completely automatic--so long as you don't mind your stats going away each time the server restarts.</p>

<p>There are some other interesting ways to analyze the data the background writer provides for you in pg_stat_bgwriter, but I'm not going to give away all of my tricks today.</p>]]>
    </content>
</entry>

</feed>
