Monday, June 26

What is SKIP LOCKED for in PostgreSQL 9.5?

PostgreSQL 9.5 introduces a new SKIP LOCKED option to SELECT ... FOR [KEY] UPDATE|SHARE. It’s used in the same place as NOWAIT and, like NOWAIT, affects behaviour when the tuple is locked by another transaction.

The main utility of SKIP LOCKED is for building simple, reliable and efficient concurrent work queues.

Most work queue implementations in SQL are wrong

The majority of PostgreSQL-based implementations of work queues I’ve seen in applications, on Stack Overflow etc have been buggy in one of a few ways:

  • It fails to consider that statements don’t execute atomically, tries to use subqueries and/or writeable CTEs as if the whole statement is a single atomic unit, and as a result hands out the same work queue entry to multiple workers when run concurrently;
  • It marks a job as done as soon as it hands it out and no recovery mechanism for if a worker takes a job then fails, so the job is just lost; or
  • It thinks it’s handing jobs out concurrently, but in practice all but one worker are blocked on a row lock so all workers take turns getting jobs.

The few exceptions I’ve seen generally use PostgreSQL’s advisory locking features or use various time-and-expiry based methods of queue cleanup and recovery. Including the popular off-the-shelf ones that are known to work well. They do, just at a performance cost.

“Find me the next unclaimed row” shouldn’t be hard, though, surely?

The hard part of the problem boils down to:

“How do I find the first row (by some given ordering) in a queue table that nobody else has claimed and claim it for myself? It needs to automatically revert to being unclaimed again if I crash or exit for any reason. Many other workers will be doing the same thing at the same time. It is vital that each item get processed exactly once; none may be skipped and none may be processed more than once.”

This is harder than you’d think because SQL statements do not execute atomically. A subquery might run before the outer query, depending on how the planner/optimizer does things. Many of the race conditions that can affect series of statements can also affect single statements with CTEs and subqueries, but the window in which they occur is narrower because the statement-parts usually run closer together. So lots of code that looks right proves not to be, it’s just right 99.95% of the time, or it’s always right until that day your business gets a big surge of business and concurrency goes up. Sometimes that’s good enough. Often it isn’t.

Additionally, PostgreSQL uses MVCC with snapshots to control row visibility. The main consequence is that you can’t “see” a row another transaction inserted until it commits. Even then PostgreSQL only lets that row “appear” to existing transactions at certain limited points, including the start of a new statement. PostgreSQL makes a few exceptions to this rule, most notably with UNIQUE indexes, but in general you can’t see uncommitted rows. PostgreSQL does not support the DIRTY READ isolation level that would permit this.

That means that tricks like:

UPDATE queue
SET is_done = 't'
WHERE itemno = (
  SELECT itemno
  FROM queue
  WHERE NOT is_done
  ORDER BY itemno
  FOR UPDATE
  LIMIT 1
)
RETURNING itemno

look good, but don’t work.

If two are started at exactly the same moment, the subSELECTs for each will be processed first. (It’s not that simple, but we can pretend for this purpose. Don’t rely on subqueries executing in any particular order for correctness). Each one scans for a row with is_done = 'f'. Both find the same row and attempt to lock it. One succeeds, one waits on the other one’s lock. Whoops, your “concurrent” queue is serialized. If the first xact to get the lock rolls back the second gets the row and tries the same row.

If the first xact commits, the second actually gets zero rows and UPDATE and returns nothing. PostgreSQL doesn’t re-execute the SELECT part, it just notices that the row was modified by another transaction while locked and re-evaluates the WHERE clause for the row. Since is_done = 't' now, the row no longer matches the condition and is excluded. The subquery returns zero rows, which is null, and no itemid is = NULL because nothing is equal to null, so the UPDATE does nothing.

If we remove the FOR UPDATE in the subquery we get a different bug. Now both subqueries can find the same itemid, but one won’t wait on a lock. Both will return it. Both UPDATEs will race to lock the row. One will succeed, the other will wait for the first to commit or roll back. If the first rolls back we’re fine, the second can continue. But if the first commits, the second will also mark the already-completed row as is_done = 't' and return it!. The outer UPDATEs don’t have any test for is_done since the inner subquery does that… so they’ll both happily update the same row.

If you protect against that by adding a second WHERE is_done = 'f' to the UPDATE its self, you get straight back to the first situation where one waits for the other and then returns zero rows.

You can create variants on this ad-nauseum, but they won’t work. Some of them will merrily return duplicate rows during concurrent execution. Some will not return duplicate rows but will not be usefully concurrent. Some will return duplicate rows and still wait for each other.

Things that don’t work

Solutions that rely on updating the row to set a ‘claimed’ flag or active worker ID or similar need another update to mark the row as completed after it’s claimed, which has a performance cost. They need a way to determine that a worker is never going to finish updating the row, be certain that the worker isn’t going to still be trying, and then assign it to another worker. Additionally, they still can’t actually assign new work items concurrently, they just try to make the assignment step very fast. These approaches don’t scale to very high queue throughput.

Solutions that rely on locking the row while the worker proceeds need an open transaction usually look concurrent, but aren’t. Everything except the worker that’s currently processing will just be waiting on the row lock held by the active worker. You might have 100 workers, but 99 of them are waiting for work.

Solutions that use SERIALIZABLE isolation can be effective, but they fall down badly as concurrency goes up. The serializable conflict rate increases and they start generating a lot of database bloat and doing a lot of wasted work. They don’t tend to scale, but work well for cases where there’s not much concurrency.

Solutions that use advisory locking can work well within limits. Instead of using tuple locks they use pg_try_advisory_xact_lock(...) in a loop or using a LIMIT clause to attempt to grab the first unlocked row. It works, but it requires that users go way outside the normal SQL programming model. They can’t use their queue table’s normal keys, they have to map them to either a 64-bit integer or two 32-bit integers. That namespace is shared across the whole database, it’s not per-table and it can’t be isolated per-application. Multiple apps that all want to use advisory locking on the same DB will tend to upset each other.

How SKIP LOCKED helps

SKIP LOCKED tries to make this easier by letting you use normal SQL to write efficient, safe queue systems. You don’t need to import a large and complex 3rd party app or library to implement a queue, and you don’t need to deal with the key mapping and namespace issues with advisory locking.

Given a trivial queue:

CREATE TABLE queue(
  itemid INTEGER PRIMARY KEY,
  is_done BOOLEAN NOT NULL DEFAULT 'f'
);

INSERT INTO queue(itemid)
SELECT x FROM generate_series(1,20) x;

an application can grab a single queue item safely while holding an open transaction with:

DELETE FROM queue
WHERE itemid = (
  SELECT itemid
  FROM queue
  ORDER BY itemid
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
RETURNING *;

This:

  • Scans the queue table in itemid order
  • Tries to acquire a lock on each row. If it fails to acquire the lock, it ignores the row as if it wasn’t in the table at all and carries on.
  • Stops scanning once it’s locked one item
  • Returns the itemid of the locked item
  • Looks up the found itemid in the index to get its physical location
  • Marks the tuple as deleted (but this doesn’t take effect until commit)

The open transaction holds a lock on the row now. Other transactions will skip it so long as this transaction keeps on running. If this transaction aborts, the row becomes unlocked and the deletion is undone by the abort so another transaction can grab the row. If this transaction commits, the deletion is committed along with the other changes in the xact, so other xacts won’t see the queue item anymore.

This is ideal whenever your work queue processing makes changes to the same database since it’s 100% atomic.

Example

A somewhat contrived usage example might be processing of a queue of pending balance exchanges:

BEGIN;

DELETE FROM pending_balance_transfers
WHERE itemid = (
  SELECT itemid FROM pending_balance_transfers 
  ORDER BY itemid 
  FOR UPDATE SKIP LOCKED 
  LIMIT 1
)
RETURNING itemid, from_customer, to_customer, amount;

-- Do its processing and database updates in the same transaction
-- now, e.g.: if the queue contained "transfer balance of $100 from customer A to customer B"
-- we might:

UPDATE bank_balances
SET balance = balance + 100
WHERE customerid = 'customer_a';

UPDATE bank_balances
SET balance = balance - 100
WHERE customerid = 'customer_b';

-- and when it commits, the queue entry is marked completed atomically with
-- the work that was queued being committed.

COMMIT;

Downsides?

There are downsides to using SKIP LOCKED to implement a queue.

Each transaction scans the table and skips over locked rows, so with high numbers of active workers it can land up doing a bit of work to acquire a new item. It’s not just popping items off a stack. The query will probably have to walk an index with an index scan, fetching each candidate item from the heap and checking the lock status. With any reasonable queue this will all be in memory but it’s still a fair bit of churn.

(You can slightly reduce this by abusing ctid lookups instead of doing a second index lookup to find the row to update or delete, but that requires caution and I’m not going to go into details on it here).

A queue implemented in the RDBMS will never match the performance of a fast dedicated queueing system, even one that makes the same atomicity and durability guarantees as PostgreSQL. Using SKIP LOCKED is better than existing in-database approaches, but you’ll still go faster using a dedicated and highly optimised external queueing engine.

Using SKIP LOCKED of course takes you away from ANSI SQL. ORM users will need to use native queries and developers will in general have to be aware that they can’t expect to just run the same SQL on other platforms. (As if you ever can in reality). That said, Oracle has the same syntax with what looks like similar or the same semantics. Microsoft SQL Server has the READPAST locking hint. DB2 has SKIP LOCKED DATA. As far as I can tell MySQL has no equivalent yet. Still, three out of four ain’t bad.

What about external systems and 2PC?

When you want to do something outside the database the queue resides in as part of your queue processing you need two-phase commit to do it reliably. SKIP LOCKED can’t help you with the atomicity aspect there.

It still can help you acquire new work items efficiently, safely and concurrently. Items will stay locked while an xact is prepared after PREPARE TRANSACTION has run, will become available again if a prepared xact does a ROLLBACK PREPARED and will be removed when COMMIT PREPARED runs.

See also

13 Comments

  • Kai

    Thanks for this post pointing out how to use this feature!

    It would be great if you could give some details regarding the following two points:

    Which isolation level is required for the ‘delete from queue’ statement in the ‘How SKIP LOCKED helps’ section?

    The sample queue in this statement features a boolean is_done flag. Does the example also work with an update on this flag instead of a delete, combined with a matching select for not is_done?

    • Which isolation level is required for the ‘delete from queue’ statement in the ‘How SKIP LOCKED helps’ section?

      That depends on your application’s other requirements. For the queue its self, READ COMMITTED is sufficient. If you use SERIALIZABLE you may face high rates of serialization failures.

      The sample queue in this statement features a boolean is_done flag. Does the example also work with an update on this flag instead of a delete, combined with a matching select for not is_done?

      Yes.

  • Here is another alternative – this is written out quickly so for illustration instead of optimal code:

    Create a sequence used for itemid PK (“pk”)
    Create a sequenced used for handing out queue records (“getpk”)

    CREATE TABLE queue(
    itemid INTEGER PRIMARY KEY,
    initialitemid INTEGER,
    is_done BOOLEAN NOT NULL DEFAULT ‘f’
    );

    Insert queue items using nextval(‘pk’);

    SELECT nextval(‘getpk’) to get queue record, and flag a status
    flag is_done when job complete

    When a job is detected to have died (eg taken too long), select a new value into itemid – with nextval(‘pk’)

    Jobs either in progress or dead will have an itemid < currentval('getpk')

    Keep requesting jobs will currentval('getpk') = currentval('pk')

    • Using sequences to bypass MVCC visbility and locking when implementing a queue is an interesting concept. It’s very fragile though. If a transaction that’s acquiring a job rolls back you’ll lose that job and the subsequent queue workers will skip past it. You’ll still need something that walks behind the queue cleaning up jobs that got lost or whose workers failed, they won’t pop automatically back on the queue for immediate re-processing like they do if you’re using row locks.

      Still, this is a new one I haven’t seen before and it’s a very clever idea.

      • Yes you do need that job to check for dead jobs. But then you always do in my experience – because sometimes jobs do crash entirely (eg because the server that was running the job lost power) and can’t tell the queue that the job died.

        That is what I meant to do this – “When a job is detected to have died (eg taken too long), select a new value into itemid – with nextval(‘pk’)”

        – which could have been clearer.

        It was just an idea that I thought of on the spur of the moment, and could possibly be tweaked a bit.

        Great read – it did point out some finer points of the MVC implementation that I was not aware of.

        • because sometimes jobs do crash entirely (eg because the server that was running the job lost power) and can’t tell the queue that the job died

          That’s why using an open transaction is good. With keepalives, you quickly notice lost jobs. It’s very robust, especially with the help of PostgreSQL’s automatic deadlock detection too.

          It has downsides, of course, and is mostly only really useful when the work you’re doing from the queue is also reading from and writing to the DB.

  • johto

    “Since is_done = ‘t’ now, the row no longer matches the condition and is excluded. The subquery returns zero rows, which is null, and no itemid is = NULL because nothing is equal to null, so the UPDATE does nothing.”

    This has not actually been true since PostgreSQL 9.0. If the row was concurrently updated not to match the WHERE clause, FOR UPDATE moves on to the “next one” according to the snapshot. In other words, FOR UPDATE applies before LIMIT.

    • Huh, so EvalPlanQual got smarter. I managed to miss that.

      That’s useful, though it doesn’t change the fact that the queue fetches are serialized. They just won’t get zero-row fetches and retry anymore.

  • Over the past couple months I’ve been working on an open source job queue taking advantage of the approach you mention here (https://www.npmjs.com/package/pg-boss). I agree that a relational database cannot match the extreme requirements of an external message queue product, but I’m also convinced that the majority of job queue consumers out there would find performance along the lines of 1000 jobs per second reasonable enough to provide an open source alternative to purchasing said external product.

    The difference between having SKIP LOCKED and not having it was compelling enough for me to create this, which frankly is not that much code at all, since the bulk of the coding effort is handled by Postgres (so awesome btw).

Leave a Reply

Your email address will not be published. Required fields are marked *