Friday, November 24

CTE and the Birthday Paradox

An interesting query has been twitted by Will Leinweber from Postgres Open:

-- this returns a different result each time it is ran
with recursive s as (
  select random()
union
  select random() from s
) select count(*) from s;

I like this example: a surprising result, which can be explained by (and indeed helps to explain) CTE behaviour.

Unexpected truths are denoted with the word “paradox”, and in fact this is a manifestation (an “instance”, in programmers’ jargon) of what is known as the Birthday Paradox.

Its simplest formulation is probably this: if you randomly choose 23 persons, the probability that two of them share the same birthday is greater than 50%.

The result is unexpected, because there are 366 different birthdays, and the number 23 seems very small compared to 366.

However it is correct, as it can be shown with a direct computation. In PostgreSQL we can run another recursive CTE:

WITH RECURSIVE r(i,acc) AS (
  SELECT 1, 1 :: double precision
UNION
  SELECT i + 1,
    acc * (((366 - i) :: double precision) / 366)
  FROM r WHERE acc > 0.5
) SELECT count(1) FROM r;

producing 23 as the result.

A recursive CTE stops when the recursive step does not add any new rows. In the last query, acc represents the probability that the first i birthdays are distinct, so recursion stops when that number is not above 50%.

In the query mentioned at the beginning, which we’ll call the “random query” for short, the recursive CTE terminates when random() does not add a new row. That is, when the randomly-computed value has already been computed in a previous iteration; that’s because the recursive CTE is using UNION instead of UNION ALL.

This is indeed the Birthday paradox, with 366 replaced by the maximum possible number of distinct values that random() can produce. What exactly is that number?

The random() function returns a double precision value, whose exact definition depends on the system. Not all the double precision values can be produced, though; the underlying C function can produce 2^31 different results, regardless of the bit size of a double precision value. This is good enough in practice, and at the same time compatibility with all the various architectures and library implementations is ensured.

So we can replace 366 with 2^31 in our query, and instead of 23 we get 54563 as the answer.

Does it come near to the actual output of the random query? Let us run it a few times, collect the result, and compute the average:

gianni=# create table t(n int);
CREATE TABLE

gianni=# with recursive s as (
  select random()
union
  select random() from s
) insert into t select count(1) from s;
INSERT 0 1

/* repeat the last query 49 times */

gianni=# select count(1), avg(n) from t;

 count | avg
-------+--------------------
    50 | 54712.060000000000
 (1 row)

The average of the actual results is quite close to the expected threshold of 54563; the difference is less than 0.3%, quite orthodoxically, we might say!

Leave a Reply

Your email address will not be published. Required fields are marked *