Friday, September 22

Tablesample and Other Methods for Getting Random Tuples

PostgreSQL’s TABLESAMPLE brings a few more advantages compared to other traditional ways for getting random tuples.

TABLESAMPLE is a SQL SELECT clause and it provides two sampling methods which are SYSTEM and BERNOULLI. With the help of TABLESAMPLE we can easily retrieve random rows from a table. For further reading about TABLESAMPLE you can check the previous blog post.

In this blog post we’ll talk about alternative ways of getting random rows. How people were selecting random rows before TABLESAMPLE, what are the pros and cons of the other methods and what we gained with TABLESAMPLE?

There are awesome blog posts about selecting random rows, you can start reading the following blog posts to gain a deep understanding of this topic.

My Thoughts of Getting Random Row

Getting Random Rows from a Database Table

random_agg()

Let’s compare the traditional ways of getting random rows from a table with the new ways provided by TABLESAMPLE.

Before the TABLESAMPLE clause, there were 3 commonly used methods for randomly selecting rows from a table.

1- Order by random()

For testing purposes we need to create a table and put some data inside of it.

Let’s create ts_test table and insert 1M rows into it:

CREATE TABLE ts_test (
  id SERIAL PRIMARY KEY,
  title TEXT
);

INSERT INTO ts_test (title)
SELECT
    'Record #' || i
FROM
    generate_series(1, 1000000) i;

Considering the following SQL statement for selecting 10 random rows:

SELECT * FROM ts_test ORDER BY random() LIMIT 10;

Causes PostgreSQL to perform a full table scan and also ordering. Therefore this method is not preferred for tables with large number of rows because of performance reasons.

Let’s look into EXPLAIN ANALYZE output of this query above:

random=# EXPLAIN ANALYZE SELECT * FROM ts_test ORDER BY random() LIMIT 10;
 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=33959.03..33959.05 rows=10 width=36) (actual time=1956.786..1956.807 rows=10 loops=1)
 -> Sort (cost=33959.03..35981.18 rows=808863 width=36) (actual time=1956.780..1956.789 rows=10 loops=1)
 Sort Key: (random())
 Sort Method: top-N heapsort Memory: 25kB
 -> Seq Scan on ts_test (cost=0.00..16479.79 rows=808863 width=36) (actual time=0.276..1001.109 rows=1000000 loops=1)
 Planning time: 1.434 ms
 Execution time: 1956.900 ms
(7 rows)

As EXPLAIN ANALYZE points out, selecting 10 out of 1M rows took nearly 2 seconds. The query also used the output of random() as the sort key to order results. Sorting seems to be the most time consuming task here. Let’s execute this with scenario using TABLESAMPLE.

In PostgreSQL 9.5, for getting exact number of rows randomly, we can use the SYSTEM_ROWS sampling method. Provided by the tsm_system_rows contrib module, it allows us to specify how many rows should be returned by sampling. Normally only the requested percentage could be specified with TABLESAMPLE SYSTEM and BERNOULLI methods.

First, we should create tsm_system_rows extension for using this method since both TABLESAMPLE SYSTEM and TABLESAMPLE BERNOULLI methods do not guarantee that the provided percentage will result in the expected number of rows. Please check the previous TABLESAMPLE post to recall why they work like that.

Let’s start by creating tsm_system_rows extension:

random=# CREATE EXTENSION tsm_system_rows;
CREATE EXTENSION

Now let’s compare “ORDER BY random()EXPLAIN ANALYZE output with the EXPLAIN ANALYZE output of tsm_system_rows query which returns 10 random rows out of a 1M row table.

random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE SYSTEM_ROWS(10);
 QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Sample Scan on ts_test (cost=0.00..4.10 rows=10 width=18) (actual time=0.069..0.083 rows=10 loops=1)
 Sampling: system_rows ('10'::bigint)
 Planning time: 0.646 ms
 Execution time: 0.159 ms
(4 rows)

The whole query took 0.159 ms. This sampling method is extremely fast comparing with the “ORDER BY random()” method which took 1956.9 ms.

2- Compare with random()

The following SQL allows us to retrieve random rows with 10% probability

SELECT * FROM ts_test WHERE random() <= 0.1;

This method is faster than ordering by random because it doesn’t sort selected rows. It will return approximate percentage of rows from the table just like BERNOULLI or SYSTEM TABLESAMPLE methods.

Let’s check the EXPLAIN ANALYZE output of random() query above:

random=# EXPLAIN ANALYZE SELECT * FROM ts_test WHERE random() <= 0.1;
 QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on ts_test (cost=0.00..21369.00 rows=333333 width=18) (actual time=0.089..280.483 rows=100014 loops=1)
 Filter: (random() <= '0.1'::double precision)
 Rows Removed by Filter: 899986
 Planning time: 0.704 ms
 Execution time: 367.527 ms
(5 rows)

The query took 367.5 ms. Much better than ORDER BY random(). But it’s harder to control the exact row count. Let’s compare “random()EXPLAIN ANALYZE output with the EXPLAIN ANALYZE output of TABLESAMPLE BERNOULLI query which return approximately 10% of random rows out of the 1M rows table.

random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE BERNOULLI (10);
 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Sample Scan on ts_test  (cost=0.00..7369.00 rows=100000 width=18) (actual time=0.015..147.002 rows=100155 loops=1)
   Sampling: bernoulli ('10'::real)
 Planning time: 0.076 ms
 Execution time: 239.289 ms
(4 rows)

We gave 10 as the parameter to BERNOULLI because we need 10% of all records.

Here we can see that the BERNOULLI method took 239.289 ms to execute. These two methods are quite similar in how they work, BERNOULLI is slightly faster as the random selection is all done on lower level. One advantage of using BERNOULLI compared to WHERE random() <= 0.1 is the REPEATABLE clause which we described in previous TABLESAMPLE post.

3- Extra column with a random value

This method suggests adding a new column with randomly assigned values, adding an index to it, performing sorting by that column, and optionally updating their values periodically to randomize the distribution.

This strategy allows mostly repeatable random sampling. It works much faster than the first method but it takes an effort to setup for the first time, and results in a performance cost in insert, update, and delete operations.

Let’s apply this method on our ts_test table.

First, we will add a new column called randomcolumn with the double precision type because PostgreSQL’s random() function returns a number in double precision.

random=# ALTER TABLE ts_test ADD COLUMN randomcolumn DOUBLE PRECISION;
ALTER TABLE

Then we’ll update the new column using random() function.

random=# \timing
Timing is on.
random=# BEGIN;
BEGIN
Time: 2.071 ms
random=# UPDATE ts_test SET randomcolumn = RANDOM();
UPDATE 1000000
Time: 8483.741 ms
random=# COMMIT;
COMMIT
Time: 2.615 ms

This method has an initial cost for creating a new column and populating that new column with random (0.0-1.0) values, but it’s a one-time cost. In this example, for 1M rows, it took almost 8.5 seconds.

Let’s try to observe if our results are reproducible by querying 100 rows with our new method:

random=# SELECT * FROM ts_test ORDER BY randomcolumn LIMIT 100;
 -------+---------------+----------------------
 13522  | Record #13522  | 6.4261257648468e-08
 671584 | Record #671584 | 6.4261257648468e-07
 714012 | Record #714012 | 1.95764005184174e-06
 162016 | Record #162016 | 3.44449654221535e-06
 106867 | Record #106867 | 3.66196036338806e-06
 865669 | Record #865669 | 3.96883115172386e-06
 927    | Record #927    | 4.65428456664085e-06
 526017 | Record #526017 | 4.65987250208855e-06
 98338  | Record #98338  | 4.91179525852203e-06
 769625 | Record #769625 | 4.91319224238396e-06
 ...
 462484 | Record #462484 | 9.83504578471184e-05
 (100 rows)

When we execute the query above, we mostly get the same result set but this is not guaranteed because we used random() function for populating randomcolumn values and in this case more than one column might have the same value. To be sure that we get the same results for each time it runs we should improve our query by adding the ID column to ORDER BY clause, in this way we ensure that ORDER BY clause specifies a unique set of rows, because id column has primary key index on it.

Now let’s run the improved query below:

random=# SELECT * FROM ts_test ORDER BY randomcolumn, id LIMIT 100;
 id | title | randomcolumn
--------+----------------+----------------------
 13522  | Record #13522  | 6.4261257648468e-08
 671584 | Record #671584 | 6.4261257648468e-07
 714012 | Record #714012 | 1.95764005184174e-06
 162016 | Record #162016 | 3.44449654221535e-06
 106867 | Record #106867 | 3.66196036338806e-06
 865669 | Record #865669 | 3.96883115172386e-06
 927    | Record #927    | 4.65428456664085e-06
 526017 | Record #526017 | 4.65987250208855e-06
 98338  | Record #98338  | 4.91179525852203e-06
 769625 | Record #769625 | 4.91319224238396e-06 
 ...
 462484 | Record #462484 | 9.83504578471184e-05
 (100 rows)

Now we’re sure that we can get reproducible random sample by using this method.

Tip: If you want to use this method for getting random rows, it’s better to create an index for randomcolumn. In this blog post we’re focusing to make a comparison with TABLESAMPLE methods and the other methods. We’re not creating index for this blog post and we’ll compare EXPLAIN ANALYZE outputs instead of improving all the other methods.

It’s time to see EXPLAIN ANALYZE results of our final query:

random=# EXPLAIN ANALYZE SELECT * FROM ts_test ORDER BY randomcolumn, id LIMIT 100;
 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=55571.28..55571.53 rows=100 width=26) (actual time=1951.811..1952.039 rows=100 loops=1)
 -> Sort (cost=55571.28..58071.28 rows=1000000 width=26) (actual time=1951.804..1951.891 rows=100 loops=1)
 Sort Key: randomcolumn, id
 Sort Method: top-N heapsort Memory: 32kB
 -> Seq Scan on ts_test (cost=0.00..17352.00 rows=1000000 width=26) (actual time=0.058..995.011 rows=1000000 loops=1)
 Planning time: 0.481 ms
 Execution time: 1952.215 ms
(7 rows)

For comparing this method with TABLESAMPLE methods, let’s pick SYSTEM method with REPEATABLE option, since this method gives us reproducible results.

TABLESAMPLE SYSTEM method basically does block/page level sampling; it reads and returns random pages from disk. Thus it provides randomness at the page level instead of the tuple level. That’s why it’s hard to control retrieved rows count exactly. If there are no pages picked we might not get any result. The page level sampling is also prone to clustering effect.

TABLESAMPLE SYNTAX is same for BERNOULLI and SYSTEM methods, we’ll specify the percentage like we did in BERNOULLI method.

Quick Reminder: SYNTAX

SELECT select_expression
FROM table_name
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
...

In order to retrieve approximately 100 rows, we need to request 100 * 100 / 1 000 000 = 0.01% of the table’s rows. So our percentage will be 0.01.

Before starting to query, let’s remember how REPEATABLE works. Basically we’ll choose a seed parameter and we’ll get the same results for each time when we use the same seed with the previous query. If we provide a different seed the query will produce a quite different result set.

Let’s try to see the results with querying.

random=# Select * from ts_test tablesample system (0.01) repeatable (60);
 id | title | randomcolumn
--------+----------------+---------------------
 659598 | Record #659598 | 0.964113113470376
 659599 | Record #659599 | 0.531714483164251
 659600 | Record #659600 | 0.477636905387044
 659601 | Record #659601 | 0.861940925940871
 659602 | Record #659602 | 0.545977566856891
 659603 | Record #659603 | 0.326795583125204
 659604 | Record #659604 | 0.469275736715645
 659605 | Record #659605 | 0.734953186474741
 659606 | Record #659606 | 0.41613544523716
 ...
 659732 | Record #659732 | 0.893704727292061
 659733 | Record #659733 | 0.847225237637758
 (136 rows)

We get 136 rows, as you can consider this number depends on how many rows are stored in a single page.

Now let’s try to run the same query with the same seed number:

random=# Select * from ts_test tablesample system (0.01) repeatable (60);
 id | title | randomcolumn
--------+----------------+---------------------
 659598 | Record #659598 | 0.964113113470376
 659599 | Record #659599 | 0.531714483164251
 659600 | Record #659600 | 0.477636905387044
 659601 | Record #659601 | 0.861940925940871
 659602 | Record #659602 | 0.545977566856891
 659603 | Record #659603 | 0.326795583125204
 659604 | Record #659604 | 0.469275736715645
 659605 | Record #659605 | 0.734953186474741
 659606 | Record #659606 | 0.41613544523716 
 ...
 659732 | Record #659732 | 0.893704727292061
 659733 | Record #659733 | 0.847225237637758
 (136 rows)

We get the same result set thanks to REPEATABLE option. Now we can compare TABLESAMPLE SYSTEM method with random column method by looking the EXPLAIN ANALYZE output.

random=# EXPLAIN ANALYZE SELECT * FROM ts_test TABLESAMPLE SYSTEM (0.01) REPEATABLE (60);
 QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Sample Scan on ts_test (cost=0.00..5.00 rows=100 width=26) (actual time=0.091..0.230 rows=136 loops=1)
 Sampling: system ('0.01'::real) REPEATABLE ('60'::double precision)
 Planning time: 0.323 ms
 Execution time: 0.398 ms
(4 rows)

SYSTEM method uses sample scan and it’s extremely fast. The only backward of this method is that we cannot guarantee that the provided percentage will result in the expected number of rows.

Conclusion

In this blog post we compared standard TABLESAMPLE (SYSTEM, BERNOULLI) and the tsm_system_rows module with the older random methods.

Here you can review the findings quickly:

  • ORDER BY random() is slow because of sorting
  • random() <= 0.1 is similar to BERNOULLI method
  • Adding column with random value needs initial work and might lead performance problems
  • SYSTEM is fast but it’s hard to control number of rows
  • tsm_system_rows is fast and can control number of the rows

As a result tsm_system_rows beats any other method for selecting just few random rows.

But the real winner is definitely TABLESAMPLE itself. Thanks to its extensibility, custom extensions (i.e. tsm_system_rows, tsm_system_time) can be developed using TABLESAMPLE method functions.

Developer note: More information about how to write custom sampling methods can be found in Writing A Table Sampling Method chapter of PostgreSQL documentation.

Note to the future: We’ll discuss AXLE Project and tsm_system_time extension in our next TABLESAMPLE blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *