Wednesday, December 12

JSONB type performance in PostgreSQL 9.4


The 9.4 version of PostgreSQL introduces the JSONB data type, a specialised representation of the JSON data, allowing PostgreSQL to be competitive in managing the “lingua franca” of the moment for the exchange of data via web services. It is useful to perform a number of tests to verify its actual performance.

Test dataarticolo-json-marco

We will run our tests using the customer reviews data from Amazon for the year 1998 in JSON format. The file customer_reviews_nested_1998.json.gz can be downloaded from the website of Citus Data.

The file, once unzipped, takes up 209MB and contains approximately 600k of records in JSON format with a structure similar to the following:

{

    "customer_id": "ATVPDKIKX0DER",
    "product": {
        "category": "Arts & Photography",
        "group": "Book",
        "id": "1854103040",
        "sales_rank": 72019,
        "similar_ids": [
            "1854102664",
            "0893815381",
            "0893816493",
            "3037664959",
            "089381296X"
        ],
        "subcategory": "Art",
        "title": "The Age of Innocence"
    },
    "review": {
        "date": "1995-08-10",
        "helpful_votes": 5,
        "rating": 5,
        "votes": 12
    }
}

Dimensions

The data can be loaded into a PostgreSQL database using the JSONB data type with the following commands:

CREATE TABLE reviews(review jsonb);
\copy reviews FROM 'customer_reviews_nested_1998.json'
VACUUM ANALYZE reviews;

The resulting table will take up approximately 268MB, with an additional cost of disk storage of around 28%. If we try to load the same data using the JSON type, which stores it as text, the result will be a table of 233MB, with an increase in space of roughly 11%. The reason for this difference is that the internal structures of JSONB, which are used to access the data without analysing the entire document each time, have a cost in terms of space.

Data access

Once the data is stored in the database, it is necessary to create an index in order to access it efficiently. Before the 9.4 version of PostgreSQL, the only method of indexing the contents of a JSON field was to use a B-tree index on a specific search expression. For example, if we want to perform a search by product category we will use:

CREATE INDEX on reviews ((review #>> '{product,category}'));

The newly created index takes up 21MB, or approximately 10% of the original data, and will be used for queries that contain within the WHERE clause the exact search term “review #>> {product,category}”, such as:

SELECT
    review #>> '{product,title}' AS title,
    avg((review #>> '{review,rating}')::int)
FROM reviews
WHERE review #>> '{product,category}' = 'Fitness & Yoga'
GROUP BY 1 ORDER BY 2;
                       title                       |        avg
---------------------------------------------------+--------------------
 Kathy Smith - New Yoga Challenge                  | 1.6666666666666667
 Pumping Iron 2                                    | 2.0000000000000000
 Kathy Smith - New Yoga Basics                     | 3.0000000000000000
 Men Are from Mars, Women Are from Venus           | 4.0000000000000000
 Kathy Smith - Functionally Fit - Peak Fat Burning | 4.5000000000000000
 Kathy Smith - Pregnancy Workout                   | 5.0000000000000000
(6 rows)

The query takes approximately 0.180ms to be performed on the test machine, but the index that has been created is highly specific and cannot be used for different searches.

Starting with version 9.4, the JSONB data type supports the use of inverted indexes (GIN, or General inverted Indexes), which allow indexing of the components of a complex object.

Let’s create a GIN index on our table reviews with the following command:

CREATE INDEX on reviews USING GIN (review);

The resulting index takes up 64MB of disk, which is approximately 30% of the size of the original table. This index can be used to speed up searches using the following operators:

  • JSON @> JSON is a subset
  • JSON ? TEXT contains a value
  • JSON ?& TEXT[] contains all the values
  • JSON ?| TEXT[] contains at least one value

The above query shall thus be rewritten using the operator @> to search for rows that contain '{"product": {"category": "Fitness & Yoga"}}':

SELECT
    review #>> '{product,title}' AS title,
    avg((review #>> '{review,rating}')::int)
FROM reviews
WHERE review @> '{"product": {"category": "Fitness & Yoga"}}'
GROUP BY 1 ORDER BY 2;

The query takes approximately 1.100ms to be performed on the test machine, and the index that has been created is flexible and can be used for any search within the JSON data.

In fact, it is often the case that the only operation used in the applications is the search for a subset. In this case it is possible to use a different GIN index which only supports the @> operator and is therefore considerably smaller. The syntax for creating this type of “optimised” index is as follows:

 

CREATE INDEX on reviews USING GIN (review jsonb_path_ops);

The resulting index occupies 46MB – only 22% of the size of the original data – and, thanks to its smaller size, it is used more efficiently by PostgreSQL.

This allows the above query to be run in just 0.167ms, with a performance increase of 650% compared to the original GIN index and 8% compared to the specific B-tree index initially used: all this without loss of generality with regards to the possible search operation.

Conclusions

With the introduction of the JSONB type and the GIN indexes
using jsonb_path_ops operator class, PostgreSQL combines the elasticity of the JSON format at an amazing data access speed.

Today it is thus possible to store and process data in JSON format with high performance while enjoying the robustness and flexibility that PostgreSQL has habitually provided us with over the years.

15 Comments

  • janko

    This was a very interesting article. I have a couple of questions:

    1. What does “\copy” exactly do? How did it know that you want to store the data in the “review” column? Was Postgres able to detect that it was the only column in the table?

    2. What does “VACUUM ANALYZE” do in your case? I read the Postgres documentation, and I understood from there that it cleans the deleted rows, but you didn’t delete anything. Does the “ANALYZE” part make future queries faster?

    3. If the “json_path_ops” index is focused on the @> operator only, does it mean that your SELECT statement is slower, since it still uses the #>> operator? Or it’s that fast because it’s actually doing the SELECT only after the WHERE filtering?

    • Marco Nenciarini

      1) \copy is a psql meta command which invokes the SQL commands “COPY … FROM STDIN” or “COPY .. TO STDOUT”, and then fetches/stores the data in a file accessible to the psql client.

      If you don’t specify any column list, PostgreSQL will load/dump all the columns of the table.

      2) It’s always a good idea to “VACUUM ANALYZE” a table after a bulk load of data. It ensures that all the internal structures used by PostgreSQL are up to date (hint bits, visibility map, etc…) and collects statistics about the content of the table, so the query planner can use these statistics to help determine the most efficient execution plans for queries.

      3) The “json_path_ops” GIN index is used to quickly find the data during the WHERE filtering, then the #>> operator is applied in the output phase to display only the required portion of data. It makes the access very fast.

      • kustodian

        Just a small correction. After bulk inserting of data, VACUUM is not needed, only ANALYZE, so that it updates the query planner for that table/database. Running VACUUM ANALYZE won’t hurt, but it will just take longer, which makes a big difference on big tables/databases.

        Thanks for the article. I didn’t know that you can make a specialized GIN index which will only work for some operators.

  • Marco,
    Thanks for putting this information together. I’ve used the json/jsonb functionality a bit and it’s really nice. Accessing the `similar_ids` array is a little more challenging. I started out doing this:

    SELECT review #>> ‘{product,title}’ AS title
    FROM reviews
    WHERE exists(select 1 from jsonb_array_elements(“reviews”.review#>'{product,similar_ids}’) where value = ‘”0910627312″‘);

    which works for both json / jsonb by changing the name of the function to `json_array_elements`

    I’m going to adopt your syntax (and better use of the index) with this as soon as I can get my production db upgraded to 9.4.x:

    SELECT review #>> ‘{product,title}’ AS title
    FROM reviews
    WHERE “reviews”.review#>'{product,similar_ids}’ @> ‘”0910627312″‘;

    Thanks again!
    dwilkins

    • Marco Nenciarini

      Hi Dwilkins,

      the query you pasted will not use the jsonb_path_ops index. To take advantage of it you need to write the query as follows:

      SELECT review #>> ‘{product,title}’ AS title
      FROM reviews
      WHERE review @> ‘{“product”: {“similar_ids”: [“0910627312”]}}’;

      • Thanks Marco – I’m gradually getting more familiar with Postgres – It’s obvious now that I explain each query that mine wasn’t using the index.

    • Marco Nenciarini

      It has the same limitations of other TOASTable fields, so it should work with maximum one GB of data. However, in my experience, it’s better to not exceed few hundred megabytes of data.

  • vincvan

    Great Article thanks! Have you any suggestions how I could speed up a jsonb query, when I would like to compare floats which are saved as values in a jsonb column?
    Such as:

    SELECT “products”.* FROM “products”
    WHERE (COALESCE(CAST( nullif(technical_details#>>’{weight,value}’,”) AS FLOAT),0) BETWEEN 500 AND 1500)

    This query takes 30ms in my db. I added a GIN index on the column technical_details, but this did not really help.

    Thanks for your help!

Leave a Reply

Your email address will not be published. Required fields are marked *