PostgreSQL anti-patterns: Unnecessary json/hstore dynamic columns

PostgreSQL has json support – but you shouldn’t use it for the great majority of what you’re doing. This goes for hstore too, and the new jsonb type. These types are useful tools where they’re needed, but should not be your first choice when modelling your data in PostgreSQL, as it’ll make querying and manipulating it harder.

Some readers will be familiar with the (anti)-pattern. EAV has been a necessary evil in database schemas for a long time. It’s something we all advise against using and look down on, then resort to when reality and business requirements mean not everything can be fit into rigid fields in properly modelled relations. For those who don’t know it, it’s a schema design where you turn a relational database into a poor-man’s key/value store using a table with object_id (“entity”), key (“attribute”) and value columns. Querying it usually involves lots and lots of self-joins.

json is the new EAV – a great tool when you need it, but not something you should use as a first choice.

(Most of this also applies to PostgreSQL arrays, which are great tools for building queries, but not usually an appropriate way to store authorative data).

JSON blob tables

Since the release of PostgreSQL 9.3, and even more since jsonb in 9.4, I’m seeing more and more Stack Overflow questions where people ask how to do joins on json objects, how to query for a nested key inside any array element, etc. All reasonable things to do, but when asked for schema and example data they’ll often post something like this:

CREATE TABLE people(
    id serial primary key,
    data jsonb not null
);

INSERT INTO people(data) VALUES ($$
{
    "name": "Bob",
    "addresses": [
        {
            "street": "Centre",
            "streetnumber": 24,
            "town": "Thornlie",
            "state": "WesternAustralia",
            "country": "Australia"
        },
        {
            "street": "Example",
            "streetnumber": "4/311-313",
            "town": "Auckland",
            "country": "NewZealand"
        }
    ],
    "phonenumbers": [
        {
            "type": "mobile",
            "number": "12345678"
        }
    ]
}
$$);

INSERT INTO people(data) VALUES ($$
{
  "name": "Fred",
  "phonenumbers": [
    { "type": "mobile", "number": "12345678" }
  ]
}
$$);

INSERT INTO people(data) VALUES ($$
{
  "name": "John Doe"
}
$$);

and ask “how do I find people who have the same phone number?”.

Experienced relational database users will already be wincing… but remember, not everybody is used to inverting their thinking to follow foreign keys backwards from child to parent and decomposing composite entities into normal form. When you think about it, relational databases are weird.

Most of PostgreSQL’s power comes from that relational model, though. When you store your data as json blobs you deny the query planner the ability to make sensible decisions based on table and column statistics, you lose most of the indexing features and scan types, and are generally restricted to quite primitive operations. Doing anything interesting tends to involve lots of self-joins and filters.

Baroque queries

For example, the request to find people with matching phone numbers could be turned into a query like:

select 
  p1.id AS person1,
  p2.id AS person2,
  p1.data ->> 'name' AS "p1 name",
  p2.data ->> 'name' AS "p2 name",
  pns1 ->> 'type' AS "type", 
  pns1 ->> 'number' AS "number" 
from people p1 
  inner join people p2 
    on (p1.id > p2.id)
  cross join lateral jsonb_array_elements(p1.data -> 'phonenumbers') pns1
  inner join lateral jsonb_array_elements(p2.data -> 'phonenumbers') pns2 
    on (pns1 -> 'type' = pns2 -> 'type' AND pns1 -> 'number' = pns2 -> 'number');

Isn’t that just a thing of readable beauty?

There’s nothing particularly wrong with the json support functions and operators, it’s just awkward taking nested object-like data and querying it using relational joins and predicates.

At least PostgreSQL supports LATERAL; without that, this would be a nightmare of a query to write.

As it happens an equivalent query on a relational schema for this particular question isn’t that much nicer:

SELECT
  p1.id AS "person1",
  p2.id AS "person2",
  p1.name AS "p1 name",
  p2.name AS "p2 name",
  pns1.phoneno AS "p1 phone",
  pns2.phoneno AS "p2 phone"
FROM
  person p1
  INNER JOIN person p2 ON (p1.id < p2.id)
  INNER JOIN phone pns1 ON (p1.id = pns1.person_id)
  INNER JOIN phone pns2 ON (p2.id = pns2.person_id)
WHERE
  pns1."type" = pns2."type"
  AND pns1.phoneno = pns2.phoneno;

… but it’s likely to go a *lot* faster with an index on (phone."type", phone.phoneno) than the json based query would, and read a lot less data off disk in the process.

No fixed data typing

The json-based query given above is also buggy, because jsonb comparisons are sensitive to data type so the jsonb values "12345678" and 12345678 are unequal:

regress=> SELECT '12345678'::jsonb, '"12345678"'::jsonb, '12345678'::jsonb = '"12345678"'::jsonb AS "isequal";
  jsonb   |   jsonb    | isequal 
----------+------------+---------
 12345678 | "12345678" | f

so this:

insert into people (data) values ('{"phonenumbers": [{"type":"mobile","number":12345678}]}');

won’t be identified as a duplicate even though it should be.

Note that this is arguably a PostgreSQL limitation, since a JavaScript intepreter’s loose typing means they compare as equal:

> 12345678 == "12345678"
> true

To get the numbers to match using a PostgreSQL query, we have to change our query so that this:

    on (pns1 -> 'type' = pns2 -> 'type' AND pns1 -> 'number' = pns2 -> 'number');

uses the ->> json-value-as-text operator instead:

    on (pns1 -> 'type' = pns2 -> 'type' AND pns1 ->> 'number' = pns2 ->> 'number');

since the text form of the numbers is the same.

No constraints

Of course, someone might insert:

insert into people (data) values ('{"phonenumbers": [{"type":"Mobile","number":"1234 5678"}]}');

… which won’t match because of the capital M and the space in the number.

Because everything is in a json blob can’t easily add CHECK constraints on the value, use ENUM types or DOMAINs, create trigger-maintained shadow normalized columns, or any of the usual approaches used to handle data normalisation like this. It’d be necessary to write a json validator/transform trigger in a procedural language like pl/v8 or do it all in the application.

So when should json be used?

From the above you might be thinking that I’m against using json types in the database. That’s not the case at all. They’re often better than EAV when you have dynamic-column data that just won’t fit into a traditional relational model. Or, as I’ve seen people try in the past, co-ordinate with an external MongoDB for the json storage.

For example, sometimes application users want to be able to add arbitrary attributes to an entity. It’s a business requirement. The client don’t like it when you say that the database doesn’t do that so they can’t have notes in their application, and they’re muttering things about “just doing it in the application” or “we didn’t have these problems with MongoDB”.

So you might model the typical stuff relationally (albeit with indiscriminate use of surrogate keys), but add a json field in person just for those extras so the app can fetch them when displaying a person.

How to decide when to use json

Use json if your data won’t fit in the database using a normal relational modelling. If you’re choosing between using EAV, serializing a Java/Ruby/Python object into a bytea field, or storing a key to look up an external structured object somewhere else … that’s when you should be reaching for json fields.

At this point it’s probably worth replacing hstore use with jsonb in all new applications.

Note that plain json fields are also useful – if you’re not going to be indexing the json and querying within it, they’re usually more compact and faster to send and receive.

This Post Has 35 Comments

  1. Anders says:

    > The json-based query given above is also buggy, because jsonb comparisons are sensitive to data type so the jsonb values “12345678” and 12345678 are unequal: […] so this: […123456789 as a number…] won’t be identified as a duplicate even though it should be.

    No it shouldn’t. Even though the two values might look different the way you write them, they are different. Had you choosen to write numbers in hex, the two values would not look similar.

    > Note that this is arguably a PostgreSQL limitation, since a JavaScript intepreter’s loose typing means they compare as equal: [12345678 == “12345678” => true]

    Note that in the JavaScript community that is considered a bug and the use of the non-type-coercing equals-operator “===” is encouraged.

    • craig.ringer says:

      Anders, interesting point re comparison. I’m actually happier knowing that it’s frowned upon in the JS world, because I’ve never been a big fan of Perl and JavaScript’s habit of implicitly converting strings and numbers for comparisons. My issue was that behaviour was inconsistent with Pg, but it seems all we really need to do there is to document that jsonb implements equality tests to match the JavaScript non-coercing equals operator “===” or something along those lines.

      As for “should be” … I was speaking narrowly of the use case in the example, where the user would clearly want to compare “13245678” and 12345678 as equal. In general it’s clearly best to do just what jsonb does by default for comparisions – but to try to stick to relational, typed fields most of the time.

  2. […] Tento můj příspěvek je reakce na článek craing ringer PostgreSQL anti-patterns: Unnecessary json/hstore dynamic column. […]

  3. wolever says:

    Can you explain the join condition `(p1.id ‘phonenumbers’)` in the first example query? It’s not a syntax I’ve seen before.

    > from people p1
    > inner join people p2
    > on (p1.id ‘phonenumbers’) pns1
    > inner join lateral jsonb_array_elements(p2.data -> ‘phonenumbers’) pns2
    > on (pns1 -> ‘type’ = pns2 -> ‘type’ AND pns1 -> ‘number’ = pns2 -> ‘number’);

  4. Interesting read. I didn’t realize so many people were trying to use the JSON/HSTORE datatype like that. I advocate for it’s use when hitting an external API. Pull out the stuff that you know you need, and store it in the table. But also just store the entire payload in a JSON field. This probably won’t scale throughout the life of an app, but I don’t see the harm in it while you’re feeling things out.

  5. mauritslamers says:

    The reason I would like to use JSON in this way is that i want to be able to create files, in the sense that I create a folder on a certain person where I want to gather all different information on this person as files in one spot. With the kind of table setup (an id and a jsonb field) this becomes possible.

    The problem I experience with the traditional relational approach is that for every type of information I need to create a separate table, and in some cases a join table in order to get the information in the database. Whenever you need to add a new kind of type, or a change to an existing type, not only do you need to change the API and all applications, but the database as well.

    Having the information spread out in so many spots makes it harder to retrieve in case of problems, or you lose the coherence of the information because one join table decides to go corrupt. If that happens, you don’t lose the coherence of the information on only one person, but I loose it on every person in the database.

    Do you think this approach is an anti-pattern? How would you implement something like this without using the jsonb type this way?

    • craig.ringer says:

      It sounds like you’re not a fan of the relational approach or it doesn’t fit your application. What you describe is the object store or document store model. If you prefer this model or it’s a better fit for your application that’s fine – but rather than trying to force PostgreSQL into working this way, consider using a dedicated document database like MongoDB.

      That said, for many apps changes to structure are relatively infrequent and easily managed, and much of the data benefits from the flexible querying that relational storage provides via aggregation, windowing, etc. PostgreSQL’s json support really comes in handy when you have mostly relational data, but you also need to store some dynamic attributes or freeform data.

      I’m a bit concerned by your comment about a table “deciding to go corrupt”. Have you been using MyISAM recently? In PostgreSQL if you’ve got corruption anywhere you should be very concerned, and there’s no reason to assume that just one table will be damaged by whatever happened (PostgreSQL bug, filesystem bug, hardware fault, etc). It’s more likely in my experience that all tables recently written to will be damaged in the same way; see e.g. the multixact issues. Isolating your data into little pieces to protect against corruption is not the solution to data corruption issues, it’s at best an incomplete and clumsy workaround to an issue that shouldn’t be present in the first place.

      • mauritslamers says:

        It is not so much being a fan of the relational model or not, but rather the idea that because PostgreSQL is at heart a relational DB which now supports json as a field type this kind of approaches to using it are by default an anti-pattern.

        The relational model has its strengths and weaknesses, as I pointed out with the remark about the table going corrupt. I didn’t necessarily mean that the cause of that corruption would be postgres or the file system, but any kind of corruption, including the one introduced by SQL injection and / or improper use. The normalization process of data of different “type” essentially makes the coherence between these types both implicit and generic, whereas JSON allows the coherence to be flexible and explicit.

        I know that I described a document store, I have been using them for a few years now, but I would like to use a document-like store, but with SQL-like search capabilities. In the jsonb type this kind of queries this becomes possible, as well as combining it with traditional relational approaches.

  6. Sakuraba says:

    It is funny how people call it an anti-pattern, when it is absolutely necessary and vital to certain applications. Best example is ecommerce. EAV cannot be avoided when you want to create a reusable webshop, because you dont know upfront what products the users will store. Or do you recommend to put columns like t-shirt size or height of an aquarium or ingredients of cosmetics into the relational model for each different user? Ridiculous!

    Stop calling it an anti-pattern. It is valid design technique.

  7. leninmhs says:

    NoSQL solutions increasing day by day, postgres allow make a combination of relational and non-relational world, finally is you decision if implement a hstore json or jsonb, Craig obviously is not fan of NoSQL, but many others postgres specialist are using this “anti-patterns” for complexisting and sofistic design and implements over postgres and the tendency is that this going demanding and consuming for many people’s

    • craig.ringer says:

      Actually I think json and jsonb support are great, when used appropriately.

      I advise people against fully schemaless tables in PostgreSQL because of the efficiency issues with indexing, querying, etc. If you need this, you should generally consider a product that’s optimised for schemaless storage, such as a document database.

      If you have a mostly-relational schema and need some free form entries, that’s when things like json and jsonb come in really handy.

      • TJ says:

        Craig, what would you say about a schema where all the relationships are columns and the rest is JSONB ?

        AFAIK indexes can be built on jsonb fields, which should make fast queries that need to be sped up.

  8. Brian says:

    Craig, THANK YOU for speaking truth on data modeling. We have used EAV for more than a decade in a SaaS product where users define their own products and questions to collect data from their customers. There is no choice other than EAV or some similar pattern. We recently implemented a user-definable data set using the JSONB functionality in 9.4 and it’s working great as it replaces two of the three tables in a typical EAV setup with a single column sitting next to your entity data.

    • craig.ringer says:

      I think the bigger thing is that it gets rid of most of the dynamic query generation with the ‘n’ joins when you want ‘n’ attributes, etc. When you have to write that “fun” query that does a self-join on multiple attributes of two EAV-afflicted entities, so you land up with 2 * n joins on the EAV table for n attributes joined on. Ew.

  9. Chris Cogdon says:

    Actually, we got a significant performance boost by switching from text to jsonb, mostly because the data stores and retrieves much more compactly. I.e., the cost in increased CPU was more than offset in the cost of IO time.

    As always, YMMV

  10. NWaithaka says:

    Hi,

    Had a look at this.

    As regards the use of indexes in JSON/B isn’t that a mute point with 9.5 as its fully supported?

    • craig.ringer says:

      Indexing with jsonb is certainly helpful, but they’re fairly large indexes (GIN, GiST), and they are more likely to have to pay the cost of out-of-line TOAST storage too. I haven’t done comparative performance studies on jsonb vs EAV models and would be curious about results from those who’ve done so.

      Fetching values from an indexed jsonb column will definitely not perform as well as fetching individual values from a regular table.

      If your choice is EAV vs jsonb, you should probably use jsonb. But you should prefer to model it relationally, not rely on EAV or jsonb or hstore or xml.

  11. Bitdivine says:

    I have been keeping a close eye on our query performance – as ever – and recently watched the speed of two sets of queries: One queries the join of a handful of tables, in the other all the data is together in one JSON blob. Surprise surprise, the JSON was outperforming the joins by a factor of six. On further investigation, what did it come down to? Pulling one JSON blob out of the database required one cache miss. Pulling together data from five tables in a join+group by meant many more cache misses. I am now very seriously considering having JSON blobs + a few columns for IDs. The IDs are there for the relational part of SQL, and are to be used as little as possible. Whilst the lack of an enforced structure may make database designers cringe, performance wise it makes a lot of sense. Just because you have relations doesn’t mean that they are cheap.

    • craig.ringer says:

      Interesting input. Denormalization for performance definitely does have its place. If you have a heirarchy of owner/owned entities that form logical units, are infrequently updated, and are always or usually fetched together then that would make a fair bit of sense.

      It depends a lot on your query patterns and what else you do with the data.

  12. muhannad says:

    Thanks for your post. What do you mean arrays are not usually an appropriate way to store authoritative data? can you elaborate with examples? how about single-dimensional arrays?

    • craig.ringer says:

      Arrays – single or multi-dimensional – suffer the same issues as json/xml/jsonb/hstore columns. If you modify one part, the whole thing must be rewritten to disk. If you read any part, the whole thing must be read.

      They’re handy for denormalization when you know you have to read/write the whole thing anyway, such as in materialized views. But they shouldn’t be the primary way you store the data in most cases.

  13. kevinhamon says:

    This is a great article and this is about the 4th time I’ve come across it looking up postges’ json capability for various reasons. JSON inside of relational database is not ideal, and I’m not sure why anyone would ever start there, which is what this seems to be warning against. The only time I’ve ever had to deal with json fields in a relational DB have been in enterprise situations where the technology decisions were driven by the existing platform. Maybe it’s more common than I think.

    I thought it might be useful for future visitors to have a concrete example of what I consider good use case for json in postgres that I think falls in line with what the authors views as well. Feel free to correct me on that.

    The one case where I’ve really found success in using json inside of a relational db has been when I needed to denormalize tables into a complex structure. Typically, this is a lot of heavy lifting so I keep them in a materialized view but I could see cases where one might want to use a plain old view.

    In my case, I inherited a postgres backed app that needed search bolted on. The search requires specific filters which are hierarchical. Using a materialized view and jsonb, I can wrap up everything I need for the javascript in one upfront call. The only calls back to the controller from that point on are to perform the search. The view is indexed by the object at the top of the hierarchy and includes all the data that will ever be needed to form the search query wrapped up in json fields. This is a rails app, so having the structure already in json does marginally help rendering performance. I tried this from all angles ( denormalizing in app code, storing denormalized objects in elastic search, multiple smaller views. In the end allowing postgres to maintain its mat view in the background and automatically update when the underlying data changes was the best way I found to accomplish this.

    This is a bit of a digression from the topic, however if you’re a Rails developer who is hesitant to use database views because AR doesn’t have native support for view migrations. Check out the Senic gem. (I’m not affiliated) It just makes views really manageable and familiar for rails development.

  14. Hello, great artickle. I would have one question. I am new to Postgre NoSql. How could i select in your example all peoples that live in Thornlie?? There can be more adresses in Thornlie for one person and also many people could live in Thornlie.

    Thanks.

Leave A Reply