On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote:
> Hi, > > One of the complaints I sometimes hear from users and customers using > Postgres to store JSON documents (as JSONB type, of course) is that the > selectivity estimates are often pretty poor. > > Currently we only really have MCV and histograms with whole documents, > and we can deduce some stats from that. But that is somewhat bogus > because there's only ~100 documents in either MCV/histogram (with the > default statistics target). And moreover we discard all "oversized" > values (over 1kB) before even calculating those stats, which makes it > even less representative. > > A couple weeks ago I started playing with this, and I experimented with > improving extended statistics in this direction. After a while I noticed > a forgotten development branch from 2016 which tried to do this by > adding a custom typanalyze function, which seemed like a more natural > idea (because it's really a statistics for a single column). > > But then I went to pgconf NYC in early December, and I spoke to Oleg > about various JSON-related things, and he mentioned they've been working > on this topic some time ago too, but did not have time to pursue it. So > he pointed me to a branch [1] developed by Nikita Glukhov. > > I like Nikita's branch because it solved a couple architectural issues > that I've been aware of, but only solved them in a rather hackish way. > > I had a discussion with Nikita about his approach what can we do to move > it forward. He's focusing on other JSON stuff, but he's OK with me > taking over and moving it forward. So here we go ... > > Nikita rebased his branch recently, I've kept improving it in various > (mostly a lot of comments and docs, and some minor fixes and tweaks). > I've pushed my version with a couple extra commits in [2], but you can > ignore that except if you want to see what I added/changed. > > Attached is a couple patches adding adding the main part of the feature. > There's a couple more commits in the github repositories, adding more > advanced features - I'll briefly explain those later, but I'm not > including them here because those are optional features and it'd be > distracting to include them here. > > There are 6 patches in the series, but the magic mostly happens in parts > 0001 and 0006. The other parts are mostly just adding infrastructure, > which may be a sizeable amount of code, but the changes are fairly > simple and obvious. So let's focus on 0001 and 0006. > > To add JSON statistics we need to do two basic things - we need to build > the statistics and then we need to allow using them while estimating > conditions. > > > 1) building stats > > Let's talk about building the stats first. The patch does one of the > things I experimented with - 0006 adds a jsonb_typanalyze function, and > it associates it with the data type. The function extracts paths and > values from the JSONB document, builds the statistics, and then stores > the result in pg_statistic as a new stakind. > > I've been planning to store the stats in pg_statistic too, but I've been > considering to use a custom data type. The patch does something far more > elegant - it simply uses stavalues to store an array of JSONB documents, > each describing stats for one path extracted from the sampled documents. > > One (very simple) element of the array might look like this: > > {"freq": 1, > "json": { > "mcv": { > "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], > "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]}, > "width": 19, > "distinct": 10, > "nullfrac": 0, > "correlation": 0.10449}, > "path": "$.\"a\"", > "freq_null": 0, "freq_array": 0, "freq_object": 0, > "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0} > > In this case there's only a MCV list (represented by two arrays, just > like in pg_statistic), but there might be another part with a histogram. > There's also the other columns we'd expect to see in pg_statistic. > > In principle, we need pg_statistic for each path we extract from the > JSON documents and decide it's interesting enough for estimation. There > are probably other ways to serialize/represent this, but I find using > JSONB for this pretty convenient because we need to deal with a mix of > data types (for the same path), and other JSON specific stuff. Storing > that in Postgres arrays would be problematic. > > I'm sure there's plenty open questions - for example I think we'll need > some logic to decide which paths to keep, otherwise the statistics can > get quite big, if we're dealing with large / variable documents. We're > already doing something similar for MCV lists. > > One of Nikita's patches not included in this thread allow "selective" > statistics, where you can define in advance a "filter" restricting which > parts are considered interesting by ANALYZE. That's interesting, but I > think we need some simple MCV-like heuristics first anyway. > > Another open question is how deep the stats should be. Imagine documents > like this: > > {"a" : {"b" : {"c" : {"d" : ...}}}} > > The current patch build stats for all possible paths: > > "a" > "a.b" > "a.b.c" > "a.b.c.d" > > and of course many of the paths will often have JSONB documents as > values, not simple scalar values. I wonder if we should limit the depth > somehow, and maybe build stats only for scalar values. > > > 2) applying the statistics > > One of the problems is how to actually use the statistics. For @> > operator it's simple enough, because it's (jsonb @> jsonb) so we have > direct access to the stats. But often the conditions look like this: > > jsonb_column ->> 'key' = 'value' > > so the condition is actually on an expression, not on the JSONB column > directly. My solutions were pretty ugly hacks, but Nikita had a neat > idea - we can define a custom procedure for each operator, which is > responsible for "calculating" the statistics for the expression. > > So in this case "->>" will have such "oprstat" procedure, which fetches > stats for the JSONB column, extracts stats for the "key" path. And then > we can use that for estimation of the (text = text) condition. > > This is what 0001 does, pretty much. We simply look for expression stats > provided by an index, extended statistics, and then - if oprstat is > defined for the operator - we try to derive the stats. > > This opens other interesting opportunities for the future - one of the > parts adds oprstat for basic arithmetic operators, which allows deducing > statistics for expressions like (a+10) from statistics on column (a). > > Which seems like a neat feature on it's own, but it also interacts with > the extended statistics in somewhat non-obvious ways (especially when > estimating GROUP BY cardinalities). > > Of course, there's a limit of what we can reasonably estimate - for > example, there may be statistical dependencies between paths, and this > patch does not even attempt to deal with that. In a way, this is similar > to correlation between columns, except that here we have a dynamic set > of columns, which makes it much much harder. We'd need something like > extended stats on steroids, pretty much. > > > I'm sure I've forgotten various important bits - many of them are > mentioned or explained in comments, but I'm sure others are not. And I'd > bet there are things I forgot about entirely or got wrong. So feel free > to ask. > > > In any case, I think this seems like a good first step to improve our > estimates for JSOB columns. > > regards > > > [1] https://github.com/postgrespro/postgres/tree/jsonb_stats > > [2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company Hi, For patch 1: + List *statisticsName = NIL; /* optional stats estimat. procedure */ I think if the variable is named estimatorName (or something similar), it would be easier for people to grasp its purpose. + /* XXX perhaps full "statistics" wording would be better */ + else if (strcmp(defel->defname, "stats") == 0) I would recommend (stats sounds too general): + else if (strcmp(defel->defname, "statsestimator") == 0) + statisticsOid = ValidateStatisticsEstimator(statisticsName); statisticsOid -> statsEstimatorOid For get_oprstat(): + } + else + return (RegProcedure) InvalidOid; keyword else is not needed (considering the return statement in if block). For patch 06: + /* FIXME Could be before the memset, I guess? Checking vardata->statsTuple. */ + if (!data->statsTuple) + return false; I would agree the check can be lifted above the memset call. + * XXX This does not really extract any stats, it merely allocates the struct? + */ +static JsonPathStats +jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type) As comments says, I think allocJsonPathStats() would be better name for the func. + * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what + * jsonPathStatsGetLengthStats does? I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check should be added for jsonPathStatsGetArrayLengthStats(). To be continued ...