On Sat, Jan 1, 2022 at 7:33 AM Zhihong Yu <z...@yugabyte.com> wrote: > > > On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra < > tomas.von...@enterprisedb.com> wrote: > >> Hi, >> >> One of the complaints I sometimes hear from users and customers using >> Postgres to store JSON documents (as JSONB type, of course) is that the >> selectivity estimates are often pretty poor. >> >> Currently we only really have MCV and histograms with whole documents, >> and we can deduce some stats from that. But that is somewhat bogus >> because there's only ~100 documents in either MCV/histogram (with the >> default statistics target). And moreover we discard all "oversized" >> values (over 1kB) before even calculating those stats, which makes it >> even less representative. >> >> A couple weeks ago I started playing with this, and I experimented with >> improving extended statistics in this direction. After a while I noticed >> a forgotten development branch from 2016 which tried to do this by >> adding a custom typanalyze function, which seemed like a more natural >> idea (because it's really a statistics for a single column). >> >> But then I went to pgconf NYC in early December, and I spoke to Oleg >> about various JSON-related things, and he mentioned they've been working >> on this topic some time ago too, but did not have time to pursue it. So >> he pointed me to a branch [1] developed by Nikita Glukhov. >> >> I like Nikita's branch because it solved a couple architectural issues >> that I've been aware of, but only solved them in a rather hackish way. >> >> I had a discussion with Nikita about his approach what can we do to move >> it forward. He's focusing on other JSON stuff, but he's OK with me >> taking over and moving it forward. So here we go ... >> >> Nikita rebased his branch recently, I've kept improving it in various >> (mostly a lot of comments and docs, and some minor fixes and tweaks). >> I've pushed my version with a couple extra commits in [2], but you can >> ignore that except if you want to see what I added/changed. >> >> Attached is a couple patches adding adding the main part of the feature. >> There's a couple more commits in the github repositories, adding more >> advanced features - I'll briefly explain those later, but I'm not >> including them here because those are optional features and it'd be >> distracting to include them here. >> >> There are 6 patches in the series, but the magic mostly happens in parts >> 0001 and 0006. The other parts are mostly just adding infrastructure, >> which may be a sizeable amount of code, but the changes are fairly >> simple and obvious. So let's focus on 0001 and 0006. >> >> To add JSON statistics we need to do two basic things - we need to build >> the statistics and then we need to allow using them while estimating >> conditions. >> >> >> 1) building stats >> >> Let's talk about building the stats first. The patch does one of the >> things I experimented with - 0006 adds a jsonb_typanalyze function, and >> it associates it with the data type. The function extracts paths and >> values from the JSONB document, builds the statistics, and then stores >> the result in pg_statistic as a new stakind. >> >> I've been planning to store the stats in pg_statistic too, but I've been >> considering to use a custom data type. The patch does something far more >> elegant - it simply uses stavalues to store an array of JSONB documents, >> each describing stats for one path extracted from the sampled documents. >> >> One (very simple) element of the array might look like this: >> >> {"freq": 1, >> "json": { >> "mcv": { >> "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], >> "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]}, >> "width": 19, >> "distinct": 10, >> "nullfrac": 0, >> "correlation": 0.10449}, >> "path": "$.\"a\"", >> "freq_null": 0, "freq_array": 0, "freq_object": 0, >> "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0} >> >> In this case there's only a MCV list (represented by two arrays, just >> like in pg_statistic), but there might be another part with a histogram. >> There's also the other columns we'd expect to see in pg_statistic. >> >> In principle, we need pg_statistic for each path we extract from the >> JSON documents and decide it's interesting enough for estimation. There >> are probably other ways to serialize/represent this, but I find using >> JSONB for this pretty convenient because we need to deal with a mix of >> data types (for the same path), and other JSON specific stuff. Storing >> that in Postgres arrays would be problematic. >> >> I'm sure there's plenty open questions - for example I think we'll need >> some logic to decide which paths to keep, otherwise the statistics can >> get quite big, if we're dealing with large / variable documents. We're >> already doing something similar for MCV lists. >> >> One of Nikita's patches not included in this thread allow "selective" >> statistics, where you can define in advance a "filter" restricting which >> parts are considered interesting by ANALYZE. That's interesting, but I >> think we need some simple MCV-like heuristics first anyway. >> >> Another open question is how deep the stats should be. Imagine documents >> like this: >> >> {"a" : {"b" : {"c" : {"d" : ...}}}} >> >> The current patch build stats for all possible paths: >> >> "a" >> "a.b" >> "a.b.c" >> "a.b.c.d" >> >> and of course many of the paths will often have JSONB documents as >> values, not simple scalar values. I wonder if we should limit the depth >> somehow, and maybe build stats only for scalar values. >> >> >> 2) applying the statistics >> >> One of the problems is how to actually use the statistics. For @> >> operator it's simple enough, because it's (jsonb @> jsonb) so we have >> direct access to the stats. But often the conditions look like this: >> >> jsonb_column ->> 'key' = 'value' >> >> so the condition is actually on an expression, not on the JSONB column >> directly. My solutions were pretty ugly hacks, but Nikita had a neat >> idea - we can define a custom procedure for each operator, which is >> responsible for "calculating" the statistics for the expression. >> >> So in this case "->>" will have such "oprstat" procedure, which fetches >> stats for the JSONB column, extracts stats for the "key" path. And then >> we can use that for estimation of the (text = text) condition. >> >> This is what 0001 does, pretty much. We simply look for expression stats >> provided by an index, extended statistics, and then - if oprstat is >> defined for the operator - we try to derive the stats. >> >> This opens other interesting opportunities for the future - one of the >> parts adds oprstat for basic arithmetic operators, which allows deducing >> statistics for expressions like (a+10) from statistics on column (a). >> >> Which seems like a neat feature on it's own, but it also interacts with >> the extended statistics in somewhat non-obvious ways (especially when >> estimating GROUP BY cardinalities). >> >> Of course, there's a limit of what we can reasonably estimate - for >> example, there may be statistical dependencies between paths, and this >> patch does not even attempt to deal with that. In a way, this is similar >> to correlation between columns, except that here we have a dynamic set >> of columns, which makes it much much harder. We'd need something like >> extended stats on steroids, pretty much. >> >> >> I'm sure I've forgotten various important bits - many of them are >> mentioned or explained in comments, but I'm sure others are not. And I'd >> bet there are things I forgot about entirely or got wrong. So feel free >> to ask. >> >> >> In any case, I think this seems like a good first step to improve our >> estimates for JSOB columns. >> >> regards >> >> >> [1] https://github.com/postgrespro/postgres/tree/jsonb_stats >> >> [2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework >> >> -- >> Tomas Vondra >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company > > > Hi, > For patch 1: > > + List *statisticsName = NIL; /* optional stats estimat. > procedure */ > > I think if the variable is named estimatorName (or something similar), it > would be easier for people to grasp its purpose. > > + /* XXX perhaps full "statistics" wording would be better */ > + else if (strcmp(defel->defname, "stats") == 0) > > I would recommend (stats sounds too general): > > + else if (strcmp(defel->defname, "statsestimator") == 0) > > + statisticsOid = ValidateStatisticsEstimator(statisticsName); > > statisticsOid -> statsEstimatorOid > > For get_oprstat(): > > + } > + else > + return (RegProcedure) InvalidOid; > > keyword else is not needed (considering the return statement in if block). > > For patch 06: > > + /* FIXME Could be before the memset, I guess? Checking > vardata->statsTuple. */ > + if (!data->statsTuple) > + return false; > > I would agree the check can be lifted above the memset call. > > + * XXX This does not really extract any stats, it merely allocates the > struct? > + */ > +static JsonPathStats > +jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type) > > As comments says, I think allocJsonPathStats() would be better name for > the func. > > + * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what > + * jsonPathStatsGetLengthStats does? > > I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check > should be added for jsonPathStatsGetArrayLengthStats(). > > To be continued ... > Hi,
+static JsonPathStats +jsonStatsFindPathStats(JsonStats jsdata, char *path, int pathlen) Stats appears twice in the method name. I think findJsonPathStats() should suffice. It should check `if (jsdata->nullfrac >= 1.0)` as jsonStatsGetPathStatsStr does. +JsonPathStats +jsonStatsGetPathStatsStr(JsonStats jsdata, const char *subpath, int subpathlen) This func can be static, right ? I think findJsonPathStatsWithPrefix() would be a better name for the func. + * XXX Doesn't this need ecape_json too? + */ +static void +jsonPathAppendEntryWithLen(StringInfo path, const char *entry, int len) +{ + char *tmpentry = pnstrdup(entry, len); + jsonPathAppendEntry(path, tmpentry); ecape_json() is called within jsonPathAppendEntry(). The XXX comment can be dropped. +jsonPathStatsGetArrayIndexSelectivity(JsonPathStats pstats, int index) It seems getJsonSelectivityWithArrayIndex() would be a better name. + sel = scalarineqsel(NULL, operator, + operator == JsonbGtOperator || + operator == JsonbGeOperator, + operator == JsonbLeOperator || + operator == JsonbGeOperator, Looking at the comment for scalarineqsel(): * scalarineqsel - Selectivity of "<", "<=", ">", ">=" for scalars. * * This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel. * The isgt and iseq flags distinguish which of the four cases apply. It seems JsonbLtOperator doesn't appear in the call, can I ask why ? Cheers