On Tue, 25 Jan 2022 at 03:50, Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > On 1/23/22 01:24, Nikita Glukhov wrote: > > Hi! > > > > I am glad that you found my very old patch interesting and started to > > work on it. We failed to post it in 2016 mostly because we were not > > satisfied with JSONB storage. Also we decided to wait for completion > > of work on extended statistics as we thought that it could help us. > > But in early 2017 we switched to SQL/JSON and forgot about this patch. > > > > Understood. Let's see how feasible this idea is and if we can move this > forward. > > > > > I think custom datatype is necessary for better performance. With a > > plain JSONB we need to do a lot of work for extraction of path stats: > > - iterate through MCV/Histogram JSONB arrays > > - cast numeric values to float, string to text etc. > > - build PG arrays from extracted datums > > - form pg_statistic tuple. > > > > With a custom data type we could store pg_statistic tuple unmodified > > and use it without any overhead. But then we need modify a bit > > VariableStatData and several functions to pass additional nullfrac > > corrections. > > > > I'm not against evaluating/exploring alternative storage formats, but my > feeling is the real impact on performance will be fairly low. At least I > haven't seen this as very expensive while profiling the patch so far. Of > course, I may be wrong, and it may be more significant in some cases. > > > Maybe simple record type (path text, data pg_statistic, ext_data jsonb) > > would be enough. > > > Maybe, but then you still need to store a bunch of those, right? So > either an array (likely toasted) or 1:M table. I'm not sure it's goiing > to be much cheaper than JSONB. > > I'd suggest we focus on what we need to store first, which seems like > tha primary question, and worry about the exact storage format then. > > > > Also there is an idea to store per-path separately in pg_statistic_ext > > rows using expression like (jb #> '{foo,bar}') as stxexprs. This could > > also help user to select which paths to analyze simply by using some > > sort of CREATE STATISTICS. But it is really unclear how to: > > * store pg_statistic_ext rows from typanalyze > > * form path expressions for array elements (maybe add new jsonpath > > operator) > > * match various -> operator chains to stxexprs > > * jsonpath-like languages need simple API for searching by stxexprs > > > > Sure, you can do statistics on expressions, right? Of course, if that > expression produces JSONB value, that's not very useful at the moment. > Maybe we could have two typanalyze functions - one for regular analyze, > one for extended statistics? > > That being said, I'm not sure extended stats are a good match for this. > My feeling was we should collect these stats for all JSONB columns, > which is why I argued for putting that in pg_statistic. > > > > > > > Per-path statistics should only be collected for scalars. This can be > > enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a > > problem: computed MCVs and histograms will be broken and we will not be > > able to use them for queries like (jb > const) in general case. Also > > we will not be and use internally in scalarineqsel() and var_eq_const() > > (see jsonSelectivity()). Instead, we will have to implement own > > estimator functions for JSONB comparison operators that will correctly > > use our hacked MCVs and histograms (and maybe not all cases will be > > supported; for example, comparison to scalars only). > > > > Yeah, but maybe that's an acceptable trade-off? I mean, if we can > improve estimates for most clauses, and there's a some number of clauses > that are estimated just like without stats, that's still an improvement, > right? > > > It's possible to replace objects and arrays with empty ones when > > scalarsOnly is set to keep correct frequencies of non-scalars. > > But there is an old bug in JSONB comparison: empty arrays are placed > > before other values in the JSONB sort order, although they should go > > after all scalars. So we need also to distinguish empty and non-empty > > arrays here. > > > > Hmmm ... > > > > > > I tried to fix a major part of places marked as XXX and FIXME, the new > > version of the patches is attached. There are a lot of changes, you > > can see them in a step-by-step form in the corresponding branch > > jsonb_stats_20220122 in our GitHub repo [1]. > > > > Thanks! I'll go through the changes soon. > >
Thanks, Nikita and Tomas for these patches. For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests. With the attached .SQL file, I can see that analyze is taking more time with these patches. *Setup: * autovacuum=off rest all are default settings. Insert attached file with and without the patch to compare the time taken by analyze. *With json patches:* postgres=# analyze test ; ANALYZE Time: *28464.062 ms (00:28.464)* postgres=# postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') ); pg_size_pretty ---------------- 328 kB (1 row) -- *Without json patches:* postgres=# analyze test ; ANALYZE *Time: 120.864* ms postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') ); pg_size_pretty ---------------- 272 kB I haven't found the root cause of this but I feel that this time is due to a loop of all the paths. In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time. I will run perf and will try to find out the root cause of this. Apart from this performance issue, I haven't found any crashes or issues. Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com
test_jsonb_1.sql
Description: Binary data