Re: Collecting statistics about contents of JSONB columns

Tomas Vondra Mon, 24 Jan 2022 14:20:10 -0800

On 1/23/22 01:24, Nikita Glukhov wrote:

Hi!


I am glad that you found my very old patch interesting and started to
work on it.  We failed to post it in 2016 mostly because we were not
satisfied with JSONB storage.  Also we decided to wait for completion
of work on extended statistics as we thought that it could help us.
But in early 2017 we switched to SQL/JSON and forgot about this patch.

Understood. Let's see how feasible this idea is and if we can move thisforward.


I think custom datatype is necessary for better performance. With a
plain JSONB we need to do a lot of work for extraction of path stats:
  - iterate through MCV/Histogram JSONB arrays
  - cast numeric values to float, string to text etc.
  - build PG arrays from extracted datums
  - form pg_statistic tuple.

With a custom data type we could store pg_statistic tuple unmodified
and use it without any overhead.  But then we need modify a bit
VariableStatData and several functions to pass additional nullfrac
corrections.

I'm not against evaluating/exploring alternative storage formats, but myfeeling is the real impact on performance will be fairly low. At least Ihaven't seen this as very expensive while profiling the patch so far. Ofcourse, I may be wrong, and it may be more significant in some cases.

Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
would be enough.

Maybe, but then you still need to store a bunch of those, right? Soeither an array (likely toasted) or 1:M table. I'm not sure it's goiingto be much cheaper than JSONB.

I'd suggest we focus on what we need to store first, which seems liketha primary question, and worry about the exact storage format then.

Also there is an idea to store per-path separately in pg_statistic_ext
rows using expression like (jb #> '{foo,bar}') as stxexprs.  This could
also help user to select which paths to analyze simply by using some
sort of CREATE STATISTICS.  But it is really unclear how to:
  * store pg_statistic_ext rows from typanalyze
  * form path expressions for array elements (maybe add new jsonpath
    operator)
  * match various -> operator chains to stxexprs
  * jsonpath-like languages need simple API for searching by stxexprs

Sure, you can do statistics on expressions, right? Of course, if thatexpression produces JSONB value, that's not very useful at the moment.Maybe we could have two typanalyze functions - one for regular analyze,one for extended statistics?

That being said, I'm not sure extended stats are a good match for this.My feeling was we should collect these stats for all JSONB columns,which is why I argued for putting that in pg_statistic.



Per-path statistics should only be collected for scalars.  This can be
enabled by flag JsonAnalyzeContext.scalarsOnly.  But there are is a
problem: computed MCVs and histograms will be broken and we will not be
able to use them for queries like (jb > const) in general case.  Also
we will not be and use internally in scalarineqsel() and var_eq_const()
(see jsonSelectivity()).  Instead, we will have to implement own
estimator functions for JSONB comparison operators that will correctly
use our hacked MCVs and histograms (and maybe not all cases will be
supported; for example, comparison to scalars only).

Yeah, but maybe that's an acceptable trade-off? I mean, if we canimprove estimates for most clauses, and there's a some number of clausesthat are estimated just like without stats, that's still an improvement,right?

It's possible to replace objects and arrays with empty ones when
scalarsOnly is set to keep correct frequencies of non-scalars.
But there is an old bug in JSONB comparison: empty arrays are placed
before other values in the JSONB sort order, although they should go
after all scalars.  So we need also to distinguish empty and non-empty
arrays here.


Hmmm ...



I tried to fix a major part of places marked as XXX and FIXME, the new
version of the patches is attached.  There are a lot of changes, you
can see them in a step-by-step form in the corresponding branch
jsonb_stats_20220122 in our GitHub repo [1].


Thanks! I'll go through the changes soon.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Collecting statistics about contents of JSONB columns

Reply via email to