On 5/17/22 13:44, Mahendra Singh Thalor wrote: > ... > > Hi Nikita, > I and Tomas discussed the design for disabling all-paths > collection(collect stats for only some paths). Below are some > thoughts/doubts/questions. > > *Point 1)* Please can you elaborate more that how are you going to > implement this(collect stats for only some paths).
I think Nikita mentioned he plans to only build stats only for most common paths, which seems generally straightforward: 1) first pass over the documents, collect distinct paths and track how many times we saw each one 2) in the second pass extract stats only for the most common paths (e.g. the top 100 most common ones, or whatever the statistics target says) I guess we might store at least the frequencing for uncommon paths, which seems somewhat useful for selectivity estimation. I wonder if we might further optimize this for less common paths. AFAICS one of the reasons why we process the paths one by one (in the second pass) is to limit memory consumption. By processing a single path, we only need to accumulate values for that path. But if we know the path is uncommon, we know there'll be few values. For example the path may be only in 100 documents, not the whole sample. So maybe we might process multiple paths at once (which would mean we don't need to detoast the JSON documents that often, etc.). OTOH that may be pointless, because if the paths are uncommon, chances are the subsets of documents will be different, in which case it's probably cheaper to just process the paths one by one. > *Point 2) *As JSON stats are taking time so should we add an on/off > switch to collect JSON stats? IMHO we should think about doing that. I think it's not really possible to eliminate (significant) regressions for all corner cases, and in many cases people don't even need this statistics (e.g. when just storing and retrieving JSON docs, without querying contents of the docs). I don't know how exactly to enable/disable this - it very much depends on how we store the stats. If we store that in pg_statistic, then ALTER TABLE ... ALTER COLUMN seems like the right way to enable/disable these path stats. We might also have a new "json" stats and do this through CREATE STATISTICS. Or something else, not sure. > *Point 3)* We thought of one more design: we can give an explicit path > to collect stats for a particular path only or we can pass a subset of > the JSON values but this may require a lot of code changes like syntax > and all so we are thinking that it will be good if we can collect stats > only for some common paths(by limit or any other way) > I'm not sure I understand what this is saying, particularly the part about subset of JSON values. Can you elaborate? I can imagine specifying a list of interesting paths, and we'd only collect stats for the matching subset of the JSON documents. So if you have huge JSON documents with complex schema, but you only query a very limited subset of paths, we could restrict ANALYZE to this subset. In fact, that's what the 'selective analyze' commit [1] in Nikita's original patch set does in principle. We'd probably need to improve this in some ways (e.g. to allow defining the column filter not just in ANALYZE itself). I left it out of this patch to keep the patch as simple as possible. But why/how exactly would we limit the "JSON values"? Can you give some example demonstrating that in practice? regards [1] https://github.com/postgrespro/postgres/commit/7ab7397450df153e5a8563c978728cb731a0df33 -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company