Hi All, I apologize for not being able to continue development due to various circumstances. The attached file is the rebased patch. I will now catch up on the discussion and try to revise the patch.
Regards, Tatsuro Yamada On Fri, Jul 19, 2024 at 7:17 PM <masahiro.ik...@nttdata.com> wrote: > > On 7/18/24 12:37, masahiro.ik...@nttdata.com wrote: > > >> Let me share my opinion on those questions ... > > > ...> > > >> For ndistinct, I think we don't show this because it doesn't go > > >> through clauselist_selectivity, which is the only thing I modified in > the PoC. > > >> But I guess we might improve estimate_num_groups() to track the stats > > >> in a similar way, I guess. > > > > > > Thanks. IIUC, the reason is that it doesn't go through > > > statext_clauselist_selectivity() because the number of clauses is one > though it goes > > through clauselist_selectivity(). > > > > > > > Ah, I see I misunderstood the original report. The query used was > > > > EXPLAIN (STATS, ANALYZE) SELECT * FROM t3 > > WHERE date_trunc('month', a) = '2020-01-01'::timestamp; > > > > And it has nothing to do with the number of clauses being one neither. > > > > The problem is this estimate is handled by examine_variable() matching > the expression > > to the "expression" stats, and injecting it into the variable, so that > the > > clauselist_selectivity() sees these stats. > > > > This would happen even if you build just expression statistics on each > of the > > date_trunc() calls, and then tried a query with two clauses: > > > > CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3; > > CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3; > > > > EXPLAIN SELECT * FROM t3 > > WHERE date_trunc('month', a) = '2020-01-01'::timestamp > > AND date_trunc('day', 'a') = '2020-01-01'::timestamp; > > > > Not sure how to handle this - we could remember when explain_variable() > injects > > statistics like this, I guess. But do we know that each call to > > examine_variable() is for estimation? And do we know for which clause? > > I see. The issue is related to extended statistics for single expression. > As a > first step, it's ok for me that we don't support it. > > The below is just an idea to know clauses... > Although I'm missing something, can callers of examine_variable() > for estimation to rebuild the clauses from partial information of "OpExpr"? > > Only clause_selectivity_ext() knows the information of actual full clauses. > But we don't need full information. It's enough to know the information > to show "OpExpr" for EXPLAIN. > > get_oper_expr() deparse "OpExpr" using only the operator oid and arguments > in get_oper_expr(). > > If so, the caller to estimate, for example eqsel_internal(), > scalarineqsel_wrapper() > and so on, seems to be able to know the "OpExpr" information, which are > operator > oid and arguments, and used extended statistics easily to show for EXPLAIN. > > # Memo: the call path of the estimation function > caller to estimate selectivity > (eqsel_internal()/scalargtjoinsel_wrappter()/...) > -> get_restriction_variable()/get_join_valiables() > -> examine_variable() > > > > >>> ERROR: unrecognized node type: 268 > > > > > > Regarding the above error, do "applied_stats" need have the list of > "StatisticExtInfo" > > > because it's enough to have the list of Oid(stat->statOid) for EXPLAIN > output in the > > current patch? > > > change_to_applied_stats_has_list_of_oids.diff is the change I assumed. > > > Do you have any plan to show extra information for example "kind" of > > "StatisticExtInfo"? > > > > > > The above is just one idea came up with while I read the following > > > comments of header of pathnodes.h, and to support copy > "StatisticExtInfo" will leads > > many other nodes to support copy. > > > * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes. > > > * There are some subsidiary structs that are useful to copy, though. > > > > > > > I do think tracking just the OID would work, because we already know how > to copy List > > objects. But if we want to also track the clauses, we'd have to keep > multiple lists, right? > > That seems a bit inconvenient. > > Understood. In future, we might show not only the applied_clauses but also > the clauses of > its extended statistics (StatisticExtInfo->exprs). > > > > > By the way, I found curios result while I tested with the above patch. > It shows same > > "Ext Stats" twice. > > > I think it's expected behavior because the stat is used when estimate > the cost of > > "Partial HashAggregate" and "Group". > > > I've shared the result because I could not understand soon when I saw > > > it first time. I think it's better to let users understand when the > stats are used, but I > > don't have any idea now. > > > > > > -- I tested with the example of CREATE STATISTICS documentation. > > > psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), > date_trunc('day', > > a) FROM t3 GROUP BY 1, 2; > > > QUERY > PLAN > > > ---------------------------------------------------------------------- > > > ---------------------------------------------------------------------- > > > - Group (cost=9530.56..9576.18 rows=365 width=16) (actual > > > time=286.908..287.909 rows=366 loops=1) > > > Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, > a)) > > > -> Gather Merge (cost=9530.56..9572.53 rows=365 width=16) (actual > > time=286.904..287.822 rows=498 loops=1) > > > Workers Planned: 1 > > > Workers Launched: 1 > > > -> Sort (cost=8530.55..8531.46 rows=365 width=16) (actual > > time=282.905..282.919 rows=249 loops=2) > > > Sort Key: (date_trunc('month'::text, a)), > (date_trunc('day'::text, a)) > > > Sort Method: quicksort Memory: 32kB > > > Worker 0: Sort Method: quicksort Memory: 32kB > > > -> Partial HashAggregate (cost=8509.54..8515.02 > rows=365 > > width=16) (actual time=282.716..282.768 rows=249 loops=2) > > > Group Key: date_trunc('month'::text, a), > date_trunc('day'::text, > > a) > > > Batches: 1 Memory Usage: 45kB > > > Worker 0: Batches: 1 Memory Usage: 45kB > > > -> Parallel Seq Scan on t3 (cost=0.00..6963.66 > rows=309177 > > width=16) (actual time=0.021..171.214 rows=262800 loops=2) > > > Ext Stats: public.s3 Clauses: > date_trunc('month'::text, > > a), date_trunc('day'::text, a) -- here > > > Ext Stats: public.s3 Clauses: > > > date_trunc('month'::text, a), date_trunc('day'::text, a) -- here > > > Planning Time: 114327.206 ms Execution Time: 288.007 ms > > > (18 rows) > > > > > > > I haven't looked into this, but my guess would be this is somehow > related to the > > parallelism - there's one parallel worker, which means we have 2 > processes to report > > stats for (leader + worker). And you get two copies of the "Ext Stats" > line. Could be a > > coincidence, ofc, but maybe there's a loop to print some worker info, > and you print the > > statistics info in it? > > I think yes and no. In the above case, it relates to parallelism, but it > doesn't print the > information per each worker. > > -- Make the number of workers is 5 and EXPLAIN without ANALYZE option. > -- But "Ext Stats" is printed only twice. > =# EXPLAIN (STATS) SELECT date_trunc('month', a), date_trunc('day', a) > FROM t3 GROUP BY 1, 2; > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------- > Group (cost=4449.49..4489.50 rows=365 width=16) > Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a)) > -> Gather Merge (cost=4449.49..4478.55 rows=1825 width=16) > Workers Planned: 5 > -> Sort (cost=4449.41..4450.32 rows=365 width=16) > Sort Key: (date_trunc('month'::text, a)), > (date_trunc('day'::text, a)) > -> Partial HashAggregate (cost=4428.40..4433.88 rows=365 > width=16) > Group Key: date_trunc('month'::text, a), > date_trunc('day'::text, a) > -> Parallel Seq Scan on t3 (cost=0.00..3902.80 > rows=105120 width=16) > Ext Stats: public.s3 Clauses: > date_trunc('month'::text, a), date_trunc('day'::text, a) > Ext Stats: public.s3 Clauses: > date_trunc('month'::text, a), date_trunc('day'::text, a) > (11 rows) > > When creating a group path, it creates partial grouping paths if possible, > and then > creates the final grouping path. At this time, both the partial grouping > path and > the final grouping path use the same RelOptInfo to repeatedly use the > extended > statistics to know how many groups there will be. That's why it outputs > only twice. > There may be other similar calculation for partial paths. > > # The call path of the above query > create_grouping_paths > create_ordinary_grouping_paths > create_partial_grouping_paths > get_number_of_groups > estimate_num_groups > estimate_multivariate_ndistinct -- first time to estimate the > number of groups for partial grouping path > get_number_of_groups > estimate_num_groups > estimate_multivariate_ndistinct -- second time to estimate the > number of groups for final grouping path > > > Regards, > -- > Masahiro Ikeda > NTT DATA CORPORATION >
0001-Add-a-new-option-STATS-to-Explain-command_r2.patch
Description: Binary data