Re: Showing applied extended statistics in explain Part 2

Tatsuro Yamada Thu, 31 Oct 2024 22:23:23 -0700

Hi All,

I apologize for not being able to continue development due to various
circumstances.
The attached file is the rebased patch.
I will now catch up on the discussion and try to revise the patch.


Regards,
Tatsuro Yamada


On Fri, Jul 19, 2024 at 7:17 PM <masahiro.ik...@nttdata.com> wrote:

> > On 7/18/24 12:37, masahiro.ik...@nttdata.com wrote:
> > >> Let me share my opinion on those questions ...
> > > ...>
> > >> For ndistinct, I think we don't show this because it doesn't go
> > >> through clauselist_selectivity, which is the only thing I modified in
> the PoC.
> > >> But I guess we might improve estimate_num_groups() to track the stats
> > >> in a similar way, I guess.
> > >
> > > Thanks. IIUC, the reason is that it doesn't go through
> > > statext_clauselist_selectivity() because the number of clauses is one
> though it goes
> > through clauselist_selectivity().
> > >
> >
> > Ah, I see I misunderstood the original report. The query used was
> >
> >   EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
> >     WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
> >
> > And it has nothing to do with the number of clauses being one neither.
> >
> > The problem is this estimate is handled by examine_variable() matching
> the expression
> > to the "expression" stats, and injecting it into the variable, so that
> the
> > clauselist_selectivity() sees these stats.
> >
> > This would happen even if you build just expression statistics on each
> of the
> > date_trunc() calls, and then tried a query with two clauses:
> >
> >   CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3;
> >   CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3;
> >
> >   EXPLAIN SELECT * FROM t3
> >     WHERE date_trunc('month', a) = '2020-01-01'::timestamp
> >       AND date_trunc('day', 'a') = '2020-01-01'::timestamp;
> >
> > Not sure how to handle this - we could remember when explain_variable()
> injects
> > statistics like this, I guess. But do we know that each call to
> > examine_variable() is for estimation? And do we know for which clause?
>
> I see. The issue is related to extended statistics for single expression.
> As a
> first step, it's ok for me that we don't support it.
>
> The below is just an idea to know clauses...
> Although I'm missing something, can callers of examine_variable()
> for estimation to rebuild the clauses from partial information of "OpExpr"?
>
> Only clause_selectivity_ext() knows the information of actual full clauses.
> But we don't need full information. It's enough to know the information
> to show "OpExpr" for EXPLAIN.
>
> get_oper_expr() deparse "OpExpr" using only the operator oid and arguments
> in get_oper_expr().
>
> If so, the caller to estimate, for example eqsel_internal(),
> scalarineqsel_wrapper()
> and so on, seems to be able to know the "OpExpr" information, which are
> operator
> oid and arguments, and used extended statistics easily to show for EXPLAIN.
>
> # Memo: the call path of the estimation function
>  caller to estimate selectivity
> (eqsel_internal()/scalargtjoinsel_wrappter()/...)
>   -> get_restriction_variable()/get_join_valiables()
>    -> examine_variable()
>
>
> > >>> ERROR:  unrecognized node type: 268
> > >
> > > Regarding the above error, do "applied_stats" need have the list of
> "StatisticExtInfo"
> > > because it's enough to have the list of Oid(stat->statOid) for EXPLAIN
> output in the
> > current patch?
> > > change_to_applied_stats_has_list_of_oids.diff is the change I assumed.
> > > Do you have any plan to show extra information for example "kind" of
> > "StatisticExtInfo"?
> > >
> > > The above is just one idea came up with while I read the following
> > > comments of header of pathnodes.h, and to support copy
> "StatisticExtInfo" will leads
> > many other nodes to support copy.
> > >   * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
> > >   * There are some subsidiary structs that are useful to copy, though.
> > >
> >
> > I do think tracking just the OID would work, because we already know how
> to copy List
> > objects. But if we want to also track the clauses, we'd have to keep
> multiple lists, right?
> > That seems a bit inconvenient.
>
> Understood. In future, we might show not only the applied_clauses but also
> the clauses of
> its extended statistics (StatisticExtInfo->exprs).
>
>
> > > By the way, I found curios result while I tested with the above patch.
> It shows same
> > "Ext Stats" twice.
> > > I think it's expected behavior because the stat is used when estimate
> the cost of
> > "Partial HashAggregate" and "Group".
> > > I've shared the result because I could not understand soon when I saw
> > > it first time. I think it's better to let users understand when the
> stats are used, but I
> > don't have any idea now.
> > >
> > > -- I tested with the example of CREATE STATISTICS documentation.
> > > psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a),
> date_trunc('day',
> > a) FROM t3 GROUP BY 1, 2;
> > >                                                                  QUERY
> PLAN
> > > ----------------------------------------------------------------------
> > > ----------------------------------------------------------------------
> > > -  Group  (cost=9530.56..9576.18 rows=365 width=16) (actual
> > > time=286.908..287.909 rows=366 loops=1)
> > >    Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text,
> a))
> > >    ->  Gather Merge  (cost=9530.56..9572.53 rows=365 width=16) (actual
> > time=286.904..287.822 rows=498 loops=1)
> > >          Workers Planned: 1
> > >          Workers Launched: 1
> > >          ->  Sort  (cost=8530.55..8531.46 rows=365 width=16) (actual
> > time=282.905..282.919 rows=249 loops=2)
> > >                Sort Key: (date_trunc('month'::text, a)),
> (date_trunc('day'::text, a))
> > >                Sort Method: quicksort  Memory: 32kB
> > >                Worker 0:  Sort Method: quicksort  Memory: 32kB
> > >                ->  Partial HashAggregate  (cost=8509.54..8515.02
> rows=365
> > width=16) (actual time=282.716..282.768 rows=249 loops=2)
> > >                      Group Key: date_trunc('month'::text, a),
> date_trunc('day'::text,
> > a)
> > >                      Batches: 1  Memory Usage: 45kB
> > >                      Worker 0:  Batches: 1  Memory Usage: 45kB
> > >                      ->  Parallel Seq Scan on t3  (cost=0.00..6963.66
> rows=309177
> > width=16) (actual time=0.021..171.214 rows=262800 loops=2)
> > >                            Ext Stats: public.s3  Clauses:
> date_trunc('month'::text,
> > a), date_trunc('day'::text, a)  -- here
> > >                            Ext Stats: public.s3  Clauses:
> > > date_trunc('month'::text, a), date_trunc('day'::text, a)  -- here
> > > Planning Time: 114327.206 ms  Execution Time: 288.007 ms
> > > (18 rows)
> > >
> >
> > I haven't looked into this, but my guess would be this is somehow
> related to the
> > parallelism - there's one parallel worker, which means we have 2
> processes to report
> > stats for (leader + worker). And you get two copies of the "Ext Stats"
> line. Could be a
> > coincidence, ofc, but maybe there's a loop to print some worker info,
> and you print the
> > statistics info in it?
>
> I think yes and no. In the above case, it relates to parallelism, but it
> doesn't print the
> information per each worker.
>
> -- Make the number of workers is 5 and EXPLAIN without ANALYZE option.
> -- But "Ext Stats" is printed only twice.
> =# EXPLAIN (STATS) SELECT date_trunc('month', a), date_trunc('day', a)
> FROM t3 GROUP BY 1, 2;
>                                                     QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------
>  Group  (cost=4449.49..4489.50 rows=365 width=16)
>    Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
>    ->  Gather Merge  (cost=4449.49..4478.55 rows=1825 width=16)
>          Workers Planned: 5
>          ->  Sort  (cost=4449.41..4450.32 rows=365 width=16)
>                Sort Key: (date_trunc('month'::text, a)),
> (date_trunc('day'::text, a))
>                ->  Partial HashAggregate  (cost=4428.40..4433.88 rows=365
> width=16)
>                      Group Key: date_trunc('month'::text, a),
> date_trunc('day'::text, a)
>                      ->  Parallel Seq Scan on t3  (cost=0.00..3902.80
> rows=105120 width=16)
>                            Ext Stats: public.s3  Clauses:
> date_trunc('month'::text, a), date_trunc('day'::text, a)
>                            Ext Stats: public.s3  Clauses:
> date_trunc('month'::text, a), date_trunc('day'::text, a)
> (11 rows)
>
> When creating a group path, it creates partial grouping paths if possible,
> and then
> creates the final grouping path. At this time, both the partial grouping
> path and
> the final grouping path use the same RelOptInfo to repeatedly use the
> extended
> statistics to know how many groups there will be. That's why it outputs
> only twice.
> There may be other similar calculation for partial paths.
>
> # The call path of the above query
>  create_grouping_paths
>   create_ordinary_grouping_paths
>    create_partial_grouping_paths
>     get_number_of_groups
>      estimate_num_groups
>       estimate_multivariate_ndistinct  -- first time to estimate the
> number of groups for partial grouping path
>    get_number_of_groups
>     estimate_num_groups
>      estimate_multivariate_ndistinct  -- second time to estimate the
> number of groups for final grouping path
>
>
> Regards,
> --
> Masahiro Ikeda
> NTT DATA CORPORATION
>

0001-Add-a-new-option-STATS-to-Explain-command_r2.patch
Description: Binary data

Re: Showing applied extended statistics in explain Part 2

Reply via email to