On Thu, Aug 3, 2023 at 7:29 PM David Rowley <dgrowle...@gmail.com> wrote:

> Thanks for having a look at this.
>
> On Thu, 3 Aug 2023 at 18:49, Andy Fan <zhihui.fan1...@gmail.com> wrote:
> > 1. ORDER BY or PARTITION BY
> >
> > select *, count(two) over (order by unique1) from tenk1 limit 1;
> > DEBUG:  startup_tuples = 1
> > DEBUG:  startup_tuples = 1
> >
> > select *, count(two) over (partition by unique1) from tenk1 limit 1;
> > DEBUG:  startup_tuples = 1
> > DEBUG:  startup_tuples = 1
> >
> > Due to the Executor of nodeWindowAgg, we have to fetch the next tuple
> > until it mismatches with the current one, then we can calculate the
> > WindowAgg function. In the current patch, we didn't count the
> > mismatched tuple. I verified my thought with 'break at IndexNext'
> > function and see IndexNext is called twice, so in the above case the
> > startup_tuples should be 2?
>
> You're probably right here. I'd considered that it wasn't that
> critical and aimed to attempt to keep the estimate close to the number
> of rows that'll be aggregated. I think that's probably not the best
> thing to do as if you consider the EXCLUDE options, those just exclude
> tuples from aggregation, it does not mean we read fewer tuples from
> the subnode. I've updated the patch accordingly.
>

Thanks.


>
> > 2. ORDER BY and PARTITION BY
> >
> > select two, hundred,
> > count(two) over (partition by ten order by hundred)
> > from tenk1 limit 1;
> >
> > DEBUG:  startup_tuples = 10
> >  two | hundred | count
> > -----+---------+-------
> >    0 |       0 |   100
> >
> > If we consider the mismatched tuples, it should be 101?
>
> I don't really see how we could do better with the current level of
> statistics.  The stats don't know that there are only 10 distinct
> "hundred" values for rows which have ten=1. All we have is n_distinct
> on tenk1.hundred, which is 100.


Yes,  actually I didn't figure it out before / after my posting.

>


> > 3. As we can see the log for startup_tuples is logged twice sometimes,
> > the reason is because it is used in cost_windowagg, so it is calculated
> > for every create_one_window_path. I think the startup_tuples should be
> > independent with the physical path, maybe we can cache it somewhere to
> > save some planning cycles?
>
> I wondered about that too but I ended up writing off the idea of
> caching because the input_tuple count comes from the Path and the
> extra calls are coming from other Paths, which could well have some
> completely different value for input_tuples.
>
>
Looks reasonable.

I have checked the updated patch and LGTM.

-- 
Best Regards
Andy Fan

Reply via email to