On Thu, Aug 3, 2023 at 7:29 PM David Rowley <dgrowle...@gmail.com> wrote:
> Thanks for having a look at this. > > On Thu, 3 Aug 2023 at 18:49, Andy Fan <zhihui.fan1...@gmail.com> wrote: > > 1. ORDER BY or PARTITION BY > > > > select *, count(two) over (order by unique1) from tenk1 limit 1; > > DEBUG: startup_tuples = 1 > > DEBUG: startup_tuples = 1 > > > > select *, count(two) over (partition by unique1) from tenk1 limit 1; > > DEBUG: startup_tuples = 1 > > DEBUG: startup_tuples = 1 > > > > Due to the Executor of nodeWindowAgg, we have to fetch the next tuple > > until it mismatches with the current one, then we can calculate the > > WindowAgg function. In the current patch, we didn't count the > > mismatched tuple. I verified my thought with 'break at IndexNext' > > function and see IndexNext is called twice, so in the above case the > > startup_tuples should be 2? > > You're probably right here. I'd considered that it wasn't that > critical and aimed to attempt to keep the estimate close to the number > of rows that'll be aggregated. I think that's probably not the best > thing to do as if you consider the EXCLUDE options, those just exclude > tuples from aggregation, it does not mean we read fewer tuples from > the subnode. I've updated the patch accordingly. > Thanks. > > > 2. ORDER BY and PARTITION BY > > > > select two, hundred, > > count(two) over (partition by ten order by hundred) > > from tenk1 limit 1; > > > > DEBUG: startup_tuples = 10 > > two | hundred | count > > -----+---------+------- > > 0 | 0 | 100 > > > > If we consider the mismatched tuples, it should be 101? > > I don't really see how we could do better with the current level of > statistics. The stats don't know that there are only 10 distinct > "hundred" values for rows which have ten=1. All we have is n_distinct > on tenk1.hundred, which is 100. Yes, actually I didn't figure it out before / after my posting. > > > 3. As we can see the log for startup_tuples is logged twice sometimes, > > the reason is because it is used in cost_windowagg, so it is calculated > > for every create_one_window_path. I think the startup_tuples should be > > independent with the physical path, maybe we can cache it somewhere to > > save some planning cycles? > > I wondered about that too but I ended up writing off the idea of > caching because the input_tuple count comes from the Path and the > extra calls are coming from other Paths, which could well have some > completely different value for input_tuples. > > Looks reasonable. I have checked the updated patch and LGTM. -- Best Regards Andy Fan