RE: RE: Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-09-04 Thread Guy Khazma
Hi Everyone, Apologies if you get this email twice, for some reason I am unable to see the email I sent earlier in the mailing list so sending it again. I wanted to revisit the discussion about using partition stats for min/max and null counts. It seems we might need to compute the null count at

RE: Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-22 Thread Guy Khazma
Hi All, Thanks for the discussion. As @karuppayya’s PR recently got merged and it collects the NDV stats on a table level I would like to revisit the partition stats vs table stats discussion and raise a few points for discussion: 1. The current action collects the NDV stats on a table l

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-07 Thread Steven Wu
I also like the middle ground of partition level stats, which is also easier to perform incremental refresh (at partition level). if the roll-up of partition level stats turned out to be slow, I don't mind adding table level stats aggregated from partition level stats. Having partition level stats

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-07 Thread Manish Malhotra
First of all thanks a lot Huaxin for starting an important proposal and thread! A lot of important points are already discussed. For me, my thoughts were also tilting towards the partition level stats, what Piotr, Alex, Anton and a few others have mentioned as well. IMO, partition level stats mi

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-07 Thread Piotr Findeisen
Hi All, Thank you for interesting discussion so far, and many view points shared! > Not all tables have partition definition and table-level stats would benefit these tables Agreed that tables not always have partitions. Current partition stats are appropriate for partitioned tables only mainly

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-06 Thread huaxin gao
Thanks Alexander, Xianjin, Gang and Anton for your valuable insights! Regarding deriving min/max values on the fly, I currently don't have a good algorithm. I rely on iterating through FileScanTask objects in memory to aggregate results, which leads me to favor pre calculating min/max values. I h

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-06 Thread Anton Okolnychyi
I'd like to entertain the idea of deriving min/max values on the fly to understand our baseline. What will the algorithm for that look like? I assume the naive approach will be to keep min/max stats for selected columns while planning (as opposed to discarding them upon filtering) and then iterate

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-06 Thread Gang Wu
Just give my two cents. Not all tables have partition definition and table-level stats would benefit these tables. In addition, NDV might not be easily populated from partition-level statistics. Thanks, Gang On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE wrote: > Thanks for raising the discussion Hu

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-06 Thread Xianjin YE
Thanks for raising the discussion Huaxin. I also think partition-level statistics file(s) are more useful and has advantage over table-level stats. For instance: 1. It would be straight forward to support incremental stats computing for large tables: by recalculating new or updated partitions on

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-05 Thread Alexander Jo
Thanks for starting this thread Huaxin, The existing statistics, on a per data file basis, are definitely too granular for use in planning/analysis time query optimizations. It's worked so far, as tables have been relatively small, but from what I've seen in the Trino community it is starting to b

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-02 Thread huaxin gao
Thanks, Samrose and Piotr, for the discussion! This issue is not addressed by the partition statistics feature. What we need are table level stats. Given that our primary goal in collecting statistics is for performance optimization, I believe it's not a good approach to derive these statistics at

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-02 Thread Piotr Findeisen
Hi, First of all, thank you Huaxin for raising this topic. It's important for Spark, but also for Trino. Min, max, and null counts can be derived from manifests. I am not saying that a query engine should derive them from manifests at query time, but it definitely can. If we want to pull min, max

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-02 Thread Samrose Ahmed
Isn't this addressed by the partition statistics feature, or do you want to have one row for the entire table? On Fri, Aug 2, 2024, 10:47 AM huaxin gao wrote: > I would like to initiate a discussion on implementing a table-level > statistics file to store column statistics, specifically min, max