Hi Everyone,
Apologies if you get this email twice, for some reason I am unable to see
the email I sent earlier in the mailing list so sending it again.
I wanted to revisit the discussion about using partition stats for min/max
and null counts. It seems we might need to compute the null count at
Hi All,
Thanks for the discussion.
As @karuppayya’s PR recently got merged and it collects the NDV stats on a
table level I would like to revisit the partition stats vs table stats
discussion and raise a few points for discussion:
1.
The current action collects the NDV stats on a table l
I also like the middle ground of partition level stats, which is also
easier to perform incremental refresh (at partition level). if the roll-up
of partition level stats turned out to be slow, I don't mind adding table
level stats aggregated from partition level stats. Having partition level
stats
First of all thanks a lot Huaxin for starting an important proposal and
thread!
A lot of important points are already discussed.
For me, my thoughts were also tilting towards the partition level stats,
what Piotr, Alex, Anton and a few others have mentioned as well.
IMO, partition level stats mi
Hi All,
Thank you for interesting discussion so far, and many view points shared!
> Not all tables have partition definition and table-level stats would
benefit these tables
Agreed that tables not always have partitions.
Current partition stats are appropriate for partitioned tables only mainly
Thanks Alexander, Xianjin, Gang and Anton for your valuable insights!
Regarding deriving min/max values on the fly, I currently don't have a good
algorithm. I rely on iterating through FileScanTask objects in memory to
aggregate results, which leads me to favor pre calculating min/max values.
I h
I'd like to entertain the idea of deriving min/max values on the fly to
understand our baseline. What will the algorithm for that look like? I
assume the naive approach will be to keep min/max stats for selected
columns while planning (as opposed to discarding them upon filtering) and
then iterate
Just give my two cents. Not all tables have partition definition and
table-level stats would
benefit these tables. In addition, NDV might not be easily populated from
partition-level
statistics.
Thanks,
Gang
On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE wrote:
> Thanks for raising the discussion Hu
Thanks for raising the discussion Huaxin.
I also think partition-level statistics file(s) are more useful and has
advantage over table-level stats. For instance:
1. It would be straight forward to support incremental stats computing for
large tables: by recalculating new or updated partitions on
Thanks for starting this thread Huaxin,
The existing statistics, on a per data file basis, are definitely too
granular for use in planning/analysis time query optimizations.
It's worked so far, as tables have been relatively small, but from what
I've seen in the Trino community it is starting to b
Thanks, Samrose and Piotr, for the discussion! This issue is not addressed
by the partition statistics feature. What we need are table level stats.
Given that our primary goal in collecting statistics is for performance
optimization, I believe it's not a good approach to derive these statistics
at
Hi,
First of all, thank you Huaxin for raising this topic. It's important for
Spark, but also for Trino.
Min, max, and null counts can be derived from manifests.
I am not saying that a query engine should derive them from manifests at
query time, but it definitely can.
If we want to pull min, max
Isn't this addressed by the partition statistics feature, or do you want to
have one row for the entire table?
On Fri, Aug 2, 2024, 10:47 AM huaxin gao wrote:
> I would like to initiate a discussion on implementing a table-level
> statistics file to store column statistics, specifically min, max
13 matches
Mail list logo