Re: [PROPOSAL] Add manifest-level statistics for CBO estimation

2024-10-24 Thread Szehon Ho
Hi Im just wondering, is a solution to put these stats in Puffin files? There's already ComputeTableStatsSparkAction (and probably similar actions in other engines), and I can imagine a quick metadata aggregation job to compute min/max/null_values, etc. Also how accurate would we need the stats? T

Re: [PROPOSAL] Add manifest-level statistics for CBO estimation

2024-10-18 Thread Xingyuan Lin
Thanks Anton for the review and feedback. I've shared more context in the below. Good to learn about the potential manifest structure change in V4. I guess this proposal is more helpful in terms of stating the problem of large-scale manifest processing. I think we can think of ways to improve that

Re: [PROPOSAL] Add manifest-level statistics for CBO estimation

2024-10-16 Thread Anton Okolnychyi
Does the doc suggest it is too expensive to aggregate min/max stats after planning files (i.e. after loading matching files in memory)? Do we have any benchmarks to refer to? We will have to read manifests for planning anyway, right? Also, the doc proposes to add column level stats to the manifest