Excellent proposal! We’ve internally augmented both table-level and partition-level ColumnStatistics, and observed a 30%+ performance gain in Spark and Trino query execution—largely due to improved Cost-Based Optimization (CBO) effectiveness. However, leveraging the v3 format presented numerous challenges (such as column-type evolution and the way to save min/max values). We believe adopting the v4 format would be a more robust solution.
I’ve researched this extensively and applied it in production. I’d be glad to collaborate on implementing this feature if needed. Best wishes. Gábor Kaszab <gaborkas...@apache.org> 于2025年8月28日周四 21:23写道: > > Hey Iceberg Community, > > I've been working on a proposal to extend the currently standardized > statistics in Iceberg, by looking into what statistics are used by some query > engines and trying to fill the gaps (credit also goes to Denys K to lay > groundwork). The motivation is to use Iceberg for the source of truth when it > comes to statistics across all the engines. > Meanwhile, there have been movements on other proposals (Restructuring > col-stats, Restructuring metadata) that might overlap with mine. Let’s see > how much of my proposal still holds up in light of these developments. > > Any feedback is appreciated! > Gabor