Hi Im just wondering, is a solution to put these stats in Puffin files? There's already ComputeTableStatsSparkAction (and probably similar actions in other engines), and I can imagine a quick metadata aggregation job to compute min/max/null_values, etc. Also how accurate would we need the stats? These stats might be more involved if delete files are involved? Thanks Szehon
On Fri, Oct 18, 2024 at 12:04 PM Xingyuan Lin <linxingyuan1...@gmail.com> wrote: > Thanks Anton for the review and feedback. I've shared more context in the > below. Good to learn about the potential manifest structure change in V4. I > guess this proposal is more helpful in terms of stating the problem of > large-scale manifest processing. I think we can think of ways to improve > that when designing the V4 spec. > > > Does the doc suggest it is too expensive to aggregate min/max stats > after planning files? > Yes. > > > Do we have any benchmarks to refer to? > Not really. But as mentioned in the doc, suppose that we have a > petabyte-scale time-series fact table whose daily data volume is ~2.5T, > which translates to 5000 data files if each file is 500MB. If we join its > past 3 months' data with a dimension table, the number of matching files > would come to 5000*90=450,000, which is a fairly big number. What could > make things worse are the table's high dimensionality and the system's high > concurrency. > > In fact, in production, we could see Trino coordinator's memory being > majorly occupied by Iceberg data files' stats (example heap dump analysis > <https://drive.google.com/file/d/1GFCm6KWPSLoBGTwDfZDH1KzHMqTVV5dI/view?usp=sharing>). > And we could see queries' planning times being excessively long because of > the cost-based optimization process. (Disclaimer: the heap dump analysis > was done with an old Trino version 372. Many improvements have been done > since then. However, the expensive-ness of data file stats processing still > exists.) > > > We will have to read manifests for planning anyway, right? > In the case of Trino, it differentiates query planning and query > scheduling. During query planning, it decides how the query will be > executed, including cost-based optimizations. During query scheduling, it > reads Iceberg manifests and assigns data files to workers, in an > asynchronous streaming manner. > > > Also, the doc proposes to add column level stats to the manifest list. I > remember Dan mentioned the idea to get rid of the manifest list in V4 and > allow manifests to point to other manifests. > Got it. We can consider adding top-level stats. > > > What if we have a selective predicate that drastically narrows down the > scope of the operation? In that case, the file stats will give us much more > precise information. > It's fine if the filter operation's stats estimation is properly > implemented. > > On Wed, Oct 16, 2024 at 4:08 PM Anton Okolnychyi <aokolnyc...@gmail.com> > wrote: > >> Does the doc suggest it is too expensive to aggregate min/max stats after >> planning files (i.e. after loading matching files in memory)? Do we have >> any benchmarks to refer to? We will have to read manifests for planning >> anyway, right? >> >> Also, the doc proposes to add column level stats to the manifest list. I >> remember Dan mentioned the idea to get rid of the manifest list in V4 and >> allow manifests to point to other manifests. While having lower and upper >> bounds at the top-level sounds promising, I am not sure we would want to >> use that for CBO. What if we have a selective predicate that drastically >> narrows down the scope of the operation? In that case, the file stats will >> give us much more precise information. >> >> - Anton >> >> вт, 15 жовт. 2024 р. о 17:01 Xingyuan Lin <linxingyuan1...@gmail.com> >> пише: >> >>> Hi everyone, >>> >>> Here's a doc for [Proposal] Add manifest-level statistics for CBO >>> estimation >>> <https://docs.google.com/document/d/1NMsS4dg_AXh_abVfzx24VBOmLPIPaHzZRa2g5uUUHDI/edit?usp=sharing>. >>> It's for more efficient derivation of stats for the CBO process. Original >>> discussion >>> thread >>> <https://lists.apache.org/thread/jkt1g4vzjgjbtd5b5dwqs9dzo1ndrwkt.>. >>> >>> Please feel free to take a look and comment. >>> >>> Thanks, >>> Xingyuan >>> >>