Thanks Anton for the review and feedback. I've shared more context in the below. Good to learn about the potential manifest structure change in V4. I guess this proposal is more helpful in terms of stating the problem of large-scale manifest processing. I think we can think of ways to improve that when designing the V4 spec.
> Does the doc suggest it is too expensive to aggregate min/max stats after planning files? Yes. > Do we have any benchmarks to refer to? Not really. But as mentioned in the doc, suppose that we have a petabyte-scale time-series fact table whose daily data volume is ~2.5T, which translates to 5000 data files if each file is 500MB. If we join its past 3 months' data with a dimension table, the number of matching files would come to 5000*90=450,000, which is a fairly big number. What could make things worse are the table's high dimensionality and the system's high concurrency. In fact, in production, we could see Trino coordinator's memory being majorly occupied by Iceberg data files' stats (example heap dump analysis <https://drive.google.com/file/d/1GFCm6KWPSLoBGTwDfZDH1KzHMqTVV5dI/view?usp=sharing>). And we could see queries' planning times being excessively long because of the cost-based optimization process. (Disclaimer: the heap dump analysis was done with an old Trino version 372. Many improvements have been done since then. However, the expensive-ness of data file stats processing still exists.) > We will have to read manifests for planning anyway, right? In the case of Trino, it differentiates query planning and query scheduling. During query planning, it decides how the query will be executed, including cost-based optimizations. During query scheduling, it reads Iceberg manifests and assigns data files to workers, in an asynchronous streaming manner. > Also, the doc proposes to add column level stats to the manifest list. I remember Dan mentioned the idea to get rid of the manifest list in V4 and allow manifests to point to other manifests. Got it. We can consider adding top-level stats. > What if we have a selective predicate that drastically narrows down the scope of the operation? In that case, the file stats will give us much more precise information. It's fine if the filter operation's stats estimation is properly implemented. On Wed, Oct 16, 2024 at 4:08 PM Anton Okolnychyi <aokolnyc...@gmail.com> wrote: > Does the doc suggest it is too expensive to aggregate min/max stats after > planning files (i.e. after loading matching files in memory)? Do we have > any benchmarks to refer to? We will have to read manifests for planning > anyway, right? > > Also, the doc proposes to add column level stats to the manifest list. I > remember Dan mentioned the idea to get rid of the manifest list in V4 and > allow manifests to point to other manifests. While having lower and upper > bounds at the top-level sounds promising, I am not sure we would want to > use that for CBO. What if we have a selective predicate that drastically > narrows down the scope of the operation? In that case, the file stats will > give us much more precise information. > > - Anton > > вт, 15 жовт. 2024 р. о 17:01 Xingyuan Lin <linxingyuan1...@gmail.com> > пише: > >> Hi everyone, >> >> Here's a doc for [Proposal] Add manifest-level statistics for CBO >> estimation >> <https://docs.google.com/document/d/1NMsS4dg_AXh_abVfzx24VBOmLPIPaHzZRa2g5uUUHDI/edit?usp=sharing>. >> It's for more efficient derivation of stats for the CBO process. Original >> discussion >> thread >> <https://lists.apache.org/thread/jkt1g4vzjgjbtd5b5dwqs9dzo1ndrwkt.>. >> >> Please feel free to take a look and comment. >> >> Thanks, >> Xingyuan >> >