Thanks Anton for the review and feedback. I've shared more context in the
below. Good to learn about the potential manifest structure change in V4. I
guess this proposal is more helpful in terms of stating the problem of
large-scale manifest processing. I think we can think of ways to improve
that when designing the V4 spec.

> Does the doc suggest it is too expensive to aggregate min/max stats after
planning files?
Yes.

> Do we have any benchmarks to refer to?
Not really. But as mentioned in the doc, suppose that we have a
petabyte-scale time-series fact table whose daily data volume is ~2.5T,
which translates to 5000 data files if each file is 500MB. If we join its
past 3 months' data with a dimension table, the number of matching files
would come to 5000*90=450,000, which is a fairly big number. What could
make things worse are the table's high dimensionality and the system's high
concurrency.

In fact, in production, we could see Trino coordinator's memory being
majorly occupied by Iceberg data files' stats (example heap dump analysis
<https://drive.google.com/file/d/1GFCm6KWPSLoBGTwDfZDH1KzHMqTVV5dI/view?usp=sharing>).
And we could see queries' planning times being excessively long because of
the cost-based optimization process. (Disclaimer: the heap dump analysis
was done with an old Trino version 372. Many improvements have been done
since then. However, the expensive-ness of data file stats processing still
exists.)

> We will have to read manifests for planning anyway, right?
In the case of Trino, it differentiates query planning and query
scheduling. During query planning, it decides how the query will be
executed, including cost-based optimizations. During query scheduling, it
reads Iceberg manifests and assigns data files to workers, in an
asynchronous streaming manner.

> Also, the doc proposes to add column level stats to the manifest list. I
remember Dan mentioned the idea to get rid of the manifest list in V4 and
allow manifests to point to other manifests.
Got it. We can consider adding top-level stats.

> What if we have a selective predicate that drastically narrows down the
scope of the operation? In that case, the file stats will give us much more
precise information.
It's fine if the filter operation's stats estimation is properly
implemented.

On Wed, Oct 16, 2024 at 4:08 PM Anton Okolnychyi <aokolnyc...@gmail.com>
wrote:

> Does the doc suggest it is too expensive to aggregate min/max stats after
> planning files (i.e. after loading matching files in memory)? Do we have
> any benchmarks to refer to? We will have to read manifests for planning
> anyway, right?
>
> Also, the doc proposes to add column level stats to the manifest list. I
> remember Dan mentioned the idea to get rid of the manifest list in V4 and
> allow manifests to point to other manifests. While having lower and upper
> bounds at the top-level sounds promising, I am not sure we would want to
> use that for CBO. What if we have a selective predicate that drastically
> narrows down the scope of the operation? In that case, the file stats will
> give us much more precise information.
>
> - Anton
>
> вт, 15 жовт. 2024 р. о 17:01 Xingyuan Lin <linxingyuan1...@gmail.com>
> пише:
>
>> Hi everyone,
>>
>> Here's a doc for [Proposal] Add manifest-level statistics for CBO
>> estimation
>> <https://docs.google.com/document/d/1NMsS4dg_AXh_abVfzx24VBOmLPIPaHzZRa2g5uUUHDI/edit?usp=sharing>.
>> It's for more efficient derivation of stats for the CBO process. Original 
>> discussion
>> thread
>> <https://lists.apache.org/thread/jkt1g4vzjgjbtd5b5dwqs9dzo1ndrwkt.>.
>>
>> Please feel free to take a look and comment.
>>
>> Thanks,
>> Xingyuan
>>
>

Reply via email to