Hi

Im just wondering, is a solution to put these stats in Puffin files?
There's already ComputeTableStatsSparkAction (and probably similar actions
in other engines), and I can imagine a quick metadata aggregation job to
compute min/max/null_values, etc. Also how accurate would we need the
stats? These stats might be more involved if delete files are involved?
Thanks
Szehon

On Fri, Oct 18, 2024 at 12:04 PM Xingyuan Lin <linxingyuan1...@gmail.com>
wrote:

> Thanks Anton for the review and feedback. I've shared more context in the
> below. Good to learn about the potential manifest structure change in V4. I
> guess this proposal is more helpful in terms of stating the problem of
> large-scale manifest processing. I think we can think of ways to improve
> that when designing the V4 spec.
>
> > Does the doc suggest it is too expensive to aggregate min/max stats
> after planning files?
> Yes.
>
> > Do we have any benchmarks to refer to?
> Not really. But as mentioned in the doc, suppose that we have a
> petabyte-scale time-series fact table whose daily data volume is ~2.5T,
> which translates to 5000 data files if each file is 500MB. If we join its
> past 3 months' data with a dimension table, the number of matching files
> would come to 5000*90=450,000, which is a fairly big number. What could
> make things worse are the table's high dimensionality and the system's high
> concurrency.
>
> In fact, in production, we could see Trino coordinator's memory being
> majorly occupied by Iceberg data files' stats (example heap dump analysis
> <https://drive.google.com/file/d/1GFCm6KWPSLoBGTwDfZDH1KzHMqTVV5dI/view?usp=sharing>).
> And we could see queries' planning times being excessively long because of
> the cost-based optimization process. (Disclaimer: the heap dump analysis
> was done with an old Trino version 372. Many improvements have been done
> since then. However, the expensive-ness of data file stats processing still
> exists.)
>
> > We will have to read manifests for planning anyway, right?
> In the case of Trino, it differentiates query planning and query
> scheduling. During query planning, it decides how the query will be
> executed, including cost-based optimizations. During query scheduling, it
> reads Iceberg manifests and assigns data files to workers, in an
> asynchronous streaming manner.
>
> > Also, the doc proposes to add column level stats to the manifest list. I
> remember Dan mentioned the idea to get rid of the manifest list in V4 and
> allow manifests to point to other manifests.
> Got it. We can consider adding top-level stats.
>
> > What if we have a selective predicate that drastically narrows down the
> scope of the operation? In that case, the file stats will give us much more
> precise information.
> It's fine if the filter operation's stats estimation is properly
> implemented.
>
> On Wed, Oct 16, 2024 at 4:08 PM Anton Okolnychyi <aokolnyc...@gmail.com>
> wrote:
>
>> Does the doc suggest it is too expensive to aggregate min/max stats after
>> planning files (i.e. after loading matching files in memory)? Do we have
>> any benchmarks to refer to? We will have to read manifests for planning
>> anyway, right?
>>
>> Also, the doc proposes to add column level stats to the manifest list. I
>> remember Dan mentioned the idea to get rid of the manifest list in V4 and
>> allow manifests to point to other manifests. While having lower and upper
>> bounds at the top-level sounds promising, I am not sure we would want to
>> use that for CBO. What if we have a selective predicate that drastically
>> narrows down the scope of the operation? In that case, the file stats will
>> give us much more precise information.
>>
>> - Anton
>>
>> вт, 15 жовт. 2024 р. о 17:01 Xingyuan Lin <linxingyuan1...@gmail.com>
>> пише:
>>
>>> Hi everyone,
>>>
>>> Here's a doc for [Proposal] Add manifest-level statistics for CBO
>>> estimation
>>> <https://docs.google.com/document/d/1NMsS4dg_AXh_abVfzx24VBOmLPIPaHzZRa2g5uUUHDI/edit?usp=sharing>.
>>> It's for more efficient derivation of stats for the CBO process. Original 
>>> discussion
>>> thread
>>> <https://lists.apache.org/thread/jkt1g4vzjgjbtd5b5dwqs9dzo1ndrwkt.>.
>>>
>>> Please feel free to take a look and comment.
>>>
>>> Thanks,
>>> Xingyuan
>>>
>>

Reply via email to