Based on our discussion here, I have created a PR for the feature:
https://github.com/apache/iceberg/pull/8803

I think this is not a big change, and the flexibility/reduced memory
consumption would be worth the additional complexity.

Please review the PR to see for yourselves :)

Thanks,
Peter

Manish Malhotra <manish.malhotra.w...@gmail.com> ezt írta (időpont: 2023.
okt. 10., K, 17:02):

> Thanks Ryan,
>
> Good point, that makes sense.
>
> Though +1 for the feature.
>
> We can avoid it during ingestion as well, though we might need the stats
> some time later, so having options during reading will help.
>
> Thanks,
> Manish
>
> On Mon, Oct 9, 2023 at 4:38 PM Ryan Blue <b...@tabular.io> wrote:
>
>> For that use case, it sounds like you'd be much better off not storing
>> all the stats rather that skipping them at read time. I understand the user
>> wants to keep them, but it may still not be a great choice. I'm just
>> worried that this is going to be a lot of effort for you that doesn't
>> really generalize.
>>
>> That said, if you're convinced that this is the right path I think it
>> would be _nice_ to have it in. We can always reduce memory consumption!
>>
>> On Mon, Oct 9, 2023 at 5:21 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> The owner of the table wanted to keep the column stats for all of the
>>> columns, claiming that other users might/are using the statistics of the
>>> columns. Even if I am not sure that their case was defendable, I think the
>>> reader of the table is often not in the position to optimize the table for
>>> their own usage. Even in this case we should provide tools for them to
>>> archive as much as possible.
>>>
>>> Since the statistics values are stored in the BaseFile in a map, every
>>> column (~400 in our case) added an entry (100 bytes) to all of the stat
>>> fields
>>> (columnSizes/valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds
>>> - 6x100 = 600 bytes). As a result we ended up having GenericDataFile
>>> costing us 120k (some columns did not have all of the stats present).
>>> Almost all of this was statistics related, and most of it is totally
>>> unnecessary for our reader. Additionally, we had 67k splits, and the result
>>> was that we had 8G unnecessary memory consumption.
>>>
>>> I think if we implement the filtering in the BaseFile constructor, where
>>> we do the column stats removal, then we can create a reasonably fast copy
>>> of the map which contains only the required column stats. Also in this case
>>> we could trade CPU for memory consumption.
>>>
>>> WDYT?
>>>
>>> Thanks,
>>> Peter
>>>
>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. okt. 7., Szo,
>>> 17:04):
>>>
>>>> I am sure dropping column stats can be helpful. Just that it has some
>>>> challenges in practice. It requires table owners to know the query pattern
>>>> and decide what column stats to keep and what to drop. While automation can
>>>> help ease the decisions based on the query history, it can't predict future
>>>> usages. When new column stats become useful later, we would need to
>>>> backfill/rebuild the new column stats for existing data files.
>>>>
>>>> Engines (like Trino) know which columns are used in filter and join
>>>> expression. Query engines precisely know what column stats are needed.
>>>>
>>>> On Fri, Oct 6, 2023 at 8:59 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I understand wanting to keep more in general, that's why we have the
>>>>> 100 column threshold set fairly high. But in the case you're describing
>>>>> those column stats are causing a problem. I'd expect you to be able to 
>>>>> drop
>>>>> some of them on such a large table to solve the problem, rather than 
>>>>> filter
>>>>> them out (which is actually not very quick anyway). Have you tried that?
>>>>> Why doesn't it work?
>>>>>
>>>>> On Thu, Oct 5, 2023 at 7:57 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>>>
>>>>>> It is definitely good to only track column stats that are used.
>>>>>> Otherwise, we are just creating wasteful metadata that can increase
>>>>>> manifest file size and slow down scan planning. If a table has 100 
>>>>>> columns,
>>>>>> it is very unlikely we need stats for all columns.
>>>>>>
>>>>>> But in practice, it is a bit difficult to predict which column stats
>>>>>> will be useful for future queries. We can analyze query patterns and
>>>>>> autotune the decision of selectively enabling column stats. However,
>>>>>>  if we need to enable new column stats, backfilling column stats for
>>>>>> existing data files would require expensive rewrites. Note that Iceberg's
>>>>>> default behavior is to track column stats up to the first 100 columns.
>>>>>>
>>>>>>
>>>>>> Interesting to learn from the community regarding column stats tuning.
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 5:38 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> I can think of situations in which you may want something like this,
>>>>>>> but I'm curious what other options you've used to solve the problem. 
>>>>>>> This
>>>>>>> seems like exactly what `write.metadata.metrics.*` was intended to solve
>>>>>>> and I'm a bit surprised that you need metrics for so many columns in the
>>>>>>> table. The metrics are used for scan planning (filtering) and for
>>>>>>> operations like this, right? Are you keeping metrics around for metadata
>>>>>>> agg pushdown or something?
>>>>>>>
>>>>>>> I'm not opposed, but I do want to make sure there's not a simple way
>>>>>>> to solve the problem.
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 3:23 PM Steven Wu <stevenz...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 for this feature of column stats projection.
>>>>>>>>
>>>>>>>> I will add some additional inputs.
>>>>>>>>
>>>>>>>> 1)  In the previous discussion, there are comments on only enabling
>>>>>>>> column stats that are needed. That is definitely a recommended best
>>>>>>>> practice. But there are some practical challenges. By default, Iceberg
>>>>>>>> enables column stats for up to 100 columns. It also may be hard to 
>>>>>>>> know all
>>>>>>>> the queries and what column stats are needed up front. And it is 
>>>>>>>> expensive
>>>>>>>> to retrofit/backfill column stats after the fact, as it requires data 
>>>>>>>> file
>>>>>>>> scan and rewrite.
>>>>>>>>
>>>>>>>> 2) We also have the Trino coordinator run into OOM due to the
>>>>>>>> column stats taking up a lot of memory space. If the Trino engine can
>>>>>>>> select the column stats based on join and filter expressions, we can
>>>>>>>> greatly reduce the memory footprint per split.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Steven
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023 at 3:32 AM Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Team,
>>>>>>>>>
>>>>>>>>> TL;DR: I would like to introduce the possibility to parametrize
>>>>>>>>> the Iceberg table scans to include the metadata metrics only for
>>>>>>>>> specific columns.
>>>>>>>>>
>>>>>>>>> We discussed this previously on the mailing list [1], but we did
>>>>>>>>> not finalize the direction there.
>>>>>>>>>
>>>>>>>>> *To recap*
>>>>>>>>> Currently there are two ways to affect which column metrics are
>>>>>>>>> returned from the metadata by the scan tasks:
>>>>>>>>>
>>>>>>>>>    - Decide which column stats are collected/stored -
>>>>>>>>>    write.metadata.metrics.* write configuration could be used
>>>>>>>>>    - Decide if the scan should include the column stats -
>>>>>>>>>    includeColumnStats - scan builder configuration
>>>>>>>>>
>>>>>>>>> In our experience this granularity is not enough if the table has
>>>>>>>>> many columns and multiple readers. Some readers might want to have a
>>>>>>>>> specific column statistics only, while other readers might need 
>>>>>>>>> metrics for
>>>>>>>>> another column.
>>>>>>>>>
>>>>>>>>> *Current issue*
>>>>>>>>> As for the concrete case, I have a PR [1] under review to emit
>>>>>>>>> watermarks when reading an Iceberg table by the Flink Source. It 
>>>>>>>>> would be
>>>>>>>>> good to provide a simple "watermarkFieldName" for the users, which 
>>>>>>>>> would be
>>>>>>>>> used for generating the watermark.
>>>>>>>>> The issue is that for the simple implementation we would need the
>>>>>>>>> column metadata metrics to extract the watermark for the
>>>>>>>>> *FileScanTasks*. If we include all of the column statistics, then
>>>>>>>>> it could easily cause memory issues on the JobManager side. This is 
>>>>>>>>> why we
>>>>>>>>> are reluctant to make this as a simple switch on the user side, and 
>>>>>>>>> might
>>>>>>>>> opt for a more heavy extractor interface where the user has the
>>>>>>>>> responsibility to extract the watermark themselves.
>>>>>>>>> OTOH, if it would be possible to return column metrics only for
>>>>>>>>> the given column, the memory pressure would be more predictable, and 
>>>>>>>>> we
>>>>>>>>> could have a more user-friendly solution.
>>>>>>>>>
>>>>>>>>> *Suggested solution*
>>>>>>>>> I would like to introduce a new method to the Scan class, like:
>>>>>>>>>
>>>>>>>>> *ThisT includeColumnStats(Collection<String> columns);*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Using this configuration the users of the Scan task could decide
>>>>>>>>> which column metrics are retained during the planning process, and 
>>>>>>>>> which
>>>>>>>>> column metrics are thrown away in the results.
>>>>>>>>>
>>>>>>>>> Would the community consider this as a valuable addition to the
>>>>>>>>> Scan API?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> - [1]
>>>>>>>>> https://lists.apache.org/thread/9rl8yq8ps3mfg91g1qvzvgd0tnkjvxgg
>>>>>>>>> - Scan statistics thread
>>>>>>>>> - [2] https://github.com/apache/iceberg/pull/8553 - Flink: Emit
>>>>>>>>> watermarks from the IcebergSource
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Reply via email to