Re: Scan column metrics

Manish Malhotra Fri, 10 Nov 2023 15:05:45 -0800

Thanks Peter!

On Wed, Oct 11, 2023 at 5:36 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:


> Based on our discussion here, I have created a PR for the feature:
> https://github.com/apache/iceberg/pull/8803
>
> I think this is not a big change, and the flexibility/reduced memory
> consumption would be worth the additional complexity.
>
> Please review the PR to see for yourselves :)
>
> Thanks,
> Peter
>
> Manish Malhotra <manish.malhotra.w...@gmail.com> ezt írta (időpont: 2023.
> okt. 10., K, 17:02):
>
>> Thanks Ryan,
>>
>> Good point, that makes sense.
>>
>> Though +1 for the feature.
>>
>> We can avoid it during ingestion as well, though we might need the stats
>> some time later, so having options during reading will help.
>>
>> Thanks,
>> Manish
>>
>> On Mon, Oct 9, 2023 at 4:38 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> For that use case, it sounds like you'd be much better off not storing
>>> all the stats rather that skipping them at read time. I understand the user
>>> wants to keep them, but it may still not be a great choice. I'm just
>>> worried that this is going to be a lot of effort for you that doesn't
>>> really generalize.
>>>
>>> That said, if you're convinced that this is the right path I think it
>>> would be _nice_ to have it in. We can always reduce memory consumption!
>>>
>>> On Mon, Oct 9, 2023 at 5:21 AM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> The owner of the table wanted to keep the column stats for all of the
>>>> columns, claiming that other users might/are using the statistics of the
>>>> columns. Even if I am not sure that their case was defendable, I think the
>>>> reader of the table is often not in the position to optimize the table for
>>>> their own usage. Even in this case we should provide tools for them to
>>>> archive as much as possible.
>>>>
>>>> Since the statistics values are stored in the BaseFile in a map, every
>>>> column (~400 in our case) added an entry (100 bytes) to all of the stat
>>>> fields
>>>> (columnSizes/valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds
>>>> - 6x100 = 600 bytes). As a result we ended up having GenericDataFile
>>>> costing us 120k (some columns did not have all of the stats present).
>>>> Almost all of this was statistics related, and most of it is totally
>>>> unnecessary for our reader. Additionally, we had 67k splits, and the result
>>>> was that we had 8G unnecessary memory consumption.
>>>>
>>>> I think if we implement the filtering in the BaseFile constructor,
>>>> where we do the column stats removal, then we can create a reasonably fast
>>>> copy of the map which contains only the required column stats. Also in this
>>>> case we could trade CPU for memory consumption.
>>>>
>>>> WDYT?
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. okt. 7.,
>>>> Szo, 17:04):
>>>>
>>>>> I am sure dropping column stats can be helpful. Just that it has some
>>>>> challenges in practice. It requires table owners to know the query pattern
>>>>> and decide what column stats to keep and what to drop. While automation 
>>>>> can
>>>>> help ease the decisions based on the query history, it can't predict 
>>>>> future
>>>>> usages. When new column stats become useful later, we would need to
>>>>> backfill/rebuild the new column stats for existing data files.
>>>>>
>>>>> Engines (like Trino) know which columns are used in filter and join
>>>>> expression. Query engines precisely know what column stats are needed.
>>>>>
>>>>> On Fri, Oct 6, 2023 at 8:59 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> I understand wanting to keep more in general, that's why we have the
>>>>>> 100 column threshold set fairly high. But in the case you're describing
>>>>>> those column stats are causing a problem. I'd expect you to be able to 
>>>>>> drop
>>>>>> some of them on such a large table to solve the problem, rather than 
>>>>>> filter
>>>>>> them out (which is actually not very quick anyway). Have you tried that?
>>>>>> Why doesn't it work?
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 7:57 PM Steven Wu <stevenz...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It is definitely good to only track column stats that are used.
>>>>>>> Otherwise, we are just creating wasteful metadata that can increase
>>>>>>> manifest file size and slow down scan planning. If a table has 100 
>>>>>>> columns,
>>>>>>> it is very unlikely we need stats for all columns.
>>>>>>>
>>>>>>> But in practice, it is a bit difficult to predict which column stats
>>>>>>> will be useful for future queries. We can analyze query patterns and
>>>>>>> autotune the decision of selectively enabling column stats. However,
>>>>>>>  if we need to enable new column stats, backfilling column stats
>>>>>>> for existing data files would require expensive rewrites. Note that
>>>>>>> Iceberg's default behavior is to track column stats up to the first 100
>>>>>>> columns.
>>>>>>>
>>>>>>> Interesting to learn from the community regarding column stats
>>>>>>> tuning.
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 5:38 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> I can think of situations in which you may want something like
>>>>>>>> this, but I'm curious what other options you've used to solve the 
>>>>>>>> problem.
>>>>>>>> This seems like exactly what `write.metadata.metrics.*` was intended to
>>>>>>>> solve and I'm a bit surprised that you need metrics for so many 
>>>>>>>> columns in
>>>>>>>> the table. The metrics are used for scan planning (filtering) and for
>>>>>>>> operations like this, right? Are you keeping metrics around for 
>>>>>>>> metadata
>>>>>>>> agg pushdown or something?
>>>>>>>>
>>>>>>>> I'm not opposed, but I do want to make sure there's not a simple
>>>>>>>> way to solve the problem.
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023 at 3:23 PM Steven Wu <stevenz...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 for this feature of column stats projection.
>>>>>>>>>
>>>>>>>>> I will add some additional inputs.
>>>>>>>>>
>>>>>>>>> 1)  In the previous discussion, there are comments on only
>>>>>>>>> enabling column stats that are needed. That is definitely a 
>>>>>>>>> recommended
>>>>>>>>> best practice. But there are some practical challenges. By default, 
>>>>>>>>> Iceberg
>>>>>>>>> enables column stats for up to 100 columns. It also may be hard to 
>>>>>>>>> know all
>>>>>>>>> the queries and what column stats are needed up front. And it is 
>>>>>>>>> expensive
>>>>>>>>> to retrofit/backfill column stats after the fact, as it requires data 
>>>>>>>>> file
>>>>>>>>> scan and rewrite.
>>>>>>>>>
>>>>>>>>> 2) We also have the Trino coordinator run into OOM due to the
>>>>>>>>> column stats taking up a lot of memory space. If the Trino engine can
>>>>>>>>> select the column stats based on join and filter expressions, we can
>>>>>>>>> greatly reduce the memory footprint per split.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Oct 5, 2023 at 3:32 AM Péter Váry <
>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Team,
>>>>>>>>>>
>>>>>>>>>> TL;DR: I would like to introduce the possibility to parametrize
>>>>>>>>>> the Iceberg table scans to include the metadata metrics only for
>>>>>>>>>> specific columns.
>>>>>>>>>>
>>>>>>>>>> We discussed this previously on the mailing list [1], but we did
>>>>>>>>>> not finalize the direction there.
>>>>>>>>>>
>>>>>>>>>> *To recap*
>>>>>>>>>> Currently there are two ways to affect which column metrics are
>>>>>>>>>> returned from the metadata by the scan tasks:
>>>>>>>>>>
>>>>>>>>>>    - Decide which column stats are collected/stored -
>>>>>>>>>>    write.metadata.metrics.* write configuration could be used
>>>>>>>>>>    - Decide if the scan should include the column stats -
>>>>>>>>>>    includeColumnStats - scan builder configuration
>>>>>>>>>>
>>>>>>>>>> In our experience this granularity is not enough if the table has
>>>>>>>>>> many columns and multiple readers. Some readers might want to have a
>>>>>>>>>> specific column statistics only, while other readers might need 
>>>>>>>>>> metrics for
>>>>>>>>>> another column.
>>>>>>>>>>
>>>>>>>>>> *Current issue*
>>>>>>>>>> As for the concrete case, I have a PR [1] under review to emit
>>>>>>>>>> watermarks when reading an Iceberg table by the Flink Source. It 
>>>>>>>>>> would be
>>>>>>>>>> good to provide a simple "watermarkFieldName" for the users, which 
>>>>>>>>>> would be
>>>>>>>>>> used for generating the watermark.
>>>>>>>>>> The issue is that for the simple implementation we would need the
>>>>>>>>>> column metadata metrics to extract the watermark for the
>>>>>>>>>> *FileScanTasks*. If we include all of the column statistics,
>>>>>>>>>> then it could easily cause memory issues on the JobManager side. 
>>>>>>>>>> This is
>>>>>>>>>> why we are reluctant to make this as a simple switch on the user 
>>>>>>>>>> side, and
>>>>>>>>>> might opt for a more heavy extractor interface where the user has the
>>>>>>>>>> responsibility to extract the watermark themselves.
>>>>>>>>>> OTOH, if it would be possible to return column metrics only for
>>>>>>>>>> the given column, the memory pressure would be more predictable, and 
>>>>>>>>>> we
>>>>>>>>>> could have a more user-friendly solution.
>>>>>>>>>>
>>>>>>>>>> *Suggested solution*
>>>>>>>>>> I would like to introduce a new method to the Scan class, like:
>>>>>>>>>>
>>>>>>>>>> *ThisT includeColumnStats(Collection<String> columns);*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Using this configuration the users of the Scan task could decide
>>>>>>>>>> which column metrics are retained during the planning process, and 
>>>>>>>>>> which
>>>>>>>>>> column metrics are thrown away in the results.
>>>>>>>>>>
>>>>>>>>>> Would the community consider this as a valuable addition to the
>>>>>>>>>> Scan API?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> - [1]
>>>>>>>>>> https://lists.apache.org/thread/9rl8yq8ps3mfg91g1qvzvgd0tnkjvxgg
>>>>>>>>>> - Scan statistics thread
>>>>>>>>>> - [2] https://github.com/apache/iceberg/pull/8553 - Flink: Emit
>>>>>>>>>> watermarks from the IcebergSource
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Scan column metrics

Reply via email to