For that use case, it sounds like you'd be much better off not storing all
the stats rather that skipping them at read time. I understand the user
wants to keep them, but it may still not be a great choice. I'm just
worried that this is going to be a lot of effort for you that doesn't
really generalize.

That said, if you're convinced that this is the right path I think it would
be _nice_ to have it in. We can always reduce memory consumption!

On Mon, Oct 9, 2023 at 5:21 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> The owner of the table wanted to keep the column stats for all of the
> columns, claiming that other users might/are using the statistics of the
> columns. Even if I am not sure that their case was defendable, I think the
> reader of the table is often not in the position to optimize the table for
> their own usage. Even in this case we should provide tools for them to
> archive as much as possible.
>
> Since the statistics values are stored in the BaseFile in a map, every
> column (~400 in our case) added an entry (100 bytes) to all of the stat
> fields
> (columnSizes/valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds
> - 6x100 = 600 bytes). As a result we ended up having GenericDataFile
> costing us 120k (some columns did not have all of the stats present).
> Almost all of this was statistics related, and most of it is totally
> unnecessary for our reader. Additionally, we had 67k splits, and the result
> was that we had 8G unnecessary memory consumption.
>
> I think if we implement the filtering in the BaseFile constructor, where
> we do the column stats removal, then we can create a reasonably fast copy
> of the map which contains only the required column stats. Also in this case
> we could trade CPU for memory consumption.
>
> WDYT?
>
> Thanks,
> Peter
>
> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. okt. 7., Szo,
> 17:04):
>
>> I am sure dropping column stats can be helpful. Just that it has some
>> challenges in practice. It requires table owners to know the query pattern
>> and decide what column stats to keep and what to drop. While automation can
>> help ease the decisions based on the query history, it can't predict future
>> usages. When new column stats become useful later, we would need to
>> backfill/rebuild the new column stats for existing data files.
>>
>> Engines (like Trino) know which columns are used in filter and join
>> expression. Query engines precisely know what column stats are needed.
>>
>> On Fri, Oct 6, 2023 at 8:59 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> I understand wanting to keep more in general, that's why we have the 100
>>> column threshold set fairly high. But in the case you're describing those
>>> column stats are causing a problem. I'd expect you to be able to drop some
>>> of them on such a large table to solve the problem, rather than filter them
>>> out (which is actually not very quick anyway). Have you tried that? Why
>>> doesn't it work?
>>>
>>> On Thu, Oct 5, 2023 at 7:57 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> It is definitely good to only track column stats that are used.
>>>> Otherwise, we are just creating wasteful metadata that can increase
>>>> manifest file size and slow down scan planning. If a table has 100 columns,
>>>> it is very unlikely we need stats for all columns.
>>>>
>>>> But in practice, it is a bit difficult to predict which column stats
>>>> will be useful for future queries. We can analyze query patterns and
>>>> autotune the decision of selectively enabling column stats. However,  if
>>>> we need to enable new column stats, backfilling column stats for existing
>>>> data files would require expensive rewrites. Note that Iceberg's default
>>>> behavior is to track column stats up to the first 100 columns.
>>>>
>>>> Interesting to learn from the community regarding column stats tuning.
>>>>
>>>> On Thu, Oct 5, 2023 at 5:38 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I can think of situations in which you may want something like this,
>>>>> but I'm curious what other options you've used to solve the problem. This
>>>>> seems like exactly what `write.metadata.metrics.*` was intended to solve
>>>>> and I'm a bit surprised that you need metrics for so many columns in the
>>>>> table. The metrics are used for scan planning (filtering) and for
>>>>> operations like this, right? Are you keeping metrics around for metadata
>>>>> agg pushdown or something?
>>>>>
>>>>> I'm not opposed, but I do want to make sure there's not a simple way
>>>>> to solve the problem.
>>>>>
>>>>> On Thu, Oct 5, 2023 at 3:23 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>>>
>>>>>> +1 for this feature of column stats projection.
>>>>>>
>>>>>> I will add some additional inputs.
>>>>>>
>>>>>> 1)  In the previous discussion, there are comments on only enabling
>>>>>> column stats that are needed. That is definitely a recommended best
>>>>>> practice. But there are some practical challenges. By default, Iceberg
>>>>>> enables column stats for up to 100 columns. It also may be hard to know 
>>>>>> all
>>>>>> the queries and what column stats are needed up front. And it is 
>>>>>> expensive
>>>>>> to retrofit/backfill column stats after the fact, as it requires data 
>>>>>> file
>>>>>> scan and rewrite.
>>>>>>
>>>>>> 2) We also have the Trino coordinator run into OOM due to the column
>>>>>> stats taking up a lot of memory space. If the Trino engine can select the
>>>>>> column stats based on join and filter expressions, we can greatly reduce
>>>>>> the memory footprint per split.
>>>>>>
>>>>>> Thanks,
>>>>>> Steven
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 3:32 AM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> TL;DR: I would like to introduce the possibility to parametrize the
>>>>>>> Iceberg table scans to include the metadata metrics only for
>>>>>>> specific columns.
>>>>>>>
>>>>>>> We discussed this previously on the mailing list [1], but we did not
>>>>>>> finalize the direction there.
>>>>>>>
>>>>>>> *To recap*
>>>>>>> Currently there are two ways to affect which column metrics are
>>>>>>> returned from the metadata by the scan tasks:
>>>>>>>
>>>>>>>    - Decide which column stats are collected/stored -
>>>>>>>    write.metadata.metrics.* write configuration could be used
>>>>>>>    - Decide if the scan should include the column stats -
>>>>>>>    includeColumnStats - scan builder configuration
>>>>>>>
>>>>>>> In our experience this granularity is not enough if the table has
>>>>>>> many columns and multiple readers. Some readers might want to have a
>>>>>>> specific column statistics only, while other readers might need metrics 
>>>>>>> for
>>>>>>> another column.
>>>>>>>
>>>>>>> *Current issue*
>>>>>>> As for the concrete case, I have a PR [1] under review to emit
>>>>>>> watermarks when reading an Iceberg table by the Flink Source. It would 
>>>>>>> be
>>>>>>> good to provide a simple "watermarkFieldName" for the users, which 
>>>>>>> would be
>>>>>>> used for generating the watermark.
>>>>>>> The issue is that for the simple implementation we would need the
>>>>>>> column metadata metrics to extract the watermark for the
>>>>>>> *FileScanTasks*. If we include all of the column statistics, then
>>>>>>> it could easily cause memory issues on the JobManager side. This is why 
>>>>>>> we
>>>>>>> are reluctant to make this as a simple switch on the user side, and 
>>>>>>> might
>>>>>>> opt for a more heavy extractor interface where the user has the
>>>>>>> responsibility to extract the watermark themselves.
>>>>>>> OTOH, if it would be possible to return column metrics only for the
>>>>>>> given column, the memory pressure would be more predictable, and we 
>>>>>>> could
>>>>>>> have a more user-friendly solution.
>>>>>>>
>>>>>>> *Suggested solution*
>>>>>>> I would like to introduce a new method to the Scan class, like:
>>>>>>>
>>>>>>> *ThisT includeColumnStats(Collection<String> columns);*
>>>>>>>
>>>>>>>
>>>>>>> Using this configuration the users of the Scan task could decide
>>>>>>> which column metrics are retained during the planning process, and which
>>>>>>> column metrics are thrown away in the results.
>>>>>>>
>>>>>>> Would the community consider this as a valuable addition to the Scan
>>>>>>> API?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> - [1]
>>>>>>> https://lists.apache.org/thread/9rl8yq8ps3mfg91g1qvzvgd0tnkjvxgg -
>>>>>>> Scan statistics thread
>>>>>>> - [2] https://github.com/apache/iceberg/pull/8553 - Flink: Emit
>>>>>>> watermarks from the IcebergSource
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Reply via email to