The owner of the table wanted to keep the column stats for all of the columns, claiming that other users might/are using the statistics of the columns. Even if I am not sure that their case was defendable, I think the reader of the table is often not in the position to optimize the table for their own usage. Even in this case we should provide tools for them to archive as much as possible.
Since the statistics values are stored in the BaseFile in a map, every column (~400 in our case) added an entry (100 bytes) to all of the stat fields (columnSizes/valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds - 6x100 = 600 bytes). As a result we ended up having GenericDataFile costing us 120k (some columns did not have all of the stats present). Almost all of this was statistics related, and most of it is totally unnecessary for our reader. Additionally, we had 67k splits, and the result was that we had 8G unnecessary memory consumption. I think if we implement the filtering in the BaseFile constructor, where we do the column stats removal, then we can create a reasonably fast copy of the map which contains only the required column stats. Also in this case we could trade CPU for memory consumption. WDYT? Thanks, Peter Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. okt. 7., Szo, 17:04): > I am sure dropping column stats can be helpful. Just that it has some > challenges in practice. It requires table owners to know the query pattern > and decide what column stats to keep and what to drop. While automation can > help ease the decisions based on the query history, it can't predict future > usages. When new column stats become useful later, we would need to > backfill/rebuild the new column stats for existing data files. > > Engines (like Trino) know which columns are used in filter and join > expression. Query engines precisely know what column stats are needed. > > On Fri, Oct 6, 2023 at 8:59 AM Ryan Blue <b...@tabular.io> wrote: > >> I understand wanting to keep more in general, that's why we have the 100 >> column threshold set fairly high. But in the case you're describing those >> column stats are causing a problem. I'd expect you to be able to drop some >> of them on such a large table to solve the problem, rather than filter them >> out (which is actually not very quick anyway). Have you tried that? Why >> doesn't it work? >> >> On Thu, Oct 5, 2023 at 7:57 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> It is definitely good to only track column stats that are used. >>> Otherwise, we are just creating wasteful metadata that can increase >>> manifest file size and slow down scan planning. If a table has 100 columns, >>> it is very unlikely we need stats for all columns. >>> >>> But in practice, it is a bit difficult to predict which column stats >>> will be useful for future queries. We can analyze query patterns and >>> autotune the decision of selectively enabling column stats. However, if >>> we need to enable new column stats, backfilling column stats for existing >>> data files would require expensive rewrites. Note that Iceberg's default >>> behavior is to track column stats up to the first 100 columns. >>> >>> Interesting to learn from the community regarding column stats tuning. >>> >>> On Thu, Oct 5, 2023 at 5:38 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> I can think of situations in which you may want something like this, >>>> but I'm curious what other options you've used to solve the problem. This >>>> seems like exactly what `write.metadata.metrics.*` was intended to solve >>>> and I'm a bit surprised that you need metrics for so many columns in the >>>> table. The metrics are used for scan planning (filtering) and for >>>> operations like this, right? Are you keeping metrics around for metadata >>>> agg pushdown or something? >>>> >>>> I'm not opposed, but I do want to make sure there's not a simple way to >>>> solve the problem. >>>> >>>> On Thu, Oct 5, 2023 at 3:23 PM Steven Wu <stevenz...@gmail.com> wrote: >>>> >>>>> +1 for this feature of column stats projection. >>>>> >>>>> I will add some additional inputs. >>>>> >>>>> 1) In the previous discussion, there are comments on only enabling >>>>> column stats that are needed. That is definitely a recommended best >>>>> practice. But there are some practical challenges. By default, Iceberg >>>>> enables column stats for up to 100 columns. It also may be hard to know >>>>> all >>>>> the queries and what column stats are needed up front. And it is expensive >>>>> to retrofit/backfill column stats after the fact, as it requires data file >>>>> scan and rewrite. >>>>> >>>>> 2) We also have the Trino coordinator run into OOM due to the column >>>>> stats taking up a lot of memory space. If the Trino engine can select the >>>>> column stats based on join and filter expressions, we can greatly reduce >>>>> the memory footprint per split. >>>>> >>>>> Thanks, >>>>> Steven >>>>> >>>>> >>>>> On Thu, Oct 5, 2023 at 3:32 AM Péter Váry <peter.vary.apa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Team, >>>>>> >>>>>> TL;DR: I would like to introduce the possibility to parametrize the >>>>>> Iceberg table scans to include the metadata metrics only for >>>>>> specific columns. >>>>>> >>>>>> We discussed this previously on the mailing list [1], but we did not >>>>>> finalize the direction there. >>>>>> >>>>>> *To recap* >>>>>> Currently there are two ways to affect which column metrics are >>>>>> returned from the metadata by the scan tasks: >>>>>> >>>>>> - Decide which column stats are collected/stored - >>>>>> write.metadata.metrics.* write configuration could be used >>>>>> - Decide if the scan should include the column stats - >>>>>> includeColumnStats - scan builder configuration >>>>>> >>>>>> In our experience this granularity is not enough if the table has >>>>>> many columns and multiple readers. Some readers might want to have a >>>>>> specific column statistics only, while other readers might need metrics >>>>>> for >>>>>> another column. >>>>>> >>>>>> *Current issue* >>>>>> As for the concrete case, I have a PR [1] under review to emit >>>>>> watermarks when reading an Iceberg table by the Flink Source. It would be >>>>>> good to provide a simple "watermarkFieldName" for the users, which would >>>>>> be >>>>>> used for generating the watermark. >>>>>> The issue is that for the simple implementation we would need the >>>>>> column metadata metrics to extract the watermark for the >>>>>> *FileScanTasks*. If we include all of the column statistics, then it >>>>>> could easily cause memory issues on the JobManager side. This is why we >>>>>> are >>>>>> reluctant to make this as a simple switch on the user side, and might opt >>>>>> for a more heavy extractor interface where the user has the >>>>>> responsibility >>>>>> to extract the watermark themselves. >>>>>> OTOH, if it would be possible to return column metrics only for the >>>>>> given column, the memory pressure would be more predictable, and we could >>>>>> have a more user-friendly solution. >>>>>> >>>>>> *Suggested solution* >>>>>> I would like to introduce a new method to the Scan class, like: >>>>>> >>>>>> *ThisT includeColumnStats(Collection<String> columns);* >>>>>> >>>>>> >>>>>> Using this configuration the users of the Scan task could decide >>>>>> which column metrics are retained during the planning process, and which >>>>>> column metrics are thrown away in the results. >>>>>> >>>>>> Would the community consider this as a valuable addition to the Scan >>>>>> API? >>>>>> >>>>>> Thanks, >>>>>> Peter >>>>>> >>>>>> - [1] >>>>>> https://lists.apache.org/thread/9rl8yq8ps3mfg91g1qvzvgd0tnkjvxgg - >>>>>> Scan statistics thread >>>>>> - [2] https://github.com/apache/iceberg/pull/8553 - Flink: Emit >>>>>> watermarks from the IcebergSource >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >