For that use case, it sounds like you'd be much better off not storing all the stats rather that skipping them at read time. I understand the user wants to keep them, but it may still not be a great choice. I'm just worried that this is going to be a lot of effort for you that doesn't really generalize.
That said, if you're convinced that this is the right path I think it would be _nice_ to have it in. We can always reduce memory consumption! On Mon, Oct 9, 2023 at 5:21 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > The owner of the table wanted to keep the column stats for all of the > columns, claiming that other users might/are using the statistics of the > columns. Even if I am not sure that their case was defendable, I think the > reader of the table is often not in the position to optimize the table for > their own usage. Even in this case we should provide tools for them to > archive as much as possible. > > Since the statistics values are stored in the BaseFile in a map, every > column (~400 in our case) added an entry (100 bytes) to all of the stat > fields > (columnSizes/valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds > - 6x100 = 600 bytes). As a result we ended up having GenericDataFile > costing us 120k (some columns did not have all of the stats present). > Almost all of this was statistics related, and most of it is totally > unnecessary for our reader. Additionally, we had 67k splits, and the result > was that we had 8G unnecessary memory consumption. > > I think if we implement the filtering in the BaseFile constructor, where > we do the column stats removal, then we can create a reasonably fast copy > of the map which contains only the required column stats. Also in this case > we could trade CPU for memory consumption. > > WDYT? > > Thanks, > Peter > > Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. okt. 7., Szo, > 17:04): > >> I am sure dropping column stats can be helpful. Just that it has some >> challenges in practice. It requires table owners to know the query pattern >> and decide what column stats to keep and what to drop. While automation can >> help ease the decisions based on the query history, it can't predict future >> usages. When new column stats become useful later, we would need to >> backfill/rebuild the new column stats for existing data files. >> >> Engines (like Trino) know which columns are used in filter and join >> expression. Query engines precisely know what column stats are needed. >> >> On Fri, Oct 6, 2023 at 8:59 AM Ryan Blue <b...@tabular.io> wrote: >> >>> I understand wanting to keep more in general, that's why we have the 100 >>> column threshold set fairly high. But in the case you're describing those >>> column stats are causing a problem. I'd expect you to be able to drop some >>> of them on such a large table to solve the problem, rather than filter them >>> out (which is actually not very quick anyway). Have you tried that? Why >>> doesn't it work? >>> >>> On Thu, Oct 5, 2023 at 7:57 PM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> It is definitely good to only track column stats that are used. >>>> Otherwise, we are just creating wasteful metadata that can increase >>>> manifest file size and slow down scan planning. If a table has 100 columns, >>>> it is very unlikely we need stats for all columns. >>>> >>>> But in practice, it is a bit difficult to predict which column stats >>>> will be useful for future queries. We can analyze query patterns and >>>> autotune the decision of selectively enabling column stats. However, if >>>> we need to enable new column stats, backfilling column stats for existing >>>> data files would require expensive rewrites. Note that Iceberg's default >>>> behavior is to track column stats up to the first 100 columns. >>>> >>>> Interesting to learn from the community regarding column stats tuning. >>>> >>>> On Thu, Oct 5, 2023 at 5:38 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> I can think of situations in which you may want something like this, >>>>> but I'm curious what other options you've used to solve the problem. This >>>>> seems like exactly what `write.metadata.metrics.*` was intended to solve >>>>> and I'm a bit surprised that you need metrics for so many columns in the >>>>> table. The metrics are used for scan planning (filtering) and for >>>>> operations like this, right? Are you keeping metrics around for metadata >>>>> agg pushdown or something? >>>>> >>>>> I'm not opposed, but I do want to make sure there's not a simple way >>>>> to solve the problem. >>>>> >>>>> On Thu, Oct 5, 2023 at 3:23 PM Steven Wu <stevenz...@gmail.com> wrote: >>>>> >>>>>> +1 for this feature of column stats projection. >>>>>> >>>>>> I will add some additional inputs. >>>>>> >>>>>> 1) In the previous discussion, there are comments on only enabling >>>>>> column stats that are needed. That is definitely a recommended best >>>>>> practice. But there are some practical challenges. By default, Iceberg >>>>>> enables column stats for up to 100 columns. It also may be hard to know >>>>>> all >>>>>> the queries and what column stats are needed up front. And it is >>>>>> expensive >>>>>> to retrofit/backfill column stats after the fact, as it requires data >>>>>> file >>>>>> scan and rewrite. >>>>>> >>>>>> 2) We also have the Trino coordinator run into OOM due to the column >>>>>> stats taking up a lot of memory space. If the Trino engine can select the >>>>>> column stats based on join and filter expressions, we can greatly reduce >>>>>> the memory footprint per split. >>>>>> >>>>>> Thanks, >>>>>> Steven >>>>>> >>>>>> >>>>>> On Thu, Oct 5, 2023 at 3:32 AM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> Hi Team, >>>>>>> >>>>>>> TL;DR: I would like to introduce the possibility to parametrize the >>>>>>> Iceberg table scans to include the metadata metrics only for >>>>>>> specific columns. >>>>>>> >>>>>>> We discussed this previously on the mailing list [1], but we did not >>>>>>> finalize the direction there. >>>>>>> >>>>>>> *To recap* >>>>>>> Currently there are two ways to affect which column metrics are >>>>>>> returned from the metadata by the scan tasks: >>>>>>> >>>>>>> - Decide which column stats are collected/stored - >>>>>>> write.metadata.metrics.* write configuration could be used >>>>>>> - Decide if the scan should include the column stats - >>>>>>> includeColumnStats - scan builder configuration >>>>>>> >>>>>>> In our experience this granularity is not enough if the table has >>>>>>> many columns and multiple readers. Some readers might want to have a >>>>>>> specific column statistics only, while other readers might need metrics >>>>>>> for >>>>>>> another column. >>>>>>> >>>>>>> *Current issue* >>>>>>> As for the concrete case, I have a PR [1] under review to emit >>>>>>> watermarks when reading an Iceberg table by the Flink Source. It would >>>>>>> be >>>>>>> good to provide a simple "watermarkFieldName" for the users, which >>>>>>> would be >>>>>>> used for generating the watermark. >>>>>>> The issue is that for the simple implementation we would need the >>>>>>> column metadata metrics to extract the watermark for the >>>>>>> *FileScanTasks*. If we include all of the column statistics, then >>>>>>> it could easily cause memory issues on the JobManager side. This is why >>>>>>> we >>>>>>> are reluctant to make this as a simple switch on the user side, and >>>>>>> might >>>>>>> opt for a more heavy extractor interface where the user has the >>>>>>> responsibility to extract the watermark themselves. >>>>>>> OTOH, if it would be possible to return column metrics only for the >>>>>>> given column, the memory pressure would be more predictable, and we >>>>>>> could >>>>>>> have a more user-friendly solution. >>>>>>> >>>>>>> *Suggested solution* >>>>>>> I would like to introduce a new method to the Scan class, like: >>>>>>> >>>>>>> *ThisT includeColumnStats(Collection<String> columns);* >>>>>>> >>>>>>> >>>>>>> Using this configuration the users of the Scan task could decide >>>>>>> which column metrics are retained during the planning process, and which >>>>>>> column metrics are thrown away in the results. >>>>>>> >>>>>>> Would the community consider this as a valuable addition to the Scan >>>>>>> API? >>>>>>> >>>>>>> Thanks, >>>>>>> Peter >>>>>>> >>>>>>> - [1] >>>>>>> https://lists.apache.org/thread/9rl8yq8ps3mfg91g1qvzvgd0tnkjvxgg - >>>>>>> Scan statistics thread >>>>>>> - [2] https://github.com/apache/iceberg/pull/8553 - Flink: Emit >>>>>>> watermarks from the IcebergSource >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> -- Ryan Blue Tabular