Based on our discussion here, I have created a PR for the feature: https://github.com/apache/iceberg/pull/8803
I think this is not a big change, and the flexibility/reduced memory consumption would be worth the additional complexity. Please review the PR to see for yourselves :) Thanks, Peter Manish Malhotra <manish.malhotra.w...@gmail.com> ezt írta (időpont: 2023. okt. 10., K, 17:02): > Thanks Ryan, > > Good point, that makes sense. > > Though +1 for the feature. > > We can avoid it during ingestion as well, though we might need the stats > some time later, so having options during reading will help. > > Thanks, > Manish > > On Mon, Oct 9, 2023 at 4:38 PM Ryan Blue <b...@tabular.io> wrote: > >> For that use case, it sounds like you'd be much better off not storing >> all the stats rather that skipping them at read time. I understand the user >> wants to keep them, but it may still not be a great choice. I'm just >> worried that this is going to be a lot of effort for you that doesn't >> really generalize. >> >> That said, if you're convinced that this is the right path I think it >> would be _nice_ to have it in. We can always reduce memory consumption! >> >> On Mon, Oct 9, 2023 at 5:21 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> The owner of the table wanted to keep the column stats for all of the >>> columns, claiming that other users might/are using the statistics of the >>> columns. Even if I am not sure that their case was defendable, I think the >>> reader of the table is often not in the position to optimize the table for >>> their own usage. Even in this case we should provide tools for them to >>> archive as much as possible. >>> >>> Since the statistics values are stored in the BaseFile in a map, every >>> column (~400 in our case) added an entry (100 bytes) to all of the stat >>> fields >>> (columnSizes/valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds >>> - 6x100 = 600 bytes). As a result we ended up having GenericDataFile >>> costing us 120k (some columns did not have all of the stats present). >>> Almost all of this was statistics related, and most of it is totally >>> unnecessary for our reader. Additionally, we had 67k splits, and the result >>> was that we had 8G unnecessary memory consumption. >>> >>> I think if we implement the filtering in the BaseFile constructor, where >>> we do the column stats removal, then we can create a reasonably fast copy >>> of the map which contains only the required column stats. Also in this case >>> we could trade CPU for memory consumption. >>> >>> WDYT? >>> >>> Thanks, >>> Peter >>> >>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. okt. 7., Szo, >>> 17:04): >>> >>>> I am sure dropping column stats can be helpful. Just that it has some >>>> challenges in practice. It requires table owners to know the query pattern >>>> and decide what column stats to keep and what to drop. While automation can >>>> help ease the decisions based on the query history, it can't predict future >>>> usages. When new column stats become useful later, we would need to >>>> backfill/rebuild the new column stats for existing data files. >>>> >>>> Engines (like Trino) know which columns are used in filter and join >>>> expression. Query engines precisely know what column stats are needed. >>>> >>>> On Fri, Oct 6, 2023 at 8:59 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> I understand wanting to keep more in general, that's why we have the >>>>> 100 column threshold set fairly high. But in the case you're describing >>>>> those column stats are causing a problem. I'd expect you to be able to >>>>> drop >>>>> some of them on such a large table to solve the problem, rather than >>>>> filter >>>>> them out (which is actually not very quick anyway). Have you tried that? >>>>> Why doesn't it work? >>>>> >>>>> On Thu, Oct 5, 2023 at 7:57 PM Steven Wu <stevenz...@gmail.com> wrote: >>>>> >>>>>> It is definitely good to only track column stats that are used. >>>>>> Otherwise, we are just creating wasteful metadata that can increase >>>>>> manifest file size and slow down scan planning. If a table has 100 >>>>>> columns, >>>>>> it is very unlikely we need stats for all columns. >>>>>> >>>>>> But in practice, it is a bit difficult to predict which column stats >>>>>> will be useful for future queries. We can analyze query patterns and >>>>>> autotune the decision of selectively enabling column stats. However, >>>>>> if we need to enable new column stats, backfilling column stats for >>>>>> existing data files would require expensive rewrites. Note that Iceberg's >>>>>> default behavior is to track column stats up to the first 100 columns. >>>>>> >>>>>> >>>>>> Interesting to learn from the community regarding column stats tuning. >>>>>> >>>>>> On Thu, Oct 5, 2023 at 5:38 PM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> I can think of situations in which you may want something like this, >>>>>>> but I'm curious what other options you've used to solve the problem. >>>>>>> This >>>>>>> seems like exactly what `write.metadata.metrics.*` was intended to solve >>>>>>> and I'm a bit surprised that you need metrics for so many columns in the >>>>>>> table. The metrics are used for scan planning (filtering) and for >>>>>>> operations like this, right? Are you keeping metrics around for metadata >>>>>>> agg pushdown or something? >>>>>>> >>>>>>> I'm not opposed, but I do want to make sure there's not a simple way >>>>>>> to solve the problem. >>>>>>> >>>>>>> On Thu, Oct 5, 2023 at 3:23 PM Steven Wu <stevenz...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> +1 for this feature of column stats projection. >>>>>>>> >>>>>>>> I will add some additional inputs. >>>>>>>> >>>>>>>> 1) In the previous discussion, there are comments on only enabling >>>>>>>> column stats that are needed. That is definitely a recommended best >>>>>>>> practice. But there are some practical challenges. By default, Iceberg >>>>>>>> enables column stats for up to 100 columns. It also may be hard to >>>>>>>> know all >>>>>>>> the queries and what column stats are needed up front. And it is >>>>>>>> expensive >>>>>>>> to retrofit/backfill column stats after the fact, as it requires data >>>>>>>> file >>>>>>>> scan and rewrite. >>>>>>>> >>>>>>>> 2) We also have the Trino coordinator run into OOM due to the >>>>>>>> column stats taking up a lot of memory space. If the Trino engine can >>>>>>>> select the column stats based on join and filter expressions, we can >>>>>>>> greatly reduce the memory footprint per split. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Steven >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Oct 5, 2023 at 3:32 AM Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Team, >>>>>>>>> >>>>>>>>> TL;DR: I would like to introduce the possibility to parametrize >>>>>>>>> the Iceberg table scans to include the metadata metrics only for >>>>>>>>> specific columns. >>>>>>>>> >>>>>>>>> We discussed this previously on the mailing list [1], but we did >>>>>>>>> not finalize the direction there. >>>>>>>>> >>>>>>>>> *To recap* >>>>>>>>> Currently there are two ways to affect which column metrics are >>>>>>>>> returned from the metadata by the scan tasks: >>>>>>>>> >>>>>>>>> - Decide which column stats are collected/stored - >>>>>>>>> write.metadata.metrics.* write configuration could be used >>>>>>>>> - Decide if the scan should include the column stats - >>>>>>>>> includeColumnStats - scan builder configuration >>>>>>>>> >>>>>>>>> In our experience this granularity is not enough if the table has >>>>>>>>> many columns and multiple readers. Some readers might want to have a >>>>>>>>> specific column statistics only, while other readers might need >>>>>>>>> metrics for >>>>>>>>> another column. >>>>>>>>> >>>>>>>>> *Current issue* >>>>>>>>> As for the concrete case, I have a PR [1] under review to emit >>>>>>>>> watermarks when reading an Iceberg table by the Flink Source. It >>>>>>>>> would be >>>>>>>>> good to provide a simple "watermarkFieldName" for the users, which >>>>>>>>> would be >>>>>>>>> used for generating the watermark. >>>>>>>>> The issue is that for the simple implementation we would need the >>>>>>>>> column metadata metrics to extract the watermark for the >>>>>>>>> *FileScanTasks*. If we include all of the column statistics, then >>>>>>>>> it could easily cause memory issues on the JobManager side. This is >>>>>>>>> why we >>>>>>>>> are reluctant to make this as a simple switch on the user side, and >>>>>>>>> might >>>>>>>>> opt for a more heavy extractor interface where the user has the >>>>>>>>> responsibility to extract the watermark themselves. >>>>>>>>> OTOH, if it would be possible to return column metrics only for >>>>>>>>> the given column, the memory pressure would be more predictable, and >>>>>>>>> we >>>>>>>>> could have a more user-friendly solution. >>>>>>>>> >>>>>>>>> *Suggested solution* >>>>>>>>> I would like to introduce a new method to the Scan class, like: >>>>>>>>> >>>>>>>>> *ThisT includeColumnStats(Collection<String> columns);* >>>>>>>>> >>>>>>>>> >>>>>>>>> Using this configuration the users of the Scan task could decide >>>>>>>>> which column metrics are retained during the planning process, and >>>>>>>>> which >>>>>>>>> column metrics are thrown away in the results. >>>>>>>>> >>>>>>>>> Would the community consider this as a valuable addition to the >>>>>>>>> Scan API? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> - [1] >>>>>>>>> https://lists.apache.org/thread/9rl8yq8ps3mfg91g1qvzvgd0tnkjvxgg >>>>>>>>> - Scan statistics thread >>>>>>>>> - [2] https://github.com/apache/iceberg/pull/8553 - Flink: Emit >>>>>>>>> watermarks from the IcebergSource >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >> >> -- >> Ryan Blue >> Tabular >> >