Thanks Peter! On Wed, Oct 11, 2023 at 5:36 AM Péter Váry <peter.vary.apa...@gmail.com> wrote:
> Based on our discussion here, I have created a PR for the feature: > https://github.com/apache/iceberg/pull/8803 > > I think this is not a big change, and the flexibility/reduced memory > consumption would be worth the additional complexity. > > Please review the PR to see for yourselves :) > > Thanks, > Peter > > Manish Malhotra <manish.malhotra.w...@gmail.com> ezt írta (időpont: 2023. > okt. 10., K, 17:02): > >> Thanks Ryan, >> >> Good point, that makes sense. >> >> Though +1 for the feature. >> >> We can avoid it during ingestion as well, though we might need the stats >> some time later, so having options during reading will help. >> >> Thanks, >> Manish >> >> On Mon, Oct 9, 2023 at 4:38 PM Ryan Blue <b...@tabular.io> wrote: >> >>> For that use case, it sounds like you'd be much better off not storing >>> all the stats rather that skipping them at read time. I understand the user >>> wants to keep them, but it may still not be a great choice. I'm just >>> worried that this is going to be a lot of effort for you that doesn't >>> really generalize. >>> >>> That said, if you're convinced that this is the right path I think it >>> would be _nice_ to have it in. We can always reduce memory consumption! >>> >>> On Mon, Oct 9, 2023 at 5:21 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> The owner of the table wanted to keep the column stats for all of the >>>> columns, claiming that other users might/are using the statistics of the >>>> columns. Even if I am not sure that their case was defendable, I think the >>>> reader of the table is often not in the position to optimize the table for >>>> their own usage. Even in this case we should provide tools for them to >>>> archive as much as possible. >>>> >>>> Since the statistics values are stored in the BaseFile in a map, every >>>> column (~400 in our case) added an entry (100 bytes) to all of the stat >>>> fields >>>> (columnSizes/valueCounts/nullValueCounts/nanValueCounts/lowerBounds/upperBounds >>>> - 6x100 = 600 bytes). As a result we ended up having GenericDataFile >>>> costing us 120k (some columns did not have all of the stats present). >>>> Almost all of this was statistics related, and most of it is totally >>>> unnecessary for our reader. Additionally, we had 67k splits, and the result >>>> was that we had 8G unnecessary memory consumption. >>>> >>>> I think if we implement the filtering in the BaseFile constructor, >>>> where we do the column stats removal, then we can create a reasonably fast >>>> copy of the map which contains only the required column stats. Also in this >>>> case we could trade CPU for memory consumption. >>>> >>>> WDYT? >>>> >>>> Thanks, >>>> Peter >>>> >>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. okt. 7., >>>> Szo, 17:04): >>>> >>>>> I am sure dropping column stats can be helpful. Just that it has some >>>>> challenges in practice. It requires table owners to know the query pattern >>>>> and decide what column stats to keep and what to drop. While automation >>>>> can >>>>> help ease the decisions based on the query history, it can't predict >>>>> future >>>>> usages. When new column stats become useful later, we would need to >>>>> backfill/rebuild the new column stats for existing data files. >>>>> >>>>> Engines (like Trino) know which columns are used in filter and join >>>>> expression. Query engines precisely know what column stats are needed. >>>>> >>>>> On Fri, Oct 6, 2023 at 8:59 AM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> I understand wanting to keep more in general, that's why we have the >>>>>> 100 column threshold set fairly high. But in the case you're describing >>>>>> those column stats are causing a problem. I'd expect you to be able to >>>>>> drop >>>>>> some of them on such a large table to solve the problem, rather than >>>>>> filter >>>>>> them out (which is actually not very quick anyway). Have you tried that? >>>>>> Why doesn't it work? >>>>>> >>>>>> On Thu, Oct 5, 2023 at 7:57 PM Steven Wu <stevenz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> It is definitely good to only track column stats that are used. >>>>>>> Otherwise, we are just creating wasteful metadata that can increase >>>>>>> manifest file size and slow down scan planning. If a table has 100 >>>>>>> columns, >>>>>>> it is very unlikely we need stats for all columns. >>>>>>> >>>>>>> But in practice, it is a bit difficult to predict which column stats >>>>>>> will be useful for future queries. We can analyze query patterns and >>>>>>> autotune the decision of selectively enabling column stats. However, >>>>>>> if we need to enable new column stats, backfilling column stats >>>>>>> for existing data files would require expensive rewrites. Note that >>>>>>> Iceberg's default behavior is to track column stats up to the first 100 >>>>>>> columns. >>>>>>> >>>>>>> Interesting to learn from the community regarding column stats >>>>>>> tuning. >>>>>>> >>>>>>> On Thu, Oct 5, 2023 at 5:38 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> I can think of situations in which you may want something like >>>>>>>> this, but I'm curious what other options you've used to solve the >>>>>>>> problem. >>>>>>>> This seems like exactly what `write.metadata.metrics.*` was intended to >>>>>>>> solve and I'm a bit surprised that you need metrics for so many >>>>>>>> columns in >>>>>>>> the table. The metrics are used for scan planning (filtering) and for >>>>>>>> operations like this, right? Are you keeping metrics around for >>>>>>>> metadata >>>>>>>> agg pushdown or something? >>>>>>>> >>>>>>>> I'm not opposed, but I do want to make sure there's not a simple >>>>>>>> way to solve the problem. >>>>>>>> >>>>>>>> On Thu, Oct 5, 2023 at 3:23 PM Steven Wu <stevenz...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> +1 for this feature of column stats projection. >>>>>>>>> >>>>>>>>> I will add some additional inputs. >>>>>>>>> >>>>>>>>> 1) In the previous discussion, there are comments on only >>>>>>>>> enabling column stats that are needed. That is definitely a >>>>>>>>> recommended >>>>>>>>> best practice. But there are some practical challenges. By default, >>>>>>>>> Iceberg >>>>>>>>> enables column stats for up to 100 columns. It also may be hard to >>>>>>>>> know all >>>>>>>>> the queries and what column stats are needed up front. And it is >>>>>>>>> expensive >>>>>>>>> to retrofit/backfill column stats after the fact, as it requires data >>>>>>>>> file >>>>>>>>> scan and rewrite. >>>>>>>>> >>>>>>>>> 2) We also have the Trino coordinator run into OOM due to the >>>>>>>>> column stats taking up a lot of memory space. If the Trino engine can >>>>>>>>> select the column stats based on join and filter expressions, we can >>>>>>>>> greatly reduce the memory footprint per split. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Steven >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Oct 5, 2023 at 3:32 AM Péter Váry < >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Team, >>>>>>>>>> >>>>>>>>>> TL;DR: I would like to introduce the possibility to parametrize >>>>>>>>>> the Iceberg table scans to include the metadata metrics only for >>>>>>>>>> specific columns. >>>>>>>>>> >>>>>>>>>> We discussed this previously on the mailing list [1], but we did >>>>>>>>>> not finalize the direction there. >>>>>>>>>> >>>>>>>>>> *To recap* >>>>>>>>>> Currently there are two ways to affect which column metrics are >>>>>>>>>> returned from the metadata by the scan tasks: >>>>>>>>>> >>>>>>>>>> - Decide which column stats are collected/stored - >>>>>>>>>> write.metadata.metrics.* write configuration could be used >>>>>>>>>> - Decide if the scan should include the column stats - >>>>>>>>>> includeColumnStats - scan builder configuration >>>>>>>>>> >>>>>>>>>> In our experience this granularity is not enough if the table has >>>>>>>>>> many columns and multiple readers. Some readers might want to have a >>>>>>>>>> specific column statistics only, while other readers might need >>>>>>>>>> metrics for >>>>>>>>>> another column. >>>>>>>>>> >>>>>>>>>> *Current issue* >>>>>>>>>> As for the concrete case, I have a PR [1] under review to emit >>>>>>>>>> watermarks when reading an Iceberg table by the Flink Source. It >>>>>>>>>> would be >>>>>>>>>> good to provide a simple "watermarkFieldName" for the users, which >>>>>>>>>> would be >>>>>>>>>> used for generating the watermark. >>>>>>>>>> The issue is that for the simple implementation we would need the >>>>>>>>>> column metadata metrics to extract the watermark for the >>>>>>>>>> *FileScanTasks*. If we include all of the column statistics, >>>>>>>>>> then it could easily cause memory issues on the JobManager side. >>>>>>>>>> This is >>>>>>>>>> why we are reluctant to make this as a simple switch on the user >>>>>>>>>> side, and >>>>>>>>>> might opt for a more heavy extractor interface where the user has the >>>>>>>>>> responsibility to extract the watermark themselves. >>>>>>>>>> OTOH, if it would be possible to return column metrics only for >>>>>>>>>> the given column, the memory pressure would be more predictable, and >>>>>>>>>> we >>>>>>>>>> could have a more user-friendly solution. >>>>>>>>>> >>>>>>>>>> *Suggested solution* >>>>>>>>>> I would like to introduce a new method to the Scan class, like: >>>>>>>>>> >>>>>>>>>> *ThisT includeColumnStats(Collection<String> columns);* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Using this configuration the users of the Scan task could decide >>>>>>>>>> which column metrics are retained during the planning process, and >>>>>>>>>> which >>>>>>>>>> column metrics are thrown away in the results. >>>>>>>>>> >>>>>>>>>> Would the community consider this as a valuable addition to the >>>>>>>>>> Scan API? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Peter >>>>>>>>>> >>>>>>>>>> - [1] >>>>>>>>>> https://lists.apache.org/thread/9rl8yq8ps3mfg91g1qvzvgd0tnkjvxgg >>>>>>>>>> - Scan statistics thread >>>>>>>>>> - [2] https://github.com/apache/iceberg/pull/8553 - Flink: Emit >>>>>>>>>> watermarks from the IcebergSource >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>