Re: Scan statistics

2023-05-22 Thread Russell Spitzer
Yeah does seem like we may have more use cases for this. The more Peter and I discuss this the more I think it makes sense to add in. On Mon, May 22, 2023 at 8:24 AM Péter Váry wrote: > The feature could be useful for Spark as well. See: > https://github.com/apache/iceberg/pull/7636#pullrequestr

Re: Scan statistics

2023-05-22 Thread Péter Váry
The feature could be useful for Spark as well. See: https://github.com/apache/iceberg/pull/7636#pullrequestreview-1434981224 Maybe we should add this as a topic for the next Iceberg Community Sync. Also when trying out possible solutions, I have found that some of the statistics are modifiable. I

Re: Scan statistics

2023-05-19 Thread Steven Wu
The proposal here is essentially column stats projection pushdown. For some Flink jobs with watermark alignment, Flink source is only interested in the column stats (min-max) for one timestamp column. Hence the column stats projection can really help reduce memory footprint for wide tables (with hu

Re: Scan statistics

2023-05-16 Thread Péter Váry
Thanks Ryan, Russell, Let me explain the situation a bit further. We have time series data written to an Iceberg table, then there is a Flink job which uses this Iceberg table as a source to read the incoming data continuously. *Downstream job -> Iceberg table -> Flink job * The Flink job

Re: Scan statistics

2023-05-15 Thread Péter Váry
Thanks Ryan, Russel for the quick response! In our Flink job we have TumblingEventTimeWindow to filter out old data. There was a temporary issue with accessing the Catalog, and our Flink job was not able to read the data from the Iceberg table for a while. When the Flink job was able to access th

Re: Scan statistics

2023-05-15 Thread Ryan Blue
Yes, I agree with Russell. You'd want to push the filter into planning rather than returning stats. That's why we strip out stats when the file metadata is copied. It also would be expensive to copy some, but not all of the file stats. It's better not to store the stats you don't need. What about

Re: Scan statistics

2023-05-15 Thread Russell Spitzer
I think currently the recommendation would be to filter the iterator rather than pulling the whole object with stat's into memory. Is there a requirement that all of the DataFiles be pulled into memory before filtering? On Mon, May 15, 2023 at 9:49 AM Péter Váry wrote: > Hi Team, > > We have a F