Hi, Peter: Thanks for the effort on this.
*1. **File format filters* Do the filters include both filter expressions from both user query and delete filter? For filters from user query, I agree with you that we should keep the current behavior. For delete filters associated with data files, at first I thought file format readers should not care about this. But now I realized that maybe we need to also push it to file reader, this is useful when `IS_DELETED` metadata column is not necessary and we could use these filters (position deletes, etc) to further prune data. But anyway, I agree that we could postpone it in follow up pr. 2. *Batch size configuration* I'm leaning toward option 2. 3. *Spark configuration* I'm leaning towards using different configuration objects. On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Team, > Thanks everyone for the reviews on > https://github.com/apache/iceberg/pull/12298! > I have addressed most of comments, but a few questions still remain which > might merit a bit wider audience: > > 1. We should decide on the expected filtering behavior when the > filters are pushed down to the readers. Currently the filters are applied > as best effort for the file format readers. Some readers (Avro) just skip > them altogether. There was a suggestion on the PR that we might enforce > more strict requirements and the readers either reject part of the filters, > or they could apply them fully. > 2. Batch sizes are currently parameters for the reader builders which > could be set for non-vectorized readers too which could be confusing. > 3. Currently the Spark batch reader uses different configuration > objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the > reviewers of the Comet PR. There was a suggestion on the current PR to use > a common configuration instead. > > I would be interested in hearing your thoughts about these topics. > > My current take: > > 1. *File format filters*: I am leaning towards keeping the current > laninet behavior. Especially since Bloom filters are not able to do a full > filtering, and are often used as a way to filter out unwanted records. > Another option would be to implement a secondary filtering inside the file > formats themselves which I think would cause extra complexity, and possible > code duplication. Whatever the decision here, I would suggest moving this > out to a next PR as the current changeset is big enough as it is. > 2. *Batch size configuration*: Currently this is the only property > which is different in the batch readers and the non-vectorized readers. I > see 3 possible solutions: > 1. Create different builders for vectorized and non-vectorized > reads - I don't think the current solution is confusing enough to worth > the > extra class > 2. We could put this to the reader configuration property set - > This could work, but "hide" the possible configuration mode which is > valid > for both Parquet and ORC readers > 3. We could keep things as it is now - I would chose this one, but > I don't have a strong opinion here > 3. *Spark configuration*: TBH, I'm open to bot solution and happy to > move to the direction the community decides on > > Thanks, > Peter > > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. márc. > 14., P, 16:31): > >> Hi Peter >> >> Thanks for the update. I will do a new pass on the PR. >> >> Regards >> JB >> >> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> > >> > Hi Team, >> > I have rebased the File Format API proposal ( >> https://github.com/apache/iceberg/pull/12298) to include the new changes >> needed for the Variant types. I would love to hear your feedback, >> especially Dan and Ryan, as you were the most active during our >> discussions. If I can help in any way to make the review easier, please let >> me know. >> > Thanks, >> > Peter >> > >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >> febr. 28., P, 17:50): >> >> >> >> Hi everyone, >> >> Thanks for all of the actionable, relevant feedback on the PR ( >> https://github.com/apache/iceberg/pull/12298). >> >> Updated the code to address most of them. Please check if you agree >> with the general approach. >> >> If there is a consensus about the general approach, I could. separate >> out the PR to smaller pieces so we can have an easier time to review and >> merge those step-by-step. >> >> Thanks, >> >> Peter >> >> >> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. febr. >> 20., Cs, 14:14): >> >>> >> >>> Hi Peter >> >>> >> >>> sorry for the late reply on this. >> >>> >> >>> I did a pass on the proposal, it's very interesting and well written. >> >>> I like the DataFile API and definitely worth to discuss all together. >> >>> >> >>> Maybe we can schedule a specific meeting to discuss about DataFile >> API ? >> >>> >> >>> Thoughts ? >> >>> >> >>> Regards >> >>> JB >> >>> >> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry < >> peter.vary.apa...@gmail.com> wrote: >> >>> > >> >>> > Hi Team, >> >>> > >> >>> > As mentioned earlier on our Community Sync I am exploring the >> possibility to define a FileFormat API for accessing different file >> formats. I have put together a proposal based on my findings. >> >>> > >> >>> > ------------------- >> >>> > Iceberg currently supports 3 different file formats: Avro, Parquet, >> ORC. With the introduction of Iceberg V3 specification many new features >> are added to Iceberg. Some of these features like new column types, default >> values require changes at the file format level. The changes are added by >> individual developers with different focus on the different file formats. >> As a result not all of the features are available for every supported file >> format. >> >>> > Also there are emerging file formats like Vortex [1] or Lance [2] >> which either by specialization, or by applying newer research results could >> provide better alternatives for certain use-cases like random access for >> data, or storing ML models. >> >>> > ------------------- >> >>> > >> >>> > Please check the detailed proposal [3] and the google document [4], >> and comment there or reply on the dev list if you have any suggestions. >> >>> > >> >>> > Thanks, >> >>> > Peter >> >>> > >> >>> > [1] - https://github.com/spiraldb/vortex >> >>> > [2] - https://lancedb.github.io/lance/ >> >>> > [3] - https://github.com/apache/iceberg/issues/12225 >> >>> > [4] - >> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds >> >>> > >> >