Hi Team, Thanks everyone for the reviews on https://github.com/apache/iceberg/pull/12298! I have addressed most of comments, but a few questions still remain which might merit a bit wider audience:
1. We should decide on the expected filtering behavior when the filters are pushed down to the readers. Currently the filters are applied as best effort for the file format readers. Some readers (Avro) just skip them altogether. There was a suggestion on the PR that we might enforce more strict requirements and the readers either reject part of the filters, or they could apply them fully. 2. Batch sizes are currently parameters for the reader builders which could be set for non-vectorized readers too which could be confusing. 3. Currently the Spark batch reader uses different configuration objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the reviewers of the Comet PR. There was a suggestion on the current PR to use a common configuration instead. I would be interested in hearing your thoughts about these topics. My current take: 1. *File format filters*: I am leaning towards keeping the current laninet behavior. Especially since Bloom filters are not able to do a full filtering, and are often used as a way to filter out unwanted records. Another option would be to implement a secondary filtering inside the file formats themselves which I think would cause extra complexity, and possible code duplication. Whatever the decision here, I would suggest moving this out to a next PR as the current changeset is big enough as it is. 2. *Batch size configuration*: Currently this is the only property which is different in the batch readers and the non-vectorized readers. I see 3 possible solutions: 1. Create different builders for vectorized and non-vectorized reads - I don't think the current solution is confusing enough to worth the extra class 2. We could put this to the reader configuration property set - This could work, but "hide" the possible configuration mode which is valid for both Parquet and ORC readers 3. We could keep things as it is now - I would chose this one, but I don't have a strong opinion here 3. *Spark configuration*: TBH, I'm open to bot solution and happy to move to the direction the community decides on Thanks, Peter Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. márc. 14., P, 16:31): > Hi Peter > > Thanks for the update. I will do a new pass on the PR. > > Regards > JB > > On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > > > > Hi Team, > > I have rebased the File Format API proposal ( > https://github.com/apache/iceberg/pull/12298) to include the new changes > needed for the Variant types. I would love to hear your feedback, > especially Dan and Ryan, as you were the most active during our > discussions. If I can help in any way to make the review easier, please let > me know. > > Thanks, > > Peter > > > > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. febr. > 28., P, 17:50): > >> > >> Hi everyone, > >> Thanks for all of the actionable, relevant feedback on the PR ( > https://github.com/apache/iceberg/pull/12298). > >> Updated the code to address most of them. Please check if you agree > with the general approach. > >> If there is a consensus about the general approach, I could. separate > out the PR to smaller pieces so we can have an easier time to review and > merge those step-by-step. > >> Thanks, > >> Peter > >> > >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. febr. > 20., Cs, 14:14): > >>> > >>> Hi Peter > >>> > >>> sorry for the late reply on this. > >>> > >>> I did a pass on the proposal, it's very interesting and well written. > >>> I like the DataFile API and definitely worth to discuss all together. > >>> > >>> Maybe we can schedule a specific meeting to discuss about DataFile API > ? > >>> > >>> Thoughts ? > >>> > >>> Regards > >>> JB > >>> > >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry < > peter.vary.apa...@gmail.com> wrote: > >>> > > >>> > Hi Team, > >>> > > >>> > As mentioned earlier on our Community Sync I am exploring the > possibility to define a FileFormat API for accessing different file > formats. I have put together a proposal based on my findings. > >>> > > >>> > ------------------- > >>> > Iceberg currently supports 3 different file formats: Avro, Parquet, > ORC. With the introduction of Iceberg V3 specification many new features > are added to Iceberg. Some of these features like new column types, default > values require changes at the file format level. The changes are added by > individual developers with different focus on the different file formats. > As a result not all of the features are available for every supported file > format. > >>> > Also there are emerging file formats like Vortex [1] or Lance [2] > which either by specialization, or by applying newer research results could > provide better alternatives for certain use-cases like random access for > data, or storing ML models. > >>> > ------------------- > >>> > > >>> > Please check the detailed proposal [3] and the google document [4], > and comment there or reply on the dev list if you have any suggestions. > >>> > > >>> > Thanks, > >>> > Peter > >>> > > >>> > [1] - https://github.com/spiraldb/vortex > >>> > [2] - https://lancedb.github.io/lance/ > >>> > [3] - https://github.com/apache/iceberg/issues/12225 > >>> > [4] - > https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds > >>> > >