+1 This will be a great progression for iceberg format allowing efficient metadata pruning. pl. count me in.
On Tue, Jun 17, 2025 at 3:45 AM Jacky Lee <qcsd2...@gmail.com> wrote: > Count me in. This solution effectively addresses the small files issue > caused by high-frequency writes in our scenario, and it also greatly > benefits the generation of partition- and table-level statistics. > > <mlhsmode...@gmail.com> 于2025年6月14日周六 07:04写道: > > > > I'm interested in working on this change as well. I think it pairs > nicely with the proposal for per column structs for statistics. > > > > Thanks, > > Harman > > > > On Thu, Jun 12, 2025 at 9:43 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> > >> It’s not required at compile time, only at test runtime. > >> > >> On Thu, Jun 12, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >>> > >>> > All we have to do is add the parquet module as a test dependency, > working on a poc now. > >>> > >>> This will be a circular dependency on the core module. That's why I > suggested abstracting out the test cases and executing them in a parquet > module. Partition stats writing (as parquet) from the core module uses > `InternalData` and does the same now. So, I guess it will be a similar work > (but on a larger scale due to testcase refactoring). > >>> > >>> Let me know the results of your POC and happy to collaborate on this > work. > >>> > >>> > >>> - Ajantha > >>> > >>> On Fri, Jun 13, 2025 at 3:16 AM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >>>> > >>>> All we have to do is add the parquet module as a test dependency, > working on a poc now. I don't think we really need to block on any other > projects although I'll probably hold off on any work on manifest-list since > I hope it won't be needed. > >>>> > >>>> On Thu, May 29, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >>>>> > >>>>> I am interested in working on this proposal. > >>>>> I would assume it is to use `InternalData` with the format as > `parquet`. But the challenge will be the test cases, the core module cannot > write the parquet metadata due to circular dependency. We need to abstract > out the test cases in the core module and run them from the parquet module > I guess. > >>>>> > >>>>> I can work on a design doc as well. So, add me as a collaborator for > the document. > >>>>> But should this work be done after we complete the work on "single > file commit in v4" ? because metadata structure can change? > >>>>> > >>>>> - Ajantha > >>>>> > >>>>> On Thu, May 29, 2025 at 11:37 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >>>>>> > >>>>>> Hi Y'all > >>>>>> > >>>>>> As discussed in the last community sync, we are beginning to gather > up folks who are interested in various efforts for Iceberg V4. To that end, > >>>>>> I'd like to use this thread as a gathering point for folks > interested in the metadata file format shift to Parquet. I wrote a quick > abstract to > >>>>>> describe the purpose of this group. > >>>>>> > >>>>>> Following this I'll be working on a full design document or if > someone has one in prod please let us know and we can start > discussing/working on > >>>>>> it there. > >>>>>> > >>>>>> Abstract: Parquet as Metadata File Format > >>>>>> > >>>>>> Currently the Iceberg SDK and Spec use Avro file format files for > all Manifest Lists and Manifests. The row oriented format was selected > >>>>>> because it was assumed that most metadata would be read in its > entirety. This has turned out to seldom be the case and the ability to read > >>>>>> single elements of the metrics would be very useful for query > planning. To address this we propose switching the underlying manifest > format > >>>>>> from Avro to Parquet. In V4, Avro files would still be readable but > all new metadata files would be written in Parquet instead of Avro. >