Re: [DISCUSS] V4 - Parquet as Metadata File Format

Sreeram Garlapati Wed, 06 Aug 2025 00:50:47 -0700

+1
This will be a great progression for iceberg format allowing efficient
metadata pruning. pl. count me in.


On Tue, Jun 17, 2025 at 3:45 AM Jacky Lee <[email protected]> wrote:

> Count me in. This solution effectively addresses the small files issue
> caused by high-frequency writes in our scenario, and it also greatly
> benefits the generation of partition- and table-level statistics.
>
> <[email protected]> 于2025年6月14日周六 07:04写道：
> >
> > I'm interested in working on this change as well. I think it pairs
> nicely with the proposal for per column structs for statistics.
> >
> > Thanks,
> > Harman
> >
> > On Thu, Jun 12, 2025 at 9:43 PM Russell Spitzer <
> [email protected]> wrote:
> >>
> >> It’s not required at compile time, only at test runtime.
> >>
> >> On Thu, Jun 12, 2025 at 8:37 PM Ajantha Bhat <[email protected]>
> wrote:
> >>>
> >>> > All we have to do is add the parquet module as a test dependency,
> working on a poc now.
> >>>
> >>> This will be a circular dependency on the core module. That's why I
> suggested abstracting out the test cases and executing them in a parquet
> module. Partition stats writing (as parquet) from the core module uses
> `InternalData` and does the same now. So, I guess it will be a similar work
> (but on a larger scale due to testcase refactoring).
> >>>
> >>> Let me know the results of your POC and happy to collaborate on this
> work.
> >>>
> >>>
> >>> - Ajantha
> >>>
> >>> On Fri, Jun 13, 2025 at 3:16 AM Russell Spitzer <
> [email protected]> wrote:
> >>>>
> >>>> All we have to do is add the parquet module as a test dependency,
> working on a poc now. I don't think we really need to block on any other
> projects although I'll probably hold off on any work on manifest-list since
> I hope it won't be needed.
> >>>>
> >>>> On Thu, May 29, 2025 at 8:37 PM Ajantha Bhat <[email protected]>
> wrote:
> >>>>>
> >>>>> I am interested in working on this proposal.
> >>>>> I would assume it is to use `InternalData` with the format as
> `parquet`. But the challenge will be the test cases, the core module cannot
> write the parquet metadata due to circular dependency. We need to abstract
> out the test cases in the core module and run them from the parquet module
> I guess.
> >>>>>
> >>>>> I can work on a design doc as well. So, add me as a collaborator for
> the document.
> >>>>> But should this work be done after we complete the work on "single
> file commit in v4" ? because metadata structure can change?
> >>>>>
> >>>>> - Ajantha
> >>>>>
> >>>>> On Thu, May 29, 2025 at 11:37 PM Russell Spitzer <
> [email protected]> wrote:
> >>>>>>
> >>>>>> Hi Y'all
> >>>>>>
> >>>>>> As discussed in the last community sync, we are beginning to gather
> up folks who are interested in various efforts for Iceberg V4. To that end,
> >>>>>> I'd like to use this thread as a gathering point for folks
> interested in the metadata file format shift to Parquet. I wrote a quick
> abstract to
> >>>>>> describe the purpose of this group.
> >>>>>>
> >>>>>> Following this I'll be working on a full design document or if
> someone has one in prod please let us know and we can start
> discussing/working on
> >>>>>> it there.
> >>>>>>
> >>>>>> Abstract: Parquet as Metadata File Format
> >>>>>>
> >>>>>> Currently the Iceberg SDK and Spec use Avro file format files for
> all Manifest Lists and Manifests. The row oriented format was selected
> >>>>>> because it was assumed that most metadata would be read in its
> entirety. This has turned out to seldom be the case and the ability to read
> >>>>>> single elements of the metrics would be very useful for query
> planning. To address this we propose switching the underlying manifest
> format
> >>>>>> from Avro to Parquet. In V4, Avro files would still be readable but
> all new metadata files would be written in Parquet instead of Avro.
>

Re: [DISCUSS] V4 - Parquet as Metadata File Format

Reply via email to