Re: [DISCUSS] Relocate Parquet to Iceberg Core

Daniel Weeks Fri, 06 Dec 2024 20:56:22 -0800

Hey Ajantha,

I understand it was discussed before, but I think a lot of recent
discussions around improvements for parquet metadata/stats/etc is good
justification for revisiting the earlier discussion.

Parquet metadata has been brought up in relation to improving stats
handling (allowing tracking of more column stats without impacting planning
performance), improving stats representations along with other possible
benefits like improved compression and scan performance.

The original decision was more narrowly focused on the stats case and there
were viable (though possibly not ideal) workarounds to keep the existing
separation of subprojects, but at this point I see this more as a barrier
to exploring some of these ideas as it's quite difficult to allow core to
work directly with parquet.

This is also a good time to consider adding a native parquet read/write
path for use in core as the generic path in 'iceberg-data' isn't ideal
(this might also be useful for projects like Kafka Connect).

I feel like ORC is a separate discussion and while we may want to include
it, I wouldn't say there's a definitive answer unless we know there is
adequate investment in it.

I wasn't aware you had a PR as part of the prior discussion, but I'm happy
to revisit that if we decide this is a reasonable path forward.

-Dan

On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <[email protected]> wrote:

> Hi Dan,
>
> I proposed the same last year while working on partition stats.
> I can revive this PR if required,
> https://github.com/apache/iceberg/pull/8500
>
> But we decided that `*iceberg-data`* can write these parquet stats files
> (metadata) and core can just register it.
> So, it is no longer needed for partition stats.
>
> a) Do we have any strong use case or feature that requires it now?
> b) I hope we do the same for ORC as well as it looks odd to have a
> module for that?
>
> - Ajantha
>
> On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <[email protected]> wrote:
>
>> Everyone,
>>
>> I wanted to propose moving the parquet implementation from the
>> 'iceberg-parquet' project to the 'iceberg-core' project.
>>
>> The original motivation for keeping these subprojects separate was due to
>> Iceberg relying on avro (which is included in the core project) for
>> metadata and keeping other format implementations separate, but as we
>> consider adding support for partition stats and parquet metadata, we need
>> the ability to read and write parquet from core library.
>>
>> I've created a draft PR <https://github.com/apache/iceberg/pull/11716>
>> of the proposed changes, which relocates relatively cleanly, but wanted to
>> discuss whether there are concerns or other considerations for keeping them
>> separate.
>>
>> -Dan
>>
>

Re: [DISCUSS] Relocate Parquet to Iceberg Core

Reply via email to