Re: [DISCUSS] Relocate Parquet to Iceberg Core

Ajantha Bhat Mon, 09 Dec 2024 03:21:55 -0800

Thanks Dan for the reply.

This is also a good time to consider adding a native parquet read/write
> path for use in core as the generic path in 'iceberg-data' isn't ideal.
> Parquet metadata has been brought up in relation to improving stats
> handling (allowing tracking of more column stats without impacting planning
> performance), improving stats representations along with other possible
> benefits like improved compression and scan performance.



These two proposals are super interesting. I don't see any public
discussion on the same. Could you please provide more details or point
me to the discussions? +1 for moving the parquet module to core if it helps
for these proposals.

I feel like ORC is a separate discussion and while we may want to include
> it, I wouldn't say there's a definitive answer unless we know there is
> adequate investment in it.

Having a module for ORC but not Parquet looks odd. I don't think the effort
is huge for moving these files.
I can take it up as well.

I wasn't aware you had a PR as part of the prior discussion, but I'm happy
> to revisit that if we decide this is a reasonable path forward.

Sure. I can revive my PR.

For those who are looking for previous discussion on the same topic last
year.
https://lists.apache.org/thread/8m6f3k7b425czktzf22q902vxgp2y10r

- Ajantha



On Sat, Dec 7, 2024 at 10:26 AM Daniel Weeks <dwe...@apache.org> wrote:

> Hey Ajantha,
>
> I understand it was discussed before, but I think a lot of recent
> discussions around improvements for parquet metadata/stats/etc is good
> justification for revisiting the earlier discussion.
>
> Parquet metadata has been brought up in relation to improving stats
> handling (allowing tracking of more column stats without impacting planning
> performance), improving stats representations along with other possible
> benefits like improved compression and scan performance.
>
> The original decision was more narrowly focused on the stats case and
> there were viable (though possibly not ideal) workarounds to keep the
> existing separation of subprojects, but at this point I see this more as a
> barrier to exploring some of these ideas as it's quite difficult to allow
> core to work directly with parquet.
>
> This is also a good time to consider adding a native parquet read/write
> path for use in core as the generic path in 'iceberg-data' isn't ideal
> (this might also be useful for projects like Kafka Connect).
>
> I feel like ORC is a separate discussion and while we may want to include
> it, I wouldn't say there's a definitive answer unless we know there is
> adequate investment in it.
>
> I wasn't aware you had a PR as part of the prior discussion, but I'm happy
> to revisit that if we decide this is a reasonable path forward.
>
> -Dan
>
>
>
> On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:
>
>> Hi Dan,
>>
>> I proposed the same last year while working on partition stats.
>> I can revive this PR if required,
>> https://github.com/apache/iceberg/pull/8500
>>
>> But we decided that `*iceberg-data`* can write these parquet stats files
>> (metadata) and core can just register it.
>> So, it is no longer needed for partition stats.
>>
>> a) Do we have any strong use case or feature that requires it now?
>> b) I hope we do the same for ORC as well as it looks odd to have a
>> module for that?
>>
>> - Ajantha
>>
>> On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <dwe...@apache.org> wrote:
>>
>>> Everyone,
>>>
>>> I wanted to propose moving the parquet implementation from the
>>> 'iceberg-parquet' project to the 'iceberg-core' project.
>>>
>>> The original motivation for keeping these subprojects separate was due
>>> to Iceberg relying on avro (which is included in the core project) for
>>> metadata and keeping other format implementations separate, but as we
>>> consider adding support for partition stats and parquet metadata, we need
>>> the ability to read and write parquet from core library.
>>>
>>> I've created a draft PR <https://github.com/apache/iceberg/pull/11716>
>>> of the proposed changes, which relocates relatively cleanly, but wanted to
>>> discuss whether there are concerns or other considerations for keeping them
>>> separate.
>>>
>>> -Dan
>>>
>>

Re: [DISCUSS] Relocate Parquet to Iceberg Core

Reply via email to