Re: [DISCUSS] Relocate Parquet to Iceberg Core

Gang Wu Wed, 18 Dec 2024 17:56:37 -0800

IIUC, iceberg-parquet depends on iceberg-arrow for the vectored reader
implementation (though partially supported). Should we relocate
iceberg-arrow together?


Since I have mentioned that the vectored reader implementation is partially
supported, is it a direction that needs to be improved? There is a
long-awaited issue for the record:
https://github.com/apache/iceberg/issues/7162

On Thu, Dec 19, 2024 at 9:41 AM Renjie Liu <liurenjie2...@gmail.com> wrote:

> Hi:
>
>
>> Third, the use case has expanded from stats files to other metadata.
>
>
> I would +1 for this reason to move the parquet module to the core module.
>
> But I think the right direction is that we still keep interfaces in the
> core module so that data file format is pluggable. Moving parquet to core
> make parquet a kind of metadata format, but this should not stop other data
> file formats from being supported.
>
> On Thu, Dec 19, 2024 at 5:21 AM rdb...@gmail.com <rdb...@gmail.com> wrote:
>
>> I was the person that originally suggested that we not move
>> iceberg-parquet into core, so it would probably help if I gave some context
>> for my rationale as I remember it and what's changed since then.
>>
>> I pushed back on the original suggestion to move Parquet classes into
>> core because it wasn't clear that we needed to move them and couldn't just
>> introduce an interface or abstraction that could be implemented by the
>> iceberg-parquet module or other formats, like ORC. The idea was to keep the
>> core module free of unnecessary dependencies. Most of the pieces of this
>> strategy exist already because all of the core metadata classes are
>> interfaces, like `DataFile` and `ManifestFile`, and there are writer
>> interfaces that can be used with them, like `FileAppender`.
>>
>> I think it's a good thing to reconsider this because a few things have
>> become more clear in the meantime. First, the exploration work to plug in
>> formats through an interface never materialized, so we should reconsider
>> rather than blocking that work. Second, I think the Iceberg ecosystem has
>> moved from being mostly format-agnostic because Parquet is the format that
>> is widely supported and optimized (Comet, for example, supports Parquet,
>> not to mention vendor offerings). Third, the use case has expanded from
>> stats files to other metadata.
>>
>> Now that we are looking at introducing more columnar metadata and have
>> not added an abstraction to support multiple formats in the last year
>> or so, I'm reconsidering the value of being format-agnostic. The original
>> choice to use Avro for metadata was to maximize compatibility and I think
>> that concern is stronger and stronger, given how widely Parquet is
>> supported. If we are going to keep the surface area small by specifying
>> Parquet for columnar metadata, then we don't need to do the extra work to
>> add more abstractions.
>>
>> And now after having written that, I think I'm now convinced. +1 for
>> merging iceberg-parquet into core.
>>
>> On Mon, Dec 9, 2024 at 3:21 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>>> Thanks Dan for the reply.
>>>
>>> This is also a good time to consider adding a native parquet read/write
>>>> path for use in core as the generic path in 'iceberg-data' isn't ideal.
>>>> Parquet metadata has been brought up in relation to improving stats
>>>> handling (allowing tracking of more column stats without impacting planning
>>>> performance), improving stats representations along with other possible
>>>> benefits like improved compression and scan performance.
>>>
>>>
>>> These two proposals are super interesting. I don't see any public
>>> discussion on the same. Could you please provide more details or point
>>> me to the discussions? +1 for moving the parquet module to core if it helps
>>> for these proposals.
>>>
>>> I feel like ORC is a separate discussion and while we may want to
>>>> include it, I wouldn't say there's a definitive answer unless we know there
>>>> is adequate investment in it.
>>>
>>> Having a module for ORC but not Parquet looks odd. I don't think the
>>> effort is huge for moving these files.
>>> I can take it up as well.
>>>
>>> I wasn't aware you had a PR as part of the prior discussion, but I'm
>>>> happy to revisit that if we decide this is a reasonable path forward.
>>>
>>> Sure. I can revive my PR.
>>>
>>> For those who are looking for previous discussion on the same topic last
>>> year.
>>> https://lists.apache.org/thread/8m6f3k7b425czktzf22q902vxgp2y10r
>>>
>>> - Ajantha
>>>
>>>
>>>
>>> On Sat, Dec 7, 2024 at 10:26 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> Hey Ajantha,
>>>>
>>>> I understand it was discussed before, but I think a lot of recent
>>>> discussions around improvements for parquet metadata/stats/etc is good
>>>> justification for revisiting the earlier discussion.
>>>>
>>>> Parquet metadata has been brought up in relation to improving stats
>>>> handling (allowing tracking of more column stats without impacting planning
>>>> performance), improving stats representations along with other possible
>>>> benefits like improved compression and scan performance.
>>>>
>>>> The original decision was more narrowly focused on the stats case and
>>>> there were viable (though possibly not ideal) workarounds to keep the
>>>> existing separation of subprojects, but at this point I see this more as a
>>>> barrier to exploring some of these ideas as it's quite difficult to allow
>>>> core to work directly with parquet.
>>>>
>>>> This is also a good time to consider adding a native parquet read/write
>>>> path for use in core as the generic path in 'iceberg-data' isn't ideal
>>>> (this might also be useful for projects like Kafka Connect).
>>>>
>>>> I feel like ORC is a separate discussion and while we may want to
>>>> include it, I wouldn't say there's a definitive answer unless we know there
>>>> is adequate investment in it.
>>>>
>>>> I wasn't aware you had a PR as part of the prior discussion, but I'm
>>>> happy to revisit that if we decide this is a reasonable path forward.
>>>>
>>>> -Dan
>>>>
>>>>
>>>>
>>>> On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Dan,
>>>>>
>>>>> I proposed the same last year while working on partition stats.
>>>>> I can revive this PR if required,
>>>>> https://github.com/apache/iceberg/pull/8500
>>>>>
>>>>> But we decided that `*iceberg-data`* can write these parquet stats
>>>>> files (metadata) and core can just register it.
>>>>> So, it is no longer needed for partition stats.
>>>>>
>>>>> a) Do we have any strong use case or feature that requires it now?
>>>>> b) I hope we do the same for ORC as well as it looks odd to have a
>>>>> module for that?
>>>>>
>>>>> - Ajantha
>>>>>
>>>>> On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>>>
>>>>>> Everyone,
>>>>>>
>>>>>> I wanted to propose moving the parquet implementation from the
>>>>>> 'iceberg-parquet' project to the 'iceberg-core' project.
>>>>>>
>>>>>> The original motivation for keeping these subprojects separate was
>>>>>> due to Iceberg relying on avro (which is included in the core project) 
>>>>>> for
>>>>>> metadata and keeping other format implementations separate, but as we
>>>>>> consider adding support for partition stats and parquet metadata, we need
>>>>>> the ability to read and write parquet from core library.
>>>>>>
>>>>>> I've created a draft PR
>>>>>> <https://github.com/apache/iceberg/pull/11716> of the proposed
>>>>>> changes, which relocates relatively cleanly, but wanted to discuss 
>>>>>> whether
>>>>>> there are concerns or other considerations for keeping them separate.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>

Re: [DISCUSS] Relocate Parquet to Iceberg Core

Reply via email to