IIUC, iceberg-parquet depends on iceberg-arrow for the vectored reader implementation (though partially supported). Should we relocate iceberg-arrow together?
Since I have mentioned that the vectored reader implementation is partially supported, is it a direction that needs to be improved? There is a long-awaited issue for the record: https://github.com/apache/iceberg/issues/7162 On Thu, Dec 19, 2024 at 9:41 AM Renjie Liu <liurenjie2...@gmail.com> wrote: > Hi: > > >> Third, the use case has expanded from stats files to other metadata. > > > I would +1 for this reason to move the parquet module to the core module. > > But I think the right direction is that we still keep interfaces in the > core module so that data file format is pluggable. Moving parquet to core > make parquet a kind of metadata format, but this should not stop other data > file formats from being supported. > > On Thu, Dec 19, 2024 at 5:21 AM rdb...@gmail.com <rdb...@gmail.com> wrote: > >> I was the person that originally suggested that we not move >> iceberg-parquet into core, so it would probably help if I gave some context >> for my rationale as I remember it and what's changed since then. >> >> I pushed back on the original suggestion to move Parquet classes into >> core because it wasn't clear that we needed to move them and couldn't just >> introduce an interface or abstraction that could be implemented by the >> iceberg-parquet module or other formats, like ORC. The idea was to keep the >> core module free of unnecessary dependencies. Most of the pieces of this >> strategy exist already because all of the core metadata classes are >> interfaces, like `DataFile` and `ManifestFile`, and there are writer >> interfaces that can be used with them, like `FileAppender`. >> >> I think it's a good thing to reconsider this because a few things have >> become more clear in the meantime. First, the exploration work to plug in >> formats through an interface never materialized, so we should reconsider >> rather than blocking that work. Second, I think the Iceberg ecosystem has >> moved from being mostly format-agnostic because Parquet is the format that >> is widely supported and optimized (Comet, for example, supports Parquet, >> not to mention vendor offerings). Third, the use case has expanded from >> stats files to other metadata. >> >> Now that we are looking at introducing more columnar metadata and have >> not added an abstraction to support multiple formats in the last year >> or so, I'm reconsidering the value of being format-agnostic. The original >> choice to use Avro for metadata was to maximize compatibility and I think >> that concern is stronger and stronger, given how widely Parquet is >> supported. If we are going to keep the surface area small by specifying >> Parquet for columnar metadata, then we don't need to do the extra work to >> add more abstractions. >> >> And now after having written that, I think I'm now convinced. +1 for >> merging iceberg-parquet into core. >> >> On Mon, Dec 9, 2024 at 3:21 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> Thanks Dan for the reply. >>> >>> This is also a good time to consider adding a native parquet read/write >>>> path for use in core as the generic path in 'iceberg-data' isn't ideal. >>>> Parquet metadata has been brought up in relation to improving stats >>>> handling (allowing tracking of more column stats without impacting planning >>>> performance), improving stats representations along with other possible >>>> benefits like improved compression and scan performance. >>> >>> >>> These two proposals are super interesting. I don't see any public >>> discussion on the same. Could you please provide more details or point >>> me to the discussions? +1 for moving the parquet module to core if it helps >>> for these proposals. >>> >>> I feel like ORC is a separate discussion and while we may want to >>>> include it, I wouldn't say there's a definitive answer unless we know there >>>> is adequate investment in it. >>> >>> Having a module for ORC but not Parquet looks odd. I don't think the >>> effort is huge for moving these files. >>> I can take it up as well. >>> >>> I wasn't aware you had a PR as part of the prior discussion, but I'm >>>> happy to revisit that if we decide this is a reasonable path forward. >>> >>> Sure. I can revive my PR. >>> >>> For those who are looking for previous discussion on the same topic last >>> year. >>> https://lists.apache.org/thread/8m6f3k7b425czktzf22q902vxgp2y10r >>> >>> - Ajantha >>> >>> >>> >>> On Sat, Dec 7, 2024 at 10:26 AM Daniel Weeks <dwe...@apache.org> wrote: >>> >>>> Hey Ajantha, >>>> >>>> I understand it was discussed before, but I think a lot of recent >>>> discussions around improvements for parquet metadata/stats/etc is good >>>> justification for revisiting the earlier discussion. >>>> >>>> Parquet metadata has been brought up in relation to improving stats >>>> handling (allowing tracking of more column stats without impacting planning >>>> performance), improving stats representations along with other possible >>>> benefits like improved compression and scan performance. >>>> >>>> The original decision was more narrowly focused on the stats case and >>>> there were viable (though possibly not ideal) workarounds to keep the >>>> existing separation of subprojects, but at this point I see this more as a >>>> barrier to exploring some of these ideas as it's quite difficult to allow >>>> core to work directly with parquet. >>>> >>>> This is also a good time to consider adding a native parquet read/write >>>> path for use in core as the generic path in 'iceberg-data' isn't ideal >>>> (this might also be useful for projects like Kafka Connect). >>>> >>>> I feel like ORC is a separate discussion and while we may want to >>>> include it, I wouldn't say there's a definitive answer unless we know there >>>> is adequate investment in it. >>>> >>>> I wasn't aware you had a PR as part of the prior discussion, but I'm >>>> happy to revisit that if we decide this is a reasonable path forward. >>>> >>>> -Dan >>>> >>>> >>>> >>>> On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> Hi Dan, >>>>> >>>>> I proposed the same last year while working on partition stats. >>>>> I can revive this PR if required, >>>>> https://github.com/apache/iceberg/pull/8500 >>>>> >>>>> But we decided that `*iceberg-data`* can write these parquet stats >>>>> files (metadata) and core can just register it. >>>>> So, it is no longer needed for partition stats. >>>>> >>>>> a) Do we have any strong use case or feature that requires it now? >>>>> b) I hope we do the same for ORC as well as it looks odd to have a >>>>> module for that? >>>>> >>>>> - Ajantha >>>>> >>>>> On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <dwe...@apache.org> wrote: >>>>> >>>>>> Everyone, >>>>>> >>>>>> I wanted to propose moving the parquet implementation from the >>>>>> 'iceberg-parquet' project to the 'iceberg-core' project. >>>>>> >>>>>> The original motivation for keeping these subprojects separate was >>>>>> due to Iceberg relying on avro (which is included in the core project) >>>>>> for >>>>>> metadata and keeping other format implementations separate, but as we >>>>>> consider adding support for partition stats and parquet metadata, we need >>>>>> the ability to read and write parquet from core library. >>>>>> >>>>>> I've created a draft PR >>>>>> <https://github.com/apache/iceberg/pull/11716> of the proposed >>>>>> changes, which relocates relatively cleanly, but wanted to discuss >>>>>> whether >>>>>> there are concerns or other considerations for keeping them separate. >>>>>> >>>>>> -Dan >>>>>> >>>>>