Hey Ajantha, I understand it was discussed before, but I think a lot of recent discussions around improvements for parquet metadata/stats/etc is good justification for revisiting the earlier discussion.
Parquet metadata has been brought up in relation to improving stats handling (allowing tracking of more column stats without impacting planning performance), improving stats representations along with other possible benefits like improved compression and scan performance. The original decision was more narrowly focused on the stats case and there were viable (though possibly not ideal) workarounds to keep the existing separation of subprojects, but at this point I see this more as a barrier to exploring some of these ideas as it's quite difficult to allow core to work directly with parquet. This is also a good time to consider adding a native parquet read/write path for use in core as the generic path in 'iceberg-data' isn't ideal (this might also be useful for projects like Kafka Connect). I feel like ORC is a separate discussion and while we may want to include it, I wouldn't say there's a definitive answer unless we know there is adequate investment in it. I wasn't aware you had a PR as part of the prior discussion, but I'm happy to revisit that if we decide this is a reasonable path forward. -Dan On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Hi Dan, > > I proposed the same last year while working on partition stats. > I can revive this PR if required, > https://github.com/apache/iceberg/pull/8500 > > But we decided that `*iceberg-data`* can write these parquet stats files > (metadata) and core can just register it. > So, it is no longer needed for partition stats. > > a) Do we have any strong use case or feature that requires it now? > b) I hope we do the same for ORC as well as it looks odd to have a > module for that? > > - Ajantha > > On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <dwe...@apache.org> wrote: > >> Everyone, >> >> I wanted to propose moving the parquet implementation from the >> 'iceberg-parquet' project to the 'iceberg-core' project. >> >> The original motivation for keeping these subprojects separate was due to >> Iceberg relying on avro (which is included in the core project) for >> metadata and keeping other format implementations separate, but as we >> consider adding support for partition stats and parquet metadata, we need >> the ability to read and write parquet from core library. >> >> I've created a draft PR <https://github.com/apache/iceberg/pull/11716> >> of the proposed changes, which relocates relatively cleanly, but wanted to >> discuss whether there are concerns or other considerations for keeping them >> separate. >> >> -Dan >> >