Thanks Dan for the reply. This is also a good time to consider adding a native parquet read/write > path for use in core as the generic path in 'iceberg-data' isn't ideal. > Parquet metadata has been brought up in relation to improving stats > handling (allowing tracking of more column stats without impacting planning > performance), improving stats representations along with other possible > benefits like improved compression and scan performance.
These two proposals are super interesting. I don't see any public discussion on the same. Could you please provide more details or point me to the discussions? +1 for moving the parquet module to core if it helps for these proposals. I feel like ORC is a separate discussion and while we may want to include > it, I wouldn't say there's a definitive answer unless we know there is > adequate investment in it. Having a module for ORC but not Parquet looks odd. I don't think the effort is huge for moving these files. I can take it up as well. I wasn't aware you had a PR as part of the prior discussion, but I'm happy > to revisit that if we decide this is a reasonable path forward. Sure. I can revive my PR. For those who are looking for previous discussion on the same topic last year. https://lists.apache.org/thread/8m6f3k7b425czktzf22q902vxgp2y10r - Ajantha On Sat, Dec 7, 2024 at 10:26 AM Daniel Weeks <dwe...@apache.org> wrote: > Hey Ajantha, > > I understand it was discussed before, but I think a lot of recent > discussions around improvements for parquet metadata/stats/etc is good > justification for revisiting the earlier discussion. > > Parquet metadata has been brought up in relation to improving stats > handling (allowing tracking of more column stats without impacting planning > performance), improving stats representations along with other possible > benefits like improved compression and scan performance. > > The original decision was more narrowly focused on the stats case and > there were viable (though possibly not ideal) workarounds to keep the > existing separation of subprojects, but at this point I see this more as a > barrier to exploring some of these ideas as it's quite difficult to allow > core to work directly with parquet. > > This is also a good time to consider adding a native parquet read/write > path for use in core as the generic path in 'iceberg-data' isn't ideal > (this might also be useful for projects like Kafka Connect). > > I feel like ORC is a separate discussion and while we may want to include > it, I wouldn't say there's a definitive answer unless we know there is > adequate investment in it. > > I wasn't aware you had a PR as part of the prior discussion, but I'm happy > to revisit that if we decide this is a reasonable path forward. > > -Dan > > > > On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <ajanthab...@gmail.com> wrote: > >> Hi Dan, >> >> I proposed the same last year while working on partition stats. >> I can revive this PR if required, >> https://github.com/apache/iceberg/pull/8500 >> >> But we decided that `*iceberg-data`* can write these parquet stats files >> (metadata) and core can just register it. >> So, it is no longer needed for partition stats. >> >> a) Do we have any strong use case or feature that requires it now? >> b) I hope we do the same for ORC as well as it looks odd to have a >> module for that? >> >> - Ajantha >> >> On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <dwe...@apache.org> wrote: >> >>> Everyone, >>> >>> I wanted to propose moving the parquet implementation from the >>> 'iceberg-parquet' project to the 'iceberg-core' project. >>> >>> The original motivation for keeping these subprojects separate was due >>> to Iceberg relying on avro (which is included in the core project) for >>> metadata and keeping other format implementations separate, but as we >>> consider adding support for partition stats and parquet metadata, we need >>> the ability to read and write parquet from core library. >>> >>> I've created a draft PR <https://github.com/apache/iceberg/pull/11716> >>> of the proposed changes, which relocates relatively cleanly, but wanted to >>> discuss whether there are concerns or other considerations for keeping them >>> separate. >>> >>> -Dan >>> >>