> > Is there an alternative where we do an implementation similar to how > Position Deletes and Data Files are currently written? Like we have the > more generic "writers" in core but the actual implementations still live in > iceberg-parquet or iceberg-orc?
Hi Russell, Let me explore this path and get back to you. Thanks. On Thu, Nov 2, 2023 at 8:09 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > Is there an alternative where we do an implementation similar to how > Position Deletes and Data Files are currently written? Like we have the > more generic "writers" in core but the actual implementations still live in > iceberg-parquet or iceberg-orc? > > On Nov 2, 2023, at 9:38 AM, Ajantha Bhat <ajanthab...@gmail.com> wrote: > > Hi Renjie, > > I have highlighted the use case from the above mail, > > >> >> *However, with the addition of partition statistics >> <https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file>, >> Iceberg's metadata (stats file) will berepresented in Parquet or ORC >> formats.* >> To enable the `iceberg-core` module to write metadata in Parquet or ORC >> format, it will make extensive use of the functions found in the >> `iceberg-parquet` >> and `iceberg-orc` modules. *However, due to a circular dependency issue*, >> *`iceberg-core` cannot directly rely on `iceberg-parquet` and >> `iceberg-orc`.* >> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as >> packages within the `iceberg-core` module. > > > A utility for reading and writing partition statistics in Parquet format > is expected to take the form outlined here > <https://github.com/apache/iceberg/pull/8503/commits/2ba244540bf9fd574ece909f4cb178fdf12defa8>, > leveraging the `iceberg-parquet` dependency. > > To facilitate on-demand partition statistics computation, this utility can > find a home in either `iceberg-data` or a new module that relies on both > `iceberg-parquet` and `iceberg-orc`. This approach would enable all engines > to make use of it. > > However, for the synchronous calculation of statistics during insertion, > similar to how Trino supports Puffin stats, the `iceberg-core` module's > snapshot producer must have access to this utility. This presents a > challenge due to the existing circular dependency, as `iceberg-parquet` and > `iceberg-orc` already depend on `iceberg-core`. > > To resolve this circular dependency issue, my proposal is to integrate > them as separate packages within the `iceberg-core` module. > I believe it's best to include them in the appropriate place during the > initial addition itself to support both synchronous and asynchronous writes, > instead of adding to `iceberg-data` just for asynchronous writes and later > deprecating and moving them to core during synchronous write > implementation. > > Moving them to `iceberg-core` can also open up the possibility of writing > existing metadata (like manifests, manifests lists) in Parquet or ORC > instead of avro in future. > > Thanks, > Ajantha > > On Thu, Nov 2, 2023 at 5:07 PM Renjie Liu <liurenjie2...@gmail.com> wrote: > >> Hi: >> >> Could you provide concrete cases to elaborate this change? >> >> On Thu, Nov 2, 2023 at 4:22 PM Gabor Kaszab <gaborkas...@apache.org> >> wrote: >> >>> Hey Ajantha, >>> >>> Wouldn't this require a major version bump considering this is a >>> breaking change for users depending on iceberg-parquet or iceberg-orc now? >>> >>> Gabor >>> >>> On Thu, Nov 2, 2023 at 3:01 AM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Hi Everyone, >>>> >>>> At present, Iceberg exclusively utilizes Avro, JSON, and Puffin formats >>>> to handle metadata. Few discussions in the past have explored the >>>> possibility >>>> of supporting these existing metadata in Parquet or ORC format. >>>> However, with the addition of partition statistics >>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file>, >>>> Iceberg's metadata (stats file) will be >>>> represented in Parquet or ORC formats. >>>> >>>> To enable the `iceberg-core` module to write metadata in Parquet or ORC >>>> format, it will make extensive use of the functions found in the >>>> `iceberg-parquet` >>>> and `iceberg-orc` modules. However, due to a circular dependency issue, >>>> `iceberg-core` cannot directly rely on `iceberg-parquet` and `iceberg-orc`. >>>> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as >>>> packages within the `iceberg-core` module. >>>> >>>> For end users, the main change in the new release package will be the >>>> absence of separate `iceberg-parquet` and `iceberg-orc` JAR files. Instead, >>>> they can >>>> depend on `iceberg-core` (which they were likely doing already). This >>>> change will also be clearly documented in the release notes. >>>> >>>> I would appreciate hearing your thoughts on this proposal. >>>> >>>> For a detailed look at the code changes required to implement the >>>> integration of `iceberg-parquet` into `iceberg-core`, >>>> please refer to the following PR: >>>> https://github.com/apache/iceberg/pull/8500 >>>> >>>> Thanks, >>>> Ajantha >>>> >>> >