Re: [Discussion] Move `iceberg-parquet` and `iceberg-orc` modules into `iceberg-core`

Russell Spitzer Thu, 02 Nov 2023 07:43:39 -0700

Is there an alternative where we do an implementation similar to how Position 
Deletes and Data Files are currently written? Like we have the more generic 
"writers" in core but the actual implementations still live in iceberg-parquet 
or iceberg-orc?


> On Nov 2, 2023, at 9:38 AM, Ajantha Bhat <ajanthab...@gmail.com> wrote:
> 
> Hi Renjie, 
> 
> I have highlighted the use case from the above mail,
>  
>> However, with the addition of partition statistics 
>> <https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file>,
>>  Iceberg's metadata (stats file) will be
>> represented in Parquet or ORC formats.
>> To enable the `iceberg-core` module to write metadata in Parquet or ORC 
>> format, it will make extensive use of the functions found in the 
>> `iceberg-parquet`
>> and `iceberg-orc` modules. However, due to a circular dependency issue, 
>> `iceberg-core` cannot directly rely on `iceberg-parquet` and `iceberg-orc`.
>> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as 
>> packages within the `iceberg-core` module.
>  
> A utility for reading and writing partition statistics in Parquet format is 
> expected to take the form outlined here 
> <https://github.com/apache/iceberg/pull/8503/commits/2ba244540bf9fd574ece909f4cb178fdf12defa8>,
>  leveraging the `iceberg-parquet` dependency.
> 
> To facilitate on-demand partition statistics computation, this utility can 
> find a home in either `iceberg-data` or a new module that relies on both 
> `iceberg-parquet` and `iceberg-orc`. This approach would enable all engines 
> to make use of it.
> 
> However, for the synchronous calculation of statistics during insertion, 
> similar to how Trino supports Puffin stats, the `iceberg-core` module's 
> snapshot producer must have access to this utility. This presents a challenge 
> due to the existing circular dependency, as `iceberg-parquet` and 
> `iceberg-orc` already depend on `iceberg-core`.
> 
> To resolve this circular dependency issue, my proposal is to integrate them 
> as separate packages within the `iceberg-core` module. 
> I believe it's best to include them in the appropriate place during the 
> initial addition itself to support both synchronous and asynchronous writes,
> instead of adding to `iceberg-data` just for asynchronous writes and later 
> deprecating and moving them to core during synchronous write implementation.  
> 
> Moving them to `iceberg-core` can also open up the possibility of writing 
> existing metadata (like manifests, manifests lists) in Parquet or ORC instead 
> of avro in future.
> 
> Thanks, 
> Ajantha 
> 
> On Thu, Nov 2, 2023 at 5:07 PM Renjie Liu <liurenjie2...@gmail.com 
> <mailto:liurenjie2...@gmail.com>> wrote:
>> Hi:
>> 
>> Could you provide concrete cases to elaborate this change?
>> 
>> On Thu, Nov 2, 2023 at 4:22 PM Gabor Kaszab <gaborkas...@apache.org 
>> <mailto:gaborkas...@apache.org>> wrote:
>>> Hey Ajantha,
>>> 
>>> Wouldn't this require a major version bump considering this is a breaking 
>>> change for users depending on iceberg-parquet or iceberg-orc now?
>>> 
>>> Gabor
>>> 
>>> On Thu, Nov 2, 2023 at 3:01 AM Ajantha Bhat <ajanthab...@gmail.com 
>>> <mailto:ajanthab...@gmail.com>> wrote:
>>>> Hi Everyone, 
>>>> 
>>>> At present, Iceberg exclusively utilizes Avro, JSON, and Puffin formats to 
>>>> handle metadata. Few discussions in the past have explored the possibility 
>>>> of supporting these existing metadata in Parquet or ORC format. However, 
>>>> with the addition of partition statistics 
>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file>,
>>>>  Iceberg's metadata (stats file) will be 
>>>> represented in Parquet or ORC formats. 
>>>> 
>>>> To enable the `iceberg-core` module to write metadata in Parquet or ORC 
>>>> format, it will make extensive use of the functions found in the 
>>>> `iceberg-parquet` 
>>>> and `iceberg-orc` modules. However, due to a circular dependency issue, 
>>>> `iceberg-core` cannot directly rely on `iceberg-parquet` and 
>>>> `iceberg-orc`. 
>>>> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as 
>>>> packages within the `iceberg-core` module.
>>>> 
>>>> For end users, the main change in the new release package will be the 
>>>> absence of separate `iceberg-parquet` and `iceberg-orc` JAR files. 
>>>> Instead, they can 
>>>> depend on `iceberg-core` (which they were likely doing already). This 
>>>> change will also be clearly documented in the release notes.
>>>> 
>>>> I would appreciate hearing your thoughts on this proposal.
>>>> 
>>>> For a detailed look at the code changes required to implement the 
>>>> integration of `iceberg-parquet` into `iceberg-core`, 
>>>> please refer to the following PR: 
>>>> https://github.com/apache/iceberg/pull/8500
>>>> 
>>>> Thanks, 
>>>> Ajantha

Re: [Discussion] Move `iceberg-parquet` and `iceberg-orc` modules into `iceberg-core`

Reply via email to