Re: [DISCUSS] Spec update to cover compressed JSON metadata files

Micah Kornfield Tue, 29 Apr 2025 10:38:23 -0700

I wanted to clarify, as others have pointed out, that the PR documents
existing functionality and making changes to it at this point risks
breaking clients


I think any changes to naming convention would have to be done as part of a
new version of the spec (and file system based commits must be completely
removed as of that version).

I think ZSTD could be useful but that again is a strict improvement out of
scope of this PR.

Thanks,
Micah

On Monday, April 28, 2025, Ryan Blue <[email protected]> wrote:

> It would be great to mention how to determine the compression of the
> metadata JSON file in the spec. Thanks for bringing this up. It makes sense
> to me to use the file name and get a bit more strict about this.
>
> That said, we will need to make sure that the current default behavior is
> documented and required for anyone using the now-deprecated "hadoop tables"
> that used atomic rename to coordinate. The atomic rename commits only work
> when all clients are using the exact same path. It's a good thing that this
> is deprecated so we can move forward with catalog-based uses.
>
> Ryan
>
> On Mon, Apr 28, 2025 at 9:47 AM Kevin Liu <[email protected]> wrote:
>
>> Thanks for bringing this up Micah!
>>
>> I think it's better to treat `.json.gz` as the "default" file scheme and
>> `.gz.json` as the "legacy".
>>
>> I agree with the other points brought up here. Across the broader
>> ecosystem, I think `.json.gz` is used more often. DuckDB, for example, can
>> automatically detect compression at the suffix, `.json.gz`, but not the
>> other way around.
>> See https://duckdb.org/docs/stable/data/json/loading_json#parameters
>>
>> Best,
>> Kevin Liu
>>
>>
>> On Sun, Apr 27, 2025 at 11:54 PM Fokko Driesprong <[email protected]>
>> wrote:
>>
>>> Hey Micah,
>>>
>>> For some reason, your email ended up in my spam box 😨
>>>
>>> There is a reason for everything!
>>>
>>> .gz.metadata.json is quite uncommon and can't be read by most existing
>>>> tools. Would it be better to support .metadata.json.gz and treat
>>>> .gz.metadata.json as legacy for backward compatibility?
>>>
>>>
>>> The Java client supports both
>>> <https://github.com/apache/iceberg/blob/dc26b72ad016840b79d62bf8a84b7f2109e9b71b/core/src/test/java/org/apache/iceberg/TableMetadataParserCodecTest.java#L29-L40>.
>>> I looked into this years ago, and if I recall correctly, it was to
>>> bypass the decompressor of Hadoop
>>> <https://github.com/apache/iceberg/pull/258/>. Hadoop would detect the
>>> .gz and handle all the (de)compression, which we wanted to do ourselves.
>>>
>>> gzip is becoming increasingly outdated due to its lack of support for
>>>> modern CPUs. New algorithms like zstd are gaining popularity, so
>>>> should we consider allowing users to use .metadata.json.zst as well?
>>>
>>>
>>> Yes, I think that would make a lot of sense.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>>
>>>
>>>
>>> Op ma 28 apr 2025 om 08:41 schreef Xuanwo <[email protected]>:
>>>
>>>> I've copied my comments from GitHub here for a broader discussion:
>>>>
>>>>
>>>>
>>>> Hi, I have two concerns about this change:
>>>>
>>>>    - .gz.metadata.json is quite uncommon and can't be read by most
>>>>    existing tools. Would it be better to support .metadata.json.gz and
>>>>    treat .gz.metadata.json as legacy for backward compatibility?
>>>>    - gzip is becoming increasingly outdated due to its lack of support
>>>>    for modern CPUs. New algorithms like zstd are gaining popularity,
>>>>    so should we consider allowing users to use .metadata.json.zst as
>>>>    well?
>>>>
>>>>
>>>> On Sun, Apr 27, 2025, at 07:36, Micah Kornfield wrote:
>>>>
>>>> I created https://github.com/apache/iceberg/pull/12598 to document
>>>> this feature.  Kevin Liu already took a look, but I would like to get more
>>>> eyes on it before starting a vote for merging.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> Xuanwo
>>>>
>>>> https://xuanwo.io/
>>>>
>>>>

Re: [DISCUSS] Spec update to cover compressed JSON metadata files

Reply via email to