Thanks for bringing this up Micah!

I think it's better to treat `.json.gz` as the "default" file scheme and
`.gz.json` as the "legacy".

I agree with the other points brought up here. Across the broader
ecosystem, I think `.json.gz` is used more often. DuckDB, for example, can
automatically detect compression at the suffix, `.json.gz`, but not the
other way around.
See https://duckdb.org/docs/stable/data/json/loading_json#parameters

Best,
Kevin Liu


On Sun, Apr 27, 2025 at 11:54 PM Fokko Driesprong <fo...@apache.org> wrote:

> Hey Micah,
>
> For some reason, your email ended up in my spam box 😨
>
> There is a reason for everything!
>
> .gz.metadata.json is quite uncommon and can't be read by most existing
>> tools. Would it be better to support .metadata.json.gz and treat
>> .gz.metadata.json as legacy for backward compatibility?
>
>
> The Java client supports both
> <https://github.com/apache/iceberg/blob/dc26b72ad016840b79d62bf8a84b7f2109e9b71b/core/src/test/java/org/apache/iceberg/TableMetadataParserCodecTest.java#L29-L40>.
> I looked into this years ago, and if I recall correctly, it was to bypass
> the decompressor of Hadoop <https://github.com/apache/iceberg/pull/258/>.
> Hadoop would detect the .gz and handle all the (de)compression, which we
> wanted to do ourselves.
>
> gzip is becoming increasingly outdated due to its lack of support for
>> modern CPUs. New algorithms like zstd are gaining popularity, so should
>> we consider allowing users to use .metadata.json.zst as well?
>
>
> Yes, I think that would make a lot of sense.
>
> Kind regards,
> Fokko
>
>
>
>
> Op ma 28 apr 2025 om 08:41 schreef Xuanwo <xua...@apache.org>:
>
>> I've copied my comments from GitHub here for a broader discussion:
>>
>>
>>
>> Hi, I have two concerns about this change:
>>
>>    - .gz.metadata.json is quite uncommon and can't be read by most
>>    existing tools. Would it be better to support .metadata.json.gz and
>>    treat .gz.metadata.json as legacy for backward compatibility?
>>    - gzip is becoming increasingly outdated due to its lack of support
>>    for modern CPUs. New algorithms like zstd are gaining popularity, so
>>    should we consider allowing users to use .metadata.json.zst as well?
>>
>>
>> On Sun, Apr 27, 2025, at 07:36, Micah Kornfield wrote:
>>
>> I created https://github.com/apache/iceberg/pull/12598 to document this
>> feature.  Kevin Liu already took a look, but I would like to get more eyes
>> on it before starting a vote for merging.
>>
>> Thanks,
>> Micah
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>>

Reply via email to