It would be great to mention how to determine the compression of the
metadata JSON file in the spec. Thanks for bringing this up. It makes sense
to me to use the file name and get a bit more strict about this.

That said, we will need to make sure that the current default behavior is
documented and required for anyone using the now-deprecated "hadoop tables"
that used atomic rename to coordinate. The atomic rename commits only work
when all clients are using the exact same path. It's a good thing that this
is deprecated so we can move forward with catalog-based uses.

Ryan

On Mon, Apr 28, 2025 at 9:47 AM Kevin Liu <kevinjq...@apache.org> wrote:

> Thanks for bringing this up Micah!
>
> I think it's better to treat `.json.gz` as the "default" file scheme and
> `.gz.json` as the "legacy".
>
> I agree with the other points brought up here. Across the broader
> ecosystem, I think `.json.gz` is used more often. DuckDB, for example, can
> automatically detect compression at the suffix, `.json.gz`, but not the
> other way around.
> See https://duckdb.org/docs/stable/data/json/loading_json#parameters
>
> Best,
> Kevin Liu
>
>
> On Sun, Apr 27, 2025 at 11:54 PM Fokko Driesprong <fo...@apache.org>
> wrote:
>
>> Hey Micah,
>>
>> For some reason, your email ended up in my spam box 😨
>>
>> There is a reason for everything!
>>
>> .gz.metadata.json is quite uncommon and can't be read by most existing
>>> tools. Would it be better to support .metadata.json.gz and treat
>>> .gz.metadata.json as legacy for backward compatibility?
>>
>>
>> The Java client supports both
>> <https://github.com/apache/iceberg/blob/dc26b72ad016840b79d62bf8a84b7f2109e9b71b/core/src/test/java/org/apache/iceberg/TableMetadataParserCodecTest.java#L29-L40>.
>> I looked into this years ago, and if I recall correctly, it was to
>> bypass the decompressor of Hadoop
>> <https://github.com/apache/iceberg/pull/258/>. Hadoop would detect the
>> .gz and handle all the (de)compression, which we wanted to do ourselves.
>>
>> gzip is becoming increasingly outdated due to its lack of support for
>>> modern CPUs. New algorithms like zstd are gaining popularity, so should
>>> we consider allowing users to use .metadata.json.zst as well?
>>
>>
>> Yes, I think that would make a lot of sense.
>>
>> Kind regards,
>> Fokko
>>
>>
>>
>>
>> Op ma 28 apr 2025 om 08:41 schreef Xuanwo <xua...@apache.org>:
>>
>>> I've copied my comments from GitHub here for a broader discussion:
>>>
>>>
>>>
>>> Hi, I have two concerns about this change:
>>>
>>>    - .gz.metadata.json is quite uncommon and can't be read by most
>>>    existing tools. Would it be better to support .metadata.json.gz and
>>>    treat .gz.metadata.json as legacy for backward compatibility?
>>>    - gzip is becoming increasingly outdated due to its lack of support
>>>    for modern CPUs. New algorithms like zstd are gaining popularity, so
>>>    should we consider allowing users to use .metadata.json.zst as well?
>>>
>>>
>>> On Sun, Apr 27, 2025, at 07:36, Micah Kornfield wrote:
>>>
>>> I created https://github.com/apache/iceberg/pull/12598 to document this
>>> feature.  Kevin Liu already took a look, but I would like to get more eyes
>>> on it before starting a vote for merging.
>>>
>>> Thanks,
>>> Micah
>>>
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>>

Reply via email to