> Perhaps this is a good default on the catalog side when creating new
metadata json.

+1 for this b/c I think it's an easy performance win for tables with large
metadata.  Is there any reason not to have write.metadata.compression-codec
default to gzip?  I'm curious if there was a reason it's currently set to
none

On Fri, Mar 21, 2025 at 1:43 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> This reminds me that GZipped metadata files are not covered in the spec.
> I opened https://github.com/apache/iceberg/pull/12598 to try to document
> them (feedback welcome).
>
> On Mon, Feb 17, 2025 at 2:35 PM Kevin Liu <kevinjq...@apache.org> wrote:
>
>> +1, json with no whitespace sounds like a reasonable default. But if
>> saving storage space and network is the main goal, then setting
>> `write.metadata.compression-codec` to `gzip` is way more impactful. Perhaps
>> this is a good default on the catalog side when creating new metadata json.
>>
>> Best,
>> Kevin Liu
>>
>> On Mon, Feb 17, 2025 at 12:19 PM Ian Streeter <i...@snowplow.io.invalid>
>> wrote:
>>
>>> The numbers I shared were for uncompressed files.
>>>
>>> I am embarrassed to say I had not noticed there is an option
>>> `write.metadata.compression-codec`.  I had it set to the default `none`,
>>> and I reckon many other Iceberg users will too.
>>>
>>> Here are some updated numbers for my example metadata file:
>>>
>>> - Uncompressed with whitespace: 53.6 MB
>>> - Uncompressed, no whitespace: 41.4 MB
>>> - Gzipped, with whitespace: 5.36 MB
>>> - Gzipped, no whitespace: 5.13 MB
>>>
>>> So there is a 4.3% improvement in dropping whitespace for a gzipped
>>> file.  I admit this is less improvement that I originally thought.
>>>
>>> But even so... I still think this sounds like an easy win, especially if
>>> many users (like myself) didn't know to enable compression.
>>>
>>> On Mon, 17 Feb 2025 at 19:51, Steve Zhang
>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>
>>>> +1. Configure table property `write.metadata.compression-codec` to gzip
>>>> is usually suggested to reduce metadata size but drop whitespace can still
>>>> help here.
>>>>
>>>> Thanks,
>>>> Steve Zhang
>>>>
>>>>
>>>>
>>>> On Feb 17, 2025, at 8:32 AM, Fokko Driesprong <fo...@apache.org> wrote:
>>>>
>>>> Hey Ian,
>>>>
>>>> Thanks for raising this. The numbers you mention, do you know if this
>>>> was compressed or uncompressed?
>>>>
>>>> I have read other issues in github which mention gigabyte-scale
>>>>> metadata files.
>>>>
>>>>
>>>> This sounds like a bad practice, and that table probably needs some
>>>> maintenance.
>>>>
>>>> I don't have the historical context of why we produce pretty JSON. I
>>>> think this would be an easy optimization, and I agree that making them
>>>> easily consumable by humans afterward is trivial. FWIW, PyIceberg also
>>>> produces unpretty JSON.
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>>
>>>> Op ma 17 feb 2025 om 16:48 schreef Ian Streeter <i...@snowplow.io.invalid
>>>> >:
>>>>
>>>>> Currently, metadata files are pretty-printed, with lots of new-lines
>>>>> and whitespace indentations.   This is the relevant line of code, which
>>>>> uses the Jackson default pretty printer:
>>>>> https://github.com/apache/iceberg/blob/abb47830e7df7dc2ae93c74b0ad97f06cdd37aad/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L131
>>>>>
>>>>> If we could write metadata files without redundant whitespace, then it
>>>>> would save some storage space, and network overhead.
>>>>>
>>>>> This will have have most impact for tables with large metadata files.
>>>>> For example, I have seen a metadata files which was 53.6MB. After removing
>>>>> whitespace, this was reduced to 41.4MB. I have read other issues in github
>>>>> which mention gigabyte-scale metadata files.
>>>>>
>>>>> I cannot think of any downside of this suggested change. Metadata
>>>>> files are mainly read by machines not humans. And if a human does want to
>>>>> inspect a metadata file, then it is fairly easy to prettify a JSON file
>>>>> when needed.
>>>>>
>>>>> I opened this as an issue in github, and then took advice to move the
>>>>> discussion to this dev list.  See
>>>>> https://github.com/apache/iceberg/issues/12281
>>>>>
>>>>> I would appreciate hearing your thoughts.
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>
>>>>

Reply via email to