The numbers I shared were for uncompressed files.

I am embarrassed to say I had not noticed there is an option
`write.metadata.compression-codec`.  I had it set to the default `none`,
and I reckon many other Iceberg users will too.

Here are some updated numbers for my example metadata file:

- Uncompressed with whitespace: 53.6 MB
- Uncompressed, no whitespace: 41.4 MB
- Gzipped, with whitespace: 5.36 MB
- Gzipped, no whitespace: 5.13 MB

So there is a 4.3% improvement in dropping whitespace for a gzipped file.
I admit this is less improvement that I originally thought.

But even so... I still think this sounds like an easy win, especially if
many users (like myself) didn't know to enable compression.

On Mon, 17 Feb 2025 at 19:51, Steve Zhang <hongyue_zh...@apple.com.invalid>
wrote:

> +1. Configure table property `write.metadata.compression-codec` to gzip is
> usually suggested to reduce metadata size but drop whitespace can still
> help here.
>
> Thanks,
> Steve Zhang
>
>
>
> On Feb 17, 2025, at 8:32 AM, Fokko Driesprong <fo...@apache.org> wrote:
>
> Hey Ian,
>
> Thanks for raising this. The numbers you mention, do you know if this was
> compressed or uncompressed?
>
> I have read other issues in github which mention gigabyte-scale metadata
>> files.
>
>
> This sounds like a bad practice, and that table probably needs some
> maintenance.
>
> I don't have the historical context of why we produce pretty JSON. I think
> this would be an easy optimization, and I agree that making them easily
> consumable by humans afterward is trivial. FWIW, PyIceberg also produces
> unpretty JSON.
>
> Kind regards,
> Fokko
>
>
> Op ma 17 feb 2025 om 16:48 schreef Ian Streeter <i...@snowplow.io.invalid>:
>
>> Currently, metadata files are pretty-printed, with lots of new-lines and
>> whitespace indentations.   This is the relevant line of code, which uses
>> the Jackson default pretty printer:
>> https://github.com/apache/iceberg/blob/abb47830e7df7dc2ae93c74b0ad97f06cdd37aad/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L131
>>
>> If we could write metadata files without redundant whitespace, then it
>> would save some storage space, and network overhead.
>>
>> This will have have most impact for tables with large metadata files. For
>> example, I have seen a metadata files which was 53.6MB. After removing
>> whitespace, this was reduced to 41.4MB. I have read other issues in github
>> which mention gigabyte-scale metadata files.
>>
>> I cannot think of any downside of this suggested change. Metadata files
>> are mainly read by machines not humans. And if a human does want to inspect
>> a metadata file, then it is fairly easy to prettify a JSON file when needed.
>>
>> I opened this as an issue in github, and then took advice to move the
>> discussion to this dev list.  See
>> https://github.com/apache/iceberg/issues/12281
>>
>> I would appreciate hearing your thoughts.
>> Thanks,
>> Ian
>>
>>
>

Reply via email to