Hey Ian,

Thanks for raising this. The numbers you mention, do you know if this was
compressed or uncompressed?

I have read other issues in github which mention gigabyte-scale metadata
> files.


This sounds like a bad practice, and that table probably needs some
maintenance.

I don't have the historical context of why we produce pretty JSON. I think
this would be an easy optimization, and I agree that making them easily
consumable by humans afterward is trivial. FWIW, PyIceberg also produces
unpretty JSON.

Kind regards,
Fokko


Op ma 17 feb 2025 om 16:48 schreef Ian Streeter <i...@snowplow.io.invalid>:

> Currently, metadata files are pretty-printed, with lots of new-lines and
> whitespace indentations.   This is the relevant line of code, which uses
> the Jackson default pretty printer:
> https://github.com/apache/iceberg/blob/abb47830e7df7dc2ae93c74b0ad97f06cdd37aad/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L131
>
> If we could write metadata files without redundant whitespace, then it
> would save some storage space, and network overhead.
>
> This will have have most impact for tables with large metadata files. For
> example, I have seen a metadata files which was 53.6MB. After removing
> whitespace, this was reduced to 41.4MB. I have read other issues in github
> which mention gigabyte-scale metadata files.
>
> I cannot think of any downside of this suggested change. Metadata files
> are mainly read by machines not humans. And if a human does want to inspect
> a metadata file, then it is fairly easy to prettify a JSON file when needed.
>
> I opened this as an issue in github, and then took advice to move the
> discussion to this dev list.  See
> https://github.com/apache/iceberg/issues/12281
>
> I would appreciate hearing your thoughts.
> Thanks,
> Ian
>
>

Reply via email to