+1. Configure table property `write.metadata.compression-codec` to gzip is 
usually suggested to reduce metadata size but drop whitespace can still help 
here. 

Thanks,
Steve Zhang



> On Feb 17, 2025, at 8:32 AM, Fokko Driesprong <fo...@apache.org> wrote:
> 
> Hey Ian,
> 
> Thanks for raising this. The numbers you mention, do you know if this was 
> compressed or uncompressed?
> 
>> I have read other issues in github which mention gigabyte-scale metadata 
>> files.
> 
> This sounds like a bad practice, and that table probably needs some 
> maintenance.
> 
> I don't have the historical context of why we produce pretty JSON. I think 
> this would be an easy optimization, and I agree that making them easily 
> consumable by humans afterward is trivial. FWIW, PyIceberg also produces 
> unpretty JSON.
> 
> Kind regards,
> Fokko
> 
> 
> Op ma 17 feb 2025 om 16:48 schreef Ian Streeter <i...@snowplow.io.invalid>:
>> Currently, metadata files are pretty-printed, with lots of new-lines and 
>> whitespace indentations.   This is the relevant line of code, which uses the 
>> Jackson default pretty printer: 
>> https://github.com/apache/iceberg/blob/abb47830e7df7dc2ae93c74b0ad97f06cdd37aad/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L131
>> 
>> If we could write metadata files without redundant whitespace, then it would 
>> save some storage space, and network overhead.
>> 
>> This will have have most impact for tables with large metadata files. For 
>> example, I have seen a metadata files which was 53.6MB. After removing 
>> whitespace, this was reduced to 41.4MB. I have read other issues in github 
>> which mention gigabyte-scale metadata files.
>> 
>> I cannot think of any downside of this suggested change. Metadata files are 
>> mainly read by machines not humans. And if a human does want to inspect a 
>> metadata file, then it is fairly easy to prettify a JSON file when needed.
>> 
>> I opened this as an issue in github, and then took advice to move the 
>> discussion to this dev list.  See 
>> https://github.com/apache/iceberg/issues/12281
>> 
>> I would appreciate hearing your thoughts.
>> Thanks,
>> Ian
>> 

Reply via email to