This reminds me that GZipped metadata files are not covered in the spec. I opened https://github.com/apache/iceberg/pull/12598 to try to document them (feedback welcome).
On Mon, Feb 17, 2025 at 2:35 PM Kevin Liu <kevinjq...@apache.org> wrote: > +1, json with no whitespace sounds like a reasonable default. But if > saving storage space and network is the main goal, then setting > `write.metadata.compression-codec` to `gzip` is way more impactful. Perhaps > this is a good default on the catalog side when creating new metadata json. > > Best, > Kevin Liu > > On Mon, Feb 17, 2025 at 12:19 PM Ian Streeter <i...@snowplow.io.invalid> > wrote: > >> The numbers I shared were for uncompressed files. >> >> I am embarrassed to say I had not noticed there is an option >> `write.metadata.compression-codec`. I had it set to the default `none`, >> and I reckon many other Iceberg users will too. >> >> Here are some updated numbers for my example metadata file: >> >> - Uncompressed with whitespace: 53.6 MB >> - Uncompressed, no whitespace: 41.4 MB >> - Gzipped, with whitespace: 5.36 MB >> - Gzipped, no whitespace: 5.13 MB >> >> So there is a 4.3% improvement in dropping whitespace for a gzipped >> file. I admit this is less improvement that I originally thought. >> >> But even so... I still think this sounds like an easy win, especially if >> many users (like myself) didn't know to enable compression. >> >> On Mon, 17 Feb 2025 at 19:51, Steve Zhang <hongyue_zh...@apple.com.invalid> >> wrote: >> >>> +1. Configure table property `write.metadata.compression-codec` to gzip >>> is usually suggested to reduce metadata size but drop whitespace can still >>> help here. >>> >>> Thanks, >>> Steve Zhang >>> >>> >>> >>> On Feb 17, 2025, at 8:32 AM, Fokko Driesprong <fo...@apache.org> wrote: >>> >>> Hey Ian, >>> >>> Thanks for raising this. The numbers you mention, do you know if this >>> was compressed or uncompressed? >>> >>> I have read other issues in github which mention gigabyte-scale metadata >>>> files. >>> >>> >>> This sounds like a bad practice, and that table probably needs some >>> maintenance. >>> >>> I don't have the historical context of why we produce pretty JSON. I >>> think this would be an easy optimization, and I agree that making them >>> easily consumable by humans afterward is trivial. FWIW, PyIceberg also >>> produces unpretty JSON. >>> >>> Kind regards, >>> Fokko >>> >>> >>> Op ma 17 feb 2025 om 16:48 schreef Ian Streeter <i...@snowplow.io.invalid >>> >: >>> >>>> Currently, metadata files are pretty-printed, with lots of new-lines >>>> and whitespace indentations. This is the relevant line of code, which >>>> uses the Jackson default pretty printer: >>>> https://github.com/apache/iceberg/blob/abb47830e7df7dc2ae93c74b0ad97f06cdd37aad/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L131 >>>> >>>> If we could write metadata files without redundant whitespace, then it >>>> would save some storage space, and network overhead. >>>> >>>> This will have have most impact for tables with large metadata files. >>>> For example, I have seen a metadata files which was 53.6MB. After removing >>>> whitespace, this was reduced to 41.4MB. I have read other issues in github >>>> which mention gigabyte-scale metadata files. >>>> >>>> I cannot think of any downside of this suggested change. Metadata files >>>> are mainly read by machines not humans. And if a human does want to inspect >>>> a metadata file, then it is fairly easy to prettify a JSON file when >>>> needed. >>>> >>>> I opened this as an issue in github, and then took advice to move the >>>> discussion to this dev list. See >>>> https://github.com/apache/iceberg/issues/12281 >>>> >>>> I would appreciate hearing your thoughts. >>>> Thanks, >>>> Ian >>>> >>>> >>>