> Perhaps this is a good default on the catalog side when creating new metadata json.
+1 for this b/c I think it's an easy performance win for tables with large metadata. Is there any reason not to have write.metadata.compression-codec default to gzip? I'm curious if there was a reason it's currently set to none On Fri, Mar 21, 2025 at 1:43 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > This reminds me that GZipped metadata files are not covered in the spec. > I opened https://github.com/apache/iceberg/pull/12598 to try to document > them (feedback welcome). > > On Mon, Feb 17, 2025 at 2:35 PM Kevin Liu <kevinjq...@apache.org> wrote: > >> +1, json with no whitespace sounds like a reasonable default. But if >> saving storage space and network is the main goal, then setting >> `write.metadata.compression-codec` to `gzip` is way more impactful. Perhaps >> this is a good default on the catalog side when creating new metadata json. >> >> Best, >> Kevin Liu >> >> On Mon, Feb 17, 2025 at 12:19 PM Ian Streeter <i...@snowplow.io.invalid> >> wrote: >> >>> The numbers I shared were for uncompressed files. >>> >>> I am embarrassed to say I had not noticed there is an option >>> `write.metadata.compression-codec`. I had it set to the default `none`, >>> and I reckon many other Iceberg users will too. >>> >>> Here are some updated numbers for my example metadata file: >>> >>> - Uncompressed with whitespace: 53.6 MB >>> - Uncompressed, no whitespace: 41.4 MB >>> - Gzipped, with whitespace: 5.36 MB >>> - Gzipped, no whitespace: 5.13 MB >>> >>> So there is a 4.3% improvement in dropping whitespace for a gzipped >>> file. I admit this is less improvement that I originally thought. >>> >>> But even so... I still think this sounds like an easy win, especially if >>> many users (like myself) didn't know to enable compression. >>> >>> On Mon, 17 Feb 2025 at 19:51, Steve Zhang >>> <hongyue_zh...@apple.com.invalid> wrote: >>> >>>> +1. Configure table property `write.metadata.compression-codec` to gzip >>>> is usually suggested to reduce metadata size but drop whitespace can still >>>> help here. >>>> >>>> Thanks, >>>> Steve Zhang >>>> >>>> >>>> >>>> On Feb 17, 2025, at 8:32 AM, Fokko Driesprong <fo...@apache.org> wrote: >>>> >>>> Hey Ian, >>>> >>>> Thanks for raising this. The numbers you mention, do you know if this >>>> was compressed or uncompressed? >>>> >>>> I have read other issues in github which mention gigabyte-scale >>>>> metadata files. >>>> >>>> >>>> This sounds like a bad practice, and that table probably needs some >>>> maintenance. >>>> >>>> I don't have the historical context of why we produce pretty JSON. I >>>> think this would be an easy optimization, and I agree that making them >>>> easily consumable by humans afterward is trivial. FWIW, PyIceberg also >>>> produces unpretty JSON. >>>> >>>> Kind regards, >>>> Fokko >>>> >>>> >>>> Op ma 17 feb 2025 om 16:48 schreef Ian Streeter <i...@snowplow.io.invalid >>>> >: >>>> >>>>> Currently, metadata files are pretty-printed, with lots of new-lines >>>>> and whitespace indentations. This is the relevant line of code, which >>>>> uses the Jackson default pretty printer: >>>>> https://github.com/apache/iceberg/blob/abb47830e7df7dc2ae93c74b0ad97f06cdd37aad/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L131 >>>>> >>>>> If we could write metadata files without redundant whitespace, then it >>>>> would save some storage space, and network overhead. >>>>> >>>>> This will have have most impact for tables with large metadata files. >>>>> For example, I have seen a metadata files which was 53.6MB. After removing >>>>> whitespace, this was reduced to 41.4MB. I have read other issues in github >>>>> which mention gigabyte-scale metadata files. >>>>> >>>>> I cannot think of any downside of this suggested change. Metadata >>>>> files are mainly read by machines not humans. And if a human does want to >>>>> inspect a metadata file, then it is fairly easy to prettify a JSON file >>>>> when needed. >>>>> >>>>> I opened this as an issue in github, and then took advice to move the >>>>> discussion to this dev list. See >>>>> https://github.com/apache/iceberg/issues/12281 >>>>> >>>>> I would appreciate hearing your thoughts. >>>>> Thanks, >>>>> Ian >>>>> >>>>> >>>>