+1, json with no whitespace sounds like a reasonable default. But if saving storage space and network is the main goal, then setting `write.metadata.compression-codec` to `gzip` is way more impactful. Perhaps this is a good default on the catalog side when creating new metadata json.
Best, Kevin Liu On Mon, Feb 17, 2025 at 12:19 PM Ian Streeter <i...@snowplow.io.invalid> wrote: > The numbers I shared were for uncompressed files. > > I am embarrassed to say I had not noticed there is an option > `write.metadata.compression-codec`. I had it set to the default `none`, > and I reckon many other Iceberg users will too. > > Here are some updated numbers for my example metadata file: > > - Uncompressed with whitespace: 53.6 MB > - Uncompressed, no whitespace: 41.4 MB > - Gzipped, with whitespace: 5.36 MB > - Gzipped, no whitespace: 5.13 MB > > So there is a 4.3% improvement in dropping whitespace for a gzipped file. > I admit this is less improvement that I originally thought. > > But even so... I still think this sounds like an easy win, especially if > many users (like myself) didn't know to enable compression. > > On Mon, 17 Feb 2025 at 19:51, Steve Zhang <hongyue_zh...@apple.com.invalid> > wrote: > >> +1. Configure table property `write.metadata.compression-codec` to gzip >> is usually suggested to reduce metadata size but drop whitespace can still >> help here. >> >> Thanks, >> Steve Zhang >> >> >> >> On Feb 17, 2025, at 8:32 AM, Fokko Driesprong <fo...@apache.org> wrote: >> >> Hey Ian, >> >> Thanks for raising this. The numbers you mention, do you know if this was >> compressed or uncompressed? >> >> I have read other issues in github which mention gigabyte-scale metadata >>> files. >> >> >> This sounds like a bad practice, and that table probably needs some >> maintenance. >> >> I don't have the historical context of why we produce pretty JSON. I >> think this would be an easy optimization, and I agree that making them >> easily consumable by humans afterward is trivial. FWIW, PyIceberg also >> produces unpretty JSON. >> >> Kind regards, >> Fokko >> >> >> Op ma 17 feb 2025 om 16:48 schreef Ian Streeter <i...@snowplow.io.invalid >> >: >> >>> Currently, metadata files are pretty-printed, with lots of new-lines and >>> whitespace indentations. This is the relevant line of code, which uses >>> the Jackson default pretty printer: >>> https://github.com/apache/iceberg/blob/abb47830e7df7dc2ae93c74b0ad97f06cdd37aad/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L131 >>> >>> If we could write metadata files without redundant whitespace, then it >>> would save some storage space, and network overhead. >>> >>> This will have have most impact for tables with large metadata files. >>> For example, I have seen a metadata files which was 53.6MB. After removing >>> whitespace, this was reduced to 41.4MB. I have read other issues in github >>> which mention gigabyte-scale metadata files. >>> >>> I cannot think of any downside of this suggested change. Metadata files >>> are mainly read by machines not humans. And if a human does want to inspect >>> a metadata file, then it is fairly easy to prettify a JSON file when needed. >>> >>> I opened this as an issue in github, and then took advice to move the >>> discussion to this dev list. See >>> https://github.com/apache/iceberg/issues/12281 >>> >>> I would appreciate hearing your thoughts. >>> Thanks, >>> Ian >>> >>> >>