On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote: > > On 2022-Sep-20, Matthias van de Meent wrote: > > > Allow me to add: compressability > > > > In the thread surrounding [0] there were complaints about the size of > > catalogs, and specifically the template database. Significant parts of > > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which > > consists mostly of serialized Nodes. If we're going to replace our > > current NodeToText infrastructure, we'd better know we can effectively > > compress this data. > > True. Currently, the largest ev_action values compress pretty well. I > think if we wanted this to be more succint, we would have to invent some > binary format -- perhaps something like Protocol Buffers: it'd be stored > in the binary format in catalogs, but for output it would be converted > into something easy to read (we already do this for > pg_statistic_ext_data for example). We'd probably lose compressibility, > but that'd be okay because the binary format would already remove most > of the redundancy by nature. > > Do we want to go there?
I don't think that a binary format would be much better for debugging/fixing than an optimization of the current textual format when combined with compression. As I mentioned in that thread, there is a lot of improvement possible with the existing format, and I think any debugging of serialized nodes would greatly benefit from using a textual format. Then again, I also agree that this argument doesn't hold it's weight when storage and output formats are going to be different. I trust that any new tooling introduced as a result of this thread will be better than what we have right now. As for best format: I don't know. The current format is usable, and a better format would not store any data for default values. JSON can do that, but I could think of many formats that could do the same (Smile, BSON, xml, etc.). I do not think that protobuf is the best choice for storage, though, because it has its own rules on what it considers a default value and what it does or does not serialize: zero is considered the only default for numbers, as is the empty string for text, etc. I think it is allright for general use, but with e.g. `location: -1` in just about every parse node we'd probably want to select our own values to ignore during (de)serialization of fields. Kind regards, Matthias van de Meent