Hi, I've spent a bit more time on this, mostly running tests to get a better idea of the practical benefits.
Firstly, I think there's a bug in ReorderBufferCompress() - it's legal for pglz_compress() to return -1. This can happen if the data is not compressible, and would not fit into the output buffer. The code can't just do elog(ERROR) in this case, it needs to handle that by storing the raw data. The attached fixup patch makes this work for me - I'm not claiming this is the best way to handle this, but it works. FWIW I find it strange the tests included in the patch did not trigger this. That probably means the tests are not quite sufficient. Now, to the testing. Attached are two scripts, testing different cases: test-columns.sh - Table with a variable number of 'float8' columns. test-toast.sh - Table with a single text column. The script always sets up a publication/subscription on two instances, generates certain amount of data (~1GB for columns, ~3.2GB for TOAST), waits for it to be replicated to the replica, and measures how much data was spilled to disk with the different compression methods (off, pglz and lz4). There's a couple more metrics, but that's irrelevant here. For the "column" test, it looks like this (this is in MB): rows columns distribution off pglz lz4 ======================================================== 100000 1000 compressible 778 20 9 random 778 778 16 -------------------------------------------------------- 1000000 100 compressible 916 116 62 random 916 916 67 It's very clear that for the "compressible" data (which just copies the same value into all columns), both pglz and lz4 can significantly reduce the amount of data. For 1000 columns it's 780MB -> 20MB/9MB, for 100 columns it's a bit less efficient, but still good. For the "random" data (where every column gets a random value, but rows are copied), it's a very different story - pglz does not help at all, while lz4 still massively reduces the amount of spilled data. I think the explanation is very simple - for pglz, we compress each row on it's own, there's no concept of streaming/context. If a row is compressible, it works fine, but when the row gets random, pglz can't compress it at all. For lz4, this does not matter, because with the streaming mode it still sees that rows are just repeated, and so can compress them efficiently. For TOAST test, the results look like this: distribution repeats toast off pglz lz4 =============================================================== compressible 10000 lz4 14 2 1 pglz 40 4 3 1000 lz4 32 16 9 pglz 54 17 10 --------------------------------------------------------- random 10000 lz4 3305 3305 3157 pglz 3305 3305 3157 1000 lz4 3166 3162 1580 pglz 3334 3326 1745 ---------------------------------------------------------- random2 10000 lz4 3305 3305 3157 pglz 3305 3305 3158 1000 lz4 3160 3156 3010 pglz 3334 3326 3172 The "repeats" value means how long the string is - it's the number of "md5" hashes added to the string. The number of rows is calculated to keep the total amount of data the same. The "toast" column tracks what compression was used for TOAST, I was wondering if it matters. This time there are three data distributions - compressible means that each TOAST value is nicely compressible, "random" means each value is random (not compressible), but the rows are just copy of the same value (so on the whole there's a lot of redundancy). And "random2" means each row is random and unique (so not compressible at all). The table shows that with compressible TOAST values, compressing the spill file is rather useless. The reason is that ReorderBufferCompress is handling raw TOAST data, which is already compressed. Yes, it may further reduce the amount of data, but it's negligible when compared to the original amount of data. For the random cases, the spill compression is rather pointless. Yes, lz4 can reduce it to 1/2 for the shorter strings, but other than that it's not very useful. For a while I was thinking this approach is flawed, because it only sees and compressed changes one by one, and that seeing a batch of changes would improve this (e.g. we'd see the copied rows). But I realized lz4 already does that (in the streaming mode at least), and yet it does not help very much. Presumably that depends on how large the context is. If the random string is long enough, it won't help. So maybe this approach is fine, and doing the compression at a lower layer (for the whole file), would not really improve this. Even then we'd only see a limited amount of data. Maybe the right answer to this is that compression does not help cases where most of the replicated data is TOAST, and that it can help cases with wide (and redundant) rows, or repeated rows. And that lz4 is a clearly superior choice. (This also raises the question if we want to support REORDER_BUFFER_STRAT_LZ4_REGULAR. I haven't looked into this, but doesn't that behave more like pglz, i.e. no context?) FWIW when doing these tests, it made me realize how useful would it be to track both the "raw" and "spilled" amounts. That is before/after compression. It'd make calculating compression ratio much easier. regards -- Tomas Vondra
test-toast.sh
Description: application/shellscript
test-columns.sh
Description: application/shellscript
1727083455 1727083494 64kB off 100000 1000 random 39 816000000 2000000000 1926788895 1727083508 1727083555 64kB pglz 100000 1000 random 47 816000000 2000000000 912987792 1727083593 1727083630 64kB lz4 100000 1000 random 37 16555784 2000000000 14425752 1727083648 1727083685 64kB off 100000 1000 compressible 37 816000000 2000000000 1800588895 1727083699 1727083733 64kB pglz 100000 1000 compressible 34 20665177 2000000000 12547360 1727083745 1727083783 64kB lz4 100000 1000 compressible 38 9665751 2000000000 8318691 1727083801 1727083838 512kB off 100000 1000 random 37 816000000 2000000000 1926188895 1727083853 1727083897 512kB pglz 100000 1000 random 44 816000000 2000000000 331066026 1727083918 1727083955 512kB lz4 100000 1000 random 37 16548612 2000000000 14472439 1727083973 1727084005 512kB off 100000 1000 compressible 32 816000000 2000000000 1900588895 1727084019 1727084053 512kB pglz 100000 1000 compressible 34 20799851 2000000000 12544310 1727084064 1727084096 512kB lz4 100000 1000 compressible 32 9828681 2000000000 8318693 1727084114 1727084151 4MB off 100000 1000 random 37 816000000 2000000000 1927088895 1727084166 1727084211 4MB pglz 100000 1000 random 45 816000000 2000000000 332679062 1727084231 1727084268 4MB lz4 100000 1000 random 37 16720988 2000000000 14425897 1727084285 1727084318 4MB off 100000 1000 compressible 33 816000000 2000000000 1900588895 1727084332 1727084366 4MB pglz 100000 1000 compressible 34 20848682 2000000000 12540369 1727084377 1727084410 4MB lz4 100000 1000 compressible 33 9808910 2000000000 8318690 1727084427 1727084463 64kB off 1000000 100 random 36 960000000 2000000000 1939888896 1727084476 1727084519 64kB pglz 1000000 100 random 43 960000000 2000000000 12442367 1727084531 1727084566 64kB lz4 1000000 100 random 35 69792488 2000000000 12905812 1727084584 1727084617 64kB off 1000000 100 compressible 33 960000000 2000000000 1906888896 1727084629 1727084666 64kB pglz 1000000 100 compressible 37 124849430 2000000000 12385533 1727084679 1727084710 64kB lz4 1000000 100 compressible 31 65217839 2000000000 12336481 1727084727 1727084762 512kB off 1000000 100 random 35 960000000 2000000000 1934888896 1727084775 1727084818 512kB pglz 1000000 100 random 43 960000000 2000000000 12413439 1727084830 1727084865 512kB lz4 1000000 100 random 35 70166953 2000000000 12890598 1727084883 1727084914 512kB off 1000000 100 compressible 31 960000000 2000000000 1906888896 1727084926 1727084964 512kB pglz 1000000 100 compressible 38 124672339 2000000000 12261806 1727084975 1727085006 512kB lz4 1000000 100 compressible 31 65499058 2000000000 12337511 1727085023 1727085057 4MB off 1000000 100 random 34 960000000 2000000000 1935888896 1727085070 1727085113 4MB pglz 1000000 100 random 43 960000000 2000000000 12277513 1727085125 1727085159 4MB lz4 1000000 100 random 34 70063518 2000000000 12903008 1727085177 1727085210 4MB off 1000000 100 compressible 33 960000000 2000000000 1906888896 1727085222 1727085260 4MB pglz 1000000 100 compressible 38 121806968 2000000000 12385491 1727085273 1727085304 4MB lz4 1000000 100 compressible 31 65511687 2000000000 12336249
1727101381 1727101685 4MB pglz off 100000 1000 random2 304 3496202728 3200000000 3200688890 1727101698 1727101997 4MB pglz pglz 100000 1000 random2 299 3487810007 3200000000 1869238511 1727102065 1727102368 4MB pglz lz4 100000 1000 random2 303 3326043497 3200000000 3200691961 1727102385 1727102562 4MB pglz off 100000 1000 random 177 3496200000 3200000000 3200688895 1727102576 1727102755 4MB pglz pglz 100000 1000 random 179 3488055924 3200000000 583756849 1727102788 1727102965 4MB pglz lz4 100000 1000 random 177 1829510657 3200000000 40619616 1727102977 1727103025 4MB pglz off 100000 1000 compressible 48 56898772 3200000000 3200688895 1727103034 1727103082 4MB pglz pglz 100000 1000 compressible 48 17316243 3200000000 23016359 1727103097 1727103146 4MB pglz lz4 100000 1000 compressible 49 10356903 3200000000 16764329 1727103152 1727103394 4MB lz4 off 100000 1000 random2 242 3313932520 3200000000 3200688890 1727103408 1727103634 4MB lz4 pglz 100000 1000 random2 226 3309284822 3200000000 1869238511 1727103701 1727103930 4MB lz4 lz4 100000 1000 random2 229 3156562298 3200000000 3200691961 1727103949 1727104066 4MB lz4 off 100000 1000 random 117 3319600000 3200000000 3200688895 1727104079 1727104216 4MB lz4 pglz 100000 1000 random 137 3315267470 3200000000 583756849 1727104250 1727104363 4MB lz4 lz4 100000 1000 random 113 1656772933 3200000000 40619616 1727104375 1727104402 4MB lz4 off 100000 1000 compressible 27 33199647 3200000000 3200688895 1727104410 1727104439 4MB lz4 pglz 100000 1000 compressible 29 16596895 3200000000 23016359 1727104455 1727104483 4MB lz4 lz4 100000 1000 compressible 28 9704302 3200000000 16764329 1727104489 1727104790 4MB pglz off 10000 10000 random2 301 3465780000 3200000000 3200058890 1727104804 1727105098 4MB pglz pglz 10000 10000 random2 294 3465349151 3200000000 1868576351 1727105165 1727105472 4MB pglz lz4 10000 10000 random2 307 3310941656 3200000000 3200061957 1727105492 1727105684 4MB pglz off 10000 10000 random 192 3465780000 3200000000 3200058894 1727105698 1727105892 4MB pglz pglz 10000 10000 random 194 3465349035 3200000000 1869203336 1727105961 1727106151 4MB pglz lz4 10000 10000 random 190 3310342175 3200000000 3200061961 1727106170 1727106220 4MB pglz off 10000 10000 compressible 50 42069867 3200000000 3200058894 1727106231 1727106281 4MB pglz pglz 10000 10000 compressible 50 4676993 3200000000 20451166 1727106299 1727106349 4MB pglz lz4 10000 10000 compressible 50 2625791 3200000000 13004334 1727106358 1727106584 4MB lz4 off 10000 10000 random2 226 3465780000 3200000000 3200058890 1727106598 1727106818 4MB lz4 pglz 10000 10000 random2 220 3465339285 3200000000 1868576351 1727106887 1727107110 4MB lz4 lz4 10000 10000 random2 223 3310848431 3200000000 3200061957 1727107132 1727107256 4MB lz4 off 10000 10000 random 124 3465780000 3200000000 3200058894 1727107269 1727107438 4MB lz4 pglz 10000 10000 random 169 3465339057 3200000000 1869203336 1727107505 1727107619 4MB lz4 lz4 10000 10000 random 114 3310290984 3200000000 3200061961 1727107640 1727107674 4MB lz4 off 10000 10000 compressible 34 14609994 3200000000 3200058894 1727107685 1727107716 4MB lz4 pglz 10000 10000 compressible 31 1758403 3200000000 20451166 1727107734 1727107765 4MB lz4 lz4 10000 10000 compressible 31 1021255 3200000000 13004334
From c0633fa03e7eefdf4bc5ab6f6608fe51368272a6 Mon Sep 17 00:00:00 2001 From: tomas <tomas> Date: Wed, 18 Sep 2024 19:39:52 +0200 Subject: [PATCH] compression fixup --- .../logical/reorderbuffer_compression.c | 40 +++++++++++++------ 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer_compression.c b/src/backend/replication/logical/reorderbuffer_compression.c index 6d19c60b60..6301ccc932 100644 --- a/src/backend/replication/logical/reorderbuffer_compression.c +++ b/src/backend/replication/logical/reorderbuffer_compression.c @@ -781,24 +781,38 @@ ReorderBufferCompress(ReorderBuffer *rb, ReorderBufferDiskHeader **header, dst = (char *) palloc0(max_size); dst_size = pglz_compress(src, src_size, dst, PGLZ_strategy_always); - if (dst_size < 0) - ereport(ERROR, - (errcode(ERRCODE_DATA_CORRUPTED), - errmsg_internal("PGLZ compression failed"))); + /* + * If compression succeeded, build the proper compression header. If + * compression fails, it means the data is not compressible. In that + * case just build a no-compress item. + */ + if (dst_size > 0) /* compressible */ + { + ReorderBufferReserve(rb, (Size) (dst_size + sizeof(ReorderBufferDiskHeader))); - ReorderBufferReserve(rb, (Size) (dst_size + sizeof(ReorderBufferDiskHeader))); + hdr = (ReorderBufferDiskHeader *) rb->outbuf; + hdr->comp_strat = REORDER_BUFFER_STRAT_PGLZ; + hdr->size = (Size) dst_size + sizeof(ReorderBufferDiskHeader); + hdr->raw_size = (Size) src_size; - hdr = (ReorderBufferDiskHeader *) rb->outbuf; - hdr->comp_strat = REORDER_BUFFER_STRAT_PGLZ; - hdr->size = (Size) dst_size + sizeof(ReorderBufferDiskHeader); - hdr->raw_size = (Size) src_size; + /* Copy back compressed data into the ReorderBuffer */ + memcpy((char *) rb->outbuf + sizeof(ReorderBufferDiskHeader), dst, + dst_size); + } + else /* not compressible */ + { + hdr = (ReorderBufferDiskHeader *) rb->outbuf; + hdr->comp_strat = REORDER_BUFFER_STRAT_UNCOMPRESSED; + hdr->size = (Size) src_size + sizeof(ReorderBufferDiskHeader); + hdr->raw_size = (Size) src_size; + + /* Copy back compressed data into the ReorderBuffer */ + memcpy((char *) rb->outbuf + sizeof(ReorderBufferDiskHeader), src, + src_size); + } *header = hdr; - /* Copy back compressed data into the ReorderBuffer */ - memcpy((char *) rb->outbuf + sizeof(ReorderBufferDiskHeader), dst, - dst_size); - pfree(dst); break; -- 2.39.2