On Fri, Mar 03, 2023 at 01:38:05PM -0800, Jacob Champion wrote: > > > With this particular dataset, I don't see much improvement with > > > zstd:long. > > > > Yeah. I this could be because either 1) you already got very good > > comprssion without looking at more data; and/or 2) the neighboring data > > is already very similar, maybe equally or more similar, than the further > > data, from which there's nothing to gain. > > What kinds of improvements do you see with your setup? I'm wondering > when we would suggest that people use it.
On customer data, I see small improvements - below 10%. But on my first two tries, I made synthetic data sets where it's a lot: $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long |wc -c 286107 $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long=0 |wc -c 1709695 That's just 6 identical tables like: pryzbyj=# CREATE TABLE t1 AS SELECT generate_series(1,999999); In this case, "custom" format doesn't see that benefit, because the greatest similarity is across tables, which don't share compressor state. But I think the note that I wrote in the docs about that should be removed - custom format could see a big benefit, as long as the table is big enough, and there's more similarity/repetition at longer distances. Here's one where custom format *does* benefit, due to long-distance repetition within a single table. The data is contrived, but the schema of ID => data is not. What's notable isn't how compressible the data is, but how much *more* compressible it is with long-distance matching. pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM generate_series(1,444)i,generate_series(1,99999)j GROUP BY 1; $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c 82023 $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c 1048267 -- Justin