Hi Tomas,
Thanks for the feedback.
> What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere.
> What am I missing?
It is a #define I introduced in the latest revision of the patch,
in src/backend/storage/file/buffile.c:
#define COMPRESS_BLCKSZ (4 * BLCKSZ) /* 32KB */
The version you benchmarked (from January) used BLCKSZ (8 KB) directly
as the compression unit -- each 8 KB buffer was compressed and written
separately. After your benchmark, I experimented with larger blocks
and found that compressing 32 KB at a time works noticeably better:
the algorithm gets more context per call (better entropy coding),
per-block framing overhead is amortized, and we make 4x fewer
compress/decompress calls.
The motivation was to improve both speed and compression ratio: with
8 KB blocks, the algorithm sees too little data per call to exploit
redundancy effectively, especially for wider rows where repetitive
patterns span more than one page. 32 KB gives substantially better
ratios with fewer calls, without excessive memory overhead (one
extra 32 KB buffer per open compressed BufFile).
For testing I recompiled with three values -- BLCKSZ (8 KB),
4*BLCKSZ (32 KB), and 8*BLCKSZ (64 KB). It is a compile-time
constant, not a GUC.
> I'm still not quite sure what "compression block size" means here,
> and how did you change it.
Same answer -- sorry for not being clear. Your benchmark used the
original 8 KB block size from the January patch. My main results
used the updated patch with 32 KB blocks. The comparison in my first
email was not entirely apples-to-apples -- I should have noted that
more clearly.
That said, the block size accounts for only a modest part of the
difference (e.g. lz4 d=1000 w=8 on HDD: 58% with 8 KB vs 52% with
32 KB). The larger gains come from the storage and memory pressure
differences between our machines.
> I wonder how much this depends on the particular data set (e.g. if
> we generate data differently, how much would it affect the results).
Good question. The d parameter already covers a range of data
redundancy (d=1 is least compressible, d=1000 is most), so the tables
show best and worst cases for the same schema. Real-world workloads
with wider rows, more NULLs, or variable-length fields would likely
compress differently -- I'd expect better ratios in many cases, since
the benchmark data is relatively compact (bigint + md5 text).
> What bothers me a little bit is that systems generally are not under
> such pressure 24/7, but only for some part of a day. But people will
> mostly set the GUC in the config file.
That is a fair point. temp_file_compression can be set at the session
level (SET temp_file_compression = 'lz4'), so an application could
enable it only for known-heavy queries. On our I/O-constrained
machines the worst case for lz4 was ~94% (within noise). Your
results on fast NVMe showed higher overhead -- up to ~135% for lz4
with w=1, where CPU cost dominates and there's no I/O to save. So
for systems with plenty of RAM and fast storage, per-session or
per-query activation may indeed be more appropriate than a global
setting.
No rush on further work -- happy to run more tests in the meantime
if anything comes to mind.
regards
-Filip-
Ășt 12. 5. 2026 v 16:14 odesĂlatel Tomas Vondra <[email protected]> napsal:
> On 5/11/26 09:09, Filip Janus wrote:
> >
> >
> > Hi Tomas,
> >
> > Thanks for the thorough benchmark and the script -- it was very helpful
> > as a starting point for my testing. I understand the results on
> > your machine were discouraging, and I appreciate the honest assessment.
> >
> > I ran a similar benchmark on different x86_64 hardware to see how the
> > results change under more I/O pressure. The short version: lz4 and
> > zstd show significant speedups once storage or page cache becomes a
> > bottleneck.
> >
>
> I'm glad you didn't just give up and decided to run some more tests.
>
> > Setup
> > -----
> >
> > I used your run-hashjoins.sh as a base, with the same parameters:
> > 100M rows, d in {1, 10, 100, 1000}, w in {1, 4, 8}, drop-caches
> > between runs. I also added zstd to the compression methods tested,
> > and tested with a larger compression block size (32 KB instead of
> > the default 8 KB BLCKSZ).
> >
> > Two x86_64 machines:
> >
> > (A) HPE BL460c Gen10, 2x Xeon Gold 6148, 64 GB RAM,
> > rotational HDD (5 disks), io_uring, Fedora 43
> >
> > (B) Dell MX840c, Xeon Gold 6148, SATA SSD (~224 GB),
> > RAM capped to 16 GB via systemd MemoryMax
> >
> > Both use 32 KB compression blocks (COMPRESS_BLCKSZ = 4*BLCKSZ).
> >
>
> What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere. What am
> I missing?
>
> > Results
> > -------
> >
> > Below are the relative timings (% of uncompressed baseline), directly
> > comparable to your table. Values below 100% mean compression is faster.
> >
> > Your results (Xeon, 64 GB, SSD/NVMe, 8 KB blocks):
> >
> > pglz lz4
> > rows rep 1 4 8 1 4 8
> > -------------------------------------------------
> > 10 1 661 688 300 144 148 86
> > 10 1000 460 472 234 119 119 58
> > 100 1 471 303 204 132 135 102
> > 100 1000 378 262 164 107 91 81
> >
> > Our results, machine A -- x86 HDD, 64 GB, 32 KB blocks:
> >
> > pglz lz4 zstd
> > rows rep 1 4 8 1 4 8 1 4 8
> > ----------------------------------------------------------------
> > 100 1 200 119 69 91 82 67 80 50 35
> > 100 10 204 101 70 91 64 66 83 44 39
> > 100 100 220 104 72 94 75 69 85 50 34
> > 100 1000 170 92 54 79 58 52 74 42 28
> >
> > Our results, machine B -- x86 SATA SSD, 16 GB cap, 32 KB blocks:
> >
> > pglz lz4 zstd
> > rows rep 1 4 8 1 4 8 1 4 8
> > ----------------------------------------------------------------
> > 100 1 284 103 79 92 81 82 98 59 53
> > 100 10 262 99 77 92 80 85 96 57 50
> > 100 100 221 89 67 80 70 64 85 49 44
> > 100 1000 155 51 42 72 39 39 77 27 29
> >
> > Analysis
> > --------
> >
> > I think the key difference is page cache pressure. Your machine has
> > 64 GB RAM with 8 GB shared_buffers, leaving ~56 GB for the OS page
> > cache. Even with 8 connections x ~10 GB temp files = ~80 GB, a large
> > portion stays cached and synchronous I/O to storage is limited.
> >
> > On our machines, I/O is a real bottleneck:
> > - Machine A: rotational HDD with 8 concurrent streams
> > - Machine B: SATA SSD but only 16 GB RAM, so the page cache
> > cannot absorb 8 x 12 GB of temp data
> >
> > Under these conditions, reducing the bytes written translates
> > directly into wall-clock savings.
> >
>
> Seems like that. It's not a huge surprise that this matters more on
> systems with memory pressure and slower storage. I should have tested
> that on my machines too.
>
> I was going to question how common such systems are nowadays, when
> people can just spin a VM with plenty of RAM and SSDs. But given the
> current RAM shortage / costs, and relatively slow network storage (even
> if temporary files can use ephemeral disks), maybe it's not all that
> uncommon ...
>
> > Both your results and ours confirm that pglz is simply too slow for
> > this use case. Your benchmark shows 164-688% overhead; ours shows
> > 155-284% with w=1. Even under heavy I/O contention (w=8 on HDD)
> > where pglz eventually wins, it never outperforms lz4 or zstd. I
> > would recommend against offering pglz for temp file compression
> > altogether -- it creates a trap for users who might try it expecting
> > reasonable performance.
> >
>
> Right.
>
> > lz4 looks safe: the worst case in our data is 94% (w=1, d=100 on
> > HDD) -- barely distinguishable from noise. Under I/O pressure it
> > delivers 39-52% of baseline time (2-2.5x speedup).
> >
> > zstd is the most compelling option: it achieves the best compression
> > ratios (down to 22% of original size on the SATA SSD) and the best
> > speedups (27-28% of baseline = 3.5x faster), with no regression
> > exceeding 98% on x86_64. I would recommend zstd as the primary
> > option to document, with lz4 as a lighter-weight alternative.
> >
>
> Agreed. lz4 seems safe, zstd is good too. I wonder how much this depends
> on the particular data set (e.g. if we generate data differently, how
> much would it affect the results).
>
> > Compression block size
> > ----------------------
> >
> > I also tested 8 KB, 32 KB, and 64 KB compression block sizes.
> > 32 KB appears to be the sweet spot. Example for lz4, d=1000, w=8
> > on HDD:
> >
> > COMPRESS_BLCKSZ time (% of no) compressed bytes
> > --------------------------------------------------------
> > 8 KB (BLCKSZ) 58% 7.47 GB
> > 32 KB (4*BLCKSZ) 52% 7.22 GB
> > 64 KB (8*BLCKSZ) 56% 7.14 GB
> >
> > The 8K-to-32K improvement comes from fewer compress/decompress calls
> > (4x fewer), less per-block header overhead, and better compression
> > ratios. Going to 64K shows diminishing returns and slightly worse
> > timings, possibly due to increased cache pressure.
> >
>
> I'm still not quite sure what "compression block size" means here, and
> how did you change it.
>
> > Conclusion
> > ----------
> >
> > I think the data shows that the benefit of temporary file compression
> > depends heavily on the I/O characteristics of the system. On machines
> > with fast storage and ample page cache, compression is neutral -- it
> > means negligible overhead, which is a good outcome on its own. On
> > systems with real I/O pressure -- slower storage, limited RAM, or
> > concurrent workloads competing for page cache -- compression delivers
> > substantial speedups.
> >
>
> True.
>
> > The feature does not need to be enabled by default. Compression is
> > controlled by the temp_file_compression GUC, which defaults to "none".
> > That means there is no risk of regression for existing users. But for
> > administrators who know their systems are I/O-constrained -- spinning
> > disks, limited memory, heavy concurrent spilling -- having the option
> > to enable lz4 or zstd can make a real difference. The data above shows
> > up to 3.5x speedup in those scenarios, with no
> > downside when the setting is left at its default.
> >
> Yes, having it as opt-in for systems where it matters helps.
>
> What bothers me a little bit is that systems generally are not under
> such pressure 24/7, but only for some part of a day. But people will
> mostly set the GUC in the config file. I don't have a better solution to
> this, though.
>
>
> FYI I won't be able to do much work on this until ~June.
>
>
> regards
>
> --
> Tomas Vondra
>
>