Re: Proposal: Adding compression of temporary files

Filip Janus Mon, 25 May 2026 00:23:49 -0700

Hi Tomas,

Thanks for the feedback.


> What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere.
> What am I missing?

It is a #define I introduced in the latest revision of the patch,
in src/backend/storage/file/buffile.c:

  #define COMPRESS_BLCKSZ   (4 * BLCKSZ)   /* 32KB */

The version you benchmarked (from January) used BLCKSZ (8 KB) directly
as the compression unit -- each 8 KB buffer was compressed and written
separately. After your benchmark, I experimented with larger blocks
and found that compressing 32 KB at a time works noticeably better:
the algorithm gets more context per call (better entropy coding),
per-block framing overhead is amortized, and we make 4x fewer
compress/decompress calls.

The motivation was to improve both speed and compression ratio: with
8 KB blocks, the algorithm sees too little data per call to exploit
redundancy effectively, especially for wider rows where repetitive
patterns span more than one page. 32 KB gives substantially better
ratios with fewer calls, without excessive memory overhead (one
extra 32 KB buffer per open compressed BufFile).

For testing I recompiled with three values -- BLCKSZ (8 KB),
4*BLCKSZ (32 KB), and 8*BLCKSZ (64 KB). It is a compile-time
constant, not a GUC.

> I'm still not quite sure what "compression block size" means here,
> and how did you change it.

Same answer -- sorry for not being clear. Your benchmark used the
original 8 KB block size from the January patch. My main results
used the updated patch with 32 KB blocks. The comparison in my first
email was not entirely apples-to-apples -- I should have noted that
more clearly.

That said, the block size accounts for only a modest part of the
difference (e.g. lz4 d=1000 w=8 on HDD: 58% with 8 KB vs 52% with
32 KB). The larger gains come from the storage and memory pressure
differences between our machines.

> I wonder how much this depends on the particular data set (e.g. if
> we generate data differently, how much would it affect the results).

Good question. The d parameter already covers a range of data
redundancy (d=1 is least compressible, d=1000 is most), so the tables
show best and worst cases for the same schema. Real-world workloads
with wider rows, more NULLs, or variable-length fields would likely
compress differently -- I'd expect better ratios in many cases, since
the benchmark data is relatively compact (bigint + md5 text).

> What bothers me a little bit is that systems generally are not under
> such pressure 24/7, but only for some part of a day. But people will
> mostly set the GUC in the config file.

That is a fair point. temp_file_compression can be set at the session
level (SET temp_file_compression = 'lz4'), so an application could
enable it only for known-heavy queries. On our I/O-constrained
machines the worst case for lz4 was ~94% (within noise). Your
results on fast NVMe showed higher overhead -- up to ~135% for lz4
with w=1, where CPU cost dominates and there's no I/O to save. So
for systems with plenty of RAM and fast storage, per-session or
per-query activation may indeed be more appropriate than a global
setting.

No rush on further work -- happy to run more tests in the meantime
if anything comes to mind.

regards

    -Filip-


út 12. 5. 2026 v 16:14 odesílatel Tomas Vondra <[email protected]> napsal:

> On 5/11/26 09:09, Filip Janus wrote:
> >
> >
> > Hi Tomas,
> >
> > Thanks for the thorough benchmark and the script -- it was very helpful
> > as a starting point for my testing. I understand the results on
> > your machine were discouraging, and I appreciate the honest assessment.
> >
> > I ran a similar benchmark on different x86_64 hardware to see how the
> > results change under more I/O pressure. The short version: lz4 and
> > zstd show significant speedups once storage or page cache becomes a
> > bottleneck.
> >
>
> I'm glad you didn't just give up and decided to run some more tests.
>
> > Setup
> > -----
> >
> > I used your run-hashjoins.sh as a base, with the same parameters:
> > 100M rows, d in {1, 10, 100, 1000}, w in {1, 4, 8}, drop-caches
> > between runs. I also added zstd to the compression methods tested,
> > and tested with a larger compression block size (32 KB instead of
> > the default 8 KB BLCKSZ).
> >
> > Two x86_64 machines:
> >
> >   (A) HPE BL460c Gen10, 2x Xeon Gold 6148, 64 GB RAM,
> >       rotational HDD (5 disks), io_uring, Fedora 43
> >
> >   (B) Dell MX840c, Xeon Gold 6148, SATA SSD (~224 GB),
> >       RAM capped to 16 GB via systemd MemoryMax
> >
> > Both use 32 KB compression blocks (COMPRESS_BLCKSZ = 4*BLCKSZ).
> >
>
> What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere. What am
> I missing?
>
> > Results
> > -------
> >
> > Below are the relative timings (% of uncompressed baseline), directly
> > comparable to your table. Values below 100% mean compression is faster.
> >
> > Your results (Xeon, 64 GB, SSD/NVMe, 8 KB blocks):
> >
> >                      pglz              lz4
> >   rows  rep    1    4    8       1    4    8
> >   -------------------------------------------------
> >    10     1  661  688  300     144  148   86
> >    10  1000  460  472  234     119  119   58
> >   100     1  471  303  204     132  135  102
> >   100  1000  378  262  164     107   91   81
> >
> > Our results, machine A -- x86 HDD, 64 GB, 32 KB blocks:
> >
> >                      pglz              lz4              zstd
> >   rows  rep    1    4    8       1    4    8       1    4    8
> >   ----------------------------------------------------------------
> >   100     1  200  119   69      91   82   67      80   50   35
> >   100    10  204  101   70      91   64   66      83   44   39
> >   100   100  220  104   72      94   75   69      85   50   34
> >   100  1000  170   92   54      79   58   52      74   42   28
> >
> > Our results, machine B -- x86 SATA SSD, 16 GB cap, 32 KB blocks:
> >
> >                      pglz              lz4              zstd
> >   rows  rep    1    4    8       1    4    8       1    4    8
> >   ----------------------------------------------------------------
> >   100     1  284  103   79      92   81   82      98   59   53
> >   100    10  262   99   77      92   80   85      96   57   50
> >   100   100  221   89   67      80   70   64      85   49   44
> >   100  1000  155   51   42      72   39   39      77   27   29
> >
> > Analysis
> > --------
> >
> > I think the key difference is page cache pressure. Your machine has
> > 64 GB RAM with 8 GB shared_buffers, leaving ~56 GB for the OS page
> > cache. Even with 8 connections x ~10 GB temp files = ~80 GB, a large
> > portion stays cached and synchronous I/O to storage is limited.
> >
> > On our machines, I/O is a real bottleneck:
> >   - Machine A: rotational HDD with 8 concurrent streams
> >   - Machine B: SATA SSD but only 16 GB RAM, so the page cache
> >     cannot absorb 8 x 12 GB of temp data
> >
> > Under these conditions, reducing the bytes written translates
> > directly into wall-clock savings.
> >
>
> Seems like that. It's not a huge surprise that this matters more on
> systems with memory pressure and slower storage. I should have tested
> that on my machines too.
>
> I was going to question how common such systems are nowadays, when
> people can just spin a VM with plenty of RAM and SSDs. But given the
> current RAM shortage / costs, and relatively slow network storage (even
> if temporary files can use ephemeral disks), maybe it's not all that
> uncommon ...
>
> > Both your results and ours confirm that pglz is simply too slow for
> > this use case. Your benchmark shows 164-688% overhead; ours shows
> > 155-284% with w=1. Even under heavy I/O contention (w=8 on HDD)
> > where pglz eventually wins, it never outperforms lz4 or zstd. I
> > would recommend against offering pglz for temp file compression
> > altogether -- it creates a trap for users who might try it expecting
> > reasonable performance.
> >
>
> Right.
>
> > lz4 looks safe: the worst case in our data is 94% (w=1, d=100 on
> > HDD) -- barely distinguishable from noise. Under I/O pressure it
> > delivers 39-52% of baseline time (2-2.5x speedup).
> >
> > zstd is the most compelling option: it achieves the best compression
> > ratios (down to 22% of original size on the SATA SSD) and the best
> > speedups (27-28% of baseline = 3.5x faster), with no regression
> > exceeding 98% on x86_64. I would recommend zstd as the primary
> > option to document, with lz4 as a lighter-weight alternative.
> >
>
> Agreed. lz4 seems safe, zstd is good too. I wonder how much this depends
> on the particular data set (e.g. if we generate data differently, how
> much would it affect the results).
>
> > Compression block size
> > ----------------------
> >
> > I also tested 8 KB, 32 KB, and 64 KB compression block sizes.
> > 32 KB appears to be the sweet spot. Example for lz4, d=1000, w=8
> > on HDD:
> >
> >    COMPRESS_BLCKSZ    time (% of no)    compressed bytes
> >    --------------------------------------------------------
> >     8 KB (BLCKSZ)         58%             7.47 GB
> >    32 KB (4*BLCKSZ)       52%             7.22 GB
> >    64 KB (8*BLCKSZ)       56%             7.14 GB
> >
> > The 8K-to-32K improvement comes from fewer compress/decompress calls
> > (4x fewer), less per-block header overhead, and better compression
> > ratios. Going to 64K shows diminishing returns and slightly worse
> > timings, possibly due to increased cache pressure.
> >
>
> I'm still not quite sure what "compression block size" means here, and
> how did you change it.
>
> > Conclusion
> > ----------
> >
> > I think the data shows that the benefit of temporary file compression
> > depends heavily on the I/O characteristics of the system. On machines
> > with fast storage and ample page cache, compression is neutral -- it
> > means negligible overhead, which is a good outcome on its own. On
> > systems with real I/O pressure -- slower storage, limited RAM, or
> > concurrent workloads competing for page cache -- compression delivers
> > substantial speedups.
> >
>
> True.
>
> > The feature does not need to be enabled by default. Compression is
> > controlled by the temp_file_compression GUC, which defaults to "none".
> > That means there is no risk of regression for existing users. But for
> > administrators who know their systems are I/O-constrained -- spinning
> > disks, limited memory, heavy concurrent spilling -- having the option
> > to enable lz4 or zstd can make a real difference. The data above shows
> > up to 3.5x speedup in those scenarios, with no
> > downside when the setting is left at its default.
> >
> Yes, having it as opt-in for systems where it matters helps.
>
> What bothers me a little bit is that systems generally are not under
> such pressure 24/7, but only for some part of a day. But people will
> mostly set the GUC in the config file. I don't have a better solution to
> this, though.
>
>
> FYI I won't be able to do much work on this until ~June.
>
>
> regards
>
> --
> Tomas Vondra
>
>

Re: Proposal: Adding compression of temporary files

Reply via email to