On Wed, Dec 3, 2025 at 3:48 AM Colin 't Hart <[email protected]> wrote:
> One of my clients has Microsoft Defender for Endpoint on Linux installed on 
> their Postgres servers.
>
> I was testing a database restore from pgBackRest. The restore itself seemed 
> to complete in a reasonable amount of time, but then the Postgres recovery 
> started and it was extremely slow to retrieve and apply the WAL files.
>
> I noticed wdavdaemon taking most of the CPU, and Postgres getting very little.

These days, tools like that work by monitoring every read, write etc
via kernel event queues (fanotify on Linux, ESF on macOS, IDK on
Windows, it might still be using something more efficient but less
isolated with tentacles inside the kernel).  Those queues usually have
a fixed size and when they overflow because the event consumer isn't
keeping up, the monitored process can be blocked.  That's probably
true even if running in a mode where it doesn't have to reply to allow
the operation to proceed.  Presumably the consumer is running some
kind of rolling fingerprint check over the data looking for things
from its database of malware, which you'd hope would be very well
optimised...

My pet theory is that PostgreSQL suffers from these systems more than
anything else not because of the total bandwidth but because of the
per-operation overheads and our historical 8KB-at-a-time disk and
network I/O.  Your report about pgBackRest supports that idea: it
probably copies a larger total size in big chunks, while recovery
reads the WAL 8KB at a time (and evicts data 8KB at a time if your
buffer pool is small), and then finally the checkpointer writes back
8KB at a time.  Another factor is that it might be using only one
fanotify queue for each process, or worse, but IDK if that matters, it
sounds like the CPU might be saturated anyway?  Future releases should
improve all of that with bigger I/Os for WAL (read through an 8KB
drinking straw, dunno if it's spying on reads too?) and data (I/O
combining, various strategies, various prototypes[1][2], watch this
space).  It's also been proposed a few times that we should have an
option to skip the end-of-recovery checkpoint, so then you'd get a
regular "spread" checkpoint that the spyware could keep up with
(assuming that it normally keeps up, just not in crash recovery).
Another thing that probably makes this worse in this strange
environment, if we assume it is due to small writes and reads are not
affected, is that crash recovery currently dirties all pages that the
WAL touches, forgetting progress that already made it to disk: it
overwrites the LSN with an FPW and then replays all changes on top,
when it could instead read the page in and skip a lot of work if the
LSN is high enough, thereby often avoiding dirtying and re-writing the
page, whenever checksums are on (as they are now by default).  The
checksum could be used as proof that the page wasn't torn by a
non-atomic write interrupted by a power outage.

I doubt anyone is really that interested in optimising for such setups
per se when anyone will tell you to just turn it off, but the reason
I've thought about it enough to take a guess is that my
corporate-managed Mac was running the PostgreSQL test suite so slowly
it would time out, and I was sufficiently nerd-sniped to figure out
that it could keep up with bursts of I/O pretty well, but everything
turned to custard under sustained workloads, notably in the recovery
tests which deliberately run with a tiny buffer pool.  As someone
working on bits of our I/O plumbing, I couldn't help speculating that
something that is objectively terrible about PostgreSQL is really just
being magnified by strange new overheads that mess with the economics.
It may not be a goal but I will still be happy if it copes with this
stuff as a by-product of general improvements like generalised I/O
combining.  (Funnily enough I've actually got a bunch of unpublished
tooling to simulate, detect and manage invisible I/O queuing.)

> I wonder if anyone here has any experience with configuring exclusions so 
> that the WAL files can be processed faster?

Yep, it entirely fixed the cliff and vastly reduced the CPU usage on
my corporate Mac.  There is still a small measurable slowdown, but the
recovery test suite couldn't even complete without timing out while
monitored.  I expect exactly the same on Linux but haven't tried it.

> Any advice on what to communicate with their IT department about using this 
> on their database servers? I've never encountered it on Linux before...

There is lots of writing on the internet about excluding pgdata from
these types of tools.  Much of it is concerned with Windows-specific
problems: opening files and directories or mapping files at bad times
can cause various PostgreSQL file operations to fail on that OS.  I
don't know of any reason why periodic scans of pgdata should interfere
with PostgreSQL on Linux other than consuming I/O bandwidth, it seems
to be just the per-syscall stuff that is unworkable.

You might be able to show "meson test" failing as some kind of
evidence that PostgreSQL is allergic to it.  Or if you want to try to
find a one-liner demonstration independent of PostgreSQL, you could
test the can't-keep-up-with-stream-of-tiny-writes theory by
experimenting with "dd" at different block sizes.  I expect you'll
find a size below which the fanotify queue quickly overflows and
performance falls off a cliff.  Current versions of PostgreSQL assumed
fast and consistent buffered writes and pretended the system calls
were free.  These monitoring tools make them expensive and also
non-linear by sending messages around with carrier pigeons.

[1] 
https://www.postgresql.org/message-id/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGK1in4FiWtisXZ%2BJo-cNSbWjmBcPww3w3DBM%2BwhJTABXA%40mail.gmail.com


Reply via email to