Hi, While thinking about the at the fsync mess, I started looking at the fsync request queue. I was primarily wondering whether we can keep FDs open long enough (by forwarding them to the checkpointer) to guarantee that we see the error. But that's mostly irrelevant for what I'm wondering about here.
The fsync request queue often is fairly large. 20 bytes for each shared_buffers isn't a neglebible overhead. One reason it needs to be fairly large is that we do not deduplicate while inserting, we just add an entry on every single write. ISTM that using a hashtable sounds saner, because we'd deduplicate on insert. While that'd require locking, we can relatively easily reduce the overhead of that by keeping track of something like mdsync_cycle_ctr in MdfdVec, and only insert again if the cycle was incremented since. Right now if the queue is full and can't be compacted we end up fsync()ing on every single write, rather than once per checkpoint afaict. That's a fairly horrible. For the case that there's no space in the map, I'd suggest to just do 10% or so of the fsync in the poor sod of a process that finds no space. That's surely better than constantly fsyncing on every single write. We can also make bgwriter check the size of the hashtable on a regular basis and do some of them if it gets too full. The hashtable also I think has some advantages for the future. I've introduced something very similar in my radix tree based buffer mapping. Greetings, Andres Freund