On Thu, Nov 20, 2025 at 11:45 AM Samuel Thibault
<[email protected]> wrote:
> Thomas Munro, le mar. 18 nov. 2025 18:32:38 +1300, a ecrit:
> > Does that also imply that preadv() has to loop over the vectors
> > sending tons of messages and waiting for replies?
>
> Currently glibc's preadv performs copies.
Even without O_DIRECT (= potential scatter/gather DMA to/from user
space), the kernel/server still seems like a better place to put a
scatter/gather-with-memcpy() loop, because it has to copy the data
in/out anyway, so the data is copied twice with the user space
implementation, and user space additionally has to allocate/free or
allocate/hold a temporary contiguous buffer. In PostgreSQL, the
maximum transfer size is 1MB and the default is 128kB, so it's not
peanuts with many processes/threads needing their own double-buffer.
I thought about that when we started using preadv/pwritev in
PostgreSQL, and I had to decide how to handle the few systems that
don't have it. So I chose to make it loop over pread()/pwritev() in
user space[1].
(At the time, only Solaris and Windows lacked it, out of around a
dozen systems we test, in another nice example of ecosystem
interaction, the Solaris kernel team saw this and added the syscalls
and proposed them to POSIX-next[2], so it's down to Windows. Windows
can avoid this too and perform DMA for O_DIRECT, if we make some
architectural changes...)
About that 128kB/1MB number: that's another bit of unfinished business
for Hurd systems. We currently assume that if you didn't define
IOV_MAX, then it must be 16, which gives 16 * 8kB = 128kB. I know
that the comment about the Hurd in the relevant file[1] is wrong,
IOV_MAX is not required to be defined by POSIX, and I know the GNU
philosophy is to avoid arbitrary limits. I doubt it matters much in
practice, especially without direct I/O (where larger scatter/gather
might scale non-linearly creating a sweet spot that is higher than
that), but it'd be nice to improve that...
I see that fs.defs can do fsync and fdatasync (= omit_metadata), which
is good, we'd make use of those too (v18 only does asynchronous reads,
but v19 will hopefully add writes). More pie-in-the-sky ideas include
(1) an O_DSYNC that is converted to FUA, and (2) doesn't block
concurrent non-overlapping writes (PostgreSQL currently serialises its
WAL (transaction log) writes, but hopefully in future will learn not
to do that), and (3) if FUA isn't supported by the storage, flushes
the drive write cache, unless it somehow knows this isn't necessary
because of powered caches. That's what Linux does, anyway.
> > (And then to get more and more pie-in-the-sky: (1) O_DIRECT is highly
> > desirable for zero-copy DMA to/from a user space buffer pool,
>
> We don't currently have that defined.
. o O { Is anyone trying to put ext4 or xfs into a Hurd server? }
> > (2) starting more than one I/O with a single context switch and likewise
> > for consuming replies,
>
> That would be possible by introducing in gnumach a multi-message variant
> of the mach_msg() system call.
. o O { If I were designing a new mach_msgs() I'd also be tempted to
try to make it so that the messages don't have have to be copied in
during the system call, but instead can be accessed directly by the
receiver, which probably means registering VM pages with the port,
preventing faulting, and mapping them into both sender and receiver
until the port is closed, which probably also means you want a
circular queue to deal with the fixed space. I'm basically describing
io_uring's submission and completion queues, reimagined as user space
port buffers. }
> > (3) registering/locking memory pages and descriptors with a port so
> > they don't have to be pinned/unpinned by the I/O subsystem all the
> > time.
>
> That could be introduced too indeed.
If that is something the Hurd project is ever looking into, there is
an interesting special case for registered socket buffers: if you have
10,000 mostly idle sockets, and you have a recv() in progress on all
of them, then you don't really want to have to supply 10,000 user
space buffers to receive into, so it'd be nice to be able to register
a user space socket buffer pool of some smaller size and let the I/O
subsystem pick a free buffer when a packet arrives and tell you which
one in the reply message. This is a problem that people meet when
they move from readiness-based networking to asynchronous networking,
with high connection counts. PostgreSQL can't do asynchronous socket
I/O yet, but a couple of us have had semi-working prototypes... (That
architecture would prepare for more zero-copy/DMA-based networking and
hopefully even offloading TLS to kernel threads or fancy network
cards, while I'm doing a tour of vapourware...)
> > And then, if Hurd works the way I think it might, (4) to avoid chains
> > of pipe-like scheduling overheads when starting a direct I/O and
> > maybe also some already-cached buffered I/O, you'd ideally want ports
> > to have a "fast" send path that behaves like the old Spring/Solaris
> > doors, where the caller's thread would yield directly to a thread in
> > the receiving server,
>
> That was proposed/experimented, called migrating threads:
>
> https://www.gnu.org/software/hurd/open_issues/mach_migrating_threads
Interesting paper. I wonder if Apple's XPC is also somehow
short-circuiting Mach RPC like this too, for example I think XPC is
used for talking to services like DNS lookup (a driving motivation for
doors on Solaris IIRC), but all their newer OS stuff is closed source
so... *shrug* :-/
Anyway, this sounds like quite a fun OS research project. I suspect
there wouldn't be too many other complex programs that could take
advantage of Mach's asynchrony to the degree PostgreSQL could, even
today but certainly later as its new AIO system spreads to more
parts... If the existing stability problems were resolved and out of
the way first, and if readv/writev operations were added, then I would
be willing to prototype a PostgreSQL patch to try it. Famous last
words perhaps, but it doesn't sound very hard: a minimal POC for a new
PostgreSQL I/O method weighs in at 200-400 lines of code that is
mostly setup and mapping our abstractions to system calls *if* it has
the right basic operations and semantics with no hidden traps. All
the rest of the vapourware we've discussed might just be independent
optimisation work on the Hurd side after that, to make it perform?
[1] https://github.com/postgres/postgres/blob/master/src/include/port/pg_iovec.h
[2] https://austingroupbugs.net/view.php?id=1832