On Fri, Apr 14, 2023 at 12:47 AM Samuel Thibault <samuel.thiba...@gnu.org> wrote: > #10 0x0106bc20 in __libc_assert_fail (assertion=0x1222762 "port == arg", > file=0x121c2a0 "../sysdeps/mach/hurd/mig-reply.c", line=92, > function=0x121c2c4 <__PRETTY_FUNCTION__.0> "__mig_dealloc_reply_port") > at __libc_assert_fail.c:31 > #11 0x010283d2 in __GI___mig_dealloc_reply_port (arg=<optimized out>) > at ../sysdeps/mach/hurd/mig-reply.c:92 > #12 0x012eb23e in __msg_sig_post (process=27, signal=15, sigcode=0, > refport=35) > at /usr/src/glibc-upstream/build/hurd/RPC_msg_sig_post.c:160 > #13 0x01074dbc in kill_port (refport=<optimized out>, msgport=<optimized out>) > at ../sysdeps/mach/hurd/kill.c:67
Interesting... So first of all we're wrong, and the signal code does not destroy the reply port. In fact it takes care to restore it: /* Destroy the MiG reply port used by the signal handler, and restore the reply port in use by the thread when interrupted. */ reply_port = THREAD_GETMEM (THREAD_SELF, reply_port); THREAD_SETMEM (THREAD_SELF, reply_port, scp->sc_reply_port); if (MACH_PORT_VALID (reply_port)) __mach_port_destroy (__mach_task_self (), reply_port); (Yes, that's the very same code we've altered recently, but it also restored the reply port before my changes.) See, it destroys the current reply port, and restores scp->sc_reply_port. So then when we jump back to the user's code (which should be in intr-msg), it gets MACH_RCV_TIMED_OUT (if that's what happened), and proceeds to destroy the port itself. One notable thing here: the sigreturn code (above) uses sc_reply_port to destroy the current reply port. So: * if the rcv timed out, we're going to destroy the reply port, but we haven't yet, so the server could fake a reply to mach_port_destroy * if the mach_port_destroy RPC itself fails, we'll dealloc the user's port right here, and then it'll mismatch what the user expects it to be The latter is in fact what's tripping the assertion in your test case. mach_port_destroy (well, your test case contains my mach_port_mod_refs) expects to receive a matching reply, but instead we get a msg_sig_post_reply. We'd have caught this earlier if we didn't neglect the result of mach_port_destroy. Q: But why do we get msg_sig_post_reply here, why not in rpc_wait_trampoline, *before* running the signal handler? A: Because msg_sig_post is uninterruptible. (and this is stupid! and so frustrating! we should just make it interruptible) So: we should not use the user's reply port for this RPC. This is my fault (747812349d42427c835aeac987aa67641d84f1ad). In my defense, there was no comment explaining why this would be a bad idea; we should write one. I'll prepare a patch. But secondly, if we are not destroying the user's reply port, but restoring it, then I don't think the "port = MACH_PORT_NULL, arg = stale name" case can happen? So we don't need to handle it? We should really just assert (arg == port), and this will help us find more broken code. Thoughts? Sergey