Re: libports and interrupted RPCs

Michael Kelly Fri, 12 Sep 2025 11:27:38 -0700

On 08/09/2025 21:58, Samuel Thibault wrote:

Michael Kelly, le lun. 08 sept. 2025 07:05:39 +0100, a ecrit:

The changes I suggest do not access the list in this way after the mutex has
been released. The next iteration restarts the scan from the (possibly new)
head of the list.

Ah, it restarts over on each cancellation. That's really bad asymptotic
quadratic complexity. That will hit us sooner or later.

Well, yes, I must agree with you. I had assumed a very small list lengthof the order of 10 or so but clearly the length is potentially limitedonly by memory resources so the total number of iterations grows rapidlyas the length increases. Fair point. I did actually address this issueby managing additional list pointers within rpc_info to maintain aseparate 'cancellation list'. I then however concluded that the wholeapproach was flawed. For example, 2 separate threads initiate aninterrupt_operation on the same remote port. On the remote side, onethread would capture the RPCs to cancel whilst a 2nd thread could sooncommence (when _ports_lock is released), find no RPCs to cancel andcomplete before the first group were actually cancelled. This doesn'tseem right and in any case is a change in behaviour.

I think what is needed is a feature to guarantee the current sequentialbehaviour per port_info but which also permits genuine concurrencyacross different port_info. That isn't really true at the moment becauseof the usage of the global _ports_lock. This could possibly be aport_info specific mutex or perhaps an extension of the 'blocking'flags. I'm investigating those options currently but it seems difficultto ensure locking order to prevent deadlock.

I don't understand the suggestion about not re-cancelling a thread already
in cancellation due to a signal.

I'm not saying only about signals, but also about interruption:
our issue is that ports_interrupt_rpcs calls hurd_thread_cancel
which cancels the thread, and for that calls _hurdsig_abort_rpcs
which might get stuck inside the __interrupt_operation() call. If
in hurd_thread_cancel we check ss->cancel and avoid calling
_hurdsig_abort_rpcs again, we won't call __interrupt_operation() again
and get stuck there.

That occurs within the originating client but isn't the storm of
interruptions being generated on the server side?

On the server side there can be a cascade of interruptions too, yes, but
at least it wouldn't pile cancellations.

I tested your idea of not re-cancelling a thread by wrapping the codefrom where the 'cancel' state is set to 1 to after the call to thecancellation hook with a guard of if (ss->cancel != 1). To capture arecord of this 'needless cancellation' I dumped some debug output usingmach_print (ext2fs doesn't seem to report on stderr). It took 3.5 hoursof my test case before the system locked with the scenario involving acall to interrupt_operation() on ext2fs which then callsinterrupt_operation on another task (as reported in earlier messages).This was using released hurd source code without my alterations.

During the test run, there were 1825 needless cancellations (2 (term),362 (ext2fs), 1439 (proc), 22 (storeio)).

I think that this is an above average 'time to failure' but I'd have tomake a statistically relevant number of runs for this to mean much. Itdoes however seem to show that there is still a need to address theissue of maintaining _ports_lock whilst calling hurd_thread_cancel().


Best regards,

Mike.

Re: libports and interrupted RPCs

Reply via email to