On 08/09/2025 21:58, Samuel Thibault wrote:
Michael Kelly, le lun. 08 sept. 2025 07:05:39 +0100, a ecrit:
The changes I suggest do not access the list in this way after the mutex has
been released. The next iteration restarts the scan from the (possibly new)
head of the list.
Ah, it restarts over on each cancellation. That's really bad asymptotic
quadratic complexity. That will hit us sooner or later.
Well, yes, I must agree with you. I had assumed a very small list length
of the order of 10 or so but clearly the length is potentially limited
only by memory resources so the total number of iterations grows rapidly
as the length increases. Fair point. I did actually address this issue
by managing additional list pointers within rpc_info to maintain a
separate 'cancellation list'. I then however concluded that the whole
approach was flawed. For example, 2 separate threads initiate an
interrupt_operation on the same remote port. On the remote side, one
thread would capture the RPCs to cancel whilst a 2nd thread could soon
commence (when _ports_lock is released), find no RPCs to cancel and
complete before the first group were actually cancelled. This doesn't
seem right and in any case is a change in behaviour.
I think what is needed is a feature to guarantee the current sequential
behaviour per port_info but which also permits genuine concurrency
across different port_info. That isn't really true at the moment because
of the usage of the global _ports_lock. This could possibly be a
port_info specific mutex or perhaps an extension of the 'blocking'
flags. I'm investigating those options currently but it seems difficult
to ensure locking order to prevent deadlock.
I don't understand the suggestion about not re-cancelling a thread already
in cancellation due to a signal.
I'm not saying only about signals, but also about interruption:
our issue is that ports_interrupt_rpcs calls hurd_thread_cancel
which cancels the thread, and for that calls _hurdsig_abort_rpcs
which might get stuck inside the __interrupt_operation() call. If
in hurd_thread_cancel we check ss->cancel and avoid calling
_hurdsig_abort_rpcs again, we won't call __interrupt_operation() again
and get stuck there.
That occurs within the originating client but isn't the storm of
interruptions being generated on the server side?
On the server side there can be a cascade of interruptions too, yes, but
at least it wouldn't pile cancellations.
I tested your idea of not re-cancelling a thread by wrapping the code
from where the 'cancel' state is set to 1 to after the call to the
cancellation hook with a guard of if (ss->cancel != 1). To capture a
record of this 'needless cancellation' I dumped some debug output using
mach_print (ext2fs doesn't seem to report on stderr). It took 3.5 hours
of my test case before the system locked with the scenario involving a
call to interrupt_operation() on ext2fs which then calls
interrupt_operation on another task (as reported in earlier messages).
This was using released hurd source code without my alterations.
During the test run, there were 1825 needless cancellations (2 (term),
362 (ext2fs), 1439 (proc), 22 (storeio)).
I think that this is an above average 'time to failure' but I'd have to
make a statistically relevant number of runs for this to mean much. It
does however seem to show that there is still a need to address the
issue of maintaining _ports_lock whilst calling hurd_thread_cancel().
Best regards,
Mike.