On Wed, Jun 28, 2023 at 01:16:42PM +0200, Martin Pieuchot wrote: > On 28/06/23(Wed) 08:58, Claudio Jeker wrote: > > On Tue, Jun 27, 2023 at 08:18:15PM -0400, Kurt Miller wrote: > > > On Jun 27, 2023, at 1:52 PM, Kurt Miller <k...@intricatesoftware.com> > > > wrote: > > > > > > > > On Jun 14, 2023, at 12:51 PM, Vitaliy Makkoveev <m...@openbsd.org> > > > > wrote: > > > >> > > > >> On Tue, May 30, 2023 at 01:31:08PM +0200, Martin Pieuchot wrote: > > > >>> So it seems the java process is holding the `sysctl_lock' for too long > > > >>> and block all other sysctl(2). This seems wrong to me. We should > > > >>> come > > > >>> up with a clever way to prevent vslocking too much memory. A single > > > >>> lock obviously doesn't fly with that many CPUs. > > > >>> > > > >> > > > >> We vslock memory to prevent context switch while doing copyin() and > > > >> copyout(), right? This is required for avoid context switch within > > > >> foreach > > > >> loops of kernel lock protected lists. But this seems not be required > > > >> for > > > >> simple sysctl_int() calls or rwlock protected data. So sysctl_lock > > > >> acquisition and the uvm_vslock() calls could be avoided for significant > > > >> count of mibs and pushed deep down for the rest. > > > > > > > > I’m back on -current testing and have some additional findings that > > > > may help a bit. The memory leak fix had no effect on this issue. > > > > -current > > > > behavior is as I previously described. When java trips the issue, it > > > > goes into a state where many threads are all running at 100% cpu but > > > > does not make forward progress. I’m going to call this state run-away > > > > java > > > > process. Java is calling sched_yield(2) when in this state. > > > > > > > > When java is in run-away state, a different process can trip > > > > the next stage were processes block waiting on sysctllk indefinitely. > > > > Top with process arguments is one, pgrep and ps -axl also trip this. > > > > My last test on -current java was stuck in run-away state for 7 hours > > > > 45 minutes before cron daily ran and cause the lockups. > > > > > > > > I did a test with -current + locking sched_yield() back up with the > > > > kernel lock. The behavior changed slightly. Java still enters run-away > > > > state occasionally but eventually does make forward progress and > > > > complete. When java is in run-away state the sysctllk issue can still > > > > be tripped, but if it is not tripped java eventually completes. For > > > > about 200 invocations of a java command that usually takes 50 seconds > > > > to complete, 4 times java entered run-away state but eventually > > > > completed: > > > > > > > > Typically it runs like this: > > > > 0m51.16s real 5m09.37s user 0m49.96s system > > > > > > > > The exceptions look like this: > > > > 1m11.15s real 5m35.88s user 13m20.47s system > > > > 27m18.93s real 31m13.19s user 754m48.41s system > > > > 13m44.44s real 19m56.11s user 501m39.73s system > > > > 19m23.72s real 24m40.97s user 629m08.16s system > > > > > > > > Testing -current with dumbsched.3 behaves the same as -current described > > > > above. > > > > > > > > One other thing I observed so far is what happens when egdb is > > > > Attached to the run-away java process. egdb stops the process > > > > using ptrace(2) PT_ATTACH. Now if I issue a command that would > > > > typically lock up the system like top displaying command line > > > > arguments, the system does not lock up. I think this rules out > > > > the kernel memory is fragmented theory. > > > > > > > > Switching cpu’s in ddb tends to lock up ddb so I have limited > > > > info but here what I have from -current lockup and -current > > > > with dumbsched.3 lockup. > > > > > > Another data point to support the idea of a missing wakeup; when > > > java is in run-away state, if I send SIGSTOP followed by SIGCONT > > > it dislodges it from run-away state and returns to normal operation. > > > > I doubt this is a missing wakeup. It is more the system is thrashing and > > not making progress. The SIGSTOP causes all threads to park which means > > that the thread not busy in its sched_yield() loop will finish its operation > > and then on SIGCONT progress is possible. > > > > I need to recheck your ps output from ddb but I guess one of the threads > > is stuck in a different place. That is where we need to look. > > It may well be a bad interaction between SCHED_LOCK() and whatever else is > > going on. > > Or simply a poor userland scheduling based on sched_yield()... > > To me it seems there are two bugs in your report: > > 1/ a deadlock due to a single rwlock in sysctl(2) > > 2/ something unknown in java not making progress and calling > sched_yield() and triggering 1/ > > While 1/ is well understood 2/ isn't. Why is java not making progress > is what should be understood. Knowing where is the sched_yield() coming > from can help.
For 1/ the main issue is that sysctl_proc_args() the call used to grab the arguments of a process needs to access that process uvm map (the environment and argument strings are only stored in userland). So 2/ causing no progress in a uvm operation is causing 1/ to lock up and from there on any sysctl blocks (and almost all binaries do call sysctl). For 2/ it could well be that there is no progress because SCHED_LOCK is so contended that nothing is able to move forward. One major issue is that rwlocks use SCHED_LOCK because of sleep_setup/sleep_finish. So doing constant sched_yield calls with 40+ threads will slow down these operations. -- :wq Claudio