On Mon, Apr 07, 2025 at 06:13:21AM -0400, Bill deWindt wrote: > Thanks for the additional info on this issue. Here's the output from both of > the machines I have here. One interesting thing I've been seeing from the > beginning is that the 450Mhz machine always has the hung process at PID 19, > and here's its output: > > # cat /proc/19/stack > [<0>] __remove_hrtimer+0x5c/0xd8 > [<0>] msleep+0x30/0x4c > [<0>] tau_work_func+0x24/0x68 > [<0>] process_one_work+0x1b8/0x3d8 > [<0>] worker_thread+0x288/0x3cc > [<0>] kthread+0xe0/0xe4 > [<0>] start_kernel_thread+0x10/0x14
... > I am guessing (perhaps incorrectly?) that since all of the output from each > trace above matches, with the exception of the first line, this gives an > idea of where the tickle lies. Is there further digging I can do that would > be useful? It looks like this is the thermal monitoring for the CPU. The code is found in arch/powerpc/kernel/tau_6xx.c and tau_work_func does call the function msleep, which does an uninterruptible sleep. static void tau_work_func(struct work_struct *work) { msleep(shrink_timer); on_each_cpu(tau_timeout, NULL, 0); /* schedule ourselves to be run again */ queue_work(tau_workq, work); } The function at the top of each stack is presumably happening in a hardware interrupt handler since msleep would cause the task to sleep. Since this worker thread would be created very early in the boot process, it's not surprising if it gets a fairly consistent PID. This function is called from a worker thread running items from a work queue, and it does an uninterruptible sleep before running tau_timeout on each CPU followed by putting itself back on the workqueue. Since this is a dedicated worker thread for this queue, that one thread will basically just sit in this function all the time. If tau_timeout doesn't take any time to run on this hardware, that thread will spend most of its time in msleep which will show as state 'D' in ps and thus affect the load average. I'm not an expert in this particular driver or how it needs to behave, but perhaps it shouldn't be using msleep for something like this. I know in some of the code I do manage that we changed out some uninterruptible sleeps for interruptible ones specifically so the threads would show in state 'S' instead of 'D' to avoid affecting the load average. Signal handling for kernel threads is different from the handling in a user thread in a system call, so there are some tricks that work without causing major issues. Someone who knows the core powerpc code better than I will need to comment on this driver and if it makes sense to change it. Brad Boyer f...@allandria.com