Dear Mr. Piggin, thanks for your response in the first place :-)
On 13 Sep 2007 at 2:30, Nick Piggin wrote: > > Can you see if it is looping in userspace or kernel? Can you kill -9 > the process? > I can't run any command. Any command hangs or coredumps. > Are you able to test with the latest 2.6.23-rc kernel? If not (or if it > still has the same problem), then can you get the output of sysrq+T > and three sysrq+P calls, please? (this might help work out where in > kernel it is spinning). > I've compiled 2.6.23-rc6, enabled serial console and captured the output of sysrq+P (on the affected virtual VGA console) and sysrq+T. http://www.fccps.cz/download/adv/frr/bonnie/2.6.23-rc6.txt The interesting bit of information, related to the erratic "bash" processes, is always a single line, such as: bash R running 0 2358 1 I've also taken a photo of `top` running on another virtual console. I can't get any data out of the affected box, as I can't run any shell commands... http://www.fccps.cz/download/adv/frr/bonnie/top.jpg Note that there are rather few processes running in the user space. Can't say if that makes any difference from a full-blown distro. Maybe I could set up the bootable CD for download somewhere (gzipped ISO of maybe 50 Megs). In this scenario, Linux 2.6.16.18 once reported a soft lockup. http://www.fccps.cz/download/adv/frr/bonnie/soft-lockup1.txt Never again. I also managed to catch the misbehavior in strace once, didn't get a capture, but essentially it was stuck at a single open syscall, I believe it was "waitpid(1, " . (Never managed that again, always got segfaults instead of the loopy bash when trying to watch bash by strace -p). Exactly where does the context switch from user to kernel take place? I know that I can call ioctl() from user space, and I can write ioctl() handlers in kernel space as part of device drivers (the handlers take place entirely in kernel space). The waitpid() thing is a syscall, being entered only once from user space - and the bash process seems to keep looping inside it. Does the single "running" line in Alt+SysRq+T mean that the process is looping in user space? Take a look at the CPU consumption % numbers though... Note that there's no OOM killer. (Seen that one before, under different circumstances - when OCFS2 didn't like machines with less than 1 GB RAM.) My impression is that the erratic behavior could be a secondary symptom of a kernel-space memory leak taking place somewhere else than in the loopy code itself. Can't say if the leak takes place in memory management or EXT3 for instance... Frank Rysanek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/