On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote: > On 2019-07-29, Raimo Niskanen <raimo+open...@erix.ericsson.se> wrote: > > A new hang, I tried to invstigate: > > > > At July 19 the last log entry from my 'ps' log was from 14:55, which is > > also the time on the 'systat vmstat' screen when it froze. Then the machine > > hums along but just after midnight at 00:42:01 the first "/bsd: process: > > table is full" entry appears. That message repeats until I rebooted it > > today at July 29 10:48. > > > > I had a terminal with top running. It was still updating. It showed about > > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle. Then (after > > the process table had gotten full) it had 1282 idle processes and 1 on > > processor, which was 'top' itself. > > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G. > > > > I had 8 shells under tmux ready for debugging. 'ls worked. > > 'systat' on one hung. 'top' on another failed with "cannot fork". > > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac > > and then hung. 'exec reboot' did not succeed. Neither did a short power > > button, that at least caused a printout "stopping daemon nginx(failed)", > > but got no further. I had to do a hard power off. > > > > My theory now is that our daily tests right before 14:55 started a process > > (this process is the top 'top' process with 10:14 execution time) that > > triggers a lock or other contention problem in the kernel which causes > > one CPU to spin in the system, and blocks processes from dying. > > About 10 hours later the process table gets full. > > > > Any, ANY ideas of how to proceed would be appreciated! > > > > Best Regards > > Did you notice any odd waitchan's (WAIT in top output)?
I do not think so: select (for the possibly triggering process), - (for 'top'), kqread, netlock, bpf, wait, piperd. > > Maybe set ddb.console=1 in sysctl.conf and reboot (if not already > set), then try to break into DDB during a hang and see how things look > in ps there. (Test breaking into DDB before a hang first so you know > that you can do it .. you can just "c" to continue). > > There might also be clues in things like "sh malloc" or "sh all pools". Sounds like fun - will try that! > > Perhaps you could also get clues from running a kernel built with > 'option WITNESS', you may get some messages in dmesg, or it adds commands > to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for > details). Maybe later. I have gotten used to not compiling my kernel... > > Can you provoke a hang by running this process manually? Might be worth a try to repeat the suspected test case many times. I will try. Thanks for the hints! -- / Raimo Niskanen, Erlang/OTP, Ericsson AB