Re: How to debug hanging machines / proc: table is full

Raimo Niskanen Mon, 09 Sep 2019 08:43:17 -0700

On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote:
> On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:
> > On 2019-07-29, Raimo Niskanen <raimo+open...@erix.ericsson.se> wrote:
> > > A new hang, I tried to invstigate:
> > >
> > > At July 19 the last log entry from my 'ps' log was from 14:55, which is
> > > also the time on the 'systat vmstat' screen when it froze.  Then the 
> > > machine
> > > hums along but just after midnight at 00:42:01 the first "/bsd: process:
> > > table is full" entry appears.  That message repeats until I rebooted it
> > > today at July 29 10:48.
> > >
> > > I had a terminal with top running.  It was still updating.  It showed 
> > > about
> > > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then (after
> > > the process table had gotten full) it had 1282 idle processes and 1 on
> > > processor, which was 'top' itself.
> > > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
> > >
> > > I had 8 shells under tmux ready for debugging.  'ls worked.
> > > 'systat' on one hung.  'top' on another failed with "cannot fork".
> > > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> > > and then hung.  'exec reboot' did not succeed.  Neither did a short power
> > > button, that at least caused a printout "stopping daemon nginx(failed)",
> > > but got no further.  I had to do a hard power off. 
> > >
> > > My theory now is that our daily tests right before 14:55 started a process
> > > (this process is the top 'top' process with 10:14 execution time) that
> > > triggers a lock or other contention problem in the kernel which causes
> > > one CPU to spin in the system, and blocks processes from dying.
> > > About 10 hours later the process table gets full.
> > >
> > > Any, ANY ideas of how to proceed would be appreciated!
> > >
> > > Best Regards
> > 
> > Did you notice any odd waitchan's (WAIT in top output)?
> > 
> > Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
> > set), then try to break into DDB during a hang and see how things look
> > in ps there. (Test breaking into DDB before a hang first so you know
> > that you can do it .. you can just "c" to continue).
> > 
> > There might also be clues in things like "sh malloc" or "sh all pools".
> > 
> > Perhaps you could also get clues from running a kernel built with
> > 'option WITNESS', you may get some messages in dmesg, or it adds commands
> > to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for
> > details).
> 
> I have enabled Witness, it went so-so.  We'll see what it catches.
> 
> I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
> applied all patches for stable 001-006 and built a kernel with:
>   include "arch/amd64/conf/GENERIC"
>   option      MULTIPROCESSOR
>   option      MP_LOCKDEBUG
>   option      WITNESS
> 
> Then I activated in /etc/sysctl.conf:
>   ddb.console=1
>   kern.witness.locktrace=1
>   kern.witness.watch=3
> 
> For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and typed
> "show witness".  It printed lots of info, I scrolled down to the end, but
> during the printout there was an UVM fault:
> 
>   Spin locks:
>   /usr/src/sys/....
>   :
>   bla bla bla
>   :
>   uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
>   kernel: page fault trap, code=0
>   Faulted in DDB: continuing...
> 
> Then I typed "cont" and it panicked.
> If anybody want details I took a picture.
> 
> Have I combined too many debugging options, or is this sh*t that happens?
> 
> Nevertheless, now the machine is running again, with Witness...
> 
> I'll be back.


I have encountered some kind of stop, oddly enought not a panic - it
just sat in ddb and I missed it for a week (or more).  Then I did not
remember what I had planned to do so I "improvised" X-| , but anyway:

ddb{0}> ps
shows about 350 processes from cron, half of them in state netlock, half
in state piperd.  Then I have my test processes beam.smp: 6 in netlock, 6
in piperd, about 70 in fsleep, 3 in poll, 3 in select, 4 in kqread.
Then about 100 more ordinary looking processes...

ddb{0}> trace
db_enter()...
softclock(0)...
softintr_dispatch(0)...
Xsoftclock(0,0,1388,....)...
acpicpu_idle()...
shed_idle(ffffffff81ceff0)...
end trace frame: 0x0, count: -6

ddb{0}> show locks
exclusive kernel_lock &kernel_lock r = 0 (0xffffffff81e37b10) locked @
/usr/src/sys/arch/amd64/amd64/softintr.c:87
#0  witness_lock+0x41f
#1  softintr_dispatc+0x56
#2  Xsoftclock+0x1f
#3  acpicpu_idle+0x271
#4  sched_idle+0x235
#5  proc_trampoline+0x1c

ddb{0}> show nfsnode
size 5476515891122113356 flag 0 vnode 0xd080c7d8 accstamp 1099511681152

(I think the size looks strange)

Then I tried show map and got a protection fault trap, gave up and
rebooted.

That was it!  Next time I will try:
  trace
  ps
  show malloc
  show all pools
  show locks
  show all locks
unless anyone has got more or better suggestions...

Best Regards
-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Re: How to debug hanging machines / proc: table is full

Reply via email to