Stuck processes in unkillable (STOP) state, listen queue overflow

Nagy, Attila Tue, 27 Oct 2015 03:20:13 -0700

Hi,

Recently I've started to see a lot of cases, where the log is full with"listen queue overflow" messages and the process behind the networksocket is unavailable.When I open a TCP to it, it opens but nothing happens (for example I getno SMTP banner from postfix, nor I get a log entry about the newconnection).

I've seen this with Java programs, postfix and redis, basicallyeverything which opens a TCP and listens on the machine.

For example, I have a redis process, which listens on 6381. When Itelnet into it, the TCP opens, but the program doesn't respond.

When I kill it, nothing happens. Even with kill -9 yields only this state:

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPUCOMMAN776 redis 2 20 0 24112K 2256K STOP 3 16:56 0.00%redis-

When I tcpdrop the connections of the process, tcpdrop reports successfor the first time and failure for the second (No such process), but theconnections remain:

# sockstat -4 | grep 776
redis    redis-serv 776   6  tcp4   *:6381 *:*
redis    redis-serv 776   9  tcp4   *:16381 *:*
redis    redis-serv 776   10 tcp4   127.0.0.1:16381 127.0.0.1:10460
redis    redis-serv 776   11 tcp4   127.0.0.1:16381 127.0.0.1:35795
redis    redis-serv 776   13 tcp4   127.0.0.1:30027 127.0.0.1:16379
redis    redis-serv 776   14 tcp4   127.0.0.1:58802 127.0.0.1:16384
redis    redis-serv 776   17 tcp4   127.0.0.1:16381 127.0.0.1:24354
redis    redis-serv 776   18 tcp4   127.0.0.1:16381 127.0.0.1:56999
redis    redis-serv 776   19 tcp4   127.0.0.1:16381 127.0.0.1:39488
redis    redis-serv 776   20 tcp4   127.0.0.1:6381 127.0.0.1:39491
# sockstat -4 | grep 776 | awk '{print "tcpdrop "$6" "$7}' | /bin/sh

tcpdrop: getaddrinfo: * port 6381: hostname nor servname provided, ornot knowntcpdrop: getaddrinfo: * port 16381: hostname nor servname provided, ornot known

tcpdrop: 127.0.0.1 16381 127.0.0.1 10460: No such process
tcpdrop: 127.0.0.1 16381 127.0.0.1 35795: No such process
tcpdrop: 127.0.0.1 30027 127.0.0.1 16379: No such process
tcpdrop: 127.0.0.1 58802 127.0.0.1 16384: No such process
tcpdrop: 127.0.0.1 16381 127.0.0.1 24354: No such process
tcpdrop: 127.0.0.1 16381 127.0.0.1 56999: No such process
tcpdrop: 127.0.0.1 16381 127.0.0.1 39488: No such process
tcpdrop: 127.0.0.1 6381 127.0.0.1 39491: No such process
# sockstat -4 | grep 776
redis    redis-serv 776   6  tcp4   *:6381 *:*
redis    redis-serv 776   9  tcp4   *:16381 *:*
redis    redis-serv 776   10 tcp4   127.0.0.1:16381 127.0.0.1:10460
redis    redis-serv 776   11 tcp4   127.0.0.1:16381 127.0.0.1:35795
redis    redis-serv 776   13 tcp4   127.0.0.1:30027 127.0.0.1:16379
redis    redis-serv 776   14 tcp4   127.0.0.1:58802 127.0.0.1:16384
redis    redis-serv 776   17 tcp4   127.0.0.1:16381 127.0.0.1:24354
redis    redis-serv 776   18 tcp4   127.0.0.1:16381 127.0.0.1:56999
redis    redis-serv 776   19 tcp4   127.0.0.1:16381 127.0.0.1:39488
redis    redis-serv 776   20 tcp4   127.0.0.1:6381 127.0.0.1:39491

$ procstat -k 776
  PID    TID COMM             TDNAME KSTACK

776 100725 redis-server - mi_switchsleepq_timedwait_sig _sleep kern_kevent sys_kevent amd64_syscallXfast_syscall776 100744 redis-server - mi_switchthread_suspend_switch thread_single exit1 sigexit postsig ast doreti_ast


I can do nothing to get out from this state, only reboot helps.

The OS is stable/10@r289313, but I could observe this behaviour withearlier releases too.


The dmesg is full with lines like these:

sonewconn: pcb 0xfffff8004dc54498: Listen queue overflow: 193 already inqueue awaiting acceptance (3142 occurrences)sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already inqueue awaiting acceptance (3068 occurrences)sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already inqueue awaiting acceptance (3057 occurrences)sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already inqueue awaiting acceptance (3037 occurrences)sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already inqueue awaiting acceptance (3015 occurrences)sonewconn: pcb 0xfffff8004d9ed188: Listen queue overflow: 193 already inqueue awaiting acceptance (3035 occurrences)

I guess this is the effect of the process freeze, not the cause (thelisten queue fills up because the app can't handle the incomingconnections).

I'm not sure it matters, but some of the machines (and the above) runson an ESX hypervisor (but as far as I can remember, I could see this onphysical machines too, but I'm not sure about that).Also -so far- I could only see this where some "exotic" stuff ran, likea java or erlang based server (opendj, elasticsearch and rabbitmq).

Also not sure about which triggers this. I've never seen this after somehours of uptime, at least some days or a week must've been passed to getstuck like the above.


Any ideas about this?

Thanks,
_______________________________________________
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

Stuck processes in unkillable (STOP) state, listen queue overflow

Reply via email to