Re: sshd with zombie process on FreeBSD 10.0-STABLE

Marcelo Gondim Fri, 28 Mar 2014 13:00:01 -0700

Em 28/03/14 15:01, Marcelo Gondim escreveu:

Hi all,

I have noticed zombie processes on the system after a few lostconnections on ssh.


# ps afx
[...]
 8045  -  Is       0:00.01 sshd: unknown [priv] (sshd)
 8046  -  Z        0:00.01 <defunct>
 8054  -  IW       0:00.00 sshd: unknown [pam] (sshd)
28146  -  Is       0:00.01 sshd: unknown [priv] (sshd)
28147  -  Z        0:00.01 <defunct>
28155  -  IW       0:00.00 sshd: unknown [pam] (sshd)
43320  -  Is       0:00.01 sshd: unknown [priv] (sshd)
43321  -  Z        0:00.01 <defunct>
43322  -  IW       0:00.00 sshd: unknown [pam] (sshd)
73413  -  Is       0:00.01 sshd: unknown [priv] (sshd)
73414  -  Z        0:00.01 <defunct>
73430  -  IW       0:00.00 sshd: unknown [pam] (sshd)
[...]

More information discussed in freebsd-stable@list:

On Tue, Mar 25, 2014 at 3:52 AM, Karl Pielorz <kpielorz_...@tdx.co.uk<mailto:kpielorz_...@tdx.co.uk>> wrote:




   --On 21 March 2014 22:02 -0700 Kevin Oberman <rkober...@gmail.com
   <mailto:rkober...@gmail.com>> wrote:

       Ideally I'd look to try and capture the packets st the end of
       the session.
       Can you do something to trigger this reliably? if so "standard"
       "tcpdump

-pw file.bpf host HOST". I seem to recall that theseconnections are

       scheduled. If so, you can put the packet capture in a crontab to
       run at
       the same time. If you feed this to a tool like wireshark, you
       should get

a good idea of what is happening, if not why. I understand thatthe

       timing of this might be very tricky.


   Ok, fwiw as I have this issue as well - I've done a packet capture
   (it's below). This box has 59 'CLOSED' sockets from ssh on it, and
   60 sshd stuck.

   Also - initially I thought this was a Xen issue - so there's a
   couple of posts on that list from a couple of weeks ago, in brief -
   the sshd processes I have are stuck in 'urdlck' - one of the Xen
   guys commented "It seems like the process is stuck while trying to
   acquire a rw mutex in read mode."

   I did a backtrace of a stuck process - I can post that if you want
   (or check the FreeBSD-Xen list for 'stuck sshd in urdlck'.

   Also, if I ssh into this host, 90% of the time (seems to get worse
   the longer the box is up) I get:

   "
   ssh_exchange_identification: Connection closed by remote host
   "

   That does not leave a lingering CLOSED socket. In fact, a successful
   ssh login - and logout, does not in testing appear to leave a
   lingering CLOSED socket, nor sshd stuck in urdlck - so I'm not

entirely sure where they're coming from, or how often they arecreated.


   tcpdump from the start of a successful ssh connect:

   10:47:04.777765 IP (tos 0x0, ttl 61, id 4058, offset 0, flags [DF],
   proto TCP (6), length 60)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [S], cksum 0x57ce
   (correct), seq 634709832, win 65535, options [mss 1368,nop,wscale
   3,sackOK,TS val 1019391535 ecr 0], length 0

   10:47:04.777776 IP (tos 0x0, ttl 64, id 10060, offset 0, flags [DF],
   proto TCP (6), length 60)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [S.], cksum 0x7ef8
   (incorrect -> 0x69d8), seq 2316386788, ack 634709833, win 65535,
   options [mss 1368,nop,wscale 6,sackOK,TS val 805368299 ecr
   1019391535], length 0

   10:47:04.804218 IP (tos 0x0, ttl 61, id 4059, offset 0, flags [DF],
   proto TCP (6), length 52)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x77d3
   (correct), ack 1, win 8305, options [nop,nop,TS val 1019391538 ecr
   805368299], length 0

   10:47:04.809692 IP (tos 0x0, ttl 64, id 10061, offset 0, flags [DF],
   proto TCP (6), length 99)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f1f
   (incorrect -> 0x5799), seq 1:48, ack 1, win 1038, options
   [nop,nop,TS val 805368328 ecr 1019391538], length 47

   10:47:04.836110 IP (tos 0x0, ttl 61, id 4060, offset 0, flags [DF],
   proto TCP (6), length 92)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x8afa
   (correct), seq 1:41, ack 48, win 8305, options [nop,nop,TS val
   1019391541 ecr 805368328], length 40

   10:47:04.836669 IP (tos 0x0, ttl 64, id 10062, offset 0, flags [DF],
   proto TCP (6), length 1596)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x84f8
   (incorrect -> 0x4d42), seq 48:1592, ack 41, win 1038, options
   [nop,nop,TS val 805368358 ecr 1019391541], length 1544


   And the end of the session:

   10:47:15.243540 IP (tos 0x10, ttl 61, id 4132, offset 0, flags [DF],
   proto TCP (6), length 100)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x6364
   (correct), seq 2321:2369, ack 3520, win 8305, options [nop,nop,TS
   val 1019392582 ecr 805375068], length 48

   10:47:15.243736 IP (tos 0x10, ttl 64, id 10125, offset 0, flags
   [DF], proto TCP (6), length 100)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f20
   (incorrect -> 0x3ea8), seq 3520:3568, ack 2369, win 1038, options
   [nop,nop,TS val 805378765 ecr 1019392582], length 48

   10:47:15.243796 IP (tos 0x10, ttl 64, id 10126, offset 0, flags
   [DF], proto TCP (6), length 100)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f20
   (incorrect -> 0xdd22), seq 3568:3616, ack 2369, win 1038, options
   [nop,nop,TS val 805378765 ecr 1019392582], length 48

   10:47:15.244627 IP (tos 0x10, ttl 64, id 10127, offset 0, flags
   [DF], proto TCP (6), length 84)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f10
   (incorrect -> 0x86ed), seq 3616:3648, ack 2369, win 1038, options
   [nop,nop,TS val 805378765 ecr 1019392582], length 32

   10:47:15.244812 IP (tos 0x10, ttl 64, id 10128, offset 0, flags
   [DF], proto TCP (6), length 212)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f90
   (incorrect -> 0x431c), seq 3648:3808, ack 2369, win 1038, options
   [nop,nop,TS val 805378765 ecr 1019392582], length 160

   10:47:15.271439 IP (tos 0x10, ttl 61, id 4134, offset 0, flags [DF],
   proto TCP (6), length 52)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x3381
   (correct), ack 3616, win 8299, options [nop,nop,TS val 1019392585
   ecr 805378765], length 0

   10:47:15.272238 IP (tos 0x10, ttl 61, id 4135, offset 0, flags [DF],
   proto TCP (6), length 52)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x32d9
   (correct), ack 3808, win 8275, options [nop,nop,TS val 1019392585
   ecr 805378765], length 0

   10:47:15.273515 IP (tos 0x10, ttl 61, id 4137, offset 0, flags [DF],
   proto TCP (6), length 84)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x43cc
   (correct), seq 2369:2401, ack 3808, win 8305, options [nop,nop,TS
   val 1019392585 ecr 805378765], length 32

   10:47:15.276199 IP (tos 0x10, ttl 61, id 4138, offset 0, flags [DF],
   proto TCP (6), length 116)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0xb169
   (correct), seq 2401:2465, ack 3808, win 8305, options [nop,nop,TS
   val 1019392585 ecr 805378765], length 64

   10:47:15.276220 IP (tos 0x10, ttl 64, id 10129, offset 0, flags
   [DF], proto TCP (6), length 52)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [.], cksum 0x7ef0
   (incorrect -> 0x4ea3), ack 2465, win 1037, options [nop,nop,TS val
   805378793 ecr 1019392585], length 0

   10:47:15.276970 IP (tos 0x10, ttl 61, id 4140, offset 0, flags [DF],
   proto TCP (6), length 52)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [F.], cksum 0x325a
   (correct), seq 2465, ack 3808, win 8305, options [nop,nop,TS val
   1019392585 ecr 805378765], length 0

   10:47:15.276978 IP (tos 0x10, ttl 64, id 10130, offset 0, flags
   [DF], proto TCP (6), length 52)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [.], cksum 0x7ef0
   (incorrect -> 0x4e9c), ack 2466, win 1038, options [nop,nop,TS val
   805378798 ecr 1019392585], length 0

   10:47:15.277212 IP (tos 0x10, ttl 64, id 10131, offset 0, flags
   [DF], proto TCP (6), length 52)
       192.168.0.138.22 > 192.168.0.37.31139: Flags [F.], cksum 0x7ef0
   (incorrect -> 0x4e9b), seq 3808, ack 2466, win 1038, options
   [nop,nop,TS val 805378798 ecr 1019392585], length 0

   10:47:15.303993 IP (tos 0x10, ttl 61, id 4142, offset 0, flags [DF],
   proto TCP (6), length 52)
       192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x3235
   (correct), ack 3809, win 8305, options [nop,nop,TS val 1019392588
   ecr 805378798], length 0


   This box has no services running on it at present, and just sits
   there idle. I'll periodically check on it and see if the CLOSED
   socket count, or hung sshd count goes up.

   It's runing as a PVHVM domU under XenServer 6.2

Since I am retired, I no longer have access to my beloved "TCP/IPIllustrated Vol. 1", but I believe CLOSED status indicates that thesocket is no longer in use.As the name says, it is CLOSED. I don'tknow why sshd should be waiting on a CLOSED socket, nor do Iunderstand why a CLOSED server socket should live on for extended times.

That said, I have been monitoring my FreeBSD 10 system and it seems topick up an occasional case where a socket gets "stuck" in the CLOSEDstate for a very long time. I can't say forever, but at least manyminutes. I don't see this on 9.2 systems. Right now I have two CLOSEDsockets, both the the same address on port 80. It appears that theonly thing that removes the socket is to kill the owning process (inthis case it's Firefox). The owning firefox process does exist justfine when asked and I think sshd should do so as well, but it looks tome like the root problem is that CLOSED socket seem to live on in 10.0and they don't on older versions.

I think it's time to move this over to net@, as I suspect a changemade in 10 is triggering this. I'm not sure the changes is wrong orthat sshd is wrong in not exiting cleanly when sockets are hangingaround in CLOSED state. (Nor am I sure sshd is wrong.)

If I am confused about this, please let me know. I'm going ot read the10.0 release notes to see if they say anything about this.


Another my server with 12 zombie process related with CLOSED connections:

last pid: 84965;  load averages:  2.74,  2.65, 2.07 up 14+23:43:16  16:10:00
133 processes: 1 running, 120 sleeping, 12 zombie
CPU:  8.5% user,  0.0% nice, 39.3% system, 11.8% interrupt, 40.4% idle
Mem: 194M Active, 2575M Inact, 828M Wired, 77M Cache, 391M Buf, 3056K Free
Swap: 16G Total, 377M Used, 16G Free, 2% Inuse, 8K In

# netstat -n |grep CLOSED
tcp4       0      0 192.168.254.1.22       10.254.0.12.26792 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.60591 CLOSED
tcp4       0      0 192.168.254.1.22       10.254.0.12.26974 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.54034 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.52999 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.65345 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.64099 CLOSED
tcp4       0      0 192.168.254.1.22       10.254.0.12.26457 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.54181 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.53717 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.57035 CLOSED
tcp4       0      0 192.168.254.1.22       10.254.0.12.26832 CLOSED


_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: sshd with zombie process on FreeBSD 10.0-STABLE

Reply via email to