Any acceptance of TCP connections suddenly BREAKS without provocation globally: gives 'connection reset by peer', 'connection refused' or blocks. Only solved by reboot. Details&analysis provided. Request for pointers as to nail down the error source.

Mikael Sun, 07 Jul 2013 08:27:15 -0700

Hi,

This error was originally reported and discussed in detail w Frostypants,
LTEO and Mischief on #openbsd@Freenode 2013-05-03 at 18:41 - 21:19 GMT.


This error was considered so exotic by the IRC channel that Frostypants
wrote "/me shoots himself" and then followed up with that I "win a prize"
for trigging it.


Request:

To get network stack/OS kernel introspection tools (new custom ones or just
shell commands) for nailing down the error source on the next occurrence!

1) For instance, an mbuf prettyprint-dumper and an mbuf resetter could be
of value.

2) Also a general TCP network stack state prettyprint-dumper and resetter
would be of value - the dumping as to track the source of the problem, and
the resetter as to fix the problem without need for reboot.



I will probably upgrade to 5.3 shortly so that the next occurrence will be
on 5.3, however I expect nothing to be fixed from 5.2 as there is no
mentioning of any network stack issue in http://openbsd.org/errata52.html so
all code related to this error.





Report:

OBSD version: 5.2 AMD64 GENERIC multicore

Occurred: 2013-05-03 and today 2013-07-07.

Reproducible: By me, beyond these two occurrences no.

Machine: Dual-xeon with a BGE NIC

Environment details:

   - On the second occurrence there was one defunct process that had a
   bound TCP port.
   - Permanently ~~50 incoming TCP connections per second as some kind of
   undirected spamming/flodding/micro-semi-DDOS, no clue from who or if with
   any particular flags.
   - On both occurrences, previously an EGDB session run as user had
   crashed so that kill -9 was needed.
   - At some points, user processes had encountered malloc failure due to
   insufficient RAM.
   - Other than this absolutely nothing exotic.




Brief error description:

The machine suddenly, without any pre-notice or other factor that could be
anticipated by the admin/user, stops accepting incoming TCP connections,
from all interfaces (external & lo0).


The error is limited to this - outbound TCP connections (lynx etc.), PING
(so ICMP) and DNS lookups are not affected but work fine.

So for example, "ssh localhost" while sshd is open and netstat clearly
shows it's listening on all interfaces for inbound connections, responds
with "Connection reset by peer".


A thorough discussion on #openbsd@Freenode revealed that "netstat -m" says
that "14308 Kbytes allocated (93% in use)" on the occurrence 2013-05-03 and
"14308 Kbytes allocated (93% in use)" on the occurrence 2013-07-07, other
than this every explored system parameter looked normal.




Analysis and details of error per IRC log from 2013-05-03:


   -  /etc/rc.d/sshd stop; /etc/rc.d/sshd start; ssh localhost
   => Connection reset by peer

   ssh localhost
   => Connection refused

   telnet localhost 22
   => Connection refused

   ssh -vvv 127.0.0.1
   => OpenSSH 6.1, OpenSSL \n Reading config \n ssh_connect: needprif 0\n
   Connecting to 127.0.0.1 \n connect to address 127.0.0.1 port 22: Connection
   reset by peer

   even while "netstat -an | grep LISTEN | grep 22" shows "tcp    0 0 *.22
         *.*        LISTEN" and then an equivalent role but for tcp6.

   Equal effect when doing the same with OpenSMTPD, (i.e. /etc/rc.d/smtpd
   restart; telnet localhost 25 etc.) HTTP server etc.

   Equally so a minimalistic TCP server written in C is unable to pick up
   new connections, so the problem really reduces to that inbound TCP
   connections are not receied anymore
   This C program indeed was reported correctly by "netstat -an | grep
   LISTEN" as: "tcp      0    0 127.0.0.1.1055    *.*           LISTEN"

   - Does not change anything about the error:

   - sudo sh /etc/netstart

      - Resetting pf.conf  (pfctl -Rf /etc/pf.conf , something like this)

      - pfctl -d (i.e. disabling PF altogether)

      - ifconfig lo0/bge0 down

      - netstat shows ~100 connections from localhost to localhost, and ~50
   connectsions from external in FIN_WAIT

   - Frostypants points out: The host is returning TCP RST then, not just
   dropping it.  That's...special.

   - netstat -an | grep LISTEN   shows that SSH, SMTP, HTTP etc. are indeed
   being listened for.

   - 'tcpdump -nni lo0 port 22' and then doing "telnet localhost 22" shows
   that SYN, SYN-ACK, ACK, RST are returned (!), see the attached image for
   full log.
   Note that this was done after "pfctl -d".
   Image is also downloadable at
   
http://s000.tinyupload.com/download.php?file_id=28266514139859296465&t=2826651413985929646525648.

   - fstat -n | wc -l gives 1017

   - fstat -n | grep sshd | wc -l  gives 7

   - pfctl -si  | grep entries gives "  current entries            12"

   - Nothing funky in 'dmesg', it shows sd1 is up and that is all.

   - "netstat -m" reports:
   6863 in use
   13804 Kbytes allocated to network (97% in use)
   0 requests for memory denied
   0 requests for memory delayed

   - "netstat -p tcp -ss" shows "17023 discarded for bad checksums" and
   that's all under the topic discarded

   - 'netstat -an | wc -l' gives 508

   - Doing this:

   netstat -p tcp -s >1.txt
   ssh localhost
   netstat -p tcp -s >2.txt
   diff -u 1.txt 2.txt

   showed differences in these regards: packets sent, control packets,
   connection requests, connections closed, segments updated RTT , retransmitt
   imeouts, keepalive timeouts, CWR by timeout,

   - vmstat -m shows:
   5842/5938/6144 mbuf 2048 byte clusters in use (current/peak/max)
   In use 126920K, total allocated: 158580K,; utlilization 80%
   and all is 0 in the fail column.

   - systat -b mbufs says 3 users, Load 0.1 0.12 0.13 Fri .....

   and the values under SIZE say:
   IFACE LIVECLOCKS SIZE ALIVE LWM HWM CWM          System 0 256 6837
    545
   size: 2k alive: 5819  hwm: 2969
   lo0
   bge0
   bge1
   enc0pflog0

   - Then, after 2 hours and performing all the checks above:

   - "ssh localhost" changed behavior: Now it started *delaying* (as in,
      blocking), tested for 2 minutes then cancelled test by CTRL+C

      - netstat -m reported: 964 Kbytes allocated to network (27% in use)

[demime 1.01d removed an attachment of type image/png which had a name of 
tcpdump -nni lo0 port 22 output for telnet localhost 22.png]

Any acceptance of TCP connections suddenly BREAKS without provocation globally: gives 'connection reset by peer', 'connection refused' or blocks. Only solved by reboot. Details&analysis provided. Request for pointers as to nail down the error source.

Reply via email to