Hi, This error was originally reported and discussed in detail w Frostypants, LTEO and Mischief on #openbsd@Freenode 2013-05-03 at 18:41 - 21:19 GMT.
This error was considered so exotic by the IRC channel that Frostypants wrote "/me shoots himself" and then followed up with that I "win a prize" for trigging it. Request: To get network stack/OS kernel introspection tools (new custom ones or just shell commands) for nailing down the error source on the next occurrence! 1) For instance, an mbuf prettyprint-dumper and an mbuf resetter could be of value. 2) Also a general TCP network stack state prettyprint-dumper and resetter would be of value - the dumping as to track the source of the problem, and the resetter as to fix the problem without need for reboot. I will probably upgrade to 5.3 shortly so that the next occurrence will be on 5.3, however I expect nothing to be fixed from 5.2 as there is no mentioning of any network stack issue in http://openbsd.org/errata52.html so all code related to this error. Report: OBSD version: 5.2 AMD64 GENERIC multicore Occurred: 2013-05-03 and today 2013-07-07. Reproducible: By me, beyond these two occurrences no. Machine: Dual-xeon with a BGE NIC Environment details: - On the second occurrence there was one defunct process that had a bound TCP port. - Permanently ~~50 incoming TCP connections per second as some kind of undirected spamming/flodding/micro-semi-DDOS, no clue from who or if with any particular flags. - On both occurrences, previously an EGDB session run as user had crashed so that kill -9 was needed. - At some points, user processes had encountered malloc failure due to insufficient RAM. - Other than this absolutely nothing exotic. Brief error description: The machine suddenly, without any pre-notice or other factor that could be anticipated by the admin/user, stops accepting incoming TCP connections, from all interfaces (external & lo0). The error is limited to this - outbound TCP connections (lynx etc.), PING (so ICMP) and DNS lookups are not affected but work fine. So for example, "ssh localhost" while sshd is open and netstat clearly shows it's listening on all interfaces for inbound connections, responds with "Connection reset by peer". A thorough discussion on #openbsd@Freenode revealed that "netstat -m" says that "14308 Kbytes allocated (93% in use)" on the occurrence 2013-05-03 and "14308 Kbytes allocated (93% in use)" on the occurrence 2013-07-07, other than this every explored system parameter looked normal. Analysis and details of error per IRC log from 2013-05-03: - /etc/rc.d/sshd stop; /etc/rc.d/sshd start; ssh localhost => Connection reset by peer ssh localhost => Connection refused telnet localhost 22 => Connection refused ssh -vvv 127.0.0.1 => OpenSSH 6.1, OpenSSL \n Reading config \n ssh_connect: needprif 0\n Connecting to 127.0.0.1 \n connect to address 127.0.0.1 port 22: Connection reset by peer even while "netstat -an | grep LISTEN | grep 22" shows "tcp 0 0 *.22 *.* LISTEN" and then an equivalent role but for tcp6. Equal effect when doing the same with OpenSMTPD, (i.e. /etc/rc.d/smtpd restart; telnet localhost 25 etc.) HTTP server etc. Equally so a minimalistic TCP server written in C is unable to pick up new connections, so the problem really reduces to that inbound TCP connections are not receied anymore This C program indeed was reported correctly by "netstat -an | grep LISTEN" as: "tcp 0 0 127.0.0.1.1055 *.* LISTEN" - Does not change anything about the error: - sudo sh /etc/netstart - Resetting pf.conf (pfctl -Rf /etc/pf.conf , something like this) - pfctl -d (i.e. disabling PF altogether) - ifconfig lo0/bge0 down - netstat shows ~100 connections from localhost to localhost, and ~50 connectsions from external in FIN_WAIT - Frostypants points out: The host is returning TCP RST then, not just dropping it. That's...special. - netstat -an | grep LISTEN shows that SSH, SMTP, HTTP etc. are indeed being listened for. - 'tcpdump -nni lo0 port 22' and then doing "telnet localhost 22" shows that SYN, SYN-ACK, ACK, RST are returned (!), see the attached image for full log. Note that this was done after "pfctl -d". Image is also downloadable at http://s000.tinyupload.com/download.php?file_id=28266514139859296465&t=2826651413985929646525648. - fstat -n | wc -l gives 1017 - fstat -n | grep sshd | wc -l gives 7 - pfctl -si | grep entries gives " current entries 12" - Nothing funky in 'dmesg', it shows sd1 is up and that is all. - "netstat -m" reports: 6863 in use 13804 Kbytes allocated to network (97% in use) 0 requests for memory denied 0 requests for memory delayed - "netstat -p tcp -ss" shows "17023 discarded for bad checksums" and that's all under the topic discarded - 'netstat -an | wc -l' gives 508 - Doing this: netstat -p tcp -s >1.txt ssh localhost netstat -p tcp -s >2.txt diff -u 1.txt 2.txt showed differences in these regards: packets sent, control packets, connection requests, connections closed, segments updated RTT , retransmitt imeouts, keepalive timeouts, CWR by timeout, - vmstat -m shows: 5842/5938/6144 mbuf 2048 byte clusters in use (current/peak/max) In use 126920K, total allocated: 158580K,; utlilization 80% and all is 0 in the fail column. - systat -b mbufs says 3 users, Load 0.1 0.12 0.13 Fri ..... and the values under SIZE say: IFACE LIVECLOCKS SIZE ALIVE LWM HWM CWM System 0 256 6837 545 size: 2k alive: 5819 hwm: 2969 lo0 bge0 bge1 enc0pflog0 - Then, after 2 hours and performing all the checks above: - "ssh localhost" changed behavior: Now it started *delaying* (as in, blocking), tested for 2 minutes then cancelled test by CTRL+C - netstat -m reported: 964 Kbytes allocated to network (27% in use) [demime 1.01d removed an attachment of type image/png which had a name of tcpdump -nni lo0 port 22 output for telnet localhost 22.png]
