To sum this one up from my end, my suspicion is that the TCP stack buffer
overflow that lead to the limited TCP stack zombie state was because I had
a TCP server process that accept()ed every incoming TCP connection without
an internal connection overload mechanism,
at the same time as I had a third party flooding.

Based on this, my working conclusion is that this accept():ing lead to the
TCP stack run out of mbuf:s (without any warning left in the the dmesg
about it), and through some more or less exotic logic in the TCP stack, the
'hangup state' as explored by Karlis and myself in the previous ML posts
was entered into.


Since this 'hangup state' was experienced, the TCP server process has been
enabled with an overload mechanism and I have not seen any 'hangup state'
since, which supports the hypothesis above.

So that about it from my end.



For a related question, I heard someone suggest that the default
kern.maxclusters setting "as a rule of thumb should be doubled in OpenBSD".

What is there to say on this topic - how do you know what
your kern.maxclusters should be, what is suitable guidance?



2013/9/19 Mikael <[email protected]>

> Karlis, is there any kind of flooding of TCP connections on your machine?
>
> (As in, you have a TCP server running that does not limit the number of
> concurrent connections and then you're flooded, or, you are making huge
> amounts of TCP connections yourself?
>
>
> 2013/9/18 Kārlis MiÄ·elsons <[email protected]>
>
>> Hello again,
>>
>>
>>  Should I just increase kern.maxclusters and see if the problem goes
>>> away or are developers interested in me doing some other tests? How much
>>> should I increase maxclusters?
>>>
>> Increasing kern.maxclusters to 18432 didn't fix the problem, system
>> hanged up after 2 weeks uptime again.
>>
>>  "none of TCP services respond" - please expand on this: if you try and
>>> connect to a listening port, does it totally fail to respond, i.e.:
>>>
>>> $ telnet $somehost 25
>>> Trying $somehost...
>>> << big pause >>
>>> telnet: connect to address $somehost: Connection timed out
>>>
>>> Or, does it connect but you get no connection banner / response, i.e.
>>>
>>> $ telnet $somehost 25
>>> Trying $somehost...
>>> Connected to $somehost.
>>> Escape character is '^]'.
>>> << just sits there >>
>>>
>>> (Look at a couple of different ports and see if there's any difference -
>>> some daemons fork a new process to answer a request, some don't).
>>>
>> Scanning hostname.domain.lv (XX.YY.ZZ.157) [1000 ports]
>> Discovered open port 587/tcp on XX.YY.ZZ.157
>> Discovered open port 53/tcp on XX.YY.ZZ.157
>> Discovered open port 143/tcp on XX.YY.ZZ.157
>> Discovered open port 995/tcp on XX.YY.ZZ.157
>> Discovered open port 993/tcp on XX.YY.ZZ.157
>> Discovered open port 22/tcp on XX.YY.ZZ.157
>> Discovered open port 443/tcp on XX.YY.ZZ.157
>> Discovered open port 465/tcp on XX.YY.ZZ.157
>> Completed Connect Scan at 09:57, 4.73s elapsed (1000 total ports)
>> Nmap scan report for hostname.domain.lv (XX.YY.ZZ.157)
>> Host is up (0.0058s latency).
>> Scanned at 2013-08-31 09:56:54 EEST for 6s
>> Not shown: 992 filtered ports
>> PORT    STATE SERVICE
>> 22/tcp  open  ssh
>> 53/tcp  open  domain
>> 143/tcp open  imap
>> 443/tcp open  https
>> 465/tcp open  smtps
>> 587/tcp open  submission
>> 993/tcp open  imaps
>> 995/tcp open  pop3s
>>
>> Read data files from: /usr/local/share/nmap
>> Nmap done: 1 IP address (1 host up) scanned in 6.27 seconds
>>
>> $ date; telnet XX.YY.ZZ.157 143; date
>> Sat Aug 31 09:59:57 EEST 2013
>> Trying XX.YY.ZZ.157...
>> Connected to XX.YY.ZZ.157.
>>
>> Escape character is '^]'.
>>
>> $ date; ssh -v hostname; date
>> Sat Aug 31 09:57:53 EEST 2013
>> OpenSSH_6.2, OpenSSL 1.0.1c 10 May 2012
>> debug1: Reading configuration data /home/username/.ssh/config
>> debug1: /home/username/.ssh/config line 72: Applying options for hostname
>> debug1: Reading configuration data /etc/ssh/ssh_config
>> debug1: Connecting to hostname.domain.lv [XX.YY.ZZ.157] port 22.
>> debug1: Connection established.
>> debug1: identity file /home/username/.ssh/t1 type 1
>> debug1: identity file /home/username/.ssh/t1-cert type -1
>> debug1: Enabling compatibility mode for protocol 2.0
>> debug1: Local version string SSH-2.0-OpenSSH_6.2
>> ssh_exchange_identification: read: Connection timed out
>> Sat Aug 31 14:09:09 EEST 2013
>>
>> $ host www.domain.lv XX.YY.ZZ.157
>> ;; connection timed out; no servers could be reached
>>
>> $ date; telnet XX.YY.ZZ.157 143; date
>> Sat Aug 31 09:59:57 EEST 2013
>> Trying XX.YY.ZZ.157...
>> Connected to XX.YY.ZZ.157.
>>
>> Escape character is '^]'.
>> ^C^]
>> telnet> Connection closed.
>> Sat Aug 31 14:34:18 EEST 2013
>>
>> --- hostname.domain.lv ping statistics ---
>> 62900 packets transmitted, 62897 packets received, 0.0% packet loss
>> round-trip min/avg/max/std-dev = 0.487/0.830/110.194/1.332 ms
>>
>>
>> --
>> Karlis

Reply via email to