Hi Karlis,

Your symptom description matches exactly what I experienced.

How frequent is this?

Yes this is an incredible error indeed.

The best hypothesis I heard til now from other users regarding its source,
was mbuf leak in device driver, however my machine run a BGE which is
completely common, so should be no error there. Also, you
reportexperiencing the same now and you're running a Intel PRO/1000MT
(82541GI) , so this should indicate that it's not the device driver.


My spontaneous reflection is that the logical place to start tracking this
down, would be to understand *what OS-global queue or buffer mechanisms
exist in the code path between that the TCP stack receives an incoming
connection and that the intended target OS process select()/accept() call
picks it up, that when full makes the pickup not happen anymore*.

With this knowledge, those structures content' could be introspected when
in the "hung up" state and then the place causing the error be tracked from
there.

Of course it could be something else than a queue or buffer too, though I'd
guess the code is so well audited by now that that would be highly
unplausible.



What does tcpdump show for you when doing telnet localhost 22 etc. when in
the "hung up" mode?

For if/when I experience the "hangup" again, I've prepared these commands
to run:

dmesg
netstat
netstat -an
netstat -m
netstat -p tcp -ss
fstat -n
pfctl -si

netstat -p tcp -s >1.txt
ssh localhost
netstat -p tcp -s >2.txt
diff -u 1.txt 2.txt

vmstat -m
systat -b mbufs


and also doing 'tcpdump -nni lo0 port 22' and then doing "telnet localhost
22" .

You feel free to do the same, and we can compare outputs, looking for any
common denominator.

Best regards,
Mikael



2013/8/13 Kārlis MiÄ·elsons <[email protected]>

> Hello,
>
> I'm experiencing something similar to this bug on one of our servers:
>
http://marc.info/?t=**137321093800002<http://marc.info/?t=137321093800002>
>
> Server seems to "hang up" time to time after updating to 5.2 (currently
> running 5.3-stable), it was working just fine until upgrade.
>
> Now let me explain "hanging up". When it "hangs up", it responds to ICMP
> echo requests but none of TCP services respond (it is running sshd, Apache
> httpd, OpenBSD spamd, Postfix). Unfortunately I don't have any UDP services
> to try, but I've set up Bind so that I can check it next time it hangs up.
>
> I've got symon on that server too, and it reports sensor, memory, mfub,
> pf, cpu, if, io and load statistics to remote location (by sending outgoing
> UDP packets to monitoring server). That also works fine.
>
> I usually ask datacenter guys to restart server whenever it "hangs up",
> but this time it "hanged up" at night, and it was running fine after 5
> hours of downtime. It didn't restart, everything just started to work again
> by itself. You can see symon graphs here:
>   http://bayimg.com/fAOaHaAef
> (don't pay attention to short downtime around 17:00, it was scheduled
> downtime due to hard drive change)
>
> From 4:00 till almost 9:00 it was unreachable, at the same time there is
> big increase in mt_data mbufs (from around 100 to 800). At the same time
> none of TCP services would respond, server answered to echo ping requests
> just fine (verified from other monitoring servers running smokeping, nagios
> and third parity web monitoring services).
>
> Nothing in /var/log/messages, /var/log/daemon or dmesg.
>
> I've tried increasing PF state limit to 25000 before, but that didn't help.
>
> What could cause sudden mt_data mbuf increase?
>
> netstat -m output (when everything is working fine):
> 306 mbufs in use:
>         26 mbufs allocated to data
>         219 mbufs allocated to packet headers
>         61 mbufs allocated to socket names and addresses
> 22/532/6144 mbuf 2048 byte clusters in use (current/peak/max)
> 0/8/6144 mbuf 4096 byte clusters in use (current/peak/max)
> 0/8/6144 mbuf 8192 byte clusters in use (current/peak/max)
> 0/8/6144 mbuf 9216 byte clusters in use (current/peak/max)
> 0/8/6144 mbuf 12288 byte clusters in use (current/peak/max)
> 0/8/6144 mbuf 16384 byte clusters in use (current/peak/max)
> 0/8/6144 mbuf 65536 byte clusters in use (current/peak/max)
> 1716 Kbytes allocated to network (7% in use)
>
> 0 requests for memory denied
> 0 requests for memory delayed
> 0 calls to protocol drain routines
>
> dmesg (standard kernel, same problem with 5.3 and 5.2):
>
> OpenBSD 5.3-stable (GENERIC.MP) #0: Thu Jun 27 18:40:25 EEST 2013
>
[email protected]:/usr/**src/sys/arch/amd64/compile/GEN**ERIC.MP<http://G
ENERIC.MP>
> RTC BIOS diagnostic error b<fixed_disk>
> real mem = 4289646592 (4090MB)
> avail mem = 4152963072 (3960MB)
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.5 @ 0xdfcfa000 (64 entries)
> bios0: vendor Intel Corporation version
"S3200X38.86B.00.00.0052.**112920101508"
> date 11/29/2010
> bios0: Intel Corporation S3210SH
> acpi0 at bios0: rev 2
> acpi0: sleep states S0 S1 S4 S5
> acpi0: tables DSDT SLIC FACP APIC WDDT MCFG HPET SPCR SSDT SSDT SSDT SSDT
> SSDT HEST BERT ERST EINJ DMAR
> acpi0: wakeup devices SLPB(S5) NPE1(S5) NPE6(S5) P32_(S5) PS2M(S1)
> PS2K(S1) ILAN(S5) PEX0(S5) PEX1(S5) PEX2(S5) PEX3(S5) PEX4(S5) PEX5(S5)
> UHC1(S1) UHC2(S1) UHC3(S1) UHC4(S1) EHCI(S1) EHC2(S1) UH42(S1) UHC5(S1)
> UHC6(S1) AZAL(S4)
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz, 3159.18 MHz
> cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,**MCE,CX8,APIC,SEP,MTRR,PGE,MCA,**
> CMOV,PAT,PSE36,CFLUSH,DS,ACPI,**MMX,FXSR,SSE,SSE2,SS,HTT,TM,**
> PBE,SSE3,DTES64,MWAIT,DS-CPL,**VMX,SMX,EST,TM2,SSSE3,CX16,**
> xTPR,PDCM,SSE4.1,XSAVE,NXE,**LONG,LAHF,PERF
> cpu0: 6MB 64b/line 16-way L2 cache
> cpu0: apic clock running at 332MHz
> cpu1 at mainbus0: apid 1 (application processor)
> cpu1: Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz, 3158.75 MHz
> cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,**MCE,CX8,APIC,SEP,MTRR,PGE,MCA,**
> CMOV,PAT,PSE36,CFLUSH,DS,ACPI,**MMX,FXSR,SSE,SSE2,SS,HTT,TM,**
> PBE,SSE3,DTES64,MWAIT,DS-CPL,**VMX,SMX,EST,TM2,SSSE3,CX16,**
> xTPR,PDCM,SSE4.1,XSAVE,NXE,**LONG,LAHF,PERF
> cpu1: 6MB 64b/line 16-way L2 cache
> ioapic0 at mainbus0: apid 5 pa 0xfec00000, version 20, 24 pins
> ioapic0: misconfigured as apic 0, remapped to apid 5
> acpimcfg0 at acpi0 addr 0xf0000000, bus 0-63
> acpihpet0 at acpi0: 14318179 Hz
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpiprt1 at acpi0: bus 1 (NPE1)
> acpiprt2 at acpi0: bus -1 (NPE6)
> acpiprt3 at acpi0: bus 4 (P32_)
> acpiprt4 at acpi0: bus 2 (PEX0)
> acpiprt5 at acpi0: bus -1 (PEX1)
> acpiprt6 at acpi0: bus -1 (PEX2)
> acpiprt7 at acpi0: bus -1 (PEX3)
> acpiprt8 at acpi0: bus 3 (PEX4)
> acpiprt9 at acpi0: bus -1 (PEX5)
> acpicpu0 at acpi0: PSS
> acpicpu1 at acpi0: PSS
> acpibtn0 at acpi0: SLPB
> acpibtn1 at acpi0: PWRB
> ipmi at mainbus0 not configured
> cpu0: Enhanced SpeedStep 3159 MHz: speeds: 3166, 2000 MHz
> pci0 at mainbus0 bus 0
> pchb0 at pci0 dev 0 function 0 "Intel 3200/3210 Host" rev 0x00
> ppb0 at pci0 dev 1 function 0 "Intel 3200/3210 PCIE" rev 0x00: msi
> pci1 at ppb0 bus 1
> mfi0 at pci1 dev 0 function 0 "Symbios Logic SAS1078" rev 0x04: apic 5 int
> 16
> mfi0: "Intel(R) RAID Controller SRCSATAWB", firmware 8.0.1-0036, 128MB
> cache
> scsibus0 at mfi0: 64 targets
> sd0 at scsibus0 targ 0 lun 0: <INTEL, SRCSATAWB, 1.12> SCSI3 0/direct
> fixed naa.**600605b000ccf0b016fef71c29d834**93
> sd0: 304222MB, 512 bytes/sector, 623046656 sectors
> em0 at pci0 dev 25 function 0 "Intel ICH9 IGP AMT" rev 0x02: msi, address
> 00:15:17:27:ea:98
> uhci0 at pci0 dev 26 function 0 "Intel 82801I USB" rev 0x02: apic 5 int 18
> uhci1 at pci0 dev 26 function 1 "Intel 82801I USB" rev 0x02: apic 5 int 21
> ehci0 at pci0 dev 26 function 7 "Intel 82801I USB" rev 0x02: apic 5 int 17
> usb0 at ehci0: USB revision 2.0
> uhub0 at usb0 "Intel EHCI root hub" rev 2.00/1.00 addr 1
> ppb1 at pci0 dev 28 function 0 "Intel 82801I PCIE" rev 0x02: msi
> pci2 at ppb1 bus 2
> ppb2 at pci0 dev 28 function 4 "Intel 82801I PCIE" rev 0x02: msi
> pci3 at ppb2 bus 3
> vga1 at pci3 dev 0 function 0 "Matrox MGA G200e (ServerEngines)" rev 0x02
> wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
> wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
> uhci2 at pci0 dev 29 function 0 "Intel 82801I USB" rev 0x02: apic 5 int 23
> uhci3 at pci0 dev 29 function 1 "Intel 82801I USB" rev 0x02: apic 5 int 19
> uhci4 at pci0 dev 29 function 2 "Intel 82801I USB" rev 0x02: apic 5 int 18
> ehci1 at pci0 dev 29 function 7 "Intel 82801I USB" rev 0x02: apic 5 int 23
> usb1 at ehci1: USB revision 2.0
> uhub1 at usb1 "Intel EHCI root hub" rev 2.00/1.00 addr 1
> ppb3 at pci0 dev 30 function 0 "Intel 82801BA Hub-to-PCI" rev 0x92
> pci4 at ppb3 bus 4
> em1 at pci4 dev 2 function 0 "Intel PRO/1000MT (82541GI)" rev 0x05: apic 5
> int 18, address 00:15:17:27:ea:96
> pcib0 at pci0 dev 31 function 0 "Intel 82801IR LPC" rev 0x02
> ahci0 at pci0 dev 31 function 2 "Intel 82801I AHCI" rev 0x02: msi, AHCI 1.2
> scsibus1 at ahci0: 32 targets
> sd1 at scsibus1 targ 0 lun 0: <ATA, WDC WD3202ABYS-0, 02.0> SCSI3 0/direct
> fixed naa.50014ee1013a29dc
> sd1: 305245MB, 512 bytes/sector, 625142448 sectors
> ichiic0 at pci0 dev 31 function 3 "Intel 82801I SMBus" rev 0x02: apic 5
> int 18
> iic0 at ichiic0
> spdmem0 at iic0 addr 0x50: 1GB DDR2 SDRAM ECC PC2-6400CL5
> spdmem1 at iic0 addr 0x51: 1GB DDR2 SDRAM ECC PC2-6400CL5
> spdmem2 at iic0 addr 0x52: 1GB DDR2 SDRAM ECC PC2-6400CL5
> spdmem3 at iic0 addr 0x53: 1GB DDR2 SDRAM ECC PC2-6400CL5
> usb2 at uhci0: USB revision 1.0
> uhub2 at usb2 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> usb3 at uhci1: USB revision 1.0
> uhub3 at usb3 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> usb4 at uhci2: USB revision 1.0
> uhub4 at usb4 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> usb5 at uhci3: USB revision 1.0
> uhub5 at usb5 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> usb6 at uhci4: USB revision 1.0
> uhub6 at usb6 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> isa0 at pcib0
> isadma0 at isa0
> com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
> pckbc0 at isa0 port 0x60/5
> pckbd0 at pckbc0 (kbd slot)
> pckbc0: using irq 1 for kbd slot
> wskbd0 at pckbd0: console keyboard, using wsdisplay0
> pcppi0 at isa0 port 0x61
> spkr0 at pcppi0
> fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
> mtrr: Pentium Pro MTRR support
> vscsi0 at root
> scsibus2 at vscsi0: 256 targets
> softraid0 at root
> scsibus3 at softraid0: 256 targets
> root on sd0a (fdbb84f0ba31516d.a) swap on sd0b dump on sd0b
>
> Thanks,
> Karlis

Reply via email to