OK, here is some new data that I think rules out any issues with the applications. Following Alfred's suggestion I have made a script to run every second and output some system statistics:

date
netstat -m
vmstat -i
ps -axl
pstat -T
vmstat -z
sysctl -a

The problem had hit us again today several times and upon investigating the log I found that increase in the mbuf usage happened in one step - going from normal 10% to 100% between two script runs. What is more interesting, is that time from two such subsequent runs were about 2 minutes apart (instead of 1 second as it should be) and when inspecting cron logs I noticed the same time gap in there. I ruled out any VM starvation as a cause of the delay because system has plenty of free memory. The incoming network traffic was not sufficient to starve VM so quickly either - it was about 7MB/sec at that time, so even if all receivers stopped draining their buffers it should have taken at least 1-2 seconds to fill up mbuf cache and create demand for an additional kernel memory. The failure would likely to be more gradual and I should have seen how it builds up in the debug log.

So it looks like kernel issue of a sort, which causes all userland activity to cease for 2 minutes when the system reaches certain load. Mbuf build-up is only the by-product of this, not really a cause. igb(4) is being the primary suspect now, since we have other machines with more load not having this problem and we don't have anybody else using this driver. The chip is the following:

i...@pci0:5:0:0: class=0x020000 card=0x323f103c chip=0x10c98086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = network
    subclass   = ethernet
i...@pci0:5:0:1: class=0x020000 card=0x323f103c chip=0x10c98086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = network
    subclass   = ethernet

Hardware in question is a new HP DL160G6. I have also checked IPMI logs and sensors and have not found any issue in there as well. No sensors reported off-range values and chassis temperature is within normal limits.

I am not sure how to debug this problem further. We are now investigating opportunity to install external non-igb card to the server and see if it solves the issue.

I have the whole log if anyone wants to take a closer peek.

Regards,
--
Maksym Sobolyev
Sippy Software, Inc.
Internet Telephony (VoIP) Experts
T/F: +1-646-651-1110
Web: http://www.sippysoft.com
MSN: sa...@sippysoft.com
Skype: SippySoft
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to