----- Original Message ----- From: "Jeremy Chadwick" <free...@jdc.parodius.com>

On Thu, Aug 11, 2011 at 09:59:36AM +0100, Steven Hartland wrote:
That's not the issue as its happening across board over 130 machines :(

Agreed, bad hardware sounds unlikely here.  I could believe some strange
incompatibility (e.g. BIOS quirk or the like[1]) that might cause problems
en masse across many servers, but hardware issues are unlikely in this
situation.

Its affecting a range of hardware from supermicro blades / 2u's &
dell blades. So it seems more like a software bug.

[1]: I mention this because we had something similar happen at my
workplace.  For months we used a specific model of system from our
vendor which worked reliably, zero issues.  Then we got a new shipment
of boxes (same model as prior) which started acting very odd (often AHCI
timeout issues or MCEs which when decoded would usually turn out to be
nonsensical).  It took weeks to determine the cause given how slow the
vendor was to respond: root cause turned out to be that the vendor
decided, on a whim, to start shipping a newer BIOS version which wasn't
"as compatible" with Solaris as previous BIOSes.  Downgrading all the
systems to the older BIOS fixed the problem.

The machines have been working for months fine, the panics only started
last week.

We've been looking at the changes made last week to see if we can identify
the cause. The only change made in that time frame was the rollout
of the change to kern.ipc.nmbclusters to workaround the tcp re-assembly
issue.

In this case we raised the value from the default of 25600 to 262144.

We've used this value for a long time on our core webservers, which are
also running 8.2 so I'd be very surprised if this was the cause. That said
we're looking to roll out kern.ipc.nmbclusters=51200 to try and rule it
out.

Prior to this, 1-2 weeks previous, we rolled out a significant update which
included:-
1. Adding IPv6 to the kernel (although no machines are configued with it yet)
2. Adding ipmi module to the kernel, although not loaded.
3. Rebuilding ALL ports to the latest version
4. Restructuring the server layout to be one jail per java server (~60
servers per machine)
5. Restructing the filesystem to be a base nullfs mount + devfs +
zfs volume per server

This update had been testing for 2 weeks prior to that, so in total 3-4
weeks before any panics where seen but that doesn't mean the issue
didnt exist at that time.

Currently we're seeing 1-4 panics a day across all machines.

So currently the most likely suspects are:-
1. kern.ipc.nmbclusters
2. nullfs
3. ipv6
4. a package update, most likely being openjdk6-b23
5. jail

In Steve's case this is unlikely to be the situation, but I thought I'd
share the story anyway.  "SKU ABCXYZ-1" from August 2009 is not
necessarily the same thing as "SKU ABCXYZ-1" from May 2010.  ;-)  This
is also why I prefer to buy/build my own systems, since I cannot trust
vendors to not mess about with settings w/out changing SKUs, P/Ns, or
revision numbers.

This caused us much scratching of heads when looking for that tcp issue
the other day. As it seemed to effecting the newer machines more than
the old, we even found two machines with the same "version" of the bios
but that's clearly a different build as the date and available options
where different, quite frustrating!

   Regards
   Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.
In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Reply via email to