Matthew Hagerty wrote:
John Baldwin wrote:
On Thursday 13 April 2006 14:17, Matthew Hagerty wrote:
Greetings,
I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon
Intel motherboard with a LSILogic MegaRAID (amr0) controller. This
machine has been running for about 2 years now, and was very stable
until I updated from 5.3 to 5.4, and now 6.0. The crashing seems to
be totally random and I have had it crash in as little as 12 hours
and as long as 143 days.
When the box goes down it does so in a strange way. First, it still
responds to network probes like ping (usually), however, all console
access is ignored. Also, some network ports still respond, like a
telnet to port 22 to test SSH will yield an SSH banner, but trying
to connect with SSH just hangs. Sometimes this is also true of the
SMTP server, but not always. This also makes it impossible for me
to use CARP to swap to the recently purchased spare machine, since
the network interface is generally still responding so CARP does not
detect a problem.
My biggest problem with this is that there are *never* any console
messages or log entries in any logs, no warnings about disk failure,
buffer exhaustion, system failures, etc.. The machine simply seems
to stop responding and the only way to correct the problem is a hard
reboot.
A strange thing did happen yesterday though, I believe I caught the
box on the verge of failure. I was SSH'd in and did a ps to check
things out. There were about 100 of these entries:
55050 ?? D 0:00.00 postmaster: ipa ipa ::1(63061) startup
(postgres)
The box runs a web-based app and connects to a local Postgres DB
which seemed to be unable to start new connections being requested
by the PHP scripts. At any rate, I stopped Apache and then tried to
stop Postgres which resulted in (or just happened to coincide with)
the box locking up and no longer responding to my SSH commands or
attempts to reconnect with SSH. I hardly think this is a Postgres
problem, but even if it was, a userland app should *not* be able to
bring down a box...
Can anyone shed some light on this, give me some options to try?
What happened to kernel panics and such when there were serious
errors going on? The only glimmer of information I have is that
*one* time there was an error on the console about there not being
any RAID controller available. I did purchase a spare controller
and I'm about to swap it out and see if it helps, but for some
reason I doubt it. If a controller like that was failing, I would
certainly hope to see some serious error messages or panics going on.
I have been running FreeBSD since version 1.01 and have never had a
box so unstable in the last 12 or so years, especially one that is
supposed to be "server" quality instead of the make-shift ones I put
together with desktop hardware. And last, I'm getting sick of my
Linux admin friends telling me "told you so! should have run
Linux...", please give me something to stick in their pie holes!
It sounds like a livelock (or deadlock) more than a crash. Can you add
'DDB' in your kernel config and break into the debugger when it hangs
and grab the output of 'ps'?
I can probably figure out how to compile in DDB (I've never done if
before though), but just two questions:
add
options DDB
to your kenrnel config file.
1. How do I break into DDB and grab the ps output?
on the console, hit <CTRL><ALT><ESC> keys (at once)
that should put you into the debugger..
then 'ps' will give you some output.
It's a lot to write down but I've found a camera phone makes good enough
snapshots :-)
alternatively you can use a serial console, but getting into the
debugger is harder,
you have to have compiled in ALT_BREAK_TO_DEBUGGER
into your kernel by adding
# Solaris implements a new BREAK which is initiated by a character
# sequence CR ~ ^b which is similar to a familiar pattern used on
# Sun servers by the Remote Console.
options ALT_BREAK_TO_DEBUGGER
to the kernel config file you are using..
at the boot prompt (where the 10 second delay is)
type
set console="comconsole"
(from memory)
to make the serial port the console.
then you can do console stuff from another window/machine and capture
the outout easily.
2. How can I login if the box is not responding to SSH or the
console? It was only by sheer luck that I caught it yesterday just
before the lockup, I have never been able to do that before.
Thanks,
Matthew
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to
"[EMAIL PROTECTED]"
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"