On 3/24/2011 11:03 AM, Windsor Dave L. (AdP/TEF7.1) wrote:
> Hello Everyone,
>
> I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I
> have identical OS software running reock-solid on two other DL380 ProLiant
> servers, but they are G6 models, not G7. On the G7, the installation went
> perfectly and the machine ran great for about 2 weeks, when it just seemed to
> "stop". The system stopped responding on the network, and there was no video
> on the console (or remote console via iLO). It would not reboot or cold boot
> through iLO, I actually had to hold the power to turn it off and then hit it
> again to power up.
>
> This happened several times within a few days of each other. Each time,
> there was no evidence in any logs of a problem - the system just seemed to
> stop or lock up. We did have a CPU problem light appear on the front, so HP
> came in and replaced the one 4-core CPU. Since then, it has run as long as
> two weeks, but still crashes randomly. After the last reboot, I left the
> console in text mode on vt1, and when it crashed again this morning this was
> displayed on the screen:
>
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff8100dc435cf0 CR3: 000000008a6ca000 CR4: 00000000000006e0
> Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0)
<snipped>
> <0>Kernel panic - not syncing: Fatal exception
OK everyone, here is an update:
The server crashed again overnight. This time, the following error
messages were on the console:
HARDWARE ERROR
CPU 3: Machine Check Exception: 4 Bank 5:
ba00000000400405
TSC 5172b45d44f0a MISC 80
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
HARDWARE ERROR
CPU 7: Machine Check Exception: 4 Bank 5:
ba00000000400405
TSC 5172b45d45bba MISC 80
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
HARDWARE ERROR
CPU 5: Machine Check Exception: 4 Bank 8:
0000000000000000
TSC 0
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check
After reboot, running the first error through mcelog --ascii gives
CPU 3: Machine Check Exception: 4 Bank 5:
ba00000000400405
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
mcelog: Unknown Intel CPU type family 6 model 2c
CPU 3 BANK 5 MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
Processor context corrupt
MCA: Internal unclassified error: 405
STATUS ba00000000400405 MCGSTATUS 4
The second error gives
CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
mcelog: Unknown Intel CPU type family 6 model 2c
CPU 7 BANK 5 MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
Processor context corrupt
MCA: Internal unclassified error: 405
STATUS ba00000000400405 MCGSTATUS 4
And the third gives
CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
mcelog: Unknown Intel CPU type family 6 model 2c
CPU 3 BANK 5 MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
Processor context corrupt
MCA: Internal unclassified error: 405
STATUS ba00000000400405 MCGSTATUS 4
I have been able to move all workloads onto other servers. As at least
two people suggested, I booted from the HP SmartStart CD and ran 100
loops of systems diagnostics and tests, especially for the memory and
CPU. No problems were found. I think I will run memtest86 over the
weekend.
We have placed a hardware support call in to HP.
Best Regards,
Dave Windsor
Robert Bosch LLC
Team Leader, MES Database Infrastructure Group (AdP/TEF7.1)
4421 Highway 81 North
Anderson, SC 29621 USA
www.bosch.us
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos