On 3/24/2011 11:03 AM, Windsor Dave L. (AdP/TEF7.1) wrote:
> Hello Everyone,
>
> I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7.  I 
> have identical OS software running reock-solid on two other DL380 ProLiant 
> servers, but they are G6 models, not G7.  On the G7, the installation went 
> perfectly and the machine ran great for about 2 weeks, when it just seemed to 
> "stop".  The system stopped responding on the network, and there was no video 
> on the console (or remote console via iLO).  It would not reboot or cold boot 
> through iLO, I actually had to hold the power to turn it off and then hit it 
> again to power up.
>
> This happened several times within a few days of each other.  Each time, 
> there was no evidence in any logs of a problem - the system just seemed to 
> stop or lock up.   We did have a CPU problem light appear on the front, so HP 
> came in and replaced the one 4-core CPU.  Since then, it has run as long as 
> two weeks, but still crashes randomly.  After the last reboot, I left the 
> console in text mode on vt1, and when it crashed again this morning this was 
> displayed on the screen:
>
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff8100dc435cf0  CR3: 000000008a6ca000 CR4: 00000000000006e0
> Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0)
<snipped>
>   <0>Kernel panic - not syncing: Fatal exception

OK everyone, here is an update:

The server crashed again overnight. This time, the following error 
messages were on the console:

     HARDWARE ERROR
     CPU 3: Machine Check Exception:                4 Bank 5: 
ba00000000400405
     TSC 5172b45d44f0a MISC 80
     This is not a software problem!
     Run through mcelog --ascii to decode and contact your hardware vendor

     HARDWARE ERROR
     CPU 7: Machine Check Exception:                4 Bank 5: 
ba00000000400405
     TSC 5172b45d45bba MISC 80
     This is not a software problem!
     Run through mcelog --ascii to decode and contact your hardware vendor

     HARDWARE ERROR
     CPU 5: Machine Check Exception:                4 Bank 8: 
0000000000000000
     TSC 0
     This is not a software problem!
     Run through mcelog --ascii to decode and contact your hardware vendor
     Kernel panic - not syncing: Uncorrected machine check

After reboot, running the first error through mcelog --ascii gives

     CPU 3: Machine Check Exception:                4 Bank 5: 
ba00000000400405
     HARDWARE ERROR. This is *NOT* a software problem!
     Please contact your hardware vendor
     mcelog: Unknown Intel CPU type family 6 model 2c

     CPU 3 BANK 5 MCG status:MCIP
     MCi status:
     Uncorrected error
     Error enabled
     MCi_MISC register valid
     Processor context corrupt
     MCA: Internal unclassified error: 405
     STATUS ba00000000400405 MCGSTATUS 4

The second error gives

     CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405
     HARDWARE ERROR. This is *NOT* a software problem!
     Please contact your hardware vendor
     mcelog: Unknown Intel CPU type family 6 model 2c

     CPU 7 BANK 5 MCG status:MCIP
     MCi status:
     Uncorrected error
     Error enabled
     MCi_MISC register valid
     Processor context corrupt
     MCA: Internal unclassified error: 405
     STATUS ba00000000400405 MCGSTATUS 4

And the third gives

     CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405
     HARDWARE ERROR. This is *NOT* a software problem!
     Please contact your hardware vendor
     mcelog: Unknown Intel CPU type family 6 model 2c

     CPU 3 BANK 5 MCG status:MCIP
     MCi status:
     Uncorrected error
     Error enabled
     MCi_MISC register valid
     Processor context corrupt
     MCA: Internal unclassified error: 405
     STATUS ba00000000400405 MCGSTATUS 4

I have been able to move all workloads onto other servers.  As at least 
two people suggested, I booted from the HP SmartStart CD and ran 100 
loops of systems diagnostics and tests, especially for the memory and 
CPU.  No problems were found.  I think I will run memtest86 over the 
weekend.

We have placed a hardware support call in to HP.

Best Regards,

Dave Windsor

Robert Bosch LLC
Team Leader, MES Database Infrastructure Group (AdP/TEF7.1)
4421 Highway 81 North
Anderson, SC 29621 USA
www.bosch.us
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Reply via email to