Great write up Jens.

The chance of two MB to be broken is probably low but overheating is a very
good point. It was on my to-do list to setup IPMI and seems that now is the
best time to do it.

Thanks

On Wed, Mar 20, 2013 at 1:08 PM, Jens Elkner <jel+...@cs.uni-magdeburg.de>wrote:

> On Wed, Mar 20, 2013 at 08:50:40AM -0700, Peter Wood wrote:
> >    I'm sorry. I should have mentioned it that I can't find any errors in
> the
> >    logs. The last entry in /var/adm/messages is that I removed the
> keyboard
> >    after the last reboot and then it shows the new boot up messages when
> I boot
> >    up the system after the crash. The BIOS log is empty. I'm not sure
> how to
> >    check the IPMI but IPMI is not configured and I'm not using it.
>
> You definitely should! Plugin a cable into the dedicated network port
> and configure it (easiest way for you is probably to jump into the BIOS
> and assign the appropriate IP address etc.). Than, for a quick look,
> point your browser to the given IP port 80 (default login is
> ADMIN/ADMIN). Also you may now configure some other details
> (accounts/passwords/roles).
>
> To track the problem, either write a script, which polls the parameters
> in question periodically or just install the latest ipmiViewer and use
> this to monitor your sensors ad hoc.
> see ftp://ftp.supermicro.com/utility/IPMIView/
>
> >    Just another observation - the crashes are more intense the more data
> the
> >    system serves (NFS).
> >    I'm looking into FRMW upgrades for the LSI now.
>
> Latest LSI FW should be P15, for this MB type 217 (2.17), MB-BIOS C28
> (1.0b).
> However, I doubt, that your problem has anything to do with the
> SAS-ctrl or OI or ZFS.
>
> My guess is, that either your MB is broken (we had an X9DRH-iF, which
> instantly "disappeared" as soon as it got some real load) or you have
> a heat problem (watch you cpu temp e.g. via ipmiviewer). With 2GHz
> that's not very likely, but worth a try (socket placement on this board
> is not really smart IMHO).
>
> To test quickly
>         - disable all addtional, unneeded service in OI, which may put some
>           load on the machine (like NFS service, http and bla) and perhaps
>           even export unneeded pools (just to be sure)
>         - fire up your ipmiviewer and look at the sensors (set update to
>           10s) or refresh manually often
>         - start 'openssl speed -multi 32' and keep watching your cpu temp
>           sensors (with 2GHz I guess it takes ~ 12min)
>
> I guess, your machine "disappears" before the CPUs getting really hot
> (broken MB). If CPUs switch off (usually first CPU2 and a little bit
> later CPU1) you have a cooling problem. If nothing happens, well, than
> it could be an OI or ZFS problem ;-)
>
> Have fun,
> jel.
> --
> Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
> Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
> 39106 Magdeburg, Germany         Tel: +49 391 67 52768
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to