Great write up Jens. The chance of two MB to be broken is probably low but overheating is a very good point. It was on my to-do list to setup IPMI and seems that now is the best time to do it.
Thanks On Wed, Mar 20, 2013 at 1:08 PM, Jens Elkner <jel+...@cs.uni-magdeburg.de>wrote: > On Wed, Mar 20, 2013 at 08:50:40AM -0700, Peter Wood wrote: > > I'm sorry. I should have mentioned it that I can't find any errors in > the > > logs. The last entry in /var/adm/messages is that I removed the > keyboard > > after the last reboot and then it shows the new boot up messages when > I boot > > up the system after the crash. The BIOS log is empty. I'm not sure > how to > > check the IPMI but IPMI is not configured and I'm not using it. > > You definitely should! Plugin a cable into the dedicated network port > and configure it (easiest way for you is probably to jump into the BIOS > and assign the appropriate IP address etc.). Than, for a quick look, > point your browser to the given IP port 80 (default login is > ADMIN/ADMIN). Also you may now configure some other details > (accounts/passwords/roles). > > To track the problem, either write a script, which polls the parameters > in question periodically or just install the latest ipmiViewer and use > this to monitor your sensors ad hoc. > see ftp://ftp.supermicro.com/utility/IPMIView/ > > > Just another observation - the crashes are more intense the more data > the > > system serves (NFS). > > I'm looking into FRMW upgrades for the LSI now. > > Latest LSI FW should be P15, for this MB type 217 (2.17), MB-BIOS C28 > (1.0b). > However, I doubt, that your problem has anything to do with the > SAS-ctrl or OI or ZFS. > > My guess is, that either your MB is broken (we had an X9DRH-iF, which > instantly "disappeared" as soon as it got some real load) or you have > a heat problem (watch you cpu temp e.g. via ipmiviewer). With 2GHz > that's not very likely, but worth a try (socket placement on this board > is not really smart IMHO). > > To test quickly > - disable all addtional, unneeded service in OI, which may put some > load on the machine (like NFS service, http and bla) and perhaps > even export unneeded pools (just to be sure) > - fire up your ipmiviewer and look at the sensors (set update to > 10s) or refresh manually often > - start 'openssl speed -multi 32' and keep watching your cpu temp > sensors (with 2GHz I guess it takes ~ 12min) > > I guess, your machine "disappears" before the CPUs getting really hot > (broken MB). If CPUs switch off (usually first CPU2 and a little bit > later CPU1) you have a cooling problem. If nothing happens, well, than > it could be an OI or ZFS problem ;-) > > Have fun, > jel. > -- > Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ > Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 > 39106 Magdeburg, Germany Tel: +49 391 67 52768 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss