Hi all, Unfortunately for me, there does seem to be a hardware component to my problem. Although my rsync copied almost 4TB of data with no iostat errors after going back to OpenSolaris 2009.06, I/O on one of my mpt cards did eventually hang, with 6 disk lights on and 2 off, until rebooting. There are a few hardware changes made since the last time I did a full backup, so it's possible that whatever problem was introduced didn't happen frequently enough in low i/o usage for me to detect until now when I was reinstalling and copying massive amounts of data back.
The changes I had made since originally installing osol2009.06 several months ago are: - stop using marvel yukon2 ethernet onboard driver (which used a 3rd party driver) in favor of intel 1000 pt dual port, which necessesitated an extra pci-e slot, prompting the following item: - swapped motherboards between 2 machines (they were similiar though, with similiar onboard hardware and shouldn't have been a major change). Originally was an Asus P5Q Deluxe w/3 pci-e slots, now is a slightly older Asus P5W64 w/4 pci-e slots. - the intel 1000 pt dual port card has been aggregated as aggr0 since it was installed (the older yukon2 was a basic interface) the above changes were what was done awhile ago before upgrading opensolaris to 127, and things seemed to be working fine for at least 2-3 months with rsync updating (never hung, or had a fatal zfs error or lost access to data requiring a reboot) new changes since troubleshooting snv 127 mpt issues: - upgrade LSI 3081 firmware from 1.28.2 (or was it .02) to 1.29, the latest. If this turns out to be an issue, I do have the previous IT firmware that I was using before which I can flash back. another, albeit unlikely factor: when I originally copied all my data to my first opensolaris raidz2 pool, I didn't use rsync at all, I used netcat & tar, and only setup rsync later for updates. perhaps the huge initial single rsync of the large tree does something strange that the original intiial netcat & tar copy did not (i know, unlikely, but I'm grasping at straws here to determine what has happened). I'll work on ruling out the potential sources of hardware problems before I report any more on the mpt issues, since my test case would probably confound things at this point. I am affected by the mpt bugs since I would get the timeouts almost constantly in snv 127+, but since I'm also apparently affected by some other unknown hardware issue, my data on the mpt problems might lead people in the wrong direction at this point. I will first try to go back to the non-aggregated yukon ethernet, remove the intel dual port pci-e network adapter, then if the problem persists try half of my drives on each LSI controller individually to confirm if one controller has a problem the other does not, or one drive in one set is causing a new problem to a particular controller. I hope to have some kind of answer at that point and not have to resort to motherboard swapping again. Chad On Thu, Dec 03, 2009 at 10:44:53PM -0800, Chad Cantwell wrote: > I eventually performed a few more tests, adjusting some zfs tuning options > which had no effect, and trying the > itmpt driver which someone had said would work, and regardless my system > would always freeze quite rapidly in > snv 127 and 128a. Just to double check my hardware, I went back to the > opensolaris 2009.06 release version, and > everything is working fine. The system has been running a few hours and > copied a lot of data and not had any > trouble, mpt syslog events, or iostat errors. > > One thing I found interesting, and I don't know if it's significant or not, > is that under the recent builds and > under 2009.06, I had run "echo '::interrupts' | mdb -k" to check the > interrupts used. (I don't have the printout > handy for snv 127+, though). > > I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 > and e1000g1. In snv 127+, each of > my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ > listing, whereas in opensolaris > 2009.06, all 4 devices are on different IRQs. I don't know if this is > significant, but most of my testing when > I encountered errors was data transfer via the network, so it could have > potentially been interfering with the > mpt drivers when it was on the same IRQ. The errors did seem to be less > frequent when the server I was copying > from was linked at 100 instead of 1000 (one of my tests), but that is as > likely to be a result of the slower zpool > throughput as it is to be related to the network traffic. > > I'll probably stay with 2009.06 for now since it works fine for me, but I can > try a newer build again once some > more progress is made in this area and people want to see if its fixed (this > machine is mainly to backup another > array so it's not too big a deal to test later when the mpt drivers are > looking better and wipe again in the event > of problems) > > Chad > > On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote: > > To update everyone, I did a complete zfs scrub, and it it generated no > > errors in iostat, and I have 4.8T of > > data on the filesystem so it was a fairly lengthy test. The machine also > > has exhibited no evidence of > > instability. If I were to start copying a lot of data to the filesystem > > again though, I'm sure it would > > generate errors and crash again. > > > > Chad > > > > > > On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: > > > Well, ok, the msi=0 thing didn't help after all. A few minutes after my > > > last message a few errors showed > > > up in iostat, and then in a few minutes more the machine was locked up > > > hard... Maybe I will try just > > > doing a scrub instead of my rsync process and see how that does. > > > > > > Chad > > > > > > > > > On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: > > > > I don't think the hardware has any problems, it only started having > > > > errors when I upgraded OpenSolaris. > > > > It's still working fine again now after a reboot. Actually, I reread > > > > one of your earlier messages, > > > > and I didn't realize at first when you said "non-Sun JBOD" that this > > > > didn't apply to me (in regards to > > > > the msi=0 fix) because I didn't realize JBOD was shorthand for an > > > > external expander device. Since > > > > I'm just using baremetal, and passive backplanes, I think the msi=0 fix > > > > should apply to me based on > > > > what you wrote earlier, anyway I've put > > > > set mpt:mpt_enable_msi = 0 > > > > now in /etc/system and rebooted as it was suggested earlier. I've > > > > resumed my rsync, and so far there > > > > have been no errors, but it's only been 20 minutes or so. I should > > > > have a good idea by tomorrow if this > > > > definitely fixed the problem (since even when the machine was not > > > > crashing it was tallying up iostat errors > > > > fairly rapidly) > > > > > > > > Thanks again for your help. Sorry for wasting your time if the > > > > previously posted workaround fixes things. > > > > I'll let you know tomorrow either way. > > > > > > > > Chad > > > > > > > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: > > > > > Chad Cantwell wrote: > > > > > >After another crash I checked the syslog and there were some > > > > > >different errors than the ones > > > > > >I saw previously during operation: > > > > > ... > > > > > > > > > > >Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not > > > > > >supported. > > > > > ... > > > > > >Nov 30 20:59:13 the-vault mpt_config_space_init failed > > > > > ... > > > > > >Nov 30 20:59:15 the-vault mpt_restart_ioc failed > > > > > .... > > > > > > > > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: > > > > > >PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > > > > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > > > > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: > > > > > >System-Serial-Number, HOSTNAME: the-vault > > > > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > > > > > >Nov 30 21:33:02 the-vault EVENT-ID: > > > > > >7886cc0d-4760-60b2-e06a-8158c3334f63 > > > > > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an > > > > > >invalid request. > > > > > >Nov 30 21:33:02 the-vault Refer to > > > > > >http://sun.com/msg/PCIEX-8000-8R for more information. > > > > > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device > > > > > >instances may be disabled > > > > > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the > > > > > >device instances associated with this fault > > > > > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers > > > > > >and patches are installed. Otherwise schedule a repair procedure to > > > > > >replace the affected device(s). Us > > > > > >e fmadm faulty to identify the devices or contact Sun for support. > > > > > > > > > > > > > > > Sorry to have to tell you, but that HBA is dead. Or at > > > > > least dying horribly. If you can't init the config space > > > > > (that's the pci bus config space), then you've got about > > > > > 1/2 the nails in the coffin hammered in. Then the failure > > > > > to restart the IOC (io controller unit) == the rest of > > > > > the lid hammered down. > > > > > > > > > > > > > > > best regards, > > > > > James C. McPherson > > > > > -- > > > > > Senior Kernel Software Engineer, Solaris > > > > > Sun Microsystems > > > > > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > > > > _______________________________________________ > > > > zfs-discuss mailing list > > > > zfs-discuss@opensolaris.org > > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > > > zfs-discuss mailing list > > > zfs-discuss@opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss