Hi all,

Unfortunately for me, there does seem to be a hardware component to my problem. 
 Although my rsync copied almost 4TB of
data with no iostat errors after going back to OpenSolaris 2009.06, I/O on one 
of my mpt cards did eventually hang, with
6 disk lights on and 2 off, until rebooting.  There are a few hardware changes 
made since the last time I did a full
backup, so it's possible that whatever problem was introduced didn't happen 
frequently enough in low i/o usage for
me to detect until now when I was reinstalling and copying massive amounts of 
data back.

The changes I had made since originally installing osol2009.06 several months 
ago are:

- stop using marvel yukon2 ethernet onboard driver (which used a 3rd party 
driver) in favor of intel 1000 pt dual port,
which necessesitated an extra pci-e slot, prompting the following item:
- swapped motherboards between 2 machines (they were similiar though, with 
similiar onboard hardware and shouldn't
have been a major change).  Originally was an Asus P5Q Deluxe w/3 pci-e slots, 
now is a slightly older Asus P5W64 w/4
pci-e slots.
- the intel 1000 pt dual port card has been aggregated as aggr0 since it was 
installed (the older yukon2 was a basic
interface)

the above changes were what was done awhile ago before upgrading opensolaris to 
127, and things seemed to be working fine
for at least 2-3 months with rsync updating (never hung, or had a fatal zfs 
error or lost access to data requiring a reboot)

new changes since troubleshooting snv 127 mpt issues:
- upgrade LSI 3081 firmware from 1.28.2 (or was it .02) to 1.29, the latest.  
If this turns out to be an issue, I do have
the previous IT firmware that I was using before which I can flash back.

another, albeit unlikely factor: when I originally copied all my data to my 
first opensolaris raidz2 pool, I didn't use
rsync at all, I used netcat & tar, and only setup rsync later for updates.  
perhaps the huge initial single rsync of
the large tree does something strange that the original intiial netcat & tar 
copy did not (i know, unlikely, but I'm
grasping at straws here to determine what has happened).

I'll work on ruling out the potential sources of hardware problems before I 
report any more on the mpt issues, since
my test case would probably confound things at this point.  I am affected by 
the mpt bugs since I would get the
timeouts almost constantly in snv 127+, but since I'm also apparently affected 
by some other unknown hardware issue,
my data on the mpt problems might lead people in the wrong direction at this 
point.

I will first try to go back to the non-aggregated yukon ethernet, remove the 
intel dual port pci-e network adapter,
then if the problem persists try half of my drives on each LSI controller 
individually to confirm if one controller
has a problem the other does not, or one drive in one set is causing a new 
problem to a particular controller.  I hope
to have some kind of answer at that point and not have to resort to motherboard 
swapping again.

Chad

On Thu, Dec 03, 2009 at 10:44:53PM -0800, Chad Cantwell wrote:
> I eventually performed a few more tests, adjusting some zfs tuning options 
> which had no effect, and trying the
> itmpt driver which someone had said would work, and regardless my system 
> would always freeze quite rapidly in
> snv 127 and 128a.  Just to double check my hardware, I went back to the 
> opensolaris 2009.06 release version, and
> everything is working fine.  The system has been running a few hours and 
> copied a lot of data and not had any
> trouble, mpt syslog events, or iostat errors.
> 
> One thing I found interesting, and I don't know if it's significant or not, 
> is that under the recent builds and
> under 2009.06, I had run "echo '::interrupts' | mdb -k" to check the 
> interrupts used.  (I don't have the printout
> handy for snv 127+, though).
> 
> I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 
> and e1000g1.  In snv 127+, each of
> my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ 
> listing, whereas in opensolaris
> 2009.06, all 4 devices are on different IRQs.  I don't know if this is 
> significant, but most of my testing when
> I encountered errors was data transfer via the network, so it could have 
> potentially been interfering with the
> mpt drivers when it was on the same IRQ.  The errors did seem to be less 
> frequent when the server I was copying
> from was linked at 100 instead of 1000 (one of my tests), but that is as 
> likely to be a result of the slower zpool
> throughput as it is to be related to the network traffic.
> 
> I'll probably stay with 2009.06 for now since it works fine for me, but I can 
> try a newer build again once some
> more progress is made in this area and people want to see if its fixed (this 
> machine is mainly to backup another
> array so it's not too big a deal to test later when the mpt drivers are 
> looking better and wipe again in the event
> of problems)
> 
> Chad
> 
> On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote:
> > To update everyone, I did a complete zfs scrub, and it it generated no 
> > errors in iostat, and I have 4.8T of
> > data on the filesystem so it was a fairly lengthy test.  The machine also 
> > has exhibited no evidence of
> > instability.  If I were to start copying a lot of data to the filesystem 
> > again though, I'm sure it would
> > generate errors and crash again.
> > 
> > Chad
> > 
> > 
> > On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote:
> > > Well, ok, the msi=0 thing didn't help after all.  A few minutes after my 
> > > last message a few errors showed
> > > up in iostat, and then in a few minutes more the machine was locked up 
> > > hard...  Maybe I will try just
> > > doing a scrub instead of my rsync process and see how that does.
> > > 
> > > Chad
> > > 
> > > 
> > > On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote:
> > > > I don't think the hardware has any problems, it only started having 
> > > > errors when I upgraded OpenSolaris.
> > > > It's still working fine again now after a reboot.  Actually, I reread 
> > > > one of your earlier messages,
> > > > and I didn't realize at first when you said "non-Sun JBOD" that this 
> > > > didn't apply to me (in regards to
> > > > the msi=0 fix) because I didn't realize JBOD was shorthand for an 
> > > > external expander device.  Since
> > > > I'm just using baremetal, and passive backplanes, I think the msi=0 fix 
> > > > should apply to me based on
> > > > what you wrote earlier, anyway I've put 
> > > >         set mpt:mpt_enable_msi = 0
> > > > now in /etc/system and rebooted as it was suggested earlier.  I've 
> > > > resumed my rsync, and so far there
> > > > have been no errors, but it's only been 20 minutes or so.  I should 
> > > > have a good idea by tomorrow if this
> > > > definitely fixed the problem (since even when the machine was not 
> > > > crashing it was tallying up iostat errors
> > > > fairly rapidly)
> > > > 
> > > > Thanks again for your help.  Sorry for wasting your time if the 
> > > > previously posted workaround fixes things.
> > > > I'll let you know tomorrow either way.
> > > > 
> > > > Chad
> > > > 
> > > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote:
> > > > > Chad Cantwell wrote:
> > > > > >After another crash I checked the syslog and there were some 
> > > > > >different errors than the ones
> > > > > >I saw previously during operation:
> > > > > ...
> > > > > 
> > > > > >Nov 30 20:59:13 the-vault       LSI PCI device (1000,ffff) not 
> > > > > >supported.
> > > > > ...
> > > > > >Nov 30 20:59:13 the-vault       mpt_config_space_init failed
> > > > > ...
> > > > > >Nov 30 20:59:15 the-vault       mpt_restart_ioc failed
> > > > > ....
> > > > > 
> > > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: 
> > > > > >PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major
> > > > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009
> > > > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: 
> > > > > >System-Serial-Number, HOSTNAME: the-vault
> > > > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16
> > > > > >Nov 30 21:33:02 the-vault EVENT-ID: 
> > > > > >7886cc0d-4760-60b2-e06a-8158c3334f63
> > > > > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an 
> > > > > >invalid request.
> > > > > >Nov 30 21:33:02 the-vault   Refer to 
> > > > > >http://sun.com/msg/PCIEX-8000-8R for more information.
> > > > > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device 
> > > > > >instances may be disabled
> > > > > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the 
> > > > > >device instances associated with this fault
> > > > > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers 
> > > > > >and patches are installed. Otherwise schedule a repair procedure to 
> > > > > >replace the affected device(s).  Us
> > > > > >e fmadm faulty to identify the devices or contact Sun for support.
> > > > > 
> > > > > 
> > > > > Sorry to have to tell you, but that HBA is dead. Or at
> > > > > least dying horribly. If you can't init the config space
> > > > > (that's the pci bus config space), then you've got about
> > > > > 1/2 the nails in the coffin hammered in. Then the failure
> > > > > to restart the IOC (io controller unit) == the rest of
> > > > > the lid hammered down.
> > > > > 
> > > > > 
> > > > > best regards,
> > > > > James C. McPherson
> > > > > --
> > > > > Senior Kernel Software Engineer, Solaris
> > > > > Sun Microsystems
> > > > > http://blogs.sun.com/jmcp     http://www.jmcp.homeunix.com/blog
> > > > _______________________________________________
> > > > zfs-discuss mailing list
> > > > zfs-discuss@opensolaris.org
> > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > > _______________________________________________
> > > zfs-discuss mailing list
> > > zfs-discuss@opensolaris.org
> > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to