Anton B. Rang wrote:
>Peter Eriksson wrote:
And to panic? How can that in any sane way be good way to "protect" the
application? *BANG* - no chance at all for the application to handle
the problem...

I agree -- a disk error should never be fatal to the system; at worst,
the file system should appear to have been forcibly unmounted (and
"worst" really means that critical metadata, like the
superblock/uberblock, can't be updated on any of the disks in the pool).
That at least gives other applications which aren't using the file system
the chance to keep going.

But it's still not the application's problem to handle the underlying
device failure.

...

That said, it also appears that the device drivers (either the
FibreChannel or SCSI disk drivers in this case) are misbehaving. The FC
driver appears to be reporting back an error which is interpreted as
fatal by the SCSI disk driver when one or the other should be retrying
the I/O. (It also appears that either the FC driver, SCSI disk driver, or
ZFS is misbehaving in the observed hang.)

In this case it is most likely that it's the qla2x00 driver which is at
fault. The Leadville drivers do the appropriate retries. The sd driver
and ZFS also do the appropriate retries.

So ZFS should be more resilient against write errors, and the SCSI disk
or FC drivers should be more resilient against LIPs (the most likely
cause of your problem) or other transient errors. (Alternatively, the ifp
driver should be updated to support the maximum number of targets on a
loop, which might also solve your second problem.)

Your alternative option isn't going to happen. The ifp driver and
the card it supports have both been long since EOLd.



James C. McPherson
--
Solaris kernel software engineer
Sun Microsystems
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to