Paul Kraus wrote: > On 9/4/07, Gino <[EMAIL PROTECTED]> wrote: > >> yesterday we had a drive failure on a fc-al jbod with 14 drives. >> Suddenly the zpool using that jbod stopped to respond to I/O requests >> and we get tons of the following messages on /var/adm/messages: > > <snip> > >> "cfgadm -al" or "devfsadm -C" didn't solve the problem. >> After a reboot ZFS recognized the drive as failed and all worked well. >> >> Do we need to restart Solaris after a drive failure??
It depends... > I would hope not but ... prior to putting some ZFS volumes > into production we did some failure testing. The hardware I was > testing with was a couple SF-V245 with 4 x 72 GB disks each. Two disks > were setup with SVM/UFS as mirrored OS, the other two were handed to > ZFS as a mirrored zpool. I did some large file copies to generate I/O. > While a large copy was going on (lots of disk I/O) I pulled one of the > drives. ... on which version of Solaris you are running. ZFS FMA phase 2 was integrated into SXCE build 68. Prior to that release, ZFS had a limited view of the (many) disk failure modes -- it would say a disk was failed if it could not be opened. In phase 2, the ZFS diagnosis engine was enhanced to look for per-vdev soft error rate discriminator (SERD) engines. More details can be found in the ARC case materials: http://www.opensolaris.org/os/community/arc/caselog/2007/283/materials/portfolio-txt/ In SXCE build 72 we gain a new FMA I/O retire agent. This is more general purpose and allows a process to set a contract against a device in use. http://www.opensolaris.org/os/community/on/flag-days/pages/2007080901/ http://www.opensolaris.org/os/community/arc/caselog/2007/290/ > If the I/O was to the zpool the system would hang (just like > it was hung waiting on an I/O operation). I let it sit this way for > over an hour with no recovery. After rebooting it found the existing > half of the ZFS mirror just fine. Just to be clear, once I pulled the > disk, over about a 5 minute period *all* activity on the box hung. > Even a shell just running prstat. It may depend on what shell you are using. Some shells, such as ksh write to the $HISTFILE before exec'ing the command. If your $HISTFILE was located in an affected file system, then you would appear hung. > If the I/O was to one of the SVM/UFS disks there would be a > 60-90 second pause in all activity (just like the ZFS case), but then > operation would resume. This is what I am used to seeing for a disk > failure. Default retries to most disks are 60 seconds (last time I checked). There are several layers involved here, so you can expect something to happen on 60 second intervals, even if it is just another retry. > In the ZFS case I could replace the disk and the zpool would > resilver automatically. I could also take the removed disk and put it > into the second system and have it recognize the zpool (and that it > was missing half of a mirror) and the data was all there. > > In no case did I see any data loss or corruption. I had > attributed the system hanging to an interaction between the SAS and > ZFS layers, but the previous post makes me question that assumption. > > As another data point, I have an old Intel box at home I am > running x86 on with ZFS. I have a pair of 120 GB PATA disks. OS is on > SVM/UFS mirrored partitions and /export home is on a pair of > partitions in a zpool (mirror). I had a bad power connector and > sometime after booting lost one of the drives. The server kept running > fine. Once I got the drive powered back up (while the server was shut > down), the SVM mirrors resync'd and the zpool resilvered. The zpool > finished substantially before the SVM. > > In all cases the OS was Solaris 10 U 3 (11/06) with no > additional patches. The behaviour you describe is what I would expect for that release of Solaris + ZFS. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss