I've lost a few drives on a thumper I look after in the past week and
I've noticed a couple of issues with the resilver process that could be
improved (or maybe have, the system is running Solaris 10 update 8).
1) While the pool has been resilvering, I have been copying a large
(2TB) filesystem from another box. Performance was OK for the initial
send (45MB/s), but is pretty terrible for incrementals. It looks like
the issue is latency, sending an empty snapshot usually gets a response
in under a second. During a resilver, it can take 30-40 seconds to
respond. I do a daily incremental send of a filesystem set with about
4000 small snapshots (1000 users, 6 hourly snaps) which normally takes
about 3 hours, while resilvering, it barely gets 10% through in a day.
2) If a drive fails in one vdev while a drive in another is resilvering,
both resilvers start over. I have yet to complete a resilver on the
first drive to fail due to others failing when the current resilver is
90% done! Is it necessary to restart if a drive fails on another vdev?
As more drives fail, the resilvers get progressively longer.
3) Detaching a failed device while a spare is resilvering causes the
resilver to restart. Is that necessary?
Thanks,
--
Ian.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss