I've lost a few drives on a thumper I look after in the past week and I've noticed a couple of issues with the resilver process that could be improved (or maybe have, the system is running Solaris 10 update 8).

1) While the pool has been resilvering, I have been copying a large (2TB) filesystem from another box. Performance was OK for the initial send (45MB/s), but is pretty terrible for incrementals. It looks like the issue is latency, sending an empty snapshot usually gets a response in under a second. During a resilver, it can take 30-40 seconds to respond. I do a daily incremental send of a filesystem set with about 4000 small snapshots (1000 users, 6 hourly snaps) which normally takes about 3 hours, while resilvering, it barely gets 10% through in a day.

2) If a drive fails in one vdev while a drive in another is resilvering, both resilvers start over. I have yet to complete a resilver on the first drive to fail due to others failing when the current resilver is 90% done! Is it necessary to restart if a drive fails on another vdev? As more drives fail, the resilvers get progressively longer.

3) Detaching a failed device while a spare is resilvering causes the resilver to restart. Is that necessary?

Thanks,

--
Ian.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to