Some of that is very worrying Miles, do you have bug ID's for any of those 
problems?

I'm guessing the problem of the device being reported ok after the reboot could 
be this one:
http://bugs.opensolaris.org/view_bug.do?bug_id=6582549

And could the errors after the reboot be one of these?
http://bugs.opensolaris.org/view_bug.do?bug_id=6558852
http://bugs.opensolaris.org/view_bug.do?bug_id=6675685

I don't have the same concerns myself that you guys have over massive pools 
since we're working at a much smaller scale, but even so it's no good ZFS 
having one of it's main selling features as "only resilvers the missing data" 
if it can't be relied upon to do that every time in real world situations.

Incidentally, even with those resilver bugs, a few back of the envelope 
calculations makes me think that this might not be too bad with 10Gb ethernet:

Server size:  28TB
Interconnect speed:  10Gb/s   (call it 8Gb/s of actual bandwidth)
Usage:  70%   (worst case scenario - pool dies while under heavy load)

That gives us an available resilver bandwidth of 3Gb's, which I'll divide by 
two since that has to be used for both reads and writes.

28TB @ 1.5Gb/s gives a resilver time of around 42 hours, and changing some of 
the assumptions by dropping pool usage to 20% brings that down to 16 hours.  
It's still a long time, but for a rare disaster recovery scenario for a large 
pool, I think I could live with it.
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to