Some of that is very worrying Miles, do you have bug ID's for any of those problems?
I'm guessing the problem of the device being reported ok after the reboot could be this one: http://bugs.opensolaris.org/view_bug.do?bug_id=6582549 And could the errors after the reboot be one of these? http://bugs.opensolaris.org/view_bug.do?bug_id=6558852 http://bugs.opensolaris.org/view_bug.do?bug_id=6675685 I don't have the same concerns myself that you guys have over massive pools since we're working at a much smaller scale, but even so it's no good ZFS having one of it's main selling features as "only resilvers the missing data" if it can't be relied upon to do that every time in real world situations. Incidentally, even with those resilver bugs, a few back of the envelope calculations makes me think that this might not be too bad with 10Gb ethernet: Server size: 28TB Interconnect speed: 10Gb/s (call it 8Gb/s of actual bandwidth) Usage: 70% (worst case scenario - pool dies while under heavy load) That gives us an available resilver bandwidth of 3Gb's, which I'll divide by two since that has to be used for both reads and writes. 28TB @ 1.5Gb/s gives a resilver time of around 42 hours, and changing some of the assumptions by dropping pool usage to 20% brings that down to 16 hours. It's still a long time, but for a rare disaster recovery scenario for a large pool, I think I could live with it. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss