On Thu, Apr 08, 2010 at 03:48:54PM -0700, Erik Trimble wrote:
> Well....

To be clear, I don't disagree with you; in fact for a specific part of
the market (at least) and a large part of your commentary, I agree.  I
just think you're overstating the case for the rest.
 
> The problem is (and this isn't just a ZFS issue) that resilver and scrub
> times /are/ very bad for >1TB disks.  This goes directly to the problem
> of redundancy - if you don't really care about resilver/scrub issues,
> then you really shouldn't bother to use Raidz or mirroring.  It's pretty
> much in the same ballpark.

Sure, and that's why you have raidz3 now; also why multi-way mirrors
are getting more attention, as the drives are getting large enough
that capacities and redundancies previously only available via raidz
constructions can now be had with mirrors and a reasonable number of
spindles. 

Large drives (with the constraints you describe) certainly change the
deployment scenarios.  I don't agree that they shouldn't be deployed
at all, ever - which seems to be what you're saying.

Take 6x1TB in raidz2, replace with 6x2TB in three-way-mirror.  Chances
are, you've just improved performance.  I'm just trying to show it's
really not all that black and white.

As for error rates, this is something zfs should not be afraid
of. Indeed, many of us would be happy to get drives with less internal
ECC overhead and complexity for greater capacity, and tolerate the
resultant higher error rates, specifically for use with zfs (sector
errors, not overall drive failure, of course).  Even if it means I
need raidz4, and wind up with the same overall usable space, I may
prefer the redundancy across drives rather than within.

> That is, >1TB 3.5" drives have such long resilver/scrub times that with
> ZFS, it's a good bet you can kill a second (or third) drive before you
> can scrub or resilver in time to compensate for the already-failed one.
> Put it another way, you get more errors before you have time to fix the
> old ones, which effectively means you now can't fix errors before they
> become permanent. Permanent errors = data loss.

Again, potential zfs improvements could help here:
 - resilver in parallel for multiply redundant vdevs with multiple
   failures/replacements (currently, I think resilver restarts in this
   case?)
 - scrub a (top level) vdev at a time, rather than a whole pool. If I
   know I'm about to replace a drive, perhaps for capacity upgrade,
   I'll scrub first to minimise the chances of tripping over a latent
   error, especially on the previous drive i just replaced. No need to
   scrub other vdevs right now. 
 - scrub/resilver selectively by dataset, to allow higher priority
   data to be given better protection.

> For example, the 2TB 5900RPM 3.5" drives are (on average) over 2x as
> slow as the 1TB 7200RPM 3.5" drives for most operations. Access time is
> slower by 40%, and throughput is slower on by 30-50%.

Please, be fair and compare like with like -  say replacing 5400rpm
1TB drives.  Your same problem would apply if replacing 1TB 7200's
with 1TB 5400's; it has little to do with the capacity.  Indeed, at
the same rpm, the higher density has the potential to be faster.

> In any case, resilver/scrub times are becoming the dominant factor in
> reliability of these large drives.

Agreed; I'd argue they have been for some time (ie, even at the 1TB
size). 

> As a practical matter, small setups are for the most part not
> expandable/upgradable much, if at all. Buy what you need now, and plan
> on rebuying something new in 5-10 years, but don't think that what you
> put together now can be continuously upgraded for a decade. 

On this, I agree completely, even on a shorter time-scale (say 3-5
years). On each generation, repurpose the previous generation for
backup or something else as appropriate.  This applies to drives, and
to the boxes that house them.  Even so, leave yourself wiggle room
for upgrades and other unanticipated devlopments in the meantime where
you can.

--
Dan.

Attachment: pgpLw78wUivGj.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to