Re: [zfs-discuss] 350TB+ storage solution

Jim Klimov Mon, 16 May 2011 17:50:41 -0700

2011-05-16 9:14, Richard Elling пишет:

On May 15, 2011, at 10:18 AM, Jim Klimov<jimkli...@cos.ru>  wrote:

Hi, Very interesting suggestions as I'm contemplating a Supermicro-based server 
for my work as well, but probably in a lower budget as a backup store for an 
aging Thumper (not as its superior replacement).

Still, I have a couple of questions regarding your raidz layout recommendation.

On one hand, I've read that as current drives get larger (while their random 
IOPS/MBPS don't grow nearly as fast with new generations), it is becoming more 
and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for vdevs 
made of many disks - a dozen or so. When a drive fails, you still have two 
redundant parities, and with a resilver window expected to be in hours if not 
days range, I would want that airbag, to say the least. You know, failures 
rarely come one by one ;)

Not to worry. If you add another level of redundancy, the data protection
is improved by orders of magnitude. If the resilver time increases, the effect
on data protection is reduced by a relatively small divisor. To get some sense
of this, the MTBF is often 1,000,000,000 hours and there are only 24 hours in
a day.


If MTBFs were real, we'd never see disks failing within a year ;)

Problem is, these values seem to be determined in an ivory-tower
lab. An expensive-vendor edition of a drive running in a cooled
data center with shock absorbers and other nice features does
often live a lot longer than a similar OEM enterprise or consumer
drive running in an apartment with varying weather around and
often overheating and randomly vibrating with a dozen other
disks rotating in the same box.

The ramble about expensive-vendor drive editions comes from
my memory of some forum or blog discussion which I can't point
to now either, which suggested that vendors like Sun do not
charge 5x-10x the price of the same label of OEM drive just
for a nice corporate logo stamped onto the disk. Vendors were
said to burn-in the drives in their labs for like half a year or a
year before putting the survivors to the market. This implies
that some of the drives did not survive a burn-in period, and
indeed the MTBF for the remaining ones is higher because
"infancy death" due to manufacturing problems soon after
arrival to the end customer is unlikely for these particular
tested devices. The long burn-in times were also said to
be the partial reason why vendors never sell the biggest
disks available on the market (does any vendor sell 3Tb
with their own brand already? Sun-Oracle? IBM? HP?)
Thus may be obscured as "certification process" which
occasionally takes about as long - to see if the newest
and greatest disks die within a year or so.

Another implied idea in that discussion was that the vendors
can influence OEMs in choice of components, an example
in the thread being about different marks of steel for the
ball bearings. Such choices can drive the price up with
a reason - disks like that are more expensive to produce -
but also increases their reliability.

In fact, I've had very few Sun disks breaking in the boxes
I've managed over 10 years; all I can remember now were
two or three 2.5" 72Gb Fujitsus with a Sun brand. Still, we
have another dozen of those running so far for several years.

So yes, I can believe that Big Vendor Brand disks can boast
huge MTBFs and prove that with a track record, and such
drives are often replaced not because of a break-down,
but rather as a precaution, and because of "moral aging",
such as low speed and small volume.

But for the rest of us (like Home-ZFS users) such numbers
of MTBF are as fantastic as the Big Vendor prices, and
inachievable for any number of reasons, starting with use
of cheaper and potentially worse hardware from the
beginning, and non-"orchard" conditions of running the
machines...

I do have some 5-year-old disks running in computers
daily and still alive, but I have about as many which died
young, sometimes even within the warranty period ;)

On another hand, I've recently seen many recommendations that in a RAIDZ* drive set, the 
number of data disks should be a power of two - so that ZFS blocks/stripes and those of 
of its users (like databases) which are inclined to use 2^N-sized blocks can be often 
accessed in a single IO burst across all drives, and not in "one and one-quarter 
IO" on the average, which might delay IOs to other stripes while some of the disks 
in a vdev are busy processing leftovers of a previous request, and others are waiting for 
their peers.

I've never heard of this and it doesn't pass the sniff test. Can you cite a 
source?

I was trying to find an "authoritative" link today but failed.
I know I've read this for many times over the past couple
of months, but this may still be an "urban legend" or even
FUD, retold many times...

In fact, today I came across old posts from Jeff Bonwick,
where he explains the disk usage and "ZFS striping" which
is not like usual RAID striping. If the architecture remains
similar after half a decade, his post (part Device Selection)
actually disproves my statement above, which was repeated
after somebody else, recursively:

http://blogs.oracle.com/bonwick/entry/zfs_block_allocation

So, protect your data and if the performance doesn't meet your expectation, then
you can make adjustments.


That would be too late - the pool would have to be torn down and
remade with another layout, which would require lots of downtime
and extra space for backups ;)

--


+============================================================+
|                                                            |
| Климов Евгений,                                 Jim Klimov |
| технический директор                                   CTO |
| ЗАО "ЦОС и ВТ"                                  JSC COS&HT |
|                                                            |
| +7-903-7705859 (cellular)          mailto:jimkli...@cos.ru |
|                          CC:ad...@cos.ru,jimkli...@mail.ru |
+============================================================+
| ()  ascii ribbon campaign - against html mail              |
| /\                        - against microsoft attachments  |
+============================================================+



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 350TB+ storage solution

Reply via email to