Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

Paul Kraus Wed, 21 Mar 2012 06:31:52 -0700

On Tue, Mar 20, 2012 at 11:16 PM, MLR <maillistread...@gmail.com> wrote:


>  1. Cache device for L2ARC
>     Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool
> setup shouldn't we be reading at least at the 500MB/s read/write range? Why
> would we want a ~500MB/s cache?

    Without knowing the I/O pattern, saying 500 MB/sec. is
meaningless. Achieving 500MB/sec. with 8KB files and lots of random
accesses is really hard, even with 20 HDDs. Achieving 500MB/sec. of
sequential streaming of 100MB+ files is much easier. An SSD will be as
fast on random I/O as on sequential (compared to an HDD). An SSD will
be as fast on small I/O as large (once again, compared to an HDD). Due
to it's COW design, once a file is _changed_, ZFS no longer accesses
it strictly sequentially. If the files are written once and never
changed, then they _may_ be sequential on disk.

    An important point to remember about the ARC  / L2ARC is that it
(they ?) are ADAPTIVE. The amount of space used by the ARC will grow
as ZFS reads data and shrinks as other processes need memory. I also
suspect that data eventually ages out of the ARC. The L2ARC is
(mostly) just an extension of the ARC, except that it does not have to
give up capacity as other processes need more memory.

>  2. ZFS dynamically strips along the top-most vdev's and that "performance 
> for 1
> vdev is equivalent to performance of one drive in that group". Am I correct in
> thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, 
> the
> disks will go ~100MB/s each ,

   Assuming the disks will do 100MB/sec. for your data :-)

> this zpool would theoretically read/write at
> ~100MB/s max (how about real world average?)?

    Yes. In a RAIDz<n> when a write is dispatched to the vdev _all_
the drives must complete the write before the write is complete. All
the drives in the vdev are written to in parallel. This is (or should
be) the case for _any_ RAID scheme, including RAID1 (mirroring). If a
zpool has more than one vdev, then writes are distributed among the
vdevs based on a number of factors (which others are _much_ more
qualified to discuss).

    For ZFS, performance is proportional to the number of vdevs NOT
the number of drives or the number of drives per vdev. See
https://docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc
for some testing I did a while back. I did not test sequential read as
that is not part of our workload.

> If this was RAID6 I think this
> would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 
> 10x-
> 14x faster than zfs, and both provide the same amount of redundancy)? Is my
> thinking off in the RAID6 or RAIDZ2 numbers? Why doesn't ZFS try to 
> dynamically
> strip inside vdevs (and if it is, is there an easy to understand explanation 
> why
> a vdev doesn't read from multiple drives at once when requesting data, or why 
> a
> zpool wouldn't make N number of requests to a vdev with N being the number of
> disks in that vdev)?

    I understand why the read performance scales with the number of
vdevs, but I have never really understood _why_ it does not also scale
with the number of drives in each vdev. When I did my testing with 40
dribves, I expected similar READ performance regardless of the layout,
but that was NOT the case.

> Since "performance for 1 vdev is equivalent to performance of one drive in 
> that
> group" it seems like the higher raidzN are not very useful. If your using 
> raidzN
> your probably looking for a lower than mirroring parity (aka 10%-33%), but if
> you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is
> terrible (almost unimaginable) if your running at 1/20 the "ideal" 
> performance.

    The recommendation is to not go over 8 or so drives per vdev, but
that is a performance issue NOT a reliability one. I have also not
been able to duplicate others observations that 2^N drives per vdev is
a magic number (4, 8, 16, etc). As you can see from the above, even a
40 drive vdev works and is reliable, just (relatively) slow :-)

> Main Question:
>  3. I am updating my old RAID5 and want to reuse my old drives. I have 8 1.5TB
> drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure
> (Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I 
> want
> around 20%-25% parity. My system is like so:

    Is the enclosure just a JBOD? If it is not, can it present drives
directly? If you cannot get at the drives individually, then the rest
of the discussion is largely moot.

    You are buying 3TB drives, by definition you are NOT looking for
performance or reliability but capacity. What is the uncorrectable
error rate on these 3TB drives? What is the real random I/Ops
capability of these 3TB drives? I am not trying to be mean here, but I
would hate to see you put a ton of effort into this and then be
disappointed with the result due to a poor choice of hardware.

> Main Application: Home NAS
> * Like to optimize max space with 20%(ideal) or 25% parity - would like 
> 'decent'
> reading performance
>  - 'decent' being max of 10GigE Ethernet, right now it is only 1 gigabit 
> Ethernet but hope to leave room to update in future if 10GigE becomes cheaper.

    1,250MB/sec of random I/O (assuming small files) is very not
trivial to achieve and is way more than "decent"... On my home network
I see 30MB/sec of large file traffic per client, and I rarely have
more than one client doing lots of I/O at a time.

    How much space do you _need_, including reasonable growth?

> My RAID5 runs at ~500MB/s so was hoping to get at least above that with the 20
> disk raid.

    How did you measure this?

> * 16GB RAM

    What OS? I have a 16 CPU Solaris 10 SPARC server with 16 GB of RAM
and serving up 20TB of random small files. The ARC uses between 8 and
10 GB with between 1 and 2 GB free. But our users are generally
accessing less than 3 TB of data at a time.

> * Open to using ZIL/L2ARC, but, left out for now: writing doesn't occur much
> (~7GB a week, maybe a big burst every couple months), and don't really read 
> same
> data multiple times.

    ZIL helps sync write performance (NFS)
    L2ARC give you more ARC space which helps all reads

> What would be the best setup? I'm thinking one of the following:
>    a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)?
> (~200MB/s reading, best reliability)
>    b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)? (~400MB/s
> reading, evens out size across vdevs)
>    c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)? (~500MB/s
> reading, maximize vdevs for performance)

    With the eight 1.5TB drives you can:
1 x 8 (raidz<n>) == worst performance
2 x 4 (raidz<n>) == better performance
    if raidz2, then capacity is the same as mirror but has better reliability
4 x 2 (mirror) == best performance

    With the twelve 3TB drives you can:
1 x 12 (raidz<n>) == worst performance
2 x 6 (raidz<n>) == better performance
3 x 4 (raidz<n>) == better performance
4 x 3 (mirror) == best performance
6 x 2 (mirror) == almost best performance

    I agree with Jim that you should keep the 1.5TB and the 3TB drives
in separate zpools. Although you _can_ partition the 3TB drives to
look like two 1.5TB drives. Group the first partition on each 3TB
drive with the 1.5TB drives and use the second as a second zpool.
There are caveats with doing that, but it may fit your needs...

    With 20 logical 1.5TB drives you can:
1 x 20 (raidz<n>) == bad performance, don't do this :-)
2 x 10 (raidz<n>) == better
3 x 6 + 2 hot spare (raidz<n>)
4 x 5 (raidz<n>)
6 x 3 + 2 hot spare (mirror)
9 x 2 + 2 hot spare (mirror)

    Plus another 12 logical 1.5TB drives:
1 x 12 (raidz<n>) == worst performance
2 x 6 (raidz<n>) == better performance
3 x 4 (raidz<n>) == better performance
4 x 3 (mirror) == best performance
6 x 2 (mirror) == almost best performance

    If you have the time, setup each configuration and _measure_ the
performance. If you can, load up a bunch of data (at least 33% full)
and then trigger a scrub to see how long a resilver takes. Remember
here that you are looking for _relative_ measures (unless you have a
performance goal you need to hit).

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

Reply via email to