On Thu, 11 Oct 2012, Freddie Cash wrote:

On Thu, Oct 11, 2012 at 2:47 PM, andy thomas <a...@time-domain.co.uk> wrote:
According to a Sun document called something like 'ZFS best practice' I read
some time ago, best practice was to use the entire disk for ZFS and not to
partition or slice it in any way. Does this advice hold good for FreeBSD as
well?

Solaris disabled the disk cache if the disk was partitioned, thus the
recommendation to always use the entire disk with ZFS.

FreeBSD's GEOM architecture allows the disk cache to be enabled
whether you use the full disk or partition it.

Personally, I find it nicer to use GPT partitions on the disk.  That
way, you can start the partition at 1 MB ("gpart add -b 2048" on 512B
disks, or "gpart add -b 512" on 4K disks), leave a little wiggle-room
at the end of the disk, and use GPT labels to identify the disk (using
gpt/label-name for the device when adding to the pool).

This is apparently what had been done in this case:

        gpart add -b 34 -s 6000000 -t freebsd-swap da0
        gpart add -b 6000034 -s 1947525101 -t freebsd-zfs da1
        gpart show

(stuff relating to a compact flash/SATA boot disk deleted)

=>        34  1953525101  da0  GPT  (932G)
          34     6000000    1  freebsd-swap  (2.9G)
     6000034  1947525101    2  freebsd-zfs  (929G)

=>        34  1953525101  da2  GPT  (932G)
          34     6000000    1  freebsd-swap  (2.9G)
     6000034  1947525101    2  freebsd-zfs  (929G)

=>        34  1953525101  da1  GPT  (932G)
          34     6000000    1  freebsd-swap  (2.9G)
     6000034  1947525101    2  freebsd-zfs  (929G)


Is this a good scheme? The server has 12 G of memory (upped from 4 GB last year after it kept crashing with out of memory reports on the console screen) so I doubt the swap would actually be used very often. Running Bonnie++ on this pool comes up with some very good results for sequential disk writes but the latency of over 43 seconds for block reads is terrible and is obviously impacting performance as a mail server, as shown here:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
hsl-main.hsl.of 24G    63  67 80584  20 70568  17   314  98 554226  60 410.1  13
Latency             77140us   43145ms   28872ms     171ms     212ms     232ms
Version  1.96       ------Sequential Create------ --------Random Create--------
hsl-main.hsl.office -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 19261  93 +++++ +++ 18491  97 21542  92 +++++ +++ 20691  94
Latency             15399us     488us     226us   27733us     103us     138us


The other issue with this server is it needs to be rebooted every 8-10 weeks as disk I/O slows to a crawl over time and the server becomes unusable. After a reboot, it's fine again. I'm told ZFS 13 on FreeBSD 8.0 has a lot of problems so I was planning to rebuild the server with FreeBSD 9.0 and ZFS 28 but I didn't want to make any basic design mistakes in doing this.

Another point about the Sun ZFS paper - it mentioned optimum performance
would be obtained with RAIDz pools if the number of disks was between 3 and
9. So I've always limited my pools to a maximum of 9 active disks plus
spares but the other day someone here was talking of seeing hundreds of
disks in a single pool! So what is the current advice for ZFS in Solaris and
FreeBSD?

You can have multiple disks in a vdev.  And you can multiple vdevs in
a pool.  Thus, you can have hundred of disks in a pool.  :)  Just
split the disks up into multiple vdevs, where each vdev is under 9
disks each.  :)  For example, we have 25 disks in the following pool,
but only 6 disks in each vdev (plus log/cache):


[root@alphadrive ~]# zpool list -v
NAME                        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
storage                    24.5T  20.7T  3.76T    84%  3.88x  DEGRADED  -
 raidz2                   8.12T  6.78T  1.34T         -
   gpt/disk-a1                -      -      -         -
   gpt/disk-a2                -      -      -         -
   gpt/disk-a3                -      -      -         -
   gpt/disk-a4                -      -      -         -
   gpt/disk-a5                -      -      -         -
   gpt/disk-a6                -      -      -         -
 raidz2                   5.44T  4.57T   888G         -
   gpt/disk-b1                -      -      -         -
   gpt/disk-b2                -      -      -         -
   gpt/disk-b3                -      -      -         -
   gpt/disk-b4                -      -      -         -
   gpt/disk-b5                -      -      -         -
   gpt/disk-b6                -      -      -         -
 raidz2                   5.44T  4.60T   863G         -
   gpt/disk-c1                -      -      -         -
   replacing                  -      -      -      932G
     6255083481182904200      -      -      -         -
     gpt/disk-c2              -      -      -         -
   gpt/disk-c3                -      -      -         -
   gpt/disk-c4                -      -      -         -
   gpt/disk-c5                -      -      -         -
   gpt/disk-c6                -      -      -         -
 raidz2                   5.45T  4.75T   720G         -
   gpt/disk-d1                -      -      -         -
   gpt/disk-d2                -      -      -         -
   gpt/disk-d3                -      -      -         -
   gpt/disk-d4                -      -      -         -
   gpt/disk-d5                -      -      -         -
   gpt/disk-d6                -      -      -         -
 gpt/log                  1.98G   460K  1.98G         -
cache                          -      -      -      -      -      -
 gpt/cache1               32.0G  32.0G     8M         -


Great, thanks for the explanation! I didn't realise you could have a sort of 'stacked pyramid' vdev/pool structure.

Andy
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to