On Thu, 11 Oct 2012, Freddie Cash wrote:
On Thu, Oct 11, 2012 at 2:47 PM, andy thomas <a...@time-domain.co.uk> wrote:
According to a Sun document called something like 'ZFS best practice' I read
some time ago, best practice was to use the entire disk for ZFS and not to
partition or slice it in any way. Does this advice hold good for FreeBSD as
well?
Solaris disabled the disk cache if the disk was partitioned, thus the
recommendation to always use the entire disk with ZFS.
FreeBSD's GEOM architecture allows the disk cache to be enabled
whether you use the full disk or partition it.
Personally, I find it nicer to use GPT partitions on the disk. That
way, you can start the partition at 1 MB ("gpart add -b 2048" on 512B
disks, or "gpart add -b 512" on 4K disks), leave a little wiggle-room
at the end of the disk, and use GPT labels to identify the disk (using
gpt/label-name for the device when adding to the pool).
This is apparently what had been done in this case:
gpart add -b 34 -s 6000000 -t freebsd-swap da0
gpart add -b 6000034 -s 1947525101 -t freebsd-zfs da1
gpart show
(stuff relating to a compact flash/SATA boot disk deleted)
=> 34 1953525101 da0 GPT (932G)
34 6000000 1 freebsd-swap (2.9G)
6000034 1947525101 2 freebsd-zfs (929G)
=> 34 1953525101 da2 GPT (932G)
34 6000000 1 freebsd-swap (2.9G)
6000034 1947525101 2 freebsd-zfs (929G)
=> 34 1953525101 da1 GPT (932G)
34 6000000 1 freebsd-swap (2.9G)
6000034 1947525101 2 freebsd-zfs (929G)
Is this a good scheme? The server has 12 G of memory (upped from 4 GB last
year after it kept crashing with out of memory reports on the console
screen) so I doubt the swap would actually be used very often. Running
Bonnie++ on this pool comes up with some very good results for sequential
disk writes but the latency of over 43 seconds for block reads is
terrible and is obviously impacting performance as a mail server, as shown
here:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
hsl-main.hsl.of 24G 63 67 80584 20 70568 17 314 98 554226 60 410.1 13
Latency 77140us 43145ms 28872ms 171ms 212ms 232ms
Version 1.96 ------Sequential Create------ --------Random Create--------
hsl-main.hsl.office -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 19261 93 +++++ +++ 18491 97 21542 92 +++++ +++ 20691 94
Latency 15399us 488us 226us 27733us 103us 138us
The other issue with this server is it needs to be rebooted every 8-10
weeks as disk I/O slows to a crawl over time and the server becomes
unusable. After a reboot, it's fine again. I'm told ZFS 13 on FreeBSD 8.0
has a lot of problems so I was planning to rebuild the server with FreeBSD
9.0 and ZFS 28 but I didn't want to make any basic design mistakes in
doing this.
Another point about the Sun ZFS paper - it mentioned optimum performance
would be obtained with RAIDz pools if the number of disks was between 3 and
9. So I've always limited my pools to a maximum of 9 active disks plus
spares but the other day someone here was talking of seeing hundreds of
disks in a single pool! So what is the current advice for ZFS in Solaris and
FreeBSD?
You can have multiple disks in a vdev. And you can multiple vdevs in
a pool. Thus, you can have hundred of disks in a pool. :) Just
split the disks up into multiple vdevs, where each vdev is under 9
disks each. :) For example, we have 25 disks in the following pool,
but only 6 disks in each vdev (plus log/cache):
[root@alphadrive ~]# zpool list -v
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
storage 24.5T 20.7T 3.76T 84% 3.88x DEGRADED -
raidz2 8.12T 6.78T 1.34T -
gpt/disk-a1 - - - -
gpt/disk-a2 - - - -
gpt/disk-a3 - - - -
gpt/disk-a4 - - - -
gpt/disk-a5 - - - -
gpt/disk-a6 - - - -
raidz2 5.44T 4.57T 888G -
gpt/disk-b1 - - - -
gpt/disk-b2 - - - -
gpt/disk-b3 - - - -
gpt/disk-b4 - - - -
gpt/disk-b5 - - - -
gpt/disk-b6 - - - -
raidz2 5.44T 4.60T 863G -
gpt/disk-c1 - - - -
replacing - - - 932G
6255083481182904200 - - - -
gpt/disk-c2 - - - -
gpt/disk-c3 - - - -
gpt/disk-c4 - - - -
gpt/disk-c5 - - - -
gpt/disk-c6 - - - -
raidz2 5.45T 4.75T 720G -
gpt/disk-d1 - - - -
gpt/disk-d2 - - - -
gpt/disk-d3 - - - -
gpt/disk-d4 - - - -
gpt/disk-d5 - - - -
gpt/disk-d6 - - - -
gpt/log 1.98G 460K 1.98G -
cache - - - - - -
gpt/cache1 32.0G 32.0G 8M -
Great, thanks for the explanation! I didn't realise you could have a sort
of 'stacked pyramid' vdev/pool structure.
Andy
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss