Re: [DNG] /usr to merge or not to merge... that is the question

Roger Leigh Thu, 29 Nov 2018 09:19:42 -0800

On 29/11/2018 13:44, Olaf Meeuwissen wrote:

Q: Isn't there some filesystem type that supports settings at a more
  granular level than the device?  Like directory or per file?
A: Eh ... Don't know.  Haven't checked ...
Solution: Go fish!


# I haven't gone fishing yet but a vague recollection of Roger's post
# where he mentioned ZFS seemed promising ...

You could set them at the directory level if you wanted with ZFS, bycreating a filesystem per directory. But that might get a bitunmanageable, so doing it in a coarser-grained way is usuallysufficient. Let me show some examples.

Firstly, this is an example of a FreeBSD 11.2 ZFS NAS with a fewterabytes of HDD storage. There's a dedicated system pool, plus thisdata pool:


    % zpool status red
      pool: red
     state: ONLINE

scan: scrub repaired 0 in 4h12m with 0 errors on Wed Nov 2819:51:31 2018

    config:

            NAME                         STATE     READ WRITE CKSUM
            red                          ONLINE       0     0     0
              mirror-0                   ONLINE       0     0     0
                gpt/zfs-b1-WCC4M0EYCTAZ  ONLINE       0     0     0
                gpt/zfs-b2-WCC4M5PV83PY  ONLINE       0     0     0
              mirror-1                   ONLINE       0     0     0
                gpt/zfs-b3-WCC4N7FLJD34  ONLINE       0     0     0
                gpt/zfs-b4-WCC4N4UDKN8F  ONLINE       0     0     0
            logs
              mirror-2                   ONLINE       0     0     0
                gpt/zfs-a1-B492480446    ONLINE       0     0     0
                gpt/zfs-a2-B492480406    ONLINE       0     0     0

    errors: No known data errors

This is arranged as a pair of two mirrors, referenced by GPT labels.Each of the mirror "vdev"s is RAID1, and then writes are striped acrossthem. It's not /actually/ striping, it's more clever, and it biases thewrites to balance throughput and free space across all the vdevs, butit's similar. The last vdev is a "log" device made of a pair ofmirrored SSDs. This "ZIL" is basically a fast write cache. I couldalso have an SSD added as an L2ARC read cache as well, but it's notnecessary for this system's workload. No errors have occurred on any ofthe devices in the pool.


All filesystems allocate storage from this pool.  Here's a few:

    % zfs list -r red | head -n6
    NAME                 USED  AVAIL  REFER  MOUNTPOINT
    red                 2.57T  1.82T   104K  /red
    red/bhyve            872K  1.82T    88K  /red/bhyve
    red/data             718G  1.82T   712G  /export/data
    red/distfiles       11.9G  1.82T  11.8G  /red/distfiles
    red/home             285G  1.82T    96K  /export/home

There are actually 74 filesystems in this pool! Because the storage isshared, there's no limit on how many you can have. So unliketraditional partitioning, or even LVM (w/o thin pool) it's massivelymore flexible. You can create, snapshot and destroy datasets on a whim,and even delegate administration of parts of the tree to different usersand groups. So users could create datasets under their home directory,snapshot them and send them to and from other systems. You can organiseyour data into whatever filesystem structure makes sense.


Let's look at the pool used on the Linux system I'm writing this email on:

    % zpool status rpool
      pool: rpool
     state: ONLINE

scan: scrub repaired 0B in 0h5m with 0 errors on Mon Nov 2617:15:55 2018

    config:

            NAME        STATE     READ WRITE CKSUM
            rpool       ONLINE       0     0     0
              sda2      ONLINE       0     0     0

    errors: No known data errors

This is a single SSD "root pool" for the operating system; with data inother pools not shown here.


    % zfs list -r rpool
    NAME                 USED  AVAIL  REFER  MOUNTPOINT
    rpool               45.4G  62.2G    96K  none
    rpool/ROOT          14.7G  62.2G    96K  none
    rpool/ROOT/default  14.7G  62.2G  12.0G  /
    rpool/home          3.18M  62.2G    96K  none
    rpool/home/root     3.08M  62.2G  3.08M  /root
    rpool/swap          8.50G  63.2G  7.51G  -
    rpool/var           5.72G  62.2G    96K  none
    rpool/var/cache     5.32G  62.2G  5.18G  /var/cache
    rpool/var/log        398M  62.2G   394M  /var/log
    rpool/var/spool     8.14M  62.2G  8.02M  /var/spool
    rpool/var/tmp        312K  62.2G   152K  /var/tmp

These datasets comprise the entire operating system (I've omitted somethird-party software package datasets from rpool/opt).


    % zfs list -t snapshot -r rpool
    NAME                             USED  AVAIL  REFER  MOUNTPOINT
    rpool@cosmic-post                  0B      -    96K  -
    rpool/ROOT@cosmic-post             0B      -    96K  -
    rpool/ROOT/default@cosmic-post  2.66G      -  12.3G  -
    rpool/home@cosmic-post             0B      -    96K  -
    rpool/home/root@cosmic-post        0B      -  3.08M  -
    rpool/var@cosmic-post              0B      -    96K  -
    rpool/var/cache@cosmic-post      148M      -  5.06G  -
    rpool/var/log@cosmic-post       3.99M      -   337M  -
    rpool/var/spool@cosmic-post      128K      -  7.79M  -
    rpool/var/tmp@cosmic-post        160K      -   192K  -

These are snapshots which would permit rollback after a recent upgrade[as you can see, this particular system is Ubuntu; I've not yet triedout ZFS on Devuan].

Each dataset has particular properties associated with it. These arethe properties for the root filesystem:


    % zfs get all rpool/ROOT/default
    NAME                PROPERTY              VALUE                  SOURCE
    rpool/ROOT/default  type                  filesystem             -
    rpool/ROOT/default  creation              Sun Jun 12 10:46 2016  -
    rpool/ROOT/default  used                  14.7G                  -
    rpool/ROOT/default  available             62.2G                  -
    rpool/ROOT/default  referenced            12.0G                  -
    rpool/ROOT/default  compressratio         1.63x                  -
    rpool/ROOT/default  mounted               yes                    -

rpool/ROOT/default quota nonedefaultrpool/ROOT/default reservation nonedefaultrpool/ROOT/default recordsize 128Kdefault

    rpool/ROOT/default  mountpoint            /                      local

rpool/ROOT/default sharenfs offdefaultrpool/ROOT/default checksum ondefaultrpool/ROOT/default compression lz4inherited from rpoolrpool/ROOT/default atime offinherited from rpoolrpool/ROOT/default devices offinherited from rpoolrpool/ROOT/default exec ondefaultrpool/ROOT/default setuid ondefaultrpool/ROOT/default readonly offdefaultrpool/ROOT/default zoned offdefaultrpool/ROOT/default snapdir hiddendefaultrpool/ROOT/default aclinherit restricteddefault

    rpool/ROOT/default  createtxg             15                     -

rpool/ROOT/default canmount ondefaultrpool/ROOT/default xattr ondefaultrpool/ROOT/default copies 1default

    rpool/ROOT/default  version               5                      -
    rpool/ROOT/default  utf8only              on                     -
    rpool/ROOT/default  normalization         formD                  -
    rpool/ROOT/default  casesensitivity       sensitive              -

rpool/ROOT/default vscan offdefaultrpool/ROOT/default nbmand offdefaultrpool/ROOT/default sharesmb offdefaultrpool/ROOT/default refquota nonedefaultrpool/ROOT/default refreservation nonedefault

    rpool/ROOT/default  guid                  3867409876204186651    -

rpool/ROOT/default primarycache alldefaultrpool/ROOT/default secondarycache alldefault

    rpool/ROOT/default  usedbysnapshots       2.66G                  -
    rpool/ROOT/default  usedbydataset         12.0G                  -
    rpool/ROOT/default  usedbychildren        0B                     -
    rpool/ROOT/default  usedbyrefreservation  0B                     -

rpool/ROOT/default logbias latencydefaultrpool/ROOT/default dedup offdefaultrpool/ROOT/default mlslabel nonedefaultrpool/ROOT/default sync standarddefaultrpool/ROOT/default dnodesize legacydefault

    rpool/ROOT/default  refcompressratio      1.62x                  -
    rpool/ROOT/default  written               2.39G                  -
    rpool/ROOT/default  logicalused           22.0G                  -
    rpool/ROOT/default  logicalreferenced     18.0G                  -

rpool/ROOT/default volmode defaultdefaultrpool/ROOT/default filesystem_limit nonedefaultrpool/ROOT/default snapshot_limit nonedefaultrpool/ROOT/default filesystem_count nonedefaultrpool/ROOT/default snapshot_count nonedefaultrpool/ROOT/default snapdev hiddendefaultrpool/ROOT/default acltype offdefaultrpool/ROOT/default context nonedefaultrpool/ROOT/default fscontext nonedefaultrpool/ROOT/default defcontext nonedefaultrpool/ROOT/default rootcontext nonedefaultrpool/ROOT/default relatime ontemporaryrpool/ROOT/default redundant_metadata alldefaultrpool/ROOT/default overlay offdefault

Some are defaulted from the parent dataset, some are general defaults,while others have been set explicitly and some are readonly information.Note the atime/devices/exec/setuid/readonly options which set the mountoptions, as well as the mountpoint. Other options control quotas andpre-allocated reservations of blocks to this filesystem, while othersare for performance tuning such as the cache, logbias and sync options.Transparent compression with lz4 is enabled. Other options controldataset integrity such as the checksum, copies and redundant_metadataoptions, (which store multiple copies of blocks /in addition to/ theeffective RAID redundancy provided by the storage pool).

So the mount options can be set on a per-filesystem basis, and you canhave as many filesystems as you like within the directory hierarchy.Ultimate flexibility!


This is the fstab:
% cat /etc/fstab
PARTUUID=7542d544-adc9-40b3-b6ef-5aa3ac5afbfb /boot/efi vfat defaults 0 1
/dev/zvol/rpool/swap none swap defaults 0 0

Just the swap volume (which is a ZFS volume), and the EFI thing (plussome NFS mounts I omitted). All the ZFS filesystems get mounted usingthe dataset properties, as shown above. There's a single place toadminister the options, within zfs itself. Mountpoint and otherproperty changes take immediate effect. It's simple, easy and powerfulto administer.


    % mount | grep rpool
    rpool/ROOT/default on / type zfs (rw,relatime,xattr,noacl)
(rw,nodev,noatime,xattr,noacl)

rpool/var/cache on /var/cache type zfs(rw,nosuid,nodev,noexec,noatime,xattr,noacl)rpool/var/log on /var/log type zfs(rw,nosuid,nodev,noexec,noatime,xattr,noacl)rpool/var/spool on /var/spool type zfs(rw,nosuid,nodev,noexec,noatime,xattr,noacl)rpool/var/tmp on /var/tmp type zfs(rw,nosuid,nodev,noatime,xattr,noacl)

So that's a brief look at ZFS on Linux as a root filesystem. Hope itwas interesting.

Why should anyone care? I'd like to begin by contrasting this with RickMoen's points regarding placement of partitions on the disc for maximumperformance. While such a strategy is technically correct, the problemis that it suffers from inflexible partition arrangements, and it'sextremely time consuming to profile enough variants to ensure the layoutis optimal for the workload, as well as making assumptions that theworkload will never change once you've adopted a particular layout. Isplacing /usr in the middle better than the rootfs, or /var or particularuser data? Who knows? And who has time for that? Certainly not any ofthe admins who looked after my systems. I did this twenty years agowhen I had far too much time to micro-optimise this stuff, but I haven'tdone so for a long time now. For the small performance gain you mightobtain, it's hugely costly.

The ZFS approach is to place all the storage in a huge pool, and then toallow performance tuning of the pool and individual datasets within it.The physical placement of the data becomes largely irrelevant; ZFShandles this for you, and it probably does a better job of placing thedata efficiently. That is, after all, its entire purpose.

The other point is that ZFS scales well. Take the 4-disc NAS.Streaming reads can pull data off all 4 discs in parallel via adedicated HBA. Streaming writes are balanced across all the discs. AndI can tune each dataset for throughput or latency, as well as adjustingcaching options and default blocksize to match the workload, and I canadd fast SSD storage for read and/or write caching to further improveperformance. There are dozens of knobs to tweak and plenty of differentstrategies to employ in the pool layout. Plus a load of kernelparameters to tune its behaviour there as well. There are several bookswritten on this stuff (by Michael Lucas and Allan Jude).

When it comes to filesystems, we still have the option of using plainpartitions, or md, and/or LVM. However, we've made a lot of progress instorage technology over the last two decades, both in hardware andsoftware. Storage managers and filesystems like ZFS provide a lot ofvalue the older systems do not. Personally, I'm sold on it. It beatsthe pants off LVM, and it doesn't eat your data or unbalance itself tounusability like Btrfs.

If you're a ZFS user, and you look at the issue of mounting /usr as aseparate filesystem, it's really a non issue:


- it's just another dataset in the pool

- being mounted directly when the pool is activated means there are zeroissues mounting from initramfs vs directly; /all/ the filesystems in thepool are mounted automatically- if you do have a separate /usr, you can control the mount options justlike any other dataset, but /usr is nothing special; you can divide thefilesystem hierarchy as finely as you like with no problems



Regards,
Roger
_______________________________________________
Dng mailing list
Dng@lists.dyne.org
https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/dng

Re: [DNG] /usr to merge or not to merge... that is the question

Reply via email to