On 29/11/2018 13:44, Olaf Meeuwissen wrote:
Q: Isn't there some filesystem type that supports settings at a more
  granular level than the device?  Like directory or per file?
A: Eh ... Don't know.  Haven't checked ...
Solution: Go fish!

# I haven't gone fishing yet but a vague recollection of Roger's post
# where he mentioned ZFS seemed promising ...

You could set them at the directory level if you wanted with ZFS, by creating a filesystem per directory. But that might get a bit unmanageable, so doing it in a coarser-grained way is usually sufficient. Let me show some examples.

Firstly, this is an example of a FreeBSD 11.2 ZFS NAS with a few terabytes of HDD storage. There's a dedicated system pool, plus this data pool:

    % zpool status red
      pool: red
     state: ONLINE
scan: scrub repaired 0 in 4h12m with 0 errors on Wed Nov 28 19:51:31 2018
    config:

            NAME                         STATE     READ WRITE CKSUM
            red                          ONLINE       0     0     0
              mirror-0                   ONLINE       0     0     0
                gpt/zfs-b1-WCC4M0EYCTAZ  ONLINE       0     0     0
                gpt/zfs-b2-WCC4M5PV83PY  ONLINE       0     0     0
              mirror-1                   ONLINE       0     0     0
                gpt/zfs-b3-WCC4N7FLJD34  ONLINE       0     0     0
                gpt/zfs-b4-WCC4N4UDKN8F  ONLINE       0     0     0
            logs
              mirror-2                   ONLINE       0     0     0
                gpt/zfs-a1-B492480446    ONLINE       0     0     0
                gpt/zfs-a2-B492480406    ONLINE       0     0     0

    errors: No known data errors

This is arranged as a pair of two mirrors, referenced by GPT labels. Each of the mirror "vdev"s is RAID1, and then writes are striped across them. It's not /actually/ striping, it's more clever, and it biases the writes to balance throughput and free space across all the vdevs, but it's similar. The last vdev is a "log" device made of a pair of mirrored SSDs. This "ZIL" is basically a fast write cache. I could also have an SSD added as an L2ARC read cache as well, but it's not necessary for this system's workload. No errors have occurred on any of the devices in the pool.

All filesystems allocate storage from this pool.  Here's a few:

    % zfs list -r red | head -n6
    NAME                 USED  AVAIL  REFER  MOUNTPOINT
    red                 2.57T  1.82T   104K  /red
    red/bhyve            872K  1.82T    88K  /red/bhyve
    red/data             718G  1.82T   712G  /export/data
    red/distfiles       11.9G  1.82T  11.8G  /red/distfiles
    red/home             285G  1.82T    96K  /export/home

There are actually 74 filesystems in this pool! Because the storage is shared, there's no limit on how many you can have. So unlike traditional partitioning, or even LVM (w/o thin pool) it's massively more flexible. You can create, snapshot and destroy datasets on a whim, and even delegate administration of parts of the tree to different users and groups. So users could create datasets under their home directory, snapshot them and send them to and from other systems. You can organise your data into whatever filesystem structure makes sense.

Let's look at the pool used on the Linux system I'm writing this email on:

    % zpool status rpool
      pool: rpool
     state: ONLINE
scan: scrub repaired 0B in 0h5m with 0 errors on Mon Nov 26 17:15:55 2018
    config:

            NAME        STATE     READ WRITE CKSUM
            rpool       ONLINE       0     0     0
              sda2      ONLINE       0     0     0

    errors: No known data errors

This is a single SSD "root pool" for the operating system; with data in other pools not shown here.

    % zfs list -r rpool
    NAME                 USED  AVAIL  REFER  MOUNTPOINT
    rpool               45.4G  62.2G    96K  none
    rpool/ROOT          14.7G  62.2G    96K  none
    rpool/ROOT/default  14.7G  62.2G  12.0G  /
    rpool/home          3.18M  62.2G    96K  none
    rpool/home/root     3.08M  62.2G  3.08M  /root
    rpool/swap          8.50G  63.2G  7.51G  -
    rpool/var           5.72G  62.2G    96K  none
    rpool/var/cache     5.32G  62.2G  5.18G  /var/cache
    rpool/var/log        398M  62.2G   394M  /var/log
    rpool/var/spool     8.14M  62.2G  8.02M  /var/spool
    rpool/var/tmp        312K  62.2G   152K  /var/tmp

These datasets comprise the entire operating system (I've omitted some third-party software package datasets from rpool/opt).

    % zfs list -t snapshot -r rpool
    NAME                             USED  AVAIL  REFER  MOUNTPOINT
    rpool@cosmic-post                  0B      -    96K  -
    rpool/ROOT@cosmic-post             0B      -    96K  -
    rpool/ROOT/default@cosmic-post  2.66G      -  12.3G  -
    rpool/home@cosmic-post             0B      -    96K  -
    rpool/home/root@cosmic-post        0B      -  3.08M  -
    rpool/var@cosmic-post              0B      -    96K  -
    rpool/var/cache@cosmic-post      148M      -  5.06G  -
    rpool/var/log@cosmic-post       3.99M      -   337M  -
    rpool/var/spool@cosmic-post      128K      -  7.79M  -
    rpool/var/tmp@cosmic-post        160K      -   192K  -

These are snapshots which would permit rollback after a recent upgrade [as you can see, this particular system is Ubuntu; I've not yet tried out ZFS on Devuan].

Each dataset has particular properties associated with it. These are the properties for the root filesystem:

    % zfs get all rpool/ROOT/default
    NAME                PROPERTY              VALUE                  SOURCE
    rpool/ROOT/default  type                  filesystem             -
    rpool/ROOT/default  creation              Sun Jun 12 10:46 2016  -
    rpool/ROOT/default  used                  14.7G                  -
    rpool/ROOT/default  available             62.2G                  -
    rpool/ROOT/default  referenced            12.0G                  -
    rpool/ROOT/default  compressratio         1.63x                  -
    rpool/ROOT/default  mounted               yes                    -
rpool/ROOT/default quota none default rpool/ROOT/default reservation none default rpool/ROOT/default recordsize 128K default
    rpool/ROOT/default  mountpoint            /                      local
rpool/ROOT/default sharenfs off default rpool/ROOT/default checksum on default rpool/ROOT/default compression lz4 inherited from rpool rpool/ROOT/default atime off inherited from rpool rpool/ROOT/default devices off inherited from rpool rpool/ROOT/default exec on default rpool/ROOT/default setuid on default rpool/ROOT/default readonly off default rpool/ROOT/default zoned off default rpool/ROOT/default snapdir hidden default rpool/ROOT/default aclinherit restricted default
    rpool/ROOT/default  createtxg             15                     -
rpool/ROOT/default canmount on default rpool/ROOT/default xattr on default rpool/ROOT/default copies 1 default
    rpool/ROOT/default  version               5                      -
    rpool/ROOT/default  utf8only              on                     -
    rpool/ROOT/default  normalization         formD                  -
    rpool/ROOT/default  casesensitivity       sensitive              -
rpool/ROOT/default vscan off default rpool/ROOT/default nbmand off default rpool/ROOT/default sharesmb off default rpool/ROOT/default refquota none default rpool/ROOT/default refreservation none default
    rpool/ROOT/default  guid                  3867409876204186651    -
rpool/ROOT/default primarycache all default rpool/ROOT/default secondarycache all default
    rpool/ROOT/default  usedbysnapshots       2.66G                  -
    rpool/ROOT/default  usedbydataset         12.0G                  -
    rpool/ROOT/default  usedbychildren        0B                     -
    rpool/ROOT/default  usedbyrefreservation  0B                     -
rpool/ROOT/default logbias latency default rpool/ROOT/default dedup off default rpool/ROOT/default mlslabel none default rpool/ROOT/default sync standard default rpool/ROOT/default dnodesize legacy default
    rpool/ROOT/default  refcompressratio      1.62x                  -
    rpool/ROOT/default  written               2.39G                  -
    rpool/ROOT/default  logicalused           22.0G                  -
    rpool/ROOT/default  logicalreferenced     18.0G                  -
rpool/ROOT/default volmode default default rpool/ROOT/default filesystem_limit none default rpool/ROOT/default snapshot_limit none default rpool/ROOT/default filesystem_count none default rpool/ROOT/default snapshot_count none default rpool/ROOT/default snapdev hidden default rpool/ROOT/default acltype off default rpool/ROOT/default context none default rpool/ROOT/default fscontext none default rpool/ROOT/default defcontext none default rpool/ROOT/default rootcontext none default rpool/ROOT/default relatime on temporary rpool/ROOT/default redundant_metadata all default rpool/ROOT/default overlay off default

Some are defaulted from the parent dataset, some are general defaults, while others have been set explicitly and some are readonly information. Note the atime/devices/exec/setuid/readonly options which set the mount options, as well as the mountpoint. Other options control quotas and pre-allocated reservations of blocks to this filesystem, while others are for performance tuning such as the cache, logbias and sync options. Transparent compression with lz4 is enabled. Other options control dataset integrity such as the checksum, copies and redundant_metadata options, (which store multiple copies of blocks /in addition to/ the effective RAID redundancy provided by the storage pool).

So the mount options can be set on a per-filesystem basis, and you can have as many filesystems as you like within the directory hierarchy. Ultimate flexibility!

This is the fstab:
% cat /etc/fstab
PARTUUID=7542d544-adc9-40b3-b6ef-5aa3ac5afbfb /boot/efi vfat defaults 0 1
/dev/zvol/rpool/swap none swap defaults 0 0

Just the swap volume (which is a ZFS volume), and the EFI thing (plus some NFS mounts I omitted). All the ZFS filesystems get mounted using the dataset properties, as shown above. There's a single place to administer the options, within zfs itself. Mountpoint and other property changes take immediate effect. It's simple, easy and powerful to administer.

    % mount | grep rpool
    rpool/ROOT/default on / type zfs (rw,relatime,xattr,noacl)
(rw,nodev,noatime,xattr,noacl)
rpool/var/cache on /var/cache type zfs (rw,nosuid,nodev,noexec,noatime,xattr,noacl) rpool/var/log on /var/log type zfs (rw,nosuid,nodev,noexec,noatime,xattr,noacl) rpool/var/spool on /var/spool type zfs (rw,nosuid,nodev,noexec,noatime,xattr,noacl) rpool/var/tmp on /var/tmp type zfs (rw,nosuid,nodev,noatime,xattr,noacl)

So that's a brief look at ZFS on Linux as a root filesystem. Hope it was interesting.


Why should anyone care? I'd like to begin by contrasting this with Rick Moen's points regarding placement of partitions on the disc for maximum performance. While such a strategy is technically correct, the problem is that it suffers from inflexible partition arrangements, and it's extremely time consuming to profile enough variants to ensure the layout is optimal for the workload, as well as making assumptions that the workload will never change once you've adopted a particular layout. Is placing /usr in the middle better than the rootfs, or /var or particular user data? Who knows? And who has time for that? Certainly not any of the admins who looked after my systems. I did this twenty years ago when I had far too much time to micro-optimise this stuff, but I haven't done so for a long time now. For the small performance gain you might obtain, it's hugely costly.

The ZFS approach is to place all the storage in a huge pool, and then to allow performance tuning of the pool and individual datasets within it. The physical placement of the data becomes largely irrelevant; ZFS handles this for you, and it probably does a better job of placing the data efficiently. That is, after all, its entire purpose.

The other point is that ZFS scales well. Take the 4-disc NAS. Streaming reads can pull data off all 4 discs in parallel via a dedicated HBA. Streaming writes are balanced across all the discs. And I can tune each dataset for throughput or latency, as well as adjusting caching options and default blocksize to match the workload, and I can add fast SSD storage for read and/or write caching to further improve performance. There are dozens of knobs to tweak and plenty of different strategies to employ in the pool layout. Plus a load of kernel parameters to tune its behaviour there as well. There are several books written on this stuff (by Michael Lucas and Allan Jude).

When it comes to filesystems, we still have the option of using plain partitions, or md, and/or LVM. However, we've made a lot of progress in storage technology over the last two decades, both in hardware and software. Storage managers and filesystems like ZFS provide a lot of value the older systems do not. Personally, I'm sold on it. It beats the pants off LVM, and it doesn't eat your data or unbalance itself to unusability like Btrfs.

If you're a ZFS user, and you look at the issue of mounting /usr as a separate filesystem, it's really a non issue:

- it's just another dataset in the pool
- being mounted directly when the pool is activated means there are zero issues mounting from initramfs vs directly; /all/ the filesystems in the pool are mounted automatically - if you do have a separate /usr, you can control the mount options just like any other dataset, but /usr is nothing special; you can divide the filesystem hierarchy as finely as you like with no problems


Regards,
Roger
_______________________________________________
Dng mailing list
Dng@lists.dyne.org
https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/dng

Reply via email to