[ceph-users] Re: Building a petabyte cluster from scratch

Phil Regnauld Wed, 04 Dec 2019 04:18:55 -0800

Darren Soothill (darren.soothill) writes:
> Hi Fabien,
> 
> ZFS ontop of RBD really makes me shudder. ZFS expects to have individual disk 
> devices that it can manage. It thinks it has them with this config but CEPH 
> is masking the real data behind it.
> 
> As has been said before why not just use Samba directly from CephFS and 
> remove that layer of complexity in the middle.


        As a user of ZFS on ceph, I can explain some of our motivation.

        As it was pointed out earlier in this thread CephFS will give you 
snapshots
        but not diffs between them. I don't know what the intent was with using
        diffs, but in ZFS' case, snapshots provide a basis for checkpointing/
        recovery, instant dataset cloning, but also for replication/offsite
        mirroring (although not synchronous) - so could easily back up/replicate
        the ZFS datasets to another location that doesn't necessarily have a 
CEPH
        installation (say, big, cheap JBOD box with SMR drives running native 
ZFS).
        And, you can diff between snapshots to see instantly which files were
        modified. In addition to the other benefits of running ZFS such as lz4
        compression (per dataset), deduplication, etc.

        While it's true that ZFS on top of RBD is not optimal, it's not
        particularly dangerous or unreliable. You provide it with multiple RBDs,
        create a pool out of those (ZFS pool, not ceph pool :). It sees each
        RBD as an individual disk, and can issue I/O to those indepdently.

        If anything, you lose some of the benefits of ZFS (automatic error
        correction - everything is still checksummed and you detect corruption).

        I already run ZFS within a VM (all our customers are hosted like this,
        using LXD or FreeBSD jails), whether the backing store is NFS, local 
disk
        or RBD doesn't really matter.

        So why NOT run ZFS on top of RBD ? Complexity mostly, and some measure
        of lost performance... But CephFS isn't exactly simple stuff to run in a
        reliable manner as of yet (MDS performance and possible deadlocks are
        an issue).

        If you're planning on serving files, you're still going to need an NFS
        or SMB layer. If you're on CEPHFS, you can serve via Ganesha or Samba
        without adding the extra ZFS layering which will add latency, but either
        way you're still going to drag the data out of cephfs to the client
        mounting the FS, export that via Samba/NFS. If instead you attach, say,
        10 x 1 TB RBD images from a host, assemble those into a zfs pool, and
        run NFS or Samba on top of that, you'll have more or less the same data
        path, but in addition you'll be going through ZFS which introduces 
latency. 

        Now, if you're daring, you create a ceph pool with size=1, min size=1
        (will ceph let you do that ? :), you map RBDs out of that, hand them 
over
        to ZFS in a raid+mirror config (or raidz2) - and let ZFS deal with
        failings VDEVs by giving it new RBDs to replace them. Sounds crazy ?
        Well, you lose the benefit of CEPH's self-healing, but you still get
        a super scalable ZFS running on a near limitless supply of JBOD :) And,
        you can quickly set up different (zfs) pools with different levels of
        redundancy, quotas, compression, metadata options, etc...

        Who says you can't do both anyway ? (CephFS and ZFS), CEPH is flexible
        enough...
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Building a petabyte cluster from scratch

Reply via email to