Re: [ceph-users] Ceph on Solaris / Illumos

Jake Young Fri, 17 Apr 2015 09:30:11 -0700

On Friday, April 17, 2015, Michal Kozanecki <mkozane...@evertz.com> wrote:


> Performance on ZFS on Linux (ZoL) seems to be fine, as long as you use the
> CEPH generic filesystem implementation (writeahead) and not the specific
> CEPH ZFS implementation, CoW snapshoting that CEPH does with ZFS support
> compiled in absolutely kills performance. I suspect the same would go with
> CEPH on Illumos on ZFS. Otherwise it is comparable to XFS in my own testing
> once tweaked.
>
> There are a few oddities/quirks with ZFS performance that need to be
> tweaked when using it with CEPH, and yea enabling SA on xattr is one of
> them.
>
> 1. ZFS recordsize - The ZFS "sector size", known as within ZFS as the
> recordsize is technically dynamic. It only enforces the maximum size,
> however the way CEPH writes and reads from objects (when working with
> smaller blocks, let's say 4k or 8k via rbd) with default settings seems to
> be affected by the recordsize. With the default 128K I've found lower IOPS
> and higher latency. Setting the recordsize too low will inflate various ZFS
> metadata, so it needs to be balanced against how your CEPH pool will be
> used.
>
> For rbd pools(where small block performance may be important) a recordsize
> of 32K seems to be a good balance. For pure large object based use (rados,
> etc) the 128K default is fine, throughput is high(small block performance
> isn't important here). See following links for more info about recordsize:
> https://blogs.oracle.com/roch/entry/tuning_zfs_recordsize and
> https://www.joyent.com/blog/bruning-questions-zfs-record-size
>
> 2. XATTR - I didn't do much testing here, I've read that if you do not set
> xattr = sa on ZFS you will get poor performance. There were also stability
> issues in the past with xattr = sa on ZFS though it seems all resolved now
> and I have not encountered any issues myself. I'm unsure what the default
> setting is here, I always enable it.
>
> Make sure you enable and set xattr = sa on ZFS.
>
> 3. ZIL(ZFS Intent Log, also known as the slog) is a MUST (even with a
> separate ceph journal) - It appears that while the ceph journal
> offloads/absorbs writes nicely and boosts performance, it does not
> consolidate writes enough for ZFS. Without a ZIL/SLOG your performance will
> be very sawtooth like (jumpy, stutter, aka fast then slow, fast than slow
> over a period of 10-15 seconds).
>
> In theory tweaking the various ZFS TXG sync settings might work, but it is
> overly complicated to maintain and likely would only apply to the specific
> underlying disk model. Disabling sync also resolves this, though you'll
> lose the last TXG on a power failure - this might be okay with CEPH, but
> since I'm unsure I'll just assume it is not. IMHO avoid too much evil
> tuning, just add a ZIL/SLOG.
>
> 4. ZIL/SLOG + on-device ceph journal vs ZIL/SLOG + separate ceph journal -
> Performance is very similar, if you have a ZIL/SLOG you could easily get
> away without a separate ceph journal and leave it on the device/ZFS
> dataset. HOWEVER this causes HUGE amounts of fragmentation due to the CoW
> nature. After only a few days usage, performance tanked with the ceph
> journal on the same device.
>
> I did find that if you partition and share device/SSD between both
> ZIL/SLOG and a separate ceph journal, the resulting performance is about
> the same in pure throughput/iops, though latency is slightly higher. This
> is what I do in my test cluster.
>
> 5. Fragmentation - once you hit around 80-90% disk usage your performance
> will start to slow down due to fragmentation. This isn't due to CEPH, it’s
> a known ZFS quirk due to its CoW nature. Unfortunately there is no defrag
> in ZFS, and likely never will be (the mythical block point rewrite unicorn
> you'll find people talking about).
>
> There is one way to delay it and possibly avoid it however, enable
> metaslab_debug, this will put the ZFS spacemaps in memory, allowing ZFS to
> make better placements during CoW operations, but it does use more memory.
> See the following links for more detail about spacemaps and fragmentation:
> http://blog.delphix.com/uday/2013/02/19/78/ and
> http://serverfault.com/a/556892 and
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html
>
> There's alot more to ZFS and "things-to-know" than that (L2ARC uses ARC
> metadata space, dedupe uses ARC metadata space, etc), but as far as CEPH is
> cocearned the above is a good place to start. ZFS IMHO is a great solution,
> but it requires some time and effort to do it right.
>
> Cheers,
>
> Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com
> <javascript:;>
>
>
Thank you for taking the time to share that, Michal!

Jake


>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com <javascript:;>]
> On Behalf Of Mark Nelson
> Sent: April-15-15 12:22 PM
> To: Jake Young
> Cc: ceph-users@lists.ceph.com <javascript:;>
> Subject: Re: [ceph-users] Ceph on Solaris / Illumos
>
> On 04/15/2015 10:36 AM, Jake Young wrote:
> >
> >
> > On Wednesday, April 15, 2015, Mark Nelson <mnel...@redhat.com
> <javascript:;>
> > <mailto:mnel...@redhat.com <javascript:;>>> wrote:
> >
> >
> >
> >     On 04/15/2015 08:16 AM, Jake Young wrote:
> >
> >         Has anyone compiled ceph (either osd or client) on a Solaris
> >         based OS?
> >
> >         The thread on ZFS support for osd got me thinking about using
> >         solaris as
> >         an osd server. It would have much better ZFS performance and I
> >         wonder if
> >         the osd performance without a journal would be 2x better.
> >
> >
> >     Doubt it.  You may be able to do a little better, but you have to
> >     pay the piper some how.  If you clone from journal you will
> >     introduce fragmentation.  If you throw the journal away you'll
> >     suffer for everything but very large writes unless you throw safety
> >     away.  I think if we are going to generally beat filestore (not just
> >     for optimal benchmarking tests!) it's going to take some very
> >     careful cleverness. Thankfully Sage is very clever and is working on
> >     it in newstore. Even there, filestore has been proving difficult to
> >     beat for writes.
> >
> >
> > That's interesting. I've been under the impression that the ideal osd
> > config was using a stable and fast BTRFS (which doesn't exist
> > yet) with no journal.
>
> This is sort of unrelated to the journal specifically, but BTRFS with RBD
> will start fragmenting terribly due to how COW works (and how it relates to
> snapshots too).  More related to the journal:  At one point we were
> thinking about cloning from the journal on BTRFS, but that also potentially
> leads to nasty fragmentation even if the initial behavior would look very
> good.  I haven't done any testing that I can remember of BTRFS with no
> journal.  I'm not sure if it even still works...
>
> >
> > In my specific case, I don't want to use an external journal. I've
> > gone down the path of using RAID controllers with write-back cache and
> > BBUs with each disk in its own RAID0 group, instead of SSD journals.
> > (Thanks for your performance articles BTW, they were very helpful!)
> >
> > My take on your results indicates that IO throughput performance on
> > XFS with same disk journal and WB cache on the RAID card was basically
> > the same or better than BTRFS with no journal.  In addition, BTRFS
> > typically used much more CPU.
> >
> > Has BTRFS performance gotten any better since you wrote the
> > performance articles?
>
> So the trick with those articles is that the systems are fresh, and most
> of the initial articles were using rados bench which is always writing out
> new objects vs something like RBD where you are (usually) doing writes to
> existing objects that represent the blocks.  If you were to do a bunch of
> random 4k writes and then later try to do sequential reads, you'd see BTRFS
> sequential read performance tank.  We actually did tests like that with
> emperor during the firefly development cycle.  I've included the results.
> Basically the first iteration of the test cycle looks great on BTRFS, then
> you see read performance drop way down.
> Eventually write performance also is likely drop as the disks become
> extremely fragmented (we may even see a little of that in those tests).
>
> >
> > Have you compared ZFS (ZoL) performance to BTRFS?
>
> I did way back in 2013 when we were working with Brian Behlendorf to fix
> xattr bugs in ZOL.  It was quite a bit slower if you didn't enable SA
> xattrs.  With SA xattrs, it was much closer, but not as fast as btrfs or
> xfs.  I didn't do a lot of tuning though and Ceph wasn't making good use of
> ZFS features, so it's very possible things have changed.
>
> >
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph on Solaris / Illumos

Reply via email to