On Friday, April 17, 2015, Michal Kozanecki <mkozane...@evertz.com> wrote:
> Performance on ZFS on Linux (ZoL) seems to be fine, as long as you use the > CEPH generic filesystem implementation (writeahead) and not the specific > CEPH ZFS implementation, CoW snapshoting that CEPH does with ZFS support > compiled in absolutely kills performance. I suspect the same would go with > CEPH on Illumos on ZFS. Otherwise it is comparable to XFS in my own testing > once tweaked. > > There are a few oddities/quirks with ZFS performance that need to be > tweaked when using it with CEPH, and yea enabling SA on xattr is one of > them. > > 1. ZFS recordsize - The ZFS "sector size", known as within ZFS as the > recordsize is technically dynamic. It only enforces the maximum size, > however the way CEPH writes and reads from objects (when working with > smaller blocks, let's say 4k or 8k via rbd) with default settings seems to > be affected by the recordsize. With the default 128K I've found lower IOPS > and higher latency. Setting the recordsize too low will inflate various ZFS > metadata, so it needs to be balanced against how your CEPH pool will be > used. > > For rbd pools(where small block performance may be important) a recordsize > of 32K seems to be a good balance. For pure large object based use (rados, > etc) the 128K default is fine, throughput is high(small block performance > isn't important here). See following links for more info about recordsize: > https://blogs.oracle.com/roch/entry/tuning_zfs_recordsize and > https://www.joyent.com/blog/bruning-questions-zfs-record-size > > 2. XATTR - I didn't do much testing here, I've read that if you do not set > xattr = sa on ZFS you will get poor performance. There were also stability > issues in the past with xattr = sa on ZFS though it seems all resolved now > and I have not encountered any issues myself. I'm unsure what the default > setting is here, I always enable it. > > Make sure you enable and set xattr = sa on ZFS. > > 3. ZIL(ZFS Intent Log, also known as the slog) is a MUST (even with a > separate ceph journal) - It appears that while the ceph journal > offloads/absorbs writes nicely and boosts performance, it does not > consolidate writes enough for ZFS. Without a ZIL/SLOG your performance will > be very sawtooth like (jumpy, stutter, aka fast then slow, fast than slow > over a period of 10-15 seconds). > > In theory tweaking the various ZFS TXG sync settings might work, but it is > overly complicated to maintain and likely would only apply to the specific > underlying disk model. Disabling sync also resolves this, though you'll > lose the last TXG on a power failure - this might be okay with CEPH, but > since I'm unsure I'll just assume it is not. IMHO avoid too much evil > tuning, just add a ZIL/SLOG. > > 4. ZIL/SLOG + on-device ceph journal vs ZIL/SLOG + separate ceph journal - > Performance is very similar, if you have a ZIL/SLOG you could easily get > away without a separate ceph journal and leave it on the device/ZFS > dataset. HOWEVER this causes HUGE amounts of fragmentation due to the CoW > nature. After only a few days usage, performance tanked with the ceph > journal on the same device. > > I did find that if you partition and share device/SSD between both > ZIL/SLOG and a separate ceph journal, the resulting performance is about > the same in pure throughput/iops, though latency is slightly higher. This > is what I do in my test cluster. > > 5. Fragmentation - once you hit around 80-90% disk usage your performance > will start to slow down due to fragmentation. This isn't due to CEPH, it’s > a known ZFS quirk due to its CoW nature. Unfortunately there is no defrag > in ZFS, and likely never will be (the mythical block point rewrite unicorn > you'll find people talking about). > > There is one way to delay it and possibly avoid it however, enable > metaslab_debug, this will put the ZFS spacemaps in memory, allowing ZFS to > make better placements during CoW operations, but it does use more memory. > See the following links for more detail about spacemaps and fragmentation: > http://blog.delphix.com/uday/2013/02/19/78/ and > http://serverfault.com/a/556892 and > http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html > > There's alot more to ZFS and "things-to-know" than that (L2ARC uses ARC > metadata space, dedupe uses ARC metadata space, etc), but as far as CEPH is > cocearned the above is a good place to start. ZFS IMHO is a great solution, > but it requires some time and effort to do it right. > > Cheers, > > Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com > <javascript:;> > > Thank you for taking the time to share that, Michal! Jake > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com <javascript:;>] > On Behalf Of Mark Nelson > Sent: April-15-15 12:22 PM > To: Jake Young > Cc: ceph-users@lists.ceph.com <javascript:;> > Subject: Re: [ceph-users] Ceph on Solaris / Illumos > > On 04/15/2015 10:36 AM, Jake Young wrote: > > > > > > On Wednesday, April 15, 2015, Mark Nelson <mnel...@redhat.com > <javascript:;> > > <mailto:mnel...@redhat.com <javascript:;>>> wrote: > > > > > > > > On 04/15/2015 08:16 AM, Jake Young wrote: > > > > Has anyone compiled ceph (either osd or client) on a Solaris > > based OS? > > > > The thread on ZFS support for osd got me thinking about using > > solaris as > > an osd server. It would have much better ZFS performance and I > > wonder if > > the osd performance without a journal would be 2x better. > > > > > > Doubt it. You may be able to do a little better, but you have to > > pay the piper some how. If you clone from journal you will > > introduce fragmentation. If you throw the journal away you'll > > suffer for everything but very large writes unless you throw safety > > away. I think if we are going to generally beat filestore (not just > > for optimal benchmarking tests!) it's going to take some very > > careful cleverness. Thankfully Sage is very clever and is working on > > it in newstore. Even there, filestore has been proving difficult to > > beat for writes. > > > > > > That's interesting. I've been under the impression that the ideal osd > > config was using a stable and fast BTRFS (which doesn't exist > > yet) with no journal. > > This is sort of unrelated to the journal specifically, but BTRFS with RBD > will start fragmenting terribly due to how COW works (and how it relates to > snapshots too). More related to the journal: At one point we were > thinking about cloning from the journal on BTRFS, but that also potentially > leads to nasty fragmentation even if the initial behavior would look very > good. I haven't done any testing that I can remember of BTRFS with no > journal. I'm not sure if it even still works... > > > > > In my specific case, I don't want to use an external journal. I've > > gone down the path of using RAID controllers with write-back cache and > > BBUs with each disk in its own RAID0 group, instead of SSD journals. > > (Thanks for your performance articles BTW, they were very helpful!) > > > > My take on your results indicates that IO throughput performance on > > XFS with same disk journal and WB cache on the RAID card was basically > > the same or better than BTRFS with no journal. In addition, BTRFS > > typically used much more CPU. > > > > Has BTRFS performance gotten any better since you wrote the > > performance articles? > > So the trick with those articles is that the systems are fresh, and most > of the initial articles were using rados bench which is always writing out > new objects vs something like RBD where you are (usually) doing writes to > existing objects that represent the blocks. If you were to do a bunch of > random 4k writes and then later try to do sequential reads, you'd see BTRFS > sequential read performance tank. We actually did tests like that with > emperor during the firefly development cycle. I've included the results. > Basically the first iteration of the test cycle looks great on BTRFS, then > you see read performance drop way down. > Eventually write performance also is likely drop as the disks become > extremely fragmented (we may even see a little of that in those tests). > > > > > Have you compared ZFS (ZoL) performance to BTRFS? > > I did way back in 2013 when we were working with Brian Behlendorf to fix > xattr bugs in ZOL. It was quite a bit slower if you didn't enable SA > xattrs. With SA xattrs, it was much closer, but not as fast as btrfs or > xfs. I didn't do a lot of tuning though and Ceph wasn't making good use of > ZFS features, so it's very possible things have changed. > > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com