Performance on ZFS on Linux (ZoL) seems to be fine, as long as you use the CEPH 
generic filesystem implementation (writeahead) and not the specific CEPH ZFS 
implementation, CoW snapshoting that CEPH does with ZFS support compiled in 
absolutely kills performance. I suspect the same would go with CEPH on Illumos 
on ZFS. Otherwise it is comparable to XFS in my own testing once tweaked. 

There are a few oddities/quirks with ZFS performance that need to be tweaked 
when using it with CEPH, and yea enabling SA on xattr is one of them.

1. ZFS recordsize - The ZFS "sector size", known as within ZFS as the 
recordsize is technically dynamic. It only enforces the maximum size, however 
the way CEPH writes and reads from objects (when working with smaller blocks, 
let's say 4k or 8k via rbd) with default settings seems to be affected by the 
recordsize. With the default 128K I've found lower IOPS and higher latency. 
Setting the recordsize too low will inflate various ZFS metadata, so it needs 
to be balanced against how your CEPH pool will be used. 

For rbd pools(where small block performance may be important) a recordsize of 
32K seems to be a good balance. For pure large object based use (rados, etc) 
the 128K default is fine, throughput is high(small block performance isn't 
important here). See following links for more info about recordsize: 
https://blogs.oracle.com/roch/entry/tuning_zfs_recordsize and 
https://www.joyent.com/blog/bruning-questions-zfs-record-size

2. XATTR - I didn't do much testing here, I've read that if you do not set 
xattr = sa on ZFS you will get poor performance. There were also stability 
issues in the past with xattr = sa on ZFS though it seems all resolved now and 
I have not encountered any issues myself. I'm unsure what the default setting 
is here, I always enable it.

Make sure you enable and set xattr = sa on ZFS.

3. ZIL(ZFS Intent Log, also known as the slog) is a MUST (even with a separate 
ceph journal) - It appears that while the ceph journal offloads/absorbs writes 
nicely and boosts performance, it does not consolidate writes enough for ZFS. 
Without a ZIL/SLOG your performance will be very sawtooth like (jumpy, stutter, 
aka fast then slow, fast than slow over a period of 10-15 seconds). 

In theory tweaking the various ZFS TXG sync settings might work, but it is 
overly complicated to maintain and likely would only apply to the specific 
underlying disk model. Disabling sync also resolves this, though you'll lose 
the last TXG on a power failure - this might be okay with CEPH, but since I'm 
unsure I'll just assume it is not. IMHO avoid too much evil tuning, just add a 
ZIL/SLOG.   

4. ZIL/SLOG + on-device ceph journal vs ZIL/SLOG + separate ceph journal - 
Performance is very similar, if you have a ZIL/SLOG you could easily get away 
without a separate ceph journal and leave it on the device/ZFS dataset. HOWEVER 
this causes HUGE amounts of fragmentation due to the CoW nature. After only a 
few days usage, performance tanked with the ceph journal on the same device. 

I did find that if you partition and share device/SSD between both ZIL/SLOG and 
a separate ceph journal, the resulting performance is about the same in pure 
throughput/iops, though latency is slightly higher. This is what I do in my 
test cluster.

5. Fragmentation - once you hit around 80-90% disk usage your performance will 
start to slow down due to fragmentation. This isn't due to CEPH, it’s a known 
ZFS quirk due to its CoW nature. Unfortunately there is no defrag in ZFS, and 
likely never will be (the mythical block point rewrite unicorn you'll find 
people talking about). 

There is one way to delay it and possibly avoid it however, enable 
metaslab_debug, this will put the ZFS spacemaps in memory, allowing ZFS to make 
better placements during CoW operations, but it does use more memory. See the 
following links for more detail about spacemaps and fragmentation: 
http://blog.delphix.com/uday/2013/02/19/78/ and http://serverfault.com/a/556892 
and http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html 

There's alot more to ZFS and "things-to-know" than that (L2ARC uses ARC 
metadata space, dedupe uses ARC metadata space, etc), but as far as CEPH is 
cocearned the above is a good place to start. ZFS IMHO is a great solution, but 
it requires some time and effort to do it right.

Cheers,

Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com


-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: April-15-15 12:22 PM
To: Jake Young
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph on Solaris / Illumos

On 04/15/2015 10:36 AM, Jake Young wrote:
>
>
> On Wednesday, April 15, 2015, Mark Nelson <mnel...@redhat.com 
> <mailto:mnel...@redhat.com>> wrote:
>
>
>
>     On 04/15/2015 08:16 AM, Jake Young wrote:
>
>         Has anyone compiled ceph (either osd or client) on a Solaris
>         based OS?
>
>         The thread on ZFS support for osd got me thinking about using
>         solaris as
>         an osd server. It would have much better ZFS performance and I
>         wonder if
>         the osd performance without a journal would be 2x better.
>
>
>     Doubt it.  You may be able to do a little better, but you have to
>     pay the piper some how.  If you clone from journal you will
>     introduce fragmentation.  If you throw the journal away you'll
>     suffer for everything but very large writes unless you throw safety
>     away.  I think if we are going to generally beat filestore (not just
>     for optimal benchmarking tests!) it's going to take some very
>     careful cleverness. Thankfully Sage is very clever and is working on
>     it in newstore. Even there, filestore has been proving difficult to
>     beat for writes.
>
>
> That's interesting. I've been under the impression that the ideal osd 
> config was using a stable and fast BTRFS (which doesn't exist
> yet) with no journal.

This is sort of unrelated to the journal specifically, but BTRFS with RBD will 
start fragmenting terribly due to how COW works (and how it relates to 
snapshots too).  More related to the journal:  At one point we were thinking 
about cloning from the journal on BTRFS, but that also potentially leads to 
nasty fragmentation even if the initial behavior would look very good.  I 
haven't done any testing that I can remember of BTRFS with no journal.  I'm not 
sure if it even still works...

>
> In my specific case, I don't want to use an external journal. I've 
> gone down the path of using RAID controllers with write-back cache and 
> BBUs with each disk in its own RAID0 group, instead of SSD journals. 
> (Thanks for your performance articles BTW, they were very helpful!)
>
> My take on your results indicates that IO throughput performance on 
> XFS with same disk journal and WB cache on the RAID card was basically 
> the same or better than BTRFS with no journal.  In addition, BTRFS 
> typically used much more CPU.
>
> Has BTRFS performance gotten any better since you wrote the 
> performance articles?

So the trick with those articles is that the systems are fresh, and most of the 
initial articles were using rados bench which is always writing out new objects 
vs something like RBD where you are (usually) doing writes to existing objects 
that represent the blocks.  If you were to do a bunch of random 4k writes and 
then later try to do sequential reads, you'd see BTRFS sequential read 
performance tank.  We actually did tests like that with emperor during the 
firefly development cycle.  I've included the results. Basically the first 
iteration of the test cycle looks great on BTRFS, then you see read performance 
drop way down. 
Eventually write performance also is likely drop as the disks become extremely 
fragmented (we may even see a little of that in those tests).

>
> Have you compared ZFS (ZoL) performance to BTRFS?

I did way back in 2013 when we were working with Brian Behlendorf to fix xattr 
bugs in ZOL.  It was quite a bit slower if you didn't enable SA xattrs.  With 
SA xattrs, it was much closer, but not as fast as btrfs or xfs.  I didn't do a 
lot of tuning though and Ceph wasn't making good use of ZFS features, so it's 
very possible things have changed.

>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to