> I'm personally interested in running Ceph on some RAID-Z2 volumes with
> ZILs.  XFS feels really dates after using ZFS.  I need to check the
> progress, but I'm thinking of reformatting one node once Giant comes out.
I'm looking forward to the results of this.

Personally I found ext4 to be faster than XFS in nearly all use cases and
the lack of full, real kernel integration of ZFS is something that doesn't
appeal to me either. 
Especially when Ceph usually lusts for the latest kernel, which of course
isn't supported yet by ZOL. ^o^
> > > Presently we use Solaris ZFS Boxes as NFS Storage for VMs.
> > >
> > That sounds slower than I would Ceph RBD expect to be in nearly all
> > cases.
> >
> > Also, how do you replicate the filesystems to cover for node failures?
> >
> I have used zfs snapshots and zfs send/receive in a cron.  It's not live
> replication, but it's fast enough that I could run it every 5 minutes,
> and maybe every minute.
Yeah, a coworker does that on a FreeBSD pair here and since the data in
question are infrequently manually edited configuration files a low
replication rate is not a big issue as the changes could be easily re-done
on the other node if need be.

However it won't cut the mustard where real HA is required. 

> ZFS will lose only the data that was in the ZIL, but not on disk.  It
> requires admin intervention to tell ZFS to forget about the lost data.
> ZFS will allow you to read/write any data that was already on the disks.

>From what I gathered loosing a SSD based ZIL doesn't mean you loose data
immediately as the data is also in RAM (ARC) and will be flushed to disk
However performance will be massively impacted and a crash or memory error
at that point would likely result in data loss. 
Thus people doing RAID1 ZIL.

> The main issue is performance during recovery.  You really don't want
> recovery to affect more than a few percent of your OSDs, otherwise you'll
> start having latency problems.  If losing a single SSD will lose 20% of
> your OSDs, recovery is going to hurt.  If losing a single SSD only loses
> 1% of your  OSDS, don't worry about it.
> The Anti-Cephalopod discussion here was pretty lively, with a lot of good
> info.
Yeah, it is also about the balance, of course if your cluster is a
firebreathing monster it might even take a big hit w/o melting down.

My anti-cephalopod designed cluster has a jounal SSD per OSD (as there are
very few OSDs ^o^) and the "classic" one I'm currently setting up has 1
journal SSD ("cheap" 100GB DC 3700S) per 2 plain HDD OSDs. 

> The snapshots themselves aren't particularly slow, but the cleanup when
> using XFS is pretty painful.
> Ceph on XFS has to emulate snapshots, and XFS isn't copy-on-write.  Ceph
> on BtrFS uses BtrFS's snapshots, so they're much faster.
> I believe Ceph on ZFS just got ZFS snapshot support.  That's what I'm
> waiting for before I start testing.
Interesting data point.

> My personal opinion is that BtrFS is dead, but nobody is willing to say
> it out loud.  Oracle was driving the BtrFS development, and now they own
> ZFS. Why bother?
Yeah, that's my take on it as well.
What confuses me at this point is that they have OCFS2, ZFS and BTRFS but
nothing I've seen suggests them working on combining features of those.

> COW fragmention is a problem.  ZFS gets around it by telling you to never
> fill the disks up more than 80%.  The writes slow down once you hit 80%,
> and they get progressively slower the closer you get to 100%.  I can live
> with that; I'll just set the nearfull ratio to 75%.
Yup, though I've seen examples of this happening a lot earlier (but that
was with ZIL on disk AFAICT), so definitely looking forward to your tests
in the future.


> > I venture that Key/Value store systems will be both faster and more
> > reliable than BTRFS within a year or so.
> >
> Also very interesting.

