Re: [ceph-users] Newbie Ceph Design Questions

Christian Balzer Sat, 20 Sep 2014 22:18:59 -0700

Hello,

On Fri, 19 Sep 2014 18:29:02 -0700 Craig Lewis wrote:


> I'm personally interested in running Ceph on some RAID-Z2 volumes with
> ZILs.  XFS feels really dates after using ZFS.  I need to check the
> progress, but I'm thinking of reformatting one node once Giant comes out.
> 
I'm looking forward to the results of this.

Personally I found ext4 to be faster than XFS in nearly all use cases and
the lack of full, real kernel integration of ZFS is something that doesn't
appeal to me either. 
Especially when Ceph usually lusts for the latest kernel, which of course
isn't supported yet by ZOL. ^o^
 
> 
> On Thu, Sep 18, 2014 at 6:36 AM, Christian Balzer <ch...@gol.com> wrote:
> 
> >
> > Hello,
> >
> > On Thu, 18 Sep 2014 13:07:35 +0200 Christoph Adomeit wrote:
> >
> >
> > > Presently we use Solaris ZFS Boxes as NFS Storage for VMs.
> > >
> > That sounds slower than I would Ceph RBD expect to be in nearly all
> > cases.
> >
> > Also, how do you replicate the filesystems to cover for node failures?
> >
> 
> I have used zfs snapshots and zfs send/receive in a cron.  It's not live
> replication, but it's fast enough that I could run it every 5 minutes,
> and maybe every minute.
> 
Yeah, a coworker does that on a FreeBSD pair here and since the data in
question are infrequently manually edited configuration files a low
replication rate is not a big issue as the changes could be easily re-done
on the other node if need be.

However it won't cut the mustard where real HA is required. 

> 
> > Next question: I read that in Ceph an OSD is marked invalid, as
> > > soon as its journaling disk is invalid. So what should I do ? I don't
> > > want to use 1 Journal Disk for each osd. I also dont want to use
> > > a journal disk per 4 osds because then I will loose 4 osds if an ssd
> > > fails. Using journals on osd Disks i am afraid will be slow.
> > > Again I am afraid of slow Ceph performance compared to zfs because
> > > zfs supports zil write cache disks .
> > >
> > I don't do ZFS, but it is my understanding that loosing the ZIL cache
> > (presumably on a SSD for speed reasons) will also potentially loose you
> > the latest writes. So not really all that different from Ceph.
> >
> 
> ZFS will lose only the data that was in the ZIL, but not on disk.  It
> requires admin intervention to tell ZFS to forget about the lost data.
> ZFS will allow you to read/write any data that was already on the disks.
> 

>From what I gathered loosing a SSD based ZIL doesn't mean you loose data
immediately as the data is also in RAM (ARC) and will be flushed to disk
eventually. 
However performance will be massively impacted and a crash or memory error
at that point would likely result in data loss. 
Thus people doing RAID1 ZIL.

> 
> 
> > In that scenario loosing even 4 OSDs due to a journal SSD failure would
> > not be the end of the world by long shot. Never mind that if you're
> > using the right SSDs (Intel DC 3700S for example) you're unlikely to
> > ever experience such a failure.
> > And even if so, there are again plenty of discussions in this ML how to
> > mitigate the effects of such failure (in terms of replication traffic
> > and its impact on the cluster performance, data redundancy should
> > really never be the issue).
> >
> 
> The main issue is performance during recovery.  You really don't want
> recovery to affect more than a few percent of your OSDs, otherwise you'll
> start having latency problems.  If losing a single SSD will lose 20% of
> your OSDs, recovery is going to hurt.  If losing a single SSD only loses
> 1% of your  OSDS, don't worry about it.
> 
> The Anti-Cephalopod discussion here was pretty lively, with a lot of good
> info.
> 
Yeah, it is also about the balance, of course if your cluster is a
firebreathing monster it might even take a big hit w/o melting down.

My anti-cephalopod designed cluster has a jounal SSD per OSD (as there are
very few OSDs ^o^) and the "classic" one I'm currently setting up has 1
journal SSD ("cheap" 100GB DC 3700S) per 2 plain HDD OSDs. 

> 
> >
> > > Last Question: Someone told me Ceph Snapshots are slow. Is this
> > > true ? I always thought making a snapshot is just moving around some
> > > pointers to data.
> > >
> > No idea, I don't use them.
> > But from what I gather the DELETION of them (like RBD images) is a
> > rather resource intensive process, not the creation.
> >
> 
> The snapshots themselves aren't particularly slow, but the cleanup when
> using XFS is pretty painful.
> 
> Ceph on XFS has to emulate snapshots, and XFS isn't copy-on-write.  Ceph
> on BtrFS uses BtrFS's snapshots, so they're much faster.
> 
> I believe Ceph on ZFS just got ZFS snapshot support.  That's what I'm
> waiting for before I start testing.
> 
Interesting data point.

> 
> >
> > > And very last question: What about btrfs, still not recommended ?
> > >
> > Definitely not from where I'm standing.
> > Between the inherent disadvantage of using BTRFS (CoW, thus
> > fragmentation galore) for VM storage and actual bugs people run into I
> > don't think it ever will be.
> >
> 
> My personal opinion is that BtrFS is dead, but nobody is willing to say
> it out loud.  Oracle was driving the BtrFS development, and now they own
> ZFS. Why bother?
> 
Yeah, that's my take on it as well.
What confuses me at this point is that they have OCFS2, ZFS and BTRFS but
nothing I've seen suggests them working on combining features of those.

> COW fragmention is a problem.  ZFS gets around it by telling you to never
> fill the disks up more than 80%.  The writes slow down once you hit 80%,
> and they get progressively slower the closer you get to 100%.  I can live
> with that; I'll just set the nearfull ratio to 75%.
>
Yup, though I've seen examples of this happening a lot earlier (but that
was with ZIL on disk AFAICT), so definitely looking forward to your tests
in the future.
 

Christian

> 
> 
> > I venture that Key/Value store systems will be both faster and more
> > reliable than BTRFS within a year or so.
> >
> 
> Also very interesting.


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Newbie Ceph Design Questions

Reply via email to