Re: [ceph-users] SSD journal deployment experiences

Christian Balzer Thu, 04 Sep 2014 18:10:18 -0700

Hello,

On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:

> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
> <daniel.vanders...@cern.ch> wrote:
> 
> >
> >
> > 1) How often are DC S3700's failing in your deployments?
> >
> 
> None of mine have failed yet.  I am planning to monitor the wear level
> indicator, and preemptively replace any SSDs that go below 10%.  Manually
> flushing the journal, replacing the SSD, and building a new journal is
> much faster than backfilling all the dependent OSDs.
>
What Craig said.

Hell, even none of the consumer Intels (3xx, 520s) I have ever failed,
though they are aging faster of course. 
Still got some ancient X-25s that haven't gone below 96% wearout.

I expect my DC 3700s to outlive 2 HDD generations. ^o^ 

Monitor and replace them accordingly and I doubt you'll ever loose one in
operation.

> 
> 
> > 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is
> > the backfilling which results from an SSD failure? Have you considered
> > tricks like increasing the down out interval so backfilling doesn’t
> > happen in this case (leaving time for the SSD to be replaced)?
> >
> 
> Replacing a failed SSD won't help your backfill.  I haven't actually
> tested it, but I'm pretty sure that losing the journal effectively
> corrupts your OSDs.  I don't know what steps are required to complete
> this operation, but it wouldn't surprise me if you need to re-format the
> OSD.
>
This.
All the threads I've read about this indicate that journal loss during
operation means OSD loss. Total OSD loss, no recovery.
From what I gathered the developers are aware of this and it might be
addressed in the future.

Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5 ratio
is sensible. However these will be the ones limiting your max sequential
write speed if that is of importance to you. In nearly all use cases you
run out of IOPS (on your HDDs) long before that becomes an issue, though.

Raiding the journal SSDs seems wasteful given the cost and quality of the
DC 3700s. 
Configure your cluster in a way that re-balancing doesn't happen unless
you want to (when the load low) by:
a) Setting the "mon osd downout subtree limit" so that a host going down
doesn't result in a full re-balancing and the resulting IO shit storm. In
nearly all cases nodes a recoverable and if it isn't the OSDs may be. And
even if that fails, you get to pick the time for the recovery.
b) As you mentioned and others have before, set the out interval so you
can react to things. 
c) Configure the various backfill options to have only a small impact.
Journal SSDs will improve things compared to your current situation. And
if I recall correctly, you're using a replica size of 3 to 4, so you can
afford a more sedate recovery.

Journals on a filesystem go against KISS. 
Not only do you add one more layer of complexity that can fail (and
filesystems do have bugs as people were reminded when Firefly came out),
you're also wasting CPU cycles that might needed over in the less than
optimal OSD code. ^o^
And you gain nothing from putting journals on a filesystem.

You might want to look into cache pools (and dedicated SSD servers with
fast controllers and CPUs) in your test cluster and for the future.
Right now my impression is that there is quite a bit more polishing to be
done (retention of hot objects, etc) and there have been stability concerns
raised here.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD journal deployment experiences

Reply via email to