On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster <daniel.vanders...@cern.ch>
wrote:

>
>
> 1) How often are DC S3700's failing in your deployments?
>

None of mine have failed yet.  I am planning to monitor the wear level
indicator, and preemptively replace any SSDs that go below 10%.  Manually
flushing the journal, replacing the SSD, and building a new journal is much
faster than backfilling all the dependent OSDs.



> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the
> backfilling which results from an SSD failure? Have you considered tricks
> like increasing the down out interval so backfilling doesn’t happen in this
> case (leaving time for the SSD to be replaced)?
>

Replacing a failed SSD won't help your backfill.  I haven't actually tested
it, but I'm pretty sure that losing the journal effectively corrupts your
OSDs.  I don't know what steps are required to complete this operation, but
it wouldn't surprise me if you need to re-format the OSD.



> Next, I wonder how people with puppet/chef/… are handling the
> creation/re-creation of the SSD devices. Are you just wiping and rebuilding
> all the dependent OSDs completely when the journal dev fails? I’m not keen
> on puppetizing the re-creation of journals for OSDs...
>

So far, I'm doing my disk zapping manually.  Automatically zapping disks
makes me nervous.  :-)

I'm of the opinion that you shouldn't automate something until you'll save
time versus doing by hand.  My cluster is small enough that it's faster to
do it manually.



>
> We also have this crazy idea of failing over to a local journal file in
> case an SSD fails. In this model, when an SSD fails we’d quickly create a
> new journal either on another SSD or on the local OSD filesystem, then
> restart the OSDs before backfilling started. Thoughts?
>

See #2.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to