Re: [ceph-users] SSD journal deployment experiences

Dan van der Ster Thu, 04 Sep 2014 13:24:14 -0700

Hi Martin,

September 4 2014 10:07 PM, "Martin B Nielsen" <mar...@unity3d.com> wrote: 
> Hi Dan,
> 
> We took a different approach (and our cluster is tiny compared to many 
> others) - we have two pools;
> normal and ssd.
> 
> We use 14 disks in each osd-server; 8 platter and 4 ssd for ceph, and 2 ssd 
> for OS/journals. We
> partitioned the two OS ssd as raid1 using about half the space for the OS and 
> leaving the rest on
> each for 2x journals and unprovisioned. We've partitioned the OS disks to 
> each hold 2x platter
> journals. On top of that our ssd pooled disks also hold 2x journals; their 
> own + an additional from
> a platter disk. We have 8 osd-nodes.
> 
> So whenever an ssd fail we lose 2 osd (but never more).


Interesting ... you have quite a few SSDs there per box. I suppose my closest 
config would be 5 platter journals per SSD, plus a FileStore+journal to squeeze 
out every last IOP from our SSDs. That would take out 6 OSDs with a failure.

> We've had this system in production for ~1½ year now and so far we've had 1 
> ssd and 2 platter disk
> fail. We run a couple of hundred vm-guests on it and use ~60TB.

Which SSD was that? 

> On a daily basis we avg. 30MB/sec r/w and ~600 iops so not very high usage. 
> The times we lost disks
> we hardly noticed. All SSD (OS included) have a general utilization of <5%, 
> platter disks near 10%.

We have peaks up to 7000 iops, but mostly between 4-5000. When we have 7000 
iops the small write latency inches up to around 70ms :(

> We did a lot of initial testing about putting journals on the OS-ssd as well 
> extra on the ssd-osd,
> but we didn't find much difference or high latencies as others have 
> experienced. When/if we notice
> otherwise we'll prob. switch to pure ssd as journalholders.
> 
> We originally deployed using saltstack and even though we have automated 
> replacing disks we still
> do it manually 'just to be sure'. It takes 5-10min to replace an old disk and 
> get it backfilling,
> so I don't expect us to spend any time automating this.
> 
> Recovering 2 disks at once for us takes a long time but we've intentionally 
> set backfilling low and
> it is not noticeable on the cluster when it happens.

Yeah 2 wouldn't be noticeable in our cluster even now. 24 _was_ noticeable, so 
I maybe 5 is doable.

Thanks for the input,

Dan


> Anyways, we have pretty low cluster usage but in our experience ssd seem to 
> handle the constant
> load very well.
> 
> Cheers,
> Martin
> 
> On Thu, Sep 4, 2014 at 6:21 PM, Dan Van Der Ster <daniel.vanders...@cern.ch> 
> wrote:
> 
>> Dear Cephalopods,
>> 
>> In a few weeks we will receive a batch of 200GB Intel DC S3700’s to augment 
>> our cluster, and I’d
>> like to hear your practical experience and discuss options how best to 
>> deploy these.
>> 
>> We’ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so they 
>> will become 20 OSDs +
> 4
>> SSDs per server. Until recently I’ve been planning to use the traditional 
>> deployment: 5 journal
>> partitions per SSD. But as SSD-day approaches, I growing less comfortable 
>> with the idea of 5 OSDs
>> going down every time an SSD fails, so perhaps there are better options out 
>> there.
>> 
>> Before getting into options, I’m curious about real reliability of these 
>> drives:
>> 
>> 1) How often are DC S3700's failing in your deployments?
>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the 
>> backfilling which
> results
>> from an SSD failure? Have you considered tricks like increasing the down out 
>> interval so
>> backfilling doesn’t happen in this case (leaving time for the SSD to be 
>> replaced)?
>> 
>> Beyond the usually 5 partitions deployment, is anyone running a RAID1 or 
>> RAID10 for the journals?
>> If so, are you using the raw block devices or formatting it and storing the 
>> journals as files on
>> the SSD array(s)? Recent discussions seem to indicate that XFS is just as 
>> fast as the block dev,
>> since these drives are so fast.
>> 
>> Next, I wonder how people with puppet/chef/… are handling the 
>> creation/re-creation of the SSD
>> devices. Are you just wiping and rebuilding all the dependent OSDs 
>> completely when the journal
> dev
>> fails? I’m not keen on puppetizing the re-creation of journals for OSDs...
>> 
>> We also have this crazy idea of failing over to a local journal file in case 
>> an SSD fails. In
> this
>> model, when an SSD fails we’d quickly create a new journal either on another 
>> SSD or on the local
>> OSD filesystem, then restart the OSDs before backfilling started. Thoughts?
>> 
>> Lastly, I would also consider using 2 of the SSDs in a data pool (with the 
>> other 2 SSDs to hold
> 20
>> journals — probably in a RAID1 to avoid backfilling 10 OSDs when an SSD 
>> fails). If the 10-1 ratio
>> of SSDs would perform adequately, that’d give us quite a few SSDs to build a 
>> dedicated high-IOPS
>> pool.
>> 
>> I’d also appreciate any other suggestions/experiences which might be 
>> relevant.
>> 
>> Thanks!
>> Dan
>> 
>> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD journal deployment experiences

Reply via email to