Hi Martin, September 4 2014 10:07 PM, "Martin B Nielsen" <mar...@unity3d.com> wrote: > Hi Dan, > > We took a different approach (and our cluster is tiny compared to many > others) - we have two pools; > normal and ssd. > > We use 14 disks in each osd-server; 8 platter and 4 ssd for ceph, and 2 ssd > for OS/journals. We > partitioned the two OS ssd as raid1 using about half the space for the OS and > leaving the rest on > each for 2x journals and unprovisioned. We've partitioned the OS disks to > each hold 2x platter > journals. On top of that our ssd pooled disks also hold 2x journals; their > own + an additional from > a platter disk. We have 8 osd-nodes. > > So whenever an ssd fail we lose 2 osd (but never more).
Interesting ... you have quite a few SSDs there per box. I suppose my closest config would be 5 platter journals per SSD, plus a FileStore+journal to squeeze out every last IOP from our SSDs. That would take out 6 OSDs with a failure. > We've had this system in production for ~1½ year now and so far we've had 1 > ssd and 2 platter disk > fail. We run a couple of hundred vm-guests on it and use ~60TB. Which SSD was that? > On a daily basis we avg. 30MB/sec r/w and ~600 iops so not very high usage. > The times we lost disks > we hardly noticed. All SSD (OS included) have a general utilization of <5%, > platter disks near 10%. We have peaks up to 7000 iops, but mostly between 4-5000. When we have 7000 iops the small write latency inches up to around 70ms :( > We did a lot of initial testing about putting journals on the OS-ssd as well > extra on the ssd-osd, > but we didn't find much difference or high latencies as others have > experienced. When/if we notice > otherwise we'll prob. switch to pure ssd as journalholders. > > We originally deployed using saltstack and even though we have automated > replacing disks we still > do it manually 'just to be sure'. It takes 5-10min to replace an old disk and > get it backfilling, > so I don't expect us to spend any time automating this. > > Recovering 2 disks at once for us takes a long time but we've intentionally > set backfilling low and > it is not noticeable on the cluster when it happens. Yeah 2 wouldn't be noticeable in our cluster even now. 24 _was_ noticeable, so I maybe 5 is doable. Thanks for the input, Dan > Anyways, we have pretty low cluster usage but in our experience ssd seem to > handle the constant > load very well. > > Cheers, > Martin > > On Thu, Sep 4, 2014 at 6:21 PM, Dan Van Der Ster <daniel.vanders...@cern.ch> > wrote: > >> Dear Cephalopods, >> >> In a few weeks we will receive a batch of 200GB Intel DC S3700’s to augment >> our cluster, and I’d >> like to hear your practical experience and discuss options how best to >> deploy these. >> >> We’ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so they >> will become 20 OSDs + > 4 >> SSDs per server. Until recently I’ve been planning to use the traditional >> deployment: 5 journal >> partitions per SSD. But as SSD-day approaches, I growing less comfortable >> with the idea of 5 OSDs >> going down every time an SSD fails, so perhaps there are better options out >> there. >> >> Before getting into options, I’m curious about real reliability of these >> drives: >> >> 1) How often are DC S3700's failing in your deployments? >> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the >> backfilling which > results >> from an SSD failure? Have you considered tricks like increasing the down out >> interval so >> backfilling doesn’t happen in this case (leaving time for the SSD to be >> replaced)? >> >> Beyond the usually 5 partitions deployment, is anyone running a RAID1 or >> RAID10 for the journals? >> If so, are you using the raw block devices or formatting it and storing the >> journals as files on >> the SSD array(s)? Recent discussions seem to indicate that XFS is just as >> fast as the block dev, >> since these drives are so fast. >> >> Next, I wonder how people with puppet/chef/… are handling the >> creation/re-creation of the SSD >> devices. Are you just wiping and rebuilding all the dependent OSDs >> completely when the journal > dev >> fails? I’m not keen on puppetizing the re-creation of journals for OSDs... >> >> We also have this crazy idea of failing over to a local journal file in case >> an SSD fails. In > this >> model, when an SSD fails we’d quickly create a new journal either on another >> SSD or on the local >> OSD filesystem, then restart the OSDs before backfilling started. Thoughts? >> >> Lastly, I would also consider using 2 of the SSDs in a data pool (with the >> other 2 SSDs to hold > 20 >> journals — probably in a RAID1 to avoid backfilling 10 OSDs when an SSD >> fails). If the 10-1 ratio >> of SSDs would perform adequately, that’d give us quite a few SSDs to build a >> dedicated high-IOPS >> pool. >> >> I’d also appreciate any other suggestions/experiences which might be >> relevant. >> >> Thanks! >> Dan >> >> -- Dan van der Ster || Data & Storage Services || CERN IT Department -- >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com