FWIW. Based on the excellent research by Mark Nelson ( http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/) we have dropped SSD journals altogether, and instead went for the battery protected controller writeback cache.
Benefits: - No negative force multiplier with one SSD failure taking down multiple OSDs - OSD portability: move OSD drives across nodes - OSD recovery: stick them into a surviving OSD node and they keep working I agree on size=3, seems to be safest in all situations. Regards, Alex On Thu, Jul 9, 2015 at 6:38 PM, Quentin Hartman < qhart...@direwolfdigital.com> wrote: > Sooooo, I was running with size=2, until we had a network interface on an > OSD node go faulty, and start corrupting data. Because ceph couldn't tell > which copy was right it caused all sorts of trouble. I might have been able > to recover more gracefully had I caught the problem sooner and been able to > identify the root right away, but as it was, we ended up labeling every VM > in the cluster suspect destroying the whole thing and restoring from > backups. I didn't end up managing to find the root of the problem until I > was rebuilding the cluster and noticed one node "felt weird" when I was > ssh'd into it. It was painful. > > We are currently running "important" vms from a ceph pool with size=3, and > more disposable ones from a size=2 pool, and that seems to be a reasonable > tradeoff so far, giving us a bit more IO overhead tha nwe would have > running 3 for everything, but still having safety where we need it. > > QH > > On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke < > goetz.reini...@filmakademie.de> wrote: > >> Hi Warren, >> >> thanks for that feedback. regarding the 2 or 3 copies we had a lot of >> internal discussions and lots of pros and cons on 2 and 3 :) … and finally >> decided to give 2 copies in the first - now called evaluation cluster - a >> chance to prove. >> >> I bet in 2016 we will see, if that was a good decision or bad and data >> los is in that scenario ok. We evaluate. :) >> >> Regarding one P3700 for 12 SATA disks I do get it right, that if that >> P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me >> from my current knowledge. Or are the P3700 so much more reliable than the >> eg. S3500 or S3700? >> >> Or is the suggestion with the P3700 if we go in the direction of 20+ >> nodes and till than stay without SSDs for journaling. >> >> I really appreciate your thoughts and feedback and I’m aware of the fact >> that building a ceph cluster is some sort of knowing the specs, >> configuration option, math, experience, modification and feedback from best >> practices real world clusters. Finally all clusters are unique in some way >> and what works for one will not work for an other. >> >> Thanks for feedback, 100 kowtows . Götz >> >> >> >> > Am 09.07.2015 um 16:58 schrieb Wang, Warren < >> warren_w...@cable.comcast.com>: >> > >> > You'll take a noticeable hit on write latency. Whether or not it's >> tolerable will be up to you and the workload you have to capture. Large >> file operations are throughput efficient without an SSD journal, as long as >> you have enough spindles. >> > >> > About the Intel P3700, you will only need 1 to keep up with 12 SATA >> drives. The 400 GB is probably okay if you keep the journal sizes small, >> but the 800 is probably safer if you plan on leaving these in production >> for a few years. Depends on the turnover of data on the servers. >> > >> > The dual disk failure comment is pointing out that you are more exposed >> for data loss with 2 copies. You do need to understand that there is a >> possibility for 2 drives to fail either simultaneously, or one before the >> cluster is repaired. As usual, this is going to be a decision you need to >> decide if it's acceptable or not. We have many clusters, and some are 2, >> and others are 3. If your data resides nowhere else, then 3 copies is the >> safe thing to do. That's getting harder and harder to justify though, when >> the price of other storage solutions using erasure coding continues to >> plummet. >> > >> > Warren >> > >> > -----Original Message----- >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> Of Götz Reinicke - IT Koordinator >> > Sent: Thursday, July 09, 2015 4:47 AM >> > To: ceph-users@lists.ceph.com >> > Subject: Re: [ceph-users] Real world benefit from SSD Journals for a >> more read than write cluster >> > >> > Hi Christian, >> > Am 09.07.15 um 09:36 schrieb Christian Balzer: >> >> >> >> Hello, >> >> >> >> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator >> wrote: >> >> >> >>> Hi again, >> >>> >> >>> time is passing, so is my budget :-/ and I have to recheck the >> >>> options for a "starter" cluster. An expansion next year for may be an >> >>> openstack installation or more performance if the demands rise is >> >>> possible. The "starter" could always be used as test or slow dark >> archive. >> >>> >> >>> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per >> >>> node, but now I'm looking for 12 SATA OSDs without SSD journal. Less >> >>> performance, less capacity I know. But thats ok! >> >>> >> >> Leave the space to upgrade these nodes with SSDs in the future. >> >> If your cluster grows large enough (more than 20 nodes) even a single >> >> P3700 might do the trick and will need only a PCIe slot. >> > >> > If I get you right, the 12Disk is not a bad idea, if there would be the >> need of SSD Journal I can add the PCIe P3700. >> > >> > In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs. >> > >> > God or bad idea? >> > >> >> >> >>> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2. >> >>> >> >> Danger, Will Robinson. >> >> This is essentially a RAID5 and you're plain asking for a double disk >> >> failure to happen. >> > >> > May be I do not understand that. size = 2 I think is more sort of raid1 >> ... ? And why am I asking for for a double disk failure? >> > >> > To less nodes, OSDs or because of the size = 2. >> > >> >> >> >> See this recent thread: >> >> "calculating maximum number of disk and node failure that can be >> >> handled by cluster with out data loss" >> >> for some discussion and python script which you will need to modify >> >> for >> >> 2 disk replication. >> >> >> >> With a RAID5 failure calculator you're at 1 data loss event per 3.5 >> >> years... >> >> >> > >> > Thanks for that thread, but I dont get the point out of it for me. >> > >> > I see that calculating the reliability is some sort of complex math ... >> > >> >>> The workload I expect is more writes of may be some GB of Office >> >>> files per day and some TB of larger video Files from a few users per >> week. >> >>> >> >>> At the end of this year we calculate to have +- 60 to 80 TB of lager >> >>> videofiles in that cluster, which are accessed from time to time. >> >>> >> >>> Any suggestion on the drop of ssd journals? >> >>> >> >> You will miss them when the cluster does write, be it from clients or >> >> when re-balancing a lost OSD. >> > >> > I can imagine, that I might miss the SSD Journal, but if I can add the >> > P3700 later I feel comfy with it for now. Budget and evaluation related. >> > >> > Thanks for your helpful input and feedback. /Götz >> > >> > -- >> > Götz Reinicke >> > IT-Koordinator >> > >> > Tel. +49 7141 969 82420 >> > E-Mail goetz.reini...@filmakademie.de >> > >> > Filmakademie Baden-Württemberg GmbH >> > Akademiehof 10 >> > 71638 Ludwigsburg >> > www.filmakademie.de >> > >> > Eintragung Amtsgericht Stuttgart HRB 205016 >> > >> > Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im >> Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg >> > >> > Geschäftsführer: Prof. Thomas Schadt >> > >> > >> >> -- >> Götz Reinicke >> IT-Koordinator >> >> Tel. +49 7141 969 82420 >> E-Mail goetz.reini...@filmakademie.de >> >> Filmakademie Baden-Württemberg GmbH >> Akademiehof 10 >> 71638 Ludwigsburg >> www.filmakademie.de >> >> Eintragung Amtsgericht Stuttgart HRB 205016 >> >> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL >> Staatssekretär im Ministerium für Wissenschaft, >> Forschung und Kunst Baden-Württemberg >> >> Geschäftsführer: Prof. Thomas Schadt >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com