Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

Alex Gorbachev Sat, 11 Jul 2015 20:56:33 -0700

FWIW. Based on the excellent research by Mark Nelson (
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/)
we have dropped SSD journals altogether, and instead went for the battery
protected controller writeback cache.


Benefits:

- No negative force multiplier with one SSD failure taking down multiple
OSDs

- OSD portability: move OSD drives across nodes

- OSD recovery: stick them into a surviving OSD node and they keep working

I agree on size=3, seems to be safest in all situations.

Regards,
Alex

On Thu, Jul 9, 2015 at 6:38 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Sooooo, I was running with size=2, until we had a network interface on an
> OSD node go faulty, and start corrupting data. Because ceph couldn't tell
> which copy was right it caused all sorts of trouble. I might have been able
> to recover more gracefully had I caught the problem sooner and been able to
> identify the root right away, but as it was, we ended up labeling every VM
> in the cluster suspect destroying the whole thing and restoring from
> backups. I didn't end up managing to find the root of the problem until I
> was rebuilding the cluster and noticed one node "felt weird" when I was
> ssh'd into it. It was painful.
>
> We are currently running "important" vms from a ceph pool with size=3, and
> more disposable ones from a size=2 pool, and that seems to be a reasonable
> tradeoff so far, giving us a bit more IO overhead tha nwe would have
> running 3 for everything, but still having safety where we need it.
>
> QH
>
> On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke <
> goetz.reini...@filmakademie.de> wrote:
>
>> Hi Warren,
>>
>> thanks for that feedback. regarding the 2 or 3 copies we had a lot of
>> internal discussions and lots of pros and cons on 2 and 3 :) … and finally
>> decided to give 2 copies in the first - now called evaluation cluster - a
>> chance to prove.
>>
>> I bet in 2016 we will see, if that was a good decision or bad and data
>> los is in that scenario ok. We evaluate. :)
>>
>> Regarding one P3700 for 12 SATA disks I do get it right, that if that
>> P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me
>> from my current knowledge. Or are the P3700 so much more reliable than the
>> eg. S3500 or S3700?
>>
>> Or is the suggestion with the P3700 if we go in the direction of 20+
>> nodes and till than stay without SSDs for journaling.
>>
>> I really appreciate your thoughts and feedback and I’m aware of the fact
>> that building a ceph cluster is some sort of knowing the specs,
>> configuration option, math, experience, modification and feedback from best
>> practices real world clusters. Finally all clusters are unique in some way
>> and what works for one will not work for an other.
>>
>> Thanks for feedback, 100 kowtows . Götz
>>
>>
>>
>> > Am 09.07.2015 um 16:58 schrieb Wang, Warren <
>> warren_w...@cable.comcast.com>:
>> >
>> > You'll take a noticeable hit on write latency. Whether or not it's
>> tolerable will be up to you and the workload you have to capture. Large
>> file operations are throughput efficient without an SSD journal, as long as
>> you have enough spindles.
>> >
>> > About the Intel P3700, you will only need 1 to keep up with 12 SATA
>> drives. The 400 GB is probably okay if you keep the journal sizes small,
>> but the 800 is probably safer if you plan on leaving these in production
>> for a few years. Depends on the turnover of data on the servers.
>> >
>> > The dual disk failure comment is pointing out that you are more exposed
>> for data loss with 2 copies. You do need to understand that there is a
>> possibility for 2 drives to fail either simultaneously, or one before the
>> cluster is repaired. As usual, this is going to be a decision you need to
>> decide if it's acceptable or not. We have many clusters, and some are 2,
>> and others are 3. If your data resides nowhere else, then 3 copies is the
>> safe thing to do. That's getting harder and harder to justify though, when
>> the price of other storage solutions using erasure coding continues to
>> plummet.
>> >
>> > Warren
>> >
>> > -----Original Message-----
>> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> Of Götz Reinicke - IT Koordinator
>> > Sent: Thursday, July 09, 2015 4:47 AM
>> > To: ceph-users@lists.ceph.com
>> > Subject: Re: [ceph-users] Real world benefit from SSD Journals for a
>> more read than write cluster
>> >
>> > Hi Christian,
>> > Am 09.07.15 um 09:36 schrieb Christian Balzer:
>> >>
>> >> Hello,
>> >>
>> >> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator
>> wrote:
>> >>
>> >>> Hi again,
>> >>>
>> >>> time is passing, so is my budget :-/ and I have to recheck the
>> >>> options for a "starter" cluster. An expansion next year for may be an
>> >>> openstack installation or more performance if the demands rise is
>> >>> possible. The "starter" could always be used as test or slow dark
>> archive.
>> >>>
>> >>> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per
>> >>> node, but now I'm looking for 12 SATA OSDs without SSD journal. Less
>> >>> performance, less capacity I know. But thats ok!
>> >>>
>> >> Leave the space to upgrade these nodes with SSDs in the future.
>> >> If your cluster grows large enough (more than 20 nodes) even a single
>> >> P3700 might do the trick and will need only a PCIe slot.
>> >
>> > If I get you right, the 12Disk is not a bad idea, if there would be the
>> need of SSD Journal I can add the PCIe P3700.
>> >
>> > In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.
>> >
>> > God or bad idea?
>> >
>> >>
>> >>> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.
>> >>>
>> >> Danger, Will Robinson.
>> >> This is essentially a RAID5 and you're plain asking for a double disk
>> >> failure to happen.
>> >
>> > May be I do not understand that. size = 2 I think is more sort of raid1
>> ... ? And why am I asking for for a double disk failure?
>> >
>> > To less nodes, OSDs or because of the size = 2.
>> >
>> >>
>> >> See this recent thread:
>> >> "calculating maximum number of disk and node failure that can be
>> >> handled by cluster with out data loss"
>> >> for some discussion and python script which you will need to modify
>> >> for
>> >> 2 disk replication.
>> >>
>> >> With a RAID5 failure calculator you're at 1 data loss event per 3.5
>> >> years...
>> >>
>> >
>> > Thanks for that thread, but I dont get the point out of it for me.
>> >
>> > I see that calculating the reliability is some sort of complex math ...
>> >
>> >>> The workload I expect is more writes of may be some GB of Office
>> >>> files per day and some TB of larger video Files from a few users per
>> week.
>> >>>
>> >>> At the end of this year we calculate to have +- 60 to 80 TB of lager
>> >>> videofiles in that cluster, which are accessed from time to time.
>> >>>
>> >>> Any suggestion on the drop of ssd journals?
>> >>>
>> >> You will miss them when the cluster does write, be it from clients or
>> >> when re-balancing a lost OSD.
>> >
>> > I can imagine, that I might miss the SSD Journal, but if I can add the
>> > P3700 later I feel comfy with it for now. Budget and evaluation related.
>> >
>> >       Thanks for your helpful input and feedback. /Götz
>> >
>> > --
>> > Götz Reinicke
>> > IT-Koordinator
>> >
>> > Tel. +49 7141 969 82420
>> > E-Mail goetz.reini...@filmakademie.de
>> >
>> > Filmakademie Baden-Württemberg GmbH
>> > Akademiehof 10
>> > 71638 Ludwigsburg
>> > www.filmakademie.de
>> >
>> > Eintragung Amtsgericht Stuttgart HRB 205016
>> >
>> > Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im
>> Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg
>> >
>> > Geschäftsführer: Prof. Thomas Schadt
>> >
>> >
>>
>> --
>> Götz Reinicke
>> IT-Koordinator
>>
>> Tel. +49 7141 969 82420
>> E-Mail goetz.reini...@filmakademie.de
>>
>> Filmakademie Baden-Württemberg GmbH
>> Akademiehof 10
>> 71638 Ludwigsburg
>> www.filmakademie.de
>>
>> Eintragung Amtsgericht Stuttgart HRB 205016
>>
>> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
>> Staatssekretär im Ministerium für Wissenschaft,
>> Forschung und Kunst Baden-Württemberg
>>
>> Geschäftsführer: Prof. Thomas Schadt
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

Reply via email to