Re: [ceph-users] Crushmap from Rack aware to Node aware

David Turner Fri, 02 Jun 2017 07:51:36 -0700

I agree that running in min_size of 1 is worse than running with only 3
failure domains.  Even if it's just for a short time and you're monitoring
it closely... it takes mere seconds before you could have corrupt data with
min_size of 1 (depending on your use case).  That right there is the key.
What is your use case? What are you willing to do to maintain your system?
With only 10 hosts, I think I would lean towards leaving the crush map in
the host failure domain, personally.  But in my mind, the assumptions are
that you have multiple legs of power, redundant power supplies in each
server, bonded nics feeding to 2 switches that are powered by different
power circuits, etc.  So it moves the onus of uptime stability from the
crush map to the physical configuration of the hardware.  If you're missing
any or all of those redundant hardware components, then the crush map would
need to step up to alleviate potential down time.


On Fri, Jun 2, 2017 at 10:32 AM Laszlo Budai <las...@componentsoft.eu>
wrote:

> What you're saying that if we only have 3 failure domains then ceph can do
> nothing to maintain 3 copies in case of an entire failure domain is lost,
> that is correct.
> BUT if you're losing 2 replicas out of 3 of your data, and your min size
> is set to 2 (the recommended minimum) then you have an even bigger problem.
> The cluster will not serve the IO operations anymore. I know you can set
> the min size to 1 for recovery, but in my opinion that is still worse than
> have the cluster running without human intervention, but this could be a
> matter of taste :).
>
> Kind regards,
> Laszlo
>
>
> On 02.06.2017 16:48, David Turner wrote:
> > You wouldn't be able to guarantee that the cluster will not use 2
> servers from the same rack. The problem with 3 failure domains, however, is
> if you lose a full failure domain ceph can do nothing to maintain 3 copies
> of your data. It leaves you in a position where you need to rush to the
> datacenter to fix the hardware problems ASAP.
> >
> > On Fri, Jun 2, 2017, 5:14 AM Laszlo Budai <las...@componentsoft.eu
> <mailto:las...@componentsoft.eu>> wrote:
> >
> >     Hi David,
> >
> >     If I understand correctly your suggestion is the following:
> >     If we have for instance 12 servers grouped into 3 racks (4/rack)
> then you would build a crush map saying that you have 6 racks (virtual
> ones), and 2 servers in each of them, right?
> >
> >     In this case if we are setting the failure domain to rack and the
> size of a pool to 3, how do you make sure that the crush map will not use 2
> servers from the same physical rack for a PG? Could you provide an example
> of distribution of servers to virtual racks?
> >
> >     Thank you,
> >     Laszlo
> >
> >
> >     On 01.06.2017 22:23, David Turner wrote:
> >      > The way to do this is to download your crush map, modify it
> manually after decompiling it to text format or modify it using the
> crushtool.  Once you have your crush map with the rules in place that you
> want, you will upload the crush map to the cluster.  When you change your
> failure domain from host to rack, or any other change to failure domain, it
> will cause all of your PGs to peer at the same time.  You want to make sure
> that you have enough memory to handle this scenario.  After that point,
> your cluster will just backfill the PGs from where they currently are to
> their new location and then clean up after itself.  It is recommended to
> monitor your cluster usage and modify osd_max_backfills during this process
> to optimize how fast you can finish your backfilling while keeping your
> cluster usable by the clients.
> >      >
> >      > I generally recommend starting a cluster with at least n+2
> failure domains so would recommend against going to a rack failure domain
> with only 3 racks.  As an alternative that I've done, I've set up 6 "racks"
> when I only have 3 racks with planned growth to a full 6 racks.  When I
> added servers and expanded to fill more racks, I moved the servers to where
> they are represented in the crush map.  So if it's physically in rack1 but
> it's set as rack4 in the crush map, then I would move those servers to the
> physical rack 4 and start filling out rack 1 and rack 4 to complete their
> capacity, then do the same for rack 2/5 when I start into the 5th rack.
> >      >
> >      > Another option to having full racks in your crush map is having
> half racks.  I've also done this for clusters that wouldn't grow larger
> than 3 racks.  Have 6 failure domains at half racks.  It lowers your chance
> of having random drives fail in different failure domains at the same time
> and gives you more servers that you can run maintenance on at a time over
> using a host failure domain.  It doesn't resolve the issue of using a
> single cross-link for the entire rack or a full power failure of the rack,
> but it's closer.
> >      >
> >      > The problem with having 3 failure domains with replica 3 is that
> if you lose a complete failure domain, then you have nowhere for the 3rd
> replica to go.  If you have 4 failure domains with replica 3 and you lose
> an entire failure domain, then you over fill the remaining 3 failure
> domains and can only really use 55% of your cluster capacity.  If you have
> 5 failure domains, then you start normalizing and losing a failure domain
> doesn't impact as severely.  The more failure domains you get to, the less
> it affects you when you lose one.
> >      >
> >      > Let's do another scenario with 3 failure domains and replica size
> 3.  Every OSD you lose inside of a failure domain gets backfilled directly
> onto the remaining OSDs in that failure domain.  There reaches a point
> where a switch failure in a rack or losing a node in the rack could
> over-fill the remaining OSDs in that rack.  If you have enough servers and
> OSDs in the rack, then this becomes moot.... but if you have a smaller
> cluster with only 3 nodes and <4 drives in each... if you lose a drive in
> one of your nodes, then all of it's data gets distributed to the other 3
> drives in that node.  That means you either have to replace your storage
> ASAP when it fails or never fill your cluster up more than 55% if you want
> to be able to automatically recover from a drive failure.
> >      >
> >      > tl;dr . Make sure you calculate what your failure domain, replica
> size, drive size, etc means for how fast you have to replace storage when
> it fails and how full you can fill your cluster to afford a hardware loss.
> >      >
> >      > On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu <dna...@nvidia.com
> <mailto:dna...@nvidia.com> <mailto:dna...@nvidia.com <mailto:
> dna...@nvidia.com>>> wrote:
> >      >
> >      >     Greetings Folks.____
> >      >
> >      >     __ __
> >      >
> >      >     Wanted to understand how ceph works when we start with rack
> aware(rack level replica) example 3 racks and 3 replica in crushmap in
> future is replaced by node aware(node level replica) ie 3 replica spread
> across nodes.____
> >      >
> >      >     __ __
> >      >
> >      >     This can be vice-versa. If this happens. How does ceph
> rearrange the “old” data. Do I need to trigger any command to ensure the
> data placement is based on latest crushmap algorithm or ceph takes care of
> it automatically.____
> >      >
> >      >     __ __
> >      >
> >      >     Thanks for your time.____
> >      >
> >      >     __ __
> >      >
> >      >     --____
> >      >
> >      >     Deepak____
> >      >
> >      >
> >
>  
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >      >     This email message is for the sole use of the intended
> recipient(s) and may contain confidential information.  Any unauthorized
> review, use, disclosure or distribution is prohibited.  If you are not the
> intended recipient, please contact the sender by reply email and destroy
> all copies of the original message.
> >      >
> >
>  
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >      >     _______________________________________________
> >      >     ceph-users mailing list
> >      > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> <mailto:ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>>
> >      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >      >
> >      >
> >      >
> >      > _______________________________________________
> >      > ceph-users mailing list
> >      > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >      >
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crushmap from Rack aware to Node aware

Reply via email to