Hi David,

If I understand correctly your suggestion is the following:
If we have for instance 12 servers grouped into 3 racks (4/rack) then you would 
build a crush map saying that you have 6 racks (virtual ones), and 2 servers in 
each of them, right?

In this case if we are setting the failure domain to rack and the size of a 
pool to 3, how do you make sure that the crush map will not use 2 servers from 
the same physical rack for a PG? Could you provide an example of distribution 
of servers to virtual racks?

Thank you,
Laszlo


On 01.06.2017 22:23, David Turner wrote:
The way to do this is to download your crush map, modify it manually after 
decompiling it to text format or modify it using the crushtool.  Once you have 
your crush map with the rules in place that you want, you will upload the crush 
map to the cluster.  When you change your failure domain from host to rack, or 
any other change to failure domain, it will cause all of your PGs to peer at 
the same time.  You want to make sure that you have enough memory to handle 
this scenario.  After that point, your cluster will just backfill the PGs from 
where they currently are to their new location and then clean up after itself.  
It is recommended to monitor your cluster usage and modify osd_max_backfills 
during this process to optimize how fast you can finish your backfilling while 
keeping your cluster usable by the clients.

I generally recommend starting a cluster with at least n+2 failure domains so would 
recommend against going to a rack failure domain with only 3 racks.  As an alternative 
that I've done, I've set up 6 "racks" when I only have 3 racks with planned 
growth to a full 6 racks.  When I added servers and expanded to fill more racks, I moved 
the servers to where they are represented in the crush map.  So if it's physically in 
rack1 but it's set as rack4 in the crush map, then I would move those servers to the 
physical rack 4 and start filling out rack 1 and rack 4 to complete their capacity, then 
do the same for rack 2/5 when I start into the 5th rack.

Another option to having full racks in your crush map is having half racks.  
I've also done this for clusters that wouldn't grow larger than 3 racks.  Have 
6 failure domains at half racks.  It lowers your chance of having random drives 
fail in different failure domains at the same time and gives you more servers 
that you can run maintenance on at a time over using a host failure domain.  It 
doesn't resolve the issue of using a single cross-link for the entire rack or a 
full power failure of the rack, but it's closer.

The problem with having 3 failure domains with replica 3 is that if you lose a 
complete failure domain, then you have nowhere for the 3rd replica to go.  If 
you have 4 failure domains with replica 3 and you lose an entire failure 
domain, then you over fill the remaining 3 failure domains and can only really 
use 55% of your cluster capacity.  If you have 5 failure domains, then you 
start normalizing and losing a failure domain doesn't impact as severely.  The 
more failure domains you get to, the less it affects you when you lose one.

Let's do another scenario with 3 failure domains and replica size 3.  Every OSD 
you lose inside of a failure domain gets backfilled directly onto the remaining 
OSDs in that failure domain.  There reaches a point where a switch failure in a 
rack or losing a node in the rack could over-fill the remaining OSDs in that rack. 
 If you have enough servers and OSDs in the rack, then this becomes moot.... but 
if you have a smaller cluster with only 3 nodes and <4 drives in each... if you 
lose a drive in one of your nodes, then all of it's data gets distributed to the 
other 3 drives in that node.  That means you either have to replace your storage 
ASAP when it fails or never fill your cluster up more than 55% if you want to be 
able to automatically recover from a drive failure.

tl;dr . Make sure you calculate what your failure domain, replica size, drive 
size, etc means for how fast you have to replace storage when it fails and how 
full you can fill your cluster to afford a hardware loss.

On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu <[email protected] 
<mailto:[email protected]>> wrote:

    Greetings Folks.____

    __ __

    Wanted to understand how ceph works when we start with rack aware(rack 
level replica) example 3 racks and 3 replica in crushmap in future is replaced 
by node aware(node level replica) ie 3 replica spread across nodes.____

    __ __

    This can be vice-versa. If this happens. How does ceph rearrange the “old” 
data. Do I need to trigger any command to ensure the data placement is based on 
latest crushmap algorithm or ceph takes care of it automatically.____

    __ __

    Thanks for your time.____

    __ __

    --____

    Deepak____

    
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.
    
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    _______________________________________________
    ceph-users mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to