Hi, As part of our testing over a period of time, we used a lot of parameters in Ceph.conf. With that configuration, we observed issues when we pulled down 2 sites as mentioned earlier.
In the last couple of days, we cleaned up a lot of parameters and configured only couple of mandatory parameters and we are not seeing any issues when we bring down 2 sites. FYI.. Thanks & Regards, Manoj On Sat, Aug 6, 2016 at 8:23 PM, Venkata Manojawa Paritala < manojaw...@vedams.com> wrote: > Hi, > > We have configured single Ceph cluster in a lab with the below > specification. > > 1. Divided the cluster into 3 logical sites (SiteA, SiteB & SiteC). This > is to simulate that nodes are part of different Data Centers and having > network connectivity between them for DR. > 2. Each site operates in a different subnet and each subnet is part of one > VLAN. We have configured routing so that OSD nodes in one site can > communicate to OSD nodes in the other 2 sites. > 3. Each site will have one monitor node, 2 OSD nodes (to which we have > disks attached) and IO generating clients. > 4. We have configured 2 networks. > 4.1. Public network - To which all the clients, monitors and OSD nodes are > connected > 4.2. Cluster network - To which only the OSD nodes are connected for - > Replication/recovery/hearbeat traffic. > > 5. We have 2 issues here. > 5.1. We are unable sustain IO for clients from individual sites when we > isolate the OSD nodes by bringing down ONLY the cluster network between > sites. Logically this will make the individual sites to be in isolation > with respect to the cluster network. Please note that the public network is > still connected between the sites. > 5.2. In a fully functional cluster, when we bring down 2 sites (shutdown > the OSD services of 2 sites - say Site A OSDs and Site B OSDs) then, OSDs > in the third site (Site C) are going down (OSD Flapping). > > We need workarounds/solutions to fix the above 2 issues. > > Below are some of the parameters we have already mentioned in the > Cenf.conf to sustain the cluster for a longer time, when we cut-off the > links between sites. But, they were not successful. > > -------------- > [global] > public_network = 10.10.0.0/16 > cluster_network = 192.168.100.0/16,192.168.150.0/16,192.168.200.0/16 > osd hearbeat address = 172.16.0.0/16 > > [monitor] > mon osd report timeout = 1800 > > [OSD} > osd heartbeat interval = 12 > osd hearbeat grace = 60 > osd mon heartbeat interval = 60 > osd mon report interval max = 300 > osd mon report interval min = 10 > osd mon act timeout = 60 > . > . > ---------------- > > We also confiured the parameter "osd_heartbeat_addr" and tried with the > values - 1) Ceph public network (assuming that when we bring down the > cluster network hearbeat should happen via public network). 2) Provided a > different network range altogether and had physical connections. But both > the options did not work. > > We have a total of 49 OSDs (14 in Site A, 14 in SiteB, 21 in SiteC) in the > cluster. One Monitor in each Site. > > We need to try the below two options. > > A) Increase the "mon osd min down reporters" value. Question is how much. > Say, if I give this value to 49, then will the client IO sustain when we > cut-off the cluster network links between sites. In this case one issue > would be that if the OSD is really down we wouldn't know. > > B) Add 2 monitors to each site. This would make each site with 3 monitors > and the overall cluster will have 9 monitors. The reason we wanted to try > this is, we think that the OSDs are going down as the the quorum is unable > to find the minimum number nodes (may be monitors) to sustain. > > Thanks & Regards, > Manoj >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com