Dear Marc,

thank you for your endurance. I had another slightly different "meltdown", this 
time throwing the MGRs out and I adjusted yet another beacon grace time. 
Fortunately, after your communication, I didn't need to look very long.

To harden our cluster a bit further, I would like to adjust a number of 
advanced parameters I found after your hints. I would be most grateful if you 
(or anyone else receiving this) still have enough endurance left and could 
check whether what I want to do makes sense and if the choices I suggest will 
achieve what I want.

Parameters with section of documentation, default in "{}", current value plain, 
new value prefixed with "*". There is an error in the documentation, please let 
me know if my interpretation is correct.


MON-MGR beacon adjustments
--------------------------
https://docs.ceph.com/docs/mimic/mgr/administrator/

mon mgr beacon grace {30}              300

This helped mitigating the second type of meltdown. I took 2 times the longest 
observed "mon slow op" time to be safe (MGR beacon handling was slow op). Our 
MGRs are no longer thrown out in case of the incident (see very end for more 
info).


MON-OSD communication adjustments
------------------------
https://docs.ceph.com/docs/mimic/rados/configuration/mon-osd-interaction/

osd beacon report interval {300}        300
mon osd report timeout {900}            3600
mon osd min down reporters {2}         *3
mon osd reporter subtree level {host}  *datacenter
mon osd down out subtree limit {rack}  *host

"mon osd report timeout" is increased after your recommendation. It is set to a 
really high value as I don't see this critical for fail-over (the default 
time-out suggests that this is merely for clean-up and not essential for 
healthy I/O). OSDs are no longer thrown out in case of the incident (see very 
end for more info).

"down reporter options": We have 3 sites (sub-clusters) under region in our 
crush map (see below). Each of these regions can be considered "equally laggy" 
as described in the documentation. I do not want a laggy site to mark down OSDs 
from another (healthy) site without a single OSD of the other site confirming 
an issue. I would like to require that at least 1 OSD from each site needs to 
report an OSD down before something happens. Does "3" and "datacenter" achieve 
what I want? Is this a reasonable choice with our crush map?

Note that, as a speciality, DC2 currently links to some hosts of DC3 (to change 
in the future).

"mon osd down out subtree limit": A host in our cluster is currently the atomic 
unit which, if it goes down, should not trigger rebalancing on the cluster as 
this indicates a server and not a disk fail. In addition, if I understand it 
correctly, this will also act as an automatic "noout" on host level if, for 
example, a host gets rebooted.


mon osd laggy *

I saw tuning parameters for laggy OSDs. However, our incidents happen very 
sporadically and are extremely radical. I do not think that any reasonable 
estimator will be able to handle that. So my working hypothesis is, that I 
should not touch these.


Error in documentation
--------------------

https://docs.ceph.com/docs/mimic/rados/configuration/mon-osd-interaction/#osds-report-their-status

osd_mon_report_interval_max {Error ENOENT:}
osd beacon report interval

The documentation mentions "osd mon report interval max", which doesn't exist. 
However "osd beacon report interval" exists but is not mentioned. I assume the 
second replaced the first?


Condensed crush tree
--------------------

region R1
    datacenter DC1
        room DC1-R1
            host ceph-08            host ceph-09            host ceph-10        
    host ceph-11
            host ceph-12            host ceph-13            host ceph-14        
    host ceph-15
            host ceph-16            host ceph-17
    datacenter DC2
        host ceph-04        host ceph-05        host ceph-06        host ceph-07
        host ceph-18        host ceph-19        host ceph-20    datacenter DC3
        room DC3-R1
            host ceph-04            host ceph-05            host ceph-06        
    host ceph-07
            host ceph-18            host ceph-19            host ceph-20        
    host ceph-21
            host ceph-22

Additional info about our meltdowns:

With "mon mgr beacon grace" and "mon osd report timeout" set to really high 
values, I finally managed to isolate a signal in our recordings that is 
connected with these strange incidents. It looks like a package storm is 
hitting exactly two MON+MGR nodes, leading to beacon time-outs with default 
settings. I will not continue this here, but rather prepare another thread 
"Cluster outage due to client IO" after checking network hardware. It looks as 
if two MON+MGR nodes are desperately trying to talk to each other but fail.

And this after only 1.5 years of relationship :)

Thanks for making it a second time!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Marc Roos <m.r...@f1-outsourcing.eu>
Sent: 06 May 2020 19:19
To: ag; brad.swanson; dan; Frank Schilder
Cc: ceph-users
Subject: RE: [ceph-users] Re: Ceph meltdown, need help

Made it all the way down ;) Thank you very much for the detailed info.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to