Dear fellow cephers,

we have a nasty and at the same time mysterious problem with our mimic 13.2.8 
cluster. The most prominent observations are

- benign operations like adding disks sometimes leads to timeouts and OSDs 
marked down for no apparent reason, other times everything is normal (see 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/VO7FGPDKRDUCENTWR3YDF5Y6KSIQXMLB/)
- out of the blue, OSDs or other daemons get marked down for a period of time 
after which everything suddenly goes back to normal (see 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QVQTAL6ZK6SII2PSFXUNWPSOSHL5C6WO/)

All this happens while everything is physically healthy, no network problems, 
no disk problems, nothing. No daemon actually crashes and the logs do not 
contain any trace of a malfunction.

At first I thought that some clients overloaded the cluster, but now I'm 
leaning towards a very rarely hit bug in the MON daemons. My current hypothesis 
is that a MON daemon can enter an ill state without notice, there is nothing in 
the logs and the peers don't recognize anything either. An affected MON daemon 
keeps running and continues to communicate with everything else, yet every now 
and then it goes bonkers for a while taking an entire cluster out of service. A 
single ill MON is enough for this to happen, making this a rare high impact 
problem.

Please find below a collection of observations and attempts to address the 
periodic service outages we experienced. Unfortunately, the observations are 
somewhat inconclusive and I hope someone here recognizes a smoking gun. The 
information was collected over a period of 6 weeks.

I tried to organize the information below by relevance/significance, but also 
aim for providing a complete re-collection just in case. The structure is as 
follows:

  1. Description of incident
  2. Actions taken to get the cluster under control
  3. Potential (temporary) resolution
  4. Observations collected during various occurrences

==========================
1. Description of incident

After setting mon_osd_report_timeout and mon_mgr_beacon_grace to 24h I managed 
to isolate a signal in our monitoring data that clearly marks the window of an 
incident; see https://imgur.com/a/cg9Yf4E . The images show the network packet 
reports collected with ganglia on the server- and client side. The blue curve 
is packets out and the green curve is packets in. The incident started at ca. 
10:15 and is clearly visible as a plateau. Note that we had another incident 
just shortly before that one, ending around 9:30, the ending of which is also 
visible.

During an incident, typically two MONs start sending an insane amount of 
packets out. The number goes up to 40000 pkts/s. For comparison, normal 
operation is only 200 pkts/s. During an incident, the leader stops processing 
OSD and MGR beacons, but happily marks everything down. I don't remember any 
other operation stuck, in particular, not MDS beacons. The CPU load of MONs 
sending packets is slightly above 100%. Memory consumption is normal.

It looks like the packets are sent to a subset of ceph fs clients. I'm not sure 
if this is connected to actual I/O the clients are doing, I have seen incidents 
for both cases, heavy and absolutely no I/O load on the file system. The images 
show an incident with no correlated I/O going on. However, it is remarkable 
that all affected clients were running a specific software of a specific user, 
so clients seem to play a role although it looks unlikely that ordinary file 
operations are causing this.

During an incident window, one has very little time to react. During the first 
4-5 minutes the cluster still responds to admin commands like setting OSD 
flags. As soon as the MONs start reporting slow ops, everything gets stuck and 
one can only wait for it to be over. Its a complete loss of control. Typically, 
OSDs and MGRs get marked down en mass and service is lost without a chance to 
do anything. The cluster is physically healthy the entire time, I did not 
observe any actual crashes or spontaneous restarts.

The incident is not (easily?) reproducible. I observed two software packages 
running while incidents happened. However, the software was also running 
without any issues at other times. The frequency of incident windows was 1-2 
per day during the worst periods. It looks like certain client software 
(operations) increase the chance of this happening, but don't always trigger it.

I collected a lot of information during a one-week period with several 
incidents, such as

- MON and OSD logs
- full aggregated system log
- dumps of ops and historic ops during incidents
- tcpdumps in both, promiscuous and non-promiscuous mode
- a full tcp dump of 10000 packets (ca. 0.25s) on the leader during an incident 
window
- more graphs with other data
- results of experiments with different actions

So far I couldn't find anything, probably because I don't know what to look 
for. I'm happy to make any information available if it helps.

==========================
2. Actions taken to get the cluster under control

The first incident of long ping times 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/VO7FGPDKRDUCENTWR3YDF5Y6KSIQXMLB/),
 which I believe is related happened while I added a small number of OSDs. The 
problem went away by itself and I didn't observe it again when adding more OSDs 
later.

I discovered the second incident of total cluster outage 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QVQTAL6ZK6SII2PSFXUNWPSOSHL5C6WO/)
 very shortly after it started. By coincidence I was looking at the dashboard 
in the right moment. However, I came too late to the command line. About 2/3 of 
all OSDs were marked down. The cluster still responded to admin commands and I 
managed to set nodown and noout, which helped a little. I got a very important 
hint in the ceph-user thread and after setting mon_osd_report_timeout to 1h the 
cluster became immediately healthy. I'm not entirely sure though if this was 
due to the setting of the time out or the incident stopping by itself and the 
command returning only after the incident was over - it might have been 
coincidence.

Thereafter it went gradually worse and worse. The incident windows became 
longer and occurred more and more often. Instead of OSDs being marked down, it 
now hit the MGRs. I therefore also increased mon_mgr_beacon_grace to larger and 
larger values to get the cluster stable and to have a chance to debug and 
collect information.

At some point, the incident window exceeded 1h and OSDs got affected again. We 
had a couple more total outages and at some point I set the timeouts to 24h, 
which would allow me to check just twice a day. The longest incident window I 
observed was something like 1h:15m.

After this, I was trying to catch incidents and experiment with different 
actions to see what happens. I potentially discovered a way to resolve the 
problem at least temporarily, see next section. Results of other experiments 
are listed thereafter.

==========================
3. Potential (temporary) resolution

Restarting the "ill" monitor resolves the issue. I don't know if this issue 
arises at MON startup or over time under certain conditions. Therefore, I 
cannot say if a clean restart resolves it permanently or if the monitor can 
fall ill again. I still have the 24h beacon timeouts just to be sure.

In my case, the leader was probably the culprit. After restarting the leader 
about 3 weeks ago I have not seen a single incident again.

A possible clue for whether or not a cluster is affected could be the 
distribution of client connections the monitors hold. I observed an incredibly 
uneven distribution like ceph-01(leader) 1500, ceph-02 70, ceph-03 300. After 
restarting ceph-02 nothing happened. After restarting ceph-01 the MONs 
immediately started re-balancing client connections and ended up with something 
like ceph-01 650, ceph-02 850, ceph-03 650. This converged over the last 3 
weeks to around ceph-01 725, ceph-02 815, ceph-03 625 and varies only little.

It seems that a highly imbalanced distribution of client connections is an 
early indicator. I watch the distribution since then.

Another early sign might be that the active MGR does not like to run on a 
specific host any more. Our MGR was always running on ceph-01 (together with 
the MON leader). At some point, however, I observed that it started failing 
over to ceph-02 for, again, no apparent reason. There was no actual MGR 
restart. Even if I forced the MGR on ceph-01 to be active, after a while a 
different MGR would become the active again.

After re-starting the "ill" MON, this also stopped.

==========================
4. Observations collected during various occurrences

- setting nodown and noout in time will prevent service outage
- if OSDs remain up, client I/O is unaffected (throughput wise), the incident 
does not add critical load
- restarting a MON with slow OPS does not help, probably unless it is the right 
one; I never dared to restart the leader during an incident though
- restarting MGRs makes things worse
- disabling the dashboard does not help (I suspected a bug in the dashboard for 
a while)
- during an incident, the general network load on the entire cluster increases; 
I could see a very large increase of packets between all (?, many?) servers 
with tcpdump in promiscuous mode; network hardware was never challenged though


Hope that any of this makes sense to someone and helps to isolate the root 
cause.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to