We had got some ping back/front problems after upgrading from filestore to bluestore. It turned out to be related to insufficient memory/swap usage.
> 2020年5月6日 下午10:08,Frank Schilder <fr...@dtu.dk> 写道: > > To answer some of my own questions: > > 1) Setting > > ceph osd set noout > ceph osd set nodown > ceph osd set norebalance > > before restart/re-deployment did not harm. I don't know if it helped, because > I didn't retry the procedure that led to OSDs going down. See also point 3 > below. > > 2) A peculiarity of this specific deployment of 2 OSDs was, that it was a mix > of OSD deployment and restart after a reboot. I'm working on getting this > sorted and this is a different story. For anyone who might find him-/herself > in a situation where some OSDs are temporarily down/out with PGs remapped and > objects degraded for whatever reason while new OSDs come up, the way to have > ceph rescan the down/out OSDs after they come up is to > > - "ceph osd crush move" the new OSDs temporarily to a location outside the > crush sub tree covering any pools (I have such a parking space in the crush > hierarchy for easy draining and parking disks) > - bring up the down/out OSDs > - at this point, the cluster will fall back to the original crush map that > was in place when the OSDs went down/out > - the cluster will now find all shards that went orphan and health will be > restored very quickly > - once the cluster is healthy, "ceph osd crush move" the new OSDs back to > their desired location > - now you will see remapped PGs/misplaced objects, but no degraded objects > > 3) I still don't have an answer why long heartbeat ping times were observed. > There seems to be a more serious issue and this will continue in its own > thread "Cluster outage due to client IO" to be opened soon. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <fr...@dtu.dk> > Sent: 25 April 2020 15:34:25 > To: ceph-users > Subject: [ceph-users] Data loss by adding 2OSD causing Long heartbeat ping > times > > Dear all, > > Two days ago I added very few disks to a ceph cluster and run into a problem > I have never seen before when doing that. The entire cluster was deployed > with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I > added OSDs under 13.2.8. > > I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one > that needed 1. Procedure was as usual: > > ceph osd set norebalance > deploy additional OSD > > The OSD came up and PGs started peering, so far so good. To my surprise, > however, I started seeing health-warnings about slow ping times: > > Long heartbeat ping times on back interface seen, longest is 1171.910 msec > Long heartbeat ping times on front interface seen, longest is 1180.764 msec > > After peering it looked like it got better and I waited it out until the > messages were gone. This took a really long time, at least 5-10 minutes. > > I went on to the next host and deployed 2 new OSDs this time. Same as above, > but with much worse consequences. Apparently, the ping times exceeded a > timeout for a very short moment and an OSD was marked out for ca. 2 seconds. > Now all hell broke loose. I got health errors with the dreaded > "backfill_toofull", undersized PGs and a large amount of degraded objects. I > don't know what is causing what, but I ended up with data loss by just adding > 2 disks. > > We have dedicated network hardware and each of the OSD hosts has 20GBit front > and 40GBit back network capacity (LACP trunking). There are currently no > more than 16 disks per server. The disks were added to an SSD pool. There was > no traffic nor any other exceptional load on the system. I have ganglia > resource monitoring on all nodes and cannot see a single curve going up. > Network, CPU utilisation, load, everything below measurement accuracy. The > hosts and network are quite overpowered and dimensioned to host many more > OSDs (in future expansions). > > I have three questions, ordered by how urgently I need an answer: > > 1) I need to add more disks next week and need a workaround. Will something > like this help avoiding the heartbeat time-out: > > ceph osd set noout > ceph osd set nodown > ceph osd set norebalance > > 2) The "lost" shards of the degraded objects were obviously still on the > cluster somewhere. Is there any way to force the cluster to rescan OSDs for > the shards that went orphan during the incident? > > 3) This smells a bit like a bug that requires attention. I was probably just > lucky that I only lost 1 shard per PG. Has something similar reported before? > Is this fixed in 13.2.10? Is it something new? Any settings that need to be > looked at? If logs need to be collected, I can do so during my next attempt. > However, I cannot risk data integrity of a production cluster and, therefore, > probably not run the original procedure again. > > Many thanks for your help and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io