We had got some ping back/front problems after upgrading from filestore to 
bluestore. It turned out to be related to insufficient memory/swap usage.  

> 2020年5月6日 下午10:08,Frank Schilder <fr...@dtu.dk> 写道:
> 
> To answer some of my own questions:
> 
> 1) Setting
> 
> ceph osd set noout
> ceph osd set nodown
> ceph osd set norebalance
> 
> before restart/re-deployment did not harm. I don't know if it helped, because 
> I didn't retry the procedure that led to OSDs going down. See also point 3 
> below.
> 
> 2) A peculiarity of this specific deployment of 2 OSDs was, that it was a mix 
> of OSD deployment and restart after a reboot. I'm working on getting this 
> sorted and this is a different story. For anyone who might find him-/herself 
> in a situation where some OSDs are temporarily down/out with PGs remapped and 
> objects degraded for whatever reason while new OSDs come up, the way to have 
> ceph rescan the down/out OSDs after they come up is to
> 
> - "ceph osd crush move" the new OSDs temporarily to a location outside the 
> crush sub tree covering any pools (I have such a parking space in the crush 
> hierarchy for easy draining and parking disks)
> - bring up the down/out OSDs
> - at this point, the cluster will fall back to the original crush map that 
> was in place when the OSDs went down/out
> - the cluster will now find all shards that went orphan and health will be 
> restored very quickly
> - once the cluster is healthy, "ceph osd crush move" the new OSDs back to 
> their desired location
> - now you will see remapped PGs/misplaced objects, but no degraded objects
> 
> 3) I still don't have an answer why long heartbeat ping times were observed. 
> There seems to be a more serious issue and this will continue in its own 
> thread "Cluster outage due to client IO" to be opened soon.
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Frank Schilder <fr...@dtu.dk>
> Sent: 25 April 2020 15:34:25
> To: ceph-users
> Subject: [ceph-users] Data loss by adding 2OSD causing Long heartbeat ping 
> times
> 
> Dear all,
> 
> Two days ago I added very few disks to a ceph cluster and run into a problem 
> I have never seen before when doing that. The entire cluster was deployed 
> with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I 
> added OSDs under 13.2.8.
> 
> I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one 
> that needed 1. Procedure was as usual:
> 
> ceph osd set norebalance
> deploy additional OSD
> 
> The OSD came up and PGs started peering, so far so good. To my surprise, 
> however, I started seeing health-warnings about slow ping times:
> 
> Long heartbeat ping times on back interface seen, longest is 1171.910 msec
> Long heartbeat ping times on front interface seen, longest is 1180.764 msec
> 
> After peering it looked like it got better and I waited it out until the 
> messages were gone. This took a really long time, at least 5-10 minutes.
> 
> I went on to the next host and deployed 2 new OSDs this time. Same as above, 
> but with much worse consequences. Apparently, the ping times exceeded a 
> timeout for a very short moment and an OSD was marked out for ca. 2 seconds. 
> Now all hell broke loose. I got health errors with the dreaded 
> "backfill_toofull", undersized PGs and a large amount of degraded objects. I 
> don't know what is causing what, but I ended up with data loss by just adding 
> 2 disks.
> 
> We have dedicated network hardware and each of the OSD hosts has 20GBit front 
> and 40GBit back network capacity (LACP trunking).  There are currently no 
> more than 16 disks per server. The disks were added to an SSD pool. There was 
> no traffic nor any other exceptional load on the system. I have ganglia 
> resource monitoring on all nodes and cannot see a single curve going up. 
> Network, CPU utilisation, load, everything below measurement accuracy. The 
> hosts and network are quite overpowered and dimensioned to host many more 
> OSDs (in future expansions).
> 
> I have three questions, ordered by how urgently I need an answer:
> 
> 1) I need to add more disks next week and need a workaround. Will something 
> like this help avoiding the heartbeat time-out:
> 
> ceph osd set noout
> ceph osd set nodown
> ceph osd set norebalance
> 
> 2) The "lost" shards of the degraded objects were obviously still on the 
> cluster somewhere. Is there any way to force the cluster to rescan OSDs for 
> the shards that went orphan during the incident?
> 
> 3) This smells a bit like a bug that requires attention. I was probably just 
> lucky that I only lost 1 shard per PG. Has something similar reported before? 
> Is this fixed in 13.2.10? Is it something new? Any settings that need to be 
> looked at? If logs need to be collected, I can do so during my next attempt. 
> However, I cannot risk data integrity of a production cluster and, therefore, 
> probably not run the original procedure again.
> 
> Many thanks for your help and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to