I'm seeing this again on two OSDs after adding another 20 disks to my cluster. Is there someway I can maybe determine which snapshots the recovery process is looking for? Or maybe find and remove the objects it's trying to recover, since there's apparently a problem with them? Thanks!
-Steve On 05/18/2017 01:06 PM, Steve Anthony wrote: > > Hmmm, after crashing for a few days every 30 seconds it's apparently > running normally again. Weird. I was thinking since it's looking for a > snapshot object, maybe re-enabling snaptrimming and removing all the > snapshots in the pool would remove that object (and the problem)? > Never got to that point this time, but I'm going to need to cycle more > OSDs in and out of the cluster, so if it happens again I might try > that and update. > > Thanks! > > -Steve > > > On 05/17/2017 03:17 PM, Gregory Farnum wrote: >> >> >> On Wed, May 17, 2017 at 10:51 AM Steve Anthony <sma...@lehigh.edu >> <mailto:sma...@lehigh.edu>> wrote: >> >> Hello, >> >> After starting a backup (create snap, export and import into a second >> cluster - one RBD image still exporting/importing as of this message) >> the other day while recovery operations on the primary cluster were >> ongoing I noticed an OSD (osd.126) start to crash; I reweighted >> it to 0 >> to prepare to remove it. Shortly thereafter I noticed the problem >> seemed >> to move to another OSD (osd.223). After looking at the logs, I >> noticed >> they appeared to have the same problem. I'm running Ceph version >> 9.2.1 >> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8. >> >> Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe >> >> Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA >> >> >> May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15 >> 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors >> {default=true} >> May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897 >> 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void >> ReplicatedPG::on_local_recover(const hobject_t&, const >> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, >> ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15 >> 10:39:55.322306 >> May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192: >> FAILED >> assert(recovery_info.oi.snaps.size()) >> >> May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391 >> 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true} >> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In >> function >> 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const >> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, >> ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15 >> 16:45:30.799839 >> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192: >> FAILED >> assert(recovery_info.oi.snaps.size()) >> >> >> I did some searching and thought it might be related to >> http://tracker.ceph.com/issues/13837 aka >> https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled >> scrubbing and deep-scrubbing, and set >> osd_pg_max_concurrent_snap_trims >> to 0 for all OSDs. No luck. I had changed the systemd service file to >> automatically restart osd.223 while recovery was happening, but it >> appears to have stalled; I suppose it's needed up for the >> remaining objects. >> >> >> Yeah, these aren't really related that I can see — though I haven't >> spent much time in this code that I can recall. The OSD is receiving >> a "push" as part of log recovery and finds that the object it's >> receiving is a snapshot object without having any information about >> the snap IDs that exist, which is weird. I don't know of any way a >> client could break it either, but maybe David or Jason know something >> more. >> -Greg >> >> >> >> I didn't see anything else online, so I thought I see if anyone >> has seen >> this before or has any other ideas. Thanks for taking the time. >> >> -Steve >> >> >> -- >> Steve Anthony >> LTS HPC Senior Analyst >> Lehigh University >> sma...@lehigh.edu <mailto:sma...@lehigh.edu> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > -- > Steve Anthony > LTS HPC Senior Analyst > Lehigh University > sma...@lehigh.edu > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Senior Analyst Lehigh University sma...@lehigh.edu
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com