Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

Steve Anthony Fri, 02 Jun 2017 11:16:55 -0700

I'm seeing this again on two OSDs after adding another 20 disks to my
cluster. Is there someway I can maybe determine which snapshots the
recovery process is looking for? Or maybe find and remove the objects
it's trying to recover, since there's apparently a problem with them?
Thanks!


-Steve

On 05/18/2017 01:06 PM, Steve Anthony wrote:
>
> Hmmm, after crashing for a few days every 30 seconds it's apparently
> running normally again. Weird. I was thinking since it's looking for a
> snapshot object, maybe re-enabling snaptrimming and removing all the
> snapshots in the pool would remove that object (and the problem)?
> Never got to that point this time, but I'm going to need to cycle more
> OSDs in and out of the cluster, so if it happens again I might try
> that and update.
>
> Thanks!
>
> -Steve
>
>
> On 05/17/2017 03:17 PM, Gregory Farnum wrote:
>>
>>
>> On Wed, May 17, 2017 at 10:51 AM Steve Anthony <sma...@lehigh.edu
>> <mailto:sma...@lehigh.edu>> wrote:
>>
>>     Hello,
>>
>>     After starting a backup (create snap, export and import into a second
>>     cluster - one RBD image still exporting/importing as of this message)
>>     the other day while recovery operations on the primary cluster were
>>     ongoing I noticed an OSD (osd.126) start to crash; I reweighted
>>     it to 0
>>     to prepare to remove it. Shortly thereafter I noticed the problem
>>     seemed
>>     to move to another OSD (osd.223). After looking at the logs, I
>>     noticed
>>     they appeared to have the same problem. I'm running Ceph version
>>     9.2.1
>>     (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.
>>
>>     Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe
>>
>>     Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA
>>
>>
>>     May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
>>     10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
>>     {default=true}
>>     May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
>>     7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
>>     ReplicatedPG::on_local_recover(const hobject_t&, const
>>     object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
>>     ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
>>     10:39:55.322306
>>     May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192:
>>     FAILED
>>     assert(recovery_info.oi.snaps.size())
>>
>>     May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
>>     7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
>>     May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In
>>     function
>>     'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
>>     object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
>>     ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
>>     16:45:30.799839
>>     May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192:
>>     FAILED
>>     assert(recovery_info.oi.snaps.size())
>>
>>
>>     I did some searching and thought it might be related to
>>     http://tracker.ceph.com/issues/13837 aka
>>     https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
>>     scrubbing and deep-scrubbing, and set
>>     osd_pg_max_concurrent_snap_trims
>>     to 0 for all OSDs. No luck. I had changed the systemd service file to
>>     automatically restart osd.223 while recovery was happening, but it
>>     appears to have stalled; I suppose it's needed up for the
>>     remaining objects.
>>
>>
>> Yeah, these aren't really related that I can see — though I haven't
>> spent much time in this code that I can recall. The OSD is receiving
>> a "push" as part of log recovery and finds that the object it's
>> receiving is a snapshot object without having any information about
>> the snap IDs that exist, which is weird. I don't know of any way a
>> client could break it either, but maybe David or Jason know something
>> more.
>> -Greg
>>  
>>
>>
>>     I didn't see anything else online, so I thought I see if anyone
>>     has seen
>>     this before or has any other ideas. Thanks for taking the time.
>>
>>     -Steve
>>
>>
>>     --
>>     Steve Anthony
>>     LTS HPC Senior Analyst
>>     Lehigh University
>>     sma...@lehigh.edu <mailto:sma...@lehigh.edu>
>>
>>
>>     _______________________________________________
>>     ceph-users mailing list
>>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> -- 
> Steve Anthony
> LTS HPC Senior Analyst
> Lehigh University
> sma...@lehigh.edu
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu

signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

Reply via email to