[ceph-users] Re: multiple OSD crash, unfound objects

Frank Schilder Sat, 10 Oct 2020 07:16:32 -0700

Dear Michael,

> I have other tasks I need to perform on the filesystem (removing OSDs,
> adding new OSDs, increasing PG count), but I feel like I need to address
> these degraded/lost objects before risking any more damage.


I would probably not attempt any such maintenance before there was a period of 
at least 1 day with HEALTH_OK. The reason is that certain historical 
information is not trimmed unless the cluster is in HEALTH_OK. The more such 
information is accumulated, the more risk one runs that a cluster becomes 
unstable.

Can you post the output of ceph status, ceph health detail, ceph osd pool stats 
and ceph osd df tree (on pastebin.com)? If I remember correctly, you removed 
OSDs/PGs following a trouble-shooting guide? I suspect that the removal has 
left something in an inconsistent state that requires manual clean up for 
recovery to proceed.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <w...@caltech.edu>
Sent: 09 October 2020 22:33:46
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

Hi Frank,

That was a good tip.  I was able to move the broken files out of the way
and restore them for users.  However, after 2 weeks I'm still left with
unfound objects.  Even more annoying, I now have 82k objects degraded
(up from 74), which hasn't changed in over a week.

I'm ready to claim that the auto-repair capabilities of ceph are not
able to fix my particular issues, and will have to continue to
investigate alternate ways to clean this up, including a pg
export/import (as you suggested) and perhaps a mds backward scrub (after
testing in a junk pool first).

I have other tasks I need to perform on the filesystem (removing OSDs,
adding new OSDs, increasing PG count), but I feel like I need to address
these degraded/lost objects before risking any more damage.

One particular PG is in a curious state:

7.39d    82163     82165     246734        1  344060777807            0

   0   2139  active+recovery_unfound+undersized+degraded+remapped
23m  50755'112549   50766:960500       [116,72,122,48,45,131,73,81]p116
       [71,109,99,48,45,90,73,NONE]p71  2020-08-13T23:02:34.325887-0500
2020-08-07T11:01:45.657036-0500

Note the 'NONE' in the acting set.  I do not know which OSD this may
have been, nor how to find out.  I suspect (without evidence) that this
is part of the cause of no action on the degraded and misplaced objects.

--Mike

On 9/18/20 11:26 AM, Frank Schilder wrote:
> Dear Michael,
>
> maybe there is a way to restore access for users and solve the issues later. 
> Someone else with a lost/unfound object was able to move the affected file 
> (or directory containing the file) to a separate location and restore the now 
> missing data from backup. This will "park" the problem of cluster health for 
> later fixing.
>
> Best regads,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <fr...@dtu.dk>
> Sent: 18 September 2020 15:38:51
> To: Michael Thomas; ceph-users@ceph.io
> Subject: [ceph-users] Re: multiple OSD crash, unfound objects
>
> Dear Michael,
>
>> I disagree with the statement that trying to recover health by deleting
>> data is a contradiction.  In some cases (such as mine), the data in ceph
>> is backed up in another location (eg tape library).  Restoring a few
>> files from tape is a simple and cheap operation that takes a minute, at
>> most.
>
> I would agree with that if the data was deleted using the appropriate 
> high-level operation. Deleting an unfound object is like marking a sector on 
> a disk as bad with smartctl. How should the file system react to that? 
> Purging an OSD is like removing a disk from a raid set. Such operations 
> increase inconsistencies/degradation rather than resolving them. Cleaning 
> this up also requires to execute other operations to remove all references to 
> the object and, finally, the file inode itself.
>
> The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. 
> For example, when coloring is enabled, ls will stat every file in the dir to 
> be able to choose the color according to permissions. If one then disables 
> coloring, a plain "ls" will return all names while an "ls -l" will hang due 
> to stat calls.
>
> An "rm" or "rm -f" should succeed if the folder permissions allow that. It 
> should not stat the file itself, so it sounds a bit odd that its hanging. I 
> guess in some situations it does, like "rm -i", which will ask before 
> removing read-only files. How does "unlink FILE" behave?
>
> Most admin commands on ceph are asynchronous. A command like "pg repair" or 
> "osd scrub" only schedules an operation. The command "ceph pg 7.1fb 
> mark_unfound_lost delete" does probably just the same. Unfortunately, I don't 
> know how to check that a scheduled operation has 
> started/completed/succeeded/failed. I asked this in an earlier thread (about 
> PG repair) and didn't get an answer. On our cluster, the actual repair 
> happened ca. 6-12 hours after scheduling (on a healthy cluster!). I would 
> conclude that (some of) these operations have very low priority and will not 
> start at least as long as there is recovery going on. One might want to 
> consider the possibility that some of the scheduled commands have not been 
> executed yet.
>
> The output of "pg query" contains the IDs of the missing objects (in mimic) 
> and each of these objects is on one of the peer OSDs of the PG (I think 
> object here refers to shard or copy). It should be possible to find the 
> corresponding OSD (or at least obtain confirmation that the object is really 
> gone) and move the object to a place where it is expected to be found. This 
> can probably be achieved with "PG export" and "PG import". I don't know of 
> any other way(s).
>
> I guess, in the current situation, sitting it out a bit longer might be a 
> good strategy. I don't know how many asynchronous commands you executed and 
> giving the cluster time to complete these jobs might improve the situation.
>
> Sorry that I can't be of more help here. However, if you figure out a 
> solution (ideally non-destructive), please post it here.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Michael Thomas <w...@caltech.edu>
> Sent: 18 September 2020 14:15:53
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] multiple OSD crash, unfound objects
>
> Hi Frank,
>
> On 9/18/20 2:50 AM, Frank Schilder wrote:
>> Dear Michael,
>>
>> firstly, I'm a bit confused why you started deleting data. The objects were 
>> unfound, but still there. That's a small issue. Now the data might be gone 
>> and that's a real issue.
>>
>> ----------------------------
>> Interval:
>>
>> Anyone reading this: I have seen many threads where ceph admins started 
>> deleting objects or PGs or even purging OSDs way too early from a cluster. 
>> Trying to recover health by deleting data is a contradiction. Ceph has bugs 
>> and sometimes it needs some help finding everything again. As far as I know, 
>> for most of these bugs there are workarounds that allow full recovery with a 
>> bit of work.
>
> I disagree with the statement that trying to recover health by deleting
> data is a contradiction.  In some cases (such as mine), the data in ceph
> is backed up in another location (eg tape library).  Restoring a few
> files from tape is a simple and cheap operation that takes a minute, at
> most.  For the sake of expediency, sometimes it's quicker and easier to
> simply delete the affected files and restore them from the backup system.
>
> This procedure has worked fine with our previous distributed filesystem
> (hdfs), so I (naively?) thought that it could be used with ceph as well.
>    I was a bit surprised that cephs behavior was to indefinitely block
> the 'rm' operation so that the affected file could not even be removed.
>
> Since I have 25 unfound objects spread across 9 PGs, I used a PG with a
> single unfound object to test this alternate recovery procedure.
>
>> First question is, did you delete the entire object or just a shard on one 
>> disk? Are there OSDs that might still have a copy?
>
> Per the troubleshooting guide
> (https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/),
> I ran:
>
> ceph pg 7.1fb mark_unfound_lost delete
>
> So I presume that the entire object has been deleted.
>
>> If the object is gone for good, the file references something that doesn't 
>> exist - its like a bad sector. You probably need to delete the file. Bit 
>> strange that the operation does not err out with a read error. Maybe it 
>> doesn't because it waits for the unfound objects state to be resolved?
>
> Even before the object was removed, all read operations on the file
> would hang.  Even worse, attempts to stat() the file with commands such
> as 'ls' or 'rm' would hang.  Even worse, attempts to 'ls' in the
> directory itself would hang.  This hasn't changed after removing the object.
>
> *Update*: The stat() operations may not be hanging indefinitely.  It
> seems to hang for somewhere between 10 minutes and 8 hours.
>
>> For all the other unfound objects, they are there somewhere - you didn't 
>> loose a disk or something. Try pushing ceph to scan the correct OSDs, for 
>> example, by restarting the newly added OSDs one by one or something similar. 
>> Sometimes exporting and importing a PG from one OSD to another forces a 
>> re-scan and subsequent discovery of unfound objects. It is also possible 
>> that ceph will find these objects along the way of recovery or when OSDs 
>> scrub or check for objects that can be deleted.
>
> I have restarted the new OSDs countless times.  I've used three
> different methods to restart the OSD:
>
> * systemctl restart ceph-osd@120
>
> * init 6
>
> * ceph osd out 120
>     ...wait for repeering to finish...
>     systemctl restart ceph-osd@120
>     ceph osd in 120
>
> I've done this for all OSDs that a PG has listed in the 'not queried'
> state in 'ceph pg $pgid detail'.  But even when all OSDs in the PG are
> back to the 'already probed' state, the missing objects remain.
>
> Over 90% of my PGs have not been deep scrubbed recently, due to the
> amount of backfilling and importing of data into the ceph cluster.  I
> plan to leave the cluster mostly idle over the weekend so that hopefully
> the deep scrubs can catch up and possibly locate any missing objects.
>
> --Mike
>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Michael Thomas <w...@caltech.edu>
>> Sent: 17 September 2020 22:27:47
>> To: Frank Schilder; ceph-users@ceph.io
>> Subject: Re: [ceph-users] multiple OSD crash, unfound objects
>>
>> Hi Frank,
>>
>> Yes, it does sounds similar to your ticket.
>>
>> I've tried a few things to restore the failed files:
>>
>> * Locate a missing object with 'ceph pg $pgid list_unfound'
>>
>> * Convert the hex oid to a decimal inode number
>>
>> * Identify the affected file with 'find /ceph -inum $inode'
>>
>> At this point, I know which file is affected by the missing object.  As
>> expected, attempts to read the file simply hang.  Unexpectedly, attempts
>> to 'ls' the file or its containing directory also hang.  I presume from
>> this that the stat() system call needs some information that is
>> contained in the missing object, and is waiting for the object to become
>> available.
>>
>> Next I tried to remove the affected object with:
>>
>> * ceph pg $pgid mark_unfound_lost delete
>>
>> Now 'ceph status' shows one fewer missing objects, but attempts to 'ls'
>> or 'rm' the affected file continue to hang.
>>
>> Finally, I ran a scrub over the part of the filesystem containing the
>> affected file:
>>
>> ceph tell mds.ceph4 scrub start /frames/postO3/hoft recursive
>>
>> Nothing seemed to come up during the scrub:
>>
>> 2020-09-17T14:56:15.208-0500 7f39bca24700  1 mds.ceph4 asok_command:
>> scrub status {prefix=scrub status} (starting...)
>> 2020-09-17T14:58:58.013-0500 7f39bca24700  1 mds.ceph4 asok_command:
>> scrub start {path=/frames/postO3/hoft,prefix=scrub
>> start,scrubops=[recursive]} (starting...)
>> 2020-09-17T14:58:58.013-0500 7f39b5215700  0 log_channel(cluster) log
>> [INF] : scrub summary: active
>> 2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log
>> [INF] : scrub queued for path: /frames/postO3/hoft
>> 2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log
>> [INF] : scrub summary: active [paths:/frames/postO3/hoft]
>> 2020-09-17T14:59:02.535-0500 7f39bca24700  1 mds.ceph4 asok_command:
>> scrub status {prefix=scrub status} (starting...)
>> 2020-09-17T15:00:12.520-0500 7f39bca24700  1 mds.ceph4 asok_command:
>> scrub status {prefix=scrub status} (starting...)
>> 2020-09-17T15:02:32.944-0500 7f39b5215700  0 log_channel(cluster) log
>> [INF] : scrub summary: idle
>> 2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
>> [INF] : scrub complete with tag '1405e5c7-3ecf-4754-918e-129e9d101f7a'
>> 2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
>> [INF] : scrub completed for path: /frames/postO3/hoft
>> 2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
>> [INF] : scrub summary: idle
>>
>>
>> After the scrub completed, access to the file (ls or rm) continue to
>> hang.  The MDS reports slow reads:
>>
>> 2020-09-17T15:11:05.654-0500 7f39b9a1e700  0 log_channel(cluster) log
>> [WRN] : slow request 481.867381 seconds old, received at
>> 2020-09-17T15:03:03.788058-0500: client_request(client.451432:11309
>> getattr pAsLsXsFs #0x1000005b1c0 2020-09-17T15:03:03.787602-0500
>> caller_uid=0, caller_gid=0{}) currently dispatched
>>
>> Does anyone have any suggestions on how else to clean up from a
>> permanently lost object?
>>
>> --Mike
>>
>> On 9/16/20 2:03 AM, Frank Schilder wrote:
>>> Sounds similar to this one: https://tracker.ceph.com/issues/46847
>>>
>>> If you have or can reconstruct the crush map from before adding the OSDs, 
>>> you might be able to discover everything with the temporary reversal of the 
>>> crush map method.
>>>
>>> Not sure if there is another method, i never got a reply to my question in 
>>> the tracker.
>>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Michael Thomas <w...@caltech.edu>
>>> Sent: 16 September 2020 01:27:19
>>> To: ceph-users@ceph.io
>>> Subject: [ceph-users] multiple OSD crash, unfound objects
>>>
>>> Over the weekend I had multiple OSD servers in my Octopus cluster
>>> (15.2.4) crash and reboot at nearly the same time.  The OSDs are part of
>>> an erasure coded pool.  At the time the cluster had been busy with a
>>> long-running (~week) remapping of a large number of PGs after I
>>> incrementally added more OSDs to the cluster.  After bringing all of the
>>> OSDs back up, I have 25 unfound objects and 75 degraded objects.  There
>>> are other problems reported, but I'm primarily concerned with these
>>> unfound/degraded objects.
>>>
>>> The pool with the missing objects is a cephfs pool.  The files stored in
>>> the pool are backed up on tape, so I can easily restore individual files
>>> as needed (though I would not want to restore the entire filesystem).
>>>
>>> I tried following the guide at
>>> https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects.
>>>      I found a number of OSDs that are still 'not queried'.  Restarting a
>>> sampling of these OSDs changed the state from 'not queried' to 'already
>>> probed', but that did not recover any of the unfound or degraded objects.
>>>
>>> I have also tried 'ceph pg deep-scrub' on the affected PGs, but never
>>> saw them get scrubbed.  I also tried doing a 'ceph pg force-recovery' on
>>> the affected PGs, but only one seems to have been tagged accordingly
>>> (see ceph -s output below).
>>>
>>> The guide also says "Sometimes it simply takes some time for the cluster
>>> to query possible locations."  I'm not sure how long "some time" might
>>> take, but it hasn't changed after several hours.
>>>
>>> My questions are:
>>>
>>> * Is there a way to force the cluster to query the possible locations
>>> sooner?
>>>
>>> * Is it possible to identify the files in cephfs that are affected, so
>>> that I could delete only the affected files and restore them from backup
>>> tapes?
>>>
>>> --Mike
>>>
>>> ceph -s:
>>>
>>>       cluster:
>>>         id:     066f558c-6789-4a93-aaf1-5af1ba01a3ad
>>>         health: HEALTH_ERR
>>>                 1 clients failing to respond to capability release
>>>                 1 MDSs report slow requests
>>>                 25/78520351 objects unfound (0.000%)
>>>                 2 nearfull osd(s)
>>>                 Reduced data availability: 1 pg inactive
>>>                 Possible data damage: 9 pgs recovery_unfound
>>>                 Degraded data redundancy: 75/626645098 objects degraded
>>> (0.000%), 9 pgs degraded
>>>                 1013 pgs not deep-scrubbed in time
>>>                 1013 pgs not scrubbed in time
>>>                 2 pool(s) nearfull
>>>                 1 daemons have recently crashed
>>>                 4 slow ops, oldest one blocked for 77939 sec, daemons
>>> [osd.0,osd.41] have slow ops.
>>>
>>>       services:
>>>         mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d)
>>>         mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1
>>>         mds: archive:1 {0=ceph4=up:active} 3 up:standby
>>>         osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped 
>>> pgs
>>>
>>>       task status:
>>>         scrub status:
>>>             mds.ceph4: idle
>>>
>>>       data:
>>>         pools:   9 pools, 2433 pgs
>>>         objects: 78.52M objects, 298 TiB
>>>         usage:   412 TiB used, 545 TiB / 956 TiB avail
>>>         pgs:     0.041% pgs unknown
>>>                  75/626645098 objects degraded (0.000%)
>>>                  135224/626645098 objects misplaced (0.022%)
>>>                  25/78520351 objects unfound (0.000%)
>>>                  2421 active+clean
>>>                  5    active+recovery_unfound+degraded
>>>                  3    active+recovery_unfound+degraded+remapped
>>>                  2    active+clean+scrubbing+deep
>>>                  1    unknown
>>>                  1    active+forced_recovery+recovery_unfound+degraded
>>>
>>>       progress:
>>>         PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d)
>>>           [............................]
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: multiple OSD crash, unfound objects

Reply via email to