Dear Michael, > I have other tasks I need to perform on the filesystem (removing OSDs, > adding new OSDs, increasing PG count), but I feel like I need to address > these degraded/lost objects before risking any more damage.
I would probably not attempt any such maintenance before there was a period of at least 1 day with HEALTH_OK. The reason is that certain historical information is not trimmed unless the cluster is in HEALTH_OK. The more such information is accumulated, the more risk one runs that a cluster becomes unstable. Can you post the output of ceph status, ceph health detail, ceph osd pool stats and ceph osd df tree (on pastebin.com)? If I remember correctly, you removed OSDs/PGs following a trouble-shooting guide? I suspect that the removal has left something in an inconsistent state that requires manual clean up for recovery to proceed. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Michael Thomas <w...@caltech.edu> Sent: 09 October 2020 22:33:46 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects Hi Frank, That was a good tip. I was able to move the broken files out of the way and restore them for users. However, after 2 weeks I'm still left with unfound objects. Even more annoying, I now have 82k objects degraded (up from 74), which hasn't changed in over a week. I'm ready to claim that the auto-repair capabilities of ceph are not able to fix my particular issues, and will have to continue to investigate alternate ways to clean this up, including a pg export/import (as you suggested) and perhaps a mds backward scrub (after testing in a junk pool first). I have other tasks I need to perform on the filesystem (removing OSDs, adding new OSDs, increasing PG count), but I feel like I need to address these degraded/lost objects before risking any more damage. One particular PG is in a curious state: 7.39d 82163 82165 246734 1 344060777807 0 0 2139 active+recovery_unfound+undersized+degraded+remapped 23m 50755'112549 50766:960500 [116,72,122,48,45,131,73,81]p116 [71,109,99,48,45,90,73,NONE]p71 2020-08-13T23:02:34.325887-0500 2020-08-07T11:01:45.657036-0500 Note the 'NONE' in the acting set. I do not know which OSD this may have been, nor how to find out. I suspect (without evidence) that this is part of the cause of no action on the degraded and misplaced objects. --Mike On 9/18/20 11:26 AM, Frank Schilder wrote: > Dear Michael, > > maybe there is a way to restore access for users and solve the issues later. > Someone else with a lost/unfound object was able to move the affected file > (or directory containing the file) to a separate location and restore the now > missing data from backup. This will "park" the problem of cluster health for > later fixing. > > Best regads, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <fr...@dtu.dk> > Sent: 18 September 2020 15:38:51 > To: Michael Thomas; ceph-users@ceph.io > Subject: [ceph-users] Re: multiple OSD crash, unfound objects > > Dear Michael, > >> I disagree with the statement that trying to recover health by deleting >> data is a contradiction. In some cases (such as mine), the data in ceph >> is backed up in another location (eg tape library). Restoring a few >> files from tape is a simple and cheap operation that takes a minute, at >> most. > > I would agree with that if the data was deleted using the appropriate > high-level operation. Deleting an unfound object is like marking a sector on > a disk as bad with smartctl. How should the file system react to that? > Purging an OSD is like removing a disk from a raid set. Such operations > increase inconsistencies/degradation rather than resolving them. Cleaning > this up also requires to execute other operations to remove all references to > the object and, finally, the file inode itself. > > The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. > For example, when coloring is enabled, ls will stat every file in the dir to > be able to choose the color according to permissions. If one then disables > coloring, a plain "ls" will return all names while an "ls -l" will hang due > to stat calls. > > An "rm" or "rm -f" should succeed if the folder permissions allow that. It > should not stat the file itself, so it sounds a bit odd that its hanging. I > guess in some situations it does, like "rm -i", which will ask before > removing read-only files. How does "unlink FILE" behave? > > Most admin commands on ceph are asynchronous. A command like "pg repair" or > "osd scrub" only schedules an operation. The command "ceph pg 7.1fb > mark_unfound_lost delete" does probably just the same. Unfortunately, I don't > know how to check that a scheduled operation has > started/completed/succeeded/failed. I asked this in an earlier thread (about > PG repair) and didn't get an answer. On our cluster, the actual repair > happened ca. 6-12 hours after scheduling (on a healthy cluster!). I would > conclude that (some of) these operations have very low priority and will not > start at least as long as there is recovery going on. One might want to > consider the possibility that some of the scheduled commands have not been > executed yet. > > The output of "pg query" contains the IDs of the missing objects (in mimic) > and each of these objects is on one of the peer OSDs of the PG (I think > object here refers to shard or copy). It should be possible to find the > corresponding OSD (or at least obtain confirmation that the object is really > gone) and move the object to a place where it is expected to be found. This > can probably be achieved with "PG export" and "PG import". I don't know of > any other way(s). > > I guess, in the current situation, sitting it out a bit longer might be a > good strategy. I don't know how many asynchronous commands you executed and > giving the cluster time to complete these jobs might improve the situation. > > Sorry that I can't be of more help here. However, if you figure out a > solution (ideally non-destructive), please post it here. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Michael Thomas <w...@caltech.edu> > Sent: 18 September 2020 14:15:53 > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] multiple OSD crash, unfound objects > > Hi Frank, > > On 9/18/20 2:50 AM, Frank Schilder wrote: >> Dear Michael, >> >> firstly, I'm a bit confused why you started deleting data. The objects were >> unfound, but still there. That's a small issue. Now the data might be gone >> and that's a real issue. >> >> ---------------------------- >> Interval: >> >> Anyone reading this: I have seen many threads where ceph admins started >> deleting objects or PGs or even purging OSDs way too early from a cluster. >> Trying to recover health by deleting data is a contradiction. Ceph has bugs >> and sometimes it needs some help finding everything again. As far as I know, >> for most of these bugs there are workarounds that allow full recovery with a >> bit of work. > > I disagree with the statement that trying to recover health by deleting > data is a contradiction. In some cases (such as mine), the data in ceph > is backed up in another location (eg tape library). Restoring a few > files from tape is a simple and cheap operation that takes a minute, at > most. For the sake of expediency, sometimes it's quicker and easier to > simply delete the affected files and restore them from the backup system. > > This procedure has worked fine with our previous distributed filesystem > (hdfs), so I (naively?) thought that it could be used with ceph as well. > I was a bit surprised that cephs behavior was to indefinitely block > the 'rm' operation so that the affected file could not even be removed. > > Since I have 25 unfound objects spread across 9 PGs, I used a PG with a > single unfound object to test this alternate recovery procedure. > >> First question is, did you delete the entire object or just a shard on one >> disk? Are there OSDs that might still have a copy? > > Per the troubleshooting guide > (https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/), > I ran: > > ceph pg 7.1fb mark_unfound_lost delete > > So I presume that the entire object has been deleted. > >> If the object is gone for good, the file references something that doesn't >> exist - its like a bad sector. You probably need to delete the file. Bit >> strange that the operation does not err out with a read error. Maybe it >> doesn't because it waits for the unfound objects state to be resolved? > > Even before the object was removed, all read operations on the file > would hang. Even worse, attempts to stat() the file with commands such > as 'ls' or 'rm' would hang. Even worse, attempts to 'ls' in the > directory itself would hang. This hasn't changed after removing the object. > > *Update*: The stat() operations may not be hanging indefinitely. It > seems to hang for somewhere between 10 minutes and 8 hours. > >> For all the other unfound objects, they are there somewhere - you didn't >> loose a disk or something. Try pushing ceph to scan the correct OSDs, for >> example, by restarting the newly added OSDs one by one or something similar. >> Sometimes exporting and importing a PG from one OSD to another forces a >> re-scan and subsequent discovery of unfound objects. It is also possible >> that ceph will find these objects along the way of recovery or when OSDs >> scrub or check for objects that can be deleted. > > I have restarted the new OSDs countless times. I've used three > different methods to restart the OSD: > > * systemctl restart ceph-osd@120 > > * init 6 > > * ceph osd out 120 > ...wait for repeering to finish... > systemctl restart ceph-osd@120 > ceph osd in 120 > > I've done this for all OSDs that a PG has listed in the 'not queried' > state in 'ceph pg $pgid detail'. But even when all OSDs in the PG are > back to the 'already probed' state, the missing objects remain. > > Over 90% of my PGs have not been deep scrubbed recently, due to the > amount of backfilling and importing of data into the ceph cluster. I > plan to leave the cluster mostly idle over the weekend so that hopefully > the deep scrubs can catch up and possibly locate any missing objects. > > --Mike > >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Michael Thomas <w...@caltech.edu> >> Sent: 17 September 2020 22:27:47 >> To: Frank Schilder; ceph-users@ceph.io >> Subject: Re: [ceph-users] multiple OSD crash, unfound objects >> >> Hi Frank, >> >> Yes, it does sounds similar to your ticket. >> >> I've tried a few things to restore the failed files: >> >> * Locate a missing object with 'ceph pg $pgid list_unfound' >> >> * Convert the hex oid to a decimal inode number >> >> * Identify the affected file with 'find /ceph -inum $inode' >> >> At this point, I know which file is affected by the missing object. As >> expected, attempts to read the file simply hang. Unexpectedly, attempts >> to 'ls' the file or its containing directory also hang. I presume from >> this that the stat() system call needs some information that is >> contained in the missing object, and is waiting for the object to become >> available. >> >> Next I tried to remove the affected object with: >> >> * ceph pg $pgid mark_unfound_lost delete >> >> Now 'ceph status' shows one fewer missing objects, but attempts to 'ls' >> or 'rm' the affected file continue to hang. >> >> Finally, I ran a scrub over the part of the filesystem containing the >> affected file: >> >> ceph tell mds.ceph4 scrub start /frames/postO3/hoft recursive >> >> Nothing seemed to come up during the scrub: >> >> 2020-09-17T14:56:15.208-0500 7f39bca24700 1 mds.ceph4 asok_command: >> scrub status {prefix=scrub status} (starting...) >> 2020-09-17T14:58:58.013-0500 7f39bca24700 1 mds.ceph4 asok_command: >> scrub start {path=/frames/postO3/hoft,prefix=scrub >> start,scrubops=[recursive]} (starting...) >> 2020-09-17T14:58:58.013-0500 7f39b5215700 0 log_channel(cluster) log >> [INF] : scrub summary: active >> 2020-09-17T14:58:58.014-0500 7f39b5215700 0 log_channel(cluster) log >> [INF] : scrub queued for path: /frames/postO3/hoft >> 2020-09-17T14:58:58.014-0500 7f39b5215700 0 log_channel(cluster) log >> [INF] : scrub summary: active [paths:/frames/postO3/hoft] >> 2020-09-17T14:59:02.535-0500 7f39bca24700 1 mds.ceph4 asok_command: >> scrub status {prefix=scrub status} (starting...) >> 2020-09-17T15:00:12.520-0500 7f39bca24700 1 mds.ceph4 asok_command: >> scrub status {prefix=scrub status} (starting...) >> 2020-09-17T15:02:32.944-0500 7f39b5215700 0 log_channel(cluster) log >> [INF] : scrub summary: idle >> 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log >> [INF] : scrub complete with tag '1405e5c7-3ecf-4754-918e-129e9d101f7a' >> 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log >> [INF] : scrub completed for path: /frames/postO3/hoft >> 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log >> [INF] : scrub summary: idle >> >> >> After the scrub completed, access to the file (ls or rm) continue to >> hang. The MDS reports slow reads: >> >> 2020-09-17T15:11:05.654-0500 7f39b9a1e700 0 log_channel(cluster) log >> [WRN] : slow request 481.867381 seconds old, received at >> 2020-09-17T15:03:03.788058-0500: client_request(client.451432:11309 >> getattr pAsLsXsFs #0x1000005b1c0 2020-09-17T15:03:03.787602-0500 >> caller_uid=0, caller_gid=0{}) currently dispatched >> >> Does anyone have any suggestions on how else to clean up from a >> permanently lost object? >> >> --Mike >> >> On 9/16/20 2:03 AM, Frank Schilder wrote: >>> Sounds similar to this one: https://tracker.ceph.com/issues/46847 >>> >>> If you have or can reconstruct the crush map from before adding the OSDs, >>> you might be able to discover everything with the temporary reversal of the >>> crush map method. >>> >>> Not sure if there is another method, i never got a reply to my question in >>> the tracker. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Michael Thomas <w...@caltech.edu> >>> Sent: 16 September 2020 01:27:19 >>> To: ceph-users@ceph.io >>> Subject: [ceph-users] multiple OSD crash, unfound objects >>> >>> Over the weekend I had multiple OSD servers in my Octopus cluster >>> (15.2.4) crash and reboot at nearly the same time. The OSDs are part of >>> an erasure coded pool. At the time the cluster had been busy with a >>> long-running (~week) remapping of a large number of PGs after I >>> incrementally added more OSDs to the cluster. After bringing all of the >>> OSDs back up, I have 25 unfound objects and 75 degraded objects. There >>> are other problems reported, but I'm primarily concerned with these >>> unfound/degraded objects. >>> >>> The pool with the missing objects is a cephfs pool. The files stored in >>> the pool are backed up on tape, so I can easily restore individual files >>> as needed (though I would not want to restore the entire filesystem). >>> >>> I tried following the guide at >>> https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects. >>> I found a number of OSDs that are still 'not queried'. Restarting a >>> sampling of these OSDs changed the state from 'not queried' to 'already >>> probed', but that did not recover any of the unfound or degraded objects. >>> >>> I have also tried 'ceph pg deep-scrub' on the affected PGs, but never >>> saw them get scrubbed. I also tried doing a 'ceph pg force-recovery' on >>> the affected PGs, but only one seems to have been tagged accordingly >>> (see ceph -s output below). >>> >>> The guide also says "Sometimes it simply takes some time for the cluster >>> to query possible locations." I'm not sure how long "some time" might >>> take, but it hasn't changed after several hours. >>> >>> My questions are: >>> >>> * Is there a way to force the cluster to query the possible locations >>> sooner? >>> >>> * Is it possible to identify the files in cephfs that are affected, so >>> that I could delete only the affected files and restore them from backup >>> tapes? >>> >>> --Mike >>> >>> ceph -s: >>> >>> cluster: >>> id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad >>> health: HEALTH_ERR >>> 1 clients failing to respond to capability release >>> 1 MDSs report slow requests >>> 25/78520351 objects unfound (0.000%) >>> 2 nearfull osd(s) >>> Reduced data availability: 1 pg inactive >>> Possible data damage: 9 pgs recovery_unfound >>> Degraded data redundancy: 75/626645098 objects degraded >>> (0.000%), 9 pgs degraded >>> 1013 pgs not deep-scrubbed in time >>> 1013 pgs not scrubbed in time >>> 2 pool(s) nearfull >>> 1 daemons have recently crashed >>> 4 slow ops, oldest one blocked for 77939 sec, daemons >>> [osd.0,osd.41] have slow ops. >>> >>> services: >>> mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d) >>> mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1 >>> mds: archive:1 {0=ceph4=up:active} 3 up:standby >>> osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped >>> pgs >>> >>> task status: >>> scrub status: >>> mds.ceph4: idle >>> >>> data: >>> pools: 9 pools, 2433 pgs >>> objects: 78.52M objects, 298 TiB >>> usage: 412 TiB used, 545 TiB / 956 TiB avail >>> pgs: 0.041% pgs unknown >>> 75/626645098 objects degraded (0.000%) >>> 135224/626645098 objects misplaced (0.022%) >>> 25/78520351 objects unfound (0.000%) >>> 2421 active+clean >>> 5 active+recovery_unfound+degraded >>> 3 active+recovery_unfound+degraded+remapped >>> 2 active+clean+scrubbing+deep >>> 1 unknown >>> 1 active+forced_recovery+recovery_unfound+degraded >>> >>> progress: >>> PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d) >>> [............................] >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >> > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io