[ceph-users] Re: Removing secondary data pool from mds

Frank Schilder Fri, 12 Feb 2021 18:55:51 -0800

Hi Michael,

I also think it would be safe to delete. The object count might be an incorrect 
reference count of lost objects that didn't get decremented. This might be 
fixed by running a deep scrub over all PGs in that pool.


I don't know rados well enough to find out where such an object count comes 
from. However, ceph df is known to be imperfect. Maybe its just an accounting 
bug there. I think there were a couple of cases where people deleted all 
objects in a pool and ceph df would still report non-zero usage.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <w...@caltech.edu>
Sent: 12 February 2021 22:35:25
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Removing secondary data pool from mds

Hi Frank,

We're not using snapshots.

I was able to run:
     ceph daemon mds.ceph1 dump cache /tmp/cache.txt

...and scan for the stray object to find the cap id that was accessing
the object.  I matched this with the entity name in:
     ceph daemon mds.ceph1 session ls

...to determine the client host.  The strays went away after I rebooted
the offending client.

With all access to the objects now cleared, I ran:

     ceph pg X.Y mark_unfound_lost delete

...on any remaining rados objects.

At this point (at long last) the pool was able to return to the
'HEALTHY' status.  However, there is one remaining bit that I don't
understand.  'ceph df' returns 355 objects for the pool
(fs.data.archive.frames):

https://pastebin.com/vbZLhQmC

...but 'rados -p fs.data.archive.frames ls --all' returns no objects.
So I'm not sure what these 355 objects were.  Because of that, I haven't
removed the pool from cephfs quite yet, even though I think it would be
safe to do so.

--Mike


On 2/10/21 4:20 PM, Frank Schilder wrote:
> Hi Michael,
>
> out of curiosity, did the pool go away or did it put up a fight?
>
> I don't remember exactly, its a long time ago, but I believe stray objects on 
> fs pools come from files still in snapshots but were deleted on the fs level. 
> Such files are moved to special stray pools until the snapshot containing 
> them is deleted as well. Not sure if this applies here though, there might be 
> other occasions when objects go to stray.
>
> I updated the case concerning the underlying problem, but not too much 
> progress either: https://tracker.ceph.com/issues/46847#change-184710 . I had 
> PG degradation even using the recovery technique with before- and after crush 
> maps. I was just lucky that I lost only 1 shard per object and ordinary 
> recovery could fix it.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Michael Thomas <w...@caltech.edu>
> Sent: 21 December 2020 23:12:09
> To: ceph-users@ceph.io
> Subject: [ceph-users] Removing secondary data pool from mds
>
> I have a cephfs secondary (non-root) data pool with unfound and degraded
> objects that I have not been able to recover[1].  I created an
> additional data pool and used "setfattr -n ceph.dir.layout.pool' and a
> very long rsync to move the files off of the degraded pool and onto the
> new pool.  This has completed, and using find + 'getfattr -n
> ceph.file.layout.pool', I verified that no files are using the old pool
> anymore.  No ceph.dir.layout.pool attributes point to the old pool either.
>
> However, the old pool still reports that there are objects in the old
> pool, likely the same ones that were unfound/degraded from before:
> https://pastebin.com/qzVA7eZr
>
> Based on a old message from the mailing list[2], I checked the MDS for
> stray objects (ceph daemon mds.ceph4 dump cache file.txt ; grep -i stray
> file.txt) and found 36 stray entries in the cache:
> https://pastebin.com/MHkpw3DV.  However, I'm not certain how to map
> these stray cache objects to clients that may be accessing them.
>
> 'rados -p fs.data.archive.frames ls' shows 145 objects.  Looking at the
> parent of each object shows 2 strays:
>
> for obj in $(cat rados.ls.txt) ; do echo $obj ; rados -p
> fs.data.archive.frames getxattr $obj parent | strings ; done
>
>
> [...]
> 10000020fa1.00000000
> 10000020fa1
> stray6
> 10000020fbc.00000000
> 10000020fbc
> stray6
> [...]
>
> ...before getting stuck on one object for over 5 minutes (then I gave up):
>
> 1000005b1af.00000083
>
> What can I do to make sure this pool is ready to be safely deleted from
> cephfs (ceph fs rm_data_pool archive fs.data.archive.frames)?
>
> --Mike
>
> [1]https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#7YQ6SSTESM5LTFVLQK3FSYFW5FDXJ5CF
>
> [2]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005233.html
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing secondary data pool from mds

Reply via email to