[ceph-users] Re: MDS crashes to damaged metadata

Patrick Donnelly Mon, 10 Jun 2024 12:49:39 -0700

You could try manually deleting the files from the directory
fragments, using `rados` commands. Make sure to flush your MDS journal
first and take the fs offline (`ceph fs fail`).

On Tue, Jun 4, 2024 at 8:50 AM Stolte, Felix <f.sto...@fz-juelich.de> wrote:
>
> Hi Patrick,
>
> it has been a year now and we did not have a single crash since upgrading to 
> 16.2.13. We still have the 19 corrupted files which are reported by 'damage 
> ls‘. Is it now possible to delete the corrupted files without taking the 
> filesystem offline?
>
> Am 22.05.2023 um 20:23 schrieb Patrick Donnelly <pdonn...@redhat.com>:
>
> Hi Felix,
>
> On Sat, May 13, 2023 at 9:18 AM Stolte, Felix <f.sto...@fz-juelich.de> wrote:
>
> Hi Patrick,
>
> we have been running one daily snapshot since december and our cephfs crashed 
> 3 times because of this https://tracker.ceph.com/issues/38452
>
> We currentliy have 19 files with corrupt metadata found by your 
> first-damage.py script. We isolated the these files from access by users and 
> are waiting for a fix before we remove them with your script (or maybe a new 
> way?)
>
> No other fix is anticipated at this time. Probably one will be
> developed after the cause is understood.
>
> Today we upgraded our cluster from 16.2.11 and 16.2.13. After Upgrading the 
> mds  servers, cluster health went to ERROR MDS_DAMAGE. 'ceph tells mds 0 
> damage ls‘ is showing me the same files as your script (initially only a 
> part, after a cephfs scrub all of them).
>
> This is expected. Once the dentries are marked damaged, the MDS won't
> allow operations on those files (like those triggering tracker
> #38452).
>
> I noticed "mds: catch damage to CDentry’s first member before persisting 
> (issue#58482, pr#50781, Patrick Donnelly)“ in the change logs for 16.2.13  
> and like to ask you the following questions:
>
> a) can we repair the damaged files online now instead of bringing down the 
> whole fs and using the python script?
>
> Not yet.
>
> b) should we set one of the new mds options in our specific case to avoid our 
> fileserver crashing because of the wrong snap ids?
>
> Have your MDS crashed or just marked the dentries damaged? If you can
> reproduce a crash with detailed logs (debug_mds=20), that would be
> incredibly helpful.
>
> c) will your patch prevent wrong snap ids in the future?
>
> It will prevent persisting the damage.
>
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
> mit freundlichem Gruß
> Felix Stolte
>
> IT-Services
> mailto: f.sto...@fz-juelich.de
> Tel: 02461-619243
>
> ---------------------------------------------------------------------------------------------
> ---------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Stefan Müller
> Geschäftsführung: Prof. Dr. Astrid Lambrecht (Vorsitzende),
> Karsten Beneke (stellv. Vorsitzender), Dr. Ir. Pieter Jansens
> ---------------------------------------------------------------------------------------------
> ---------------------------------------------------------------------------------------------
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS crashes to damaged metadata

Reply via email to