Hello Frédéric,

Here are the requested outputs (I kept some outputs during the repair of the fs) :

1) 'rados df' when 'rados ls' command was failing (but after I have already removed thefs cfs_irods_test and the objects from the metadata pool cfs_irods_md_test):

[mon-01]# ceph fs rm cfs_irods_test --yes-i-really-mean-it
[mon-01]# rados -p cfs_irods_md_test ls
601.00000000
...
100.00000000.inode
1.00000000

[mon-01]# for i in `rados -p cfs_irods_md_test ls`; do rados -p cfs_irods_md_test rm $i; done *==> this was unnecessary because later I simply deleted all the cephfs pools and recreated them*
*
*

*[mon-01 ~]#  rados df *
POOL_NAME                     USED  OBJECTS  CLONES   COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED    RD_OPS       RD    WR_OPS       WR  USED COMPR  UNDER COMPR .mgr                       1.4 GiB      123       0      369                   0        0         0    477873  721 MiB    885240   19 GiB         0 B          0 B cfs_irods_data_test            0 B        0       0        0                   0        0         0         0      0 B         0      0 B         0 B          0 B cfs_irods_def_test             0 B        0       0        0                   0        0         0         1      0 B     80200  157 GiB         0 B          0 B cfs_irods_md_test           97 KiB        0       0        0                   0        0         0       224  212 KiB      2463  4.5 MiB         0 B          0 B metadata_4hddrbd_rep3_ssd  1.9 MiB        5       0       15                   0        0         0   1339579  1.7 GiB    226136  553 MiB         0 B          0 B metadata_4ssdrbd_rep3_ssd  239 KiB        5       0       15                   0        0         0   1212124  1.5 GiB    203438  497 MiB         0 B          0 B pool_rbd_rep3_hdd          3.8 TiB   838156       0  2514468                   0        0         0  17599175  545 GiB  34068128  2.5 TiB         0 B          0 B pool_rbd_rep3_ssd          2.3 TiB   197298       0   591894                   0        0         0  15948416  460 GiB  32329097  1.8 TiB         0 B          0 B rbd_ec_k6m2_hdd            589 GiB   113057       0   904456                   0        0         0   5021553  232 GiB  14520983  916 GiB         0 B          0 B rbd_ec_k6m2_ssd            529 GiB   101552       0   812416                   0        0         0   4787661  206 GiB  14524780  908 GiB         0 B          0 B

total_objects    1250196
total_used       417 TiB
total_avail      6.4 PiB
total_space      6.8 PiB

2) 'rados df'  when 'rados ls' command is working fine for all the pools :

*[mon-01 ~]# rados df *
POOL_NAME                     USED  OBJECTS  CLONES   COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED    RD_OPS       RD    WR_OPS       WR  USED COMPR  UNDER COMPR .mgr                       1.4 GiB      123       0      369                   0        0         0    480493  723 MiB    890072   19 GiB         0 B          0 B cfs_irods_data_test            0 B        0       0        0                   0        0         0         0      0 B         0      0 B         0 B          0 B cfs_irods_def_test             0 B        0       0        0                   0        0         0         1      0 B     80200  157 GiB         0 B          0 B cfs_irods_md_test           97 KiB        0       0        0                   0        0         0       224  212 KiB      2463  4.5 MiB         0 B          0 B metadata_4hddrbd_rep3_ssd  1.9 MiB        5       0       15                   0        0         0   1339536  1.7 GiB    226136  553 MiB         0 B          0 B metadata_4ssdrbd_rep3_ssd  239 KiB        5       0       15                   0        0         0   1212120  1.5 GiB    203438  497 MiB         0 B          0 B pool_rbd_rep3_hdd          3.8 TiB   838156       0  2514468                   0        0         0  17594633  540 GiB  34068128  2.5 TiB         0 B          0 B pool_rbd_rep3_ssd          2.3 TiB   197298       0   591894                   0        0         0  15946345  458 GiB  32329097  1.8 TiB         0 B          0 B rbd_ec_k6m2_hdd            589 GiB   113057       0   904456                   0        0         0   5021360  232 GiB  14520983  916 GiB         0 B          0 B rbd_ec_k6m2_ssd            529 GiB   101552       0   812416                   0        0         0   4785748  204 GiB  14524780  908 GiB         0 B          0 B

total_objects    1250196
total_used       417 TiB
total_avail      6.4 PiB
total_space      6.8 PiB
[mon-01 ~]#

From here I deleted the cephfs pools, recreated them and recreated the fs without any problem.

Best regards,

Christophe

On 24/04/2025 08:52, Frédéric Nass wrote:
Hi Christophe,

Do you have a 'rados df' or 'ceph df' output (of all pools, not just the one from the test filesystem) from the time when the 'rados ls' command was failing?


I'm trying to determine if the pools were incorrectly reported as empty due to stuck PGs or OSDs. This is important as we based our decision to delete the filesystem on this (possibily wrong) information.


Regards,
Frédéric.

----- Le 23 Avr 25, à 18:19, Christophe DIARRA <christophe.dia...@idris.fr> a écrit :

    Hello Frédéric,

    I have a new working fs now after deleting the fs + pools and
    recreating them. I will mount the fs on the test client, create
    some files and do some tests:

    1. shutdown and restart the cluster to see what will happen to the
    metadata

    2. redo the test 1. by removing power from the rack for hours
    after the cluster is shut down

    I will let you know, when the tests will be finished.

    Following is the current status:

    [mon-01 ~]# ceph fs status
    cfs_irods_test - 0 clients
    ==============
    RANK  STATE                  MDS                    ACTIVITY
        DNS    INOS   DIRS   CAPS
    0active cfs_irods_test.mon-01.hitdem  Reqs:    0 /s    10     13
        12      0
           POOL           TYPE     USED  AVAIL
    cfs_irods_md_test   metadata  96.0k  34.4T
    cfs_irods_def_test    data       0   2018T
    cfs_irods_data_test    data       0   4542T
              STANDBY MDS
    cfs_irods_test.mon-03.vlmeuz
    cfs_irods_test.mon-02.awuygq
    MDS version: ceph version 18.2.2
    (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

    Many thanks to you Frédéric and also to David, Anthony and Michel
    for the advice and remarks.

    Best regards,

    Christophe


    On 23/04/2025 14:30, Christophe DIARRA wrote:

        Hello Frédéric, Michel,

        Rebooting the OSD one by one solved the problem of 'rados ls'.
        Now it is working fine for all the pools.

        The next step is to recreate the cephfs fs I deleted yesterday
        because of the damaged metadata problem.

        I will let you know.

        Thanks,

        Christophe

        On 23/04/2025 12:47, Christophe DIARRA wrote:

            Hello Frédéric,

            Thank you for the answer.

            osd_mclock_max_capacity_iops_hdd is not defined. Only
            osd_mclock_max_capacity_iops_ssd is defined. I suppose
            that these are defaults values.
            I didn't know anything about them until now.

            [mon-01 ~]# ceph config dump | grep
            osd_mclock_max_capacity_iops_hdd
            [cmon-01 ~]#

            [mon-01 ~]# ceph config dump | grep
            osd_mclock_max_capacity_iops
            osd.352                               basic
            osd_mclock_max_capacity_iops_ssd       47136.994042
            osd.353                               basic
            osd_mclock_max_capacity_iops_ssd       45567.566829
            osd.354                               basic
            osd_mclock_max_capacity_iops_ssd       44979.777767
            osd.355                               basic
            osd_mclock_max_capacity_iops_ssd       44494.118337
            osd.356                               basic
            osd_mclock_max_capacity_iops_ssd       48002.559112
            osd.357                               basic
            osd_mclock_max_capacity_iops_ssd       54686.144097
            osd.358                               basic
            osd_mclock_max_capacity_iops_ssd       42349.183758
            osd.359                               basic
            osd_mclock_max_capacity_iops_ssd       58134.190143
            osd.360                               basic
            osd_mclock_max_capacity_iops_ssd       46867.824097
            osd.361                               basic
            osd_mclock_max_capacity_iops_ssd       54869.366372
            osd.362                               basic
            osd_mclock_max_capacity_iops_ssd       55875.432057
            osd.363                               basic
            osd_mclock_max_capacity_iops_ssd       58346.849381
            osd.364                               basic
            osd_mclock_max_capacity_iops_ssd       52520.181799
            osd.365                               basic
            osd_mclock_max_capacity_iops_ssd       46632.056458
            osd.366                               basic
            osd_mclock_max_capacity_iops_ssd       45746.055260
            osd.367                               basic
            osd_mclock_max_capacity_iops_ssd       47884.575954

            I will restart the OSD nodes one by one and let you know
            if 'rados ls' works again.

            Thanks,

            Christophe

            On 23/04/2025 12:23, Frédéric Nass wrote:

                Hi Christophe,

                Response inline

                ----- Le 23 Avr 25, à 11:42, Christophe
                DIARRA<christophe.dia...@idris.fr>
                <mailto:christophe.dia...@idris.fr> a écrit :

                    Hello Frédéric,
                    I removed the fs but haven't recreated it yet
                    because I have a doubt about the
                    health of the cluster even though it seems healthy:
                    [mon-01 ~]# ceph -s
                    cluster:
                    id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
                    health: HEALTH_OK
                    services:
                    mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 6d)
                    mgr: mon-02.mqaubn(active, since 6d), standbys:
                    mon-03.gvywio, mon-01.xhxqdi
                    osd: 368 osds: 368 up (since 16h), 368 in (since 3w)
                    data:
                    pools: 10 pools, 4353 pgs
                    objects: 1.25M objects, 3.9 TiB
                    usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail
                    pgs: 4353 active+clean
                    I observed that listing the objects in any hdd
                    pool will hang at the beginning
                    for and empty hdd pool or after displaying the
                    list of objects.
                    I need to do a Ctrl-C to interrupt the hung 'rados
                    ls' command. I don't have
                    this problem with the pools on sdd.
                    [mon-01 ~]# rados lspools
                    .mgr
                    pool_rbd_rep3_hdd <------ hdd pool
                    pool_rbd_rep3_ssd
                    rbd_ec_k6m2_hdd <------ hdd pool
                    rbd_ec_k6m2_ssd
                    metadata_4hddrbd_rep3_ssd
                    metadata_4ssdrbd_rep3_ssd
                    cfs_irods_md_test
                    cfs_irods_def_test
                    cfs_irods_data_test <------ hdd pool
                    [mon-01 ~]# 1) Testing 'rados ls' on hdd pools:
                    [mon-01 ~]# rados -p cfs_irods_data_test ls
                    (hangs forever) ==> Ctrl-C
                    [mon-01 ~]# rados -p pool_rbd_rep3_hdd ls|head -2
                    rbd_data.565ed6699dd8.0000000000097ff6
                    rbd_data.565ed6699dd8.00000000001041fb
                    (then hangs forever here) ==> Ctrl-C
                    [mon-01 ~]# rados -p pool_rbd_rep3_hdd ls
                    rbd_data.565ed6699dd8.0000000000097ff6
                    rbd_data.565ed6699dd8.00000000001041fb
                    rbd_data.565ed6699dd8.000000000004f1a3
                    ...
                    (list truncated by me)
                    ...
                    rbd_data.565ed6699dd8.000000000016809e
                    rbd_data.565ed6699dd8.000000000007bc05
                    (then hangs forever here) ==> Ctrl-C
                    2) With the pools on ssd everything works well
                    (the 'rados ls' commands doesn't
                    hang):
                    [mon-01 ~]# for i in $(rados lspools|egrep
                    'ssd|md|def'); do echo -n "Pool $i
                    :"; rados -p $i ls |wc -l; done
                    Pool pool_rbd_rep3_ssd :197298
                    Pool rbd_ec_k6m2_ssd :101552
                    Pool metadata_4hddrbd_rep3_ssd :5
                    Pool metadata_4ssdrbd_rep3_ssd :5
                    Pool cfs_irods_md_test :0
                    Pool cfs_irods_def_test :0
                    Below is the configuration of the cluster:
                    - 3 MONs (HPE DL360) + 8 OSD servers ( HPE Apol lo
                    4510 gen10)
                    - each OSD server has 44x20TB HDD + 10x7.6TB SSD

                This is dense. :-/

                    - On each OSD server, 8 SSD are partioned and used
                    for the wal/db of the HDD OSD
                    - On each OSD server 2 SSD are used for the ceph
                    fs metadata and default data
                    pools.
                    Do you see any configuration problem here which
                    could lead to our metadata
                    problem ?
                    Do you know what could cause the hang of the
                    'rados ls' command on the HDD pools
                    ? I would like to understand this problem before
                    recreating an new cephfs fs.

                Inaccessible PGs, misbehaving OSDs, mClock scheduler
                in use with osd_mclock_max_capacity_iops_hdd (auto)set
                way too low (check 'ceph config dump | grep
                osd_mclock_max_capacity_iops_hdd').

                Since this is consecutive to an electrical maintenance
                (power outage?), if osd_mclock_max_capacity_iops_hdd
                is not the issue, I would restart all HDD OSDs one by
                one or node by node to have all PGs repeered. Then try
                the 'rados ls' command again.

                Regards,
                Frédéric.

                    The cluster is still is testing state so we can do
                    any tests you could
                    recommend.
                    Thanks,
                    Christophe
                    On 22/04/2025 16:46, Christophe DIARRA wrote:

                        Hello Frédéric,
                        15 of the 16 parallel scanning workers
                        terminated almost immediately . But one
                        worker is still running for 1+ hour:
                        [mon-01 log]# ps -ef|grep scan
                        root 1977927 1925004 0 15:18 pts/0 00:00:00
                        cephfs-data-scanscan_extents
                        --filesystem cfs_irods_test --worker_n 11
                        --worker_m 16
                        [mon-01 log]# date;lsof -p 1977927|grep osd
                        Tue Apr 22 04:37:05 PM CEST 2025
                        cephfs-da 1977927 root 15u IPv4 7105122 0t0
                        TCP mon-01:34736->osd-06:6912
                        (ESTABLISHED)
                        cephfs-da 1977927 root 18u IPv4 7110774 0t0
                        TCP mon-01:45122->osd-03:ethoscan
                        (ESTABLISHED)
                        cephfs-da 1977927 root 19u IPv4 7105123 0t0
                        TCP mon-01:58556->osd-07:spg
                        (ESTABLISHED)
                        cephfs-da 1977927 root 20u IPv4 7049672 0t0
                        TCP mon-01:55064->osd-01:7112
                        (ESTABLISHED)
                        cephfs-da 1977927 root 21u IPv4 7082598 0t0
                        TCP mon-01:42120->osd-03-data:6896
                        (SYN_SENT)
                        [mon-01 log]#
                        The filesystem is empty. So I will follow your
                        advice and remove it. After that
                        I will recreate it.
                        I will redo some proper shutdown and restart
                        of the cluster to check if the
                        problem reappears with the newly recreated fs.
                        I will let you know.
                        Thank you for your help,
                        Christophe
                        On 22/04/2025 15:56, Frédéric Nass wrote:

                            That, is weird for 2 reasons.
                            The first reason is that the
                            cephfs-data-scan should not run for a
                            couple of
                            hours on empty data pools. I just tried to
                            run it on an empty pool and it
                            doesn't run for more than maybe 10 seconds.
                            The second reason is that the data pool
                            cfs_irods_def_test should not be empty,
                            even with if the filesystem tree is. It
                            should at least have a few rados
                            objects named after
                            {100,200,400,60x}.00000000 and the root
                            inode 1.00000000 /
                            1.00000000.inode unless you removed the
                            filesystem by running the 'ceph fs rm
                            <filesystem_name> --yes-i-really-mean-it'
                            command which does remove rados
                            objects in the associated pools.
                            If it's clear for you that this filesystem
                            should be empty, I'd advise you to
                            remove it (using the 'ceph fs rm'
                            command), delete any rados objects in the
                            metadata and data pools, and then recreate
                            the filesystem.
                            Regards,
                            Frédéric.
                            ----- Le 22 Avr 25, à 15:13, Christophe
                            DIARRA [
                            mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |<christophe.dia...@idris.fr>
                            <mailto:christophe.dia...@idris.fr> ] a
                            écrit :
                            Hello Frédéric,
                            I have:
                            [mon-01 ~]# rados df | grep -E
                            'OBJECTS|cfs_irods_def_test|cfs_irods_data_test'

                            POOL_NAME USED OBJECTS CLONES COPIES
                            MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD
                            WR_OPS WR USED COMPR UNDER COMPR
                            cfs_irods_data_test 0 B 0 0 0
                            0 0 0 0 0 B
                            0 0 B 0 B 0 B
                            cfs_irods_def_test 0 B 0 0 0
                            0 0 0 1 0 B
                            80200 157 GiB 0 B 0 B
                            [mon-01 ~]#
                            I will interrupt the current scanning
                            process and rerun it with
                            more workers.
                            Thanks,
                            Christophe
                            On 22/04/2025 15:05, Frédéric Nass wrote:
                            Hum... Obviously this 'empty' filesystem
                            has way more rados
                            objects in the 2 data pools than expected.
                            You should see that
                            many objects with:
                            rados df | grep -E
                            'OBJECTS|cfs_irods_def_test|cfs_irods_data_test'

                            If waiting is not an option, you can break
                            the scan_extents
                            command, re-run it with multiple workers,
                            and then proceed
                            with the next scan (scan_links). Just make
                            sure you run the
                            next scan with multiple workers as well.
                            Regards,
                            Frédéric.
                            ----- Le 22 Avr 25, à 14:54, Christophe
                            DIARRA
                            [mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |<christophe.dia...@idris.fr>
                            <mailto:christophe.dia...@idris.fr> ]
                            [mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |<mailto:christophe.dia...@idris.fr>
                            <mailto:christophe.dia...@idris.fr> ] a
                            écrit :
                            Hello Frédéric,
                            I ran the commands (see below) but the
                            command
                            'cephfs-data-scan scan_extents --filesystem
                            cfs_irods_test' is not finished yet. It
                            has been running
                            for 2+ hours. I didn't run it in parallel
                            because it
                            contains empty directories only. According
                            to [1]:
                            "scan_extents and scan_inodes commands may
                            take a very
                            long time if the data pool contains many
                            files or very
                            large files. Now I think I should have run
                            the command in
                            parallel. I don't know if it is safe to
                            interrupt it and
                            then rerun it with 16 workers.
                            On 22/04/2025 12:13, Frédéric Nass wrote:
                            Hi Christophe,
                            You could but it won't be of any help
                            since the
                            journal is empty. What you can do to fix
                            the fs
                            metadata is to run the below commands from
                            the
                            disaster-recovery-experts documentation
                            [1] in this
                            particular order:
                            #Prevent access to the fs and set it down.
                            ceph fs set cfs_irods_test
                            refuse_client_session true
                            ceph fs set cfs_irods_test joinable false
                            ceph fs set cfs_irods_test down true
                            [mon-01 ~]# ceph fs set cfs_irods_test
                            refuse_client_session true
                            client(s) blocked from establishing new
                            session(s)
                            [mon-01 ~]# ceph fs set cfs_irods_test
                            joinable false
                            cfs_irods_test marked not joinable; MDS
                            cannot join as
                            newly active.
                            [mon-01 ~]# ceph fs set cfs_irods_test
                            down true
                            cfs_irods_test marked down.
                            # Reset maps and journal
                            cephfs-table-tool cfs_irods_test:0 reset
                            session
                            cephfs-table-tool cfs_irods_test:0 reset snap
                            cephfs-table-tool cfs_irods_test:0 reset
                            inode
                            [mon-01 ~]# cephfs-table-tool
                            cfs_irods_test:0 reset session
                            {
                            "0": {
                            "data": {},
                            "result": 0
                            }
                            }
                            [mon-01 ~]# cephfs-table-tool
                            cfs_irods_test:0 reset snap
                            Error ((2) No such file or directory)
                            2025-04-22T12:29:09.550+0200 7f1d4c03e100
                            -1 main: Bad
                            rank selection: cfs_irods_test:0'
                            [mon-01 ~]# cephfs-table-tool
                            cfs_irods_test:0 reset inode
                            Error ((2) No such file or
                            directory2025-04-22T12:29:43.880+0200
                            7f0878a3a100 -1
                            main: Bad rank selection: cfs_irods_test:0'
                            )
                            cephfs-journal-tool --rank
                            cfs_irods_test:0 journal
                            reset --force
                            cephfs-data-scan init --force-init
                            --filesystem
                            cfs_irods_test
                            [mon-01 ~]# cephfs-journal-tool --rank
                            cfs_irods_test:0
                            journal reset --force
                            Error ((2) No such file or directory)
                            2025-04-22T12:34:42.474+0200 7fe8b3a36100
                            -1 main:
                            Couldn't determine MDS rank.
                            [mon-01 ~]# cephfs-data-scan init
                            --force-init
                            --filesystem cfs_irods_test
                            [mon-01 ~]#
                            # Rescan data and fix metadata (leaving
                            the below
                            commands commented for information on how
                            to // these
                            scan tasks)
                            #for i in {0..15} ; do cephfs-data-scan
                            scan_frags
                            --filesystem cfs_irods_test
                            --force-corrupt --worker_n
                            $i --worker_m 16 & done
                            #for i in {0..15} ; do cephfs-data-scan
                            scan_extents
                            --filesystem cfs_irods_test --worker_n $i
                            --worker_m
                            16 & done
                            #for i in {0..15} ; do cephfs-data-scan
                            scan_inodes
                            --filesystem cfs_irods_test
                            --force-corrupt --worker_n
                            $i --worker_m 16 & done
                            #for i in {0..15} ; do cephfs-data-scan
                            scan_links
                            --filesystem cfs_irods_test --worker_n $i
                            --worker_m
                            16 & done
                            cephfs-data-scan scan_frags --filesystem
                            cfs_irods_test --force-corrupt
                            cephfs-data-scan scan_extents --filesystem
                            cfs_irods_test
                            [mon-01 ~]# cephfs-data-scan scan_frags
                            --filesystem
                            cfs_irods_test --force-corrupt
                            [mon-01 ~]# cephfs-data-scan scan_extents
                            --filesystem
                            cfs_irods_test *------> still running*
                            I don't know how long it will take. Once
                            it will be
                            completed I will run the remaining commands.
                            Thanks,
                            Christophe
                            cephfs-data-scan scan_inodes --filesystem
                            cfs_irods_test --force-corrupt
                            cephfs-data-scan scan_links --filesystem
                            cfs_irods_test
                            cephfs-data-scan cleanup --filesystem
                            cfs_irods_test
                            #ceph mds repaired 0 <---- should not be
                            necessary
                            # Set the fs back online and accessible
                            ceph fs set cfs_irods_test down false
                            ceph fs set cfs_irods_test joinable true
                            ceph fs set cfs_irods_test
                            refuse_client_session false
                            An MDS should now start, if not then use
                            'ceph orch
                            daemon restart mds.xxxxx' to start a MDS.
                            After
                            remounting the fs you should be able to
                            access
                            /testdir1 and /testdir2 in the fs root.
                            # scrub the fs again to check that if
                            everything is OK.
                            ceph tell mds.cfs_irods_test:0 scrub start /
                            recursive,repair,force
                            Regards,
                            Frédéric.
                            [1]
                            
[https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
                            |
                            
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
                            ]
                            ----- Le 22 Avr 25, à 10:21, Christophe
                            DIARRA
                            [mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |<christophe.dia...@idris.fr>
                            <mailto:christophe.dia...@idris.fr> ]
                            [mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |<mailto:christophe.dia...@idris.fr>
                            <mailto:christophe.dia...@idris.fr> ] a
                            écrit :
                            Hello Frédéric,
                            Thank your for your help.
                            Following is output you asked for:
                            [mon-01 ~]# date
                            Tue Apr 22 10:09:10 AM CEST 2025
                            [root@fidrcmon-01 ~]# ceph tell
                            mds.cfs_irods_test:0 scrub start /
                            recursive,repair,force
                            2025-04-22T10:09:12.796+0200 7f43f6ffd640 0
                            client.86553 ms_handle_reset on
                            v2:130.84.80.10:6800/3218663047
                            2025-04-22T10:09:12.818+0200 7f43f6ffd640 0
                            client.86559 ms_handle_reset on
                            v2:130.84.80.10:6800/3218663047
                            {
                            "return_code": 0,
                            "scrub_tag":
                            "12e537bb-bb39-4f3b-ae09-e0a1ae6ce906",
                            "mode": "asynchronous"
                            }
                            [root@fidrcmon-01 ~]# ceph tell
                            mds.cfs_irods_test:0 scrub status
                            2025-04-22T10:09:31.760+0200 7f3f0f7fe640 0
                            client.86571 ms_handle_reset on
                            v2:130.84.80.10:6800/3218663047
                            2025-04-22T10:09:31.781+0200 7f3f0f7fe640 0
                            client.86577 ms_handle_reset on
                            v2:130.84.80.10:6800/3218663047
                            {
                            "status": "no active scrubs running",
                            "scrubs": {}
                            }
                            [root@fidrcmon-01 ~]# cephfs-journal-tool
                            --rank
                            cfs_irods_test:0 event recover_dentries list
                            2025-04-16T18:24:56.802960+0200 0x7c334a
                            SUBTREEMAP: ()
                            [root@fidrcmon-01 ~]#
                            Based on this output, can I run the other
                            three
                            commands provided in your message :
                            ceph tell mds.0 flush journal
                            ceph mds fail 0
                            ceph tell mds.cfs_irods_test:0 scrub start
                            / recursive
                            Thanks,
                            Christophe
                            On 19/04/2025 12:55, Frédéric Nass wrote:
                            Hi Christophe, Hi David,
                            Could you share the ouptut of the below
                            command after running the scrubbing with
                            recursive,repair,force?
                            cephfs-journal-tool --rank
                            cfs_irods_test:0 event recover_dentries list
                            Could be that the MDS recovered these 2
                            dentries in its journal already but the
                            status of the filesystem was not updated
                            yet. I've seen this happening before.
                            If that the case, you could try a flush,
                            fail and re-scrub:
                            ceph tell mds.0 flush journal
                            ceph mds fail 0
                            ceph tell mds.cfs_irods_test:0 scrub start
                            / recursive
                            This might clear the HEALTH_ERR. If not,
                            then it will be easy to fix by
                            rebuilding / fixing the metadata from the
                            data pools since this fs is empty.
                            Let us know,
                            Regards,
                            Frédéric.
                            ----- Le 18 Avr 25, à 9:51,
                            [mailto:daviddavid.cas...@aevoo.fr
                            <mailto:daviddavid.cas...@aevoo.fr> |
                            daviddavid.cas...@aevoo.fr ] a écrit :
                            I also tend to think that the disk has
                            nothing to do with the problem.
                            My reading is that the inode associated
                            with the dentry is missing.
                            Can anyone correct me?
                            Christophe informed me that the
                            directories were emptied before the
                            incident.
                            I don't understand why scrubbing doesn't
                            repair the meta data.
                            Perhaps because the directory is empty ?
                            Le jeu. 17 avr. 2025 à 19:06, Anthony
                            D'Atri [mailto:anthony.da...@gmail.com
                            <mailto:anthony.da...@gmail.com> |
                            <anthony.da...@gmail.com>
                            <mailto:anthony.da...@gmail.com> ]
                            [mailto:anthony.da...@gmail.com
                            <mailto:anthony.da...@gmail.com> |
                            <mailto:anthony.da...@gmail.com>
                            <mailto:anthony.da...@gmail.com> ] a
                            écrit :
                            HPE rebadges drives from manufacturers. A
                            quick search supports the idea
                            that this SKU is fulfilled at least partly
                            by Kioxia, so not likely a PLP
                            issue.
                            On Apr 17, 2025, at 11:39 AM, Christophe
                            DIARRA <
                            [mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |christophe.dia...@idris.fr ] > wrote:
                            Hello David,
                            The SSD model is VO007680JWZJL.
                            I will delay the 'ceph tell
                            mds.cfs_irods_test:0 damage rm 241447932'
                            for the moment. If any other solution is
                            found I will be obliged to use
                            this command.
                            I found 'dentry' in the logs when the
                            cephfs cluster started:
                            Apr 16 17:29:53 mon-02 ceph-mds[2367]:
                            mds.cfs_irods_test.mon-02.awuygq
                            Updating MDS map to version 15613 from mon.2
                            Apr 16 17:29:53 mon-02 ceph-mds[2367]:
                            mds.0.15612 handle_mds_map i am
                            now mds.0.15612
                            Apr 16 17:29:53 mon-02 ceph-mds[2367]:
                            mds.0.15612 handle_mds_map state
                            change up:starting --> up:active
                            Apr 16 17:29:53 mon-02 ceph-mds[2367]:
                            mds.0.15612 active_start
                            Apr 16 17:29:53 mon-02 ceph-mds[2367]:
                            mds.0.cache.den(0x1 testdir2)
                            loaded already *corrupt dentry*: [dentry
                            #0x1/testdir2 [2, [mailto:head
                            <mailto:head>]rep@0.0
                            |head]rep@0.0 ]
                            NULL (dversion lock) pv=0 v=4442 ino=(n
                            il) state=0 0x5617e18c8280]
                            Apr 16 17:29:53 mon-02 ceph-mds[2367]:
                            mds.0.cache.den(0x1 testdir1)
                            loaded already *corrupt dentry*: [dentry
                            #0x1/testdir1 [2, [mailto:head
                            <mailto:head>]rep@0.0
                            |head]rep@0.0 ]
                            NULL (dversion lock) pv=0 v=4442 ino=(n
                            il) state=0 0x5617e18c8500]
                            Apr 16 17:29:53 mon-02 ceph-mon[2288]:
                            Health check failed: 1
                            filesystem is offline (MDS_ALL_DOWN)
                            Apr 16 17:29:53 mon-02 ceph-mon[2288]:
                            Health check failed: 1
                            filesystem is online with fewer MDS than
                            max_mds (MDS_UP_LESS_THAN_MAX)
                            Apr 16 17:29:53 mon-02 ceph-mon[2288]:
                            from='client.?
                            xx.xx.xx.8:0/3820885518'
                            entity='client.admin' cmd='[{"prefix": "fs
                            set",
                            "fs_name": "cfs_irods_test", "var":
                            "down", "val":
                            "false"}]': finished
                            Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon
                            mds.cfs_irods_test.mon-02.awuygq assigned
                            to filesystem cfs_irods_test as
                            rank 0 (now has 1 ranks)
                            Apr 16 17:29:53 mon-02 ceph-mon[2288]:
                            Health check cleared:
                            MDS_ALL_DOWN (was: 1 filesystem is offline)
                            Apr 16 17:29:53 mon-02 ceph-mon[2288]:
                            Health check cleared:
                            MDS_UP_LESS_THAN_MAX (was: 1 filesystem is
                            online with fewer MDS than
                            max_mds)
                            Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon
                            mds.cfs_irods_test.mon-02.awuygq is now
                            active in filesystem cfs_irods_test
                            as rank 0
                            Apr 16 17:29:54 mon-02 ceph-mgr[2444]:
                            log_channel(cluster) log [DBG] :
                            pgmap v1721: 4353 pgs: 4346 active+clean,
                            7 active+clean+scrubbing+deep;
                            3.9 TiB data, 417 TiB used, 6.4 P
                            iB / 6.8 PiB avail; 1.4 KiB/s rd, 1 op/s
                            If you need more extract from the log file
                            please let me know.
                            Thanks for your help,
                            Christophe
                            On 17/04/2025 13:39, David C. wrote:
                            If I'm not mistaken, this is a fairly rare
                            situation.
                            The fact that it's the result of a power
                            outage makes me think of a bad
                            SSD (like "S... Pro").
                            Does a grep of the dentry id in the MDS
                            logs return anything?
                            Maybe some interesting information around
                            this grep
                            In the heat of the moment, I have no other
                            idea than to delete the
                            dentry.
                            ceph tell mds.cfs_irods_test:0 damage rm
                            241447932
                            However, in production, this results in
                            the content (of dir
                            /testdir[12]) being abandoned.
                            Le jeu. 17 avr. 2025 à 12:44, Christophe
                            DIARRA <
                            [mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |christophe.dia...@idris.fr ] > a écrit :
                            Hello David,
                            Thank you for the tip about the scrubbing.
                            I have tried the
                            commands found in the documentation but it
                            seems to have no effect:
                            [root@mon-01 ~]#*ceph tell
                            mds.cfs_irods_test:0 scrub start /
                            recursive,repair,force*
                            2025-04-17T12:07:20.958+0200 7fd4157fa640
                            0 client.86301
                            ms_handle_reset on
                            v2:130.84.80.10:6800/3218663047
                            2025-04-17T12:07:20.979+0200<
                            
[http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200
                            |
                            
http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200
                            ] > [
                            
http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200
                            |
                            
<http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200>
                            
<http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200>
                            ]
                            7fd4157fa640 0 client.86307
                            ms_handle_reset on v2:
                            130.84.80.10:6800/3218663047
                            [http://130.84.80.10:6800/3218663047 |
                            <http://130.84.80.10:6800/3218663047>
                            <http://130.84.80.10:6800/3218663047> ]
                            [http://130.84.80.10:6800/3218663047 |
                            <http://130.84.80.10:6800/3218663047>
                            <http://130.84.80.10:6800/3218663047> ]
                            {
                            "return_code": 0,
                            "scrub_tag":
                            "733b1c6d-a418-4c83-bc8e-b28b556e970c",
                            "mode": "asynchronous"
                            }
                            [root@mon-01 ~]#*ceph tell
                            mds.cfs_irods_test:0 scrub status*
                            2025-04-17T12:07:30.734+0200 7f26cdffb640
                            0 client.86319
                            ms_handle_reset on
                            v2:130.84.80.10:6800/3218663047
                            2025-04-17T12:07:30.753+0200<
                            
[http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200
                            |
                            
http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200
                            ] > [
                            
http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200
                            |
                            
<http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200>
                            
<http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200>
                            ]
                            7f26cdffb640 0 client.86325
                            ms_handle_reset on v2:
                            130.84.80.10:6800/3218663047
                            [http://130.84.80.10:6800/3218663047 |
                            <http://130.84.80.10:6800/3218663047>
                            <http://130.84.80.10:6800/3218663047> ]
                            [http://130.84.80.10:6800/3218663047 |
                            <http://130.84.80.10:6800/3218663047>
                            <http://130.84.80.10:6800/3218663047> ]
                            {
                            "status": "no active scrubs running",
                            "scrubs": {}
                            }
                            [root@mon-01 ~]# ceph -s
                            cluster:
                            id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
                            *health: HEALTH_ERR 1 MDSs report damaged
                            metadata*
                            services:
                            mon: 3 daemons, quorum
                            mon-01,mon-03,mon-02 (age 19h)
                            mgr: mon-02.mqaubn(active, since 19h),
                            standbys: mon-03.gvywio,
                            mon-01.xhxqdi
                            mds: 1/1 daemons up, 2 standby
                            osd: 368 osds: 368 up (since 18h), 368 in
                            (since 3w)
                            data:
                            volumes: 1/1 healthy
                            pools: 10 pools, 4353 pgs
                            objects: 1.25M objects, 3.9 TiB
                            usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail
                            pgs: 4353 active+clean
                            Did I miss something ?
                            The server didn't crash. I don't
                            understand what you are meaning
                            by "there may be a design flaw in the
                            infrastructure (insecure
                            cache, for example)".
                            How to know if we have a design problem ?
                            What should we check ?
                            Best regards,
                            Christophe
                            On 17/04/2025 11:07, David C. wrote:
                            Hello Christophe,
                            Check the file system scrubbing procedure =>
                            [https://docs.ceph.com/en/latest/cephfs/scrub/
                            |
                            https://docs.ceph.com/en/latest/cephfs/scrub/
                            ] But this doesn't
                            guarantee data recovery.
                            Was the cluster crashed?
                            Ceph should be able to handle it; there
                            may be a design flaw in
                            the infrastructure (insecure cache, for
                            example).
                            David
                            Le jeu. 17 avr. 2025 à 10:44, Christophe
                            DIARRA
                            [mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |<christophe.dia...@idris.fr>
                            <mailto:christophe.dia...@idris.fr> ] [
                            mailto:christophe.dia...@idris.fr
                            <mailto:christophe.dia...@idris.fr>
                            |<mailto:christophe.dia...@idris.fr>
                            <mailto:christophe.dia...@idris.fr> ] a
                            écrit :
                            Hello,
                            After an electrical maintenance I
                            restarted our ceph cluster
                            but it
                            remains in an unhealthy state: HEALTH_ERR
                            1 MDSs report
                            damaged metadata.
                            How to repair this damaged metadata ?
                            To bring down the cephfs cluster I
                            unmounted the fs from the
                            client
                            first and then did: ceph fs set
                            cfs_irods_test down true
                            To bring up the cephfs cluster I did: ceph
                            fs set
                            cfs_irods_test down false
                            Fortunately the cfs_irods_test fs is
                            almost empty and is a fs
                            for
                            tests.The ceph cluster is not in
                            production yet.
                            Following is the current status:
                            [root@mon-01 ~]# ceph health detail
                            HEALTH_ERR 1 MDSs report damaged metadata
                            *[ERR] MDS_DAMAGE: 1 MDSs report damaged
                            metadata
                            mds.cfs_irods_test.mon-03.vlmeuz(mds.0):
                            Metadata damage
                            detected*
                            [root@mon-01 ~]# ceph -s
                            cluster:
                            id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
                            health: HEALTH_ERR
                            1 MDSs report damaged metadata
                            services:
                            mon: 3 daemons, quorum
                            mon-01,mon-03,mon-02 (age 17h)
                            mgr: mon-02.mqaubn(active, since 17h),
                            standbys:
                            mon-03.gvywio,
                            mon-01.xhxqdi
                            mds: 1/1 daemons up, 2 standby
                            osd: 368 osds: 368 up (since 17h), 368 in
                            (since 3w)
                            data:
                            volumes: 1/1 healthy
                            pools: 10 pools, 4353 pgs
                            objects: 1.25M objects, 3.9 TiB
                            usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail
                            pgs: 4353 active+clean
                            [root@mon-01 ~]# ceph fs ls
                            name: cfs_irods_test, metadata pool:
                            cfs_irods_md_test, data
                            pools:
                            [cfs_irods_def_test cfs_irods_data_test ]
                            [root@mon-01 ~]# ceph mds stat
                            cfs_irods_test:1
                            {0=cfs_irods_test.mon-03.vlmeuz=up:active} 2
                            up:standby
                            [root@mon-01 ~]# ceph fs status
                            cfs_irods_test - 0 clients
                            ==============
                            RANK STATE MDS ACTIVITY DNS
                            INOS DIRS CAPS
                            0 active cfs_irods_test.mon-03.vlmeuz
                            Reqs: 0 /s
                            12 15
                            14 0
                            POOL TYPE USED AVAIL
                            cfs_irods_md_test metadata 11.4M 34.4T
                            cfs_irods_def_test data 0 34.4T
                            cfs_irods_data_test data 0 4542T
                            STANDBY MDS
                            cfs_irods_test.mon-01.hitdem
                            cfs_irods_test.mon-02.awuygq
                            MDS version: ceph version 18.2.2
                            (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
                            reef (stable)
                            [root@mon-01 ~]#
                            [root@mon-01 ~]# ceph tell
                            mds.cfs_irods_test:0 damage ls
                            2025-04-17T10:23:31.849+0200 7f4b87fff640
                            0 client.86181
                            ms_handle_reset on
                            v2:130.84.80.10:6800/3218663047
                            [http://130.84.80.10:6800/3218663047
                            |<http://130.84.80.10:6800/3218663047>
                            <http://130.84.80.10:6800/3218663047> ]
                            [http://130.84.80.10:6800/3218663047
                            |<http://130.84.80.10:6800/3218663047>
                            <http://130.84.80.10:6800/3218663047> ]
                            2025-04-17T10:23:31.866+0200 7f4b87fff640
                            0 client.86187
                            ms_handle_reset on
                            v2:130.84.80.10:6800/3218663047
                            [http://130.84.80.10:6800/3218663047
                            |<http://130.84.80.10:6800/3218663047>
                            <http://130.84.80.10:6800/3218663047> ]
                            [http://130.84.80.10:6800/3218663047
                            |<http://130.84.80.10:6800/3218663047>
                            <http://130.84.80.10:6800/3218663047> ]
                            [
                            {
                            *"damage_type": "dentry",*
                            "id": 241447932,
                            "ino": 1,
                            "frag": "*",
                            "dname": "testdir2",
                            "snap_id": "head",
                            "path": "/testdir2"
                            },
                            {
                            *"damage_type": "dentry"*,
                            "id": 2273238993,
                            "ino": 1,
                            "frag": "*",
                            "dname": "testdir1",
                            "snap_id": "head",
                            "path": "/testdir1"
                            }
                            ]
                            [root@mon-01 ~]#
                            Any help will be appreciated,
                            Thanks,
                            Christophe
                            _______________________________________________

                            ceph-users mailing list --ceph-users@ceph.io
                            To unsubscribe send an email
                            [mailto:toceph-users-le...@ceph.io
                            <mailto:toceph-users-le...@ceph.io> |
                            toceph-users-le...@ceph.io ]
                            _______________________________________________

                            ceph-users mailing list --ceph-users@ceph.io
                            To unsubscribe send an email
                            [mailto:toceph-users-le...@ceph.io
                            <mailto:toceph-users-le...@ceph.io> |
                            toceph-users-le...@ceph.io ]
                            _______________________________________________

                            ceph-users mailing list --ceph-users@ceph.io
                            To unsubscribe send an email
                            [mailto:toceph-users-le...@ceph.io
                            <mailto:toceph-users-le...@ceph.io> |
                            toceph-users-le...@ceph.io ]
                            _______________________________________________

                            ceph-users mailing list --ceph-users@ceph.io
                            To unsubscribe send an email
                            [mailto:toceph-users-le...@ceph.io
                            <mailto:toceph-users-le...@ceph.io> |
                            toceph-users-le...@ceph.io ]

                        _______________________________________________
                        ceph-users mailing list --
                        [mailto:ceph-users@ceph.io
                        <mailto:ceph-users@ceph.io> |ceph-users@ceph.io ]
                        To unsubscribe send an email to
                        [mailto:ceph-users-le...@ceph.io
                        <mailto:ceph-users-le...@ceph.io> |
                        ceph-users-le...@ceph.io ]

                _______________________________________________
                ceph-users mailing list --ceph-users@ceph.io
                To unsubscribe send an email toceph-users-le...@ceph.io

            _______________________________________________
            ceph-users mailing list -- ceph-users@ceph.io
            To unsubscribe send an email to ceph-users-le...@ceph.io

        _______________________________________________
        ceph-users mailing list -- ceph-users@ceph.io
        To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to