Can anybody comment on my questions below? Thanks so much in advance....
Am 26. Juni 2024 08:08:39 MESZ schrieb Dietmar Rieder <dietmar.rie...@i-med.ac.at>: >...sending also to the list and Xiubo (were accidentally removed from >recipients)... > >On 6/25/24 21:28, Dietmar Rieder wrote: >> Hi Patrick, Xiubo and List, >> >> finally we managed to get the filesystem repaired and running again! YEAH, >> I'm so happy!! >> >> Big thanks for your support Patrick and Xiubo! (Would love invite you for a >> beer)! >> >> >> Please see some comments and (important?) questions below: >> >> On 6/25/24 03:14, Patrick Donnelly wrote: >>> On Mon, Jun 24, 2024 at 5:22 PM Dietmar Rieder >>> <dietmar.rie...@i-med.ac.at> wrote: >>>> >>>> (resending this, the original message seems that it didn't make it through >>>> between all the SPAM recently sent to the list, my apologies if it doubles >>>> at some point) >>>> >>>> Hi List, >>>> >>>> we are still struggeling to get our cephfs back online again, this is an >>>> update to inform you what we did so far, and we kindly ask for any input >>>> on this to get an idea on how to proceed: >>>> >>>> After resetting the journals Xiubo suggested (in a PM) to go on with the >>>> disaster recovery procedure: >>>> >>>> cephfs-data-scan init skipped creating the inodes 0x0x1 and 0x0x100 >>>> >>>> [root@ceph01-b ~]# cephfs-data-scan init >>>> Inode 0x0x1 already exists, skipping create. Use --force-init to >>>> overwrite the existing object. >>>> Inode 0x0x100 already exists, skipping create. Use --force-init to >>>> overwrite the existing object. >>>> >>>> We did not use --force-init and proceeded with scan_extents using a single >>>> worker, which was indeed very slow. >>>> >>>> After ~24h we interupted the scan_extents and restarted it with 32 workers >>>> which went through in about 2h15min w/o any issue. >>>> >>>> Then I started scan_inodes with 32 workers this was also finished after >>>> ~50min no output on stderr or stdout. >>>> >>>> I went on with scan_links, which after ~45 minutes threw the following >>>> error: >>>> >>>> # cephfs-data-scan scan_links >>>> Error ((2) No such file or directory) >>> >>> Not sure what this indicates necessarily. You can try to get more >>> debug information using: >>> >>> [client] >>> debug mds = 20 >>> debug ms = 1 >>> debug client = 20 >>> >>> in the local ceph.conf for the node running cephfs-data-scan. >> >> I did that, and restarted the "cephfs-data-scan scan_links" . >> >> It didn't produce any additional debug output, however this time it just >> went through without error (~50 min) >> >> We then reran "cephfs-data-scan cleanup" and it also finished without error >> after about 10h. >> >> We then set the fs as repaired and all seems to work fin again: >> >> [root@ceph01-b ~]# ceph mds repaired 0 >> repaired: restoring rank 1:0 >> >> [root@ceph01-b ~]# ceph -s >> cluster: >> id: aae23c5c-a98b-11ee-b44d-00620b05cac4 >> health: HEALTH_OK >> >> services: >> mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 6d) >> mgr: cephmon-01.dsxcho(active, since 6d), standbys: cephmon-02.nssigg, >> cephmon-03.rgefle >> mds: 1/1 daemons up, 5 standby >> osd: 336 osds: 336 up (since 2M), 336 in (since 4M) >> >> data: >> volumes: 1/1 healthy >> pools: 4 pools, 6401 pgs >> objects: 284.68M objects, 623 TiB >> usage: 890 TiB used, 3.1 PiB / 3.9 PiB avail >> pgs: 6206 active+clean >> 140 active+clean+scrubbing >> 55 active+clean+scrubbing+deep >> >> io: >> client: 3.9 MiB/s rd, 84 B/s wr, 482 op/s rd, 1.11k op/s wr >> >> >> [root@ceph01-b ~]# ceph fs status >> cephfs - 0 clients >> ====== >> RANK STATE MDS ACTIVITY DNS INOS DIRS >> CAPS >> 0 active default.cephmon-03.xcujhz Reqs: 0 /s 124k 60.3k 1993 >> 0 >> POOL TYPE USED AVAIL >> ssd-rep-metadata-pool metadata 298G 63.4T >> sdd-rep-data-pool data 10.2T 84.5T >> hdd-ec-data-pool data 808T 1929T >> STANDBY MDS >> default.cephmon-01.cepqjp >> default.cephmon-01.pvnqad >> default.cephmon-02.duujba >> default.cephmon-02.nyfook >> default.cephmon-03.chjusj >> MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) >> reef (stable) >> >> >> The msd log however shows some "bad backtrace on directory inode" messages: >> >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz >> Updating MDS map to version 8082 from mon.1 >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am >> now mds.0.8082 >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 handle_mds_map state >> change up:standby --> up:replay >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 replay_start >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 waiting for osdmap >> 34331 (which blocklists prior instance) >> 2024-06-25T18:45:36.581+0000 7f858de4c700 0 mds.0.cache creating system >> inode with ino:0x100 >> 2024-06-25T18:45:36.581+0000 7f858de4c700 0 mds.0.cache creating system >> inode with ino:0x1 >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.journal EResetJournal >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe start >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe result >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe done >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.8082 Finished replaying >> journal >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.8082 making mds journal >> writeable >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz >> Updating MDS map to version 8083 from mon.1 >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am >> now mds.0.8082 >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 handle_mds_map state >> change up:replay --> up:reconnect >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reconnect_start >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reopen_log >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reconnect_done >> 2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz >> Updating MDS map to version 8084 from mon.1 >> 2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am >> now mds.0.8082 >> 2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 handle_mds_map state >> change up:reconnect --> up:rejoin >> 2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 rejoin_start >> 2024-06-25T18:45:38.583+0000 7f8594659700 1 mds.0.8082 rejoin_joint_start >> 2024-06-25T18:45:38.592+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] >> : bad backtrace on directory inode 0x10003e42340 >> 2024-06-25T18:45:38.680+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] >> : bad backtrace on directory inode 0x10003e45d8b >> 2024-06-25T18:45:38.754+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] >> : bad backtrace on directory inode 0x10003e45d90 >> 2024-06-25T18:45:38.754+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] >> : bad backtrace on directory inode 0x10003e45d9f >> 2024-06-25T18:45:38.785+0000 7f858fe50700 1 mds.0.8082 rejoin_done >> 2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz >> Updating MDS map to version 8085 from mon.1 >> 2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am >> now mds.0.8082 >> 2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 handle_mds_map state >> change up:rejoin --> up:active >> 2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 recovery_done -- >> successful recovery! >> 2024-06-25T18:45:39.584+0000 7f8594659700 1 mds.0.8082 active_start >> 2024-06-25T18:45:39.585+0000 7f8594659700 1 mds.0.8082 cluster recovered. >> 2024-06-25T18:45:42.409+0000 7f8591e54700 -1 mds.pinger is_rank_lagging: >> rank=0 was never sent ping request. >> 2024-06-25T18:57:28.213+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] >> : bad backtrace on directory inode 0x4 >> >> >> Is there anything that we can do about this, to get rid of the "bad >> backtrace on directory inode"? >> >> >> Sone more question: >> >> 1. >> As Xiubo suggested, we now tried to mount the filesystem with the "nowsysnc" >> option <https://tracker.ceph.com/issues/61009#note-26>: >> >> [root@ceph01-b ~]# mount -t ceph cephfs_user@.cephfs=/ /mnt/cephfs -o >> secretfile=/etc/ceph/ceph.client.cephfs_user.secret,nowsync >> >> however the option seems not to show up in /proc/mounts >> >> [root@ceph01-b ~]# grep ceph /proc/mounts >> cephfs_user@aae23c5c-a98b-11ee-b44d-00620b05cac4.cephfs=/ /mnt/cephfs ceph >> rw,relatime,name=cephfs_user,secret=<hidden>,ms_mode=prefer-crc,acl,mon_addr=10.1.3.21:3300/10.1.3.22:3300/10.1.3.23:3300 >> 0 0 >> >> The kernel version is 5.14.0 (from Rocky 9.3) >> >> [root@ceph01-b ~]# uname -a >> Linux ceph01-b 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar >> 13 17:33:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux >> >> Is this expected? How can we make sure that the filesystem uses 'nowsync', >> so that we do not hit the bug <https://tracker.ceph.com/issues/61009> again? >> > >Oh, I think I misunderstood the suggested workaround. I guess we need to >disable "nowsync", which is set by default, right? > >so: -o wsync > >should be the workaround, right? > >> 2. >> There are two empty files in lost+found now. Is ist save to remove them? >> >> [root@ceph01-b lost+found]# ls -la >> total 0 >> drwxr-xr-x 2 root root 1 Jan 1 1970 . >> drwxr-xr-x 4 root root 2 Mar 13 21:22 .. >> -r-x------ 1 root root 0 Jun 20 23:50 100037a50e2 >> -r-x------ 1 root root 0 Jun 20 19:05 200049612e5 >> >> 3. >> Are there any specific steps that we should perform now (scrub or similar >> things) before we put the filesystem into production again? >> > > >Dietmar _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io