[ceph-users] Re: ceph fs (meta) data inconsistent
Hi Frank, Sure, just take your time. Thanks - Xiubo On 12/8/23 19:54, Frank Schilder wrote: Hi Xiubo, I will update the case. I'm afraid this will have to wait a little bit though. I'm too occupied for a while and also don't have a test cluster that would help speed things up. I will update you, please keep the tracker open. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Tuesday, December 5, 2023 1:58 AM To: Frank Schilder; Gregory Farnum Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent Frank, By using your script I still couldn't reproduce it. Locally my python version is 3.9.16, and I didn't have other VMs to test python other versions. Could you check the tracker to provide the debug logs ? Thanks - Xiubo On 12/1/23 21:08, Frank Schilder wrote: Hi Xiubo, I uploaded a test script with session output showing the issue. When I look at your scripts, I can't see the stat-check on the second host anywhere. Hence, I don't really know what you are trying to compare. If you want me to run your test scripts on our system for comparison, please include the part executed on the second host explicitly in an ssh-command. Running your scripts alone in their current form will not reproduce the issue. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Monday, November 27, 2023 3:59 AM To: Frank Schilder; Gregory Farnum Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent On 11/24/23 21:37, Frank Schilder wrote: Hi Xiubo, thanks for the update. I will test your scripts in our system next week. Something important: running both scripts on a single client will not produce a difference. You need 2 clients. The inconsistency is between clients, not on the same client. For example: Frank, Yeah, I did this with 2 different kclients. Thanks Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs Test 1 - on host1: execute shutil.copy2 - execute ls -l /mnt/kcephfs/ on host1 and host2: same result Test 2 - on host1: shutil.copy - execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 while correct on host 1 Your scripts only show output of one host, but the inconsistency requires two hosts for observation. The stat information is updated on host1, but not synchronized to host2 in the second test. In case you can't reproduce that, I will append results from our system to the case. Also it would be important to know the python and libc versions. We observe this only for newer versions of both. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Thursday, November 23, 2023 3:47 AM To: Frank Schilder; Gregory Farnum Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent I just raised one tracker to follow this: https://tracker.ceph.com/issues/63510 Thanks - Xiubo ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Osd full
Hello the team, We initially had a cluster of 3 machines with 4 osd on each machine, we added 4 machines in the cluster (each machine with 4 osd) We launched the balancing but it never finished, still in progress. But the big issue: we have an osd full and all the pools on this osd are read only. *ceph osd df *: ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL%USE VAR PGS STATUS 20hdd 9.09569 1.0 9.1 TiB 580 GiB 576 GiB 1.2 GiB 3.1 GiB 8.5 TiB 6.23 0.32 169 up 21hdd 9.09569 1.0 9.1 TiB 1.5 TiB 1.5 TiB 252 MiB 7.7 GiB 7.6 TiB 16.08 0.82 247 up 22hdd 9.09569 1.0 9.1 TiB 671 GiB 667 GiB 204 MiB 4.1 GiB 8.4 TiB 7.21 0.37 136 up 23hdd 9.09569 1.0 9.1 TiB 665 GiB 660 GiB 270 MiB 4.5 GiB 8.4 TiB 7.14 0.37 124 up 0hdd 9.09569 1.0 9.1 TiB 1.2 TiB 1.2 TiB 87 MiB 6.0 GiB 7.9 TiB 13.30 0.68 230 up 1hdd 9.09569 1.0 9.1 TiB 1.3 TiB 1.3 TiB 347 MiB 6.6 GiB 7.8 TiB 14.01 0.72 153 up 2hdd 9.09569 0.65009 9.1 TiB 1.8 TiB 1.8 TiB 443 MiB 7.3 GiB 7.3 TiB 20.00 1.03 147 up 3hdd 9.09569 1.0 9.1 TiB 617 GiB 611 GiB 220 MiB 5.8 GiB 8.5 TiB 6.62 0.34 101 up 4hdd 9.09569 0.80005 9.1 TiB 2.0 TiB 2.0 TiB 293 MiB 8.2 GiB 7.1 TiB 22.12 1.13 137 up 5hdd 9.09569 1.0 9.1 TiB 857 GiB 852 GiB 157 MiB 4.9 GiB 8.3 TiB 9.20 0.47 155 up 6hdd 9.09569 1.0 9.1 TiB 580 GiB 575 GiB 678 MiB 4.4 GiB 8.5 TiB 6.23 0.32 114 up 7hdd 9.09569 0.5 9.1 TiB 7.7 TiB 7.7 TiB 103 MiB 16 GiB 1.4 TiB 85.03 4.36 201 up 24hdd 9.09569 1.0 9.1 TiB 1.2 TiB 1.2 TiB 133 MiB 6.2 GiB 7.9 TiB 13.11 0.67 225 up 25hdd 9.09569 0.34999 9.1 TiB 8.3 TiB 8.2 TiB 101 MiB 17 GiB 860 GiB 90.77 4.66 159 up 26hdd 9.09569 1.0 9.1 TiB 665 GiB 661 GiB 292 MiB 3.8 GiB 8.4 TiB 7.14 0.37 107 up 27hdd 9.09569 1.0 9.1 TiB 427 GiB 423 GiB 241 MiB 3.4 GiB 8.7 TiB 4.58 0.24 103 up 8hdd 9.09569 1.0 9.1 TiB 845 GiB 839 GiB 831 MiB 5.9 GiB 8.3 TiB 9.07 0.47 163 up 9hdd 9.09569 1.0 9.1 TiB 727 GiB 722 GiB 162 MiB 4.8 GiB 8.4 TiB 7.80 0.40 169 up 10hdd 9.09569 0.80005 9.1 TiB 1.9 TiB 1.9 TiB 742 MiB 7.5 GiB 7.2 TiB 21.01 1.08 136 up 11hdd 9.09569 1.0 9.1 TiB 733 GiB 727 GiB 498 MiB 5.2 GiB 8.4 TiB 7.87 0.40 163 up 12hdd 9.09569 1.0 9.1 TiB 892 GiB 886 GiB 318 MiB 5.6 GiB 8.2 TiB 9.58 0.49 254 up 13hdd 9.09569 1.0 9.1 TiB 759 GiB 755 GiB 37 MiB 4.0 GiB 8.4 TiB 8.15 0.42 134 up 14hdd 9.09569 0.85004 9.1 TiB 2.3 TiB 2.3 TiB 245 MiB 7.7 GiB 6.8 TiB 24.96 1.28 142 up 15hdd 9.09569 1.0 9.1 TiB 7.3 TiB 7.3 TiB 435 MiB 16 GiB 1.8 TiB 80.17 4.11 213 up 16hdd 9.09569 1.0 9.1 TiB 784 GiB 781 GiB 104 MiB 3.6 GiB 8.3 TiB 8.42 0.43 247 up 17hdd 9.09569 1.0 9.1 TiB 861 GiB 856 GiB 269 MiB 5.1 GiB 8.3 TiB 9.25 0.47 102 up 18hdd 9.09569 1.0 9.1 TiB 1.9 TiB 1.9 TiB 962 MiB 8.2 GiB 7.2 TiB 21.15 1.09 283 up 19hdd 9.09569 1.0 9.1 TiB 893 GiB 888 GiB 291 MiB 4.6 GiB 8.2 TiB 9.59 0.49 148 up TOTAL 255 TiB 50 TiB 49 TiB 9.7 GiB 187 GiB 205 TiB 19.49 MIN/MAX VAR: 0.24/4.66 STDDEV: 19.63 *ceph health detail |grep -i wrn* [WRN] OSDMAP_FLAGS: nodeep-scrub flag(s) set [WRN] OSD_NEARFULL: 2 nearfull osd(s) [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 16 pgs backfill_toofull [WRN] PG_NOT_DEEP_SCRUBBED: 1360 pgs not deep-scrubbed in time [WRN] PG_NOT_SCRUBBED: 53 pgs not scrubbed in time [WRN] POOL_NEARFULL: 36 pool(s) nearfull Thanks the team ;) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD CPU and write latency increase after upgrade from 15.2.16 to 17.2.6
After analysis, our cluster used compression, but did not use isal during compression. The compilation of ISAL compress in the current code depends on the macro HAVE_NASM_X64_AVX2. However, the macro HAVE_NASM_X64_AVX2 has been removed, resulting in the compression not using ISAL even if the compressor_zlib_isal parameter is set to true. At 2023-12-06 09:51:42, "Tony Yao" wrote: >Hi, > > >Recently, I upgraded Ceph from 15.2.16 to 17.2.6, but I found that OSD CPU >usage increased from 30% to 90% or more, and OSD subop_w_latency increased >from 600us to 5ms. This is incredible. > >My hardware environment: > >12 nodes x 12 NVMe (Intel P4510 4T) > >I tried to set the OSD configuration to the default value and saw no >improvement either. > > > >Have you ever encountered this problem? >What might have gone wrong? > > >Best regards, >Tony Yao >___ >ceph-users mailing list -- ceph-users@ceph.io >To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed
Hi, Another update: after 2 more weeks the mgr process grew to ~1.5 GB, which again was expected: mgr.ceph01.vankui ceph01 *:8443,9283 running (2w)102s ago 2y 1519M- 16.2.14 fc0182d6cda5 3451f8c6c07e mgr.ceph02.shsinf ceph02 *:8443,9283 running (2w)102s ago 7M 112M- 16.2.14 fc0182d6cda5 1c3d2d83b6df The cluster is healthy and operating normally, the mgr process is growing slowly. It's still unclear what caused the ballooning and OOM issue under very similar conditions. /Z On Sat, 25 Nov 2023 at 08:31, Zakhar Kirpichenko wrote: > Hi, > > A small update: after disabling 'progress' module the active mgr (on > ceph01) used up ~1.3 GB of memory in 3 days, which was expected: > > mgr.ceph01.vankui ceph01 *:8443,9283 running (3d) 9m ago 2y > 1284M- 16.2.14 fc0182d6cda5 3451f8c6c07e > mgr.ceph02.shsinf ceph02 *:8443,9283 running (3d) 9m ago 7M > 374M- 16.2.14 fc0182d6cda5 1c3d2d83b6df > > The cluster is healthy and operating normally. The mgr process is growing > slowly, at roughly about 1-2 MB per 10 minutes give or take, which is not > quick enough to balloon to over 100 GB RSS over several days, which likely > means that whatever triggers the issue happens randomly and quite suddenly. > I'll continue monitoring the mgr and get back with more observations. > > /Z > > On Wed, 22 Nov 2023 at 16:33, Zakhar Kirpichenko wrote: > >> Thanks for this. This looks similar to what we're observing. Although we >> don't use the API apart from the usage by Ceph deployment itself - which I >> guess still counts. >> >> /Z >> >> On Wed, 22 Nov 2023, 15:22 Adrien Georget, >> wrote: >> >>> Hi, >>> >>> This memory leak with ceph-mgr seems to be due to a change in Ceph >>> 16.2.12. >>> Check this issue : https://tracker.ceph.com/issues/59580 >>> We are also affected by this, with or without containerized services. >>> >>> Cheers, >>> Adrien >>> >>> Le 22/11/2023 à 14:14, Eugen Block a écrit : >>> > One other difference is you use docker, right? We use podman, could it >>> > be some docker restriction? >>> > >>> > Zitat von Zakhar Kirpichenko : >>> > >>> >> It's a 6-node cluster with 96 OSDs, not much I/O, mgr . Each node has >>> >> 384 >>> >> GB of RAM, each OSD has a memory target of 16 GB, about 100 GB of >>> >> memory, >>> >> give or take, is available (mostly used by page cache) on each node >>> >> during >>> >> normal operation. Nothing unusual there, tbh. >>> >> >>> >> No unusual mgr modules or settings either, except for disabled >>> progress: >>> >> >>> >> { >>> >> "always_on_modules": [ >>> >> "balancer", >>> >> "crash", >>> >> "devicehealth", >>> >> "orchestrator", >>> >> "pg_autoscaler", >>> >> "progress", >>> >> "rbd_support", >>> >> "status", >>> >> "telemetry", >>> >> "volumes" >>> >> ], >>> >> "enabled_modules": [ >>> >> "cephadm", >>> >> "dashboard", >>> >> "iostat", >>> >> "prometheus", >>> >> "restful" >>> >> ], >>> >> >>> >> /Z >>> >> >>> >> On Wed, 22 Nov 2023, 14:52 Eugen Block, wrote: >>> >> >>> >>> What does your hardware look like memory-wise? Just for comparison, >>> >>> one customer cluster has 4,5 GB in use (middle-sized cluster for >>> >>> openstack, 280 OSDs): >>> >>> >>> >>> PID USER PR NIVIRTRESSHR S %CPU %MEM >>> TIME+ >>> >>> COMMAND >>> >>> 6077 ceph 20 0 6357560 4,522g 22316 S 12,00 1,797 >>> >>> 57022:54 ceph-mgr >>> >>> >>> >>> In our own cluster (smaller than that and not really heavily used) >>> the >>> >>> mgr uses almost 2 GB. So those numbers you have seem relatively >>> small. >>> >>> >>> >>> Zitat von Zakhar Kirpichenko : >>> >>> >>> >>> > I've disabled the progress module entirely and will see how it >>> goes. >>> >>> > Otherwise, mgr memory usage keeps increasing slowly, from past >>> >>> experience >>> >>> > it will stabilize at around 1.5-1.6 GB. Other than this event >>> >>> warning, >>> >>> it's >>> >>> > unclear what could have caused random memory ballooning. >>> >>> > >>> >>> > /Z >>> >>> > >>> >>> > On Wed, 22 Nov 2023 at 13:07, Eugen Block wrote: >>> >>> > >>> >>> >> I see these progress messages all the time, I don't think they >>> cause >>> >>> >> it, but I might be wrong. You can disable it just to rule that >>> out. >>> >>> >> >>> >>> >> Zitat von Zakhar Kirpichenko : >>> >>> >> >>> >>> >> > Unfortunately, I don't have a full stack trace because there's >>> no >>> >>> crash >>> >>> >> > when the mgr gets oom-killed. There's just the mgr log, which >>> >>> looks >>> >>> >> > completely normal until about 2-3 minutes before the oom-kill, >>> >>> when >>> >>> >> > tmalloc warnings show up. >>> >>> >> > >>> >>> >> > I'm not sure that it's the same issue that is described in the >>> >>> tracker. >>> >>> >> We >>> >>> >> > seem to have some stale "events" in the progress module though: >>> >>> >> > >>>
[ceph-users] Re: Osd full
Hi Mohamed, Changing weights is no longer a good practice. The balancer is supposed to do the job. The number of pg per osd is really tight on your infrastructure. Can you display the ceph osd tree command? Cordialement, *David CASIER* *Ligne directe: +33(0) 9 72 61 98 29* Le lun. 11 déc. 2023 à 11:06, Mohamed LAMDAOUAR a écrit : > Hello the team, > > We initially had a cluster of 3 machines with 4 osd on each machine, we > added 4 machines in the cluster (each machine with 4 osd) > We launched the balancing but it never finished, still in progress. But the > big issue: we have an osd full and all the pools on this osd are read only. > > *ceph osd df *: > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > AVAIL%USE VAR PGS STATUS > 20hdd 9.09569 1.0 9.1 TiB 580 GiB 576 GiB 1.2 GiB 3.1 GiB > 8.5 TiB 6.23 0.32 169 up > 21hdd 9.09569 1.0 9.1 TiB 1.5 TiB 1.5 TiB 252 MiB 7.7 GiB > 7.6 TiB 16.08 0.82 247 up > 22hdd 9.09569 1.0 9.1 TiB 671 GiB 667 GiB 204 MiB 4.1 GiB > 8.4 TiB 7.21 0.37 136 up > 23hdd 9.09569 1.0 9.1 TiB 665 GiB 660 GiB 270 MiB 4.5 GiB > 8.4 TiB 7.14 0.37 124 up > 0hdd 9.09569 1.0 9.1 TiB 1.2 TiB 1.2 TiB 87 MiB 6.0 GiB > 7.9 TiB 13.30 0.68 230 up > 1hdd 9.09569 1.0 9.1 TiB 1.3 TiB 1.3 TiB 347 MiB 6.6 GiB > 7.8 TiB 14.01 0.72 153 up > 2hdd 9.09569 0.65009 9.1 TiB 1.8 TiB 1.8 TiB 443 MiB 7.3 GiB > 7.3 TiB 20.00 1.03 147 up > 3hdd 9.09569 1.0 9.1 TiB 617 GiB 611 GiB 220 MiB 5.8 GiB > 8.5 TiB 6.62 0.34 101 up > 4hdd 9.09569 0.80005 9.1 TiB 2.0 TiB 2.0 TiB 293 MiB 8.2 GiB > 7.1 TiB 22.12 1.13 137 up > 5hdd 9.09569 1.0 9.1 TiB 857 GiB 852 GiB 157 MiB 4.9 GiB > 8.3 TiB 9.20 0.47 155 up > 6hdd 9.09569 1.0 9.1 TiB 580 GiB 575 GiB 678 MiB 4.4 GiB > 8.5 TiB 6.23 0.32 114 up > 7hdd 9.09569 0.5 9.1 TiB 7.7 TiB 7.7 TiB 103 MiB 16 GiB > 1.4 TiB 85.03 4.36 201 up > 24hdd 9.09569 1.0 9.1 TiB 1.2 TiB 1.2 TiB 133 MiB 6.2 GiB > 7.9 TiB 13.11 0.67 225 up > 25hdd 9.09569 0.34999 9.1 TiB 8.3 TiB 8.2 TiB 101 MiB 17 GiB > 860 GiB 90.77 4.66 159 up > 26hdd 9.09569 1.0 9.1 TiB 665 GiB 661 GiB 292 MiB 3.8 GiB > 8.4 TiB 7.14 0.37 107 up > 27hdd 9.09569 1.0 9.1 TiB 427 GiB 423 GiB 241 MiB 3.4 GiB > 8.7 TiB 4.58 0.24 103 up > 8hdd 9.09569 1.0 9.1 TiB 845 GiB 839 GiB 831 MiB 5.9 GiB > 8.3 TiB 9.07 0.47 163 up > 9hdd 9.09569 1.0 9.1 TiB 727 GiB 722 GiB 162 MiB 4.8 GiB > 8.4 TiB 7.80 0.40 169 up > 10hdd 9.09569 0.80005 9.1 TiB 1.9 TiB 1.9 TiB 742 MiB 7.5 GiB > 7.2 TiB 21.01 1.08 136 up > 11hdd 9.09569 1.0 9.1 TiB 733 GiB 727 GiB 498 MiB 5.2 GiB > 8.4 TiB 7.87 0.40 163 up > 12hdd 9.09569 1.0 9.1 TiB 892 GiB 886 GiB 318 MiB 5.6 GiB > 8.2 TiB 9.58 0.49 254 up > 13hdd 9.09569 1.0 9.1 TiB 759 GiB 755 GiB 37 MiB 4.0 GiB > 8.4 TiB 8.15 0.42 134 up > 14hdd 9.09569 0.85004 9.1 TiB 2.3 TiB 2.3 TiB 245 MiB 7.7 GiB > 6.8 TiB 24.96 1.28 142 up > 15hdd 9.09569 1.0 9.1 TiB 7.3 TiB 7.3 TiB 435 MiB 16 GiB > 1.8 TiB 80.17 4.11 213 up > 16hdd 9.09569 1.0 9.1 TiB 784 GiB 781 GiB 104 MiB 3.6 GiB > 8.3 TiB 8.42 0.43 247 up > 17hdd 9.09569 1.0 9.1 TiB 861 GiB 856 GiB 269 MiB 5.1 GiB > 8.3 TiB 9.25 0.47 102 up > 18hdd 9.09569 1.0 9.1 TiB 1.9 TiB 1.9 TiB 962 MiB 8.2 GiB > 7.2 TiB 21.15 1.09 283 up > 19hdd 9.09569 1.0 9.1 TiB 893 GiB 888 GiB 291 MiB 4.6 GiB > 8.2 TiB 9.59 0.49 148 up >TOTAL 255 TiB 50 TiB 49 TiB 9.7 GiB 187 GiB > 205 TiB 19.49 > MIN/MAX VAR: 0.24/4.66 STDDEV: 19.63 > > > > > *ceph health detail |grep -i wrn* > [WRN] OSDMAP_FLAGS: nodeep-scrub flag(s) set > [WRN] OSD_NEARFULL: 2 nearfull osd(s) > [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this > doesn't resolve itself): 16 pgs backfill_toofull > [WRN] PG_NOT_DEEP_SCRUBBED: 1360 pgs not deep-scrubbed in time > [WRN] PG_NOT_SCRUBBED: 53 pgs not scrubbed in time > [WRN] POOL_NEARFULL: 36 pool(s) nearfull > > > Thanks the team ;) > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS recovery with existing pools
So we did walk through the advanced recovery page but didn't really succeed. The CephFS is still going to readonly because of the purge_queue error. Is there any chance to recover from that or should we try to recover with an empty metadata pool next? I'd still appreciate any comments. ;-) Zitat von Eugen Block : Some more information on the damaged CephFS, apparently the journal is damaged: ---snip--- # cephfs-journal-tool --rank=storage:0 --journal=mdlog journal inspect 2023-12-08T15:35:22.922+0200 7f834d0320c0 -1 Missing object 200.000527c4 2023-12-08T15:35:22.938+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f140067f) at 0x149f1174595 2023-12-08T15:35:22.942+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f1400e66) at 0x149f1174d7c 2023-12-08T15:35:22.954+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f1401642) at 0x149f1175558 2023-12-08T15:35:22.970+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f1401e29) at 0x149f1175d3f 2023-12-08T15:35:22.974+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f1402610) at 0x149f1176526 2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527ca 2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527cb 2023-12-08T15:35:22.994+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f30008f4) at 0x149f2d7480a 2023-12-08T15:35:22.998+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f3000ced) at 0x149f2d74c03 Overall journal integrity: DAMAGED Objects missing: 0x527c4 0x527ca 0x527cb Corrupt regions: 0x149f0d73f16-149f1174595 0x149f1174595-149f1174d7c 0x149f1174d7c-149f1175558 0x149f1175558-149f1175d3f 0x149f1175d3f-149f1176526 0x149f1176526-149f2d7480a 0x149f2d7480a-149f2d74c03 0x149f2d74c03- # cephfs-journal-tool --rank=storage:0 --journal=purge_queue journal inspect 2023-12-08T15:35:57.691+0200 7f331621e0c0 -1 Missing object 500.0dc6 Overall journal integrity: DAMAGED Objects missing: 0xdc6 Corrupt regions: 0x3718522e9- ---snip--- A backup isn't possible: ---snip--- # cephfs-journal-tool --rank=storage:0 journal export backup.bin 2023-12-08T15:42:07.643+0200 7fde6a24f0c0 -1 Missing object 200.000527c4 2023-12-08T15:42:07.659+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f140067f) at 0x149f1174595 2023-12-08T15:42:07.667+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f1400e66) at 0x149f1174d7c 2023-12-08T15:42:07.675+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f1401642) at 0x149f1175558 2023-12-08T15:42:07.687+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f1401e29) at 0x149f1175d3f 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f1402610) at 0x149f1176526 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Missing object 200.000527ca 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Missing object 200.000527cb 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f30008f4) at 0x149f2d7480a 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f3000ced) at 0x149f2d74c03 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Error ((5) Input/output error) ---snip--- Does it make sense to continue with the advanced disaster recovery [3] bei running (all of) these steps: cephfs-journal-tool event recover_dentries summary cephfs-journal-tool [--rank=N] journal reset cephfs-table-tool all reset session ceph fs reset --yes-i-really-mean-it cephfs-table-tool 0 reset session cephfs-table-tool 0 reset snap cephfs-table-tool 0 reset inode cephfs-journal-tool --rank=0 journal reset cephfs-data-scan init Fortunately, I didn't have to run through this procedure too often, so I'd appreciate any comments what the best approach would be here. Thanks! Eugen [3] https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts Zitat von Eugen Block : I was able to (almost) reproduce the issue in a (Pacific) test cluster. I rebuilt the monmap from the OSDs, brought everything back up, started the mds recovery like described in [1]: ceph fs new--force --recover Then I added two mds daemons which went into standby: ---snip--- Started Ceph mds.cephfs.pacific.uexvvq for 1b0afda4-2221-11ee-87be-fa163eed040c. Dez 08 12:51:53 pacific conmon[100493]: debug 2023-12-08T11:51:53.086+ 7ff5f589b900 0 set uid:gid to 167:167 (ceph:ceph) Dez 08 12:51:53 pacific conmon[100493]: debug 2023-12-08T11:51:53.086+ 7ff5f589b900 0 ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific (stable), process ceph-md> Dez 08 12:51:53 pacific conmon[100493]: debug 2023-12-08T11:51:53.086+ 7ff5f589b900 1 main not setting numa affinity Dez 08 12:51:53 pacific conmon[100493]: debug 2023-12-08T11:51:53.086+ 7ff5f589b900 0 pidfile_write: ignore empty --pid-file Dez 08 12:51:53 pacific conmon[100493]: starting mds.cephfs.pacific.uexvvq
[ceph-users] mds.0.journaler.pq(ro) _finish_read got error -2
Hi, I'm trying to help someone with a broken CephFS. We managed to recover basic ceph functionality but the CephFS is still inaccessible (currently read-only). We went through the disaster recovery steps but to no avail. Here's a snippet from the startup logs: ---snip--- mds.0.41 Booting: 2: waiting for purge queue recovered mds.0.journaler.pq(ro) _finish_probe_end write_pos = 14797504512 (header had 14789452521). recovered. mds.0.purge_queue operator(): open complete mds.0.purge_queue operator(): recovering write_pos monclient: get_auth_request con 0x55c280bc5c00 auth_method 0 monclient: get_auth_request con 0x55c280ee0c00 auth_method 0 mds.0.journaler.pq(ro) _finish_read got error -2 mds.0.purge_queue _recover: Error -2 recovering write_pos mds.0.purge_queue _go_readonly: going readonly because internal IO failed: No such file or directory mds.0.journaler.pq(ro) set_readonly mds.0.41 unhandled write error (2) No such file or directory, force readonly... mds.0.cache force file system read-only force file system read-only ---snip--- I've added the dev mailing list, maybe someone can give some advice how to continue from here (we could try to recover with an empty metadata pool). Or is this FS lost? Thanks! Eugen ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: reef 18.2.1 QE Validation status
Hi Yuri, Any chance we can include [1] ? This patch fixes mpath devices deployments, the PR has missed a merge and was backported onto reef this morning only. Thanks, [1] https://github.com/ceph/ceph/pull/53539/commits/1e7223281fa044c9653633e305c0b344e4c9b3a4 -- Guillaume Abrioux Software Engineer From: Venky Shankar Date: Tuesday, 28 November 2023 at 05:09 To: Yuri Weinstein Cc: dev , ceph-users Subject: [EXTERNAL] [ceph-users] Re: reef 18.2.1 QE Validation status On Tue, Nov 21, 2023 at 10:35 PM Venky Shankar wrote: > > Hi Yuri, > > On Fri, Nov 10, 2023 at 1:22 PM Venky Shankar wrote: > > > > Hi Yuri, > > > > On Fri, Nov 10, 2023 at 4:55 AM Yuri Weinstein wrote: > > > > > > I've updated all approvals and merged PRs in the tracker and it looks > > > like we are ready for gibba, LRC upgrades pending approval/update from > > > Venky. > > > > The smoke test failure is caused by missing (kclient) patches in > > Ubuntu 20.04 that certain parts of the fs suite (via smoke tests) rely > > on. More details here > > > > https://tracker.ceph.com/issues/63488#note-8 > > > > The kclient tests in smoke pass with other distro's and the fs suite > > tests have been reviewed and look good. Run details are here > > > > https://tracker.ceph.com/projects/cephfs/wiki/Reef#07-Nov-2023 > > > > The smoke failure is noted as a known issue for now. Consider this run > > as "fs approved". > > We need an additional change to be tested for inclusion into this reef > release. This one > > https://github.com/ceph/ceph/pull/54407 > > The issue showed up when upgrading the LRC. This was discussed in the CLT meeting last week and the overall agreement was to be able to extend our tests to validate the fix which showed up in quincy run (smoke suite) but not in reef. I've sent a change regarding the same: https://github.com/ceph/ceph/pull/54677 I'll update when it's ready to be included for testing. > > > > > > > > > On Thu, Nov 9, 2023 at 1:31 PM Radoslaw Zarzynski > > > wrote: > > > > > > > > rados approved! > > > > > > > > Details are here: > > > > https://tracker.ceph.com/projects/rados/wiki/REEF#1821-Review . > > > > > > > > On Mon, Nov 6, 2023 at 10:33 PM Yuri Weinstein > > > > wrote: > > > > > > > > > > Details of this release are summarized here: > > > > > > > > > > https://tracker.ceph.com/issues/63443#note-1 > > > > > > > > > > Seeking approvals/reviews for: > > > > > > > > > > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures) > > > > > rados - Neha, Radek, Travis, Ernesto, Adam King > > > > > rgw - Casey > > > > > fs - Venky > > > > > orch - Adam King > > > > > rbd - Ilya > > > > > krbd - Ilya > > > > > upgrade/quincy-x (reef) - Laura PTL > > > > > powercycle - Brad > > > > > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures) > > > > > > > > > > Please reply to this email with approval and/or trackers of known > > > > > issues/PRs to address them. > > > > > > > > > > TIA > > > > > YuriW > > > > > ___ > > > > > Dev mailing list -- d...@ceph.io > > > > > To unsubscribe send an email to dev-le...@ceph.io > > > > > > > > > > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > > > -- > > Cheers, > > Venky > > > > -- > Cheers, > Venky -- Cheers, Venky ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io Unless otherwise stated above: Compagnie IBM France Siège Social : 17, avenue de l'Europe, 92275 Bois-Colombes Cedex RCS Nanterre 552 118 465 Forme Sociale : S.A.S. Capital Social : 664 069 390,60 € SIRET : 552 118 465 03644 - Code NAF 6203Z ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: reef 18.2.1 QE Validation status
Per Guillaume it was tested and does not have any impact on other areas. So I will cherry-pick it for the release Thx On Mon, Dec 11, 2023 at 7:53 AM Guillaume Abrioux wrote: > > Hi Yuri, > > > > Any chance we can include [1] ? This patch fixes mpath devices deployments, > the PR has missed a merge and was backported onto reef this morning only. > > > > Thanks, > > > > [1] > https://github.com/ceph/ceph/pull/53539/commits/1e7223281fa044c9653633e305c0b344e4c9b3a4 > > > > -- > > Guillaume Abrioux > > Software Engineer > > > > From: Venky Shankar > Date: Tuesday, 28 November 2023 at 05:09 > To: Yuri Weinstein > Cc: dev , ceph-users > Subject: [EXTERNAL] [ceph-users] Re: reef 18.2.1 QE Validation status > > On Tue, Nov 21, 2023 at 10:35 PM Venky Shankar wrote: > > > > Hi Yuri, > > > > On Fri, Nov 10, 2023 at 1:22 PM Venky Shankar wrote: > > > > > > Hi Yuri, > > > > > > On Fri, Nov 10, 2023 at 4:55 AM Yuri Weinstein > > > wrote: > > > > > > > > I've updated all approvals and merged PRs in the tracker and it looks > > > > like we are ready for gibba, LRC upgrades pending approval/update from > > > > Venky. > > > > > > The smoke test failure is caused by missing (kclient) patches in > > > Ubuntu 20.04 that certain parts of the fs suite (via smoke tests) rely > > > on. More details here > > > > > > https://tracker.ceph.com/issues/63488#note-8 > > > > > > The kclient tests in smoke pass with other distro's and the fs suite > > > tests have been reviewed and look good. Run details are here > > > > > > https://tracker.ceph.com/projects/cephfs/wiki/Reef#07-Nov-2023 > > > > > > The smoke failure is noted as a known issue for now. Consider this run > > > as "fs approved". > > > > We need an additional change to be tested for inclusion into this reef > > release. This one > > > > https://github.com/ceph/ceph/pull/54407 > > > > The issue showed up when upgrading the LRC. > > This was discussed in the CLT meeting last week and the overall > agreement was to be able to extend our tests to validate the fix which > showed up in quincy run (smoke suite) but not in reef. I've sent a > change regarding the same: > > https://github.com/ceph/ceph/pull/54677 > > I'll update when it's ready to be included for testing. > > > > > > > > > > > > > > On Thu, Nov 9, 2023 at 1:31 PM Radoslaw Zarzynski > > > > wrote: > > > > > > > > > > rados approved! > > > > > > > > > > Details are here: > > > > > https://tracker.ceph.com/projects/rados/wiki/REEF#1821-Review . > > > > > > > > > > On Mon, Nov 6, 2023 at 10:33 PM Yuri Weinstein > > > > > wrote: > > > > > > > > > > > > Details of this release are summarized here: > > > > > > > > > > > > https://tracker.ceph.com/issues/63443#note-1 > > > > > > > > > > > > Seeking approvals/reviews for: > > > > > > > > > > > > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures) > > > > > > rados - Neha, Radek, Travis, Ernesto, Adam King > > > > > > rgw - Casey > > > > > > fs - Venky > > > > > > orch - Adam King > > > > > > rbd - Ilya > > > > > > krbd - Ilya > > > > > > upgrade/quincy-x (reef) - Laura PTL > > > > > > powercycle - Brad > > > > > > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures) > > > > > > > > > > > > Please reply to this email with approval and/or trackers of known > > > > > > issues/PRs to address them. > > > > > > > > > > > > TIA > > > > > > YuriW > > > > > > ___ > > > > > > Dev mailing list -- d...@ceph.io > > > > > > To unsubscribe send an email to dev-le...@ceph.io > > > > > > > > > > > > > > > ___ > > > > ceph-users mailing list -- ceph-users@ceph.io > > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > > > > > > > -- > > > Cheers, > > > Venky > > > > > > > > -- > > Cheers, > > Venky > > > > -- > Cheers, > Venky > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > Unless otherwise stated above: > > Compagnie IBM France > Siège Social : 17, avenue de l'Europe, 92275 Bois-Colombes Cedex > RCS Nanterre 552 118 465 > Forme Sociale : S.A.S. > Capital Social : 664 069 390,60 € > SIRET : 552 118 465 03644 - Code NAF 6203Z ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 17.2.7 to 18.2.0 issues
Hi, Thank you very much for the reply. So I evicted all my clients and still no luck. Check for blocked ops returns 0 from each mds service. Each mds service is serving a different pool suffering the same issue. If I write any recent files I can both stat and pull those so I have zero issues writing into the pool or pulling those newer files. Everything pre-reboot however just hangs when you try an actual copy. I've tried both fuse and kernel mounts and may try a nfs server and see if that makes a difference. The whole cluster still reports healthy system and healthy volumes with no pgs stuck in deep scrub. Tempted to reboot the system ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 17.2.7 to 18.2.0 issues
Thanks for this, I've replied above but sadly a client eviction and remount didn't help. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Deleting files from lost+found in 18.2.0
Hi, I saw in the changelogs that it should now finally be possible to delete files from lost+found with 18.x . I upgraded recently but still I can't delete or move files from there. I tried to change permissions but still every time I try it, I get "read only filesystem" but only when acting on files in lost+found. I searched the documentation but I can't find anything related. Is there a special trick or a flag I have to set? Cheers, Thomas -- http://www.widhalm.or.at GnuPG : 6265BAE6 , A84CB603 Threema: H7AV7D33 Telegram, Signal: widha...@widhalm.or.at OpenPGP_signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS recovery with existing pools
Update: apparently, we did it! We walked through the disaster recovery steps where one of the steps was to reset the journal. I was under the impression that the specified command 'cephfs-journal-tool [--rank=N] journal reset' would simply reset all the journals (mdlog and purge_queue), but it seems like it doesn't. After Mykola (once again, thank you so much for your input) pointed towards running the command for the purge_queue specifically, the filesystem got out of the read-only mode and was mountable again. the exact command was: cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal reset We didn't have to walk through the recovery with an empty pool, which is nice. I have a suggestion to include the "journal inspect" command to the docs for both mdlog and purge_queue to understand that both journals might need a reset. Thanks again, Mykola! Eugen Zitat von Eugen Block : So we did walk through the advanced recovery page but didn't really succeed. The CephFS is still going to readonly because of the purge_queue error. Is there any chance to recover from that or should we try to recover with an empty metadata pool next? I'd still appreciate any comments. ;-) Zitat von Eugen Block : Some more information on the damaged CephFS, apparently the journal is damaged: ---snip--- # cephfs-journal-tool --rank=storage:0 --journal=mdlog journal inspect 2023-12-08T15:35:22.922+0200 7f834d0320c0 -1 Missing object 200.000527c4 2023-12-08T15:35:22.938+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f140067f) at 0x149f1174595 2023-12-08T15:35:22.942+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f1400e66) at 0x149f1174d7c 2023-12-08T15:35:22.954+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f1401642) at 0x149f1175558 2023-12-08T15:35:22.970+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f1401e29) at 0x149f1175d3f 2023-12-08T15:35:22.974+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f1402610) at 0x149f1176526 2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527ca 2023-12-08T15:35:22.978+0200 7f834d0320c0 -1 Missing object 200.000527cb 2023-12-08T15:35:22.994+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f30008f4) at 0x149f2d7480a 2023-12-08T15:35:22.998+0200 7f834d0320c0 -1 Bad entry start ptr (0x149f3000ced) at 0x149f2d74c03 Overall journal integrity: DAMAGED Objects missing: 0x527c4 0x527ca 0x527cb Corrupt regions: 0x149f0d73f16-149f1174595 0x149f1174595-149f1174d7c 0x149f1174d7c-149f1175558 0x149f1175558-149f1175d3f 0x149f1175d3f-149f1176526 0x149f1176526-149f2d7480a 0x149f2d7480a-149f2d74c03 0x149f2d74c03- # cephfs-journal-tool --rank=storage:0 --journal=purge_queue journal inspect 2023-12-08T15:35:57.691+0200 7f331621e0c0 -1 Missing object 500.0dc6 Overall journal integrity: DAMAGED Objects missing: 0xdc6 Corrupt regions: 0x3718522e9- ---snip--- A backup isn't possible: ---snip--- # cephfs-journal-tool --rank=storage:0 journal export backup.bin 2023-12-08T15:42:07.643+0200 7fde6a24f0c0 -1 Missing object 200.000527c4 2023-12-08T15:42:07.659+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f140067f) at 0x149f1174595 2023-12-08T15:42:07.667+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f1400e66) at 0x149f1174d7c 2023-12-08T15:42:07.675+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f1401642) at 0x149f1175558 2023-12-08T15:42:07.687+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f1401e29) at 0x149f1175d3f 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f1402610) at 0x149f1176526 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Missing object 200.000527ca 2023-12-08T15:42:07.699+0200 7fde6a24f0c0 -1 Missing object 200.000527cb 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f30008f4) at 0x149f2d7480a 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 Bad entry start ptr (0x149f3000ced) at 0x149f2d74c03 2023-12-08T15:42:07.707+0200 7fde6a24f0c0 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados` Error ((5) Input/output error) ---snip--- Does it make sense to continue with the advanced disaster recovery [3] bei running (all of) these steps: cephfs-journal-tool event recover_dentries summary cephfs-journal-tool [--rank=N] journal reset cephfs-table-tool all reset session ceph fs reset --yes-i-really-mean-it cephfs-table-tool 0 reset session cephfs-table-tool 0 reset snap cephfs-table-tool 0 reset inode cephfs-journal-tool --rank=0 journal reset cephfs-data-scan init Fortunately, I didn't have to run through this procedure too often, so I'd appreciate any comments what the best approach would be here. Thanks! Eugen [3] https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts Zitat von Eugen Block : I was able to (almost) reproduce the issue in a (Pacific) test cluster. I rebuilt the monmap from the OSDs, brought ev
[ceph-users] Is there any way to merge an rbd image's full backup and a diff?
Hi, I'm developing RBD images' backup system. In my case, a backup data must be stored at least two weeks. To meet this requirement, I'd like to take backups as follows: 1. Take a full backup by rbd export first. 2. Take a differencial backups everyday. 3. Merge the full backup and the oldest (taken two weeks ago) diff. As a result of evaluation, I confirmed there is no problem in step 1 and 2. However, I found that step 3 couldn't be accomplished by `rbd merge-diff ` because `rbd merge-diff` only accepts a diff as a first parameter. Is there any way to merge a full backup and a diff? Thanks, Satoru ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS recovery with existing pools
Good to hear that, Eugen! CC'ed Zac for a your docs mention k > On Dec 11, 2023, at 23:28, Eugen Block wrote: > > Update: apparently, we did it! > We walked through the disaster recovery steps where one of the steps was to > reset the journal. I was under the impression that the specified command > 'cephfs-journal-tool [--rank=N] journal reset' would simply reset all the > journals (mdlog and purge_queue), but it seems like it doesn't. After Mykola > (once again, thank you so much for your input) pointed towards running the > command for the purge_queue specifically, the filesystem got out of the > read-only mode and was mountable again. the exact command was: > > cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal reset > > We didn't have to walk through the recovery with an empty pool, which is > nice. I have a suggestion to include the "journal inspect" command to the > docs for both mdlog and purge_queue to understand that both journals might > need a reset. > > Thanks again, Mykola! > Eugen ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds.0.journaler.pq(ro) _finish_read got error -2 [solved]
Just for posterity, we made the CephFS available again. We walked through the disaster recovery steps where one of the steps was to reset the journal. I was under the impression that the specified command 'cephfs-journal-tool [--rank=N] journal reset' would simply reset all the journals (mdlog and purge_queue), but it seems like it doesn't. The cephfs-journal-tool help page mentions mdlog as default: --journal= Journal type (purge_queue means this journal is used to queue for purge operation, default is mdlog, and only mdlog support event mode) And after Mykola (once again, thank you so much for your input) pointed towards running the command for the purge_queue specifically, the filesystem then got out of the read-only mode and was mountable again. The exact command was: cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal reset Zitat von Eugen Block : Hi, I'm trying to help someone with a broken CephFS. We managed to recover basic ceph functionality but the CephFS is still inaccessible (currently read-only). We went through the disaster recovery steps but to no avail. Here's a snippet from the startup logs: ---snip--- mds.0.41 Booting: 2: waiting for purge queue recovered mds.0.journaler.pq(ro) _finish_probe_end write_pos = 14797504512 (header had 14789452521). recovered. mds.0.purge_queue operator(): open complete mds.0.purge_queue operator(): recovering write_pos monclient: get_auth_request con 0x55c280bc5c00 auth_method 0 monclient: get_auth_request con 0x55c280ee0c00 auth_method 0 mds.0.journaler.pq(ro) _finish_read got error -2 mds.0.purge_queue _recover: Error -2 recovering write_pos mds.0.purge_queue _go_readonly: going readonly because internal IO failed: No such file or directory mds.0.journaler.pq(ro) set_readonly mds.0.41 unhandled write error (2) No such file or directory, force readonly... mds.0.cache force file system read-only force file system read-only ---snip--- I've added the dev mailing list, maybe someone can give some advice how to continue from here (we could try to recover with an empty metadata pool). Or is this FS lost? Thanks! Eugen ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io