[ceph-users] MDS stuck in rejoin
Hi all, we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS. Unfortunately, the MDS that took over is remained stuck in rejoin without doing anything. All that happened in the log was: [root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log 2023-07-20T15:54:29.147+0200 7fedb9c9f700 1 mds.2.896604 rejoin_start 2023-07-20T15:54:29.161+0200 7fedb9c9f700 1 mds.2.896604 rejoin_joint_start 2023-07-20T15:55:28.005+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896614 from mon.4 2023-07-20T15:56:00.278+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896615 from mon.4 [...] 2023-07-20T16:02:54.935+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896653 from mon.4 2023-07-20T16:03:07.276+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896654 from mon.4 After some time I decided to give another fail a try and, this time, the replacement daemon went to active state really fast. If I have a message like the above, what is the clean way of getting the client clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))? Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-mgr ssh connections left open
On Tuesday, July 18, 2023 10:56:12 AM EDT Wyll Ingersoll wrote: > Every night at midnight, our ceph-mgr daemons open up ssh connections to the > other nodes and then leaves them open. Eventually they become zombies. I > cannot figure out what module is causing this or how to turn it off. If > left unchecked over days/weeks, the zombie ssh connections just keep > growing, the only way to clear them is to restart ceph-mgr services. > > Any idea what is causing this or how it can be disabled? > > Example: > > > ceph 1350387 1350373 7 Jul17 ?01:19:39 /usr/bin/ceph-mgr -n > mgr.mon03 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix > > ceph 1350548 1350387 0 Jul17 ?00:00:01 ssh -C -F > /tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o > ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.11 sudo python > > [...snip...] Is this cluster on pacific? The module in question is likely to be `cephadm` but the cephadm ssh backend has been changed and the team assumes problems like this no longer occur. Hope that helps! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-mgr ssh connections left open
Yes, it is ceph pacific 16.2.11. Is this a known issue that is fixed in a more recent pacific update? We're not ready to move to quincy yet. thanks, Wyllys From: John Mulligan Sent: Thursday, July 20, 2023 10:30 AM To: ceph-users@ceph.io Cc: Wyll Ingersoll Subject: Re: [ceph-users] ceph-mgr ssh connections left open On Tuesday, July 18, 2023 10:56:12 AM EDT Wyll Ingersoll wrote: > Every night at midnight, our ceph-mgr daemons open up ssh connections to the > other nodes and then leaves them open. Eventually they become zombies. I > cannot figure out what module is causing this or how to turn it off. If > left unchecked over days/weeks, the zombie ssh connections just keep > growing, the only way to clear them is to restart ceph-mgr services. > > Any idea what is causing this or how it can be disabled? > > Example: > > > ceph 1350387 1350373 7 Jul17 ?01:19:39 /usr/bin/ceph-mgr -n > mgr.mon03 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix > > ceph 1350548 1350387 0 Jul17 ?00:00:01 ssh -C -F > /tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o > ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.11 sudo python > > [...snip...] Is this cluster on pacific? The module in question is likely to be `cephadm` but the cephadm ssh backend has been changed and the team assumes problems like this no longer occur. Hope that helps! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-mgr ssh connections left open
On Thursday, July 20, 2023 10:36:02 AM EDT Wyll Ingersoll wrote: > Yes, it is ceph pacific 16.2.11. > > Is this a known issue that is fixed in a more recent pacific update? We're > not ready to move to quincy yet. > > thanks, >Wyllys > To the best of my knowledge there's no fix in pacific, I'm sorry to say. It was resolved by using a completely different library to make the ssh connections. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Workload that delete 100 M object daily via lifecycle
Enabling debug lc will execute more often the LC, but, please mind that might not respect expiration time set. By design it will consider a day the time set in interval. So, if will run more often, you will end up removing objects sooner than 365 days (as an example) if set to do so. Please test using: rgw_lifecycle_work_time 00:00-23:59 to run all day, and rgw_lc_debug_interval 86400 Meaning it will run every 4h. Paul On Wed, Jul 19, 2023 at 5:04 AM Anthony D'Atri wrote: > Indeed that's very useful. I improved the documentation for that not long > ago, took a while to sort out exactly what it was about. > > Normally LC only runs once a day as I understand it, there's a debug > option that compresses time so that it'll run more frequently, as having to > wait for a day to see the effect of changes harks back to the uucp days ;) > > > On Jul 18, 2023, at 21:37, Hoan Nguyen Van wrote: > > > > You can enable debug lc to test and tuning rgw lc parameter. > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] what is the point of listing "auth: unable to find a keyring on /etc/ceph/ceph.client nfs-ganesha
I need some help understanding this. I have configured nfs-ganesha for cephfs using something like this in ganesha.conf FSAL { Name = CEPH; User_Id = "testing.nfs"; Secret_Access_Key = "AAA=="; } But I contstantly have these messages in de ganesha logs, 6x per user_id auth: unable to find a keyring on /etc/ceph/ceph.client.testing I thought this was a ganesha authentication order issue, but they[1] say it has to do with ceph. I am still on Nautilus so maybe this has been fixed in newer releases. I still have a hard time understanding why this is an issue of ceph (libraries). [1] https://github.com/nfs-ganesha/nfs-ganesha/issues/974 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: index object in shard begins with hex 80
Ok, I fthink I igured this out. First, as I think I wrote earlier, these objects in the ugly namespace begin with "<80>0_", and as such are a "bucket log index" file according to the bucket_index_prefixes[] in cls_rgw.cc. These objects were multiplying, and caused the 'Large omap object' warnings. Our users were creating *alot* of small objects. We have a multi-site environment, with replication between the two sites for all buckets. Recently, we had some inadvertent downtime on the slave zone side. Checking the bucket in question, the large omap warning ONLY showed up on the slave side. Turns out the bucket in question has expiration set of all objects after a few days. Since the date of the downtime, NO objects have been deleted on the slave side! Deleting the 'extra' objects on the slave side by hand, amd then running 'bucket sync init' on the bucket on both sides seems to have resolved the situation. But this may be a bug in data sync when the slave side is not available for a time. -Chris On Tuesday, July 18, 2023 at 12:14:18 PM MDT, Dan van der Ster wrote: Hi Chris, Those objects are in the so called "ugly namespace" of the rgw, used to prefix special bucket index entries. // No UTF-8 character can begin with 0x80, so this is a safe indicator // of a special bucket-index entry for the first byte. Note: although // it has no impact, the 2nd, 3rd, or 4th byte of a UTF-8 character // may be 0x80. #define BI_PREFIX_CHAR 0x80 You can use --omap-key-file and some sed magic to interact with those keys, e.g. like this example from my archives [1].(In my example I needed to remove orphaned olh entries -- in your case you can generate uglykeys.txt in whichever way is meaningful for your situation.) BTW, to be clear, I'm not suggesting you blindly delete those keys. You would need to confirm that they are not needed by a current bucket instance before deleting, lest some index get corrupted. Cheers, Dan__ Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com [1] # radosgw-admin bi list --bucket=xxx --shard-id=0 > xxx.bilist.0 # cat xxx.bilist.0 | jq -r '.[]|select(.type=="olh" and .entry.key.name=="") | .idx' > uglykeys.txt # head -n2 uglykeys.txt �1001_00/2a/002a985cc73a01ce738da460b990e9b2fa849eb4411efb0a4598876c2859d444/2018_12_11/2893439/3390300/metadata.gz �1001_02/5f/025f8e0fc8234530d6ae7302adf682509f0f7fb68666391122e16d00bd7107e3/2018_11_14/2625203/3034777/metadata.gz # cat do_remove.sh # usage: "bash do_remove.sh | sh -x" while read f; do echo -n $f | sed 's/^.1001_/echo -n -e x801001_/'; echo ' > mykey && rados rmomapkey -p default.rgw.buckets.index .dir.zone.bucketid.xx.indexshardnumber --omap-key-file mykey'; done < uglykeys.txt On Tue, Jul 18, 2023 at 9:27 AM Christopher Durham wrote: Hi, I am using ceph 17.2.6 on rocky linux 8. I got a large omap object warning today. Ok, So I tracked it down to a shard for a bucket in the index pool of an s3 pool. However, when lisitng the omapkeys with: # rados -p pool.index listomapkeys .dir.zone.bucketid.xx.indexshardnumber it is clear that the problem is caused by many omapkeys with the following name format: <80>0_4771163.3444695458.6 A hex dump of the output of the listomapkeys command above indicates that the first 'character' is indeed hex 80, but as there is no equivalent ascii for hex 80, I am not sure how to 'get at' those keys to see the values, delete them, etc. The index keys not of the format above appear to be fine, indicating s3 object names as expected. The rest of the index shards for the bucket are reasonable and have less than osd_deep_scrub_large_omap_object_key_threshold index objects , and the overall total of objects in the bucket is way less than osd_deep_scrub_large_omap_object_key_threshold*num_shards. These weird objects seem to be created occasionally.? Yes, the bucket is used heavily. Any advice here? -Chris ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Quincy 17.2.6 - Rados gateway crash -
Hi, we have service that is still crashing when S3 client (veeam backup) start to write data main log from rgw service req 13170422438428971730 0.00886s s3:get_obj WARNING: couldn't find acl header for object, generating default 2023-07-20T14:36:45.331+ 7fa5adb4c700 -1 *** Caught signal (Aborted) ** And " 2023-07-19T22:04:15.968+ 7ff07305b700 1 beast: 0x7fefc7178710: 172.16.199.11 - veeam90 [19/Jul/2023:22:04:15.948 +] "PUT /veeam90/Veeam/Backu p/veeam90/Clients/%7Bd14cd688-57b4-4809-a1d9-14cafd191b11%7D/34387bbd-bec9-4a40-a04d-6a890d5d6407/CloudStg/Data/%7Bf687ee0f-fb50-4ded-b3a8-3f67ca7f244 b%7D/%7B6f31c277-734c-46fd-98d5-c560aa6dc776%7D/144113_f3fd31c9ee2a45aeeadda0de3cbc9064_ HTTP/1.1" 200 63422 - "APN/1. 0 Veeam/1.0 Backup/12.0" - latency=0.02216s 2023-07-19T22:04:15.972+ 7ff08307b700 1 == starting new request req=0x7fefc7682710 = 2023-07-19T22:04:15.972+ 7ff087083700 1 == starting new request req=0x7fefc737c710 = 2023-07-19T22:04:15.972+ 7ff071057700 1 == starting new request req=0x7fefc72fb710 = 2023-07-19T22:04:15.972+ 7ff0998a8700 1 == starting new request req=0x7fefc71f9710 = 2023-07-19T22:04:15.972+ 7fefe473e700 -1 *** Caught signal (Aborted) ** in thread 7fefe473e700 thread_name:radosgw ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7ff102d62cf0] 2: gsignal() 3: abort() 4: /lib64/libstdc++.so.6(+0x9009b) [0x7ff101d5209b] 5: /lib64/libstdc++.so.6(+0x9653c) [0x7ff101d5853c] 6: /lib64/libstdc++.so.6(+0x95559) [0x7ff101d57559] 7: __gxx_personality_v0() 8: /lib64/libgcc_s.so.1(+0x10b03) [0x7ff101736b03] 9: _Unwind_Resume() 10: /lib64/libradosgw.so.2(+0x538c5b) [0x7ff105246c5b] -- -- -10> 2023-07-19T22:04:15.972+ 7ff071057700 2 req 8167590275148061076 0.0s s3:put_obj pre-executing -9> 2023-07-19T22:04:15.972+ 7ff071057700 2 req 8167590275148061076 0.0s s3:put_obj check rate limiting -8> 2023-07-19T22:04:15.972+ 7ff071057700 2 req 8167590275148061076 0.0s s3:put_obj executing -7> 2023-07-19T22:04:15.972+ 7ff0998a8700 1 == starting new request req=0x7fefc71f9710 = -6> 2023-07-19T22:04:15.972+ 7ff0998a8700 2 req 15658207768827051601 0.0s initializing for trans_id = tx0d94d21014832be51-0064b85 ddf-3dfe-backup -5> 2023-07-19T22:04:15.972+ 7ff0998a8700 2 req 15658207768827051601 0.0s getting op 1 -4> 2023-07-19T22:04:15.972+ 7ff0998a8700 2 req 15658207768827051601 0.0s s3:put_obj verifying requester -3> 2023-07-19T22:04:15.972+ 7ff0998a8700 2 req 15658207768827051601 0.0s s3:put_obj normalizing buckets and tenants -2> 2023-07-19T22:04:15.972+ 7ff0998a8700 2 req 15658207768827051601 0.0s s3:put_obj init permissions -1> 2023-07-19T22:04:15.972+ 7ff011798700 2 req 15261257039771290446 0.024000257s s3:put_obj completing 0> 2023-07-19T22:04:15.972+ 7fefe473e700 -1 *** Caught signal (Aborted) ** in thread 7fefe473e700 thread_name:radosgw " Anyone have this issue ? Thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: librbd hangs during large backfill
We did have a peering storm, we're past that portion of the backfill and still experiencing new instances of rbd volumes hanging. It is for sure not just the peering storm. We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill (like 75k). Our rbd poll is using about 1.7PiB of storage, so we're looking at like 370TiB yet to backfill, rough estimate. This specific pool is using replicated encoding, with size=3. RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 21 PiB 11 PiB 10 PiB 10 PiB 48.73 TOTAL 21 PiB 11 PiB 10 PiB 10 PiB 48.73 POOLS: POOLID PGS STORED OBJECTS USED%USED MAX AVAIL pool14 32768 574 TiB 147.16M 1.7 PiB 68.87 260 TiB We did see a lot of rbd volumes that hung, often giving the buffer i/o errors previously sent - whether that was the peering storm or backfills is uncertain. As suggested, we've already been detaching/reattaching the rbd volumes, pushing the primary active osd for pgs to another, and sometimes rebooting the kernel on the vm to clear the io queue. A combination of those brings the rbd volume block device back for a while. We're no longer in a peering storm and we're seeing the rbd volumes going into an unresponsive state again - including osds where they were unresponsive, we did things and got them responsive, and then they went unresponsive again. All pgs are in an active state, some active+remapped+backfilling, some active+undersized+remapped+backfilling, etc. We also run the object gateway off the same cluster with the same backfill, the object gateway is not experiencing issues. Also the osds patricipating in the backfill are not saturated with i/o, or seeing abnormal load for our usual backfill operations. But with the continuing backfill, we're seeing rbd volumes on active pgs going back into a blocked state. We can do about the same with detaching the volume / bouncing the pg to a new primary acting osd, but we'd rather have these stop going unresponsive in the first place. Any suggestions towards that direction are greatly appreciated. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds terminated
I think the rook-ceph is not responding to the liveness probe (confirmed by k8s describe mds pod) I don't think it's the memory as I don't limit it, and I have the cpu set to 500m per mds, but what direction should I go from here? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time
Hello Eugen, Requested details are as below. PG ID: 15.28f0 Pool ID: 15 Pool: default.rgw.buckets.data Pool EC Ratio: 8: 3 Number of Hosts: 12 ## crush dump for rule ## #ceph osd crush rule dump data_ec_rule { "rule_id": 1, "rule_name": "data_ec_rule", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 11, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -50, "item_name": "root_data~hdd" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ## From Crushmap dump ## rule data_ec_rule { id 1 type erasure min_size 3 max_size 11 step set_chooseleaf_tries 5 step set_choose_tries 100 step take root_data class hdd step chooseleaf indep 0 type host step emit } ## EC Profile ## ceph osd erasure-code-profile get data crush-device-class=hdd crush-failure-domain=host crush-root=root_data jerasure-per-chunk-alignment=false k=8 m=3 plugin=jerasure technique=reed_sol_van w=8 OSD Tree: https://pastebin.com/raw/q6u7aSeu ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time
What should be appropriate way to restart primary OSD in this case (343) ? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: librbd hangs during large backfill
We did have a peering storm, we're past that portion of the backfill and still experiencing new instances of rbd volumes hanging. It is for sure not just the peering storm. We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill (like 75k). Our rbd poll is using about 1.7PiB of storage, so we're looking at like 370TiB yet to backfill, rough estimate. This specific pool is using replicated encoding, with size=3. RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 21 PiB 11 PiB 10 PiB 10 PiB 48.73 TOTAL 21 PiB 11 PiB 10 PiB 10 PiB 48.73 POOLS: POOLID PGS STORED OBJECTS USED%USED MAX AVAIL pool14 32768 574 TiB 147.16M 1.7 PiB 68.87 260 TiB We did see a lot of rbd volumes that hung, often giving the buffer i/o errors previously sent - whether that was the peering storm or backfills is uncertain. As suggested, we've already been detaching/reattaching the rbd volumes, pushing the primary active osd for pgs to another, and sometimes rebooting the kernel on the vm to clear the io queue. A combination of those brings the rbd volume block device back for a while. We're no longer in a peering storm and we're seeing the rbd volumes going into an unresponsive state again - including osds where they were unresponsive, we did things and got them responsive, and then they went unresponsive again. All pgs are in an active state, some active+remapped+backfilling, some active+undersized+remapped+backfilling, etc. We also run the object gateway off the same cluster with the same backfill, the object gateway is not experiencing issues. Also the osds patricipating in the backfill are not saturated with i/o, or seeing abnormal load for our usual backfill operations. But with the continuing backfill, we're seeing rbd volumes on active pgs going back into a blocked state. We can do about the same with detaching the volume / bouncing the pg to a new primary acting osd, but we'd rather have these stop going unresponsive in the first place. Any suggestions towards that direction are greatly appreciated. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds terminated
This issue has been closed. If any rook-ceph users see this, when mds replay takes a long time, look at the logs in mds pod. If it's going well and then abruptly terminates, try describing the mds pod, and if liveness probe terminated, try increasing the threadhold of liveness probe. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds terminated
If any rook-ceph users see the situation that mds is stuck in replay, then look at the logs of the mds pod. When it runs and then terminates repeatedly, check if there is "liveness probe termninated" error message by typing "kubectl describe pod -n (namspace) (mds' pod name)" If there is the error message, it's helpful to increase the threshold about "liveness probe" In my case, it resolved the issue. Thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw multisite sync not syncing data, error: RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data log shards
Hey Christian, What does sync look like on the first site? And does restarting the RGW instances on the first site fix up your issues? We saw issues in the past that sound a lot like yours. We've adopted the practice of restarting the RGW instances in the first cluster after deploying a second cluster, and that's got sync working in both directions. Dave ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGWs offline after upgrade to Nautilus
Hello, We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The upgrade went mostly fine, though now several of our RGWs will not start. One RGW is working fine, the rest will not initialize. They are on a crash loop. This is part of a multisite configuration, and is currently not the master zone. Current master zone is running 14.2.22. These are the only two zones in the zonegroup. After turning debug up to 20, these are the log snippets between each crash: ``` 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.52 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.54 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got realms_names. 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.371 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=114 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=686 2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup init ret 0 2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup name 2023-07-20 14:29:56.374 7fd8dec40900 20 using current period zonegroup 2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903 2023-07-20 14:29:56.375 7fd8dec40900 10 Cannot find current period zone using local zone 2023-07-20 14:29:56.375 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903 2023-07-20 14:29:56.375 7fd8dec40900 20 zone 2023-07-20 14:29:56.375 7fd8dec40900 20 generating connection object for zone id f10b465f-bf18-47d0-a51c-ca4f17118ee1 2023-07-20 14:34:56.198 7fd8cafe8700 -1 Initialization timeout, failed to initialize ``` I’ve checked all file permissions, filesystem free space, disabled selinux and firewalld, tried turning up the initialization timeout to 600, and tried removing all non-essential config from ceph.conf. All produce the same results. I would greatly appreciate any other ideas or insight. Thanks, Ben ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time
Assuming you're running systemctl OSDs you can run the following command on the host that OSD 343 resides on. systemctl restart ceph-osd@343 From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To: ceph-users@ceph.io Subject: [ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time What should be appropriate way to restart primary OSD in this case (343) ? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing
Thank you both Michel and Christian. Looks like I will have to do the rebalancing eventually. From past experience with Ceph 16 the rebalance will likely take at least a month with my 500 M objects. It seems like a good idea to upgrade to Ceph 17 first as Michel suggests. Unless: I was hoping that Ceph might have a way to reduce the rebalancing, given that all constraints about failure domains are already fulfilled. In particular, I was wondering whether I could play with the names of the "datacenter"s, to bring them in the same (alphabetical?) order as the hosts were so far. I suspect that this is what avoided the reshuffling on my my mini test cluster. I think it would be in alignment with Table 1 from the CRUSH paper: https://ceph.com/assets/pdfs/weil-crush-sc06.pdf E.g. perhaps take(root) select(1, row) select(3, cabinet) emit yields the same result as take(root) select(3, row) select(1, cabinet) emit ? Niklas ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time
Sometimes one can even get away with "ceph osd down 343" which doesn't affect the process. I have had occasions when this goosed peering in a less-intrusive way. I believe it just marks the OSD down in the mons' map, and when that makes it to the OSD, the OSD responds with "I'm not dead yet" and gets marked up again. > On Jul 20, 2023, at 13:50, Matthew Leonard (BLOOMBERG/ 120 PARK) > wrote: > > Assuming you're running systemctl OSDs you can run the following command on > the host that OSD 343 resides on. > > systemctl restart ceph-osd@343 > > From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To: > ceph-users@ceph.io > Subject: [ceph-users] Re: 1 PG stucked in "active+undersized+degraded for > long time > > What should be appropriate way to restart primary OSD in this case (343) ? > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing
Hi Niklas, As I said, ceph placement is based on more than fulfilling the failure domain constraint. This is a core feature in ceph design. There is no reason for a rebalancing on a cluster with a few hundreds OSDs to last a month. Just before 17 you have to adjust the max backfills parameter whose default is 1, a very conservative value. Using 2 should already reduce to rebalancing to a few days. But my experience shows that if it an option, upgrading to quincy first may be a better option due to to the autotuning of the number of backfills based on the real load of the cluster. If your cluster is using cephadm, upgrading to quincy is very straightforward and should be complete I. A couple of hours for the cluster size I mentioned. Cheers, Michel Sent from my mobile Le 20 juillet 2023 20:15:54 Niklas Hambüchen a écrit : Thank you both Michel and Christian. Looks like I will have to do the rebalancing eventually. From past experience with Ceph 16 the rebalance will likely take at least a month with my 500 M objects. It seems like a good idea to upgrade to Ceph 17 first as Michel suggests. Unless: I was hoping that Ceph might have a way to reduce the rebalancing, given that all constraints about failure domains are already fulfilled. In particular, I was wondering whether I could play with the names of the "datacenter"s, to bring them in the same (alphabetical?) order as the hosts were so far. I suspect that this is what avoided the reshuffling on my my mini test cluster. I think it would be in alignment with Table 1 from the CRUSH paper: https://ceph.com/assets/pdfs/weil-crush-sc06.pdf E.g. perhaps take(root) select(1, row) select(3, cabinet) emit yields the same result as take(root) select(3, row) select(1, cabinet) emit ? Niklas ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing
I can believe the month timeframe for a cluster with multiple large spinners behind each HBA. I’ve witnessed such personally. > On Jul 20, 2023, at 4:16 PM, Michel Jouvin > wrote: > > Hi Niklas, > > As I said, ceph placement is based on more than fulfilling the failure domain > constraint. This is a core feature in ceph design. There is no reason for a > rebalancing on a cluster with a few hundreds OSDs to last a month. Just > before 17 you have to adjust the max backfills parameter whose default is 1, > a very conservative value. Using 2 should already reduce to rebalancing to a > few days. But my experience shows that if it an option, upgrading to quincy > first may be a better option due to to the autotuning of the number of > backfills based on the real load of the cluster. > > If your cluster is using cephadm, upgrading to quincy is very straightforward > and should be complete I. A couple of hours for the cluster size I mentioned. > > Cheers, > > Michel > Sent from my mobile > Le 20 juillet 2023 20:15:54 Niklas Hambüchen a écrit : > >> Thank you both Michel and Christian. >> >> Looks like I will have to do the rebalancing eventually. >> From past experience with Ceph 16 the rebalance will likely take at least a >> month with my 500 M objects. >> >> It seems like a good idea to upgrade to Ceph 17 first as Michel suggests. >> >> Unless: >> >> I was hoping that Ceph might have a way to reduce the rebalancing, given >> that all constraints about failure domains are already fulfilled. >> >> In particular, I was wondering whether I could play with the names of the >> "datacenter"s, to bring them in the same (alphabetical?) order as the hosts >> were so far. >> I suspect that this is what avoided the reshuffling on my my mini test >> cluster. >> I think it would be in alignment with Table 1 from the CRUSH paper: >> https://ceph.com/assets/pdfs/weil-crush-sc06.pdf >> >> E.g. perhaps >> >> take(root) >> select(1, row) >> select(3, cabinet) >> emit >> >> yields the same result as >> >> take(root) >> select(3, row) >> select(1, cabinet) >> emit >> >> ? >> >> >> Niklas >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds terminated
On Thu, Jul 20, 2023 at 11:19 PM wrote: > > If any rook-ceph users see the situation that mds is stuck in replay, then > look at the logs of the mds pod. > > When it runs and then terminates repeatedly, check if there is "liveness > probe termninated" error message by typing "kubectl describe pod -n > (namspace) (mds' pod name)" > > If there is the error message, it's helpful to increase the threshold about > "liveness probe" > > In my case, it resolved the issue. Would you mind sharing what version of ceph (mds) was used? In a particular (pacific) release, the mds would abort when it received an metric update message (from a client) that it did not understand. > Thanks > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- Cheers, Venky ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS stuck in rejoin
On 7/20/23 22:09, Frank Schilder wrote: Hi all, we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS. Unfortunately, the MDS that took over is remained stuck in rejoin without doing anything. All that happened in the log was: BTW, are you using the kclient or user space client ? How long was the MDS stuck in rejoin state ? This means in the client side the oldest client has been stuck too long, maybe in heavy load case there were to many requests generated in a short time and the oldest request was stuck too long in MDS. [root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log 2023-07-20T15:54:29.147+0200 7fedb9c9f700 1 mds.2.896604 rejoin_start 2023-07-20T15:54:29.161+0200 7fedb9c9f700 1 mds.2.896604 rejoin_joint_start 2023-07-20T15:55:28.005+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896614 from mon.4 2023-07-20T15:56:00.278+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896615 from mon.4 [...] 2023-07-20T16:02:54.935+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896653 from mon.4 2023-07-20T16:03:07.276+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896654 from mon.4 Did you see any slow request log in the mds log files ? And any other suspect logs from the dmesg if it's kclient ? After some time I decided to give another fail a try and, this time, the replacement daemon went to active state really fast. If I have a message like the above, what is the clean way of getting the client clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))? I think your steps are correct. Thanks - Xiubo Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGWs offline after upgrade to Nautilus
Hi, a couple of threads with similar error messages all lead back to some sort of pool or osd issue. What is your current cluster status (ceph -s)? Do you have some full OSDs? Those can cause this initialization timeout as well as hit the max_pg_per_osd limit. So a few more cluster details could help here. Thanks, Eugen Zitat von "Ben.Zieglmeier" : Hello, We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The upgrade went mostly fine, though now several of our RGWs will not start. One RGW is working fine, the rest will not initialize. They are on a crash loop. This is part of a multisite configuration, and is currently not the master zone. Current master zone is running 14.2.22. These are the only two zones in the zonegroup. After turning debug up to 20, these are the log snippets between each crash: ``` 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.52 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.54 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got realms_names. 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.371 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=114 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=686 2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup init ret 0 2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup name 2023-07-20 14:29:56.374 7fd8dec40900 20 using current period zonegroup 2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903 2023-07-20 14:29:56.375 7fd8dec40900 10 Cannot find current period zone using local zone 2023-07-20 14:29:56.375 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903 2023-07-20 14:29:56.375 7fd8dec40900 20 zone 2023-07-20 14:29:56.375 7fd8dec40900 20 generating connection object for zone id f10b465f-bf18-47d0-a51c-ca4f17118ee1 2023-07-20 14:34:56.198 7fd8cafe8700 -1 Initialization timeout, failed to initialize ``` I’ve checked all file permissions, filesystem free space, disabled selinux and firewalld, tried turning up the initialization timeout to 600, and tried removing all non-essential config from ceph.conf. All produce the same results. I would greatly appreciate any other ideas or insight. Thanks, Ben ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD tries (and fails) to scrub the same PGs over and over
Hi, what's the cluster status? Is there recovery or backfilling going on? Zitat von Vladimir Brik : I have a PG that hasn't been scrubbed in over a month and not deep-scrubbed in over two months. I tried forcing with `ceph pg (deep-)scrub` but with no success. Looking at the logs of that PG's primary OSD it looks like every once in a while it attempts (and apparently fails) to scrub that PG, along with two others, over and over. For example: 2023-07-19T16:26:07.082 ... 24.3ea scrub starts 2023-07-19T16:26:10.284 ... 27.aae scrub starts 2023-07-19T16:26:11.169 ... 24.aa scrub starts 2023-07-19T16:26:12.153 ... 24.3ea scrub starts 2023-07-19T16:26:13.346 ... 27.aae scrub starts 2023-07-19T16:26:16.239 ... 24.aa scrub starts ... Lines like that are repeated throughout the log file. Has anyone seen something similar? How can I debug this? I am running 17.2.5 Vlad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io