date:20230720

[ceph-users] MDS stuck in rejoin

2023-07-20 Thread Frank Schilder

Hi all, we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS. Unfor

[ceph-users] Re: ceph-mgr ssh connections left open

2023-07-20 Thread John Mulligan

On Tuesday, July 18, 2023 10:56:12 AM EDT Wyll Ingersoll wrote: > Every night at midnight, our ceph-mgr daemons open up ssh connections to the > other nodes and then leaves them open. Eventually they become zombies. I > cannot figure out what module is causing this or how to turn it off. If > left

[ceph-users] Re: ceph-mgr ssh connections left open

2023-07-20 Thread Wyll Ingersoll

Yes, it is ceph pacific 16.2.11. Is this a known issue that is fixed in a more recent pacific update? We're not ready to move to quincy yet. thanks, Wyllys From: John Mulligan Sent: Thursday, July 20, 2023 10:30 AM To: ceph-users@ceph.io Cc: Wyll Ingerso

[ceph-users] Re: ceph-mgr ssh connections left open

2023-07-20 Thread John Mulligan

On Thursday, July 20, 2023 10:36:02 AM EDT Wyll Ingersoll wrote: > Yes, it is ceph pacific 16.2.11. > > Is this a known issue that is fixed in a more recent pacific update? We're > not ready to move to quincy yet. > > thanks, >Wyllys > To the best of my knowledge there's no fix in pacific

[ceph-users] Re: Workload that delete 100 M object daily via lifecycle

2023-07-20 Thread Paul JURCO

Enabling debug lc will execute more often the LC, but, please mind that might not respect expiration time set. By design it will consider a day the time set in interval. So, if will run more often, you will end up removing objects sooner than 365 days (as an example) if set to do so. Please test u

[ceph-users] what is the point of listing "auth: unable to find a keyring on /etc/ceph/ceph.client nfs-ganesha

2023-07-20 Thread Marc

I need some help understanding this. I have configured nfs-ganesha for cephfs using something like this in ganesha.conf FSAL { Name = CEPH; User_Id = "testing.nfs"; Secret_Access_Key = "AAA=="; } But I contstantly have these messages in de ganesha logs, 6x per user_id auth: unabl

[ceph-users] Re: index object in shard begins with hex 80

2023-07-20 Thread Christopher Durham

Ok, I fthink I igured this out. First, as I think I wrote earlier, these objects in the ugly namespace begin with "<80>0_", and as such are a "bucket log index" file according to the bucket_index_prefixes[] in cls_rgw.cc. These objects were multiplying, and caused the 'Large omap object' w

[ceph-users] Quincy 17.2.6 - Rados gateway crash -

2023-07-20 Thread xadhoom76

Hi, we have service that is still crashing when S3 client (veeam backup) start to write data main log from rgw service req 13170422438428971730 0.00886s s3:get_obj WARNING: couldn't find acl header for object, generating default 2023-07-20T14:36:45.331+ 7fa5adb4c700 -1 *** Caught signal

[ceph-users] Re: librbd hangs during large backfill

2023-07-20 Thread Jack Hayhurst

We did have a peering storm, we're past that portion of the backfill and still experiencing new instances of rbd volumes hanging. It is for sure not just the peering storm. We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill (like 75k). Our rbd poll is using about 1.7P

[ceph-users] Re: mds terminated

2023-07-20 Thread dxodnd

I think the rook-ceph is not responding to the liveness probe (confirmed by k8s describe mds pod) I don't think it's the memory as I don't limit it, and I have the cpu set to 500m per mds, but what direction should I go from here? ___ ceph-users mailing

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread siddhit . renake

Hello Eugen, Requested details are as below. PG ID: 15.28f0 Pool ID: 15 Pool: default.rgw.buckets.data Pool EC Ratio: 8: 3 Number of Hosts: 12 ## crush dump for rule ## #ceph osd crush rule dump data_ec_rule { "rule_id": 1, "rule_name": "data_ec_rule", "ruleset": 1, "type": 3

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread siddhit . renake

What should be appropriate way to restart primary OSD in this case (343) ? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: librbd hangs during large backfill

2023-07-20 Thread fb2cd0fc-933c-4cfe-b534-93d67045a088

We did have a peering storm, we're past that portion of the backfill and still experiencing new instances of rbd volumes hanging. It is for sure not just the peering storm. We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill (like 75k). Our rbd poll is using about 1.7P

[ceph-users] Re: mds terminated

2023-07-20 Thread dxodnd

This issue has been closed. If any rook-ceph users see this, when mds replay takes a long time, look at the logs in mds pod. If it's going well and then abruptly terminates, try describing the mds pod, and if liveness probe terminated, try increasing the threadhold of liveness probe. ___

[ceph-users] Re: mds terminated

2023-07-20 Thread dxodnd

If any rook-ceph users see the situation that mds is stuck in replay, then look at the logs of the mds pod. When it runs and then terminates repeatedly, check if there is "liveness probe termninated" error message by typing "kubectl describe pod -n (namspace) (mds' pod name)" If there is the

[ceph-users] Re: rgw multisite sync not syncing data, error: RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data log shards

2023-07-20 Thread david . piper

Hey Christian, What does sync look like on the first site? And does restarting the RGW instances on the first site fix up your issues? We saw issues in the past that sound a lot like yours. We've adopted the practice of restarting the RGW instances in the first cluster after deploying a seco

[ceph-users] RGWs offline after upgrade to Nautilus

2023-07-20 Thread Ben . Zieglmeier

Hello, We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The upgrade went mostly fine, though now several of our RGWs will not start. One RGW is working fine, the rest will not initialize. They are on a crash loop. This is part of a multisite configuration, and is curre

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread Matthew Leonard (BLOOMBERG/ 120 PARK)

Assuming you're running systemctl OSDs you can run the following command on the host that OSD 343 resides on. systemctl restart ceph-osd@343 From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To: ceph-users@ceph.io Subject: [ceph-users] Re: 1 PG stucked in "active+undersized+degrad

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-20 Thread Niklas Hambüchen

Thank you both Michel and Christian. Looks like I will have to do the rebalancing eventually. From past experience with Ceph 16 the rebalance will likely take at least a month with my 500 M objects. It seems like a good idea to upgrade to Ceph 17 first as Michel suggests. Unless: I was hoping

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread Anthony D'Atri

Sometimes one can even get away with "ceph osd down 343" which doesn't affect the process. I have had occasions when this goosed peering in a less-intrusive way. I believe it just marks the OSD down in the mons' map, and when that makes it to the OSD, the OSD responds with "I'm not dead yet" a

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-20 Thread Michel Jouvin

Hi Niklas, As I said, ceph placement is based on more than fulfilling the failure domain constraint. This is a core feature in ceph design. There is no reason for a rebalancing on a cluster with a few hundreds OSDs to last a month. Just before 17 you have to adjust the max backfills parameter

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-20 Thread Anthony D'Atri

I can believe the month timeframe for a cluster with multiple large spinners behind each HBA. I’ve witnessed such personally. > On Jul 20, 2023, at 4:16 PM, Michel Jouvin > wrote: > > Hi Niklas, > > As I said, ceph placement is based on more than fulfilling the failure domain > constraint.

[ceph-users] Re: mds terminated

2023-07-20 Thread Venky Shankar

On Thu, Jul 20, 2023 at 11:19 PM wrote: > > If any rook-ceph users see the situation that mds is stuck in replay, then > look at the logs of the mds pod. > > When it runs and then terminates repeatedly, check if there is "liveness > probe termninated" error message by typing "kubectl describe p

[ceph-users] Re: MDS stuck in rejoin

2023-07-20 Thread Xiubo Li

On 7/20/23 22:09, Frank Schilder wrote: Hi all, we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there

[ceph-users] Re: RGWs offline after upgrade to Nautilus

2023-07-20 Thread Eugen Block

Hi, a couple of threads with similar error messages all lead back to some sort of pool or osd issue. What is your current cluster status (ceph -s)? Do you have some full OSDs? Those can cause this initialization timeout as well as hit the max_pg_per_osd limit. So a few more cluster detail

[ceph-users] Re: OSD tries (and fails) to scrub the same PGs over and over

2023-07-20 Thread Eugen Block

Hi, what's the cluster status? Is there recovery or backfilling going on? Zitat von Vladimir Brik : I have a PG that hasn't been scrubbed in over a month and not deep-scrubbed in over two months. I tried forcing with `ceph pg (deep-)scrub` but with no success. Looking at the logs of that

[ceph-users] MDS stuck in rejoin

[ceph-users] Re: ceph-mgr ssh connections left open

[ceph-users] Re: ceph-mgr ssh connections left open

[ceph-users] Re: ceph-mgr ssh connections left open

[ceph-users] Re: Workload that delete 100 M object daily via lifecycle

[ceph-users] what is the point of listing "auth: unable to find a keyring on /etc/ceph/ceph.client nfs-ganesha

[ceph-users] Re: index object in shard begins with hex 80

[ceph-users] Quincy 17.2.6 - Rados gateway crash -

[ceph-users] Re: librbd hangs during large backfill

[ceph-users] Re: mds terminated

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

[ceph-users] Re: librbd hangs during large backfill

[ceph-users] Re: mds terminated

[ceph-users] Re: mds terminated

[ceph-users] Re: rgw multisite sync not syncing data, error: RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data log shards

[ceph-users] RGWs offline after upgrade to Nautilus

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

[ceph-users] Re: mds terminated

[ceph-users] Re: MDS stuck in rejoin

[ceph-users] Re: RGWs offline after upgrade to Nautilus

[ceph-users] Re: OSD tries (and fails) to scrub the same PGs over and over

26 matches

Site Navigation

Mail list logo

Footer information