Hi all,
we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients
failing to advance oldest client/flush tid". I looked at the client and there
was nothing going on, so I rebooted it. After the client was back, the message
was still there. To clean this up I failed the MDS. Unfor
On Tuesday, July 18, 2023 10:56:12 AM EDT Wyll Ingersoll wrote:
> Every night at midnight, our ceph-mgr daemons open up ssh connections to the
> other nodes and then leaves them open. Eventually they become zombies. I
> cannot figure out what module is causing this or how to turn it off. If
> left
Yes, it is ceph pacific 16.2.11.
Is this a known issue that is fixed in a more recent pacific update? We're not
ready to move to quincy yet.
thanks,
Wyllys
From: John Mulligan
Sent: Thursday, July 20, 2023 10:30 AM
To: ceph-users@ceph.io
Cc: Wyll Ingerso
On Thursday, July 20, 2023 10:36:02 AM EDT Wyll Ingersoll wrote:
> Yes, it is ceph pacific 16.2.11.
>
> Is this a known issue that is fixed in a more recent pacific update? We're
> not ready to move to quincy yet.
>
> thanks,
>Wyllys
>
To the best of my knowledge there's no fix in pacific
Enabling debug lc will execute more often the LC, but, please mind that
might not respect expiration time set. By design it will consider a day the
time set in interval.
So, if will run more often, you will end up removing objects sooner than
365 days (as an example) if set to do so.
Please test u
I need some help understanding this. I have configured nfs-ganesha for cephfs
using something like this in ganesha.conf
FSAL { Name = CEPH; User_Id = "testing.nfs"; Secret_Access_Key =
"AAA=="; }
But I contstantly have these messages in de ganesha logs, 6x per user_id
auth: unabl
Ok,
I fthink I igured this out. First, as I think I wrote earlier, these objects in
the ugly namespace begin with "<80>0_", and as such are a "bucket log
index" file according to the bucket_index_prefixes[] in cls_rgw.cc.
These objects were multiplying, and caused the 'Large omap object' w
Hi, we have service that is still crashing when S3 client (veeam backup) start
to write data
main log from rgw service
req 13170422438428971730 0.00886s s3:get_obj WARNING: couldn't find acl
header for object, generating
default
2023-07-20T14:36:45.331+ 7fa5adb4c700 -1 *** Caught signal
We did have a peering storm, we're past that portion of the backfill and still
experiencing new instances of rbd volumes hanging. It is for sure not just the
peering storm.
We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill
(like 75k). Our rbd poll is using about 1.7P
I think the rook-ceph is not responding to the liveness probe (confirmed by k8s
describe mds pod) I don't think it's the memory as I don't limit it, and I have
the cpu set to 500m per mds, but what direction should I go from here?
___
ceph-users mailing
Hello Eugen,
Requested details are as below.
PG ID: 15.28f0
Pool ID: 15
Pool: default.rgw.buckets.data
Pool EC Ratio: 8: 3
Number of Hosts: 12
## crush dump for rule ##
#ceph osd crush rule dump data_ec_rule
{
"rule_id": 1,
"rule_name": "data_ec_rule",
"ruleset": 1,
"type": 3
What should be appropriate way to restart primary OSD in this case (343) ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
We did have a peering storm, we're past that portion of the backfill and still
experiencing new instances of rbd volumes hanging. It is for sure not just the
peering storm.
We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill
(like 75k). Our rbd poll is using about 1.7P
This issue has been closed.
If any rook-ceph users see this, when mds replay takes a long time, look at the
logs in mds pod.
If it's going well and then abruptly terminates, try describing the mds pod,
and if liveness probe terminated, try increasing the threadhold of liveness
probe.
___
If any rook-ceph users see the situation that mds is stuck in replay, then look
at the logs of the mds pod.
When it runs and then terminates repeatedly, check if there is "liveness probe
termninated" error message by typing "kubectl describe pod -n (namspace) (mds'
pod name)"
If there is the
Hey Christian,
What does sync look like on the first site? And does restarting the RGW
instances on the first site fix up your issues?
We saw issues in the past that sound a lot like yours. We've adopted the
practice of restarting the RGW instances in the first cluster after deploying a
seco
Hello,
We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The
upgrade went mostly fine, though now several of our RGWs will not start. One
RGW is working fine, the rest will not initialize. They are on a crash loop.
This is part of a multisite configuration, and is curre
Assuming you're running systemctl OSDs you can run the following command on the
host that OSD 343 resides on.
systemctl restart ceph-osd@343
From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To:
ceph-users@ceph.io
Subject: [ceph-users] Re: 1 PG stucked in "active+undersized+degrad
Thank you both Michel and Christian.
Looks like I will have to do the rebalancing eventually.
From past experience with Ceph 16 the rebalance will likely take at least a
month with my 500 M objects.
It seems like a good idea to upgrade to Ceph 17 first as Michel suggests.
Unless:
I was hoping
Sometimes one can even get away with "ceph osd down 343" which doesn't affect
the process. I have had occasions when this goosed peering in a less-intrusive
way. I believe it just marks the OSD down in the mons' map, and when that
makes it to the OSD, the OSD responds with "I'm not dead yet" a
Hi Niklas,
As I said, ceph placement is based on more than fulfilling the failure
domain constraint. This is a core feature in ceph design. There is no
reason for a rebalancing on a cluster with a few hundreds OSDs to last a
month. Just before 17 you have to adjust the max backfills parameter
I can believe the month timeframe for a cluster with multiple large spinners
behind each HBA. I’ve witnessed such personally.
> On Jul 20, 2023, at 4:16 PM, Michel Jouvin
> wrote:
>
> Hi Niklas,
>
> As I said, ceph placement is based on more than fulfilling the failure domain
> constraint.
On Thu, Jul 20, 2023 at 11:19 PM wrote:
>
> If any rook-ceph users see the situation that mds is stuck in replay, then
> look at the logs of the mds pod.
>
> When it runs and then terminates repeatedly, check if there is "liveness
> probe termninated" error message by typing "kubectl describe p
On 7/20/23 22:09, Frank Schilder wrote:
Hi all,
we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to
advance oldest client/flush tid". I looked at the client and there was nothing going
on, so I rebooted it. After the client was back, the message was still there
Hi,
a couple of threads with similar error messages all lead back to some
sort of pool or osd issue. What is your current cluster status (ceph
-s)? Do you have some full OSDs? Those can cause this initialization
timeout as well as hit the max_pg_per_osd limit. So a few more cluster
detail
Hi,
what's the cluster status? Is there recovery or backfilling going on?
Zitat von Vladimir Brik :
I have a PG that hasn't been scrubbed in over a month and not
deep-scrubbed in over two months.
I tried forcing with `ceph pg (deep-)scrub` but with no success.
Looking at the logs of that
26 matches
Mail list logo