[ceph-users] Re: radosgw bucket usage metrics gone after created in a loop 64K buckets

2023-09-18 Thread Szabo, Istvan (Agoda)
I think this is related to my radosgw-exporter, not related to ceph, I'll 
report it in git, sorry for the noise.



From: Szabo, Istvan (Agoda) 
Sent: Monday, September 18, 2023 1:58 PM
To: Ceph Users 
Subject: [ceph-users] radosgw bucket usage metrics gone after created in a loop 
64K buckets

Hi,

Last week we've created for a user 64K buckets to be able to properly shard 
their huge amount of objects and I can see that the "radosgw_usage_bucket" 
metrics disappeared from 10pm that day when the mass creation happened in our 
octopus 15.2.17 cluster.

In the logs I don't really see anything useful.

Is there any limitation that I might have hit?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of IPv4 / IPv6 dual stack?

2023-09-18 Thread Stefan Kooman

On 15-09-2023 09:25, Robert Sander wrote:

Hi,

as the documentation sends mixed signals in

https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#ipv4-ipv6-dual-stack-mode

"Note

Binding to IPv4 is enabled by default, so if you just add the option to 
bind to IPv6 you’ll actually put yourself into dual stack mode."


and

https://docs.ceph.com/en/latest/rados/configuration/msgr2/#address-formats

"Note

The ability to bind to multiple ports has paved the way for dual-stack 
IPv4 and IPv6 support. That said, dual-stack operation is not yet 
supported as of Quincy v17.2.0."


just the quick questions:

Is a dual stacked networking with IPv4 and IPv6 now supported or not?
 From which version on is it considered stable?


IIIRC, the "enable dual" stack PR's were more or less "accidentally" 
merged, at least that's what Radoslaw Zarzynski (added to CC) told me 
during the developer summit at Cephalocon in Amsterdam. There was a 
discussion about dual stack support after that. I voted in favor of not 
supporting dual stack. Currently there are no IPv6 (only) tests that are 
performed, it's IPv4 only. Let alone dual stack testing setups. It gets 
complicated quickly if you want to test all sort of combinations (some 
daemons with dual stack, some IPv4 only, some IPv6 only, etc.).



Are OSDs now able to register themselves with two IP addresses in the 
cluster map? MONs too?


At least the OSDs and MDSs can, and caused trouble for kernels with 
messenger v2 support. We had to disable IPv4 explicitly to get rid of 
the IPv4 "0.0.0.0" addresses in the MDS map. See this thread [1].


Gr. Stefan

[1]: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GLNS2S6BK7Q5ECUT3G53EP5CCXNFENXQ/


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of IPv4 / IPv6 dual stack?

2023-09-18 Thread Nico Schottelius

Hello,

a note: we are running IPv6 only clusters since 2017, in case anyone has
questions. In earlier releases no tunings were necessary, later releases
need the bind parameters.

BR,

Nico

Stefan Kooman  writes:

> On 15-09-2023 09:25, Robert Sander wrote:
>> Hi,
>> as the documentation sends mixed signals in
>> https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#ipv4-ipv6-dual-stack-mode
>> "Note
>> Binding to IPv4 is enabled by default, so if you just add the option
>> to bind to IPv6 you’ll actually put yourself into dual stack mode."
>> and
>> https://docs.ceph.com/en/latest/rados/configuration/msgr2/#address-formats
>> "Note
>> The ability to bind to multiple ports has paved the way for
>> dual-stack IPv4 and IPv6 support. That said, dual-stack operation is
>> not yet supported as of Quincy v17.2.0."
>> just the quick questions:
>> Is a dual stacked networking with IPv4 and IPv6 now supported or
>> not?
>>  From which version on is it considered stable?
>
> IIIRC, the "enable dual" stack PR's were more or less "accidentally"
> merged, at least that's what Radoslaw Zarzynski (added to CC) told me
> during the developer summit at Cephalocon in Amsterdam. There was a
> discussion about dual stack support after that. I voted in favor of
> not supporting dual stack. Currently there are no IPv6 (only) tests
> that are performed, it's IPv4 only. Let alone dual stack testing
> setups. It gets complicated quickly if you want to test all sort of
> combinations (some daemons with dual stack, some IPv4 only, some IPv6
> only, etc.).
>
>
>> Are OSDs now able to register themselves with two IP addresses in
>> the cluster map? MONs too?
>
> At least the OSDs and MDSs can, and caused trouble for kernels with
> messenger v2 support. We had to disable IPv4 explicitly to get rid of
> the IPv4 "0.0.0.0" addresses in the MDS map. See this thread [1].
>
> Gr. Stefan
>
> [1]:
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GLNS2S6BK7Q5ECUT3G53EP5CCXNFENXQ/
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Error EPERM: error setting 'osd_op_queue' to 'wpq': (1) Operation not permitted

2023-09-18 Thread Nikolaos Dandoulakis
Hi,

After upgrading our cluster to 17.2.6 all OSDs appear to have "osd_op_queue": 
"mclock_scheduler" (used to be wpq). As we see several OSDs reporting 
unjustifiable heavy load, we would like to revert this back to "wpq" but any 
attempt yells the following error:

root@store14:~# ceph tell osd.71 config set osd_op_queue wpq
Error EPERM: error setting 'osd_op_queue' to 'wpq': (1) Operation not permitted

I cannot find anywhere why this is happening, I am guessing another setting 
needs to be changed as well. Has anybody resolved this?

Best,
Nick
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336. Is e buidheann carthannais a th' ann an Oilthigh 
Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy 17.2.6 - Rados gateway crash -

2023-09-18 Thread Berger Wolfgang
Hi Michal,
I dont see any errors on versions 17.2.6 and 18.2.0 using veeam 12.0.0.1420.
In my setup, I am using a dedicated nginx proxy (not managed by ceph) to reach 
the upstream rgw instances.
BR
Wolfgang

-Ursprüngliche Nachricht-
Von: Michal Strnad  
Gesendet: Sonntag, 17. September 2023 19:43
An: Berger Wolfgang ; ceph-users@ceph.io
Betreff: Re: [ceph-users] Re: Quincy 17.2.6 - Rados gateway crash -

Hi.

Has anyone encountered the same error in the Reef version? At least 
Quincy (17.2.5) is still suffering from this even when using the latest 
version of Veeam.

Michal Strnad



On 8/17/23 15:12, Wolfgang Berger wrote:
> Hi,
> I've just checked Veeam backup (build 12.0.0.1420) to reef 18.2.0.
> Works great so far.
> BR
> Wolfgang
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Make ceph orch daemons reboot safe

2023-09-18 Thread Boris Behrens
Found it. The target was not enabled:
root@0cc47a6df14e:~# systemctl status
ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853.target
● ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853.target - Ceph cluster
03977a23-f00f-4bb0-b9a7-de57f40ba853
 Loaded: loaded
(/etc/systemd/system/ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853.target;
enabled; vendor preset: enabled)
 Active: inactive (dead)

Am Sa., 16. Sept. 2023 um 13:29 Uhr schrieb Boris :

> The other hosts are still online and the cluster only lost 1/3 of its
> services.
>
>
>
> > Am 16.09.2023 um 12:53 schrieb Eugen Block :
> >
> > I don’t have time to look into all the details, but I’m wondering how
> you seem to be able to start mgr services with the orchestrator if all mgr
> daemons are down. The orchestrator is a mgr module, so that’s a bit weird,
> isn’t it?
> >
> > Zitat von Boris Behrens :
> >
> >> Hi Eugen,
> >> the test-test cluster where we started with simple ceph and the adoption
> >> when straight forward are working fine.
> >>
> >> But this test cluster was all over the place.
> >> We had an old running update via orchestrator which was still in the
> >> pipeline, the adoption process was stopped a year ago and now got
> picked up
> >> again, and so on and so forth.
> >>
> >> But now we have it clean, at least we think it's clean.
> >>
> >> After a reboot, the services are not available. I have to start the via
> >> ceph orch.
> >> root@0cc47a6df14e:~# systemctl list-units | grep ceph
> >>  ceph-crash.service
> >>loaded active running   Ceph crash dump collector
> >>  ceph-fuse.target
> >>loaded active activeceph target allowing to
> start/stop
> >> all ceph-fuse@.service instances at once
> >>  ceph-mds.target
> >>   loaded active activeceph target allowing to start/stop
> >> all ceph-mds@.service instances at once
> >>  ceph-mgr.target
> >>   loaded active activeceph target allowing to start/stop
> >> all ceph-mgr@.service instances at once
> >>  ceph-mon.target
> >>   loaded active activeceph target allowing to start/stop
> >> all ceph-mon@.service instances at once
> >>  ceph-osd.target
> >>   loaded active activeceph target allowing to start/stop
> >> all ceph-osd@.service instances at once
> >>  ceph-radosgw.target
> >>   loaded active activeceph target allowing to start/stop
> >> all ceph-radosgw@.service instances at once
> >>  ceph.target
> >>   loaded active activeAll Ceph clusters and services
> >> root@0cc47a6df14e:~# ceph orch start mgr
> >> Scheduled to start mgr.0cc47a6df14e.nvjlcx on host '0cc47a6df14e'
> >> Scheduled to start mgr.0cc47a6df330.aznjao on host '0cc47a6df330'
> >> Scheduled to start mgr.0cc47aad8ce8.ifiydp on host '0cc47aad8ce8'
> >> root@0cc47a6df14e:~# ceph orch start mon
> >> Scheduled to start mon.0cc47a6df14e on host '0cc47a6df14e'
> >> Scheduled to start mon.0cc47a6df330 on host '0cc47a6df330'
> >> Scheduled to start mon.0cc47aad8ce8 on host '0cc47aad8ce8'
> >> root@0cc47a6df14e:~# ceph orch start osd.all-flash-over-1tb
> >> Scheduled to start osd.2 on host '0cc47a6df14e'
> >> Scheduled to start osd.5 on host '0cc47a6df14e'
> >> Scheduled to start osd.3 on host '0cc47a6df330'
> >> Scheduled to start osd.0 on host '0cc47a6df330'
> >> Scheduled to start osd.4 on host '0cc47aad8ce8'
> >> Scheduled to start osd.1 on host '0cc47aad8ce8'
> >> root@0cc47a6df14e:~# systemctl list-units | grep ceph
> >>
> ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853@mgr.0cc47a6df14e.nvjlcx.service
> >>   loaded active running   Ceph
> >> mgr.0cc47a6df14e.nvjlcx for 03977a23-f00f-4bb0-b9a7-de57f40ba853
> >>  ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853@mon.0cc47a6df14e.service
> >>loaded active running   Ceph
> >> mon.0cc47a6df14e for 03977a23-f00f-4bb0-b9a7-de57f40ba853
> >>  ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853@osd.2.service
> >>   loaded active running   Ceph osd.2
> >> for 03977a23-f00f-4bb0-b9a7-de57f40ba853
> >>  ceph-crash.service
> >>loaded active running   Ceph
> crash
> >> dump collector
> >>  system-ceph\x2d03977a23\x2df00f\x2d4bb0\x2db9a7\x2dde57f40ba853.slice
> >>   loaded active active
> >> system-ceph\x2d03977a23\x2df00f\x2d4bb0\x2db9a7\x2dde57f40ba853.slice
> >>  ceph-fuse.target
> >>loaded active activeceph
> target
> >> allowing to start/stop all ceph-fuse@.service instances at once
> >>  ceph-mds.target
> >>   loaded active activeceph
> target
> >> allowing to start/stop all ceph-mds@.service instances at once
> >>  ceph-mgr.target
> >>   loaded active activeceph
> target
> >> allowing to start/stop all ceph-mgr@.service instances at once
> >>  ceph-mon.target
> >>  

[ceph-users] Re: 6.5 CephFS client - ceph_cap_reclaim_work [ceph] / ceph_con_workfn [libceph] hogged CPU

2023-09-18 Thread Stefan Kooman

On 13-09-2023 16:49, Stefan Kooman wrote:

On 13-09-2023 14:58, Ilya Dryomov wrote:

On Wed, Sep 13, 2023 at 9:20 AM Stefan Kooman  wrote:


Hi,

Since the 6.5 kernel addressed the issue with regards to regression in
the readahead handling code... we went ahead and installed this kernel
for a couple of mail / web clusters (Ubuntu 6.5.1-060501-generic
#202309020842 SMP PREEMPT_DYNAMIC Sat Sep  2 08:48:34 UTC 2023 x86_64
x86_64 x86_64 GNU/Linux). Since then we occasionally see the following
being logged by the kernel:

[Sun Sep 10 07:19:00 2023] workqueue: delayed_work [ceph] hogged CPU for
   >1us 4 times, consider switching to WQ_UNBOUND
[Sun Sep 10 08:41:24 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >1us 4 times, consider switching to WQ_UNBOUND
[Sun Sep 10 11:05:55 2023] workqueue: delayed_work [ceph] hogged CPU for
   >1us 8 times, consider switching to WQ_UNBOUND
[Sun Sep 10 12:54:38 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >1us 8 times, consider switching to WQ_UNBOUND
[Sun Sep 10 19:06:37 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >1us 16 times, consider switching to WQ_UNBOUND
[Mon Sep 11 10:53:33 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >1us 32 times, consider switching to WQ_UNBOUND
[Tue Sep 12 10:14:03 2023] workqueue: ceph_con_workfn [libceph] hogged
CPU for >1us 64 times, consider switching to WQ_UNBOUND
[Tue Sep 12 11:14:33 2023] workqueue: ceph_cap_reclaim_work [ceph]
hogged CPU for >1us 4 times, consider switching to WQ_UNBOUND

We wonder if this is a new phenomenon, or that it's rather logged in the
new kernel and it was not before.


Hi Stefan,

This is something that wasn't logged in older kernels.  The kernel
workqueue infrastructure is considering Ceph work items CPU intensive
and reports that in dmesg.  This is new in 6.5 kernel, the threshold
can be tweaked with workqueue.cpu_intensive_thresh_us parameter.


Thanks. I was just looking into it (WQ_UNBOUND), alloc_workqueue(), etc. 
The patch by Tejun Heo on workqueue also mentions this:


* Concurrency-managed per-cpu work items that hog CPUs and delay the
    execution of other work items are now automatically detected and 
excluded from concurrency management. Reporting on such work items can 
also be enabled through a config option.


This does imply that the Ceph work items are "excluded from concurrency 
management", is that correct? And if so, what does that mean in 
practice? Might this make the process of returning / claiming caps to 
the MDS slower?


In 6.6-rc1 more workqueue work is done and more fine tuning seems 
possible. If there are any recommendations from a cephfs kernel client 
perspective on what a good policy would be, we would love to hear about 
that.


For now we will just disable the detection (cpu_intensive_thresh_us=0) 
and see how it goes.


Well, the positive thing to mention is that we don't see this happening 
anymore. However, in the same time window as last week, an event 
happened. This time it was the MDS that got OOM killed. At the time it 
was killed it was consuming 249.9 GiB (03:51:33). Just before that it 
consumed ~ 200 GiB. It acquired ~ 50 GiB of ram in 35 seconds (according 
to metrics). What MDS / client behaviour can trigger such an amount of 
increased memory usage?


Details: Single MDS (16.2.11), 1 active-standby, no snapshots. MDS 
server has 256 GiB of RAM. Dedicated node (bare metal).


Adding Patrick in CC.

Gr. Stefan


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rbd-mirror and DR test

2023-09-18 Thread Kamil Madac
One of our customers is currently facing a challenge in testing our
disaster recovery (DR) procedures on a pair of Ceph clusters (Quincy
version 17.2.5).

Our issue revolves around the need to resynchronize data after
conducting a DR procedure test. In small-scale scenarios, this may not
be a significant problem. However, when dealing with terabytes of
data, it becomes a considerable challenge.

In a typical DR procedure, there are two sites, Site A and Site B. The
process involves demoting Site A and promoting Site B, followed by the
reverse operation to ensure data resynchronization. However, our
specific challenge lies in the fact that, in our case:

- Site A is running and serving production traffic, Site B is just for
DR purposes.
- Network connectivity between Site A and Site B is deliberately disrupted.
- A "promote" operation is enforced (--force) on Site B, creating a
split-brain situation.
- Data access and modifications are performed on Site B during this state.
- To revert to the original configuration, we must demote Site B, but
the only way to re-establish RBD mirroring is by forcing a full
resynchronization, essentially recopying the entire dataset.

Given these circumstances, we are interested in how to address this
challenge efficiently, especially when dealing with large datasets
(TBs of data). Are there alternative approaches, best practices, or
recommendations such that we won't need to fully resync site A to site
B in order to reestablish rbd-mirror?

Thank you very much for any advice.

Kamil Madac
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: openstack rgw swift -- reef vs quincy

2023-09-18 Thread Casey Bodley
thanks Shashi, this regression is tracked in
https://tracker.ceph.com/issues/62771. we're testing a fix

On Sat, Sep 16, 2023 at 7:32 PM Shashi Dahal  wrote:
>
> Hi All,
>
> We have 3 openstack clusters, each with their  own ceph.  The openstack
> versions are identical( using openstack-ansible) and all rgw-keystone
> related configs are also the same.
>
> The only difference is the ceph version  .. one is pacific, quincy while
> the other (new) one is reef.
>
> The issue with reef is:
>
> Horizon >> Object Storage >> Containers >> Create New Container
> In storage-policy , there is nothing in reef   vs   default-placement in
> quincy and pacific.
> without any policy selected ( due to the form being blank), the "submit"
> button to create the container is disabled.
>
> via openstack cli, we are able to create the container, and once created,
> we can use horizon to upload/download images etc.  When doing container
> show ( in the cli/horizon) it shows that the policy is default-placement.
>
> Can someone guide us on how to troubleshoot and correct this?
>
> --
> Cheers,
> Shashi
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Error EPERM: error setting 'osd_op_queue' to 'wpq': (1) Operation not permitted

2023-09-18 Thread Josh Baergen
My guess is that this is because this setting can't be changed at
runtime, though if so that's a new enforcement behaviour in Quincy
that didn't exist in prior versions.

I think what you want to do is 'config set osd osd_op_queue wpq'
(assuming you want this set for all OSDs) and then restart your OSDs
in a safe manner.

Josh

On Mon, Sep 18, 2023 at 4:43 AM Nikolaos Dandoulakis  wrote:
>
> Hi,
>
> After upgrading our cluster to 17.2.6 all OSDs appear to have "osd_op_queue": 
> "mclock_scheduler" (used to be wpq). As we see several OSDs reporting 
> unjustifiable heavy load, we would like to revert this back to "wpq" but any 
> attempt yells the following error:
>
> root@store14:~# ceph tell osd.71 config set osd_op_queue wpq
> Error EPERM: error setting 'osd_op_queue' to 'wpq': (1) Operation not 
> permitted
>
> I cannot find anywhere why this is happening, I am guessing another setting 
> needs to be changed as well. Has anybody resolved this?
>
> Best,
> Nick
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336. Is e buidheann carthannais a th' ann an 
> Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS warning: clients laggy due to laggy OSDs

2023-09-18 Thread Janek Bevendorff

Hey all,

Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:

10 client(s) laggy due to laggy OSDs

ceph health detail shows it as:

[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
    mds.***(mds.3): Client *** is laggy; not evicted because some 
OSD(s) is/are laggy

    more of this...

When I restart the client(s) or the affected MDS daemons, the message 
goes away and then comes back after a while. ceph osd perf does not list 
any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so 
I'm on a total loss what this even means.


I have never seen this message before nor was I able to find anything 
about it. Do you have any idea what this message actually means and how 
I can get rid of it?


Thanks
Janek



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-18 Thread Laura Flores
Hi Janek,

There was some documentation added about it here:
https://docs.ceph.com/en/pacific/cephfs/health-messages/

There is a description of what it means, and it's tied to an mds
configurable.

On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

> Hey all,
>
> Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:
>
> 10 client(s) laggy due to laggy OSDs
>
> ceph health detail shows it as:
>
> [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
>  mds.***(mds.3): Client *** is laggy; not evicted because some
> OSD(s) is/are laggy
>  more of this...
>
> When I restart the client(s) or the affected MDS daemons, the message
> goes away and then comes back after a while. ceph osd perf does not list
> any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
> I'm on a total loss what this even means.
>
> I have never seen this message before nor was I able to find anything
> about it. Do you have any idea what this message actually means and how
> I can get rid of it?
>
> Thanks
> Janek
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-18 Thread Janek Bevendorff

Thanks! However, I still don't really understand why I am seeing this.

The first time I had this, one of the clients was a remote user dialling 
in via VPN, which could indeed be laggy. But I am also seeing it from 
neighbouring hosts that are on the same physical network with reliable 
ping times way below 1ms. How is that considered laggy?



On 18/09/2023 18:07, Laura Flores wrote:

Hi Janek,

There was some documentation added about it here: 
https://docs.ceph.com/en/pacific/cephfs/health-messages/


There is a description of what it means, and it's tied to an mds 
configurable.


On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff 
 wrote:


Hey all,

Since the upgrade to Ceph 16.2.14, I keep seeing the following
warning:

10 client(s) laggy due to laggy OSDs

ceph health detail shows it as:

[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
 mds.***(mds.3): Client *** is laggy; not evicted because some
OSD(s) is/are laggy
 more of this...

When I restart the client(s) or the affected MDS daemons, the message
goes away and then comes back after a while. ceph osd perf does
not list
any laggy OSDs (a few with 10-60ms ping, but overwhelmingly <
1ms), so
I'm on a total loss what this even means.

I have never seen this message before nor was I able to find anything
about it. Do you have any idea what this message actually means
and how
I can get rid of it?

Thanks
Janek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804 




--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS_CACHE_OVERSIZED, what is this a symptom of?

2023-09-18 Thread Pedro Lopes
So I'm getting this warning (although there are no noticeable problems in the 
cluster):

$ ceph health detail
HEALTH_WARN 1 MDSs report oversized cache
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.storefs-b(mds.0): MDS cache is too large (7GB/4GB); 0 inodes in use by 
clients, 0 stray files

Ceph FS status:

$ ceph fs status
storefs - 20 clients
===
RANK  STATE  MDSACTIVITY DNSINOS   DIRS   CAPS  
 0active  storefs-a  Reqs:0 /s  1385k  1385k   113k   193k  
0-s   standby-replay  storefs-b  Evts:0 /s  3123k  3123k  33.5k 0   
  POOL  TYPE USED  AVAIL  
storefs-metadata  metadata  19.4G  12.6T  
 storefs-pool4x data4201M  9708G  
 storefs-pool2x data2338G  18.9T  
MDS version: ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) 
quincy (stable)

What is telling me? Is it just that case of the cache size needing to be 
bigger? Is it a problem with the clients holding onto some kind of reference 
(documentation says this can be a cause, but now how to check for it).

Thanks in advance,
Pedro Lopes
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] python error when adding subvolume permission in cli

2023-09-18 Thread metrax
Hi,

at one of my clusters, i'm having a problem with authorizing new clients to an 
existing share.
The problem exists on all of the 3 nodes with the same error.

Due to a similar command is working on another cluster, something seems to be 
wrong here. I also tried to delete and recreate the fs and the mds, but it 
didn't help.

The servers are running on Debian 12.1 bookworm with Proxmox. Ceph version is 
17.2.6 from the proxmox repositories.

Please help me to find the cause of this problem.

Thank you very much.

Robert

~# ceph fs subvolume authorize shared cweb01lab glcweb01a.lab.xyz.host 
--access_level=rwp --verbose
parsed_args: Namespace(completion=False, help=False, cephconf=None, 
input_file=None, output_file=None, setuser=None, setgroup=None, client_id=None, 
client_name=None, cluster=None, admin_socket=None, status=False, watch=False, 
watch_debug=False, watch_info=False, watch_sec=False, watch_warn=False, 
watch_error=False, watch_channel=None, version=False, verbose=True, 
output_format=None, cluster_timeout=None, block=False, period=1), childargs: 
['fs', 'subvolume', 'authorize', 'shared', 'cweb01lab', 
'glcweb01a.lab.xyz.host', '--access_level=rwp']
cmd000: pg stat
cmd001: pg getmap
cmd002: pg dump 
[...]
cmd003: pg dump_json [...]
cmd004: pg dump_pools_json
cmd005: pg ls-by-pool  [...]
cmd006: pg ls-by-primary  [] [...]
cmd007: pg ls-by-osd  [] [...]
cmd008: pg ls [] [...]
cmd009: pg dump_stuck 
[...] []
cmd010: pg debug 
cmd011: pg scrub 
cmd012: pg deep-scrub 
cmd013: pg repair 
cmd014: pg force-recovery ...
cmd015: pg force-backfill ...
cmd016: pg cancel-force-recovery ...
cmd017: pg cancel-force-backfill ...
cmd018: osd perf
cmd019: osd df [] [] []
cmd020: osd blocked-by
cmd021: osd pool stats []
cmd022: osd pool scrub ...
cmd023: osd pool deep-scrub ...
cmd024: osd pool repair ...
cmd025: osd pool force-recovery ...
cmd026: osd pool force-backfill ...
cmd027: osd pool cancel-force-recovery ...
cmd028: osd pool cancel-force-backfill ...
cmd029: osd reweight-by-utilization [] [] 
[] [--no-increasing]
cmd030: osd test-reweight-by-utilization [] [] 
[] [--no-increasing]
cmd031: osd reweight-by-pg [] [] [] 
[...]
cmd032: osd test-reweight-by-pg [] [] 
[] [...]
cmd033: osd destroy  [--force] [--yes-i-really-mean-it]
cmd034: osd purge  [--force] [--yes-i-really-mean-it]
cmd035: osd safe-to-destroy ...
cmd036: osd ok-to-stop ... []
cmd037: osd scrub 
cmd038: osd deep-scrub 
cmd039: osd repair 
cmd040: service dump
cmd041: service status
cmd042: config show  []
cmd043: config show-with-defaults 
cmd044: device ls
cmd045: device info 
cmd046: device ls-by-daemon 
cmd047: device ls-by-host 
cmd048: device set-life-expectancy   []
cmd049: device rm-life-expectancy 
cmd050: alerts send
cmd051: balancer status
cmd052: balancer mode 
cmd053: balancer on
cmd054: balancer off
cmd055: balancer pool ls
cmd056: balancer pool add ...
cmd057: balancer pool rm ...
cmd058: balancer eval-verbose []
cmd059: balancer eval []
cmd060: balancer optimize  [...]
cmd061: balancer show 
cmd062: balancer rm 
cmd063: balancer reset
cmd064: balancer dump 
cmd065: balancer ls
cmd066: balancer execute 
cmd067: crash info 
cmd068: crash post
cmd069: crash ls [--format ]
cmd070: crash ls-new [--format ]
cmd071: crash rm 
cmd072: crash prune 
cmd073: crash archive 
cmd074: crash archive-all
cmd075: crash stat
cmd076: crash json_report 
cmd077: device query-daemon-health-metrics 
cmd078: device scrape-daemon-health-metrics 
cmd079: device scrape-health-metrics []
cmd080: device get-health-metrics  []
cmd081: device check-health
cmd082: device monitoring on
cmd083: device monitoring off
cmd084: device predict-life-expectancy 
cmd085: influx config-set  
cmd086: influx config-show
cmd087: influx send
cmd088: influx config-show
cmd089: influx config-set  
cmd090: influx send
cmd091: insights
cmd092: insights prune-health []
cmd093: iostat [] [--print-header]
cmd094: fs snapshot mirror enable []
cmd095: fs snapshot mirror disable []
cmd096: fs snapshot mirror peer_add  [] 
[] [] []
cmd097: fs snapshot mirror peer_list []
cmd098: fs snapshot mirror peer_remove  []
cmd099: fs snapshot mirror peer_bootstrap create   
[]
cmd100: fs snapshot mirror peer_bootstrap import  []
cmd101: fs snapshot mirror add  []
cmd102: fs snapshot mirror remove  []
cmd103: fs snapshot mirror dirmap  []
cmd104: fs snapshot mirror show distribution []
cmd105: fs snapshot mirror daemon status
cmd106: nfs export create cephfs[] 
[--readonly] [--client_addr ...] [--squash ] [--sectype 
...]
cmd107: nfs export create rgw   [] [] 
[--readonly] [--client_addr ...] [--squash ] [--sectype 
...]
cmd108: nfs export rm  
cmd109: nfs export delete  
cmd110: nfs export ls  [--detailed]
cmd111: nfs export info  
cmd112: nfs export get  
cmd113: nfs export apply 
cmd114: nfs cluster create  [] [--ingress] [--virtual_ip 
] [--port ]
cmd115: nfs cluster rm 
cmd116: nfs cluster delete 
cmd117: nfs cluster ls
cmd118: nfs cluster info []
cmd119: nfs cluster config get 

[ceph-users] Re: libceph: mds1 IP+PORT wrong peer at address

2023-09-18 Thread ultrasagenexus
We are facing the similar issue where we are seeing "libceph: wrong peer, want 
, got " in our dmesg as well.

Servers are running Ubuntu 20.04.6 kernel verison: 5.15.0-79-generic   
K8s: 1.27.4

containerd:1.6.22
rook: 1.12.1

Ceph: 18.2.0 

The rook and ceph versions were recently upgraded from 1.11.9 and 17.2.6 
respectively - these messages we not seen before.

Here are some related dmesg logs from one of our server where we are seeing OSD 
restarts for your reference:

[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::nn]:6801/3310605789, 
got (1)[::nn]:6801/3848687189
[Sun Sep 17 10:21:31 2023] libceph: osd2 (1)[::nn]:6801 wrong peer at address
[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::mm]:6801/480442735, 
got (1)[::mm]:6801/261725973
[Sun Sep 17 10:21:31 2023] libceph: osd0 (1)[::mm]:6801 wrong peer at address
[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::yy]:6841/3558675245, 
got (1)[::yy]:6841/392097708
[Sun Sep 17 10:21:31 2023] libceph: osd1 (1)[::yy]:6841 wrong peer at address
[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::mm]:6801/3886522490, 
got (1)[::mm]:6801/261725973
[Sun Sep 17 10:21:31 2023] libceph: osd0 (1)[::mm]:6801 wrong peer at address
[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::nn]:6801/1808088144, 
got (1)[::nn]:6801/3848687189
[Sun Sep 17 10:21:31 2023] libceph: osd2 (1)[::nn]:6801 wrong peer at address
[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::mm]:6801/2444743718, 
got (1)[::mm]:6801/261725973
[Sun Sep 17 10:21:31 2023] libceph: osd0 (1)[::mm]:6801 wrong peer at address
[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::yy]:6841/3558675245, 
got (1)[::yy]:6841/392097708
[Sun Sep 17 10:21:31 2023] libceph: osd1 (1)[::yy]:6841 wrong peer at address
[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::nn]:6801/927670669, 
got (1)[::nn]:6801/3848687189
[Sun Sep 17 10:21:31 2023] libceph: osd2 (1)[::nn]:6801 wrong peer at address
[Sun Sep 17 10:21:31 2023] libceph: wrong peer, want (1)[::mm]:6801/799469619, 
got (1)[::mm]:6801/261725973
[Sun Sep 17 10:21:31 2023] libceph: osd0 (1)[::mm]:6801 wrong peer at address
[Sun Sep 17 10:21:32 2023] libceph: wrong peer, want (1)[::yy]:6841/3558675245, 
got (1)[::yy]:6841/392097708
[Sun Sep 17 10:21:32 2023] libceph: osd1 (1)[::yy]:6841 wrong peer at address
[Sun Sep 17 10:21:32 2023] libceph: wrong peer, want (1)[::nn]:6801/927670669, 
got (1)[::nn]:6801/3848687189
[Sun Sep 17 10:21:32 2023] libceph: osd2 (1)[::nn]:6801 wrong peer at address
[Sun Sep 17 10:21:32 2023] libceph: wrong peer, want (1)[::mm]:6801/799469619, 
got (1)[::mm]:6801/261725973
[Sun Sep 17 10:21:32 2023] libceph: osd0 (1)[::mm]:6801 wrong peer at address
[Sun Sep 17 10:24:01 2023] libceph: wrong peer, want (1)[::yy]:6841/3558675245, 
got (1)[::yy]:6841/392097708
[Sun Sep 17 10:24:01 2023] libceph: osd1 (1)[::yy]:6841 wrong peer at address
[Sun Sep 17 10:24:01 2023] libceph: wrong peer, want (1)[::yy]:6841/3558675245, 
got (1)[::yy]:6841/392097708
[Sun Sep 17 10:24:01 2023] libceph: osd1 (1)[::yy]:6841 wrong peer at address
[Sun Sep 17 10:24:01 2023] libceph: wrong peer, want (1)[::yy]:6841/3558675245, 
got (1)[::yy]:6841/392097708
[Sun Sep 17 10:24:01 2023] libceph: osd1 (1)[::yy]:6841 wrong peer at address
[Sun Sep 17 10:24:01 2023] libceph: wrong peer, want (1)[::mm]:6801/799469619, 
got (1)[::mm]:6801/261725973
[Sun Sep 17 10:24:01 2023] libceph: osd0 (1)[::mm]:6801 wrong peer at address

Would appreciate some help or insights in resolving the issue. Please let us 
know if you need any further information. Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Error EPERM: error setting 'osd_op_queue' to 'wpq': (1) Operation not permitted

2023-09-18 Thread ndandoul
Hi Josh, 

Thanks a million, your proposed solution worked. 

Best,
Nick
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy 17.2.6 - Rados gateway crash -

2023-09-18 Thread Michal Strnad

Hi.

Thank you! We will try to upgrade to 17.2.6.

Michal

On 9/18/23 12:51, Berger Wolfgang wrote:

Hi Michal,
I dont see any errors on versions 17.2.6 and 18.2.0 using veeam 12.0.0.1420.
In my setup, I am using a dedicated nginx proxy (not managed by ceph) to reach 
the upstream rgw instances.
BR
Wolfgang

-Ursprüngliche Nachricht-
Von: Michal Strnad 
Gesendet: Sonntag, 17. September 2023 19:43
An: Berger Wolfgang ; ceph-users@ceph.io
Betreff: Re: [ceph-users] Re: Quincy 17.2.6 - Rados gateway crash -

Hi.

Has anyone encountered the same error in the Reef version? At least
Quincy (17.2.5) is still suffering from this even when using the latest
version of Veeam.

Michal Strnad



On 8/17/23 15:12, Wolfgang Berger wrote:

Hi,
I've just checked Veeam backup (build 12.0.0.1420) to reef 18.2.0.
Works great so far.
BR
Wolfgang
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph 16.2.x excessive logging, how to reduce?

2023-09-18 Thread Zakhar Kirpichenko
Hi,

Our Ceph 16.2.x cluster managed by cephadm is logging a lot of very
detailed messages, Ceph logs alone on hosts with monitors and several OSDs
has already eaten through 50% of the endurance of the flash system drives
over a couple of years.

Cluster logging settings are default, and it seems that all daemons are
writing lots and lots of debug information to the logs, such as for
example: https://pastebin.com/ebZq8KZk (it's just a snippet, but there's
lots and lots of various information).

Is there a way to reduce the amount of logging and, for example, limit the
logging to warnings or important messages so that it doesn't include every
successful authentication attempt, compaction etc, etc, when the cluster is
healthy and operating normally?

I would very much appreciate your advice on this.

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io