[ceph-users] log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Torkil Svensgaard

Good morning,

Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure 
domain from host to datacenter which is the reason for the large 
misplaced percentage.


We were seeing some pretty crazy spikes in "OSD Read Latencies" and "OSD 
Write Latencies" on the dashboard. Most of the time everything is well 
but then for periods of time, 1-4 hours, latencies will go to 10+ 
seconds for one or more OSDs. This also happens outside scrub hours and 
it is not the same OSDs every time. The OSDs affected are HDD with 
DB/WAL on NVMe.


Log snippet:

"
...
2024-03-22T06:48:22.859+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.859+ 7fb185b54700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  1 heartbeat_map clear_timeout 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  0 
bluestore(/var/lib/ceph/osd/ceph-112) log_latency slow operation 
observed for submit_transact, latency = 17.716707230s
2024-03-22T06:48:22.880+ 7fb1748ae700  0 
bluestore(/var/lib/ceph/osd/ceph-112) log_latency_fn slow operation 
observed for _txc_committed_kv, latency = 17.732601166s, txc = 
0x55a5bcda0f00
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s

...
"

"
[root@dopey ~]# ceph -s
  cluster:
id: 8ee2d228-ed21-4580-8bbf-0649f229e21d
health: HEALTH_WARN
1 failed cephadm daemon(s)
Low space hindering backfill (add storage if this doesn't 
resolve itself): 1 pg backfill_toofull


  services:
mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
mgr: jolly.tpgixt(active, since 10d), standbys: dopey.lxajvk, 
lazy.xuhetq

mds: 1/1 daemons up, 2 standby
osd: 540 osds: 539 up (since 6m), 539 in (since 15h); 6250 remapped pgs

  data:
volumes: 1/1 healthy
pools:   15 pools, 10849 pgs
objects: 546.35M objects, 1.1 PiB
usage:   1.9 PiB used, 2.3 PiB / 4.2 PiB avail
pgs: 1425479651/3163081036 objects misplaced (45.066%)
 6224 active+remapped+backfill_wait
 4516 active+clean
 67   active+clean+scrubbing
 25   active+remapped+backfilling
 16   active+clean+scrubbing+deep
 1active+remapped+backfill_wait+backfill_toofull

  io:
client:   117 MiB/s rd, 68 MiB/s wr, 274 op/s rd, 183 op/s wr
recovery: 438 MiB/s, 192 objects/s
"

Anyone know what the issue might be? Given that is happens on and off 
with large periods of time in between with normal low latencies I think 
it unlikely that it is just because the cluster is busy.


Also, how come there's only a small amount of PGs doing backfill when we 
have such a large misplaced percentage? Can this be just from backfill 
reservation logjam?


Mvh.

Torkil

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Are we logging IRC channels?

2024-03-22 Thread Alvaro Soto
Should we bring to life this again?

On Tue, Mar 19, 2024, 8:14 PM Mark Nelson  wrote:

> A long time ago Wido used to have a bot logging IRC afaik, but I think
> that's been gone for some time.
>
>
> Mark
>
>
> On 3/19/24 19:36, Alvaro Soto wrote:
> > Hi Community!!!
> > Are we logging IRC channels? I ask this because a lot of people only use
> > Slack, and the Slack we use doesn't have a subscription, so messages are
> > lost after 90 days (I believe)
> >
> > I believe it's important to keep track of the technical knowledge we see
> > each day over IRC+Slack
> > Cheers!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Bandelow, Gunnar
Hi Michael,

i think yesterday i found the culprit in my case.

After inspecting "ceph pg dump" and especially the column
"last_scrub_duration". I found, that any PG without proper scrubbing
was located on one of three OSDs (and all these OSDs share the same
SSD for their DB). I put them on "out" and now after backfill and
remapping everything seems to be fine. 


Only the log is still flooded with "scrub starts" and i have no clue
why these OSDs are causing the problems.
Will investigate further.


Best regards,
Gunnar

===


 Gunnar Bandelow
 Universitätsrechenzentrum (URZ)
 Universität Greifswald
 Felix-Hausdorff-Straße 18
 17489 Greifswald
 Germany


 Tel.: +49 3834 420 1450

--- Original Nachricht ---
Betreff: [ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep
scrubbed for 1 month
Von: "Michel Jouvin" 
An: ceph-users@ceph.io
Datum: 21-03-2024 23:40






Hi,

Today we decided to upgrade from 18.2.0 to 18.2.2. No real hope of a 
direct impact (nothing in the change log related to something similar)

but at least all daemons were restarted so we thought that may be this

will clear the problem at least temporarily. Unfortunately it has not 
been the case. The same pages are still stuck, despite continuous 
activity of scrubbing/deep scrubbing in the cluster...

I'm happy to provide more information if somebody tells me what to
look 
at...

Cheers,

Michel

Le 21/03/2024 à 14:40, Bernhard Krieger a écrit :
> Hi,
>
> i have the same issues.
> Deep scrub havent finished the jobs on some PGs.
>
> Using ceph 18.2.2.
> Initial installed version was 18.0.0
>
>
> In the logs i see a lot of scrub/deep-scrub starts
>
> Mar 21 14:21:09 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.b deep-scrubstarts
> Mar 21 14:21:10 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1a deep-scrubstarts
> Mar 21 14:21:17 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1c deep-scrubstarts
> Mar 21 14:21:19 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 11.1 scrubstarts
> Mar 21 14:21:27 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 14.6 scrubstarts
> Mar 21 14:21:30 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 10.c deep-scrubstarts
> Mar 21 14:21:35 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 12.3 deep-scrubstarts
> Mar 21 14:21:41 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 6.0 scrubstarts
> Mar 21 14:21:44 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 8.5 deep-scrubstarts
> Mar 21 14:21:45 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.66 deep-scrubstarts
> Mar 21 14:21:49 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.30 deep-scrubstarts
> Mar 21 14:21:50 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.b deep-scrubstarts
> Mar 21 14:21:52 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1a deep-scrubstarts
> Mar 21 14:21:54 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1c deep-scrubstarts
> Mar 21 14:21:55 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 11.1 scrubstarts
> Mar 21 14:21:58 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 14.6 scrubstarts
> Mar 21 14:22:01 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 10.c deep-scrubstarts
> Mar 21 14:22:04 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 12.3 scrubstarts
> Mar 21 14:22:13 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 6.0 scrubstarts
> Mar 21 14:22:15 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 8.5 deep-scrubstarts
> Mar 21 14:22:20 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.66 deep-scrubstarts
> Mar 21 14:22:27 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.30 scrubstarts
> Mar 21 14:22:30 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.b deep-scrubstarts
> Mar 21 14:22:32 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1a deep-scrubstarts
> Mar 21 14:22:33 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1c deep-scrubstarts
> Mar 21 14:22:35 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 11.1 deep-scrubstarts
> Mar 21 14:22:37 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 14.6 scrubstarts
> Mar 21 14:22:38 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 10.c scrubstarts
> Mar 21 14:22:39 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 12.3 scrubstarts
> Mar 21 14:22:41 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 6.0 deep-scrubstarts
> Mar 21 14:22:43 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 8.5 deep-scrubstarts
> Mar 21 14:22:46 ceph-node10 ceph-osd[3804193]: log_channel(c

[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Igor Fedotov

Hi Torkil,

highly likely you're facing a well known issue with RocksDB performance 
drop after bulk data removal. The latter might occur at source OSDs 
after PG migration completion.


You might want to use DB compaction (preferably offline one using 
ceph-kvstore-tool) to get OSD out of this "degraded" state or as a 
preventive measure. I'd recommend to do that for all the OSDs right now. 
And once again after rebalancing is completed.  This should improve 
things but unfortunately no 100% guarantee.


Also curious if you have DB/WAL on fast (SSD or NVMe) drives? This might 
be crucial..



Thanks,

Igor

On 3/22/2024 9:59 AM, Torkil Svensgaard wrote:

Good morning,

Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure 
domain from host to datacenter which is the reason for the large 
misplaced percentage.


We were seeing some pretty crazy spikes in "OSD Read Latencies" and 
"OSD Write Latencies" on the dashboard. Most of the time everything is 
well but then for periods of time, 1-4 hours, latencies will go to 10+ 
seconds for one or more OSDs. This also happens outside scrub hours 
and it is not the same OSDs every time. The OSDs affected are HDD with 
DB/WAL on NVMe.


Log snippet:

"
...
2024-03-22T06:48:22.859+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.859+ 7fb185b54700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  1 heartbeat_map 
clear_timeout 'OSD::osd_op_tp thread 0x7fb169898700' had timed out 
after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  0 
bluestore(/var/lib/ceph/osd/ceph-112) log_latency slow operation 
observed for submit_transact, latency = 17.716707230s
2024-03-22T06:48:22.880+ 7fb1748ae700  0 
bluestore(/var/lib/ceph/osd/ceph-112) log_latency_fn slow operation 
observed for _txc_committed_kv, latency = 17.732601166s, txc = 
0x55a5bcda0f00
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s

...
"

"
[root@dopey ~]# ceph -s
  cluster:
    id: 8ee2d228-ed21-4580-8bbf-0649f229e21d
    health: HEALTH_WARN
    1 failed cephadm daemon(s)
    Low space hindering backfill (add storage if this doesn't 
resolve itself): 1 pg backfill_toofull


  services:
    mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
    mgr: jolly.tpgixt(active, since 10d), standbys: dopey.lxajvk, 
lazy.xuhetq

    mds: 1/1 daemons up, 2 standby
    osd: 540 osds: 539 up (since 6m), 539 in (since 15h); 6250 
remapped pgs


  data:
    volumes: 1/1 healthy
    pools:   15 pools, 10849 pgs
    objects: 546.35M objects, 1.1 PiB
    usage:   1.9 PiB used, 2.3 PiB / 4.2 PiB avail
    pgs: 1425479651/3163081036 objects misplaced (45.066%)
 6224 active+remapped+backfill_wait
 4516 active+clean
 67   active+clean+scrubbing
 25   active+remapped+backfilling
 16   active+clean+scrubbing+deep
 1    active+remapped+backfill_wait+backfill_toofull

  io:
    client:   117 MiB/s rd, 68 MiB/s wr, 274 op/s rd, 183 op/s wr
    recovery: 438 MiB/s, 192 objects/s
"

Anyone know what the issue might be? Given that is happens on and off 
with large periods of time in between with normal low latencies I 
think it unlikely that it is just because the cluster is busy.


Also, how come there's only a small amount of PGs doing backfill when 
we have such a large misplaced percentage? Can this be just from 
backfill reservation logjam?


Mvh.

Torkil


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Torkil Svensgaard


On 22-03-2024 08:38, Igor Fedotov wrote:

Hi Torkil,


Hi Igor

highly likely you're facing a well known issue with RocksDB performance 
drop after bulk data removal. The latter might occur at source OSDs 
after PG migration completion.


Aha, thanks.

You might want to use DB compaction (preferably offline one using ceph- 
kvstore-tool) to get OSD out of this "degraded" state or as a preventive 
measure. I'd recommend to do that for all the OSDs right now. And once 
again after rebalancing is completed.  This should improve things but 
unfortunately no 100% guarantee.


Why is offline preferred? With offline the easiest way would be 
something like stop all OSDs one host at a time and run a loop over 
/var/lib/ceph/$id/osd.*?


Also curious if you have DB/WAL on fast (SSD or NVMe) drives? This might 
be crucial..


We do, 22 HDDs and 2 DB/WAL NVMes pr host.

Thanks.

Mvh.

Torkil



Thanks,

Igor

On 3/22/2024 9:59 AM, Torkil Svensgaard wrote:

Good morning,

Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure 
domain from host to datacenter which is the reason for the large 
misplaced percentage.


We were seeing some pretty crazy spikes in "OSD Read Latencies" and 
"OSD Write Latencies" on the dashboard. Most of the time everything is 
well but then for periods of time, 1-4 hours, latencies will go to 10+ 
seconds for one or more OSDs. This also happens outside scrub hours 
and it is not the same OSDs every time. The OSDs affected are HDD with 
DB/WAL on NVMe.


Log snippet:

"
...
2024-03-22T06:48:22.859+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.859+ 7fb185b54700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  1 heartbeat_map 
clear_timeout 'OSD::osd_op_tp thread 0x7fb169898700' had timed out 
after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  0 bluestore(/var/lib/ceph/ 
osd/ceph-112) log_latency slow operation observed for submit_transact, 
latency = 17.716707230s
2024-03-22T06:48:22.880+ 7fb1748ae700  0 bluestore(/var/lib/ceph/ 
osd/ceph-112) log_latency_fn slow operation observed for 
_txc_committed_kv, latency = 17.732601166s, txc = 0x55a5bcda0f00
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s

...
"

"
[root@dopey ~]# ceph -s
  cluster:
    id: 8ee2d228-ed21-4580-8bbf-0649f229e21d
    health: HEALTH_WARN
    1 failed cephadm daemon(s)
    Low space hindering backfill (add storage if this doesn't 
resolve itself): 1 pg backfill_toofull


  services:
    mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
    mgr: jolly.tpgixt(active, since 10d), standbys: dopey.lxajvk, 
lazy.xuhetq

    mds: 1/1 daemons up, 2 standby
    osd: 540 osds: 539 up (since 6m), 539 in (since 15h); 6250 
remapped pgs


  data:
    volumes: 1/1 healthy
    pools:   15 pools, 10849 pgs
    objects: 546.35M objects, 1.1 PiB
    usage:   1.9 PiB used, 2.3 PiB / 4.2 PiB avail
    pgs: 1425479651/3163081036 objects misplaced (45.066%)
 6224 active+remapped+backfill_wait
 4516 active+clean
 67   active+clean+scrubbing
 25   active+remapped+backfilling
 16   active+clean+scrubbing+deep
 1    active+remapped+backfill_wait+backfill_toofull

  io:
    client:   117 MiB/s rd, 68 MiB/s wr, 274 op/s rd, 183 op/s wr
    recovery: 438 MiB/s, 192 objects/s
"

Anyone know what the issue might be? Given that is happens on and off 
with large periods of time in between with normal low latencies I 
think it unlikely that it is just because the cluster is busy.


Also, how come there's only a small amount of PGs doing backfill when 
we have such a large misplaced percentage? Can this be just from 
backfill reservation logjam?


Mvh.

Torkil



--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: node-exporter error

2024-03-22 Thread Eugen Block

Hi,

what does your node-exporter spec look like?

ceph orch ls node-exporter --export

If other node-exporter daemons are running in the cluster, what's the  
difference between them? Do they all have the same container image?


ceph config get mgr mgr/cephadm/container_image_node_exporter

and compare with 'docker|podman images' output.

Regards,
Eugen

Zitat von quag...@bol.com.br:


Hello,
 After some time, I'm adding some more disks on a new machine in  
the ceph cluster.
 However, there is a container that is not rising. It is the  
"node-exporter".


 Below is an excerpt from the log that reports the error:

Mar 20 15:51:08 adafn02  
ceph-da43a27a-eee8-11eb-9c87-525400baa344-node-exporter-adafn02[736348]:  
ts=2024-03-20T18:51:08.606Z caller=node_exporter.go:117 level=info  
collector=xfs
Mar 20 15:51:08 adafn02  
ceph-da43a27a-eee8-11eb-9c87-525400baa344-node-exporter-adafn02[736348]:  
ts=2024-03-20T18:51:08.606Z caller=node_exporter.go:117 level=info  
collector=zfs
Mar 20 15:51:08 adafn02  
ceph-da43a27a-eee8-11eb-9c87-525400baa344-node-exporter-adafn02[736348]:  
ts=2024-03-20T18:51:08.606Z caller=tls_config.go:232 level=info  
msg="Listening on" address=[::]:9100
Mar 20 15:51:08 adafn02  
ceph-da43a27a-eee8-11eb-9c87-525400baa344-node-exporter-adafn02[736348]:  
ts=2024-03-20T18:51:08.606Z caller=tls_config.go:235 level=info  
msg="TLS is disabled." http2=false address=[::]:9100
Mar 20 15:51:09 adafn02 systemd[1]:  
var-lib-containers-storage-overlay-a80fe574f464677d2fc313cd0e92b12930370b64ec56477ced79e24293953e99-merged.mount:  
Succeeded.
Mar 20 15:51:09 adafn02 systemd[1]:  
ceph-da43a27a-eee8-11eb-9c87-525400baa344@node-exporter.adafn02.service:  
Main process exited, code=exited, status=137/n/a
Mar 20 15:51:10 adafn02 systemd[1]:  
ceph-da43a27a-eee8-11eb-9c87-525400baa344@node-exporter.adafn02.service:  
Failed with result 'exit-code'.


 Version is:
[root@adafn02 ~]# ceph orch ps | grep adafn
crash.adafn02adafn02 
running (26m)65s ago  38m7440k-  18.2.1  
5be31c24972a  839c3ba37349
node-exporter.adafn02adafn02  *:9100 
error65s ago   2m-- 
 
osd.62   adafn02 
running (26m)65s ago  29m54.7M 352G  18.2.1  
5be31c24972a  368d60d5ac3c
osd.83   adafn02 
running (26m)65s ago  28m56.3M 352G  18.2.1  
5be31c24972a  4f9052698265
osd.134  adafn02 
running (24m)65s ago  24m 105M 352G  18.2.1  
5be31c24972a  40fc99160112
osd.135  adafn02 
running (23m)65s ago  23m 103M 352G  18.2.1  
5be31c24972a  6f352c76f2e5



 Other containers in this machine are ok. Could anyone help me  
identify where the error is?


Thanks
Rafael.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin

Hi,

As I said in my initial message, I'd in mind to do exactly the same as I 
identified in my initial analysis that all the PGs with this problem 
where sharing one OSD (but only 20 PGs had the problem over ~200 hosted 
by the OSD). But as I don't feel I'm in an urgent situation, I was 
wondering if collecting more information on the problem may have some 
value and which one... If it helps, I add below the `pg dump` for the 17 
PGs still with a "stucked scrub".


I observed the "stucked scrubs" is lowering very slowly. In the last 12 
hours, 1 more PG was successfully scrubbed/deep scrubbed. In case it was 
not clear in my initial message, the lists of PGs with a too old scrub 
and too old deep scrub are the same.


Without an answer, next week i may consider doing what you did: remove 
the suspect OSD (instead of just restarting it) and see it unblocks the 
stucked scrubs.


Best regards,

Michel

- "ceph pg dump pgs" for the 17 PGs with 
a too old scrub and deep scrub (same list) 



PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED UNFOUND  
BYTES    OMAP_BYTES*  OMAP_KEYS*  LOG    LOG_DUPS DISK_LOG  STATE 
STATE_STAMP  VERSION   REPORTED 
UP UP_PRIMARY  ACTING ACTING_PRIMARY 
LAST_SCRUB    SCRUB_STAMP  LAST_DEEP_SCRUB 
DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION 
SCRUB_SCHEDULING OBJECTS_SCRUBBED  OBJECTS_TRIMMED
29.7e3   260   0 0  0 0   
1090519040    0   0   1978   500 
1978 active+clean 2024-03-21T18:28:53.369789+    
39202'2478    83812:97136 [29,141,64,194]  29    
[29,141,64,194]  29 39202'2478  
2024-02-17T19:56:34.413412+   39202'2478 
2024-02-17T19:56:34.413412+  0 3  queued for deep scrub 
0    0
25.7cc 0   0 0  0 0    
0    0   0  0  1076 0 
active+clean 2024-03-21T18:09:48.104279+ 46253'548 
83812:89843    [29,50,173]  29 [29,50,173]  
29 39159'536 2024-02-17T18:14:54.950401+    39159'536 
2024-02-17T18:14:54.950401+  0 1  queued for deep scrub 
0    0
25.70c 0   0 0  0 0    
0    0   0  0   918 0 
active+clean 2024-03-21T18:00:57.942902+ 46253'514    
83812:95212 [29,195,185]  29   [29,195,185]  29 
39159'530  2024-02-18T03:56:17.559531+    39159'530 
2024-02-16T17:39:03.281785+  0 1  queued for deep scrub 
0    0
29.70c   249   0 0  0 0   
1044381696    0   0   1987   600 
1987 active+clean 2024-03-21T18:35:36.848189+    
39202'2587    83812:99628 [29,138,63,12]  29 
[29,138,63,12]  29 39202'2587  
2024-02-17T21:34:22.042560+   39202'2587 
2024-02-17T21:34:22.042560+  0 1  queued for deep scrub 
0    0
29.705   231   0 0  0 0    
968884224    0   0   1959   500 1959 
active+clean 2024-03-21T18:18:22.028551+    39202'2459    
83812:91258 [29,147,173,61]  29    [29,147,173,61]  
29 39202'2459  2024-02-17T16:41:40.421763+   39202'2459 
2024-02-17T16:41:40.421763+  0 1  queued for deep scrub 
0    0
29.6b9   236   0 0  0 0    
989855744    0   0   1956   500 1956 
active+clean 2024-03-21T18:11:29.912132+    39202'2456    
83812:95607 [29,199,74,16]  29 [29,199,74,16]  
29 39202'2456  2024-02-17T11:46:06.706625+   39202'2456 
2024-02-17T11:46:06.706625+  0 1  queued for deep scrub 
0    0
25.56e 0   0 0  0 0    
0    0   0  0  1158 0  
active+clean+scrubbing+deep 2024-03-22T08:09:38.840145+ 
46253'514   83812:637482 [111,29,128] 111   
[111,29,128] 111 39159'579  
2024-03-06T17:57:53.158936+    39159'579 
2024-03-06T17:57:53.158936+  0 1  queued for deep scrub 
0    0
25.56a 0   0 0  0 0    
0    0   0  0  1055 0 
active+clean 2024-03-21T18:00:57.940851+ 46253'545 
83812:93475    [29,19,211]  29 [29,19,211]  
29 46253'545 2024-03-07T11:12:45.881545+    46253'545 
2024-03-07T11:12:45.881545+  0 28  queued for deep scrub 
0    0
25.55a 0   0 

[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Alexander E. Patrakov
Hello Torkil,

The easiest way (in my opinion) to perform offline compaction is a bit
different than what Igor suggested. We had a prior off-list
conversation indicating that the results would be equivalent.

1. ceph config set osd osd_compact_on_start true
2. Restart the OSD that you want to compact (or the whole host at
once, if you want to compact the whole host and your failure domain
allows for that)
3. ceph config set osd osd_compact_on_start false

The OSD will restart, but will not show as "up" until the compaction
process completes. In your case, I would expect it to take up to 40
minutes.

On Fri, Mar 22, 2024 at 3:46 PM Torkil Svensgaard  wrote:
>
>
> On 22-03-2024 08:38, Igor Fedotov wrote:
> > Hi Torkil,
>
> Hi Igor
>
> > highly likely you're facing a well known issue with RocksDB performance
> > drop after bulk data removal. The latter might occur at source OSDs
> > after PG migration completion.
>
> Aha, thanks.
>
> > You might want to use DB compaction (preferably offline one using ceph-
> > kvstore-tool) to get OSD out of this "degraded" state or as a preventive
> > measure. I'd recommend to do that for all the OSDs right now. And once
> > again after rebalancing is completed.  This should improve things but
> > unfortunately no 100% guarantee.
>
> Why is offline preferred? With offline the easiest way would be
> something like stop all OSDs one host at a time and run a loop over
> /var/lib/ceph/$id/osd.*?
>
> > Also curious if you have DB/WAL on fast (SSD or NVMe) drives? This might
> > be crucial..
>
> We do, 22 HDDs and 2 DB/WAL NVMes pr host.
>
> Thanks.
>
> Mvh.
>
> Torkil
>
> >
> > Thanks,
> >
> > Igor
> >
> > On 3/22/2024 9:59 AM, Torkil Svensgaard wrote:
> >> Good morning,
> >>
> >> Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure
> >> domain from host to datacenter which is the reason for the large
> >> misplaced percentage.
> >>
> >> We were seeing some pretty crazy spikes in "OSD Read Latencies" and
> >> "OSD Write Latencies" on the dashboard. Most of the time everything is
> >> well but then for periods of time, 1-4 hours, latencies will go to 10+
> >> seconds for one or more OSDs. This also happens outside scrub hours
> >> and it is not the same OSDs every time. The OSDs affected are HDD with
> >> DB/WAL on NVMe.
> >>
> >> Log snippet:
> >>
> >> "
> >> ...
> >> 2024-03-22T06:48:22.859+ 7fb184b52700  1 heartbeat_map is_healthy
> >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> >> 2024-03-22T06:48:22.859+ 7fb185b54700  1 heartbeat_map is_healthy
> >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> >> 2024-03-22T06:48:22.864+ 7fb169898700  1 heartbeat_map
> >> clear_timeout 'OSD::osd_op_tp thread 0x7fb169898700' had timed out
> >> after 15.00954s
> >> 2024-03-22T06:48:22.864+ 7fb169898700  0 bluestore(/var/lib/ceph/
> >> osd/ceph-112) log_latency slow operation observed for submit_transact,
> >> latency = 17.716707230s
> >> 2024-03-22T06:48:22.880+ 7fb1748ae700  0 bluestore(/var/lib/ceph/
> >> osd/ceph-112) log_latency_fn slow operation observed for
> >> _txc_committed_kv, latency = 17.732601166s, txc = 0x55a5bcda0f00
> >> 2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy
> >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> >> 2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy
> >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> >> ...
> >> "
> >>
> >> "
> >> [root@dopey ~]# ceph -s
> >>   cluster:
> >> id: 8ee2d228-ed21-4580-8bbf-0649f229e21d
> >> health: HEALTH_WARN
> >> 1 failed cephadm daemon(s)
> >> Low space hindering backfill (add storage if this doesn't
> >> resolve itself): 1 pg backfill_toofull
> >>
> >>   services:
> >> mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
> >> mgr: jolly.tpgixt(active, since 10d), standbys: dopey.lxajvk,
> >> lazy.xuhetq
> >> mds: 1/1 daemons up, 2 standby
> >> osd: 540 osds: 539 up (since 6m), 539 in (since 15h); 6250
> >> remapped pgs
> >>
> >>   data:
> >> volumes: 1/1 healthy
> >> pools:   15 pools, 10849 pgs
> >> objects: 546.35M objects, 1.1 PiB
> >> usage:   1.9 PiB used, 2.3 PiB / 4.2 PiB avail
> >> pgs: 1425479651/3163081036 objects misplaced (45.066%)
> >>  6224 active+remapped+backfill_wait
> >>  4516 active+clean
> >>  67   active+clean+scrubbing
> >>  25   active+remapped+backfilling
> >>  16   active+clean+scrubbing+deep
> >>  1active+remapped+backfill_wait+backfill_toofull
> >>
> >>   io:
> >> client:   117 MiB/s rd, 68 MiB/s wr, 274 op/s rd, 183 op/s wr
> >> recovery: 438 MiB/s, 192 objects/s
> >> "
> >>
> >> Anyone know what the issue might be? Given that is happens on and off
> >> with large periods of time in between with normal low latencies I
> >> think it unl

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Pierre Riteau
Hello Michel,

It might be worth mentioning that the next releases of Reef and Quincy
should increase the default value of osd_max_scrubs from 1 to 3. See the
Reef pull request: https://github.com/ceph/ceph/pull/55173
You could try increasing this configuration setting if you haven't already,
but note that it can impact client I/O performance.

Also, if the delays appear to be related to a single OSD, have you checked
the health and performance of this device?

On Fri, 22 Mar 2024 at 09:29, Michel Jouvin 
wrote:

> Hi,
>
> As I said in my initial message, I'd in mind to do exactly the same as I
> identified in my initial analysis that all the PGs with this problem
> where sharing one OSD (but only 20 PGs had the problem over ~200 hosted
> by the OSD). But as I don't feel I'm in an urgent situation, I was
> wondering if collecting more information on the problem may have some
> value and which one... If it helps, I add below the `pg dump` for the 17
> PGs still with a "stucked scrub".
>
> I observed the "stucked scrubs" is lowering very slowly. In the last 12
> hours, 1 more PG was successfully scrubbed/deep scrubbed. In case it was
> not clear in my initial message, the lists of PGs with a too old scrub
> and too old deep scrub are the same.
>
> Without an answer, next week i may consider doing what you did: remove
> the suspect OSD (instead of just restarting it) and see it unblocks the
> stucked scrubs.
>
> Best regards,
>
> Michel
>
> - "ceph pg dump pgs" for the 17 PGs with
> a too old scrub and deep scrub (same list)
> 
>
> PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED UNFOUND
> BYTESOMAP_BYTES*  OMAP_KEYS*  LOGLOG_DUPS DISK_LOG  STATE
> STATE_STAMP  VERSION   REPORTED
> UP UP_PRIMARY  ACTING ACTING_PRIMARY
> LAST_SCRUBSCRUB_STAMP  LAST_DEEP_SCRUB
> DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION
> SCRUB_SCHEDULING OBJECTS_SCRUBBED  OBJECTS_TRIMMED
> 29.7e3   260   0 0  0 0
> 10905190400   0   1978   500
> 1978 active+clean 2024-03-21T18:28:53.369789+
> 39202'247883812:97136 [29,141,64,194]  29
> [29,141,64,194]  29 39202'2478
> 2024-02-17T19:56:34.413412+   39202'2478
> 2024-02-17T19:56:34.413412+  0 3  queued for deep scrub
> 00
> 25.7cc 0   0 0  0 0
> 00   0  0  1076 0
> active+clean 2024-03-21T18:09:48.104279+ 46253'548
> 83812:89843[29,50,173]  29 [29,50,173]
> 29 39159'536 2024-02-17T18:14:54.950401+39159'536
> 2024-02-17T18:14:54.950401+  0 1  queued for deep scrub
> 00
> 25.70c 0   0 0  0 0
> 00   0  0   918 0
> active+clean 2024-03-21T18:00:57.942902+ 46253'514
> 83812:95212 [29,195,185]  29   [29,195,185]  29
> 39159'530  2024-02-18T03:56:17.559531+39159'530
> 2024-02-16T17:39:03.281785+  0 1  queued for deep scrub
> 00
> 29.70c   249   0 0  0 0
> 10443816960   0   1987   600
> 1987 active+clean 2024-03-21T18:35:36.848189+
> 39202'258783812:99628 [29,138,63,12]  29
> [29,138,63,12]  29 39202'2587
> 2024-02-17T21:34:22.042560+   39202'2587
> 2024-02-17T21:34:22.042560+  0 1  queued for deep scrub
> 00
> 29.705   231   0 0  0 0
> 9688842240   0   1959   500 1959
> active+clean 2024-03-21T18:18:22.028551+39202'2459
> 83812:91258 [29,147,173,61]  29[29,147,173,61]
> 29 39202'2459  2024-02-17T16:41:40.421763+   39202'2459
> 2024-02-17T16:41:40.421763+  0 1  queued for deep scrub
> 00
> 29.6b9   236   0 0  0 0
> 9898557440   0   1956   500 1956
> active+clean 2024-03-21T18:11:29.912132+39202'2456
> 83812:95607 [29,199,74,16]  29 [29,199,74,16]
> 29 39202'2456  2024-02-17T11:46:06.706625+   39202'2456
> 2024-02-17T11:46:06.706625+  0 1  queued for deep scrub
> 00
> 25.56e 0   0 0  0 0
> 00   0  0  1158 0
> active+clean+scrubbing+deep 2024-03-22T08:09:38.840145+
> 46253'514   83812:637482 [111,29,128] 111
> [111,29,128] 111 39159'579
> 2024-03-06T17:57:53.158936+39159'579
> 2024-03-06T17:57:53.158936+  0 1  queued for deep scrub
> 00
> 25.56a 0   

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin

Pierre,

Yes, as mentioned in my initial email, I checked the OSD state and found 
nothing wrong either in the OSD logs or in the system logs (SMART errors).


Thanks for the advice of increasing osd_max_scrubs, I may try it, but I 
doubt it is a contention problem because it really only affects a fixed 
set of PGs (no new PGS have a "stucked scrub") and there is a 
significant scrubbing activity going on continuously (~10K PGs in the 
cluster).


Again, it is not a problem for me to try to kick out the suspect OSD and 
see it fixes the issue but as this cluster is pretty simple/low in terms 
of activity and I see nothing that may explain why we have this 
situation on a pretty new cluster (9 months, created in Quincy) and not 
on our 2 other production clusters, much more used, one of them being 
the backend storage of a significant OpenStack clouds, a cluster created 
10 years ago with Infernetis and upgraded since then, a better candidate 
for this kind of problems! So, I'm happy to contribute to 
troubleshooting a potential issue in Reef if somebody finds it useful 
and can help. Else I'll try the approach that worked for Gunnar.


Best regards,

Michel

Le 22/03/2024 à 09:59, Pierre Riteau a écrit :

Hello Michel,

It might be worth mentioning that the next releases of Reef and Quincy 
should increase the default value of osd_max_scrubs from 1 to 3. See 
the Reef pull request: https://github.com/ceph/ceph/pull/55173
You could try increasing this configuration setting if you 
haven't already, but note that it can impact client I/O performance.


Also, if the delays appear to be related to a single OSD, have you 
checked the health and performance of this device?


On Fri, 22 Mar 2024 at 09:29, Michel Jouvin 
 wrote:


Hi,

As I said in my initial message, I'd in mind to do exactly the
same as I
identified in my initial analysis that all the PGs with this problem
where sharing one OSD (but only 20 PGs had the problem over ~200
hosted
by the OSD). But as I don't feel I'm in an urgent situation, I was
wondering if collecting more information on the problem may have some
value and which one... If it helps, I add below the `pg dump` for
the 17
PGs still with a "stucked scrub".

I observed the "stucked scrubs" is lowering very slowly. In the
last 12
hours, 1 more PG was successfully scrubbed/deep scrubbed. In case
it was
not clear in my initial message, the lists of PGs with a too old
scrub
and too old deep scrub are the same.

Without an answer, next week i may consider doing what you did:
remove
the suspect OSD (instead of just restarting it) and see it
unblocks the
stucked scrubs.

Best regards,

Michel

- "ceph pg dump pgs" for the 17
PGs with
a too old scrub and deep scrub (same list)


PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED UNFOUND
BYTES    OMAP_BYTES*  OMAP_KEYS*  LOG    LOG_DUPS DISK_LOG  STATE
STATE_STAMP  VERSION   REPORTED
UP UP_PRIMARY  ACTING ACTING_PRIMARY
LAST_SCRUB    SCRUB_STAMP LAST_DEEP_SCRUB
DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION
SCRUB_SCHEDULING OBJECTS_SCRUBBED  OBJECTS_TRIMMED
29.7e3   260   0 0  0 0
1090519040    0   0   1978   500
1978 active+clean 2024-03-21T18:28:53.369789+
39202'2478    83812:97136 [29,141,64,194]  29
[29,141,64,194]  29 39202'2478
2024-02-17T19:56:34.413412+   39202'2478
2024-02-17T19:56:34.413412+  0 3  queued for deep
scrub
0    0
25.7cc 0   0 0  0 0
0    0   0  0  1076 0
active+clean 2024-03-21T18:09:48.104279+ 46253'548
83812:89843    [29,50,173]  29 [29,50,173]
29 39159'536 2024-02-17T18:14:54.950401+ 39159'536
2024-02-17T18:14:54.950401+  0 1  queued for deep
scrub
0    0
25.70c 0   0 0  0 0
0    0   0  0   918 0
active+clean 2024-03-21T18:00:57.942902+ 46253'514
83812:95212 [29,195,185]  29 [29,195,185]  29
39159'530  2024-02-18T03:56:17.559531+    39159'530
2024-02-16T17:39:03.281785+  0 1  queued for deep
scrub
0    0
29.70c   249   0 0  0 0
1044381696    0   0   1987   600
1987 active+clean 2024-03-21T18:35:36.848189+
39202'2587    83812:99628 [29,138,63,12]  29
[29,138,63,12]  29 39202'2587
2024-02-17T21:34:22.042560+   39202'2587
202

[ceph-users] Ceph fs understand usage

2024-03-22 Thread Marcus



Hi all,
I have setup a test cluster with 3 servers,
Everything has default values with a replication
of 3.

I have created one volume called gds-common
and the data pool has been configured with compression lz4
and compression_mode aggressive.

I have copied 71TB data to this volume but I can
not get my head around usage information on the cluster.
Most of this data is quite small files that contain plain text,
so I expect the compression rate to be quite good.

With both the data storage where I copy from and the ceph fs
mounted a df -h gives:
urd-gds-031:/gds-common   163T   71T   92T  
44% /gds-common
10.10.100.0:6789,10.10.100.1:6789,10.10.100.2:6789:/   92T   68T   25T  
74% /ceph-gds-common


Looking at this, the compression rate do not seem to be that good,
or is the used column showing an uncompressed value?

Using ceph and command ceph fs df detail:
--- RAW STORAGE ---
CLASS SIZE   AVAIL USED  RAW USED  %RAW USED
hdd262 TiB  94 TiB  168 TiB   168 TiB  64.10
TOTAL  262 TiB  94 TiB  168 TiB   168 TiB  64.10

--- POOLS ---
POOL ID   PGS   STORED   (DATA)   (OMAP)  OBJECTS 
USED   (DATA)  (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  
DIRTY  USED COMPR  UNDER COMPR
.mgr  1 1   24 MiB   24 MiB  0 B8   73 
MiB   73 MiB 0 B  0 25 TiBN/A  N/A
N/A 0 B  0 B
gds-common_data   2  1024   67 TiB   67 TiB  0 B   23.31M  167 
TiB  167 TiB 0 B  69.43 25 TiBN/A  N/A
N/A  35 TiB   70 TiB
gds-common_metadata   332  4.0 GiB  251 MiB  3.8 GiB  680.88k   12 
GiB  753 MiB  11 GiB   0.02 25 TiBN/A  N/A
N/A 0 B  0 B
.rgw.root 432  1.4 KiB  1.4 KiB  0 B4   48 
KiB   48 KiB 0 B  0 25 TiBN/A  N/A
N/A 0 B  0 B
default.rgw.log   532182 B182 B  0 B2   24 
KiB   24 KiB 0 B  0 25 TiBN/A  N/A
N/A 0 B  0 B
default.rgw.control   632  0 B  0 B  0 B7  
0 B  0 B 0 B  0 25 TiBN/A  N/A
N/A 0 B  0 B
default.rgw.meta  732  0 B  0 B  0 B0  
0 B  0 B 0 B  0 25 TiBN/A  N/A
N/A 0 B  0 B


From my understanding the raw storage used contain all the 3 copies
so this means 56TB per copy and gives an compression of about 20% if
this is a compressed value?
Looking at the pool gds-common_data value STORED 67TB is an 
uncompressed value

and a value per copy, right?
The used value from gds-common_data is the raw usage of all 3 copies, 
right?
The %RAW USED value make sense (64.10) but the gds-common_data %USED 
differs

(69.43) and I can not figure out what this value relates to?
UNDER COMPR is the amount of data that ceph has recognized that it can 
be

used in compression (70TB) so it is about all the data.
I did not understand the value USED COMPR (35TB), do this specify how 
much

it has been compressed, so 70TB has been compressed to 35TB?
But what values are specified as compressed and what values shows the
raw uncompressed values?
Are all values uncompressed values and the only place I see compression
is "USED COMPR" and "UNDER COMPR"?
But when do I run out of storage in my cluster then and what value
should I keep my eyes on if %used is calculated on uncompressed data?
Does this mean that I have more storage available then shown from %USED?
Does df -h on a mount shows the uncompressed used value?

Then we have mon_osd_full_ratio does this mean that the first osd
that reaches .95 full (default) make the system stop the clients write 
aso?

But does this mon_osd_full_ratio always reaches its limit before
%RAW USAGE reaches 100% or pool %USED reaches 100% or what does
happen if one of the used values reaches 100% before mon_osd_full_ratio?

I am sorry for all the questions but even after reading the documentaion
I do not seem to be able to figure this out.

All help is appreciated.
Many thanks in advance!

Best regards
Marcus


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Frédéric Nass

Hello Michel,

Pierre also suggested checking the performance of this OSD's device(s) which 
can be done by running a ceph tell osd.x bench.

One think I can think of is how the scrubbing speed of this very OSD could be 
influenced by mclock sheduling, would the max iops capacity calculated by this 
OSD during its initialization be significantly lower than other OSDs's.

What I would do is check (from this OSD's log) the calculated value for max 
iops capacity and compare it to other OSDs. Eventually force a recalculation by 
setting 'ceph config set osd.x osd_mclock_force_run_benchmark_on_init true' and 
restart this OSD.

Also I would:

- compare running OSD's mclock values (cephadm shell ceph daemon osd.x config 
show | grep mclock) to other OSDs's.
- compare ceph tell osd.x bench to other OSDs's benchmarks.
- compare the rotational status of this OSD's db and data devices to other 
OSDs, to make sure things are in order.

Bests,
Frédéric.

PS: If mclock is the culprit here, then setting osd_op_queue back to mpq for 
this only OSD would probably reveal it. Not sure about the implication of 
having a signel OSD running a different scheduler in the cluster though.


- Le 22 Mar 24, à 10:11, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :

> Pierre,
> 
> Yes, as mentioned in my initial email, I checked the OSD state and found
> nothing wrong either in the OSD logs or in the system logs (SMART errors).
> 
> Thanks for the advice of increasing osd_max_scrubs, I may try it, but I
> doubt it is a contention problem because it really only affects a fixed
> set of PGs (no new PGS have a "stucked scrub") and there is a
> significant scrubbing activity going on continuously (~10K PGs in the
> cluster).
> 
> Again, it is not a problem for me to try to kick out the suspect OSD and
> see it fixes the issue but as this cluster is pretty simple/low in terms
> of activity and I see nothing that may explain why we have this
> situation on a pretty new cluster (9 months, created in Quincy) and not
> on our 2 other production clusters, much more used, one of them being
> the backend storage of a significant OpenStack clouds, a cluster created
> 10 years ago with Infernetis and upgraded since then, a better candidate
> for this kind of problems! So, I'm happy to contribute to
> troubleshooting a potential issue in Reef if somebody finds it useful
> and can help. Else I'll try the approach that worked for Gunnar.
> 
> Best regards,
> 
> Michel
> 
> Le 22/03/2024 à 09:59, Pierre Riteau a écrit :
>> Hello Michel,
>>
>> It might be worth mentioning that the next releases of Reef and Quincy
>> should increase the default value of osd_max_scrubs from 1 to 3. See
>> the Reef pull request: https://github.com/ceph/ceph/pull/55173
>> You could try increasing this configuration setting if you
>> haven't already, but note that it can impact client I/O performance.
>>
>> Also, if the delays appear to be related to a single OSD, have you
>> checked the health and performance of this device?
>>
>> On Fri, 22 Mar 2024 at 09:29, Michel Jouvin
>>  wrote:
>>
>> Hi,
>>
>> As I said in my initial message, I'd in mind to do exactly the
>> same as I
>> identified in my initial analysis that all the PGs with this problem
>> where sharing one OSD (but only 20 PGs had the problem over ~200
>> hosted
>> by the OSD). But as I don't feel I'm in an urgent situation, I was
>> wondering if collecting more information on the problem may have some
>> value and which one... If it helps, I add below the `pg dump` for
>> the 17
>> PGs still with a "stucked scrub".
>>
>> I observed the "stucked scrubs" is lowering very slowly. In the
>> last 12
>> hours, 1 more PG was successfully scrubbed/deep scrubbed. In case
>> it was
>> not clear in my initial message, the lists of PGs with a too old
>> scrub
>> and too old deep scrub are the same.
>>
>> Without an answer, next week i may consider doing what you did:
>> remove
>> the suspect OSD (instead of just restarting it) and see it
>> unblocks the
>> stucked scrubs.
>>
>> Best regards,
>>
>> Michel
>>
>> - "ceph pg dump pgs" for the 17
>> PGs with
>> a too old scrub and deep scrub (same list)
>> 
>>
>> PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED UNFOUND
>> BYTES    OMAP_BYTES*  OMAP_KEYS*  LOG    LOG_DUPS DISK_LOG  STATE
>> STATE_STAMP  VERSION   REPORTED
>> UP UP_PRIMARY  ACTING ACTING_PRIMARY
>> LAST_SCRUB    SCRUB_STAMP LAST_DEEP_SCRUB
>> DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION
>> SCRUB_SCHEDULING OBJECTS_SCRUBBED  OBJECTS_TRIMMED
>> 29.7e3   260   0 0  0 0
>> 1090519040    0   0   1978   500
>> 1978 acti

[ceph-users] High OSD commit_latency after kernel upgrade

2024-03-22 Thread Özkan Göksu
Hello!

After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2
LTS) , commit latency started acting weird with "CT4000MX500SSD" drives.

osd  commit_latency(ms)  apply_latency(ms)
 36 867867
 373045   3045
 38  15 15
 39  18 18
 421409   1409
 431224   1224

I downgraded the kernel but the result did not change.
I have a similar build and it didn't get upgraded and it is just fine.
While I was digging I realised a difference.

This is high latency cluster and as you can see the "DISC-GRAN=0B",
"DISC-MAX=0B"
root@sd-01:~# lsblk -D
NAME   DISC-ALN DISC-GRAN DISC-MAX
DISC-ZERO
sdc   00B   0B
0
├─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--201d5050--db0c--41b4--85c4--6416ee989d6c
│ 00B   0B
0
└─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--5a376133--47de--4e29--9b75--2314665c2862

root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
/sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full

--

This is low latency cluster and as you can see the "DISC-GRAN=4K",
"DISC-MAX=2G"
root@ud-01:~# lsblk -D
NAME  DISC-ALN
DISC-GRAN DISC-MAX DISC-ZERO
sdc  0
   4K   2G 0
├─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--ec86a029--23f7--4328--9600--a24a290e3003
│0
   4K   2G 0
└─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--5b69b748--d899--4f55--afc3--2ea3c8a05ca1

root@ud-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
/sys/devices/pci:00/:00:11.4/ata3/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16

I think the problem is related to provisioning_mode but I really did not
understand the reason.
I boot with a live iso and still the drive was "provisioning_mode:full" so
it means this is not related to my OS at all.

With the upgrade something changed and I think during boot sequence
negotiation between LSI controller, drives and kernel started to assign
"provisioning_mode:full" but I'm not sure.

What should I do ?

Best regards.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin

Hi Frédéric,

I think you raise the right point, sorry if I misunderstood Pierre's 
suggestion to look at OSD performances. Just before reading your email, 
I was implementing Pierre's suggestion for max_osd_scrubs and I saw the 
osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those with a 
value different from the default). For the suspect OSD, the value is 
very low, 0.145327, and I suspect it is the cause of the problem. A few 
others have a value ~5 which I find also very low (all OSDs are using 
the same recent HW/HDD).


Thanks for these informations. I'll follow your suggestions to rerun the 
benchmark and report if it improved the situation.


Best regards,

Michel

Le 22/03/2024 à 12:18, Frédéric Nass a écrit :

Hello Michel,

Pierre also suggested checking the performance of this OSD's device(s) which 
can be done by running a ceph tell osd.x bench.

One think I can think of is how the scrubbing speed of this very OSD could be 
influenced by mclock sheduling, would the max iops capacity calculated by this 
OSD during its initialization be significantly lower than other OSDs's.

What I would do is check (from this OSD's log) the calculated value for max 
iops capacity and compare it to other OSDs. Eventually force a recalculation by 
setting 'ceph config set osd.x osd_mclock_force_run_benchmark_on_init true' and 
restart this OSD.

Also I would:

- compare running OSD's mclock values (cephadm shell ceph daemon osd.x config 
show | grep mclock) to other OSDs's.
- compare ceph tell osd.x bench to other OSDs's benchmarks.
- compare the rotational status of this OSD's db and data devices to other 
OSDs, to make sure things are in order.

Bests,
Frédéric.

PS: If mclock is the culprit here, then setting osd_op_queue back to mpq for 
this only OSD would probably reveal it. Not sure about the implication of 
having a signel OSD running a different scheduler in the cluster though.


- Le 22 Mar 24, à 10:11, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :


Pierre,

Yes, as mentioned in my initial email, I checked the OSD state and found
nothing wrong either in the OSD logs or in the system logs (SMART errors).

Thanks for the advice of increasing osd_max_scrubs, I may try it, but I
doubt it is a contention problem because it really only affects a fixed
set of PGs (no new PGS have a "stucked scrub") and there is a
significant scrubbing activity going on continuously (~10K PGs in the
cluster).

Again, it is not a problem for me to try to kick out the suspect OSD and
see it fixes the issue but as this cluster is pretty simple/low in terms
of activity and I see nothing that may explain why we have this
situation on a pretty new cluster (9 months, created in Quincy) and not
on our 2 other production clusters, much more used, one of them being
the backend storage of a significant OpenStack clouds, a cluster created
10 years ago with Infernetis and upgraded since then, a better candidate
for this kind of problems! So, I'm happy to contribute to
troubleshooting a potential issue in Reef if somebody finds it useful
and can help. Else I'll try the approach that worked for Gunnar.

Best regards,

Michel

Le 22/03/2024 à 09:59, Pierre Riteau a écrit :

Hello Michel,

It might be worth mentioning that the next releases of Reef and Quincy
should increase the default value of osd_max_scrubs from 1 to 3. See
the Reef pull request: https://github.com/ceph/ceph/pull/55173
You could try increasing this configuration setting if you
haven't already, but note that it can impact client I/O performance.

Also, if the delays appear to be related to a single OSD, have you
checked the health and performance of this device?

On Fri, 22 Mar 2024 at 09:29, Michel Jouvin
 wrote:

 Hi,

 As I said in my initial message, I'd in mind to do exactly the
 same as I
 identified in my initial analysis that all the PGs with this problem
 where sharing one OSD (but only 20 PGs had the problem over ~200
 hosted
 by the OSD). But as I don't feel I'm in an urgent situation, I was
 wondering if collecting more information on the problem may have some
 value and which one... If it helps, I add below the `pg dump` for
 the 17
 PGs still with a "stucked scrub".

 I observed the "stucked scrubs" is lowering very slowly. In the
 last 12
 hours, 1 more PG was successfully scrubbed/deep scrubbed. In case
 it was
 not clear in my initial message, the lists of PGs with a too old
 scrub
 and too old deep scrub are the same.

 Without an answer, next week i may consider doing what you did:
 remove
 the suspect OSD (instead of just restarting it) and see it
 unblocks the
 stucked scrubs.

 Best regards,

 Michel

 - "ceph pg dump pgs" for the 17
 PGs with
 a too old scrub and deep scrub (same list)
 

 PG_STAT  OBJECTS  MISSING_ON_P

[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Anthony D'Atri
https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub
How to stop sys from changing USB SSD provisioning_mode from unmap to full in 
Ubuntu 22.04?
askubuntu.com
?


> On Mar 22, 2024, at 09:36, Özkan Göksu  wrote:
> 
> Hello!
> 
> After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2
> LTS) , commit latency started acting weird with "CT4000MX500SSD" drives.
> 
> osd  commit_latency(ms)  apply_latency(ms)
> 36 867867
> 373045   3045
> 38  15 15
> 39  18 18
> 421409   1409
> 431224   1224
> 
> I downgraded the kernel but the result did not change.
> I have a similar build and it didn't get upgraded and it is just fine.
> While I was digging I realised a difference.
> 
> This is high latency cluster and as you can see the "DISC-GRAN=0B",
> "DISC-MAX=0B"
> root@sd-01:~# lsblk -D
> NAME   DISC-ALN DISC-GRAN DISC-MAX
> DISC-ZERO
> sdc   00B   0B
>0
> ├─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--201d5050--db0c--41b4--85c4--6416ee989d6c
> │ 00B   0B
>0
> └─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--5a376133--47de--4e29--9b75--2314665c2862
> 
> root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full
> 
> --
> 
> This is low latency cluster and as you can see the "DISC-GRAN=4K",
> "DISC-MAX=2G"
> root@ud-01:~# lsblk -D
> NAME  DISC-ALN
> DISC-GRAN DISC-MAX DISC-ZERO
> sdc  0
>   4K   2G 0
> ├─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--ec86a029--23f7--4328--9600--a24a290e3003
> │0
>   4K   2G 0
> └─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--5b69b748--d899--4f55--afc3--2ea3c8a05ca1
> 
> root@ud-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
> /sys/devices/pci:00/:00:11.4/ata3/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16
> 
> I think the problem is related to provisioning_mode but I really did not
> understand the reason.
> I boot with a live iso and still the drive was "provisioning_mode:full" so
> it means this is not related to my OS at all.
> 
> With the upgrade something changed and I think during boot sequence
> negotiation between LSI controller, drives and kernel started to assign
> "provisioning_mode:full" but I'm not sure.
> 
> What should I do ?
> 
> Best regards.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Özkan Göksu
Hello Anthony, thank you for the answer.

While researching I also found out this type of issues but the thing I did
not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is
all good.

root@sd-01:~# lsblk -D
NAME   DISC-ALN DISC-GRAN DISC-MAX
DISC-ZERO
sda   0  512B   2G
0
├─sda10  512B   2G
0
├─sda20  512B   2G
0
└─sda30  512B   2G
0
  └─md0   0  512B   2G
0
└─md0p1   0  512B   2G
0
sdb   0  512B   2G
0
├─sdb10  512B   2G
0
├─sdb20  512B   2G
0
└─sdb30  512B   2G
0
  └─md0   0  512B   2G
0
└─md0p1   0  512B   2G
0

root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
/sys/devices/pci:00/:00:11.4/ata1/host1/target1:0:0/1:0:0:0/scsi_disk/1:0:0:0/provisioning_mode:writesame_16
/sys/devices/pci:00/:00:11.4/ata2/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16
/sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full
/sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/provisioning_mode:full

root@sd-01:~# disklist
HCTL   NAME   SIZE  REV TRAN   WWNSERIAL  MODEL
1:0:0:0/dev/sda 447.1G 203Q sata   0x5002538500231d05 S1G1NYAF923
SAMSUNG MZ7WD4
2:0:0:0/dev/sdb 447.1G 203Q sata   0x5002538500231a41 S1G1NYAF922
SAMSUNG MZ7WD4
0:0:0:0/dev/sdc   3.6T 046  sas0x500a0751e6bd969b 2312E6BD969
CT4000MX500SSD
0:0:1:0/dev/sdd   3.6T 046  sas0x500a0751e6bd97ee 2312E6BD97E
CT4000MX500SSD
0:0:2:0/dev/sde   3.6T 046  sas0x500a0751e6bd9805 2312E6BD980
CT4000MX500SSD
0:0:3:0/dev/sdf   3.6T 046  sas0x500a0751e6bd9681 2312E6BD968
CT4000MX500SSD
0:0:4:0/dev/sdg   3.6T 045  sas0x500a0751e6b5d30a 2309E6B5D30
CT4000MX500SSD
0:0:5:0/dev/sdh   3.6T 046  sas0x500a0751e6bd967e 2312E6BD967
CT4000MX500SSD
0:0:6:0/dev/sdi   3.6T 046  sas0x500a0751e6bd97e4 2312E6BD97E
CT4000MX500SSD
0:0:7:0/dev/sdj   3.6T 046  sas0x500a0751e6bd96a0 2312E6BD96A
CT4000MX500SSD

So my question is why it only happens to CT4000MX500SSD drives and why it
just started now and I don't have in other servers?
Maybe it is related to firmware version "M3CR046 vs M3CR045"
I check the crucial website and actually "M3CR046" is not exist:
https://www.crucial.com/support/ssd-support/mx500-support
In this forum people recommend upgrading "M3CR046"
https://forums.unraid.net/topic/134954-warning-crucial-mx500-ssds-world-of-pain-stay-away-from-these/
But actually in my ud cluster all the drives are "M3CR045" and have lower
latency. I'm really confused.


Instead of writing udev rules for only CT4000MX500SSD is there any
recommended udev rule for ceph and all type of sata drives?



Anthony D'Atri , 22 Mar 2024 Cum, 17:00 tarihinde şunu
yazdı:

> [image: apple-touch-i...@2.png]
>
> How to stop sys from changing USB SSD provisioning_mode from unmap to full
> in Ubuntu 22.04?
> 
> askubuntu.com
> 
>
> 
> ?
>
>
> On Mar 22, 2024, at 09:36, Özkan Göksu  wrote:
>
> Hello!
>
> After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2
> LTS) , commit latency started acting weird with "CT4000MX500SSD" drives.
>
> osd  commit_latency(ms)  apply_latency(ms)
> 36 867867
> 373045   3045
> 38  15 15
> 39  18 18
> 421409   1409
> 431224   1224
>
> I downgraded the kernel but the result did not change.
> I have a similar build and it didn't get upgraded and it is just fine.
> While I was digging I realised a difference.
>
> This is high latency cluster and as you can see the "DISC-GRAN=0B",
> "DISC-MAX=0B"
> root@sd-01:~# lsblk -D
> NAME  

[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Özkan Göksu
Hello again.

In ceph recommendations I found this:

https://docs.ceph.com/en/quincy/start/hardware-recommendations/

WRITE CACHES
Enterprise SSDs and HDDs normally include power loss protection features
which ensure data durability when power is lost while operating, and use
multi-level caches to speed up direct or synchronous writes. These devices
can be toggled between two caching modes – a volatile cache flushed to
persistent media with fsync, or a non-volatile cache written synchronously.
These two modes are selected by either “enabling” or “disabling” the write
(volatile) cache. When the volatile cache is enabled, Linux uses a device
in “write back” mode, and when disabled, it uses “write through”.
The default configuration (usually: caching is enabled) may not be optimal,
and OSD performance may be dramatically increased in terms of increased
IOPS and decreased commit latency by disabling this write cache.
Users are therefore encouraged to benchmark their devices with fio as
described earlier and persist the optimal cache configuration for their
devices.


root@sd-02:~# cat /sys/class/scsi_disk/*/cache*
write back
write back
write back
write back
write back
write back
write back
write back
write back
write back

What do you think about these new udev rules?

root@sd-02:~# cat /etc/udev/rules.d/98-ceph-provisioning-mode.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}:="unmap"

root@sd-02:~# cat /etc/udev/rules.d/99-ceph-write-through.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"


Özkan Göksu , 22 Mar 2024 Cum, 17:42 tarihinde şunu
yazdı:

> Hello Anthony, thank you for the answer.
>
> While researching I also found out this type of issues but the thing I did
> not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is
> all good.
>
> root@sd-01:~# lsblk -D
> NAME   DISC-ALN DISC-GRAN DISC-MAX
> DISC-ZERO
> sda   0  512B   2G
> 0
> ├─sda10  512B   2G
> 0
> ├─sda20  512B   2G
> 0
> └─sda30  512B   2G
> 0
>   └─md0   0  512B   2G
> 0
> └─md0p1   0  512B   2G
> 0
> sdb   0  512B   2G
> 0
> ├─sdb10  512B   2G
> 0
> ├─sdb20  512B   2G
> 0
> └─sdb30  512B   2G
> 0
>   └─md0   0  512B   2G
> 0
> └─md0p1   0  512B   2G
> 0
>
> root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + |
> sort
>
> /sys/devices/pci:00/:00:11.4/ata1/host1/target1:0:0/1:0:0:0/scsi_disk/1:0:0:0/provisioning_mode:writesame_16
>
> /sys/devices/pci:00/:00:11.4/ata2/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16
>
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full
>
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/provisioning_mode:full
>
> root@sd-01:~# disklist
> HCTL   NAME   SIZE  REV TRAN   WWNSERIAL  MODEL
> 1:0:0:0/dev/sda 447.1G 203Q sata   0x5002538500231d05 S1G1NYAF923
> SAMSUNG MZ7WD4
> 2:0:0:0/dev/sdb 447.1G 203Q sata   0x5002538500231a41 S1G1NYAF922
> SAMSUNG MZ7WD4
> 0:0:0:0/dev/sdc   3.6T 046  sas0x500a0751e6bd969b 2312E6BD969
> CT4000MX500SSD
> 0:0:1:0/dev/sdd   3.6T 046  sas0x500a0751e6bd97ee 2312E6BD97E
> CT4000MX500SSD
> 0:0:2:0/dev/sde   3.6T 046  sas0x500a0751e6bd9805 2312E6BD980
> CT4000MX500SSD
> 0:0:3:0/dev/sdf   3.6T 046  sas0x500a0751e6bd9681 2312E6BD968
> CT4000MX500SSD
> 0:0:4:0/dev/sdg   3.6T 045  sas0x500a0751e6b5d30a 2309E6B5D30
> CT4000MX500SSD
> 0:0:5:0/dev/sdh   3.6T 046  sas0x500a0751e6bd967e 2312E6BD967
> CT4000MX500SSD
> 0:0:6:0/dev/sdi   3.6T 046  sas0x500a0751e6bd97e4 2312E6BD97E
> CT4000MX500SSD
> 0:0:7:0/dev/sdj   3.6T 046  sas0x500a0751e6bd96a0 2312E6BD96A
> CT4000MX500SSD
>
> So my question is why it only happens to CT4000MX500SSD drives and why it
> just started now and I don't have in other servers?
> Maybe it is related to firmware version "M3CR046 vs M3CR045"
> I check the crucial website and actually "M3CR046" is not exist:
> https://www.crucial.com/support/ssd-support/mx500-support
> In this forum people recommend upgrading "M3CR

[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Anthony D'Atri
Maybe because the Crucial units are detected as client drives?  But also look 
at the device paths and the output of whatever "disklist" is.  Your boot drives 
are SATA and the others are SAS which seems even more likely to be a factor.

> On Mar 22, 2024, at 10:42, Özkan Göksu  wrote:
> 
> Hello Anthony, thank you for the answer. 
> 
> While researching I also found out this type of issues but the thing I did 
> not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is all 
> good.
> 
> root@sd-01:~# lsblk -D
> NAME   DISC-ALN DISC-GRAN DISC-MAX 
> DISC-ZERO
> sda   0  512B   2G
>  0
> ├─sda10  512B   2G
>  0
> ├─sda20  512B   2G
>  0
> └─sda30  512B   2G
>  0
>   └─md0   0  512B   2G
>  0
> └─md0p1   0  512B   2G
>  0
> sdb   0  512B   2G
>  0
> ├─sdb10  512B   2G
>  0
> ├─sdb20  512B   2G
>  0
> └─sdb30  512B   2G
>  0
>   └─md0   0  512B   2G
>  0
> └─md0p1   0  512B   2G
>  0
> 
> root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
> /sys/devices/pci:00/:00:11.4/ata1/host1/target1:0:0/1:0:0:0/scsi_disk/1:0:0:0/provisioning_mode:writesame_16
> /sys/devices/pci:00/:00:11.4/ata2/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/provisioning_mode:full
> 
> root@sd-01:~# disklist
> HCTL   NAME   SIZE  REV TRAN   WWNSERIAL  MODEL
> 1:0:0:0/dev/sda 447.1G 203Q sata   0x5002538500231d05 S1G1NYAF923 SAMSUNG 
> MZ7WD4
> 2:0:0:0/dev/sdb 447.1G 203Q sata   0x5002538500231a41 S1G1NYAF922 SAMSUNG 
> MZ7WD4
> 0:0:0:0/dev/sdc   3.6T 046  sas0x500a0751e6bd969b 2312E6BD969 
> CT4000MX500SSD
> 0:0:1:0/dev/sdd   3.6T 046  sas0x500a0751e6bd97ee 2312E6BD97E 
> CT4000MX500SSD
> 0:0:2:0/dev/sde   3.6T 046  sas0x500a0751e6bd9805 2312E6BD980 
> CT4000MX500SSD
> 0:0:3:0/dev/sdf   3.6T 046  sas0x500a0751e6bd9681 2312E6BD968 
> CT4000MX500SSD
> 0:0:4:0/dev/sdg   3.6T 045  sas0x500a0751e6b5d30a 2309E6B5D30 
> CT4000MX500SSD
> 0:0:5:0/dev/sdh   3.6T 046  sas0x500a0751e6bd967e 2312E6BD967 
> CT4000MX500SSD
> 0:0:6:0/dev/sdi   3.6T 046  sas0x500a0751e6bd97e4 2312E6BD97E 
> CT4000MX500SSD
> 0:0:7:0/dev/sdj   3.6T 046  sas0x500a0751e6bd96a0 2312E6BD96A 
> CT4000MX500SSD
> 
> So my question is why it only happens to CT4000MX500SSD drives and why it 
> just started now and I don't have in other servers? 
> Maybe it is related to firmware version "M3CR046 vs M3CR045" 
> I check the crucial website and actually "M3CR046" is not exist: 
> https://www.crucial.com/support/ssd-support/mx500-support
> In this forum people recommend upgrading "M3CR046" 
> https://forums.unraid.net/topic/134954-warning-crucial-mx500-ssds-world-of-pain-stay-away-from-these/
> But actually in my ud cluster all the drives are "M3CR045" and have lower 
> latency. I'm really confused.
> 
> 
> Instead of writing udev rules for only CT4000MX500SSD is there any 
> recommended udev rule for ceph and all type of sata drives? 
> 
> 
> 
> Anthony D'Atri mailto:a...@dreamsnake.net>>, 22 Mar 
> 2024 Cum, 17:00 tarihinde şunu yazdı:
>> 
>> How to stop sys from changing USB SSD provisioning_mode from unmap to full 
>> in Ubuntu 22.04?
>> askubuntu.com
>>  
>> How
>>  to stop sys from changing USB SSD provisioning_mode from unmap to full in 
>> Ubuntu 22.04? 
>> 
>> askubuntu.com 
>> ?
>> 
>> 
>>> On Mar 22, 2024, at 09:36, Özkan Göksu >> > wrote:
>>> 
>>> Hello!
>>> 
>>> After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2
>>> LTS) , commit latency started a

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Frédéric Nass
 
 
 
Michel, 
  
Glad to know that was it. 
  
I was wondering when would per OSD osd_mclock_max_capacity_iops_hdd value be 
set in cluster's config database since I don't have any set in my lab. 
Turns out the per OSD osd_mclock_max_capacity_iops_hdd is only set when the 
calculated value is below osd_mclock_iops_capacity_threshold_hdd, otherwise the 
OSD uses the default value of 315. 
  
Probably to rule out any insanely high calculated values. Would have been nice 
to also rule out any insanely low measured values. :-) 
  
Now either: 
  
A/ these incredibly low values were calculated a while back with an unmature 
version of the code or under some specific hardware conditions and you can hope 
this won't happen again 
  
OR 
  
B/ you don't want to rely on hope to much and you'll prefer to disable 
automatic calculation (osd_mclock_skip_benchmark = true) and set 
osd_mclock_max_capacity_iops_[hdd,ssd] by yourself (globally or using a 
rack/host mask) after a precise evaluation of the performance of your OSDs. 
  
B/ would be more deterministic :-) 
  
Cheers, 
Frédéric.   
 
 
 
 
 

-Message original-

De: Michel 
à: Frédéric 
Cc: Pierre ; ceph-users 
Envoyé: vendredi 22 mars 2024 14:44 CET
Sujet : Re: [ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed 
for 1 month

Hi Frédéric, 

I think you raise the right point, sorry if I misunderstood Pierre's 
suggestion to look at OSD performances. Just before reading your email, 
I was implementing Pierre's suggestion for max_osd_scrubs and I saw the 
osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those with a 
value different from the default). For the suspect OSD, the value is 
very low, 0.145327, and I suspect it is the cause of the problem. A few 
others have a value ~5 which I find also very low (all OSDs are using 
the same recent HW/HDD). 

Thanks for these informations. I'll follow your suggestions to rerun the 
benchmark and report if it improved the situation. 

Best regards, 

Michel 

Le 22/03/2024 à 12:18, Frédéric Nass a écrit : 
> Hello Michel, 
> 
> Pierre also suggested checking the performance of this OSD's device(s) which 
> can be done by running a ceph tell osd.x bench. 
> 
> One think I can think of is how the scrubbing speed of this very OSD could be 
> influenced by mclock sheduling, would the max iops capacity calculated by 
> this OSD during its initialization be significantly lower than other OSDs's. 
> 
> What I would do is check (from this OSD's log) the calculated value for max 
> iops capacity and compare it to other OSDs. Eventually force a recalculation 
> by setting 'ceph config set osd.x osd_mclock_force_run_benchmark_on_init 
> true' and restart this OSD. 
> 
> Also I would: 
> 
> - compare running OSD's mclock values (cephadm shell ceph daemon osd.x config 
> show | grep mclock) to other OSDs's. 
> - compare ceph tell osd.x bench to other OSDs's benchmarks. 
> - compare the rotational status of this OSD's db and data devices to other 
> OSDs, to make sure things are in order. 
> 
> Bests, 
> Frédéric. 
> 
> PS: If mclock is the culprit here, then setting osd_op_queue back to mpq for 
> this only OSD would probably reveal it. Not sure about the implication of 
> having a signel OSD running a different scheduler in the cluster though. 
> 
> 
> - Le 22 Mar 24, à 10:11, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
> écrit : 
> 
>> Pierre, 
>> 
>> Yes, as mentioned in my initial email, I checked the OSD state and found 
>> nothing wrong either in the OSD logs or in the system logs (SMART errors). 
>> 
>> Thanks for the advice of increasing osd_max_scrubs, I may try it, but I 
>> doubt it is a contention problem because it really only affects a fixed 
>> set of PGs (no new PGS have a "stucked scrub") and there is a 
>> significant scrubbing activity going on continuously (~10K PGs in the 
>> cluster). 
>> 
>> Again, it is not a problem for me to try to kick out the suspect OSD and 
>> see it fixes the issue but as this cluster is pretty simple/low in terms 
>> of activity and I see nothing that may explain why we have this 
>> situation on a pretty new cluster (9 months, created in Quincy) and not 
>> on our 2 other production clusters, much more used, one of them being 
>> the backend storage of a significant OpenStack clouds, a cluster created 
>> 10 years ago with Infernetis and upgraded since then, a better candidate 
>> for this kind of problems! So, I'm happy to contribute to 
>> troubleshooting a potential issue in Reef if somebody finds it useful 
>> and can help. Else I'll try the approach that worked for Gunnar. 
>> 
>> Best regards, 
>> 
>> Michel 
>> 
>> Le 22/03/2024 à 09:59, Pierre Riteau a écrit : 
>>> Hello Michel, 
>>> 
>>> It might be worth mentioning that the next releases of Reef and Quincy 
>>> should increase the default value of osd_max_scrubs from 1 to 3. See 
>>> the Reef pull request: https://github.com/ceph/ceph/pull/55173 
>>> You could try increasing this config

[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Joshua Baergen
Personally, I don't think the compaction is actually required. Reef
has compact-on-iteration enabled, which should take care of this
automatically. We see this sort of delay pretty often during PG
cleaning, at the end of a PG being cleaned, when the PG has a high
count of objects, whether or not OSD compaction has been keeping up
with tombstones. It's unfortunately just something to ride through
these days until backfill completes.

https://github.com/ceph/ceph/pull/49438 is a recent attempt to improve
things in this area, but I'm not sure whether it would eliminate this
issue. We've considered going to higher PG counts (and thus fewer
objects per PG) as a possible mitigation as well.

Josh

On Fri, Mar 22, 2024 at 2:59 AM Alexander E. Patrakov
 wrote:
>
> Hello Torkil,
>
> The easiest way (in my opinion) to perform offline compaction is a bit
> different than what Igor suggested. We had a prior off-list
> conversation indicating that the results would be equivalent.
>
> 1. ceph config set osd osd_compact_on_start true
> 2. Restart the OSD that you want to compact (or the whole host at
> once, if you want to compact the whole host and your failure domain
> allows for that)
> 3. ceph config set osd osd_compact_on_start false
>
> The OSD will restart, but will not show as "up" until the compaction
> process completes. In your case, I would expect it to take up to 40
> minutes.
>
> On Fri, Mar 22, 2024 at 3:46 PM Torkil Svensgaard  wrote:
> >
> >
> > On 22-03-2024 08:38, Igor Fedotov wrote:
> > > Hi Torkil,
> >
> > Hi Igor
> >
> > > highly likely you're facing a well known issue with RocksDB performance
> > > drop after bulk data removal. The latter might occur at source OSDs
> > > after PG migration completion.
> >
> > Aha, thanks.
> >
> > > You might want to use DB compaction (preferably offline one using ceph-
> > > kvstore-tool) to get OSD out of this "degraded" state or as a preventive
> > > measure. I'd recommend to do that for all the OSDs right now. And once
> > > again after rebalancing is completed.  This should improve things but
> > > unfortunately no 100% guarantee.
> >
> > Why is offline preferred? With offline the easiest way would be
> > something like stop all OSDs one host at a time and run a loop over
> > /var/lib/ceph/$id/osd.*?
> >
> > > Also curious if you have DB/WAL on fast (SSD or NVMe) drives? This might
> > > be crucial..
> >
> > We do, 22 HDDs and 2 DB/WAL NVMes pr host.
> >
> > Thanks.
> >
> > Mvh.
> >
> > Torkil
> >
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > > On 3/22/2024 9:59 AM, Torkil Svensgaard wrote:
> > >> Good morning,
> > >>
> > >> Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure
> > >> domain from host to datacenter which is the reason for the large
> > >> misplaced percentage.
> > >>
> > >> We were seeing some pretty crazy spikes in "OSD Read Latencies" and
> > >> "OSD Write Latencies" on the dashboard. Most of the time everything is
> > >> well but then for periods of time, 1-4 hours, latencies will go to 10+
> > >> seconds for one or more OSDs. This also happens outside scrub hours
> > >> and it is not the same OSDs every time. The OSDs affected are HDD with
> > >> DB/WAL on NVMe.
> > >>
> > >> Log snippet:
> > >>
> > >> "
> > >> ...
> > >> 2024-03-22T06:48:22.859+ 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> > >> 2024-03-22T06:48:22.859+ 7fb185b54700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> > >> 2024-03-22T06:48:22.864+ 7fb169898700  1 heartbeat_map
> > >> clear_timeout 'OSD::osd_op_tp thread 0x7fb169898700' had timed out
> > >> after 15.00954s
> > >> 2024-03-22T06:48:22.864+ 7fb169898700  0 bluestore(/var/lib/ceph/
> > >> osd/ceph-112) log_latency slow operation observed for submit_transact,
> > >> latency = 17.716707230s
> > >> 2024-03-22T06:48:22.880+ 7fb1748ae700  0 bluestore(/var/lib/ceph/
> > >> osd/ceph-112) log_latency_fn slow operation observed for
> > >> _txc_committed_kv, latency = 17.732601166s, txc = 0x55a5bcda0f00
> > >> 2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> > >> 2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> > >> ...
> > >> "
> > >>
> > >> "
> > >> [root@dopey ~]# ceph -s
> > >>   cluster:
> > >> id: 8ee2d228-ed21-4580-8bbf-0649f229e21d
> > >> health: HEALTH_WARN
> > >> 1 failed cephadm daemon(s)
> > >> Low space hindering backfill (add storage if this doesn't
> > >> resolve itself): 1 pg backfill_toofull
> > >>
> > >>   services:
> > >> mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
> > >> mgr: jolly.tpgixt(active, since 10d), standbys: dopey.lxajvk,
> > >> lazy.xuhetq
> > >> mds: 1

[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Özkan Göksu
After I set these 2 udev rules:

root@sd-02:~# cat /etc/udev/rules.d/98-ceph-provisioning-mode.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}:="unmap"

root@sd-02:~# cat /etc/udev/rules.d/99-ceph-write-through.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"

Only drives changed to "DISC-GRAN=4K", "DISC-MAX=4G"

This is the status:

root@sd-02:~# lsblk -D

> NAME
>DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> sda
>0  512B   2G 0
> ├─sda1
>   0  512B   2G 0
> ├─sda2
>   0  512B   2G 0
> └─sda3
>   0  512B   2G 0
>   └─md0
>0  512B   2G 0
> └─md0p1
>0  512B   2G 0
> sdb
>0  512B   2G 0
> ├─sdb1
>   0  512B   2G 0
> ├─sdb2
>   0  512B   2G 0
> └─sdb3
>   0  512B   2G 0
>   └─md0
>0  512B   2G 0
> └─md0p1
>0  512B   2G 0
> sdc
>04K   4G 0
> ├─ceph--35de126c--326d--45f0--85e6--ef651dd25506-osd--block--65a12345--788d--406c--b4aa--79c691662f3e
>00B   0B 0
> └─ceph--35de126c--326d--45f0--85e6--ef651dd25506-osd--block--0fc29fdb--1345--465c--b830--8a217dd9034f
>00B   0B 0
>
>

 But in my other cluster as you can see also ceph lvm partitions are 4K + 2G

root@ud-01:~# lsblk -D
NAME
   DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda
 0  512B   2G 0
├─sda1
  0  512B   2G 0
└─sda2
  0  512B   2G 0
  └─md0
 0  512B   2G 0
├─md0p1
 0  512B   2G 0
└─md0p2
 0  512B   2G 0
sdb
 0  512B   2G 0
├─sdb1
  0  512B   2G 0
└─sdb2
  0  512B   2G 0
  └─md0
 0  512B   2G 0
├─md0p1
 0  512B   2G 0
└─md0p2
 0  512B   2G 0
sdc
 04K   2G 0
├─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--ec86a029--23f7--4328--9600--a24a290e3003
   04K   2G 0
└─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--5b69b748--d899--4f55--afc3--2ea3c8a05ca1
   04K   2G 0

I think I also need to write a udev rule for LVM osd partitions right?

Anthony D'Atri , 22 Mar 2024 Cum, 18:11 tarihinde şunu
yazdı:

> Maybe because the Crucial units are detected as client drives?  But also
> look at the device paths and the output of whatever "disklist" is.  Your
> boot drives are SATA and the others are SAS which seems even more likely to
> be a factor.
>
> On Mar 22, 2024, at 10:42, Özkan Göksu  wrote:
>
> Hello Anthony, thank you for the answer.
>
> While researching I also found out this type of issues but the thing I did
> not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is
> all good.
>
> root@sd-01:~# lsblk -D
> NAME   DISC-ALN DISC-GRAN DISC-MAX
> DISC-ZERO
> sda   0  512B   2G
> 0
> ├─sda10  512B   2G
> 0
> ├─sda20  512B   2G
> 0
> └─sda30  512B   2G
> 0
>   └─md0   0  512B   2G
> 0
> └─md0p1   0  512B   2G
> 0
> sdb   0  512B   2G
> 0
> ├─sdb10  512B   2G
> 0
> ├─sdb20  512B   2G
> 0
> └─sdb30  512B   2G
> 0
>   └─md0   0  512B   2G
> 0
> └─md0p1   0  512B   2G
> 0
>
> roo

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin

Hi,

The attempt to rerun the bench was not really a success. I got the 
following messages:


-

Mar 22 14:48:36 idr-osd2 ceph-osd[326854]: osd.29 83873 
maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth 
(MiB/sec): 10.910 iops: 2792.876 elapsed_sec: 1.074
Mar 22 14:48:36 idr-osd2 ceph-osd[326854]: log_channel(cluster) log 
[WRN] : OSD bench result of 2792.876456 IOPS exceeded the threshold 
limit of 500.00 IOPS for osd.29. IOPS capacity is unchanged at 
0.00 IOPS. The recommendation is to establish the osd's IOPS 
capacity using other benchmark tools (e.g. Fio) and then override 
osd_mclock_max_capacity_iops_[hdd|ssd].

-

I decided as a first step to raise the osd_mclock_max_capacity_iops_hdd 
for the suspect OSD to 50. It was magic! I already managed to get 16 
over 17 scrubs/deep scrubs to be run and the last one is in progress.


I now have to understand why this OSD had such bad perfs that 
osd_mclock_max_capacity_iops_hdd was set to such a low value... I have 
12 OSDs with an entry for their osd_mclock_max_capacity_iops_hdd and 
they are mostly on one server (with 2 OSDs on another one). I suspect 
there was a problem on these servers at some points. It is unclear why 
it is not enough to just rerun the benchmark and why a crazy value for 
an HDD is found...


Best regards,

Michel

Le 22/03/2024 à 14:44, Michel Jouvin a écrit :

Hi Frédéric,

I think you raise the right point, sorry if I misunderstood Pierre's 
suggestion to look at OSD performances. Just before reading your 
email, I was implementing Pierre's suggestion for max_osd_scrubs and I 
saw the osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those 
with a value different from the default). For the suspect OSD, the 
value is very low, 0.145327, and I suspect it is the cause of the 
problem. A few others have a value ~5 which I find also very low (all 
OSDs are using the same recent HW/HDD).


Thanks for these informations. I'll follow your suggestions to rerun 
the benchmark and report if it improved the situation.


Best regards,

Michel

Le 22/03/2024 à 12:18, Frédéric Nass a écrit :

Hello Michel,

Pierre also suggested checking the performance of this OSD's 
device(s) which can be done by running a ceph tell osd.x bench.


One think I can think of is how the scrubbing speed of this very OSD 
could be influenced by mclock sheduling, would the max iops capacity 
calculated by this OSD during its initialization be significantly 
lower than other OSDs's.


What I would do is check (from this OSD's log) the calculated value 
for max iops capacity and compare it to other OSDs. Eventually force 
a recalculation by setting 'ceph config set osd.x 
osd_mclock_force_run_benchmark_on_init true' and restart this OSD.


Also I would:

- compare running OSD's mclock values (cephadm shell ceph daemon 
osd.x config show | grep mclock) to other OSDs's.

- compare ceph tell osd.x bench to other OSDs's benchmarks.
- compare the rotational status of this OSD's db and data devices to 
other OSDs, to make sure things are in order.


Bests,
Frédéric.

PS: If mclock is the culprit here, then setting osd_op_queue back to 
mpq for this only OSD would probably reveal it. Not sure about the 
implication of having a signel OSD running a different scheduler in 
the cluster though.



- Le 22 Mar 24, à 10:11, Michel Jouvin 
michel.jou...@ijclab.in2p3.fr a écrit :



Pierre,

Yes, as mentioned in my initial email, I checked the OSD state and 
found
nothing wrong either in the OSD logs or in the system logs (SMART 
errors).


Thanks for the advice of increasing osd_max_scrubs, I may try it, but I
doubt it is a contention problem because it really only affects a fixed
set of PGs (no new PGS have a "stucked scrub") and there is a
significant scrubbing activity going on continuously (~10K PGs in the
cluster).

Again, it is not a problem for me to try to kick out the suspect OSD 
and
see it fixes the issue but as this cluster is pretty simple/low in 
terms

of activity and I see nothing that may explain why we have this
situation on a pretty new cluster (9 months, created in Quincy) and not
on our 2 other production clusters, much more used, one of them being
the backend storage of a significant OpenStack clouds, a cluster 
created
10 years ago with Infernetis and upgraded since then, a better 
candidate

for this kind of problems! So, I'm happy to contribute to
troubleshooting a potential issue in Reef if somebody finds it useful
and can help. Else I'll try the approach that worked for Gunnar.

Best regards,

Michel

Le 22/03/2024 à 09:59, Pierre Riteau a écrit :

Hello Michel,

It might be worth mentioning that the next releases of Reef and Quincy
should increase the default value of osd_max_scrubs from 1 to 3. See
the Reef pull request: https://github.com/ceph/ceph/pull/55173
You could try increasing this configuration setting if you
haven't already, but note that it can impact client I/O performance.

Also, if the delay

[ceph-users] Re: [ext] Re: cephadm auto disk preparation and OSD installation incomplete

2024-03-22 Thread Kuhring, Mathias
Hey Eugen,

Thank you for the quick reply.

The 5 missing disks on the one host were completely installed after I fully 
cleaned them up as I described.
So, seems a smaller number of disks can make it.

Regarding the other host with 40 disks:
Failing the MGR didn't have any effect.
There are nor errors in `/var/log/ceph/cephadm.log`.
But a bunch of repeating image listings like:
cephadm --image 
quay.io/ceph/ceph@sha256:1fb108217b110c01c480e32d0cfea0e19955733537af7bb8cbae165222496e09
 --timeout 895 ls

But `ceph log last 200 debug cephadm` gave me a bunch of interesting errors
(Excerpt below. Is there any preferred method to provide bigger logs?).

So, there are some timeouts, which might play into the assumption that 
ceph-volume is a bit overwhelmed by the number of disks.
Shy assumption, but maybe LV creation is taking way too long (is cephadm 
waiting for all of them in bulk?) and times out with the default 900 secs.
However, LVs are created and cephadm will not consider them next round ("has a 
filesystem").

I'm testing this theory right now by bumping up the limit to 2 hours (and the 
restart with "fresh" disks again):
ceph config set mgr mgr/cephadm/default_cephadm_command_timeout 7200

However, there are also mentions of the host being not reachable: "Unable to 
reach remote host ceph-3-11"
But this seems to be limited to cephadm / ceph orch, so basically MGR but not 
the rest of the cluster
(i.e. MONs, OSDs, etc. are communicating happily, as far as I can tell).

During my fresh run, I do notice more hosts being apperently down:
0|0[root@ceph-3-10 ~]# ceph orch host ls | grep Offline
ceph-3-7  172.16.62.38  rgw,osd,_admin Offline
ceph-3-10 172.16.62.41  rgw,osd,_admin,prometheus  Offline
ceph-3-11 172.16.62.43  rgw,osd,_admin Offline
osd-mirror-2  172.16.62.23  rgw,osd,_admin Offline
osd-mirror-3  172.16.62.24  rgw,osd,_admin Offline

But I wonder if this just a side effect of the MGR (cephadm/orch) being too 
busy/overwhelmed with e.g. deploying the new OSDs.

I will update you once the next round is done or failed.

Best Wishes,
Mathias


ceph log last 200 debug cephadm
...
2024-03-20T09:19:24.917834+ mgr.osd-mirror-4.dkzbkw (mgr.339518816) 82122 : 
cephadm [INF] Detected new or changed devices on ceph-3-11
2024-03-20T09:34:28.877718+ mgr.osd-mirror-4.dkzbkw (mgr.339518816) 83339 : 
cephadm [ERR] Failed to apply osd.all-available-devices spec 
DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd
service_id: all-available-devices
service_name: osd.all-available-devices
placement:
  host_pattern: '*'
spec:
  data_devices:
all: true
  filter_logic: AND
  objectstore: bluestore
''')): Command timed out on host cephadm deploy (osd daemon) (default 900 
second timeout)
...

raise TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another exception occurred:

...
orchestrator._interface.OrchestratorError: Command timed out on host cephadm 
deploy (osd daemon) (default 900 second timeout)
2024-03-20T09:34:28.881472+ mgr.osd-mirror-4.dkzbkw (mgr.339518816) 83340 : 
cephadm [ERR] Task exception was never retrieved
future: .all_hosts() 
done, defined at /usr/share/ceph/mgr/cephadm/services/osd.py:72> 
exception=RuntimeError('cephadm exited with an error code: 1, stderr:Unable to 
reach remote host ceph-3-11. ',)>
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 75, in all_hosts
return await gather(*futures)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 64, in 
create_from_spec_one
replace_osd_ids=osd_id_claims_for_host, env_vars=env_vars
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 96, in 
create_single_host
code, '\n'.join(err)))
RuntimeError: cephadm exited with an error code: 1, stderr:Unable to reach 
remote host ceph-3-11.
...

''')): cephadm exited with an error code: 1, stderr:Inferring config 
/var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/config/ceph.conf
Non-zero exit code 1 from /usr/bin/docker run --rm --ipc=host 
--stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint 
/usr/sbin/ceph-volume --privileged --group-add=disk --init -e 
CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:1fb108217b110c01c480e32d0cfea0e19955733537af7bb8cbae165222496e09
 -e NODE_NAME=ceph-3-11 -e CEPH_USE_RANDOM_NONCE=1 -e 
CEPH_VOLUME_OSDSPEC_AFFINITY=all-available-devices -e 
CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v 
/var/run/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b:/var/run/ceph:z -v 
/var/log/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b:/var/log/ceph:z -v 
/var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/crash:/var/lib/ceph/crash:z 
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v 
/run/lock/lvm:/run/lock/lvm -v /:/rootfs -v 
/tmp/ceph-tmpfoibulv3:/etc/ceph/ceph.conf:z -v 
/tmp/ceph-tmpjq5uxhj1:/var/lib/ceph/bootstrap-osd/ceph.keyring:z 
quay.io/ceph/ceph

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Michel Jouvin

Frédéric,

We arrived at the same conclusions! I agree that an insane low value 
would be a good addition: the idea would be that the benchmark emits a 
warning about the value but the it will not put a value lower than the 
minimum defined. I don't have a precise idea of the possible bad side 
effects of such an approach...


Thanks for your help.

Michel

Le 22/03/2024 à 16:29, Frédéric Nass a écrit :

Michel,
Glad to know that was it.
I was wondering when would per OSD osd_mclock_max_capacity_iops_hdd 
value be set in cluster's config database since I don't have any set 
in my lab.
Turns out the per OSD osd_mclock_max_capacity_iops_hdd is only set 
when the calculated value is below 
osd_mclock_iops_capacity_threshold_hdd, otherwise the OSD uses the 
default value of 315.
Probably to rule out any insanely high calculated values. Would have 
been nice to also rule out any insanely low measured values. :-)

Now either:
A/ these incredibly low values were calculated a while back with an 
unmature version of the code or under some specific hardware 
conditions and you can hope this won't happen again

OR
B/ you don't want to rely on hope to much and you'll prefer to disable 
automatic calculation (osd_mclock_skip_benchmark = true) and set 
osd_mclock_max_capacity_iops_[hdd,ssd] by yourself (globally or using 
a rack/host mask) after a precise evaluation of the performance of 
your OSDs.

B/ would be more deterministic :-)
Cheers,
Frédéric.


*De: *Michel 
*à: *Frédéric 
*Cc: *Pierre ; ceph-users 
*Envoyé: *vendredi 22 mars 2024 14:44 CET
*Sujet : *Re: [ceph-users] Re: Reef (18.2): Some PG not
scrubbed/deep scrubbed for 1 month

Hi Frédéric,

I think you raise the right point, sorry if I misunderstood Pierre's
suggestion to look at OSD performances. Just before reading your
email,
I was implementing Pierre's suggestion for max_osd_scrubs and I
saw the
osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those with a
value different from the default). For the suspect OSD, the value is
very low, 0.145327, and I suspect it is the cause of the problem.
A few
others have a value ~5 which I find also very low (all OSDs are using
the same recent HW/HDD).

Thanks for these informations. I'll follow your suggestions to
rerun the
benchmark and report if it improved the situation.

Best regards,

Michel

Le 22/03/2024 à 12:18, Frédéric Nass a écrit :
> Hello Michel,
>
> Pierre also suggested checking the performance of this OSD's
device(s) which can be done by running a ceph tell osd.x bench.
>
> One think I can think of is how the scrubbing speed of this very
OSD could be influenced by mclock sheduling, would the max iops
capacity calculated by this OSD during its initialization be
significantly lower than other OSDs's.
>
> What I would do is check (from this OSD's log) the calculated
value for max iops capacity and compare it to other OSDs.
Eventually force a recalculation by setting 'ceph config set osd.x
osd_mclock_force_run_benchmark_on_init true' and restart this OSD.
>
> Also I would:
>
> - compare running OSD's mclock values (cephadm shell ceph daemon
osd.x config show | grep mclock) to other OSDs's.
> - compare ceph tell osd.x bench to other OSDs's benchmarks.
> - compare the rotational status of this OSD's db and data
devices to other OSDs, to make sure things are in order.
>
> Bests,
> Frédéric.
>
> PS: If mclock is the culprit here, then setting osd_op_queue
back to mpq for this only OSD would probably reveal it. Not sure
about the implication of having a signel OSD running a different
scheduler in the cluster though.
>
>
> - Le 22 Mar 24, à 10:11, Michel Jouvin
michel.jou...@ijclab.in2p3.fr a écrit :
>
>> Pierre,
>>
>> Yes, as mentioned in my initial email, I checked the OSD state
and found
>> nothing wrong either in the OSD logs or in the system logs
(SMART errors).
>>
>> Thanks for the advice of increasing osd_max_scrubs, I may try
it, but I
>> doubt it is a contention problem because it really only affects
a fixed
>> set of PGs (no new PGS have a "stucked scrub") and there is a
>> significant scrubbing activity going on continuously (~10K PGs
in the
>> cluster).
>>
>> Again, it is not a problem for me to try to kick out the
suspect OSD and
>> see it fixes the issue but as this cluster is pretty simple/low
in terms
>> of activity and I see nothing that may explain why we have this
>> situation on a pretty new cluster (9 months, created in Quincy)
and not
>> on our 2 other production clusters, much more used, one of them
being
>> the backend storage of a significant OpenStack clouds, a
  

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Anthony D'Atri
Perhaps emitting an extremely low value could have value for identifying a 
compromised drive?

> On Mar 22, 2024, at 12:49, Michel Jouvin  
> wrote:
> 
> Frédéric,
> 
> We arrived at the same conclusions! I agree that an insane low value would be 
> a good addition: the idea would be that the benchmark emits a warning about 
> the value but the it will not put a value lower than the minimum defined. I 
> don't have a precise idea of the possible bad side effects of such an 
> approach...
> 
> Thanks for your help.
> 
> Michel
> 
> Le 22/03/2024 à 16:29, Frédéric Nass a écrit :
>> Michel,
>> Glad to know that was it.
>> I was wondering when would per OSD osd_mclock_max_capacity_iops_hdd value be 
>> set in cluster's config database since I don't have any set in my lab.
>> Turns out the per OSD osd_mclock_max_capacity_iops_hdd is only set when the 
>> calculated value is below osd_mclock_iops_capacity_threshold_hdd, otherwise 
>> the OSD uses the default value of 315.
>> Probably to rule out any insanely high calculated values. Would have been 
>> nice to also rule out any insanely low measured values. :-)
>> Now either:
>> A/ these incredibly low values were calculated a while back with an unmature 
>> version of the code or under some specific hardware conditions and you can 
>> hope this won't happen again
>> OR
>> B/ you don't want to rely on hope to much and you'll prefer to disable 
>> automatic calculation (osd_mclock_skip_benchmark = true) and set 
>> osd_mclock_max_capacity_iops_[hdd,ssd] by yourself (globally or using a 
>> rack/host mask) after a precise evaluation of the performance of your OSDs.
>> B/ would be more deterministic :-)
>> Cheers,
>> Frédéric.
>> 
>>
>>*De: *Michel 
>>*à: *Frédéric 
>>*Cc: *Pierre ; ceph-users 
>>*Envoyé: *vendredi 22 mars 2024 14:44 CET
>>*Sujet : *Re: [ceph-users] Re: Reef (18.2): Some PG not
>>scrubbed/deep scrubbed for 1 month
>> 
>>Hi Frédéric,
>> 
>>I think you raise the right point, sorry if I misunderstood Pierre's
>>suggestion to look at OSD performances. Just before reading your
>>email,
>>I was implementing Pierre's suggestion for max_osd_scrubs and I
>>saw the
>>osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those with a
>>value different from the default). For the suspect OSD, the value is
>>very low, 0.145327, and I suspect it is the cause of the problem.
>>A few
>>others have a value ~5 which I find also very low (all OSDs are using
>>the same recent HW/HDD).
>> 
>>Thanks for these informations. I'll follow your suggestions to
>>rerun the
>>benchmark and report if it improved the situation.
>> 
>>Best regards,
>> 
>>Michel
>> 
>>Le 22/03/2024 à 12:18, Frédéric Nass a écrit :
>>> Hello Michel,
>>>
>>> Pierre also suggested checking the performance of this OSD's
>>device(s) which can be done by running a ceph tell osd.x bench.
>>>
>>> One think I can think of is how the scrubbing speed of this very
>>OSD could be influenced by mclock sheduling, would the max iops
>>capacity calculated by this OSD during its initialization be
>>significantly lower than other OSDs's.
>>>
>>> What I would do is check (from this OSD's log) the calculated
>>value for max iops capacity and compare it to other OSDs.
>>Eventually force a recalculation by setting 'ceph config set osd.x
>>osd_mclock_force_run_benchmark_on_init true' and restart this OSD.
>>>
>>> Also I would:
>>>
>>> - compare running OSD's mclock values (cephadm shell ceph daemon
>>osd.x config show | grep mclock) to other OSDs's.
>>> - compare ceph tell osd.x bench to other OSDs's benchmarks.
>>> - compare the rotational status of this OSD's db and data
>>devices to other OSDs, to make sure things are in order.
>>>
>>> Bests,
>>> Frédéric.
>>>
>>> PS: If mclock is the culprit here, then setting osd_op_queue
>>back to mpq for this only OSD would probably reveal it. Not sure
>>about the implication of having a signel OSD running a different
>>scheduler in the cluster though.
>>>
>>>
>>> - Le 22 Mar 24, à 10:11, Michel Jouvin
>>michel.jou...@ijclab.in2p3.fr a écrit :
>>>
>>>> Pierre,
>>>>
>>>> Yes, as mentioned in my initial email, I checked the OSD state
>>and found
>>>> nothing wrong either in the OSD logs or in the system logs
>>(SMART errors).
>>>>
>>>> Thanks for the advice of increasing osd_max_scrubs, I may try
>>it, but I
>>>> doubt it is a contention problem because it really only affects
>>a fixed
>>>> set of PGs (no new PGS have a "stucked scrub") and there is a
>>>> significant scrubbing activity going on continuously (~10K PGs
>>in the
>>>> cluster).
>>>>
>>>> Again, it is not a problem for me to try to kick out t

[ceph-users] Re: 18.8.2: osd_mclock_iops_capacity_threshold_hdd untypical values

2024-03-22 Thread Michel Jouvin

Follow-up, changing the title to the real topic...

I did more tests on my OSDs using "ceph tell osd.x bench..." as advised 
by 
https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#benchmarking-test-steps-using-osd-bench 
(exact impact of "cache drop" is not clear/visible based on my 
experience but it is a detail). I get for most OSDs an IOPS value ~700 
which is consistent with what I have in mind for recent HDD. I see that 
the default value for osd_mclock_iops_capacity_threshold_hdd is 315 
which looks very conservative for a cluster with only recent HW. Any 
risk to increase it to 700? (again all my OSDs are 9 months old).


The exceptions to IOPS = ~700 seems to be OSDs which have an 
osd_mclock_max_capacity_iops_hdd entry  in the central config DB, with a 
value < 315 (the exceptions). What is puzzling is that if I run "ceph 
bench" on them, I find value very high > 1300, which looks suspect. But 
I have no clue why it happens as most of the OSDs on the same server 
will report a sensible value. May it be something with the benchmark? Do 
I need more than the "cache drop" before running the bench?


Best regards,

Michel

Le 22/03/2024 à 17:49, Michel Jouvin a écrit :

Frédéric,

We arrived at the same conclusions! I agree that an insane low value 
would be a good addition: the idea would be that the benchmark emits a 
warning about the value but the it will not put a value lower than the 
minimum defined. I don't have a precise idea of the possible bad side 
effects of such an approach...


Thanks for your help.

Michel

Le 22/03/2024 à 16:29, Frédéric Nass a écrit :

Michel,
Glad to know that was it.
I was wondering when would per OSD osd_mclock_max_capacity_iops_hdd 
value be set in cluster's config database since I don't have any set 
in my lab.
Turns out the per OSD osd_mclock_max_capacity_iops_hdd is only set 
when the calculated value is below 
osd_mclock_iops_capacity_threshold_hdd, otherwise the OSD uses the 
default value of 315.
Probably to rule out any insanely high calculated values. Would have 
been nice to also rule out any insanely low measured values. :-)

Now either:
A/ these incredibly low values were calculated a while back with an 
unmature version of the code or under some specific hardware 
conditions and you can hope this won't happen again

OR
B/ you don't want to rely on hope to much and you'll prefer to 
disable automatic calculation (osd_mclock_skip_benchmark = true) and 
set osd_mclock_max_capacity_iops_[hdd,ssd] by yourself (globally or 
using a rack/host mask) after a precise evaluation of the performance 
of your OSDs.

B/ would be more deterministic :-)
Cheers,
Frédéric.


    *De: *Michel 
    *à: *Frédéric 
    *Cc: *Pierre ; ceph-users 
    *Envoyé: *vendredi 22 mars 2024 14:44 CET
    *Sujet : *Re: [ceph-users] Re: Reef (18.2): Some PG not
    scrubbed/deep scrubbed for 1 month

    Hi Frédéric,

    I think you raise the right point, sorry if I misunderstood Pierre's
    suggestion to look at OSD performances. Just before reading your
    email,
    I was implementing Pierre's suggestion for max_osd_scrubs and I
    saw the
    osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those 
with a

    value different from the default). For the suspect OSD, the value is
    very low, 0.145327, and I suspect it is the cause of the problem.
    A few
    others have a value ~5 which I find also very low (all OSDs are 
using

    the same recent HW/HDD).

    Thanks for these informations. I'll follow your suggestions to
    rerun the
    benchmark and report if it improved the situation.

    Best regards,

    Michel

    Le 22/03/2024 à 12:18, Frédéric Nass a écrit :
    > Hello Michel,
    >
    > Pierre also suggested checking the performance of this OSD's
    device(s) which can be done by running a ceph tell osd.x bench.
    >
    > One think I can think of is how the scrubbing speed of this very
    OSD could be influenced by mclock sheduling, would the max iops
    capacity calculated by this OSD during its initialization be
    significantly lower than other OSDs's.
    >
    > What I would do is check (from this OSD's log) the calculated
    value for max iops capacity and compare it to other OSDs.
    Eventually force a recalculation by setting 'ceph config set osd.x
    osd_mclock_force_run_benchmark_on_init true' and restart this OSD.
    >
    > Also I would:
    >
    > - compare running OSD's mclock values (cephadm shell ceph daemon
    osd.x config show | grep mclock) to other OSDs's.
    > - compare ceph tell osd.x bench to other OSDs's benchmarks.
    > - compare the rotational status of this OSD's db and data
    devices to other OSDs, to make sure things are in order.
    >
    > Bests,
    > Frédéric.
    >
    > PS: If mclock is the culprit here, then setting osd_op_queue
    back to mpq for this only OSD would probably reveal it. Not sure
    about t

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Kai Stian Olstad

On Fri, Mar 22, 2024 at 04:29:21PM +0100, Frédéric Nass wrote:

A/ these incredibly low values were calculated a while back with an unmature 
version of the code or under some specific hardware conditions and you can hope 
this won't happen again


The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every time 
the OSD is started.
If you check the OSD log you'll see it does the bench.

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ext] Re: cephadm auto disk preparation and OSD installation incomplete

2024-03-22 Thread Kuhring, Mathias
I'm afraid the parameter mgr mgr/cephadm/default_cephadm_command_timeout is 
buggy.
Once not on default anymore, MGR is preparing the parameter a bit (e.g. 
substracting 5 secs)
And there making it float, but cephadm is not having it (not even if I try the 
default 900 myself):

[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s): 
osd.all-available-devices
osd.all-available-devices: cephadm exited with an error code: 2, 
stderr:usage: 
cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b
   [-h] [--image IMAGE] [--docker] [--data-dir DATA_DIR]
   [--log-dir LOG_DIR] [--logrotate-dir LOGROTATE_DIR]
   [--sysctl-dir SYSCTL_DIR] [--unit-dir UNIT_DIR] [--verbose]
   [--timeout TIMEOUT] [--retry RETRY] [--env ENV] [--no-container-init]
   [--no-cgroups-split]
   
{version,pull,inspect-image,ls,list-networks,adopt,rm-daemon,rm-cluster,run,shell,enter,ceph-volume,zap-osds,unit,logs,bootstrap,deploy,check-host,prepare-host,add-repo,rm-repo,install,registry-login,gather-facts,host-maintenance,agent,disk-rescan}
   ...
cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b: 
error: argument --timeout: invalid int value: '895.0'

This also let to a status panic spiral reporting plenty of host and services 
missing failing (I assume orch failing due to cephadm complaining about the 
parameter).
I got it under control by removing the parameter again from the config (ceph 
config rm mgr mgr/cephadm/default_cephadm_command_timeout).
And the restarting all MGRs manually (systemctl restart..., again since orch 
was kinda useless at this stage).

Anyhow, is there any other way I can adapt this parameter?
Or maybe look into speeding up LV creation (if this is the bootleneck)?

Thanks a lot,
Mathias

-Original Message-
From: Kuhring, Mathias  
Sent: Friday, March 22, 2024 5:38 PM
To: Eugen Block ; ceph-users@ceph.io
Subject: [ceph-users] Re: [ext] Re: cephadm auto disk preparation and OSD 
installation incomplete

Hey Eugen,

Thank you for the quick reply.

The 5 missing disks on the one host were completely installed after I fully 
cleaned them up as I described.
So, seems a smaller number of disks can make it.

Regarding the other host with 40 disks:
Failing the MGR didn't have any effect.
There are nor errors in `/var/log/ceph/cephadm.log`.
But a bunch of repeating image listings like:
cephadm --image 
quay.io/ceph/ceph@sha256:1fb108217b110c01c480e32d0cfea0e19955733537af7bb8cbae165222496e09
 --timeout 895 ls

But `ceph log last 200 debug cephadm` gave me a bunch of interesting errors 
(Excerpt below. Is there any preferred method to provide bigger logs?).

So, there are some timeouts, which might play into the assumption that 
ceph-volume is a bit overwhelmed by the number of disks.
Shy assumption, but maybe LV creation is taking way too long (is cephadm 
waiting for all of them in bulk?) and times out with the default 900 secs.
However, LVs are created and cephadm will not consider them next round ("has a 
filesystem").

I'm testing this theory right now by bumping up the limit to 2 hours (and the 
restart with "fresh" disks again):
ceph config set mgr mgr/cephadm/default_cephadm_command_timeout 7200

However, there are also mentions of the host being not reachable: "Unable to 
reach remote host ceph-3-11"
But this seems to be limited to cephadm / ceph orch, so basically MGR but not 
the rest of the cluster (i.e. MONs, OSDs, etc. are communicating happily, as 
far as I can tell).

During my fresh run, I do notice more hosts being apperently down:
0|0[root@ceph-3-10 ~]# ceph orch host ls | grep Offline
ceph-3-7  172.16.62.38  rgw,osd,_admin Offline
ceph-3-10 172.16.62.41  rgw,osd,_admin,prometheus  Offline
ceph-3-11 172.16.62.43  rgw,osd,_admin Offline
osd-mirror-2  172.16.62.23  rgw,osd,_admin Offline
osd-mirror-3  172.16.62.24  rgw,osd,_admin Offline

But I wonder if this just a side effect of the MGR (cephadm/orch) being too 
busy/overwhelmed with e.g. deploying the new OSDs.

I will update you once the next round is done or failed.

Best Wishes,
Mathias


ceph log last 200 debug cephadm
...
2024-03-20T09:19:24.917834+ mgr.osd-mirror-4.dkzbkw (mgr.339518816) 82122 : 
cephadm [INF] Detected new or changed devices on ceph-3-11
2024-03-20T09:34:28.877718+ mgr.osd-mirror-4.dkzbkw (mgr.339518816) 83339 : 
cephadm [ERR] Failed to apply osd.all-available-devices spec 
DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd
service_id: all-available-devices
service_name: osd.all-available-devices
placement:
  host_pattern: '*'
spec:
  data_devices:
all: true
  filter_logic: AND
  objectstore: bluestore
''')): Command timed out on host cephadm deploy (osd daemon) (default 900 
second timeout) ...

raise TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another exception occurred:

...
orchestrator._interface.Orchestrator

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Frédéric Nass
 
 
 
 
 
Michel, 
  
Log says that osd.29 is providing 2792 '4k' iops at 10.910 MiB/s. These figures 
suggest that a controller write-back cache is in use along the IO path. Is that 
right? 
  
Since 2792 is above 500, osd_mclock_max_capacity_iops_hdd falls back to 315 and 
OSD is suggesting running a benchmark and setting 
osd_mclock_max_capacity_iops_[hdd|ssd] accordingly. 
Removing any per osd osd_mclock_max_capacity_iops_hdd and restarting all 
concerned OSDs, checking that no osd_mclock_max_capacity_iops_hdd is set 
anymore should be enough for the time being. 
  
No sure why these OSDs had such pretty bad performance in the past. Maybe a 
controller firmware issue at that time. 
  
Regarding the write-back cache, be carefull to not set 
osd_mclock_max_capacity_iops_hdd too high as OSDs may not always benefit from 
the controller's write-back cache, especially during large IO workloads filling 
up the cache or would this cache be disabled due to controller's battery 
becoming defective. 
  
I'll be interested in what you decide for osd_mclock_max_capacity_iops_hdd in 
such configuration. 
  
Cheers, 
Frédéric.

 
 
 
 

-Message original-

De: Michel 
à: ceph-users 
Envoyé: vendredi 22 mars 2024 17:20 CET
Sujet : [ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 
month

Hi, 

The attempt to rerun the bench was not really a success. I got the 
following messages: 

- 

Mar 22 14:48:36 idr-osd2 ceph-osd[326854]: osd.29 83873 
maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth 
(MiB/sec): 10.910 iops: 2792.876 elapsed_sec: 1.074 
Mar 22 14:48:36 idr-osd2 ceph-osd[326854]: log_channel(cluster) log 
[WRN] : OSD bench result of 2792.876456 IOPS exceeded the threshold 
limit of 500.00 IOPS for osd.29. IOPS capacity is unchanged at 
0.00 IOPS. The recommendation is to establish the osd's IOPS 
capacity using other benchmark tools (e.g. Fio) and then override 
osd_mclock_max_capacity_iops_[hdd|ssd]. 
- 

I decided as a first step to raise the osd_mclock_max_capacity_iops_hdd 
for the suspect OSD to 50. It was magic! I already managed to get 16 
over 17 scrubs/deep scrubs to be run and the last one is in progress. 

I now have to understand why this OSD had such bad perfs that 
osd_mclock_max_capacity_iops_hdd was set to such a low value... I have 
12 OSDs with an entry for their osd_mclock_max_capacity_iops_hdd and 
they are mostly on one server (with 2 OSDs on another one). I suspect 
there was a problem on these servers at some points. It is unclear why 
it is not enough to just rerun the benchmark and why a crazy value for 
an HDD is found... 

Best regards, 

Michel 

Le 22/03/2024 à 14:44, Michel Jouvin a écrit : 
> Hi Frédéric, 
> 
> I think you raise the right point, sorry if I misunderstood Pierre's 
> suggestion to look at OSD performances. Just before reading your 
> email, I was implementing Pierre's suggestion for max_osd_scrubs and I 
> saw the osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those 
> with a value different from the default). For the suspect OSD, the 
> value is very low, 0.145327, and I suspect it is the cause of the 
> problem. A few others have a value ~5 which I find also very low (all 
> OSDs are using the same recent HW/HDD). 
> 
> Thanks for these informations. I'll follow your suggestions to rerun 
> the benchmark and report if it improved the situation. 
> 
> Best regards, 
> 
> Michel 
> 
> Le 22/03/2024 à 12:18, Frédéric Nass a écrit : 
>> Hello Michel, 
>> 
>> Pierre also suggested checking the performance of this OSD's 
>> device(s) which can be done by running a ceph tell osd.x bench. 
>> 
>> One think I can think of is how the scrubbing speed of this very OSD 
>> could be influenced by mclock sheduling, would the max iops capacity 
>> calculated by this OSD during its initialization be significantly 
>> lower than other OSDs's. 
>> 
>> What I would do is check (from this OSD's log) the calculated value 
>> for max iops capacity and compare it to other OSDs. Eventually force 
>> a recalculation by setting 'ceph config set osd.x 
>> osd_mclock_force_run_benchmark_on_init true' and restart this OSD. 
>> 
>> Also I would: 
>> 
>> - compare running OSD's mclock values (cephadm shell ceph daemon 
>> osd.x config show | grep mclock) to other OSDs's. 
>> - compare ceph tell osd.x bench to other OSDs's benchmarks. 
>> - compare the rotational status of this OSD's db and data devices to 
>> other OSDs, to make sure things are in order. 
>> 
>> Bests, 
>> Frédéric. 
>> 
>> PS: If mclock is the culprit here, then setting osd_op_queue back to 
>> mpq for this only OSD would probably reveal it. Not sure about the 
>> implication of having a signel OSD running a different scheduler in 
>> the cluster though. 
>> 
>> 
>> - Le 22 Mar 24, à 10:11, Michel Jouvin 
>> michel.jou...@ijclab.in2p3.fr a écrit : 
>> 
>>> Pierre, 
>>> 
>>> Yes, as mentioned in my initial email, I checked the OSD state and 
>>> fou

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Frédéric Nass
 
 
 
  
 
> The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every 
> time the OSD is started. 
> If you check the OSD log you'll see it does the bench.  
  
Are you sure about the update on every start? Does the update happen only if 
the benchmark result is < 500 iops? 
  
Looks like the OSD does not remove any set configuration when the benchmark 
result is > 500 iops. Otherwise, the extremely low value that Michel reported 
earlier (less than 1 iops) would have been updated over time. 
I guess. 
  
 
 
Frédéric.  

 
 
 
 

-Message original-

De: Kai 
à: Frédéric 
Cc: Michel ; Pierre ; 
ceph-users 
Envoyé: vendredi 22 mars 2024 18:32 CET
Sujet : Re: [ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed 
for 1 month

On Fri, Mar 22, 2024 at 04:29:21PM +0100, Frédéric Nass wrote: 
>A/ these incredibly low values were calculated a while back with an unmature 
>version of the code or under some specific hardware conditions and you can 
>hope this won't happen again 

The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every time 
the OSD is started. 
If you check the OSD log you'll see it does the bench. 

-- 
Kai Stian Olstad 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Kai Stian Olstad

On Fri, Mar 22, 2024 at 06:51:44PM +0100, Frédéric Nass wrote:



The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every time 
the OSD is started.
If you check the OSD log you'll see it does the bench.

 
Are you sure about the update on every start? Does the update happen only if the 
benchmark result is < 500 iops?
 
Looks like the OSD does not remove any set configuration when the benchmark result 
is > 500 iops. Otherwise, the extremely low value that Michel reported earlier 
(less than 1 iops) would have been updated over time.
I guess.


I'm not completely sure, it's a couple a month since I used mclock, have switch
back to wpq because of a nasty bug in mclock that can freeze cluster I/O.

It could be because I was testing osd_mclock_force_run_benchmark_on_init.
The OSD had DB on SSD and data on HDD, so the measured to about 1700 IOPS and
was ignored because of the 500 limit.
So only the SSD got the osd_mclock_max_capacity_iops_ssd set.

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-22 Thread Alexander E. Patrakov
Hi John,
> A few major features we have planned include:
> * Standalone servers (internally defined users/groups)

No concerns here

> * Active Directory Domain Member Servers

In the second case, what is the plan regarding UID mapping? Is NFS
coexistence planned, or a concurrent mount of the same directory using
CephFS directly?

In fact, I am quite skeptical, because, at least in my experience,
every customer's SAMBA configuration as a domain member is a unique
snowflake, and cephadm would need an ability to specify arbitrary UID
mapping configuration to match what the customer uses elsewhere - and
the match must be precise.

Here is what I have seen or was told about:

1. We don't care about interoperability with NFS or CephFS, so we just
let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2"
idmap backend. It's completely OK that workstations get different UIDs
and GIDs, as only SIDs traverse the wire.
2. [not seen in the wild, the customer did not actually implement it,
it's a product of internal miscommunication, and I am not sure if it
is valid at all] We don't care about interoperability with CephFS,
and, while we have NFS, security guys would not allow running NFS
non-kerberized. Therefore, no UIDs or GIDs traverse the wire, only
SIDs and names. Therefore, all we need is to allow both SAMBA and NFS
to use shared UID mapping allocated on as-needed basis using the
"tdb2" idmap module, and it doesn't matter that these UIDs and GIDs
are inconsistent with what clients choose.
3. We don't care about ACLs at all, and don't care about CephFS
interoperability. We set ownership of all new files to root:root 0666
using whatever options are available [well, I would rather use a
dedicated nobody-style uid/gid here]. All we care about is that only
authorized workstations or authorized users can connect to each NFS or
SMB share, and we absolutely don't want them to be able to set custom
ownership or ACLs.
4. We care about NFS and CephFS file ownership being consistent with
what Windows clients see. We store all UIDs and GIDs in Active
Directory using the rfc2307 schema, and it's mandatory that all
servers (especially SAMBA - thanks to the "ad" idmap backend) respect
that and don't try to invent anything [well, they do - BUILTIN/Users
gets its GID through tdb2]. Oh, and by the way, we have this strangely
low-numbered group that everybody gets wrong unless they set "idmap
config CORP : range = 500-99".
5. We use a few static ranges for algorithmic ID translation using the
idmap rid backend. Everything works.
6. We use SSSD, which provides consistent IDs everywhere, and for a
few devices which can't use it, we configured compatible idmap rid
ranges for use with winbindd. The only problem is that we like
user-private groups, and only SSSD has support for them (although we
admit it's our fault that we enabled this non-default option).
7. We store ID mappings in non-AD LDAP and use winbindd with the
"ldap" idmap backend.

I am sure other weird but valid setups exist - please extend the list
if you can.

Which of the above scenarios would be supportable without resorting to
the old way of installing SAMBA manually alongside the cluster?

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Reset health.

2024-03-22 Thread Albert Shih
Hi, 

Very basic question : 2 days ago I reboot all the cluster. Everything work
fine. But I'm guessing during the shutdown 4 osd was mark as crash

[WRN] RECENT_CRASH: 4 daemons have recently crashed
osd.381 crashed on host cthulhu5 at 2024-03-20T18:33:12.017102Z
osd.379 crashed on host cthulhu4 at 2024-03-20T18:47:13.838839Z
osd.376 crashed on host cthulhu3 at 2024-03-20T18:50:00.877536Z
osd.373 crashed on host cthulhu1 at 2024-03-20T18:56:46.887394Z

is they are any way to «clean» that ? Because otherwise my icinga
complain

I don't like to add a downtime in icinga. 

Thanks.
-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
ven. 22 mars 2024 22:24:35 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How you manage log

2024-03-22 Thread Albert Shih
Hi, 

With our small cluster (11 nodes) I notice ceph log a lot 

Beside to keep that somewhere «just in case», is they are anything to check
regularly in the log (in prevention of more serious problem) ? Or can we
trust «ceph health» and use the log only for debug. 

Regards
-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
ven. 22 mars 2024 22:28:42 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reset health.

2024-03-22 Thread Murilo Morais
 You can use the `ceph crash` interface to view/archive recent crashes. [1]

To list recent crashes: ceph crash ls-new
To get information about a particular crash: ceph crash info 
To silence a crash: ceph crash archive 
To silence all active crashes: ceph crash archive-all

[1]
https://docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash

Em sex., 22 de mar. de 2024 às 18:28, Albert Shih 
escreveu:

> Hi,
>
> Very basic question : 2 days ago I reboot all the cluster. Everything work
> fine. But I'm guessing during the shutdown 4 osd was mark as crash
>
> [WRN] RECENT_CRASH: 4 daemons have recently crashed
> osd.381 crashed on host cthulhu5 at 2024-03-20T18:33:12.017102Z
> osd.379 crashed on host cthulhu4 at 2024-03-20T18:47:13.838839Z
> osd.376 crashed on host cthulhu3 at 2024-03-20T18:50:00.877536Z
> osd.373 crashed on host cthulhu1 at 2024-03-20T18:56:46.887394Z
>
> is they are any way to «clean» that ? Because otherwise my icinga
> complain
>
> I don't like to add a downtime in icinga.
>
> Thanks.
> --
> Albert SHIH 🦫 🐸
> France
> Heure locale/Local time:
> ven. 22 mars 2024 22:24:35 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph versus Zabbix: failure: no data sent

2024-03-22 Thread John Jasen
If the documentation is to be believed, it's just install the zabbix
sender, then;

ceph mgr module enable zabbix

ceph zabbix config-set zabbix_host my-zabbix-server

(Optional) Set the identifier to the fsid.

And poof. I should now have a discovered entity on my zabbix server to add
templates to.

However, this has not worked yet on either of my ceph clusters (one RHEL,
one proxmox).

Reference: https://docs.ceph.com/en/latest/mgr/zabbix/

On Reddit advice, I installed the Ceph templates for Zabbix.
https://raw.githubusercontent.com/ceph/ceph/master/src/pybind/mgr/zabbix/zabbix_template.xml

Still no dice.  No traffic at all seems to be generated, that I've seen
from packet traces,

... OK.

I su'ed to the ceph user on both clusters, and ran zabbix_send:

zabbix_sender -v -z 10.0.0.1 -s "$my_fsid" -k ceph.osd_avg_pgs -o 1

Response from "10.0.0.1:10051": "processed: 1; failed: 0; total: 1; seconds
spent: 0.42"

sent: 1; skipped: 0; total: 1

As the ceph user, ceph zabbix send/discovery still fail.

I am officially stumped.

Any ideas as to which tree I should be barking up?

Thanks in advance!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Are we logging IRC channels?

2024-03-22 Thread Mark Nelson
Sure!  I think Wido just did it all unofficially, but afaik we've lost 
all of those records now.  I don't know if Wido still reads the mailing 
list but he might be able to chime in.  There was a ton of knowledge in 
the irc channel back in the day.  With slack, it feels like a lot of 
discussions have migrated into different channels, though #ceph still 
gets some community traffic (and a lot of hardware design discussion).


Mark

On 3/22/24 02:15, Alvaro Soto wrote:

Should we bring to life this again?

On Tue, Mar 19, 2024, 8:14 PM Mark Nelson > wrote:


A long time ago Wido used to have a bot logging IRC afaik, but I think
that's been gone for some time.


Mark


On 3/19/24 19:36, Alvaro Soto wrote:
 > Hi Community!!!
 > Are we logging IRC channels? I ask this because a lot of people
only use
 > Slack, and the Slack we use doesn't have a subscription, so
messages are
 > lost after 90 days (I believe)
 >
 > I believe it's important to keep track of the technical knowledge
we see
 > each day over IRC+Slack
 > Cheers!
___
ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Are we logging IRC channels?

2024-03-22 Thread Gregory Farnum
I put it on the list for the next CLT. :) (though I imagine it will move to
the infrastructure meeting from there.)

On Fri, Mar 22, 2024 at 4:42 PM Mark Nelson  wrote:

> Sure!  I think Wido just did it all unofficially, but afaik we've lost
> all of those records now.  I don't know if Wido still reads the mailing
> list but he might be able to chime in.  There was a ton of knowledge in
> the irc channel back in the day.  With slack, it feels like a lot of
> discussions have migrated into different channels, though #ceph still
> gets some community traffic (and a lot of hardware design discussion).
>
> Mark
>
> On 3/22/24 02:15, Alvaro Soto wrote:
> > Should we bring to life this again?
> >
> > On Tue, Mar 19, 2024, 8:14 PM Mark Nelson  > > wrote:
> >
> > A long time ago Wido used to have a bot logging IRC afaik, but I
> think
> > that's been gone for some time.
> >
> >
> > Mark
> >
> >
> > On 3/19/24 19:36, Alvaro Soto wrote:
> >  > Hi Community!!!
> >  > Are we logging IRC channels? I ask this because a lot of people
> > only use
> >  > Slack, and the Slack we use doesn't have a subscription, so
> > messages are
> >  > lost after 90 days (I believe)
> >  >
> >  > I believe it's important to keep track of the technical knowledge
> > we see
> >  > each day over IRC+Slack
> >  > Cheers!
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > 
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > 
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Laptop Losing Connectivity To CephFS On Sleep/Hibernation

2024-03-22 Thread duluxoz

Hi All,

I'm looking for some help/advice to solve the issue outlined in the heading.

I'm running CepfFS (name: cephfs) on a Ceph Reef (v18.2.2 - latest 
update) cluster, connecting from a laptop running Rocky Linux v9.3 
(latest update) with KDE v5 (latest update).


I've set up the laptop to connect to a number of directories on CephFS 
via the `/etc/fstab' folder, an example of such is: 
`ceph_user@.cephfs=/my_folder  /mnt/my_folder  ceph noatime,_netdev  0 0`.


Everything is working great; the required Ceph Key is on the laptop 
(with a chmod of 600), I can access the files on the Ceph Cluster, etc, 
etc, etc - all good.


However, whenever the laptop is in sleep or hibernate mode (ie when I 
close the laptop's lid) and then bring the laptop out of 
sleep/hibernation (ie I open the laptop's lid) I've lost the CephFS 
mountings. The only way to bring them back is to run `mount -a` as root 
(or sudo). This is, as I'm sure you'll agree, not a long-term viable 
options - especially as this is a running as a pilot-project and the 
eventual end-users won't have access to root/sudo.


So I'm seeking the collective wisdom of the community in how to solve 
this issue.


I've taken a brief look at autofs, and even half-heartedly had a go at 
configuring it, but it didn't seem to work - honestly, it was late and I 
wanted to get home after a long day.  :-)


Is this the solution to my issue, or is there a better way to construct 
the fstab entries, or is there another solution I haven't found yet in 
the doco or via google-foo?


All help and advice greatly appreciated - thanks in advance

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Mounting A RBD Via Kernal Modules

2024-03-22 Thread duluxoz

Hi All,

I'm trying to mount a Ceph Reef (v18.2.2 - latest version) RBD Image as 
a 2nd HDD on a Rocky Linux v9.3 (latest version) host.


The EC pool has been created and initialised and the image has been 
created.


The ceph-common package has been installed on the host.

The correct keyring has been added to the host (with a chmod of 600) and 
the host has been configure with an rbdmap file as follows: 
`my_pool.meta/my_image 
id=ceph_user,keyring=/etc/ceph/ceph.client.ceph_user.keyring`.


When running the rbdmap.service the image appears as both `/dev/rbd0` 
and `/dev/rbd/my_pool.meta/my_image`, exactly as the Ceph Doco says it 
should.


So everything *appears* AOK up to this point.

My question now is: Should I run `mkfs xfs` on `/dev/rbd0` *before* or 
*after* I try to mount the image (via fstab: 
`/dev/rbd/my_pool.meta/my_image  /mnt/my_image  xfs  noauto  0 0` - as 
per the Ceph doco)?


The reason I ask is that I've tried this *both* ways and all I get is an 
error message (sorry, can't remember the exact messages and I'm not 
currently in front of the host to confirm it  :-) - but from memory it 
was something about not being able to recognise the 1st block - or 
something like that).


So, I'm obviously doing something wrong, but I can't work out what 
exactly (and the logs don't show any useful info).


Do I, for instance, have the process wrong / don't understand the exact 
process, or is there something else wrong?


All comments/suggestions/etc greatly appreciated - thanks in advance

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io