[ceph-users] Re: Ceph orchestrator not refreshing device list

2024-10-29 Thread Eugen Block

Hi,

I haven't done this in production yet either, but in a test cluster I  
threw away that config-key and it just gets regenerated. So I suppose  
one could try that without any bis risk.

Just a note, this should also work (get instead of dump):

ceph config-key get mgr/cephadm/host.ceph-osd31.devices.0 | jq  
.devices[].created



Zitat von Bob Gibson :

I enabled debug logging with `ceph config set mgr  
mgr/cephadm/log_to_cluster_level debug` and viewed the logs with  
`ceph -W cephadm --watch-debug`. I can see the orchestrator  
refreshing the device list, and this is reflected in the  
`ceph-volume.log` file on the target osd nodes. When I restart the  
mgr, `ceph orch device ls` reports each device with “5w ago” under  
the “REFRESHED” column. After the orchestrator attempts to refresh  
the device list, `ceph orch device ls` stops outputting any data at  
all until I restart the mgr again.


I discovered that I can query the cached device data using `ceph  
config-key dump`. On the problematic cluster, the `created`  
attribute is stale, e.g.


ceph config-key dump | jq -r  
.'"mgr/cephadm/host.ceph-osd31.devices.0"' | jq .devices[].created

"2024-09-23T17:56:44.914535Z"
"2024-09-23T17:56:44.914569Z"
"2024-09-23T17:56:44.914591Z"
"2024-09-23T17:56:44.914612Z"
"2024-09-23T17:56:44.914632Z"
"2024-09-23T17:56:44.914652Z"
"2024-09-23T17:56:44.914672Z"
"2024-09-23T17:56:44.914692Z"
"2024-09-23T17:56:44.914711Z"
"2024-09-23T17:56:44.914732Z"

whereas on working clusters the `created` attribute is set to the  
time the device information was last cached, e.g.


ceph config-key dump | jq -r  
.'"mgr/cephadm/host.ceph-osd1.devices.0"' | jq .devices[].created

"2024-10-28T21:49:29.510593Z"
"2024-10-28T21:49:29.510635Z"
"2024-10-28T21:49:29.510657Z"
"2024-10-28T21:49:29.510678Z"

It appears that the orchestrator is polling the devices but failing  
to update the cache for some reason. It would be interesting to see  
what happens if I removed one of these device entries from the  
cache, but the cluster is in production so I’m hesitant to poke at it.


We have a maintenance window scheduled in December which will  
provide an opportunity to perform a complete restart of the cluster.  
Hopefully that will clean things up. In the meantime, I’ve set all  
devices to be unmanaged, and the cluster is otherwise healthy, so  
unless anyone has any other ideas to offer I guess I’ll just leave  
things as-is until the maintenance window.


Cheers,
/rjg

On Oct 25, 2024, at 10:31 AM, Bob Gibson  wrote:

[…]
My hunch is that some persistent state is corrupted, or there’s  
something else preventing the orchestrator from successfully  
refreshing its device status, but I don’t know how to troubleshoot  
this. Any ideas?


I don't think this is related to the 'osd' service. As suggested by  
Tobi, enabling cephadm debug will tell you more.


Agreed. I’ll dig through the logs some more today to see if I can  
spot any problems.


Cheers,
/rjg

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS and stretched clusters

2024-10-29 Thread Sake Ceph
Hi all
We deployed successfully a stretched cluster and all is working fine. But is it 
possible to assign the active MDS services in one DC and the standby-replay in 
the other?

We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with 2 active 
MDS on pinnend ranks and 2 in standby-replay mode.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Crash Module "RADOS permission denied"

2024-10-29 Thread Tim Holloway

This is a common error on my system (Pacific).


It appears that there is internal confusion as to where the crash 
support stuff lives - whether it's new-style (administered and under 
/var/lib/ceph/fsid) or legacy style (/var/lib/ceph). One way to fake it 
out was to manually created a minimal crash infrastructure (keyring) 
under /var/lib/ceph/crash and assign appropriate access rights.



  Tim

On 10/29/24 10:41, mailing-lists wrote:

Hey Cephers,

i was investigating some other issue, when I stumbled across this. I 
am not sure, if this is "as intended" or faulty. This is a cephadm 
cluster on reef 18.2.4, containerized with docker.


The ceph-crash module states that it cant find its key and that it 
cant access RADOS.


Pre-face, this is how its key looks like and what its caps are (dont 
worry about leaking the key, its a Test-Cluster):

client.crash.bi-ubu-srv-ceph2-01
    key: AQBi5CBnedEKORAAgaTwkiqO7KJ+wzu8+EGXEQ==
    caps: [mgr] profile crash
    caps: [mon] profile crash

Looks like it should. As per documentation: 
https://docs.ceph.com/en/reef/mgr/crash/#automated-collection


The key on the node/directory of this ceph-crash container is the same.

This is the docker logs of the container running ceph-crash.

INFO:ceph-crash:pinging cluster to exercise our key
2024-10-29T13:34:28.261+ 7f7abb1af640 -1 auth: unable to find a 
keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: 
(2) No such file or directory
2024-10-29T13:34:28.261+ 7f7abb1af640 -1 
AuthRegistry(0x7f7ab4067810) no keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, 
disabling cephx
2024-10-29T13:34:28.265+ 7f7abb1af640 -1 auth: unable to find a 
keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: 
(2) No such file or directory
2024-10-29T13:34:28.265+ 7f7abb1af640 -1 
AuthRegistry(0x7f7abb1ae0c0) no keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, 
disabling cephx
2024-10-29T13:34:28.265+ 7f7ab99ac640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-10-29T13:34:28.265+ 7f7aba1ad640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-10-29T13:34:28.265+ 7f7aba9ae640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-10-29T13:34:28.265+ 7f7abb1af640 -1 monclient: authenticate 
NOTE: no keyring found; disabled cephx authentication

[errno 13] RADOS permission denied (error connecting to the cluster)
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s


As we can see here, it says that it couldn't find its key. After some 
online research, I found that some names are not good/interpreted 
correctly. Well, just to test, i've changed the unit.run file of 
ceph-crash-container and restarted it.


This is the part I've changed:

From

/var/lib/ceph/a12b3ade-2849-11ed-9b46-c5b62beb178a/crash.bi-ubu-srv-ceph2-01/keyring:/etc/ceph/ceph.client.crash.bi-ubu-srv-ceph2-01.keyring 



To

/var/lib/ceph/a12b3ade-2849-11ed-9b46-c5b62beb178a/crash.bi-ubu-srv-ceph2-01/keyring:/etc/ceph/keyring 



Then the logs look different, but the permissions denied error is 
still there.


INFO:ceph-crash:pinging cluster to exercise our key
2024-10-29T13:36:42.273+ 7f1d04b49640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support 
[2,1]
2024-10-29T13:36:42.277+ 7f1cff7fe640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support 
[2,1]
2024-10-29T13:36:42.281+ 7f1cf640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support 
[2,1]

[errno 13] RADOS permission denied (error connecting to the cluster)
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s


Can anybody tell me if this is "normal"?

I think this is suspicious. Because for example, if i do ceph crash 
ls, it does not yield anything, while there is definitely something on 
a node inside the folder /var/lib/ceph/xyz/crash/



Thanks in advance and best wishes!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS and stretched clusters

2024-10-29 Thread Gregory Farnum
No, unfortunately this needs to be done at a higher level and is not
included in Ceph right now. Rook may be able to do this, but I don't think
cephadm does.
Adam, is there some way to finagle this with pod placement rules (ie,
tagging nodes as mds and mds-standby, and then assigning special mds config
info to corresponding pods)?
-Greg

On Tue, Oct 29, 2024 at 12:46 PM Sake Ceph  wrote:

> I hope someone of the development team can share some light on this. Will
> search the tracker if some else made a request about this.
>
> > Op 29-10-2024 16:02 CET schreef Frédéric Nass <
> frederic.n...@univ-lorraine.fr>:
> >
> >
> > Hi,
> >
> > I'm not aware of any service settings that would allow that.
> >
> > You'll have to monitor each MDS state and restart any non-local active
> MDSs to reverse roles.
> >
> > Regards,
> > Frédéric.
> >
> > - Le 29 Oct 24, à 14:06, Sake Ceph c...@paulusma.eu a écrit :
> >
> > > Hi all
> > > We deployed successfully a stretched cluster and all is working fine.
> But is it
> > > possible to assign the active MDS services in one DC and the
> standby-replay in
> > > the other?
> > >
> > > We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with 2
> active
> > > MDS on pinnend ranks and 2 in standby-replay mode.
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS and stretched clusters

2024-10-29 Thread Travis Nielsen
Yes, with Rook this is possible by adding zone anti-affinity for the MDS
pods.

Travis

On Tue, Oct 29, 2024 at 3:35 PM Gregory Farnum  wrote:

> No, unfortunately this needs to be done at a higher level and is not
> included in Ceph right now. Rook may be able to do this, but I don't think
> cephadm does.
> Adam, is there some way to finagle this with pod placement rules (ie,
> tagging nodes as mds and mds-standby, and then assigning special mds config
> info to corresponding pods)?
> -Greg
>
> On Tue, Oct 29, 2024 at 12:46 PM Sake Ceph  wrote:
>
> > I hope someone of the development team can share some light on this. Will
> > search the tracker if some else made a request about this.
> >
> > > Op 29-10-2024 16:02 CET schreef Frédéric Nass <
> > frederic.n...@univ-lorraine.fr>:
> > >
> > >
> > > Hi,
> > >
> > > I'm not aware of any service settings that would allow that.
> > >
> > > You'll have to monitor each MDS state and restart any non-local active
> > MDSs to reverse roles.
> > >
> > > Regards,
> > > Frédéric.
> > >
> > > - Le 29 Oct 24, à 14:06, Sake Ceph c...@paulusma.eu a écrit :
> > >
> > > > Hi all
> > > > We deployed successfully a stretched cluster and all is working fine.
> > But is it
> > > > possible to assign the active MDS services in one DC and the
> > standby-replay in
> > > > the other?
> > > >
> > > > We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with
> 2
> > active
> > > > MDS on pinnend ranks and 2 in standby-replay mode.
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: no recovery running

2024-10-29 Thread David Turner
I was running into that as well. Setting
`osd_mclock_override_recovery_settings` [1] to true allowed me to manage
osd_max_backfills again and get recovery to start happening again. It's on
my todo list to understand mclock profiles, but resizing PGs was a
nightmare with it. Changing to override the recovery settings saved me.

[1]
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/#confval-osd_mclock_override_recovery_settings

On Fri, Oct 25, 2024 at 11:13 AM Joffrey  wrote:

> HI,
>
>
> This is my cluster:
>
>   cluster:
> id: c300532c-51fa-11ec-9a41-0050569c3b55
> health: HEALTH_WARN
> Degraded data redundancy: 2062374/1331064781 objects degraded
> (0.155%), 278 pgs degraded, 40 pgs undersized
> 2497 pgs not deep-scrubbed in time
> 2497 pgs not scrubbed in time
>
>   services:
> mon: 3 daemons, quorum hbgt-ceph1-mon1,hbgt-ceph1-mon2,hbgt-ceph1-mon3
> (age 9d)
> mgr: hbgt-ceph1-mon3.gmfzqm(active, since 10d), standbys:
> hbgt-ceph1-mon2.nteihj, hbgt-ceph1-mon1.thrnnu
> osd: 96 osds: 96 up (since 9d), 96 in (since 45h); 1588 remapped pgs
> rgw: 3 daemons active (3 hosts, 2 zones)
>
>   data:
> pools:   16 pools, 2497 pgs
> objects: 266.22M objects, 518 TiB
> usage:   976 TiB used, 808 TiB / 1.7 PiB avail
> pgs: 2062374/1331064781 objects degraded (0.155%)
>  349917519/1331064781 objects misplaced (26.289%)
>  1312 active+remapped+backfill_wait
>  864  active+clean
>  199  active+recovery_wait+degraded+remapped
>  38   active+recovery_wait+degraded
>  33   active+undersized+degraded+remapped+backfill_wait
>  33   active+recovery_wait+remapped
>  7active+recovery_wait
>  6active+undersized+degraded+remapped+backfilling
>  2active+recovering+remapped
>  1active+remapped+backfilling
>  1active+recovering+degraded+remapped
>  1active+recovery_wait+undersized+degraded+remapped
>
>   io:
> client:   683 KiB/s rd, 2.2 KiB/s wr, 51 op/s rd, 2 op/s wr
>
>
> No recovery is running and I don't understand why.
> I have free space:
>
> ID   CLASS  WEIGHT  REWEIGHT  SIZE RAW USE   DATA OMAP META
>  AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
>  -1 1784.12231 -  1.7 PiB   976 TiB  895 TiB  298 GiB   4.1
> TiB  808 TiB  54.72  1.00-  root default
>  -5  208.09680 -  208 TiB   142 TiB  130 TiB   51 GiB   605
> GiB   66 TiB  68.14  1.25-  host hbgt-ceph1-osd01
>   1hdd17.34140   1.0   17 TiB11 TiB   11 TiB   33 KiB49
> GiB  5.9 TiB  66.16  1.21  136  up  osd.1
>   3hdd17.34140   1.0   17 TiB11 TiB   10 TiB   23 GiB49
> GiB  6.3 TiB  63.80  1.17  139  up  osd.3
>   5hdd17.34140   1.0   17 TiB13 TiB   12 TiB  139 MiB53
> GiB  4.8 TiB  72.31  1.32  142  up  osd.5
>   7hdd17.34140   1.0   17 TiB12 TiB   11 TiB   11 GiB51
> GiB  5.6 TiB  67.97  1.24  145  up  osd.7
>   9hdd17.34140   1.0   17 TiB11 TiB   10 TiB  2.2 GiB49
> GiB  6.0 TiB  65.67  1.20  140  up  osd.9
>  11hdd17.34140   1.0   17 TiB12 TiB   11 TiB  329 MiB50
> GiB  5.5 TiB  68.42  1.25  145  up  osd.11
>  13hdd17.34140   1.0   17 TiB12 TiB   11 TiB  1.5 GiB52
> GiB  5.1 TiB  70.45  1.29  153  up  osd.13
>  15hdd17.34140   1.0   17 TiB12 TiB   11 TiB   61 KiB48
> GiB  5.7 TiB  66.85  1.22  144  up  osd.15
>  17hdd17.34140   1.0   17 TiB11 TiB  9.5 TiB  272 MiB45
> GiB  6.8 TiB  60.63  1.11  120  up  osd.17
>  19hdd17.34140   1.0   17 TiB11 TiB   10 TiB   12 GiB50
> GiB  5.9 TiB  65.90  1.20  134  up  osd.19
>  21hdd17.34140   1.0   17 TiB13 TiB   12 TiB  1.6 GiB57
> GiB  4.1 TiB  76.49  1.40  152  up  osd.21
>  23hdd17.34140   1.0   17 TiB13 TiB   12 TiB   31 KiB54
> GiB  4.7 TiB  73.10  1.34  124  up  osd.23
>  -3  208.09680 -  208 TiB   146 TiB  134 TiB   64 GiB   629
> GiB   62 TiB  70.05  1.28-  host hbgt-ceph1-osd02
>   0hdd17.34140   1.0   17 TiB11 TiB  9.8 TiB   22 GiB49
> GiB  6.6 TiB  62.07  1.13  124  up  osd.0
>   2hdd17.34140   1.0   17 TiB12 TiB   11 TiB  1.7 GiB52
> GiB  5.2 TiB  70.14  1.28  150  up  osd.2
>   4hdd17.34140   1.0   17 TiB12 TiB   11 TiB  1.8 GiB48
> GiB  5.8 TiB  66.83  1.22  152  up  osd.4
>   6hdd17.34140   0.85004   17 TiB13 TiB   12 TiB   11 GiB58
> GiB  4.0 TiB  76.85  1.40  153  up  osd.6
>   8hdd17.34140   1.0   17 TiB12 TiB

[ceph-users] Re: Destroyed OSD clinging to wrong disk

2024-10-29 Thread Tim Holloway
Take care when reading the output of "ceph osd metadata". When you are 
running the OSD as an administered service, it's running in a container, 
and a container is a miniature VM. So, for example, it may report your 
OS as "CentOS Stream 8" even if your actual machine is running Ubuntu.



The biggest pitfall is in paths, because in certain cases - definitely 
for OSDs - internally the path for the OSD metadata and data store will 
be /var/lib/ceph/osd, but the actual path in the machine's OS will be 
/var/lib/ceph//osd, where the container simply mounts that for its 
internal path.


In other words, "ceph osd metadata" formulates its reports by having the 
containers assemble the report data and the output is thus the OSD's 
internal view, not your server's view.


   Tim


On 10/28/24 14:01, Dave Hall wrote:

Hello.

Thanks to Rober's reply to 'Influencing the osd.id ' 
I've learned two new commands today.  I can now see that 'ceph osd 
metadata'  confirms that I have two OSDs pointing to the same physical 
disk name:


root@ceph09:/# ceph osd metadata 12 | grep sdi
    "bluestore_bdev_devices": "sdi",
    "device_ids":

"nvme0n1=SAMSUNG_MZPLL1T6HEHP-3_S3HBNA0KA03264,sdi=SEAGATE_ST12000NM0027_*ZJV5TX47*C9470ZWA",
    "device_paths":

"nvme0n1=/dev/disk/by-path/pci-:83:00.0-nvme-1,sdi=/dev/disk/by-path/pci-:41:00.0-sas-phy18-lun-0",
    "devices": "nvme0n1,sdi",
    "objectstore_numa_unknown_devices": "nvme0n1,sdi",
root@ceph09:/# ceph osd metadata 9 | grep sdi
    "bluestore_bdev_devices": "sdi",
    "device_ids":

"nvme1n1=Samsung_SSD_983_DCT_M.2_1.92TB_S48DNC0N701016D,sdi=SEAGATE_ST12000NM0027_*ZJV5SMTQ*C9128FE0",
    "device_paths":

"nvme1n1=/dev/disk/by-path/pci-:01:00.0-nvme-1,sdi=/dev/disk/by-path/pci-:41:00.0-sas-phy6-lun-0",
    "devices": "nvme1n1,sdi",
    "objectstore_numa_unknown_devices": "sdi",


However, even though OSD 12 is saying sdi, at least it is pointing to 
the serial number of the failed disk.  However, the disk with that 
serial number is currently residing at /dev/sdc.


Is there a way to force the record for the destroyed OSD to point to 
/dev/sdc?


Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu

On Mon, Oct 28, 2024 at 11:47 AM Dave Hall  wrote:

Hello.

The following is on a Reef Podman installation:

In attempting to deal over the weekend with a failed OSD disk, I
have somehow managed to have two OSDs pointing to the same HDD, as
shown below.

image.png

To be sure, the failure occurred on OSD.12, which was pointing to
/dev/sdi.

I disabled the systemd unit for OSD.12 because it kept
restarting.  I then destroyed it.

When I physically removed the failed disk and rebooted the system,
the disk enumeration changed.  So, before the reboot, OSD.12 was
using /dev/sdi.  After the reboot, OSD.9 moved to /dev/sdi.

I didn't know that I had an issue until 'ceph-volume lvm prepare'
failed.  It was in the process of investigating this that I found
the above.  Right now I have reinserted the failed disk and
rebooted, hoping that OSD.12 would find its old disk by some other
means, but no joy.

My concern is that if I run 'ceph osd rm' I could take out OSD.9. 
I could take the precaution of marking OSD.9 out and let it drain,
but I'd rather not.  I am, perhaps, more inclined to manually
clear the lingering configuration associated with OSD.12 if
someone could send me the list of commands. Otherwise, I'm open to
suggestions.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu


___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Crash Module "RADOS permission denied"

2024-10-29 Thread mailing-lists

Hey Cephers,

i was investigating some other issue, when I stumbled across this. I am 
not sure, if this is "as intended" or faulty. This is a cephadm cluster 
on reef 18.2.4, containerized with docker.


The ceph-crash module states that it cant find its key and that it cant 
access RADOS.


Pre-face, this is how its key looks like and what its caps are (dont 
worry about leaking the key, its a Test-Cluster):

client.crash.bi-ubu-srv-ceph2-01
    key: AQBi5CBnedEKORAAgaTwkiqO7KJ+wzu8+EGXEQ==
    caps: [mgr] profile crash
    caps: [mon] profile crash

Looks like it should. As per documentation: 
https://docs.ceph.com/en/reef/mgr/crash/#automated-collection


The key on the node/directory of this ceph-crash container is the same.

This is the docker logs of the container running ceph-crash.

INFO:ceph-crash:pinging cluster to exercise our key
2024-10-29T13:34:28.261+ 7f7abb1af640 -1 auth: unable to find a 
keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: 
(2) No such file or directory
2024-10-29T13:34:28.261+ 7f7abb1af640 -1 
AuthRegistry(0x7f7ab4067810) no keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, 
disabling cephx
2024-10-29T13:34:28.265+ 7f7abb1af640 -1 auth: unable to find a 
keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: 
(2) No such file or directory
2024-10-29T13:34:28.265+ 7f7abb1af640 -1 
AuthRegistry(0x7f7abb1ae0c0) no keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, 
disabling cephx
2024-10-29T13:34:28.265+ 7f7ab99ac640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-10-29T13:34:28.265+ 7f7aba1ad640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-10-29T13:34:28.265+ 7f7aba9ae640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [1]
2024-10-29T13:34:28.265+ 7f7abb1af640 -1 monclient: authenticate 
NOTE: no keyring found; disabled cephx authentication

[errno 13] RADOS permission denied (error connecting to the cluster)
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s


As we can see here, it says that it couldn't find its key. After some 
online research, I found that some names are not good/interpreted 
correctly. Well, just to test, i've changed the unit.run file of 
ceph-crash-container and restarted it.


This is the part I've changed:

From

/var/lib/ceph/a12b3ade-2849-11ed-9b46-c5b62beb178a/crash.bi-ubu-srv-ceph2-01/keyring:/etc/ceph/ceph.client.crash.bi-ubu-srv-ceph2-01.keyring

To

/var/lib/ceph/a12b3ade-2849-11ed-9b46-c5b62beb178a/crash.bi-ubu-srv-ceph2-01/keyring:/etc/ceph/keyring

Then the logs look different, but the permissions denied error is still 
there.


INFO:ceph-crash:pinging cluster to exercise our key
2024-10-29T13:36:42.273+ 7f1d04b49640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2024-10-29T13:36:42.277+ 7f1cff7fe640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2024-10-29T13:36:42.281+ 7f1cf640 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]

[errno 13] RADOS permission denied (error connecting to the cluster)
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s


Can anybody tell me if this is "normal"?

I think this is suspicious. Because for example, if i do ceph crash ls, 
it does not yield anything, while there is definitely something on a 
node inside the folder /var/lib/ceph/xyz/crash/



Thanks in advance and best wishes!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS and stretched clusters

2024-10-29 Thread Frédéric Nass
Hi,

I'm not aware of any service settings that would allow that.

You'll have to monitor each MDS state and restart any non-local active MDSs to 
reverse roles.

Regards,
Frédéric.

- Le 29 Oct 24, à 14:06, Sake Ceph c...@paulusma.eu a écrit :

> Hi all
> We deployed successfully a stretched cluster and all is working fine. But is 
> it
> possible to assign the active MDS services in one DC and the standby-replay in
> the other?
> 
> We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with 2 active
> MDS on pinnend ranks and 2 in standby-replay mode.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS and stretched clusters

2024-10-29 Thread Sake Ceph
I hope someone of the development team can share some light on this. Will 
search the tracker if some else made a request about this. 

> Op 29-10-2024 16:02 CET schreef Frédéric Nass 
> :
> 
>  
> Hi,
> 
> I'm not aware of any service settings that would allow that.
> 
> You'll have to monitor each MDS state and restart any non-local active MDSs 
> to reverse roles.
> 
> Regards,
> Frédéric.
> 
> - Le 29 Oct 24, à 14:06, Sake Ceph c...@paulusma.eu a écrit :
> 
> > Hi all
> > We deployed successfully a stretched cluster and all is working fine. But 
> > is it
> > possible to assign the active MDS services in one DC and the standby-replay 
> > in
> > the other?
> > 
> > We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with 2 
> > active
> > MDS on pinnend ranks and 2 in standby-replay mode.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Destroyed OSD clinging to wrong disk

2024-10-29 Thread Dave Hall
Tim,

Thank you for your guidance.  Your points are completely understood.  It
was more that I couldn't figure out why the Dashboard was telling me that
the destroyed OSD was still using /dev/sdi when the physical disk with that
serial number was at /dev/sdc, and when another OSD was also reporting
/dev/sdi.  I figured that there must be some information buried somewhere.
I don't know where this metadata comes from or how it gets updated when
things like 'drive letters' change, but the metadata matched what the
dashboard showed, so now I know something new.

Regarding the process for bringing the OSD back online with a new HDD, I am
still having some difficulties.  I used the steps in the Adding/Removing
OSDs document under Removing the OSD, and the OSD mostly appears to be
gone.  However, attempts to use 'ceph-volume lvm prepare' to build the
remplacement OSD are failing,   Same thing with 'ceph orch daemon add
osd'.

I think the problem might be that the NVMe LV that was the WAL/DB for the
failed OSD did not get cleaned up, but on my systems 4 OSDs use the same
NVMe drive for WAL/DB, so I'm not sure how to proceed.

Any suggestions would be welcome.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu


On Tue, Oct 29, 2024 at 3:13 PM Tim Holloway  wrote:

> Take care when reading the output of "ceph osd metadata". When you are
> running the OSD as an administered service, it's running in a container,
> and a container is a miniature VM. So, for example, it may report your
> OS as "CentOS Stream 8" even if your actual machine is running Ubuntu.
>
>
> The biggest pitfall is in paths, because in certain cases - definitely
> for OSDs - internally the path for the OSD metadata and data store will
> be /var/lib/ceph/osd, but the actual path in the machine's OS will be
> /var/lib/ceph//osd, where the container simply mounts that for its
> internal path.
>
> In other words, "ceph osd metadata" formulates its reports by having the
> containers assemble the report data and the output is thus the OSD's
> internal view, not your server's view.
>
> Tim
>
>
> On 10/28/24 14:01, Dave Hall wrote:
> > Hello.
> >
> > Thanks to Rober's reply to 'Influencing the osd.id '
> > I've learned two new commands today.  I can now see that 'ceph osd
> > metadata'  confirms that I have two OSDs pointing to the same physical
> > disk name:
> >
> > root@ceph09:/# ceph osd metadata 12 | grep sdi
> > "bluestore_bdev_devices": "sdi",
> > "device_ids":
> >
>  
> "nvme0n1=SAMSUNG_MZPLL1T6HEHP-3_S3HBNA0KA03264,sdi=SEAGATE_ST12000NM0027_*ZJV5TX47*C9470ZWA",
> > "device_paths":
> >
>  
> "nvme0n1=/dev/disk/by-path/pci-:83:00.0-nvme-1,sdi=/dev/disk/by-path/pci-:41:00.0-sas-phy18-lun-0",
> > "devices": "nvme0n1,sdi",
> > "objectstore_numa_unknown_devices": "nvme0n1,sdi",
> > root@ceph09:/# ceph osd metadata 9 | grep sdi
> > "bluestore_bdev_devices": "sdi",
> > "device_ids":
> >
>  
> "nvme1n1=Samsung_SSD_983_DCT_M.2_1.92TB_S48DNC0N701016D,sdi=SEAGATE_ST12000NM0027_*ZJV5SMTQ*C9128FE0",
> > "device_paths":
> >
>  
> "nvme1n1=/dev/disk/by-path/pci-:01:00.0-nvme-1,sdi=/dev/disk/by-path/pci-:41:00.0-sas-phy6-lun-0",
> > "devices": "nvme1n1,sdi",
> > "objectstore_numa_unknown_devices": "sdi",
> >
> >
> > However, even though OSD 12 is saying sdi, at least it is pointing to
> > the serial number of the failed disk.  However, the disk with that
> > serial number is currently residing at /dev/sdc.
> >
> > Is there a way to force the record for the destroyed OSD to point to
> > /dev/sdc?
> >
> > Thanks.
> >
> > -Dave
> >
> > --
> > Dave Hall
> > Binghamton University
> > kdh...@binghamton.edu
> >
> > On Mon, Oct 28, 2024 at 11:47 AM Dave Hall 
> wrote:
> >
> > Hello.
> >
> > The following is on a Reef Podman installation:
> >
> > In attempting to deal over the weekend with a failed OSD disk, I
> > have somehow managed to have two OSDs pointing to the same HDD, as
> > shown below.
> >
> > image.png
> >
> > To be sure, the failure occurred on OSD.12, which was pointing to
> > /dev/sdi.
> >
> > I disabled the systemd unit for OSD.12 because it kept
> > restarting.  I then destroyed it.
> >
> > When I physically removed the failed disk and rebooted the system,
> > the disk enumeration changed.  So, before the reboot, OSD.12 was
> > using /dev/sdi.  After the reboot, OSD.9 moved to /dev/sdi.
> >
> > I didn't know that I had an issue until 'ceph-volume lvm prepare'
> > failed.  It was in the process of investigating this that I found
> > the above.  Right now I have reinserted the failed disk and
> > rebooted, hoping that OSD.12 would find its old disk by some other
> > means, but no joy.
> >
> > My concern is that if I run 'ceph osd rm' I could take out OSD.9.
> > I could take the precaution of marking OSD.9 out and let it drai

[ceph-users] Re: MDS and stretched clusters

2024-10-29 Thread Frédéric Nass
But you don't get to choose which one is active and which one is standby, as 
these are states that permute over time, not configurations, or do you?
 
I mean there's no way to tell Rook 'i want this one to be active preferably' 
and have Rook operator monitor MDSs and restart the non-local one if ever 
active so that the local preferred one becomes active. Or is Rook doing this 
any better than cephadm?
 
Frédéric.


De : Travis Nielsen 
Envoyé : mardi 29 octobre 2024 23:56
À : Gregory Farnum
Cc: Sake Ceph; Adam King; ceph-users 
Objet : [ceph-users] Re: MDS and stretched clusters

Yes, with Rook this is possible by adding zone anti-affinity for the MDS
pods.

Travis

On Tue, Oct 29, 2024 at 3:35 PM Gregory Farnum  wrote:

> No, unfortunately this needs to be done at a higher level and is not
> included in Ceph right now. Rook may be able to do this, but I don't think
> cephadm does.
> Adam, is there some way to finagle this with pod placement rules (ie,
> tagging nodes as mds and mds-standby, and then assigning special mds config
> info to corresponding pods)?
> -Greg
>
> On Tue, Oct 29, 2024 at 12:46 PM Sake Ceph  wrote:
>
> > I hope someone of the development team can share some light on this. Will
> > search the tracker if some else made a request about this.
> >
> > > Op 29-10-2024 16:02 CET schreef Frédéric Nass <
> > frederic.n...@univ-lorraine.fr>:
> > >
> > >
> > > Hi,
> > >
> > > I'm not aware of any service settings that would allow that.
> > >
> > > You'll have to monitor each MDS state and restart any non-local active
> > MDSs to reverse roles.
> > >
> > > Regards,
> > > Frédéric.
> > >
> > > - Le 29 Oct 24, à 14:06, Sake Ceph c...@paulusma.eu a écrit :
> > >
> > > > Hi all
> > > > We deployed successfully a stretched cluster and all is working fine.
> > But is it
> > > > possible to assign the active MDS services in one DC and the
> > standby-replay in
> > > > the other?
> > > >
> > > > We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with
> 2
> > active
> > > > MDS on pinnend ranks and 2 in standby-replay mode.
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] why performance difference between 'rados bench seq' and 'rados bench rand' quite significant

2024-10-29 Thread Louisa
Hi all,
We used 'rados bench' to test 4k object read and write operations.  
Our cluster is pacific, one node, 11 bluestore osd ,db and wal share the block 
device.  Block device is HDD.

1. testing 4k write with command 'rados bench 120 write -t 16 -b 4K -p 
rep3datapool --run-name 4kreadwrite --no-cleanup'

2. Before tesing 4k reads, we restarted all OSD daemons.  The perfomance of 
'rados bench 120 seq -t 16 -p rep3datapool --run-name 4kreadwrite' was very 
good, which Average IOPS: 17735; 
using 'ceph daemon osd.1 perf dump rocksdb' , we found the rocksdb:get_latency 
avgcount: 15189, avgtime: 0.12947 (12.9us)

3. Before tesing 4k rand reads, we restarted all OSD daemons.  'rados bench 60 
rand -t 16 -p rep3datapool --run-name 4kreadwrite' average IOPS: 2071
rocksdb:get_latency avgcount: 8756, avgtime: 0.001761293 (1.7ms)

Q1: Why performance difference between 'rados bench seq' and 'rados bench rand' 
quite significant? How to explain the rocksdb get_latency perfomance between 
this two scenario?

4. We write 40w 4k object to the pool, restarted all OSD daemons. running 
'rados bench 120 seq -t 16 -p rep3datapool --run-name 4kreadwrite' again. 
Average IOPS~= 2000. 
rocsdb:get_latency avgtime  also reached milliseconds level
Q2: Why 'rados bench seq' performance decresing extremly after writing some 
more 4k object to the pool?

Q3: Is there any methods and suggestions to optimized the read performance of 
this scenario under this hardware configuration.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: why performance difference between 'rados bench seq' and 'rados bench rand' quite significant

2024-10-29 Thread Anthony D'Atri
The good Mr. Nelson and others may have more to contribute, but a few thoughts:

* Running for 60 or 120 seconds isn’t quantitative:  rados bench typically 
exhibits a clear ramp-up; watch the per-second stats.
* Suggest running for 10 minutes, three times in a row and averaging the results
* How many PGs in rep3datapool?  Average number of PG replicas per OSD shown by 
`ceph osd df` ?  I would shoot for 150 - 200 in your case.
* Disable scrubs, the balancer, and the pg autoscaler during benching.
* If you have OS swap configured, disable it and reboot.  How much RAM?

> On Oct 29, 2024, at 11:43 PM, Louisa  wrote:
> 
> Hi all,
> We used 'rados bench' to test 4k object read and write operations.  
> Our cluster is pacific, one node, 11 bluestore osd ,db and wal share the 
> block device.  Block device is HDD.
> 
> 1. testing 4k write with command 'rados bench 120 write -t 16 -b 4K -p 
> rep3datapool --run-name 4kreadwrite --no-cleanup'
> 
> 2. Before tesing 4k reads, we restarted all OSD daemons.  The perfomance of 
> 'rados bench 120 seq -t 16 -p rep3datapool --run-name 4kreadwrite' was very 
> good, which Average IOPS: 17735; 
> using 'ceph daemon osd.1 perf dump rocksdb' , we found the 
> rocksdb:get_latency avgcount: 15189, avgtime: 0.12947 (12.9us)
> 
> 3. Before tesing 4k rand reads, we restarted all OSD daemons.  'rados bench 
> 60 rand -t 16 -p rep3datapool --run-name 4kreadwrite' average IOPS: 2071
> rocksdb:get_latency avgcount: 8756, avgtime: 0.001761293 (1.7ms)
> 
> Q1: Why performance difference between 'rados bench seq' and 'rados bench 
> rand' quite significant? How to explain the rocksdb get_latency perfomance 
> between this two scenario?
> 
> 4. We write 40w 4k object to the pool, restarted all OSD daemons. running 
> 'rados bench 120 seq -t 16 -p rep3datapool --run-name 4kreadwrite' again. 
> Average IOPS~= 2000. 
> rocsdb:get_latency avgtime  also reached milliseconds level
> Q2: Why 'rados bench seq' performance decresing extremly after writing some 
> more 4k object to the pool?
> 
> Q3: Is there any methods and suggestions to optimized the read performance of 
> this scenario under this hardware configuration.
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: why performance difference between 'rados bench seq' and 'rados bench rand' quite significant

2024-10-29 Thread Louisa
rep3datapool pg num is 512, Average number of PG replicas per OSD is 139
scrubs, balancer and pg autoscaler was disabled
RAM is 128G, swap is 0
From: Anthony D'Atri
Date: 2024-10-30 12:03
To: Louisa
CC: ceph-users
Subject: Re: [ceph-users] why performance difference between 'rados bench seq' 
and 'rados bench rand' quite significant
The good Mr. Nelson and others may have more to contribute, but a few thoughts:
 
* Running for 60 or 120 seconds isn’t quantitative:  rados bench typically 
exhibits a clear ramp-up; watch the per-second stats.
* Suggest running for 10 minutes, three times in a row and averaging the results
* How many PGs in rep3datapool?  Average number of PG replicas per OSD shown by 
`ceph osd df` ?  I would shoot for 150 - 200 in your case.
* Disable scrubs, the balancer, and the pg autoscaler during benching.
* If you have OS swap configured, disable it and reboot.  How much RAM?
 
> On Oct 29, 2024, at 11:43 PM, Louisa  wrote:
> 
> Hi all,
> We used 'rados bench' to test 4k object read and write operations.  
> Our cluster is pacific, one node, 11 bluestore osd ,db and wal share the 
> block device.  Block device is HDD.
> 
> 1. testing 4k write with command 'rados bench 120 write -t 16 -b 4K -p 
> rep3datapool --run-name 4kreadwrite --no-cleanup'
> 
> 2. Before tesing 4k reads, we restarted all OSD daemons.  The perfomance of 
> 'rados bench 120 seq -t 16 -p rep3datapool --run-name 4kreadwrite' was very 
> good, which Average IOPS: 17735; 
> using 'ceph daemon osd.1 perf dump rocksdb' , we found the 
> rocksdb:get_latency avgcount: 15189, avgtime: 0.12947 (12.9us)
> 
> 3. Before tesing 4k rand reads, we restarted all OSD daemons.  'rados bench 
> 60 rand -t 16 -p rep3datapool --run-name 4kreadwrite' average IOPS: 2071
> rocksdb:get_latency avgcount: 8756, avgtime: 0.001761293 (1.7ms)
> 
> Q1: Why performance difference between 'rados bench seq' and 'rados bench 
> rand' quite significant? How to explain the rocksdb get_latency perfomance 
> between this two scenario?
> 
> 4. We write 40w 4k object to the pool, restarted all OSD daemons. running 
> 'rados bench 120 seq -t 16 -p rep3datapool --run-name 4kreadwrite' again. 
> Average IOPS~= 2000. 
> rocsdb:get_latency avgtime  also reached milliseconds level
> Q2: Why 'rados bench seq' performance decresing extremly after writing some 
> more 4k object to the pool?
> 
> Q3: Is there any methods and suggestions to optimized the read performance of 
> this scenario under this hardware configuration.
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io