[ceph-users] Re: Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb)
Hi, Cool, thanks! As for the global_id_reclaim settings: root@proxmox01:~# ceph config get mon auth_allow_insecure_global_id_reclaim false root@proxmox01:~# ceph config get mon auth_expose_insecure_global_id_reclaim true root@proxmox01:~# ceph config get mon mon_warn_on_insecure_global_id_reclaim true root@proxmox01:~# ceph config get mon mon_warn_on_insecure_global_id_reclaim_allowed true — Mark Schouten CTO, Tuxis B.V. +31 318 200208 / m...@tuxis.nl -- Original Message -- From "Eugen Block" To ceph-users@ceph.io Date 02/02/2024, 08:30:45 Subject [ceph-users] Re: Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb) I might have a reproducer, the second rebuilt mon is not joining the cluster as well, I'll look into it and let you know if I find anything. Zitat von Eugen Block : Hi, Can anyone confirm that ancient (2017) leveldb database mons should just accept ‘mon.$hostname’ names for mons, a well as ‘mon.$id’ ? at some point you had or have to remove one of the mons to recreate it with a rocksdb backend, so the mismatch should not be an issue here. I can confirm that when I tried to reproduce it in a small test cluster with leveldb. So now I have two leveldb MONs and one rocksdb MON: jewel:~ # cat /var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel/kv_backend rocksdb jewel2:~ # cat /var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel2/kv_backend leveldb jewel3:~ # cat /var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel3/kv_backend leveldb And the cluster is healthy, although it took a minute or two for the rebuilt MON to sync (in a real cluster with some load etc. it might take longer): jewel:~ # ceph -s cluster: id: b08424fa-8530-4080-876d-2821c916d26c health: HEALTH_OK services: mon: 3 daemons, quorum jewel2,jewel3,jewel (age 3m) I'm wondering if this could have to do with the insecure_global_id things. Can you send the output of: ceph config get mon auth_allow_insecure_global_id_reclaim ceph config get mon auth_expose_insecure_global_id_reclaim ceph config get mon mon_warn_on_insecure_global_id_reclaim ceph config get mon mon_warn_on_insecure_global_id_reclaim_allowed Zitat von Mark Schouten : Hi, I don’t have a fourth machine available, so that’s not an option unfortunatly. I did enable a lot of debugging earlier, but that shows no information as to why stuff is not working as to be expected. Proxmox just deploys the mons, nothing fancy there, no special cases. Can anyone confirm that ancient (2017) leveldb database mons should just accept ‘mon.$hostname’ names for mons, a well as ‘mon.$id’ ? — Mark Schouten CTO, Tuxis B.V. +31 318 200208 / m...@tuxis.nl -- Original Message -- From "Eugen Block" To ceph-users@ceph.io Date 31/01/2024, 13:02:04 Subject [ceph-users] Re: Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb) Hi Mark, as I'm not familiar with proxmox I'm not sure what happens under the hood. There are a couple of things I would try, not necessarily in this order: - Check the troubleshooting guide [1], for example a clock skew could be one reason, have you verified ntp/chronyd functionality? - Inspect debug log output, maybe first on the probing mon and if those don't reveal the reason, enable debug logs for the other MONs as well: ceph config set mon.proxmox03 debug_mon 20 ceph config set mon.proxmox03 debug_paxos 20 or for all MONs: ceph config set mon debug_mon 20 ceph config set mon debug_paxos 20 - Try to deploy an additional MON on a different server (if you have more available) and see if that works. - Does proxmox log anything? - Maybe last resort, try to start a MON manually after adding it to the monmap with the monmaptool, but only if you know what you're doing. I wonder if the monmap doesn't get updated... Regards, Eugen [1] https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/ Zitat von Mark Schouten : Hi, During an upgrade from pacific to quincy, we needed to recreate the mons because the mons were pretty old and still using leveldb. So step one was to destroy one of the mons. After that we recreated the monitor, and although it starts, it remains in state ‘probing’, as you can see below. No matter what I tried, it won’t come up. I’ve seen quite some messages that the MTU might be an issue, but that seems to be ok: root@proxmox03:/var/log/ceph# fping -b 1472 10.10.10.{1..3} -M 10.10.10.1 is alive 10.10.10.2 is alive 10.10.10.3 is alive Does anyone have an idea how to fix this? I’ve tried destroying and recreating the mon a few times now. Could it be that the leveldb mons only support mon.$id notation for the monitors? root@proxmox03:/var/log/ceph# ceph daemon mon.proxmox03 mon_status { "name": “proxmox03”, "rank": 2, "state": “probing”, "election_epoch": 0, "quorum": [], "features": { "required_co
[ceph-users] Re: Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb)
I decided to try to bring the mon back manually after looking at the logs without any findings. It's kind of ugly but it worked. The problem with that approach is that I had to take down a second MON to inject a new monmap (which then includes the failed MON), restart it and do the same for the third MON. This means no quorum until two MONs are back, so two short interruptions. Then I created a new MON (legacy style) and started it with the modified monmap. The cluster got back into quorum, then I converted the legacy MON to cephadm and resumed the orchestrator. Unfortunately, I still don't understand what the root cause is... I also played around with the insecure_global_id stuff to no avail, it was just an idea... Zitat von Mark Schouten : Hi, Cool, thanks! As for the global_id_reclaim settings: root@proxmox01:~# ceph config get mon auth_allow_insecure_global_id_reclaim false root@proxmox01:~# ceph config get mon auth_expose_insecure_global_id_reclaim true root@proxmox01:~# ceph config get mon mon_warn_on_insecure_global_id_reclaim true root@proxmox01:~# ceph config get mon mon_warn_on_insecure_global_id_reclaim_allowed true — Mark Schouten CTO, Tuxis B.V. +31 318 200208 / m...@tuxis.nl -- Original Message -- From "Eugen Block" To ceph-users@ceph.io Date 02/02/2024, 08:30:45 Subject [ceph-users] Re: Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb) I might have a reproducer, the second rebuilt mon is not joining the cluster as well, I'll look into it and let you know if I find anything. Zitat von Eugen Block : Hi, Can anyone confirm that ancient (2017) leveldb database mons should just accept ‘mon.$hostname’ names for mons, a well as ‘mon.$id’ ? at some point you had or have to remove one of the mons to recreate it with a rocksdb backend, so the mismatch should not be an issue here. I can confirm that when I tried to reproduce it in a small test cluster with leveldb. So now I have two leveldb MONs and one rocksdb MON: jewel:~ # cat /var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel/kv_backend rocksdb jewel2:~ # cat /var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel2/kv_backend leveldb jewel3:~ # cat /var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel3/kv_backend leveldb And the cluster is healthy, although it took a minute or two for the rebuilt MON to sync (in a real cluster with some load etc. it might take longer): jewel:~ # ceph -s cluster: id: b08424fa-8530-4080-876d-2821c916d26c health: HEALTH_OK services: mon: 3 daemons, quorum jewel2,jewel3,jewel (age 3m) I'm wondering if this could have to do with the insecure_global_id things. Can you send the output of: ceph config get mon auth_allow_insecure_global_id_reclaim ceph config get mon auth_expose_insecure_global_id_reclaim ceph config get mon mon_warn_on_insecure_global_id_reclaim ceph config get mon mon_warn_on_insecure_global_id_reclaim_allowed Zitat von Mark Schouten : Hi, I don’t have a fourth machine available, so that’s not an option unfortunatly. I did enable a lot of debugging earlier, but that shows no information as to why stuff is not working as to be expected. Proxmox just deploys the mons, nothing fancy there, no special cases. Can anyone confirm that ancient (2017) leveldb database mons should just accept ‘mon.$hostname’ names for mons, a well as ‘mon.$id’ ? — Mark Schouten CTO, Tuxis B.V. +31 318 200208 / m...@tuxis.nl -- Original Message -- From "Eugen Block" To ceph-users@ceph.io Date 31/01/2024, 13:02:04 Subject [ceph-users] Re: Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb) Hi Mark, as I'm not familiar with proxmox I'm not sure what happens under the hood. There are a couple of things I would try, not necessarily in this order: - Check the troubleshooting guide [1], for example a clock skew could be one reason, have you verified ntp/chronyd functionality? - Inspect debug log output, maybe first on the probing mon and if those don't reveal the reason, enable debug logs for the other MONs as well: ceph config set mon.proxmox03 debug_mon 20 ceph config set mon.proxmox03 debug_paxos 20 or for all MONs: ceph config set mon debug_mon 20 ceph config set mon debug_paxos 20 - Try to deploy an additional MON on a different server (if you have more available) and see if that works. - Does proxmox log anything? - Maybe last resort, try to start a MON manually after adding it to the monmap with the monmaptool, but only if you know what you're doing. I wonder if the monmap doesn't get updated... Regards, Eugen [1] https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/ Zitat von Mark Schouten : Hi, During an upgrade from pacific to quincy, we needed to recreate the mons because the mons were pretty old and still using leveld
[ceph-users] Ceph Dashboard failed to execute login
Hello team, I failed to login to my ceph dashboard which is running pacific as version and deployed using ceph-ansible . I have set admin password using the following command : "ceph dashboard ac-user-set-password admin -i ceph-dash-pass" where ceph-dash-pass possesses the real password. I am getting the following output : "{"username": "admin", "password": "$2b$12$Ge/2cpg0ZGjRPnBC2YREP.E5oVyNvV4SC9HU4PMsWWMBtC9UvL7mG", "roles": ["administrator"], "name": null, "email": null, "lastUpdate": 1706866328, "enabled": false, "pwdExpirationDate": null, "pwdUpdateRequired": false}" Once I login to the dashboard , still i get the same error message. I am guessing it is because the above "enabled" field is set to false . Ho w to set that field to true ? or if there is other alternative to set it you can advise. Thank you ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Dashboard failed to execute login
Have you tried to enable it? # ceph dashboard ac-user-enable admin Zitat von Michel Niyoyita : Hello team, I failed to login to my ceph dashboard which is running pacific as version and deployed using ceph-ansible . I have set admin password using the following command : "ceph dashboard ac-user-set-password admin -i ceph-dash-pass" where ceph-dash-pass possesses the real password. I am getting the following output : "{"username": "admin", "password": "$2b$12$Ge/2cpg0ZGjRPnBC2YREP.E5oVyNvV4SC9HU4PMsWWMBtC9UvL7mG", "roles": ["administrator"], "name": null, "email": null, "lastUpdate": 1706866328, "enabled": false, "pwdExpirationDate": null, "pwdUpdateRequired": false}" Once I login to the dashboard , still i get the same error message. I am guessing it is because the above "enabled" field is set to false . Ho w to set that field to true ? or if there is other alternative to set it you can advise. Thank you ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Dashboard failed to execute login
Thank you very much Sir , now it works. Michel On Fri, Feb 2, 2024 at 11:55 AM Eugen Block wrote: > Have you tried to enable it? > > # ceph dashboard ac-user-enable admin > > Zitat von Michel Niyoyita : > > > Hello team, > > > > I failed to login to my ceph dashboard which is running pacific as > version > > and deployed using ceph-ansible . I have set admin password using the > > following command : "ceph dashboard ac-user-set-password admin -i > > ceph-dash-pass" where ceph-dash-pass possesses the real password. I am > > getting the following output : "{"username": "admin", "password": > > "$2b$12$Ge/2cpg0ZGjRPnBC2YREP.E5oVyNvV4SC9HU4PMsWWMBtC9UvL7mG", "roles": > > ["administrator"], "name": null, "email": null, "lastUpdate": 1706866328, > > "enabled": false, "pwdExpirationDate": null, "pwdUpdateRequired": false}" > > > > Once I login to the dashboard , still i get the same error message. I am > > guessing it is because the above "enabled" field is set to false . Ho w > to > > set that field to true ? or if there is other alternative to set it you > can > > advise. > > > > Thank you > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RBD mirroring to an EC pool
Hello, Ceph users, I would like to use my secondary Ceph cluster for backing up RBD OpenNebula volumes from my primary cluster using mirroring in image+snapshot mode. Because it is for backups only, not a cold-standby, I would like to use erasure coding on the secondary side to save a disk space. Is it supported at all? I tried to create a pool: secondary# ceph osd pool create one-mirror erasure k6m2 secondary# ceph osd pool set one-mirror allow_ec_overwrites true set pool 13 allow_ec_overwrites to true secondary# rbd mirror pool enable --site-name secondary one-mirror image 2024-02-02T11:00:34.123+0100 7f95070ad5c0 -1 librbd::api::Mirror: mode_set: failed to allocate mirroring uuid: (95) Operation not supported When I created a replicated pool instead, this step worked: secondary# ceph osd pool create one-mirror-repl replicated secondary# rbd mirror pool enable --site-name secondary one-mirror-repl image secondary# So, is RBD mirroring supported with erasure-coded pools at all? Thanks! -Yenya -- | Jan "Yenya" Kasprzak | | https://www.fi.muni.cz/~kas/GPG: 4096R/A45477D5 | We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise. --Larry Wall ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD read latency grows over time
Shiming in here, just so that it’s indexed in archives. We’ve have a lot of issues with tombstones when running RGW usage logging and when we trim those the Ceph OSD hosting that usage.X object will basically kill the OSD performance due to the tombstones being so many, restarting the OSD solves it. We are not yet on Quincy but when we are will look into optimizing rocksdb_cf_compact_on_deletion_trigger so that we don’t have to locate the objects, trim, restart OSDs everytime we want to clean them. Unfortunately the message on Ceph Slack is lost since it was a while back I wrote more details on that investigation, but IIRC the issue is that the "radosgw-admin usage trim” does SingleDelete() in the RocksDB layer when deleting objects that could be bulk deleted (RangeDelete?) due to them having the same prefix (name + date). Best regards > On 26 Jan 2024, at 23:18, Mark Nelson wrote: > > On 1/26/24 11:26, Roman Pashin wrote: > >>> Unfortunately they cannot. You'll want to set them in centralized conf >>> and then restart OSDs for them to take effect. >>> >> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart >> them. >> >> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 > >> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on >> HDD OSD, where data is stored, but it would be easier to set it to [osd] >> section without cherry picking only SSD/NVME OSDs, but for all at once. > > > Potentially if you set the trigger too low, you could force constant > compactions. Say if you set it to trigger compaction every time a tombstone > is encountered. You really want to find the sweet spot where iterating over > tombstones (possibly multiple times) is more expensive than doing a > compaction. The defaults are basically just tuned to avoid the worst case > scenario where OSDs become laggy or even go into heartbeat timeout (and we're > not 100% sure we got those right). I believe we've got a couple of big users > that tune it more aggressively, though I'll let them speak up if they are > able. > > > Mark > > >> -- >> Thank you, >> Roman >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD read latency grows over time
I found the internal note I made about it, see below. When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in the RocksDB database. These thousands of tombstones that each needs to be iterated over when for example reading data from the database causes the latency to become super high. If the OSD is restarted the issue disappears, I assume this is because RocksDB or the RocksDBStore in Ceph creates a new iterator or does compaction internally upon startup. I don't see any straight forward solution without having to rebuild internal logic in the usage trim code. More specifically that would be investigating in the usage trim code to use `cls_cxx_map_remove_range()` which would call `RocksDBStore::RocksDBTransactionImpl::rm_range_keys()` internally instead when doing a usage trim for an epoch (—start-date and —end-date only, and no user or bucket). The problem there though is that the `rocksdb_delete_range_threshold` config option defaults to 1_M which is way more than the amount we are deleting and still causing issue, that function calls `DeleteRange()` instead of `SingleDelete()` in RocksDB which would cause one tombstone for all entries instead of one tombstone for every single OMAP key. Even better for above would be calling `rmkeys_by_prefix()` and not having to specify start and end but there is no OSD op in PrimaryLogPG for that which means even more work that might not be backportable. Our best bet right now without touching radosgw-admin is upgrading to >=16.2.14 which introduces https://github.com/ceph/ceph/pull/50894 that will do compaction if a threshold of tombstones is hit within a sliding window during iteration. Best regards > On 2 Feb 2024, at 11:29, Tobias Urdin wrote: > > Shiming in here, just so that it’s indexed in archives. > > We’ve have a lot of issues with tombstones when running RGW usage logging and > when we > trim those the Ceph OSD hosting that usage.X object will basically kill the > OSD performance > due to the tombstones being so many, restarting the OSD solves it. > > We are not yet on Quincy but when we are will look into optimizing > rocksdb_cf_compact_on_deletion_trigger > so that we don’t have to locate the objects, trim, restart OSDs everytime we > want to clean them. > > Unfortunately the message on Ceph Slack is lost since it was a while back I > wrote more details > on that investigation, but IIRC the issue is that the "radosgw-admin usage > trim” does SingleDelete() in the RocksDB layer > when deleting objects that could be bulk deleted (RangeDelete?) due to them > having the same prefix (name + date). > > Best regards > >> On 26 Jan 2024, at 23:18, Mark Nelson wrote: >> >> On 1/26/24 11:26, Roman Pashin wrote: >> Unfortunately they cannot. You'll want to set them in centralized conf and then restart OSDs for them to take effect. >>> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart >>> them. >>> >>> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 > >>> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on >>> HDD OSD, where data is stored, but it would be easier to set it to [osd] >>> section without cherry picking only SSD/NVME OSDs, but for all at once. >> >> >> Potentially if you set the trigger too low, you could force constant >> compactions. Say if you set it to trigger compaction every time a tombstone >> is encountered. You really want to find the sweet spot where iterating over >> tombstones (possibly multiple times) is more expensive than doing a >> compaction. The defaults are basically just tuned to avoid the worst case >> scenario where OSDs become laggy or even go into heartbeat timeout (and >> we're not 100% sure we got those right). I believe we've got a couple of >> big users that tune it more aggressively, though I'll let them speak up if >> they are able. >> >> >> Mark >> >> >>> -- >>> Thank you, >>> Roman >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Understanding subvolumes
On 01.02.24 00:20, Matthew Melendy wrote: In our department we're getting starting with Ceph 'reef', using Ceph FUSE client for our Ubuntu workstations. So far so good, except I can't quite figure out one aspect of subvolumes. AFAIK subvolumes were introduced to be used with Kubernetes and other cloud technologies. If you run a classical file service on top of CephFS you usually do not need subvolumes but can go with normal quotas on directories. Regards -- Robert Sander Heinlein Support GmbH Linux: Akademie - Support - Hosting http://www.heinlein-support.de Tel: 030-405051-43 Fax: 030-405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Problems adding a new host via orchestration.
Happy Friday all. I was hoping someone could point me in the right direction or clarify any limitations that could be impacting an issue I am having. I'm struggling to add a new set of hosts to my ceph cluster using cephadm and orchestration. When trying to add a host: "ceph orch host add 172.31.102.41 --labels _admin" returns: "Error EINVAL: Can't communicate with remote host `172.31.102.41`, possibly because python3 is not installed there: [Errno 12] Cannot allocate memory" I've verified that the ceph ssh key works to the remote host, host's name matches that returned from `hostname`, python3 is installed, and "/usr/sbin/cephadm prepare-host" on the new hosts returns "host is ok". In addition, the cluster ssh key works between hosts and the existing hosts are able to ssh in using the ceph key. The existing ceph cluster is Pacific release using docker based containerization on RockyLinux8 base OS. The new hosts are RockyLinux9 based, with the cephadm being installed from Quincy release: ./cephadm add-repo --release quincy ./cephadm install I did try installing cephadm from the Pacific release by changing the repo to el8, but that did not work either. Is there a limitation is mixing RL8 and RL9 container hosts under Pacific? Does this same limitation exist under Quincy? Is there a python version dependency? The reason for RL9 on the new hosts is to stage upgrading the OS's for the cluster. I did this under Octopus for moving from Centos7 to RL8. Thanks and I appreciate any feedback/pointers. Gary I've added the log trace here in case that helps (from `ceph log last cephadm`) 2024-02-02T14:22:32.610048+ mgr.storage01.oonvfl (mgr.441023307) 4957871 : cephadm [ERR] Can't communicate with remote host `172.31.102.41`, possibly because python3 is not installed there: [Errno 12] Cannot allocate memory Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1524, in _remote_connection conn, connr = self.mgr._get_connection(addr) File "/usr/share/ceph/mgr/cephadm/module.py", line 1370, in _get_connection sudo=True if self.ssh_user != 'root' else False) File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 35, in __init__ self.gateway = self._make_gateway(hostname) File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 46, in _make_gateway self._make_connection_string(hostname) File "/lib/python3.6/site-packages/execnet/multi.py", line 133, in makegateway io = gateway_io.create_io(spec, execmodel=self.execmodel) File "/lib/python3.6/site-packages/execnet/gateway_io.py", line 121, in create_io io = Popen2IOMaster(args, execmodel) File "/lib/python3.6/site-packages/execnet/gateway_io.py", line 21, in __init__ self.popen = p = execmodel.PopenPiped(args) File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 184, in PopenPiped return self.subprocess.Popen(args, stdout=PIPE, stdin=PIPE) File "/lib64/python3.6/subprocess.py", line 729, in __init__ restore_signals, start_new_session) File "/lib64/python3.6/subprocess.py", line 1295, in _execute_child restore_signals, start_new_session, preexec_fn) OSError: [Errno 12] Cannot allocate memory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1528, in _remote_connection raise execnet.gateway_bootstrap.HostNotFound(msg) execnet.gateway_bootstrap.HostNotFound: Can't communicate with remote host `172.31.102.41`, possibly because python3 is not installed there: [Errno 12] Cannot allocate memory The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in wrapper return OrchResult(f(*args, **kwargs)) File "/usr/share/ceph/mgr/cephadm/module.py", line 2709, in apply results.append(self._apply(spec)) File "/usr/share/ceph/mgr/cephadm/module.py", line 2574, in _apply return self._add_host(cast(HostSpec, spec)) File "/usr/share/ceph/mgr/cephadm/module.py", line 1517, in _add_host ip_addr = self._check_valid_addr(spec.hostname, spec.addr) File "/usr/share/ceph/mgr/cephadm/module.py", line 1498, in _check_valid_addr error_ok=True, no_fsid=True) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1326, in _run_cephadm with self._remote_connection(host, addr) as tpl: File "/lib64/python3.6/contextlib.py", line 81, in __enter__ return next(self.gen) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1558, in _remote_connection raise OrchestratorError(msg) from e orchestrator._interface.OrchestratorError: Can't communicate with remote host `172.31.102.41`, possibly because python3 is not installed there: [Errno 12] Cannot allocate memory -- Gary Molenkamp Science Technology
[ceph-users] PG upmap corner cases that silently fail
Hi cephers, I've been looking into better balancing our clusters with upmaps lately, and ran into upmap cases that behave in a less than ideal way. If there is any cycle in the upmaps like ceph osd pg-upmap-items a b b a or ceph osd pg-upmap-items a b b c c a the upmap validation passes, the upmap gets added to the osdmap, but then gets silently ignored. Obviously this is for EC pools - irrelevant for replicated pools where the order of OSDs is not significant. The relevant code OSDMap::_apply_upmap even has a comment about this: if (q != pg_upmap_items.end()) { // NOTE: this approach does not allow a bidirectional swap, // e.g., [[1,2],[2,1]] applied to [0,1,2] -> [0,2,1]. for (auto& r : q->second) { // make sure the replacement value doesn't already appear ... I'm trying to understand the reasons for this limitation: is it the case that this is just a matter of convenience of coding (OSDMap::_apply_upmap could do this correctly with a bit more careful approach), or is there some inherent limitation somewhere else that prevents these cases from working? I did notice that just updating crush weights (without using upmaps) produces similar changes to the UP set (swaps OSDs in EC pools sometimes), so the OSDs seem to be perfectly capable of doing backfills for osdmap changes that shuffle the order of OSDs in the UP set. Some insight/history here would be appreciated. Either way, the behavior of validation passing on an upmap and then the upmap getting silently ignored is not ideal. I do realize that all clients would have to agree on this code, since clients independently execute it to find the OSDs to access (so rolling out a change to this is challenging). Andras ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems
Chris, Thanks for all the investigations you are doing here. We're on quincy/debian11. Is there any working path at this point to reef/debian12? Ideally I want to go in two steps. Upgrade ceph first or upgrade debian first, then do the upgrade to the other one. Most of our infra is already upgraded to debian 12, except ceph. On 2024-01-29 07:27, Chris Palmer wrote: I have logged this as https://tracker.ceph.com/issues/64213 On 16/01/2024 14:18, DERUMIER, Alexandre wrote: Hi, ImportError: PyO3 modules may only be initialized once per interpreter process and ceph -s reports "Module 'dashboard' has failed dependency: PyO3 modules may only be initialized once per interpreter process We have the same problem on proxmox8 (based on debian12) with ceph quincy or reef. It seem to be related to python version on debian12 (we have no fix for this currently) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] XFS on top of RBD, overhead
Hi group, Today I conducted a small experiment to test an assumption of mine, namely that Ceph incurs a substantial network overhead when doing many small files. One RBD was created, and on top of that an XFS containing 1.6 M files, each with size 10 kiB: # rbd info libvirt/bobtest rbd image 'bobtest': size 20 GiB in 5120 objects order 22 (4 MiB objects) [...] # df -h /space Filesystem Size Used Avail Use% Mounted on /dev/rbd020G 20G 181M 100% /space # ls -lh /space |head total 19G -rw-r--r--. 1 root root 10K Feb 2 14:13 xaa -rw-r--r--. 1 root root 10K Feb 2 14:13 xab -rw-r--r--. 1 root root 10K Feb 2 14:13 xac -rw-r--r--. 1 root root 10K Feb 2 14:13 xad -rw-r--r--. 1 root root 10K Feb 2 14:13 xae -rw-r--r--. 1 root root 10K Feb 2 14:13 xaf -rw-r--r--. 1 root root 10K Feb 2 14:13 xag -rw-r--r--. 1 root root 10K Feb 2 14:13 xah -rw-r--r--. 1 root root 10K Feb 2 14:13 xai # ls /space |wc -l 1638400 All files contain pseudorandom (i.e. incompressible) junk. My assumption was, that as the backend RBD block size is 4 MiB, it would be necessary for the client machine to download at least that 4 MiB worth of data on any given request, even if the file in the XFS is only 10 kB. I.e. I cat(1) a small file, the RBD client grabs the relevant 4 MiB block from Ceph, from this the small amount of requested data is extracted and presented to userspace. That's not what I see, however. My testing procedure is as follows: I have a list of all the files on the RBD, order randomized, stored in root's home folder -- this to make sure that I can pick file names at random by going through the list from the top, and not causing network traffic by listing files directly in the target FS. I then reboot the node to ensure that all caches are empty and start an iftop(1) to monitor network usage. Mapping the RBD and mounting the XFS results in 5.29 MB worth of data read from the network. Reading one file at random from the XFS results in approx. 200 kB of network read. Reading 100 files at random results in approx. 3.83 MB of network read. Reading 1000 files at random results in approx. 36.2 MB of network read. Bottom line is that reading any 10 kiB of actual data results in approximately 37 kiB data being transferred over the network. Overhead, sure, but nowhere near what I expected, which was 4 MiB per block of data "hit" in the backend. Is the RBD client performing partial object reads? Is that even a thing? Cheers, Ruben Vestergaard ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: XFS on top of RBD, overhead
On Fri, Feb 2, 2024 at 7:44 AM Ruben Vestergaard wrote: > Is the RBD client performing partial object reads? Is that even a thing? Yup! The rados API has both length and offset parameters for reads (https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_read) and writes (https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_write). Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: XFS on top of RBD, overhead
On Fri, Feb 02 2024 at 07:51:36 -0700, Josh Baergen wrote: On Fri, Feb 2, 2024 at 7:44 AM Ruben Vestergaard wrote: Is the RBD client performing partial object reads? Is that even a thing? Yup! The rados API has both length and offset parameters for reads (https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_read) and writes (https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_write). Ah! That was easy. And good to know. Thanks! -R Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: XFS on top of RBD, overhead
On 02/02/2024 16:41, Ruben Vestergaard wrote: Hi group, Today I conducted a small experiment to test an assumption of mine, namely that Ceph incurs a substantial network overhead when doing many small files. One RBD was created, and on top of that an XFS containing 1.6 M files, each with size 10 kiB: # rbd info libvirt/bobtest rbd image 'bobtest': size 20 GiB in 5120 objects order 22 (4 MiB objects) [...] # df -h /space Filesystem Size Used Avail Use% Mounted on /dev/rbd0 20G 20G 181M 100% /space # ls -lh /space |head total 19G -rw-r--r--. 1 root root 10K Feb 2 14:13 xaa -rw-r--r--. 1 root root 10K Feb 2 14:13 xab -rw-r--r--. 1 root root 10K Feb 2 14:13 xac -rw-r--r--. 1 root root 10K Feb 2 14:13 xad -rw-r--r--. 1 root root 10K Feb 2 14:13 xae -rw-r--r--. 1 root root 10K Feb 2 14:13 xaf -rw-r--r--. 1 root root 10K Feb 2 14:13 xag -rw-r--r--. 1 root root 10K Feb 2 14:13 xah -rw-r--r--. 1 root root 10K Feb 2 14:13 xai # ls /space |wc -l 1638400 All files contain pseudorandom (i.e. incompressible) junk. My assumption was, that as the backend RBD block size is 4 MiB, it would be necessary for the client machine to download at least that 4 MiB worth of data on any given request, even if the file in the XFS is only 10 kB. I.e. I cat(1) a small file, the RBD client grabs the relevant 4 MiB block from Ceph, from this the small amount of requested data is extracted and presented to userspace. That's not what I see, however. My testing procedure is as follows: I have a list of all the files on the RBD, order randomized, stored in root's home folder -- this to make sure that I can pick file names at random by going through the list from the top, and not causing network traffic by listing files directly in the target FS. I then reboot the node to ensure that all caches are empty and start an iftop(1) to monitor network usage. Mapping the RBD and mounting the XFS results in 5.29 MB worth of data read from the network. Reading one file at random from the XFS results in approx. 200 kB of network read. Reading 100 files at random results in approx. 3.83 MB of network read. Reading 1000 files at random results in approx. 36.2 MB of network read. Bottom line is that reading any 10 kiB of actual data results in approximately 37 kiB data being transferred over the network. Overhead, sure, but nowhere near what I expected, which was 4 MiB per block of data "hit" in the backend. Is the RBD client performing partial object reads? Is that even a thing? Cheers, Ruben Vestergaard ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io The OSD/rados api llows you read partial data within an object, you specify the length and logical offset from with an object, no need to read entire object if you do not need. This is not specific to rbd. The small network overhead is i guess overhead in network protocol layers including Ceph messenger overhead. /maged ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Unable to mount ceph
Hi, A little basic question. I created a volume with ceph fs volume then a subvolume called «erasure» I can see that with root@cthulhu1:/etc/ceph# ceph fs subvolume info cephfs erasure { "atime": "2024-02-02 11:02:07", "bytes_pcent": "undefined", "bytes_quota": "infinite", "bytes_used": 0, "created_at": "2024-02-02 11:02:07", "ctime": "2024-02-02 14:12:30", "data_pool": "data_ec", "features": [ "snapshot-clone", "snapshot-autoprotect", "snapshot-retention" ], "gid": 0, "mode": 16877, "mon_addrs": [ "145.238.187.184:6789", "145.238.187.185:6789", "145.238.187.186:6789", "145.238.187.188:6789", "145.238.187.187:6789" ], "mtime": "2024-02-02 14:12:30", "path": "/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9", "pool_namespace": "", "state": "complete", "type": "subvolume", "uid": 0 } From the mon server I was able to mount the «partition» with mount -t ceph admin@fXXX-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9 /mnt but on my test client I'm unable to mount root@ceph-vo-m:/etc/ceph# mount -t ceph vo@fxxx-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9/ /vo --verbose parsing options: rw source mount path was not specified unable to parse mount source: -22 root@ceph-vo-m:/etc/ceph# So I copy the /etc/ceph/ceph.conf on my client Put the /etc/ceph/ceph.client.vo.keyring on my client No firewall between the client/cluster. Weird part is when I run a tcpdump on my client I didn't see any tcp activity. Anyway to debug this pb ? Thanks Regards -- Albert SHIH 🦫 🐸 France Heure locale/Local time: ven. 02 févr. 2024 16:21:01 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] How can I clone data from a faulty bluestore disk?
Hi, I have a small cluster with some faulty disks within it and I want to clone the data from the faulty disks onto new ones. The cluster is currently down and I am unable to do things like ceph-bluestore-fsck but ceph-bluestore-tool bluefs-export does appear to be working. Any help would be appreciated Many thanks Carl ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems
Hi Matthew AFAIK the upgrade from quincy/deb11 to reef/deb12 is not possible: * The packaging problem you can work around, and a fix is pending * You have to upgrade both the OS and Ceph in one step * The MGR will not run under deb12 due to the PyO3 lack of support for subinterpreters. If you do attempt an upgrade, you will end up stuck with a partially upgraded cluster. The MONs will be on deb12/reef and cannot be downgraded, and the MGR will be stuck on deb11/quincy, We have a test cluster in that state with no way forward or back. I fear the MGR problem will spread as time goes on and PyO3 updates occur. And it's not good that it can silently corrupt in the existing apparently-working installations. No-one has picked up issue 64213 that I raised yet. I'm tempted to raise another issue for qa : the debian 12 package cannot have been tested as it just won't work either as an upgrade or a new install. Regards, Chris On 02/02/2024 14:40, Matthew Darwin wrote: Chris, Thanks for all the investigations you are doing here. We're on quincy/debian11. Is there any working path at this point to reef/debian12? Ideally I want to go in two steps. Upgrade ceph first or upgrade debian first, then do the upgrade to the other one. Most of our infra is already upgraded to debian 12, except ceph. On 2024-01-29 07:27, Chris Palmer wrote: I have logged this as https://tracker.ceph.com/issues/64213 On 16/01/2024 14:18, DERUMIER, Alexandre wrote: Hi, ImportError: PyO3 modules may only be initialized once per interpreter process and ceph -s reports "Module 'dashboard' has failed dependency: PyO3 modules may only be initialized once per interpreter process We have the same problem on proxmox8 (based on debian12) with ceph quincy or reef. It seem to be related to python version on debian12 (we have no fix for this currently) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How can I clone data from a faulty bluestore disk?
Hi Carl, you might want to use ceph-objectstore-tool to export PGs from faulty OSDs and import them back to healthy ones. The process could be quite tricky though. There is also pending PR (https://github.com/ceph/ceph/pull/54991) to make the tool more tolerant to disk errors. The patch worth trying in some cases, not a silver bullet though. And generally whether the recovery doable greatly depends on the actual error(s). Thanks, Igor On 02/02/2024 19:03, Carl J Taylor wrote: Hi, I have a small cluster with some faulty disks within it and I want to clone the data from the faulty disks onto new ones. The cluster is currently down and I am unable to do things like ceph-bluestore-fsck but ceph-bluestore-tool bluefs-export does appear to be working. Any help would be appreciated Many thanks Carl ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems
On Fri, Feb 2, 2024 at 11:21 AM Chris Palmer wrote: > > Hi Matthew > > AFAIK the upgrade from quincy/deb11 to reef/deb12 is not possible: > > * The packaging problem you can work around, and a fix is pending > * You have to upgrade both the OS and Ceph in one step > * The MGR will not run under deb12 due to the PyO3 lack of support for > subinterpreters. > > If you do attempt an upgrade, you will end up stuck with a partially > upgraded cluster. The MONs will be on deb12/reef and cannot be > downgraded, and the MGR will be stuck on deb11/quincy, We have a test > cluster in that state with no way forward or back. > > I fear the MGR problem will spread as time goes on and PyO3 updates > occur. And it's not good that it can silently corrupt in the existing > apparently-working installations. > > No-one has picked up issue 64213 that I raised yet. > > I'm tempted to raise another issue for qa : the debian 12 package cannot > have been tested as it just won't work either as an upgrade or a new > install. you're right that the debian packages don't get tested: https://docs.ceph.com/en/reef/start/os-recommendations/#platforms > > Regards, Chris > > > On 02/02/2024 14:40, Matthew Darwin wrote: > > Chris, > > > > Thanks for all the investigations you are doing here. We're on > > quincy/debian11. Is there any working path at this point to > > reef/debian12? Ideally I want to go in two steps. Upgrade ceph first > > or upgrade debian first, then do the upgrade to the other one. Most of > > our infra is already upgraded to debian 12, except ceph. > > > > On 2024-01-29 07:27, Chris Palmer wrote: > >> I have logged this as https://tracker.ceph.com/issues/64213 > >> > >> On 16/01/2024 14:18, DERUMIER, Alexandre wrote: > >>> Hi, > >>> > > ImportError: PyO3 modules may only be initialized once per > > interpreter > > process > > > > and ceph -s reports "Module 'dashboard' has failed dependency: PyO3 > > modules may only be initialized once per interpreter process > >>> We have the same problem on proxmox8 (based on debian12) with ceph > >>> quincy or reef. > >>> > >>> It seem to be related to python version on debian12 > >>> > >>> (we have no fix for this currently) > >>> > >>> > >>> > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems
Would migrating to a cephadm orchestrated docker/podman cluster be an acceptable workaround? We are running that config with reef containers on Debian 12 hosts, with a couple of debian 12 clients successfully mounting cephfs mounts, using the reef client packages directly on Debian. On Fri, Feb 2, 2024, 8:21 AM Chris Palmer wrote: > Hi Matthew > > AFAIK the upgrade from quincy/deb11 to reef/deb12 is not possible: > > * The packaging problem you can work around, and a fix is pending > * You have to upgrade both the OS and Ceph in one step > * The MGR will not run under deb12 due to the PyO3 lack of support for > subinterpreters. > > If you do attempt an upgrade, you will end up stuck with a partially > upgraded cluster. The MONs will be on deb12/reef and cannot be > downgraded, and the MGR will be stuck on deb11/quincy, We have a test > cluster in that state with no way forward or back. > > I fear the MGR problem will spread as time goes on and PyO3 updates > occur. And it's not good that it can silently corrupt in the existing > apparently-working installations. > > No-one has picked up issue 64213 that I raised yet. > > I'm tempted to raise another issue for qa : the debian 12 package cannot > have been tested as it just won't work either as an upgrade or a new > install. > > Regards, Chris > > > On 02/02/2024 14:40, Matthew Darwin wrote: > > Chris, > > > > Thanks for all the investigations you are doing here. We're on > > quincy/debian11. Is there any working path at this point to > > reef/debian12? Ideally I want to go in two steps. Upgrade ceph first > > or upgrade debian first, then do the upgrade to the other one. Most of > > our infra is already upgraded to debian 12, except ceph. > > > > On 2024-01-29 07:27, Chris Palmer wrote: > >> I have logged this as https://tracker.ceph.com/issues/64213 > >> > >> On 16/01/2024 14:18, DERUMIER, Alexandre wrote: > >>> Hi, > >>> > > ImportError: PyO3 modules may only be initialized once per > > interpreter > > process > > > > and ceph -s reports "Module 'dashboard' has failed dependency: PyO3 > > modules may only be initialized once per interpreter process > >>> We have the same problem on proxmox8 (based on debian12) with ceph > >>> quincy or reef. > >>> > >>> It seem to be related to python version on debian12 > >>> > >>> (we have no fix for this currently) > >>> > >>> > >>> > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems
We have fundamental problems with the concept of cephadm and its direction of travel. But that's a different story. The nub of this problem is a design incompatibility with MGR and the PyO3 package that python-cryptography relies on. It's actually unsafe as it is, and the new package just stops you performing the unsafe operations. So that affects all distributions and containers and versions of ceph. Eventually the updated PyO3 will find its way into other distributions and containers bringing things to a head. On 02/02/2024 16:45, Brian Chow wrote: Would migrating to a cephadm orchestrated docker/podman cluster be an acceptable workaround? We are running that config with reef containers on Debian 12 hosts, with a couple of debian 12 clients successfully mounting cephfs mounts, using the reef client packages directly on Debian. On Fri, Feb 2, 2024, 8:21 AM Chris Palmer wrote: Hi Matthew AFAIK the upgrade from quincy/deb11 to reef/deb12 is not possible: * The packaging problem you can work around, and a fix is pending * You have to upgrade both the OS and Ceph in one step * The MGR will not run under deb12 due to the PyO3 lack of support for subinterpreters. If you do attempt an upgrade, you will end up stuck with a partially upgraded cluster. The MONs will be on deb12/reef and cannot be downgraded, and the MGR will be stuck on deb11/quincy, We have a test cluster in that state with no way forward or back. I fear the MGR problem will spread as time goes on and PyO3 updates occur. And it's not good that it can silently corrupt in the existing apparently-working installations. No-one has picked up issue 64213 that I raised yet. I'm tempted to raise another issue for qa : the debian 12 package cannot have been tested as it just won't work either as an upgrade or a new install. Regards, Chris On 02/02/2024 14:40, Matthew Darwin wrote: Chris, Thanks for all the investigations you are doing here. We're on quincy/debian11. Is there any working path at this point to reef/debian12? Ideally I want to go in two steps. Upgrade ceph first or upgrade debian first, then do the upgrade to the other one. Most of our infra is already upgraded to debian 12, except ceph. On 2024-01-29 07:27, Chris Palmer wrote: I have logged this as https://tracker.ceph.com/issues/64213 On 16/01/2024 14:18, DERUMIER, Alexandre wrote: Hi, ImportError: PyO3 modules may only be initialized once per interpreter process and ceph -s reports "Module 'dashboard' has failed dependency: PyO3 modules may only be initialized once per interpreter process We have the same problem on proxmox8 (based on debian12) with ceph quincy or reef. It seem to be related to python version on debian12 (we have no fix for this currently) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Unable to mount ceph
Le 02/02/2024 à 16:34:17+0100, Albert Shih a écrit > Hi, > > > A little basic question. > > I created a volume with > > ceph fs volume > > then a subvolume called «erasure» I can see that with > > root@cthulhu1:/etc/ceph# ceph fs subvolume info cephfs erasure > { > "atime": "2024-02-02 11:02:07", > "bytes_pcent": "undefined", > "bytes_quota": "infinite", > "bytes_used": 0, > "created_at": "2024-02-02 11:02:07", > "ctime": "2024-02-02 14:12:30", > "data_pool": "data_ec", > "features": [ > "snapshot-clone", > "snapshot-autoprotect", > "snapshot-retention" > ], > "gid": 0, > "mode": 16877, > "mon_addrs": [ > "145.238.187.184:6789", > "145.238.187.185:6789", > "145.238.187.186:6789", > "145.238.187.188:6789", > "145.238.187.187:6789" > ], > "mtime": "2024-02-02 14:12:30", > "path": "/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9", > "pool_namespace": "", > "state": "complete", > "type": "subvolume", > "uid": 0 > } > > From the mon server I was able to mount the «partition» with > > mount -t ceph > admin@fXXX-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9 > /mnt > > but on my test client I'm unable to mount OK find out the problem. Not the same version of ceph-common on the server and on the client. On the client it's Debian pkg. So not the same syntaxe for mounting with mount.ceph mon1,mon2,mon3,mon4,mon5:/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9/vo/ /vo -o name=vo everything work fine. Sorrry Regards -- Albert SHIH 🦫 🐸 France Heure locale/Local time: ven. 02 févr. 2024 18:21:23 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD read latency grows over time
We've seen issues with high index OSD latencies in multiple scenarios over the past couple of years. The issues related to rocksdb tombstones could certainly be relevant, but compact on deletion has been very effective for us in that regard. Recently, we experienced a similar issue at a higher level with the RGW bucket index deletion markers on versioned buckets. Do you happen to have versioned buckets in your cluster? If you do and the clients of those buckets are doing a bunch of deletes that leave behind S3 delete markers, the CLS code may be doing a lot of work to filter relevant entries during bucket listing ops. Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase OSD memory in accordance with bucket index pool usage. One one cluster, we found that increasing OSD memory improved rocksdb latencies by over 10x. Hope this helps! Cory Snyder From: Tobias Urdin Sent: Friday, February 2, 2024 5:41 AM To: ceph-users Subject: [ceph-users] Re: OSD read latency grows over time I found the internal note I made about it, see below. When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in the RocksDB database. These thousands of tombstones that each ZjQcmQRYFpfptBannerStart This Message Is From an Untrusted Sender You have not previously corresponded with this sender. Report Suspicious ZjQcmQRYFpfptBannerEnd I found the internal note I made about it, see below. When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in the RocksDB database. These thousands of tombstones that each needs to be iterated over when for example reading data from the database causes the latency to become super high. If the OSD is restarted the issue disappears, I assume this is because RocksDB or the RocksDBStore in Ceph creates a new iterator or does compaction internally upon startup. I don't see any straight forward solution without having to rebuild internal logic in the usage trim code. More specifically that would be investigating in the usage trim code to use `cls_cxx_map_remove_range()` which would call `RocksDBStore::RocksDBTransactionImpl::rm_range_keys()` internally instead when doing a usage trim for an epoch (—start-date and —end-date only, and no user or bucket). The problem there though is that the `rocksdb_delete_range_threshold` config option defaults to 1_M which is way more than the amount we are deleting and still causing issue, that function calls `DeleteRange()` instead of `SingleDelete()` in RocksDB which would cause one tombstone for all entries instead of one tombstone for every single OMAP key. Even better for above would be calling `rmkeys_by_prefix()` and not having to specify start and end but there is no OSD op in PrimaryLogPG for that which means even more work that might not be backportable. Our best bet right now without touching radosgw-admin is upgrading to >=16.2.14 which introduces https://urldefense.com/v3/__https://github.com/ceph/ceph/pull/50894__;!!J0dtj8f0ZRU!jlSdazGnkKfYPm4GupnIQba_7jIceMkOBEZvj6jbtsydX46nCt3ARobEFZzuIU6hMF3g85-87RT0KbUjNAU$ that will do compaction if a threshold of tombstones is hit within a sliding window during iteration. Best regards > On 2 Feb 2024, at 11:29, Tobias Urdin wrote: > > Shiming in here, just so that it’s indexed in archives. > > We’ve have a lot of issues with tombstones when running RGW usage logging and > when we > trim those the Ceph OSD hosting that usage.X object will basically kill the > OSD performance > due to the tombstones being so many, restarting the OSD solves it. > > We are not yet on Quincy but when we are will look into optimizing > rocksdb_cf_compact_on_deletion_trigger > so that we don’t have to locate the objects, trim, restart OSDs everytime we > want to clean them. > > Unfortunately the message on Ceph Slack is lost since it was a while back I > wrote more details > on that investigation, but IIRC the issue is that the "radosgw-admin usage > trim” does SingleDelete() in the RocksDB layer > when deleting objects that could be bulk deleted (RangeDelete?) due to them > having the same prefix (name + date). > > Best regards > >> On 26 Jan 2024, at 23:18, Mark Nelson wrote: >> >> On 1/26/24 11:26, Roman Pashin wrote: >> Unfortunately they cannot. You'll want to set them in centralized conf and then restart OSDs for them to take effect. >>> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart >>> them. >>> >>> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 > >>> 4096 hurt performance of HDD OSDs in any way? I have no growing laten
[ceph-users] Re: OSD read latency grows over time
You adjusted osd_memory_target? Higher than the default 4GB? > > > Another thing that we've found is that rocksdb can become quite slow if it > doesn't have enough memory for internal caches. As our cluster usage has > grown, we've needed to increase OSD memory in accordance with bucket index > pool usage. One one cluster, we found that increasing OSD memory improved > rocksdb latencies by over 10x. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD read latency grows over time
Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs have over 300 GB of lz4 compressed bucket index omap data. Here is a graph showing the latencies before/after that single change: https://pasteboard.co/IMCUWa1t3Uau.png Cory Snyder From: Anthony D'Atri Sent: Friday, February 2, 2024 2:15 PM To: Cory Snyder Cc: ceph-users Subject: Re: [ceph-users] OSD read latency grows over time You adjusted osd_memory_target? Higher than the default 4GB? Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ZjQcmQRYFpfptBannerEnd You adjusted osd_memory_target? Higher than the default 4GB? Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase OSD memory in accordance with bucket index pool usage. One one cluster, we found that increasing OSD memory improved rocksdb latencies by over 10x. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD read latency grows over time
1024 PGs on NVMe. From: Anthony D'Atri Sent: Friday, February 2, 2024 2:37 PM To: Cory Snyder Subject: Re: [ceph-users] OSD read latency grows over time Thanks. What type of media are your index OSDs? How many PGs? > On Feb 2, 2024, at 2: 32 PM, Cory Snyder wrote: > > Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs have over ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ZjQcmQRYFpfptBannerEnd Thanks. What type of media are your index OSDs? How many PGs? > On Feb 2, 2024, at 2:32 PM, Cory Snyder wrote: > > Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs > have over 300 GB of lz4 compressed bucket index omap data. Here is a graph > showing the latencies before/after that single change: > > https://urldefense.com/v3/__https://pasteboard.co/IMCUWa1t3Uau.png__;!!J0dtj8f0ZRU!l4XNLVA0N9y347MkNZ_gcnzLaYG9S6nLx_nGR0bzUw6SlThh6f8gvXzqzRUOMnLOMVpnNFDi9OQ9TqWsJN8gDPN11WfU$ > > Cory Snyder > > > From: Anthony D'Atri > Sent: Friday, February 2, 2024 2:15 PM > To: Cory Snyder > Cc: ceph-users > Subject: Re: [ceph-users] OSD read latency grows over time > > You adjusted osd_memory_target? Higher than the default 4GB? Another thing > that we've found is that rocksdb can become quite slow if it doesn't have > enough memory for internal caches. As our cluster usage has grown, we've > needed to increase > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > Report Suspicious > > ZjQcmQRYFpfptBannerEnd > You adjusted osd_memory_target? Higher than the default 4GB? > > > > Another thing that we've found is that rocksdb can become quite slow if it > doesn't have enough memory for internal caches. As our cluster usage has > grown, we've needed to increase OSD memory in accordance with bucket index > pool usage. One one cluster, we found that increasing OSD memory improved > rocksdb latencies by over 10x. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How can I clone data from a faulty bluestore disk?
Hi, if the OSDs are deployed as LVs (by ceph-volume) you could try to do a pvmove to a healthy disk. There was a thread here a couple of weeks ago explaining the steps. I don’t have it at hand right now, but it should be easy to find. Of course, there’s no guarantee that this will be successful. I also can’t tell if Igor‘s approach is more promising. Zitat von Igor Fedotov : Hi Carl, you might want to use ceph-objectstore-tool to export PGs from faulty OSDs and import them back to healthy ones. The process could be quite tricky though. There is also pending PR (https://github.com/ceph/ceph/pull/54991) to make the tool more tolerant to disk errors. The patch worth trying in some cases, not a silver bullet though. And generally whether the recovery doable greatly depends on the actual error(s). Thanks, Igor On 02/02/2024 19:03, Carl J Taylor wrote: Hi, I have a small cluster with some faulty disks within it and I want to clone the data from the faulty disks onto new ones. The cluster is currently down and I am unable to do things like ceph-bluestore-fsck but ceph-bluestore-tool bluefs-export does appear to be working. Any help would be appreciated Many thanks Carl ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD read latency grows over time
Hi Cory, Thanks for the excellent information here! I'm super curious how much the kv cache is using in this case. If you happen to have a dump from the perf counters that includes the prioritycache subsystem that would be ideal. By default, onode (meta) and rocksdb (except for onodes stored in rocksdb) each get a first shot at 45% of the available cache memory at high priority, but how much they actually request depends on the relative ages of the items in each cache. The age bins are defined in seconds. By default: kv: "1 2 6 24 120 720 0 0 0 0" kv_onode: "0 0 0 0 0 0 0 0 0 720" meta: "1 2 6 24 120 720 0 0 0 0" data: "1 2 6 24 120 720 0 0 0 0" and the ratios: kv: 45% kv_onode: 4% meta: 45% data: 6% (implicit) This means that items from the kv cache, meta cache, and data caches that are less than 1 second old will all be competing with each other for memory during the first round. kv and meta cache can each get up to 45% of the available memory and meta/data get up to 4% and 6% respectively. Since kv_onode doesn't actually compete at the first priority level though, it won't actually request any memory. Whatever memory is left after the first round (assuming there is any) will be divided up based on the ratios to the remaining caches that are still requesting memory until either there are no requests or no memory left. After that, the PriorityCache proceeds to the next round and does the same thing, this time for cache items that are between 1 and 2 seconds old. Then between 2 and 6 seconds old, etc. This approach lets us have different caches compete at different intervals. For instance we could have the first age-bin be 0-1 seconds for onodes, but 0-5 seconds for kv. We could also make the ratios different. IE the first bin might be for onodes that are 0-1 seconds, but we give them a first shot at 60% of the memory. kv entries that are 0-5 seconds old might all be put in the first priority bin with the 0-1 second onodes, but we could give them say only a 30% iniital shot at available memory (but they would still all be cached with higher priority than onodes that are 1-2 seconds old). Ultimately, we might find that there are better defaults for the bins and ratios when the index gets big, however typically we really want to cache onodes, so if we are seeing that the kv cache is fully utilizing it's default ratio, increasing the amount of memory may indeed be warranted. Mark On 2/2/24 12:50, Cory Snyder wrote: We've seen issues with high index OSD latencies in multiple scenarios over the past couple of years. The issues related to rocksdb tombstones could certainly be relevant, but compact on deletion has been very effective for us in that regard. Recently, we experienced a similar issue at a higher level with the RGW bucket index deletion markers on versioned buckets. Do you happen to have versioned buckets in your cluster? If you do and the clients of those buckets are doing a bunch of deletes that leave behind S3 delete markers, the CLS code may be doing a lot of work to filter relevant entries during bucket listing ops. Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase OSD memory in accordance with bucket index pool usage. One one cluster, we found that increasing OSD memory improved rocksdb latencies by over 10x. Hope this helps! Cory Snyder From: Tobias Urdin Sent: Friday, February 2, 2024 5:41 AM To: ceph-users Subject: [ceph-users] Re: OSD read latency grows over time I found the internal note I made about it, see below. When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in the RocksDB database. These thousands of tombstones that each ZjQcmQRYFpfptBannerStart This Message Is From an Untrusted Sender You have not previously corresponded with this sender. Report Suspicious ZjQcmQRYFpfptBannerEnd I found the internal note I made about it, see below. When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in the RocksDB database. These thousands of tombstones that each needs to be iterated over when for example reading data from the database causes the latency to become super high. If the OSD is restarted the issue disappears, I assume this is because RocksDB or the RocksDBStore in Ceph creates a new iterator or does compaction internally upon startup. I don't see any straight forward solution without having to rebuild internal logic in the usage trim code. More specifically that would be investigating in the usage trim code to use `cls_cxx_map_remove_range()` which would call `RocksDBStore::RocksDBTransactionImpl::rm_range_keys()` internally instead when doing a usag