[ceph-users] Re: Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb)

2024-02-02 Thread Mark Schouten

Hi,

Cool, thanks!

As for the global_id_reclaim settings:
root@proxmox01:~# ceph config get mon 
auth_allow_insecure_global_id_reclaim

false
root@proxmox01:~# ceph config get mon 
auth_expose_insecure_global_id_reclaim

true
root@proxmox01:~# ceph config get mon 
mon_warn_on_insecure_global_id_reclaim

true
root@proxmox01:~# ceph config get mon 
mon_warn_on_insecure_global_id_reclaim_allowed

true


—
Mark Schouten
CTO, Tuxis B.V.
+31 318 200208 / m...@tuxis.nl


-- Original Message --
From "Eugen Block" 
To ceph-users@ceph.io
Date 02/02/2024, 08:30:45
Subject [ceph-users] Re: Cannot recreate monitor in upgrade from pacific 
to quincy (leveldb -> rocksdb)



I might have a reproducer, the second rebuilt mon is not joining the  cluster 
as well, I'll look into it and let you know if I find anything.

Zitat von Eugen Block :


Hi,


Can anyone confirm that ancient (2017) leveldb database mons should  just 
accept ‘mon.$hostname’ names for mons, a well as ‘mon.$id’ ?


at some point you had or have to remove one of the mons to recreate  it with a 
rocksdb backend, so the mismatch should not be an issue  here. I can confirm 
that when I tried to reproduce it in a small  test cluster with leveldb. So now 
I have two leveldb MONs and one  rocksdb MON:

jewel:~ # cat  
/var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel/kv_backend
rocksdb
jewel2:~ # cat  
/var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel2/kv_backend
leveldb
jewel3:~ # cat  
/var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel3/kv_backend
leveldb

And the cluster is healthy, although it took a minute or two for the  rebuilt 
MON to sync (in a real cluster with some load etc. it might  take longer):

jewel:~ # ceph -s
  cluster:
id: b08424fa-8530-4080-876d-2821c916d26c
health: HEALTH_OK

  services:
mon: 3 daemons, quorum jewel2,jewel3,jewel (age 3m)

I'm wondering if this could have to do with the insecure_global_id  things. Can 
you send the output of:

ceph config get mon auth_allow_insecure_global_id_reclaim
ceph config get mon auth_expose_insecure_global_id_reclaim
ceph config get mon mon_warn_on_insecure_global_id_reclaim
ceph config get mon mon_warn_on_insecure_global_id_reclaim_allowed



Zitat von Mark Schouten :


Hi,

I don’t have a fourth machine available, so that’s not an option  unfortunatly.

I did enable a lot of debugging earlier, but that shows no  information as to 
why stuff is not working as to be expected.

Proxmox just deploys the mons, nothing fancy there, no special cases.

Can anyone confirm that ancient (2017) leveldb database mons should  just 
accept ‘mon.$hostname’ names for mons, a well as ‘mon.$id’ ?

—
Mark Schouten
CTO, Tuxis B.V.
+31 318 200208 / m...@tuxis.nl


-- Original Message --
From "Eugen Block" 
To ceph-users@ceph.io
Date 31/01/2024, 13:02:04
Subject [ceph-users] Re: Cannot recreate monitor in upgrade from  pacific to 
quincy (leveldb -> rocksdb)


Hi Mark,

as I'm not familiar with proxmox I'm not sure what happens under  the  hood. 
There are a couple of things I would try, not  necessarily in  this order:

- Check the troubleshooting guide [1], for example a clock skew  could  be one 
reason, have you verified ntp/chronyd functionality?
- Inspect debug log output, maybe first on the probing mon and if   those don't 
reveal the reason, enable debug logs for the other  MONs as  well:
ceph config set mon.proxmox03 debug_mon 20
ceph config set mon.proxmox03 debug_paxos 20

or for all MONs:
ceph config set mon debug_mon 20
ceph config set mon debug_paxos 20

- Try to deploy an additional MON on a different server (if you  have  more 
available) and see if that works.
- Does proxmox log anything?
- Maybe last resort, try to start a MON manually after adding it  to  the 
monmap with the monmaptool, but only if you know what  you're  doing. I wonder 
if the monmap doesn't get updated...

Regards,
Eugen

[1]  https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/

Zitat von Mark Schouten :


Hi,

During an upgrade from pacific to quincy, we needed to recreate  the  mons 
because the mons were pretty old and still using leveldb.

So step one was to destroy one of the mons. After that we  recreated  the 
monitor, and although it starts, it remains in  state ‘probing’,  as you can 
see below.

No matter what I tried, it won’t come up. I’ve seen quite some   messages that 
the MTU might be an issue, but that seems to be ok:
root@proxmox03:/var/log/ceph# fping -b 1472 10.10.10.{1..3} -M
10.10.10.1 is alive
10.10.10.2 is alive
10.10.10.3 is alive


Does anyone have an idea how to fix this? I’ve tried destroying  and  
recreating the mon a few times now. Could it be that the  leveldb  mons only 
support mon.$id notation for the monitors?

root@proxmox03:/var/log/ceph# ceph daemon mon.proxmox03 mon_status
{
  "name": “proxmox03”,
  "rank": 2,
  "state": “probing”,
  "election_epoch": 0,
  "quorum": [],
  "features": {
  "required_co

[ceph-users] Re: Cannot recreate monitor in upgrade from pacific to quincy (leveldb -> rocksdb)

2024-02-02 Thread Eugen Block
I decided to try to bring the mon back manually after looking at the  
logs without any findings. It's kind of ugly but it worked. The  
problem with that approach is that I had to take down a second MON to  
inject a new monmap (which then includes the failed MON), restart it  
and do the same for the third MON. This means no quorum until two MONs  
are back, so two short interruptions. Then I created a new MON (legacy  
style) and started it with the modified monmap. The cluster got back  
into quorum, then I converted the legacy MON to cephadm and resumed  
the orchestrator.

Unfortunately, I still don't understand what the root cause is...
I also played around with the insecure_global_id stuff to no avail, it  
was just an idea...


Zitat von Mark Schouten :


Hi,

Cool, thanks!

As for the global_id_reclaim settings:
root@proxmox01:~# ceph config get mon auth_allow_insecure_global_id_reclaim
false
root@proxmox01:~# ceph config get mon auth_expose_insecure_global_id_reclaim
true
root@proxmox01:~# ceph config get mon mon_warn_on_insecure_global_id_reclaim
true
root@proxmox01:~# ceph config get mon  
mon_warn_on_insecure_global_id_reclaim_allowed

true


—
Mark Schouten
CTO, Tuxis B.V.
+31 318 200208 / m...@tuxis.nl


-- Original Message --
From "Eugen Block" 
To ceph-users@ceph.io
Date 02/02/2024, 08:30:45
Subject [ceph-users] Re: Cannot recreate monitor in upgrade from  
pacific to quincy (leveldb -> rocksdb)


I might have a reproducer, the second rebuilt mon is not joining  
the  cluster as well, I'll look into it and let you know if I find  
anything.


Zitat von Eugen Block :


Hi,

Can anyone confirm that ancient (2017) leveldb database mons  
should  just accept ‘mon.$hostname’ names for mons, a well as  
‘mon.$id’ ?


at some point you had or have to remove one of the mons to  
recreate  it with a rocksdb backend, so the mismatch should not be  
an issue  here. I can confirm that when I tried to reproduce it in  
a small  test cluster with leveldb. So now I have two leveldb MONs  
and one  rocksdb MON:


jewel:~ # cat   
/var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel/kv_backend

rocksdb
jewel2:~ # cat   
/var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel2/kv_backend

leveldb
jewel3:~ # cat   
/var/lib/ceph/b08424fa-8530-4080-876d-2821c916d26c/mon.jewel3/kv_backend

leveldb

And the cluster is healthy, although it took a minute or two for  
the  rebuilt MON to sync (in a real cluster with some load etc. it  
might  take longer):


jewel:~ # ceph -s
 cluster:
   id: b08424fa-8530-4080-876d-2821c916d26c
   health: HEALTH_OK

 services:
   mon: 3 daemons, quorum jewel2,jewel3,jewel (age 3m)

I'm wondering if this could have to do with the insecure_global_id  
 things. Can you send the output of:


ceph config get mon auth_allow_insecure_global_id_reclaim
ceph config get mon auth_expose_insecure_global_id_reclaim
ceph config get mon mon_warn_on_insecure_global_id_reclaim
ceph config get mon mon_warn_on_insecure_global_id_reclaim_allowed



Zitat von Mark Schouten :


Hi,

I don’t have a fourth machine available, so that’s not an option   
unfortunatly.


I did enable a lot of debugging earlier, but that shows no   
information as to why stuff is not working as to be expected.


Proxmox just deploys the mons, nothing fancy there, no special cases.

Can anyone confirm that ancient (2017) leveldb database mons  
should  just accept ‘mon.$hostname’ names for mons, a well as  
‘mon.$id’ ?


—
Mark Schouten
CTO, Tuxis B.V.
+31 318 200208 / m...@tuxis.nl


-- Original Message --
From "Eugen Block" 
To ceph-users@ceph.io
Date 31/01/2024, 13:02:04
Subject [ceph-users] Re: Cannot recreate monitor in upgrade from   
pacific to quincy (leveldb -> rocksdb)



Hi Mark,

as I'm not familiar with proxmox I'm not sure what happens under  
 the  hood. There are a couple of things I would try, not   
necessarily in  this order:


- Check the troubleshooting guide [1], for example a clock skew   
could  be one reason, have you verified ntp/chronyd functionality?
- Inspect debug log output, maybe first on the probing mon and  
if   those don't reveal the reason, enable debug logs for the  
other  MONs as  well:

ceph config set mon.proxmox03 debug_mon 20
ceph config set mon.proxmox03 debug_paxos 20

or for all MONs:
ceph config set mon debug_mon 20
ceph config set mon debug_paxos 20

- Try to deploy an additional MON on a different server (if you   
have  more available) and see if that works.

- Does proxmox log anything?
- Maybe last resort, try to start a MON manually after adding it  
 to  the monmap with the monmaptool, but only if you know what   
you're  doing. I wonder if the monmap doesn't get updated...


Regards,
Eugen

[1]   
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/


Zitat von Mark Schouten :


Hi,

During an upgrade from pacific to quincy, we needed to recreate  
 the  mons because the mons were pretty old and still using  
leveld

[ceph-users] Ceph Dashboard failed to execute login

2024-02-02 Thread Michel Niyoyita
Hello team,

I failed to login to my ceph dashboard which is running pacific as version
and deployed using ceph-ansible . I have set admin password using the
following command : "ceph dashboard ac-user-set-password admin -i
ceph-dash-pass" where ceph-dash-pass possesses the real password. I am
getting the following output : "{"username": "admin", "password":
"$2b$12$Ge/2cpg0ZGjRPnBC2YREP.E5oVyNvV4SC9HU4PMsWWMBtC9UvL7mG", "roles":
["administrator"], "name": null, "email": null, "lastUpdate": 1706866328,
"enabled": false, "pwdExpirationDate": null, "pwdUpdateRequired": false}"

Once I login to the dashboard , still i get the same error message. I am
guessing it is because the above "enabled" field is set to false . Ho w to
set that field to true ? or if there is other alternative to set it you can
advise.

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Dashboard failed to execute login

2024-02-02 Thread Eugen Block

Have you tried to enable it?

 # ceph dashboard ac-user-enable admin

Zitat von Michel Niyoyita :


Hello team,

I failed to login to my ceph dashboard which is running pacific as version
and deployed using ceph-ansible . I have set admin password using the
following command : "ceph dashboard ac-user-set-password admin -i
ceph-dash-pass" where ceph-dash-pass possesses the real password. I am
getting the following output : "{"username": "admin", "password":
"$2b$12$Ge/2cpg0ZGjRPnBC2YREP.E5oVyNvV4SC9HU4PMsWWMBtC9UvL7mG", "roles":
["administrator"], "name": null, "email": null, "lastUpdate": 1706866328,
"enabled": false, "pwdExpirationDate": null, "pwdUpdateRequired": false}"

Once I login to the dashboard , still i get the same error message. I am
guessing it is because the above "enabled" field is set to false . Ho w to
set that field to true ? or if there is other alternative to set it you can
advise.

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Dashboard failed to execute login

2024-02-02 Thread Michel Niyoyita
Thank you very much Sir , now it works.

Michel

On Fri, Feb 2, 2024 at 11:55 AM Eugen Block  wrote:

> Have you tried to enable it?
>
>   # ceph dashboard ac-user-enable admin
>
> Zitat von Michel Niyoyita :
>
> > Hello team,
> >
> > I failed to login to my ceph dashboard which is running pacific as
> version
> > and deployed using ceph-ansible . I have set admin password using the
> > following command : "ceph dashboard ac-user-set-password admin -i
> > ceph-dash-pass" where ceph-dash-pass possesses the real password. I am
> > getting the following output : "{"username": "admin", "password":
> > "$2b$12$Ge/2cpg0ZGjRPnBC2YREP.E5oVyNvV4SC9HU4PMsWWMBtC9UvL7mG", "roles":
> > ["administrator"], "name": null, "email": null, "lastUpdate": 1706866328,
> > "enabled": false, "pwdExpirationDate": null, "pwdUpdateRequired": false}"
> >
> > Once I login to the dashboard , still i get the same error message. I am
> > guessing it is because the above "enabled" field is set to false . Ho w
> to
> > set that field to true ? or if there is other alternative to set it you
> can
> > advise.
> >
> > Thank you
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RBD mirroring to an EC pool

2024-02-02 Thread Jan Kasprzak
Hello, Ceph users,

I would like to use my secondary Ceph cluster for backing up RBD OpenNebula
volumes from my primary cluster using mirroring in image+snapshot mode.
Because it is for backups only, not a cold-standby, I would like to use
erasure coding on the secondary side to save a disk space.
Is it supported at all?

I tried to create a pool:

secondary# ceph osd pool create one-mirror erasure k6m2
secondary# ceph osd pool set one-mirror allow_ec_overwrites true
set pool 13 allow_ec_overwrites to true
secondary# rbd mirror pool enable --site-name secondary one-mirror image
2024-02-02T11:00:34.123+0100 7f95070ad5c0 -1 librbd::api::Mirror: mode_set: 
failed to allocate mirroring uuid: (95) Operation not supported

When I created a replicated pool instead, this step worked:

secondary# ceph osd pool create one-mirror-repl replicated
secondary# rbd mirror pool enable --site-name secondary one-mirror-repl image
secondary#

So, is RBD mirroring supported with erasure-coded pools at all? Thanks!

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| https://www.fi.muni.cz/~kas/GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-02-02 Thread Tobias Urdin
Shiming in here, just so that it’s indexed in archives.

We’ve have a lot of issues with tombstones when running RGW usage logging and 
when we
trim those the Ceph OSD hosting that usage.X object will basically kill the OSD 
performance
due to the tombstones being so many, restarting the OSD solves it.

We are not yet on Quincy but when we are will look into optimizing 
rocksdb_cf_compact_on_deletion_trigger
so that we don’t have to locate the objects, trim, restart OSDs everytime we 
want to clean them.

Unfortunately the message on Ceph Slack is lost since it was a while back I 
wrote more details
on that investigation, but IIRC the issue is that the "radosgw-admin usage 
trim” does SingleDelete() in the RocksDB layer
when deleting objects that could be bulk deleted (RangeDelete?) due to them 
having the same prefix (name + date). 

Best regards

> On 26 Jan 2024, at 23:18, Mark Nelson  wrote:
> 
> On 1/26/24 11:26, Roman Pashin wrote:
> 
>>> Unfortunately they cannot. You'll want to set them in centralized conf
>>> and then restart OSDs for them to take effect.
>>> 
>> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
>> them.
>> 
>> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
>> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
>> HDD OSD, where data is stored, but it would be easier to set it to [osd]
>> section without cherry picking only SSD/NVME OSDs, but for all at once.
> 
> 
> Potentially if you set the trigger too low, you could force constant 
> compactions.  Say if you set it to trigger compaction every time a tombstone 
> is encountered.  You really want to find the sweet spot where iterating over 
> tombstones (possibly multiple times) is more expensive than doing a 
> compaction.  The defaults are basically just tuned to avoid the worst case 
> scenario where OSDs become laggy or even go into heartbeat timeout (and we're 
> not 100% sure we got those right).  I believe we've got a couple of big users 
> that tune it more aggressively, though I'll let them speak up if they are 
> able.
> 
> 
> Mark
> 
> 
>> --
>> Thank you,
>> Roman
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-02-02 Thread Tobias Urdin
I found the internal note I made about it, see below.

When we trim thousands of OMAP keys in RocksDB this calls 
SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in
the RocksDB database.

These thousands of tombstones that each needs to be iterated over when 
for example reading data from the database causes the latency
to become super high. If the OSD is restarted the issue disappears, I 
assume this is because RocksDB or the RocksDBStore in Ceph creates
a new iterator or does compaction internally upon startup.

I don't see any straight forward solution without having to rebuild 
internal logic in the usage trim code. More specifically that would be 
investigating
in the usage trim code to use `cls_cxx_map_remove_range()` which would 
call `RocksDBStore::RocksDBTransactionImpl::rm_range_keys()` internally
instead when doing a usage trim for an epoch (—start-date and —end-date 
only, and no user or bucket).

The problem there though is that the `rocksdb_delete_range_threshold` 
config option defaults to 1_M which is way more than the amount we are deleting
and still causing issue, that function calls `DeleteRange()` instead of 
`SingleDelete()` in RocksDB which would cause one tombstone for all entries
instead of one tombstone for every single OMAP key.

Even better for above would be calling `rmkeys_by_prefix()`  and not 
having to specify start and end but there is no OSD op in PrimaryLogPG for that
which means even more work that might not be backportable.

Our best bet right now without touching radosgw-admin is upgrading to 
>=16.2.14 which introduces https://github.com/ceph/ceph/pull/50894 that will
do compaction if a threshold of tombstones is hit within a sliding 
window during iteration. 

Best regards

> On 2 Feb 2024, at 11:29, Tobias Urdin  wrote:
> 
> Shiming in here, just so that it’s indexed in archives.
> 
> We’ve have a lot of issues with tombstones when running RGW usage logging and 
> when we
> trim those the Ceph OSD hosting that usage.X object will basically kill the 
> OSD performance
> due to the tombstones being so many, restarting the OSD solves it.
> 
> We are not yet on Quincy but when we are will look into optimizing 
> rocksdb_cf_compact_on_deletion_trigger
> so that we don’t have to locate the objects, trim, restart OSDs everytime we 
> want to clean them.
> 
> Unfortunately the message on Ceph Slack is lost since it was a while back I 
> wrote more details
> on that investigation, but IIRC the issue is that the "radosgw-admin usage 
> trim” does SingleDelete() in the RocksDB layer
> when deleting objects that could be bulk deleted (RangeDelete?) due to them 
> having the same prefix (name + date). 
> 
> Best regards
> 
>> On 26 Jan 2024, at 23:18, Mark Nelson  wrote:
>> 
>> On 1/26/24 11:26, Roman Pashin wrote:
>> 
 Unfortunately they cannot. You'll want to set them in centralized conf
 and then restart OSDs for them to take effect.
 
>>> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
>>> them.
>>> 
>>> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
>>> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
>>> HDD OSD, where data is stored, but it would be easier to set it to [osd]
>>> section without cherry picking only SSD/NVME OSDs, but for all at once.
>> 
>> 
>> Potentially if you set the trigger too low, you could force constant 
>> compactions.  Say if you set it to trigger compaction every time a tombstone 
>> is encountered.  You really want to find the sweet spot where iterating over 
>> tombstones (possibly multiple times) is more expensive than doing a 
>> compaction.  The defaults are basically just tuned to avoid the worst case 
>> scenario where OSDs become laggy or even go into heartbeat timeout (and 
>> we're not 100% sure we got those right).  I believe we've got a couple of 
>> big users that tune it more aggressively, though I'll let them speak up if 
>> they are able.
>> 
>> 
>> Mark
>> 
>> 
>>> --
>>> Thank you,
>>> Roman
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Understanding subvolumes

2024-02-02 Thread Robert Sander

On 01.02.24 00:20, Matthew Melendy wrote:
In our department we're getting starting with Ceph 'reef', using Ceph 
FUSE client for our Ubuntu workstations.


So far so good, except I can't quite figure out one aspect of subvolumes.


AFAIK subvolumes were introduced to be used with Kubernetes and other 
cloud technologies.


If you run a classical file service on top of CephFS you usually do not 
need subvolumes but can go with normal quotas on directories.


Regards
--
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Problems adding a new host via orchestration.

2024-02-02 Thread Gary Molenkamp
Happy Friday all.  I was hoping someone could point me in the right 
direction or clarify any limitations that could be impacting an issue I 
am having.


I'm struggling to add a new set of hosts to my ceph cluster using 
cephadm and orchestration.  When trying to add a host:

    "ceph orch host add  172.31.102.41 --labels _admin"
returns:
    "Error EINVAL: Can't communicate with remote host `172.31.102.41`, 
possibly because python3 is not installed there: [Errno 12] Cannot 
allocate memory"


I've verified that the ceph ssh key works to the remote host, host's 
name matches that returned from `hostname`, python3 is installed, and 
"/usr/sbin/cephadm prepare-host" on the new hosts returns "host is ok".  
  In addition, the cluster ssh key works between hosts and the existing 
hosts are able to ssh in using the ceph key.


The existing ceph cluster is Pacific release using docker based 
containerization on RockyLinux8 base OS.  The new hosts are RockyLinux9 
based, with the cephadm being installed from Quincy release:

        ./cephadm add-repo --release quincy
        ./cephadm install
I did try installing cephadm from the Pacific release by changing the 
repo to el8,  but that did not work either.


Is there a limitation is mixing RL8 and RL9 container hosts under 
Pacific?  Does this same limitation exist under Quincy?  Is there a 
python version dependency?
The reason for RL9 on the new hosts is to stage upgrading the OS's for 
the cluster.  I did this under Octopus for moving from Centos7 to RL8.


Thanks and I appreciate any feedback/pointers.
Gary


I've added the log trace here in case that helps (from `ceph log last 
cephadm`)




2024-02-02T14:22:32.610048+ mgr.storage01.oonvfl (mgr.441023307) 
4957871 : cephadm [ERR] Can't communicate with remote host 
`172.31.102.41`, possibly because python3 is not installed there: [Errno 
12] Cannot allocate memory

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1524, in 
_remote_connection

    conn, connr = self.mgr._get_connection(addr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1370, in 
_get_connection

    sudo=True if self.ssh_user != 'root' else False)
  File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 
35, in __init__

    self.gateway = self._make_gateway(hostname)
  File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 
46, in _make_gateway

    self._make_connection_string(hostname)
  File "/lib/python3.6/site-packages/execnet/multi.py", line 133, in 
makegateway

    io = gateway_io.create_io(spec, execmodel=self.execmodel)
  File "/lib/python3.6/site-packages/execnet/gateway_io.py", line 121, 
in create_io

    io = Popen2IOMaster(args, execmodel)
  File "/lib/python3.6/site-packages/execnet/gateway_io.py", line 21, 
in __init__

    self.popen = p = execmodel.PopenPiped(args)
  File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 
184, in PopenPiped

    return self.subprocess.Popen(args, stdout=PIPE, stdin=PIPE)
  File "/lib64/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/lib64/python3.6/subprocess.py", line 1295, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1528, in 
_remote_connection

    raise execnet.gateway_bootstrap.HostNotFound(msg)
execnet.gateway_bootstrap.HostNotFound: Can't communicate with remote 
host `172.31.102.41`, possibly because python3 is not installed there: 
[Errno 12] Cannot allocate memory


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in 
wrapper

    return OrchResult(f(*args, **kwargs))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 2709, in apply
    results.append(self._apply(spec))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 2574, in _apply
    return self._add_host(cast(HostSpec, spec))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1517, in _add_host
    ip_addr = self._check_valid_addr(spec.hostname, spec.addr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1498, in 
_check_valid_addr

    error_ok=True, no_fsid=True)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1326, in _run_cephadm
    with self._remote_connection(host, addr) as tpl:
  File "/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1558, in 
_remote_connection

    raise OrchestratorError(msg) from e
orchestrator._interface.OrchestratorError: Can't communicate with remote 
host `172.31.102.41`, possibly because python3 is not installed there: 
[Errno 12] Cannot allocate memory





--
Gary Molenkamp  Science Technology

[ceph-users] PG upmap corner cases that silently fail

2024-02-02 Thread Andras Pataki

Hi cephers,

I've been looking into better balancing our clusters with upmaps lately, 
and ran into upmap cases that behave in a less than ideal way.  If there 
is any cycle in the upmaps like


ceph osd pg-upmap-items  a b b a
or
ceph osd pg-upmap-items  a b b c c a

the upmap validation passes, the upmap gets added to the osdmap, but 
then gets silently ignored.  Obviously this is for EC pools - irrelevant 
for replicated pools where the order of OSDs is not significant.

The relevant code OSDMap::_apply_upmap even has a comment about this:

  if (q != pg_upmap_items.end()) {
    // NOTE: this approach does not allow a bidirectional swap,
    // e.g., [[1,2],[2,1]] applied to [0,1,2] -> [0,2,1].
    for (auto& r : q->second) {
  // make sure the replacement value doesn't already appear
  ...

I'm trying to understand the reasons for this limitation: is it the case 
that this is just a matter of convenience of coding 
(OSDMap::_apply_upmap could do this correctly with a bit more careful 
approach), or is there some inherent limitation somewhere else that 
prevents these cases from working?  I did notice that just updating 
crush weights (without using upmaps) produces similar changes to the UP 
set (swaps OSDs in EC pools sometimes), so the OSDs seem to be perfectly 
capable of doing backfills for osdmap changes that shuffle the order of 
OSDs in the UP set.  Some insight/history here would be appreciated.


Either way, the behavior of validation passing on an upmap and then the 
upmap getting silently ignored is not ideal.  I do realize that all 
clients would have to agree on this code, since clients independently 
execute it to find the OSDs to access (so rolling out a change to this 
is challenging).


Andras
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems

2024-02-02 Thread Matthew Darwin

Chris,

Thanks for all the investigations you are doing here. We're on 
quincy/debian11.  Is there any working path at this point to 
reef/debian12?  Ideally I want to go in two steps.  Upgrade ceph first 
or upgrade debian first, then do the upgrade to the other one. Most of 
our infra is already upgraded to debian 12, except ceph.


On 2024-01-29 07:27, Chris Palmer wrote:

I have logged this as https://tracker.ceph.com/issues/64213

On 16/01/2024 14:18, DERUMIER, Alexandre wrote:

Hi,


ImportError: PyO3 modules may only be initialized once per
interpreter
process

and ceph -s reports "Module 'dashboard' has failed dependency: PyO3
modules may only be initialized once per interpreter process

We have the same problem on proxmox8 (based on debian12) with ceph
quincy or reef.

It seem to be related to python version on debian12

(we have no fix for this currently)




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] XFS on top of RBD, overhead

2024-02-02 Thread Ruben Vestergaard

Hi group,

Today I conducted a small experiment to test an assumption of mine, 
namely that Ceph incurs a substantial network overhead when doing many 
small files.


One RBD was created, and on top of that an XFS containing 1.6 M files, 
each with size 10 kiB:


# rbd info libvirt/bobtest
rbd image 'bobtest':
size 20 GiB in 5120 objects
order 22 (4 MiB objects)
[...]

# df -h /space
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd020G   20G  181M 100% /space

# ls -lh /space |head
total 19G
-rw-r--r--. 1 root root 10K Feb  2 14:13 xaa
-rw-r--r--. 1 root root 10K Feb  2 14:13 xab
-rw-r--r--. 1 root root 10K Feb  2 14:13 xac
-rw-r--r--. 1 root root 10K Feb  2 14:13 xad
-rw-r--r--. 1 root root 10K Feb  2 14:13 xae
-rw-r--r--. 1 root root 10K Feb  2 14:13 xaf
-rw-r--r--. 1 root root 10K Feb  2 14:13 xag
-rw-r--r--. 1 root root 10K Feb  2 14:13 xah
-rw-r--r--. 1 root root 10K Feb  2 14:13 xai

# ls /space |wc -l
1638400

All files contain pseudorandom (i.e. incompressible) junk. 

My assumption was, that as the backend RBD block size is 4 MiB, it would 
be necessary for the client machine to download at least that 4 MiB 
worth of data on any given request, even if the file in the XFS is only 
10 kB.


I.e. I cat(1) a small file, the RBD client grabs the relevant 4 MiB 
block from Ceph, from this the small amount of requested data is 
extracted and presented to userspace.


That's not what I see, however. My testing procedure is as follows:

I have a list of all the files on the RBD, order randomized, stored in 
root's home folder -- this to make sure that I can pick file names at 
random by going through the list from the top, and not causing network 
traffic by listing files directly in the target FS. I then reboot the 
node to ensure that all caches are empty and start an iftop(1) to 
monitor network usage.


Mapping the RBD and mounting the XFS results in 5.29 MB worth of data 
read from the network.


Reading one file at random from the XFS results in approx. 200 kB of 
network read.


Reading 100 files at random results in approx. 3.83 MB of network read.

Reading 1000 files at random results in approx. 36.2 MB of network read.

Bottom line is that reading any 10 kiB of actual data results in 
approximately 37 kiB data being transferred over the network. Overhead, 
sure, but nowhere near what I expected, which was 4 MiB per block of 
data "hit" in the backend.


Is the RBD client performing partial object reads? Is that even a thing?

Cheers,
Ruben Vestergaard
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: XFS on top of RBD, overhead

2024-02-02 Thread Josh Baergen
On Fri, Feb 2, 2024 at 7:44 AM Ruben Vestergaard  wrote:
> Is the RBD client performing partial object reads? Is that even a thing?

Yup! The rados API has both length and offset parameters for reads
(https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_read)
and writes 
(https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_write).

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: XFS on top of RBD, overhead

2024-02-02 Thread Ruben Vestergaard

On Fri, Feb 02 2024 at 07:51:36 -0700, Josh Baergen wrote:

On Fri, Feb 2, 2024 at 7:44 AM Ruben Vestergaard  wrote:

Is the RBD client performing partial object reads? Is that even a thing?


Yup! The rados API has both length and offset parameters for reads
(https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_read)
and writes 
(https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_write).


Ah! That was easy. And good to know.

Thanks!
-R


Josh

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: XFS on top of RBD, overhead

2024-02-02 Thread Maged Mokhtar


On 02/02/2024 16:41, Ruben Vestergaard wrote:

Hi group,

Today I conducted a small experiment to test an assumption of mine, 
namely that Ceph incurs a substantial network overhead when doing many 
small files.


One RBD was created, and on top of that an XFS containing 1.6 M files, 
each with size 10 kiB:


    # rbd info libvirt/bobtest
    rbd image 'bobtest':
    size 20 GiB in 5120 objects
    order 22 (4 MiB objects)
    [...]

    # df -h /space
    Filesystem  Size  Used Avail Use% Mounted on
    /dev/rbd0    20G   20G  181M 100% /space

    # ls -lh /space |head
    total 19G
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xaa
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xab
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xac
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xad
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xae
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xaf
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xag
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xah
    -rw-r--r--. 1 root root 10K Feb  2 14:13 xai

    # ls /space |wc -l
    1638400

All files contain pseudorandom (i.e. incompressible) junk.
My assumption was, that as the backend RBD block size is 4 MiB, it 
would be necessary for the client machine to download at least that 4 
MiB worth of data on any given request, even if the file in the XFS is 
only 10 kB.


I.e. I cat(1) a small file, the RBD client grabs the relevant 4 MiB 
block from Ceph, from this the small amount of requested data is 
extracted and presented to userspace.


That's not what I see, however. My testing procedure is as follows:

I have a list of all the files on the RBD, order randomized, stored in 
root's home folder -- this to make sure that I can pick file names at 
random by going through the list from the top, and not causing network 
traffic by listing files directly in the target FS. I then reboot the 
node to ensure that all caches are empty and start an iftop(1) to 
monitor network usage.


Mapping the RBD and mounting the XFS results in 5.29 MB worth of data 
read from the network.


Reading one file at random from the XFS results in approx. 200 kB of 
network read.


Reading 100 files at random results in approx. 3.83 MB of network read.

Reading 1000 files at random results in approx. 36.2 MB of network read.

Bottom line is that reading any 10 kiB of actual data results in 
approximately 37 kiB data being transferred over the network. 
Overhead, sure, but nowhere near what I expected, which was 4 MiB per 
block of data "hit" in the backend.


Is the RBD client performing partial object reads? Is that even a thing?

Cheers,
Ruben Vestergaard
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


The OSD/rados api llows you read partial data within an object, you 
specify the length and logical offset from with an object, no need to 
read entire object if you do not need. This is not specific to rbd. The 
small network overhead is i guess overhead in network protocol layers 
including Ceph messenger overhead.


/maged

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Unable to mount ceph

2024-02-02 Thread Albert Shih
Hi, 


A little basic question.

I created a volume with 

  ceph fs volume

then a subvolume called «erasure» I can see that with 

root@cthulhu1:/etc/ceph# ceph fs subvolume info cephfs erasure
{
"atime": "2024-02-02 11:02:07",
"bytes_pcent": "undefined",
"bytes_quota": "infinite",
"bytes_used": 0,
"created_at": "2024-02-02 11:02:07",
"ctime": "2024-02-02 14:12:30",
"data_pool": "data_ec",
"features": [
"snapshot-clone",
"snapshot-autoprotect",
"snapshot-retention"
],
"gid": 0,
"mode": 16877,
"mon_addrs": [
"145.238.187.184:6789",
"145.238.187.185:6789",
"145.238.187.186:6789",
"145.238.187.188:6789",
"145.238.187.187:6789"
],
"mtime": "2024-02-02 14:12:30",
"path": "/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9",
"pool_namespace": "",
"state": "complete",
"type": "subvolume",
"uid": 0
}

From the mon server I was able to mount the «partition» with 

  mount -t ceph 
admin@fXXX-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9
 /mnt

but on my test client I'm unable to mount 

root@ceph-vo-m:/etc/ceph# mount -t ceph 
vo@fxxx-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9/
 /vo --verbose
parsing options: rw
source mount path was not specified
unable to parse mount source: -22
root@ceph-vo-m:/etc/ceph#

So I copy the /etc/ceph/ceph.conf on my client

Put the /etc/ceph/ceph.client.vo.keyring on my client

No firewall between the client/cluster.

Weird part is when I run a tcpdump on my client I didn't see any tcp
activity. 

Anyway to debug this pb ? 

Thanks

Regards

-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
ven. 02 févr. 2024 16:21:01 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How can I clone data from a faulty bluestore disk?

2024-02-02 Thread Carl J Taylor
Hi,
I have a small cluster with some faulty disks within it and I want to clone
the data from the faulty disks onto new ones.

The cluster is currently down and I am unable to do things like
ceph-bluestore-fsck but ceph-bluestore-tool  bluefs-export does appear to
be working.

Any help would be appreciated

Many thanks
Carl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems

2024-02-02 Thread Chris Palmer

Hi Matthew

AFAIK the upgrade from quincy/deb11 to reef/deb12 is not possible:

 * The packaging problem you can work around, and a fix is pending
 * You have to upgrade both the OS and Ceph in one step
 * The MGR will not run under deb12 due to the PyO3 lack of support for
   subinterpreters.

If you do attempt an upgrade, you will end up stuck with a partially 
upgraded cluster. The MONs will be on deb12/reef and cannot be 
downgraded, and the MGR will be stuck on deb11/quincy, We have a test 
cluster in that state with no way forward or back.


I fear the MGR problem will spread as time goes on and PyO3 updates 
occur. And it's not good that it can silently corrupt in the existing 
apparently-working installations.


No-one has picked up issue 64213 that I raised yet.

I'm tempted to raise another issue for qa : the debian 12 package cannot 
have been tested as it just won't work either as an upgrade or a new 
install.


Regards, Chris


On 02/02/2024 14:40, Matthew Darwin wrote:

Chris,

Thanks for all the investigations you are doing here. We're on 
quincy/debian11.  Is there any working path at this point to 
reef/debian12?  Ideally I want to go in two steps.  Upgrade ceph first 
or upgrade debian first, then do the upgrade to the other one. Most of 
our infra is already upgraded to debian 12, except ceph.


On 2024-01-29 07:27, Chris Palmer wrote:

I have logged this as https://tracker.ceph.com/issues/64213

On 16/01/2024 14:18, DERUMIER, Alexandre wrote:

Hi,


ImportError: PyO3 modules may only be initialized once per
interpreter
process

and ceph -s reports "Module 'dashboard' has failed dependency: PyO3
modules may only be initialized once per interpreter process

We have the same problem on proxmox8 (based on debian12) with ceph
quincy or reef.

It seem to be related to python version on debian12

(we have no fix for this currently)




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How can I clone data from a faulty bluestore disk?

2024-02-02 Thread Igor Fedotov

Hi Carl,

you might want to use ceph-objectstore-tool to export PGs from faulty 
OSDs and import them back to healthy ones.


The process could be quite tricky though.

There is also pending PR (https://github.com/ceph/ceph/pull/54991) to 
make the tool more tolerant to disk errors.


The patch worth trying in some cases, not a silver bullet though.

And generally whether the recovery doable greatly depends on the actual 
error(s).



Thanks,

Igor

On 02/02/2024 19:03, Carl J Taylor wrote:

Hi,
I have a small cluster with some faulty disks within it and I want to clone
the data from the faulty disks onto new ones.

The cluster is currently down and I am unable to do things like
ceph-bluestore-fsck but ceph-bluestore-tool  bluefs-export does appear to
be working.

Any help would be appreciated

Many thanks
Carl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems

2024-02-02 Thread Casey Bodley
On Fri, Feb 2, 2024 at 11:21 AM Chris Palmer  wrote:
>
> Hi Matthew
>
> AFAIK the upgrade from quincy/deb11 to reef/deb12 is not possible:
>
>   * The packaging problem you can work around, and a fix is pending
>   * You have to upgrade both the OS and Ceph in one step
>   * The MGR will not run under deb12 due to the PyO3 lack of support for
> subinterpreters.
>
> If you do attempt an upgrade, you will end up stuck with a partially
> upgraded cluster. The MONs will be on deb12/reef and cannot be
> downgraded, and the MGR will be stuck on deb11/quincy, We have a test
> cluster in that state with no way forward or back.
>
> I fear the MGR problem will spread as time goes on and PyO3 updates
> occur. And it's not good that it can silently corrupt in the existing
> apparently-working installations.
>
> No-one has picked up issue 64213 that I raised yet.
>
> I'm tempted to raise another issue for qa : the debian 12 package cannot
> have been tested as it just won't work either as an upgrade or a new
> install.

you're right that the debian packages don't get tested:

https://docs.ceph.com/en/reef/start/os-recommendations/#platforms

>
> Regards, Chris
>
>
> On 02/02/2024 14:40, Matthew Darwin wrote:
> > Chris,
> >
> > Thanks for all the investigations you are doing here. We're on
> > quincy/debian11.  Is there any working path at this point to
> > reef/debian12?  Ideally I want to go in two steps.  Upgrade ceph first
> > or upgrade debian first, then do the upgrade to the other one. Most of
> > our infra is already upgraded to debian 12, except ceph.
> >
> > On 2024-01-29 07:27, Chris Palmer wrote:
> >> I have logged this as https://tracker.ceph.com/issues/64213
> >>
> >> On 16/01/2024 14:18, DERUMIER, Alexandre wrote:
> >>> Hi,
> >>>
> > ImportError: PyO3 modules may only be initialized once per
> > interpreter
> > process
> >
> > and ceph -s reports "Module 'dashboard' has failed dependency: PyO3
> > modules may only be initialized once per interpreter process
> >>> We have the same problem on proxmox8 (based on debian12) with ceph
> >>> quincy or reef.
> >>>
> >>> It seem to be related to python version on debian12
> >>>
> >>> (we have no fix for this currently)
> >>>
> >>>
> >>>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems

2024-02-02 Thread Brian Chow
Would migrating to a cephadm orchestrated docker/podman cluster be an
acceptable workaround?

We are running that config with reef containers on Debian 12 hosts, with a
couple of debian 12 clients successfully mounting cephfs mounts, using the
reef client packages directly on Debian.

On Fri, Feb 2, 2024, 8:21 AM Chris Palmer  wrote:

> Hi Matthew
>
> AFAIK the upgrade from quincy/deb11 to reef/deb12 is not possible:
>
>   * The packaging problem you can work around, and a fix is pending
>   * You have to upgrade both the OS and Ceph in one step
>   * The MGR will not run under deb12 due to the PyO3 lack of support for
> subinterpreters.
>
> If you do attempt an upgrade, you will end up stuck with a partially
> upgraded cluster. The MONs will be on deb12/reef and cannot be
> downgraded, and the MGR will be stuck on deb11/quincy, We have a test
> cluster in that state with no way forward or back.
>
> I fear the MGR problem will spread as time goes on and PyO3 updates
> occur. And it's not good that it can silently corrupt in the existing
> apparently-working installations.
>
> No-one has picked up issue 64213 that I raised yet.
>
> I'm tempted to raise another issue for qa : the debian 12 package cannot
> have been tested as it just won't work either as an upgrade or a new
> install.
>
> Regards, Chris
>
>
> On 02/02/2024 14:40, Matthew Darwin wrote:
> > Chris,
> >
> > Thanks for all the investigations you are doing here. We're on
> > quincy/debian11.  Is there any working path at this point to
> > reef/debian12?  Ideally I want to go in two steps.  Upgrade ceph first
> > or upgrade debian first, then do the upgrade to the other one. Most of
> > our infra is already upgraded to debian 12, except ceph.
> >
> > On 2024-01-29 07:27, Chris Palmer wrote:
> >> I have logged this as https://tracker.ceph.com/issues/64213
> >>
> >> On 16/01/2024 14:18, DERUMIER, Alexandre wrote:
> >>> Hi,
> >>>
> > ImportError: PyO3 modules may only be initialized once per
> > interpreter
> > process
> >
> > and ceph -s reports "Module 'dashboard' has failed dependency: PyO3
> > modules may only be initialized once per interpreter process
> >>> We have the same problem on proxmox8 (based on debian12) with ceph
> >>> quincy or reef.
> >>>
> >>> It seem to be related to python version on debian12
> >>>
> >>> (we have no fix for this currently)
> >>>
> >>>
> >>>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems

2024-02-02 Thread Chris Palmer
We have fundamental problems with the concept of cephadm and its 
direction of travel. But that's a different story.


The nub of this problem is a design incompatibility with MGR and the 
PyO3 package that python-cryptography relies on. It's actually unsafe as 
it is, and the new package just stops you performing the unsafe 
operations. So that affects all distributions and containers and 
versions of ceph. Eventually the updated PyO3 will find its way into 
other distributions and containers bringing things to a head.


On 02/02/2024 16:45, Brian Chow wrote:

Would migrating to a cephadm orchestrated docker/podman cluster be an
acceptable workaround?

We are running that config with reef containers on Debian 12 hosts, with a
couple of debian 12 clients successfully mounting cephfs mounts, using the
reef client packages directly on Debian.

On Fri, Feb 2, 2024, 8:21 AM Chris Palmer  wrote:


Hi Matthew

AFAIK the upgrade from quincy/deb11 to reef/deb12 is not possible:

   * The packaging problem you can work around, and a fix is pending
   * You have to upgrade both the OS and Ceph in one step
   * The MGR will not run under deb12 due to the PyO3 lack of support for
 subinterpreters.

If you do attempt an upgrade, you will end up stuck with a partially
upgraded cluster. The MONs will be on deb12/reef and cannot be
downgraded, and the MGR will be stuck on deb11/quincy, We have a test
cluster in that state with no way forward or back.

I fear the MGR problem will spread as time goes on and PyO3 updates
occur. And it's not good that it can silently corrupt in the existing
apparently-working installations.

No-one has picked up issue 64213 that I raised yet.

I'm tempted to raise another issue for qa : the debian 12 package cannot
have been tested as it just won't work either as an upgrade or a new
install.

Regards, Chris


On 02/02/2024 14:40, Matthew Darwin wrote:

Chris,

Thanks for all the investigations you are doing here. We're on
quincy/debian11.  Is there any working path at this point to
reef/debian12?  Ideally I want to go in two steps.  Upgrade ceph first
or upgrade debian first, then do the upgrade to the other one. Most of
our infra is already upgraded to debian 12, except ceph.

On 2024-01-29 07:27, Chris Palmer wrote:

I have logged this as https://tracker.ceph.com/issues/64213

On 16/01/2024 14:18, DERUMIER, Alexandre wrote:

Hi,


ImportError: PyO3 modules may only be initialized once per
interpreter
process

and ceph -s reports "Module 'dashboard' has failed dependency: PyO3
modules may only be initialized once per interpreter process

We have the same problem on proxmox8 (based on debian12) with ceph
quincy or reef.

It seem to be related to python version on debian12

(we have no fix for this currently)




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to mount ceph

2024-02-02 Thread Albert Shih
Le 02/02/2024 à 16:34:17+0100, Albert Shih a écrit
> Hi, 
> 
> 
> A little basic question.
> 
> I created a volume with 
> 
>   ceph fs volume
> 
> then a subvolume called «erasure» I can see that with 
> 
> root@cthulhu1:/etc/ceph# ceph fs subvolume info cephfs erasure
> {
> "atime": "2024-02-02 11:02:07",
> "bytes_pcent": "undefined",
> "bytes_quota": "infinite",
> "bytes_used": 0,
> "created_at": "2024-02-02 11:02:07",
> "ctime": "2024-02-02 14:12:30",
> "data_pool": "data_ec",
> "features": [
> "snapshot-clone",
> "snapshot-autoprotect",
> "snapshot-retention"
> ],
> "gid": 0,
> "mode": 16877,
> "mon_addrs": [
> "145.238.187.184:6789",
> "145.238.187.185:6789",
> "145.238.187.186:6789",
> "145.238.187.188:6789",
> "145.238.187.187:6789"
> ],
> "mtime": "2024-02-02 14:12:30",
> "path": "/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9",
> "pool_namespace": "",
> "state": "complete",
> "type": "subvolume",
> "uid": 0
> }
> 
> From the mon server I was able to mount the «partition» with 
> 
>   mount -t ceph 
> admin@fXXX-c0f2-11ee-9307-f7e3b9f03075.cephfs=/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9
>  /mnt
> 
> but on my test client I'm unable to mount 

OK find out the problem. Not the same version of ceph-common on the server
and on the client. On the client it's Debian pkg. 

So not the same syntaxe for mounting with 

  mount.ceph 
mon1,mon2,mon3,mon4,mon5:/volumes/_nogroup/erasure/998e3bdf-f92b-4508-99ed-69f03a7303e9/vo/
 /vo -o name=vo

everything work fine. 

Sorrry

Regards
-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
ven. 02 févr. 2024 18:21:23 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-02-02 Thread Cory Snyder
We've seen issues with high index OSD latencies in multiple scenarios over the 
past couple of years. The issues related to rocksdb tombstones could certainly 
be relevant, but compact on deletion has been very effective for us in that 
regard. Recently, we experienced a similar issue at a higher level with the RGW 
bucket index deletion markers on versioned buckets. Do you happen to have 
versioned buckets in your cluster? If you do and the clients of those buckets 
are doing a bunch of deletes that leave behind S3 delete markers, the CLS code 
may be doing a lot of work to filter relevant entries during bucket listing ops.

Another thing that we've found is that rocksdb can become quite slow if it 
doesn't have enough memory for internal caches. As our cluster usage has grown, 
we've needed to increase OSD memory in accordance with bucket index pool usage. 
One one cluster, we found that increasing OSD memory improved rocksdb latencies 
by over 10x.

Hope this helps!

Cory Snyder


From: Tobias Urdin 
Sent: Friday, February 2, 2024 5:41 AM
To: ceph-users 
Subject: [ceph-users] Re: OSD read latency grows over time 
 
I found the internal note I made about it, see below. When we trim thousands of 
OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, 
this causes tombstones in the RocksDB database. These thousands of tombstones 
that each 
ZjQcmQRYFpfptBannerStart
This Message Is From an Untrusted Sender 
You have not previously corresponded with this sender. 
Report Suspicious 
 
ZjQcmQRYFpfptBannerEnd
I found the internal note I made about it, see below.

When we trim thousands of OMAP keys in RocksDB this calls 
SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in
the RocksDB database.

These thousands of tombstones that each needs to be iterated over when 
for example reading data from the database causes the latency
to become super high. If the OSD is restarted the issue disappears, I 
assume this is because RocksDB or the RocksDBStore in Ceph creates
a new iterator or does compaction internally upon startup.

I don't see any straight forward solution without having to rebuild 
internal logic in the usage trim code. More specifically that would be 
investigating
in the usage trim code to use `cls_cxx_map_remove_range()` which would 
call `RocksDBStore::RocksDBTransactionImpl::rm_range_keys()` internally
instead when doing a usage trim for an epoch (—start-date and —end-date 
only, and no user or bucket).

The problem there though is that the `rocksdb_delete_range_threshold` 
config option defaults to 1_M which is way more than the amount we are deleting
and still causing issue, that function calls `DeleteRange()` instead of 
`SingleDelete()` in RocksDB which would cause one tombstone for all entries
instead of one tombstone for every single OMAP key.

Even better for above would be calling `rmkeys_by_prefix()`  and not 
having to specify start and end but there is no OSD op in PrimaryLogPG for that
which means even more work that might not be backportable.

Our best bet right now without touching radosgw-admin is upgrading to 
>=16.2.14 which introduces 
https://urldefense.com/v3/__https://github.com/ceph/ceph/pull/50894__;!!J0dtj8f0ZRU!jlSdazGnkKfYPm4GupnIQba_7jIceMkOBEZvj6jbtsydX46nCt3ARobEFZzuIU6hMF3g85-87RT0KbUjNAU$
 that will
do compaction if a threshold of tombstones is hit within a sliding 
window during iteration. 

Best regards

> On 2 Feb 2024, at 11:29, Tobias Urdin  wrote:
> 
> Shiming in here, just so that it’s indexed in archives.
> 
> We’ve have a lot of issues with tombstones when running RGW usage logging and 
> when we
> trim those the Ceph OSD hosting that usage.X object will basically kill the 
> OSD performance
> due to the tombstones being so many, restarting the OSD solves it.
> 
> We are not yet on Quincy but when we are will look into optimizing 
> rocksdb_cf_compact_on_deletion_trigger
> so that we don’t have to locate the objects, trim, restart OSDs everytime we 
> want to clean them.
> 
> Unfortunately the message on Ceph Slack is lost since it was a while back I 
> wrote more details
> on that investigation, but IIRC the issue is that the "radosgw-admin usage 
> trim” does SingleDelete() in the RocksDB layer
> when deleting objects that could be bulk deleted (RangeDelete?) due to them 
> having the same prefix (name + date). 
> 
> Best regards
> 
>> On 26 Jan 2024, at 23:18, Mark Nelson  wrote:
>> 
>> On 1/26/24 11:26, Roman Pashin wrote:
>> 
 Unfortunately they cannot. You'll want to set them in centralized conf
 and then restart OSDs for them to take effect.
 
>>> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
>>> them.
>>> 
>>> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
>>> 4096 hurt performance of HDD OSDs in any way? I have no growing laten

[ceph-users] Re: OSD read latency grows over time

2024-02-02 Thread Anthony D'Atri
You adjusted osd_memory_target?  Higher than the default 4GB?

> 
> 
> Another thing that we've found is that rocksdb can become quite slow if it 
> doesn't have enough memory for internal caches. As our cluster usage has 
> grown, we've needed to increase OSD memory in accordance with bucket index 
> pool usage. One one cluster, we found that increasing OSD memory improved 
> rocksdb latencies by over 10x.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-02-02 Thread Cory Snyder
Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs 
have over 300 GB of lz4 compressed bucket index omap data. Here is a graph 
showing the latencies before/after that single change:

https://pasteboard.co/IMCUWa1t3Uau.png

Cory Snyder


From: Anthony D'Atri 
Sent: Friday, February 2, 2024 2:15 PM
To: Cory Snyder 
Cc: ceph-users 
Subject: Re: [ceph-users] OSD read latency grows over time 
 
You adjusted osd_memory_target? Higher than the default 4GB? Another thing that 
we've found is that rocksdb can become quite slow if it doesn't have enough 
memory for internal caches. As our cluster usage has grown, we've needed to 
increase 
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender 
This message came from outside your organization. 
Report Suspicious 
 
ZjQcmQRYFpfptBannerEnd
You adjusted osd_memory_target?  Higher than the default 4GB?



Another thing that we've found is that rocksdb can become quite slow if it 
doesn't have enough memory for internal caches. As our cluster usage has grown, 
we've needed to increase OSD memory in accordance with bucket index pool usage. 
One one cluster, we found that increasing OSD memory improved rocksdb latencies 
by over 10x.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-02-02 Thread Cory Snyder
1024 PGs on NVMe.

From: Anthony D'Atri 
Sent: Friday, February 2, 2024 2:37 PM
To: Cory Snyder 
Subject: Re: [ceph-users] OSD read latency grows over time 
 
Thanks. What type of media are your index OSDs? How many PGs? > On Feb 2, 2024, 
at 2: 32 PM, Cory Snyder  wrote: > > Yes, we changed 
osd_memory_target to 10 GB on just our index OSDs. These OSDs have over 
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender 
This message came from outside your organization. 
Report Suspicious 
 
ZjQcmQRYFpfptBannerEnd
Thanks.  What type of media are your index OSDs? How many PGs?

> On Feb 2, 2024, at 2:32 PM, Cory Snyder  wrote:
> 
> Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs 
> have over 300 GB of lz4 compressed bucket index omap data. Here is a graph 
> showing the latencies before/after that single change:
> 
> https://urldefense.com/v3/__https://pasteboard.co/IMCUWa1t3Uau.png__;!!J0dtj8f0ZRU!l4XNLVA0N9y347MkNZ_gcnzLaYG9S6nLx_nGR0bzUw6SlThh6f8gvXzqzRUOMnLOMVpnNFDi9OQ9TqWsJN8gDPN11WfU$
> 
> Cory Snyder
> 
> 
> From: Anthony D'Atri 
> Sent: Friday, February 2, 2024 2:15 PM
> To: Cory Snyder 
> Cc: ceph-users 
> Subject: Re: [ceph-users] OSD read latency grows over time 
>  
> You adjusted osd_memory_target? Higher than the default 4GB? Another thing 
> that we've found is that rocksdb can become quite slow if it doesn't have 
> enough memory for internal caches. As our cluster usage has grown, we've 
> needed to increase 
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender 
> This message came from outside your organization. 
> Report Suspicious 
>  
> ZjQcmQRYFpfptBannerEnd
> You adjusted osd_memory_target?  Higher than the default 4GB?
> 
> 
> 
> Another thing that we've found is that rocksdb can become quite slow if it 
> doesn't have enough memory for internal caches. As our cluster usage has 
> grown, we've needed to increase OSD memory in accordance with bucket index 
> pool usage. One one cluster, we found that increasing OSD memory improved 
> rocksdb latencies by over 10x.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How can I clone data from a faulty bluestore disk?

2024-02-02 Thread Eugen Block

Hi,

if the OSDs are deployed as LVs (by ceph-volume) you could try to do a  
pvmove to a healthy disk. There was a thread here a couple of weeks  
ago explaining the steps. I don’t have it at hand right now, but it  
should be easy to find.
Of course, there’s no guarantee that this will be successful. I also  
can’t tell if Igor‘s approach is more promising.


Zitat von Igor Fedotov :


Hi Carl,

you might want to use ceph-objectstore-tool to export PGs from  
faulty OSDs and import them back to healthy ones.


The process could be quite tricky though.

There is also pending PR (https://github.com/ceph/ceph/pull/54991)  
to make the tool more tolerant to disk errors.


The patch worth trying in some cases, not a silver bullet though.

And generally whether the recovery doable greatly depends on the  
actual error(s).



Thanks,

Igor

On 02/02/2024 19:03, Carl J Taylor wrote:

Hi,
I have a small cluster with some faulty disks within it and I want to clone
the data from the faulty disks onto new ones.

The cluster is currently down and I am unable to do things like
ceph-bluestore-fsck but ceph-bluestore-tool  bluefs-export does appear to
be working.

Any help would be appreciated

Many thanks
Carl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-02-02 Thread Mark Nelson

Hi Cory,


Thanks for the excellent information here!  I'm super curious how much 
the kv cache is using in this case.  If you happen to have a dump from 
the perf counters that includes the prioritycache subsystem that would 
be ideal.  By default, onode (meta) and rocksdb (except for onodes 
stored in rocksdb) each get a first shot at 45% of the available cache 
memory at high priority, but how much they actually request depends on 
the relative ages of the items in each cache.  The age bins are defined 
in seconds.  By default:



kv: "1 2 6 24 120 720 0 0 0 0"

kv_onode: "0 0 0 0 0 0 0 0 0 720"

meta: "1 2 6 24 120 720 0 0 0 0"

data: "1 2 6 24 120 720 0 0 0 0"


and the ratios:

kv: 45%

kv_onode: 4%

meta: 45%

data: 6% (implicit)


This means that items from the kv cache, meta cache, and data caches 
that are less than 1 second old will all be competing with each other 
for memory during the first round.  kv and meta cache can each get up to 
45% of the available memory and meta/data get up to 4% and 6% 
respectively.  Since kv_onode doesn't actually compete at the first 
priority level though, it won't actually request any memory.  Whatever 
memory is left after the first round (assuming there is any) will be 
divided up based on the ratios to the remaining caches that are still 
requesting memory until either there are no requests or no memory left.  
After that, the PriorityCache proceeds to the next round and does the 
same thing, this time for cache items that are between 1 and 2 seconds 
old. Then between 2 and 6 seconds old, etc.


This approach lets us have different caches compete at different 
intervals.  For instance we could have the first age-bin be 0-1 seconds 
for onodes, but 0-5 seconds for kv.  We could also make the ratios 
different.  IE the first bin might be for onodes that are 0-1 seconds, 
but we give them a first shot at 60% of the memory.  kv entries that are 
0-5 seconds old might all be put in the first priority bin with the 0-1 
second onodes, but we could give them say only a 30% iniital shot at 
available memory (but they would still all be cached with higher 
priority than onodes that are 1-2 seconds old).


Ultimately, we might find that there are better defaults for the bins 
and ratios when the index gets big, however typically we really want to 
cache onodes, so if we are seeing that the kv cache is fully utilizing 
it's default ratio, increasing the amount of memory may indeed be warranted.



Mark


On 2/2/24 12:50, Cory Snyder wrote:

We've seen issues with high index OSD latencies in multiple scenarios over the 
past couple of years. The issues related to rocksdb tombstones could certainly 
be relevant, but compact on deletion has been very effective for us in that 
regard. Recently, we experienced a similar issue at a higher level with the RGW 
bucket index deletion markers on versioned buckets. Do you happen to have 
versioned buckets in your cluster? If you do and the clients of those buckets 
are doing a bunch of deletes that leave behind S3 delete markers, the CLS code 
may be doing a lot of work to filter relevant entries during bucket listing ops.

Another thing that we've found is that rocksdb can become quite slow if it 
doesn't have enough memory for internal caches. As our cluster usage has grown, 
we've needed to increase OSD memory in accordance with bucket index pool usage. 
One one cluster, we found that increasing OSD memory improved rocksdb latencies 
by over 10x.

Hope this helps!

Cory Snyder


From: Tobias Urdin 
Sent: Friday, February 2, 2024 5:41 AM
To: ceph-users 
Subject: [ceph-users] Re: OSD read latency grows over time
  
I found the internal note I made about it, see below. When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in the RocksDB database. These thousands of tombstones that each

ZjQcmQRYFpfptBannerStart
This Message Is From an Untrusted Sender
You have not previously corresponded with this sender.
Report Suspicious
  
ZjQcmQRYFpfptBannerEnd

I found the internal note I made about it, see below.

When we trim thousands of OMAP keys in RocksDB this calls 
SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in
the RocksDB database.

These thousands of tombstones that each needs to be iterated over when 
for example reading data from the database causes the latency
to become super high. If the OSD is restarted the issue disappears, I 
assume this is because RocksDB or the RocksDBStore in Ceph creates
a new iterator or does compaction internally upon startup.

I don't see any straight forward solution without having to rebuild 
internal logic in the usage trim code. More specifically that would be 
investigating
in the usage trim code to use `cls_cxx_map_remove_range()` which would 
call `RocksDBStore::RocksDBTransactionImpl::rm_range_keys()` internally
instead when doing a usag