[ceph-users] Re: nodes with high density of OSDs

2025-04-11 Thread Alex from North
Hello Tim! First of all, thanks for the detailed answer!
Yes, probably in set up of 4 nodes by 116 OSD it looks a bit overloaded, but 
what if I have 10 nodes? Yes, nodes itself are still heavy but in a row it 
seems to be not that dramatic, no?

However, in a docu I see that it is quite common for systemd to fail on boot 
and even showed a way to escape.

```
It is common to have failures when a system is coming up online. The devices 
are sometimes not fully available and this unpredictable behavior may cause an 
OSD to not be ready to be used.

There are two configurable environment variables used to set the retry behavior:

CEPH_VOLUME_SYSTEMD_TRIES: Defaults to 30

CEPH_VOLUME_SYSTEMD_INTERVAL: Defaults to 5
```

But if where should I set these vars? If I set it as ENV vars in bashrc of root 
it doesnt seem to work as ceph starts at the boot time when root env vars are 
not active yet...
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: FS not mount after update to quincy

2025-04-11 Thread Janne Johansson
Can the client talk to the MDS on the port it listens on?

Den fre 11 apr. 2025 kl 08:59 skrev Iban Cabrillo :
>
>
>
> Hi guys Good morning,
>
>
> Since I performed the update to Quincy, I've noticed a problem that wasn't 
> present with Octopus. Currently, our Ceph cluster exports a filesystem to 
> certain nodes, which we use as a backup repository.
> The machines that mount this FS are currently running Ubuntu 24 with Ceph 
> Squid as the client version.
>
> zeus22:~ # ls -la /cephvmsfs/
> total 225986576
> drwxrwxrwx 13 root root 17 Apr  4 13:10 .
> drwxr-xr-x  1 root root   286 Mar 19 
> 13:27 ..
> -rw-r--r-- 1 root root 124998647808 Apr  4 13:18 
> arcceal9.img
> drwxrwxrwx  2 nobodynogroup2 Jul 12  2018 backup
> drwxr-xr-x 2 nobodynogroup1 Oct 18  2017 
> Default
> -rw-r--r--1 root root  214cat /etc74836480 Mar 26 
> 18:11 ns1.img
> drwxr-xr-x 2 root root   1 Aug 29  
> 2024 OnlyOffice
> Before the update, these nodes mounted the FS correctly (even cluster in 
> octopus and clients in squid), and the nodes that haven't been restarted are 
> still accessing it.
>
> One of these machines has been reinstalled, and using the same configuration 
> as the nodes that are still mounting this FS, it is unable to mount, giving 
> errors such as:
>
> `mount error: no mds (Metadata Server) is up. The cluster might be laggy, or 
> you may not be authorized`
> 10.10.3.1:3300,10.10.3.2:3300,10.10.3.3:3300:/ /cephvmsfs ceph 
> name=cephvmsfs,secretfile=/etc/ceph/cephvmsfs.secret,noatime,mds_namespace=cephvmsfs,_netdev
>  0 0
>
> If I change the port to use 6789 (v1)
>
>
> mount error 110 = Connection timed out
>
> ceph cluster is healty and msd are up
>
> cephmon01:~ # ceph -s
> cluster:
> id: 6f5a65a7-yyy---428608941dd1
> health: HEALTH_OK
>
> services:
> mon: 3 daemons, quorum cephmon01,cephmon03,cephmon02 (age 2d)
> mgr: cephmon02(active, since 7d), standbys: cephmon01, cephmon03
> mds: 1/1 daemons up, 1 standby
> osd: 231 osds: 231 up (since 7d), 231 in (since 9d)
> rgw: 2 daemons active (2 hosts, 1 zones)
>
>
>
> Cephmons are available from clients in both ports:
> zeus:~ # telnet cephmon02 6789
> Trying 10.10.3.2...
> Connected to cephmon02.
> Escape character is '^]'.
> ceph v027��
>
> Ҭ
>
> zeus01:~ # telnet cephmon02 3300
> Trying 10.10.3.2...
> Connected to cephmon02.
> Escape character is '^]'.
> ceph v2
>
>
> Any advise is welcomed, regards I
> --
>
> 
> Ibán Cabrillo Bartolomé
> Instituto de Física de Cantabria (IFCA-CSIC)
> Santander, Spain
> Tel: +34942200969/+34669930421
> Responsible for advanced computing service (RSC)
> =
> =
> All our suppliers must know and accept IFCA policy available at:
>
> https://confluence.ifca.es/display/IC/Information+Security+Policy+for+External+Suppliers
> ==
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph deployment best practice

2025-04-11 Thread gagan tiwari
Hi Anthony,
   Thanks for the reply!

We will be using  CephFS  to access  Ceph Storage from clients.  So, this
will need MDS daemon also.

So, based on your advice, I am thinking of having 4 Dell PowerEdge servers
. 3 of them will run 3 Monitor daemons and one of them  will run MDS
daemon.

These Dell Servers will have following hardware :-

1. 4 cores (  8 threads )  ( Can go for 8 core and 16 threads )

2.  64G RAM

3. 2x4T  Samsung SSD  with RA!D 1 to install OS and run monitor and
metadata services.

OSD nodes will be upgraded to have 32 cores ( 64 threads ).  Disk and RAM
will remain same ( 128G and 22X8T Samsung SSD )

Actually , I want to use OSD nodes to run OSD damons and not any
other demons and which is why I am thinking of having 4 additional Dell
servers as mentioned above.

Please advise if this plan will be better.

Thanks,
Gagan






On Wed, Apr 9, 2025 at 8:12 PM Anthony D'Atri 
wrote:

>
> >
> > We would start deploying Ceph with 4 hosts ( HP Proliant servers ) each
> > running RockyLinux 9.
> >
> > One of the hosts called ceph-adm will be smaller one and will have
> > following hardware :-
> >
> > 2x4T SSD  with raid 1 to install OS on.
> >
> > 8 Core with 3600MHz freq.
> >
> > 64G  RAM
> >
> > We are planning to run all Ceph daemons except OSD daemon like monitor ,
> > metadata ,etc on this host.
>
> 8 core == 16 threads? Are you provisioning this node because you have it
> laying around idle?
>
> Note that you will want *at least* 3 Monitor (monitors) daemons, which
> must be on different nodes.  5 is better, but at least 3. You’ll also have
> Grafana, Prometheus, MDS (if you’re going to CephFS vs using S3 object
> storage or RBD block)
>
> 8c is likely on the light side for all of that.  You would also benefit
> from not having that node be a single point of failure.  I would suggest if
> you can raising this node to the spec of the planned 3x OSD nodes so you
> have 4x equivalent nodes, and spread that non-OSD daemons across them.
>
> Note also that your OSD nodes will also have node_exporter, crash, and
> other boilerplate daemons.
>
>
> > We will have 3 hosts to run OSD which will store actual data.
> >
> > Each OSD host will have following hardware
> >
> > 2x4T SSD  with raid 1 to install OS on.
> >
> > 22X8T SSD  to store data ( OSDs ) ( without partition ). We will use
> entire
> > disk without partitions
>
> SAS, SATA, or NVMe SSDs?  Which specific model?  You really want to avoid
> client (desktop) models for Ceph, but you likely do not need to pay for
> higher endurance mixed-use SKUs.
>
> > Each OSD host will have 128G RAM  ( No swap space )
>
> Thank you for skipping swap.  Some people are really stuck in the past in
> that regard.
>
> > Each OSD host will have 16 cores.
>
> So 32 threads total?  That is very light for 22 OSDs + other daemons.  For
> HDD OSDs a common rule of thumb is at minimum 2x threads per, for SAS/SATA
> SSDs, 4, for NVMe SSDs 6.  Plus margin for the OS and other processes.
>
> > All 4 hosts will connect to each via 10G nic.
>
> Two ports with bonding? Redundant switches?
>
> > The 500T data
>
> The specs you list above include 528 TB of *raw* space.  Be advised that
> with three OSD nodes, you will necessarily be doing replication.  For
> safety replication with size=3.  Taking into consideration TB vs TiB and
> headroom, you’re looking at 133TiB of usable space.  You could go with
> size=2 to get 300TB of usable space, but at increased risk of data
> unavailability or loss when drives/hosts fail or reboot.
>
> With at least 4 OSD nodes - even if they aren’t fully populated with
> capacity drives — you could do EC for a more favorable raw:usable ratio, at
> the expense of slower writes and recovery.  With 4 nodes you could in
> theory do 2,2 EC for 200 TiB of usable space, with 5 you could do 3,2 for
> 240 TiB usable, etc.
>
> > will be accessed by the clients. We need to have
> > read performance as fast as possible.
>
> Hope your SSDs are enterprise NVMe.
>
> > We can't afford data loss and downtime.
>
> Then no size=2 for you.
>
> > So, we want to have a Ceph
> > deployment  which serves our purpose.
> >
> > So, please advise me if the plan that I have designed will serve our
> > purpose.
> > Or is there a better way , please advise that.
> >
> > Thanks,
> > Gagan
> >
> >
> >
> >
> >
> >
> > We have a HP storage server with 12 SDD of 5T each and have set-up
> hardware
> > RAID6 on these disks.
> >
> > HP storage server has 64G RAM and 18 cores.
> >
> > So, please advise how I should go about setting up Ceph on it to have
> best
> > read performance. We need fastest read performance.
> >
> >
> > Thanks,
> > Gagan
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: FS not mount after update to quincy

2025-04-11 Thread Iban Cabrillo
Hi Janne,
   yes both mds are rechable:

zeus01:~ # telnet cephmds01 6800
Trying 10.10.3.8...
Connected to cephmds01.
Escape character is '^]'.
ceph v2


zeus01:~ # telnet cephmds02 6800
Trying 10.10.3.9...
Connected to cephmds02.
Escape character is '^]'.
ceph v2

Regards, I


-- 

  Ibán Cabrillo Bartolomé
  Instituto de Física de Cantabria (IFCA-CSIC)
  Santander, Spain
  Tel: +34942200969/+34669930421
  Responsible for advanced computing service (RSC)
=
=
All our suppliers must know and accept IFCA policy available at:

https://confluence.ifca.es/display/IC/Information+Security+Policy+for+External+Suppliers
==
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: nodes with high density of OSDs

2025-04-11 Thread Janne Johansson
Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri :
>
> Filestore IIRC used partitions, with cute hex GPT types for various states 
> and roles.  Udev activation was sometimes problematic, and LVM tags are more 
> flexible and reliable than the prior approach.  There no doubt is more to it 
> but that’s what I recall.

Filestore used to have softlinks towards the journal device (if used)
which pointed to sdX where that X of course would jump around if you
changed the number of drives on the box, or the kernel disk detection
order changed, breaking the OSD.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: nodes with high density of OSDs

2025-04-11 Thread Tim Holloway

Hi Alex,

I think one of the scariest things about your setup is that there are 
only 4 nodes (I'm assuming that means Ceph hosts carrying OSDs). I've 
been bouncing around different configurations lately between some of my 
deployment issues and cranky old hardware and I presently am down to 4 
hosts with 1-2 OSDs per host. If even one of those hosts goes down, Ceph 
gets unhappy. If 2 are offline at once, Ceph goes into self-defense 
mode. I'd hate to think of 116 OSDs at risk on a single host.


I got curious about when LVM comes online, and I believe that the 
vgchange command that activates the LVs is actually in the initrd file 
before systemd comes up if a system was configured for LVM support. 
That's necessary, in fact, since the live root partition can be and 
often is an LV itself.


As for for systemd dependencies, that's something I've been doing a lot 
of tuning on myself, as things like my backup system won't work if 
certain volumes aren't mounted, so I've had to add "RequiresVolume" 
dependencies, plus some daemons require other daemons. So it's an 
interesting dance.


At this point I think that the best way to ensure that all LVs are 
online would be to add overrides under /etc/systemd/system/ceph.service 
(probably needs the fsid in the service name, too). Include a 
beforeStartup command that scans the proc ps list and loops until the 
vgscan process no longer show up (command completed).


But I really would reconsider both your host and OSD count. Larger OSDs 
and more hosts would give better reliability and performance.


   Tim

On 4/11/25 03:53, Alex from North wrote:

Hello Tim! First of all, thanks for the detailed answer!
Yes, probably in set up of 4 nodes by 116 OSD it looks a bit overloaded, but 
what if I have 10 nodes? Yes, nodes itself are still heavy but in a row it 
seems to be not that dramatic, no?

However, in a docu I see that it is quite common for systemd to fail on boot 
and even showed a way to escape.

```
It is common to have failures when a system is coming up online. The devices 
are sometimes not fully available and this unpredictable behavior may cause an 
OSD to not be ready to be used.

There are two configurable environment variables used to set the retry behavior:

CEPH_VOLUME_SYSTEMD_TRIES: Defaults to 30

CEPH_VOLUME_SYSTEMD_INTERVAL: Defaults to 5
```

But if where should I set these vars? If I set it as ENV vars in bashrc of root 
it doesnt seem to work as ceph starts at the boot time when root env vars are 
not active yet...
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: FS not mount after update to quincy

2025-04-11 Thread Konstantin Shalygin
Hi,

> On 11 Apr 2025, at 09:59, Iban Cabrillo  wrote:
> 
> 10.10.3.1:3300,10.10.3.2:3300,10.10.3.3:3300:/ /cephvmsfs ceph 
> name=cephvmsfs,secretfile=/etc/ceph/cephvmsfs.secret,noatime,mds_namespace=cephvmsfs,_netdev
>  0 0

Try add the ms_mode option, because you use msgr2 protocol. For example, like 
this:

noatime,ms_mode=prefer-crc,_netdev



Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: nodes with high density of OSDs

2025-04-11 Thread Konstantin Shalygin
Hi,

> On 11 Apr 2025, at 10:53, Alex from North  wrote:
> 
> Hello Tim! First of all, thanks for the detailed answer!
> Yes, probably in set up of 4 nodes by 116 OSD it looks a bit overloaded, but 
> what if I have 10 nodes? Yes, nodes itself are still heavy but in a row it 
> seems to be not that dramatic, no?
> 
> However, in a docu I see that it is quite common for systemd to fail on boot 
> and even showed a way to escape.

Currently we don't have 116 OSD chassis, but operate the 60 OSD chassis without 
any issues with LVM activation. We use Ceph Pacific (16.2.15)




Welcome to CentOS Stream 8 4.18.0-553.6.1.el8.x86_64
Platform:   Supermicro Super Server (X12DPi-N6)
BIOS:   1.9 (American Megatrends International, LLC. 02/07/2024)
CPU:Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz [2P/32C]
RAM:382Gi free / 628Gi total
LoadAvg:2.94 2.87 3.27
up 10 weeks, 2 hours, 36 minutes

[operator@host:/]$ scsis
Device /dev/sda [TOSHIBA MG10ACA2: ]: 0:0:0:0
Device /dev/sdb [TOSHIBA MG10ACA2: ]: 0:0:1:0
Device /dev/sdc [TOSHIBA MG10ACA2: ]: 0:0:2:0
Device /dev/sdd [TOSHIBA MG10ACA2: ]: 0:0:3:0
Device /dev/sde [TOSHIBA MG10ACA2: ]: 0:0:4:0
Device /dev/sdf [TOSHIBA MG10ACA2: ]: 0:0:5:0
Device /dev/sdg [TOSHIBA MG10ACA2: ]: 0:0:6:0
Device /dev/sdh [TOSHIBA MG10ACA2: ]: 0:0:7:0
Device /dev/sdi [TOSHIBA MG10ACA2: ]: 0:0:8:0
Device /dev/sdj [TOSHIBA MG10ACA2: ]: 0:0:9:0
Device /dev/sdk [TOSHIBA MG10ACA2: ]: 0:0:10:0
Device /dev/sdl [TOSHIBA MG10ACA2: ]: 0:0:11:0
Device /dev/sdm [TOSHIBA MG10ACA2: ]: 0:0:12:0
Device /dev/sdn [TOSHIBA MG10ACA2: ]: 0:0:13:0
Device /dev/sdo [TOSHIBA MG10ACA2: ]: 0:0:14:0
Device /dev/sdp [TOSHIBA MG10ACA2: ]: 0:0:15:0
Device /dev/sdq [TOSHIBA MG10ACA2: ]: 0:0:16:0
Device /dev/sdr [TOSHIBA MG10ACA2: ]: 0:0:17:0
Device /dev/sds [TOSHIBA MG10ACA2: ]: 0:0:18:0
Device /dev/sdt [TOSHIBA MG10ACA2: ]: 0:0:19:0
Device /dev/sdu [TOSHIBA MG10ACA2: ]: 0:0:21:0
Device /dev/sdv [TOSHIBA MG10ACA2: ]: 0:0:22:0
Device /dev/sdw [TOSHIBA MG10ACA2: ]: 0:0:23:0
Device /dev/sdx [TOSHIBA MG10ACA2: ]: 0:0:24:0
Device /dev/sdy [TOSHIBA MG10ACA2: ]: 0:0:25:0
Device /dev/sdz [TOSHIBA MG10ACA2: ]: 0:0:26:0
Device /dev/sdaa [TOSHIBA MG10ACA2: ]: 0:0:27:0
Device /dev/sdab [TOSHIBA MG10ACA2: ]: 0:0:28:0
Device /dev/sdac [TOSHIBA MG10ACA2: ]: 0:0:29:0
Device /dev/sdad [TOSHIBA MG10ACA2: ]: 0:0:30:0
Device /dev/sdae [TOSHIBA MG10ACA2: ]: 0:0:31:0
Device /dev/sdaf [TOSHIBA MG10ACA2: ]: 0:0:32:0
Device /dev/sdag [TOSHIBA MG10ACA2: ]: 0:0:33:0
Device /dev/sdah [TOSHIBA MG10ACA2: ]: 0:0:34:0
Device /dev/sdai [TOSHIBA MG10ACA2: ]: 0:0:35:0
Device /dev/sdaj [TOSHIBA MG10ACA2: ]: 0:0:36:0
Device /dev/sdak [TOSHIBA MG10ACA2: ]: 0:0:37:0
Device /dev/sdal [TOSHIBA MG10ACA2: ]: 0:0:38:0
Device /dev/sdam [TOSHIBA MG10ACA2: ]: 0:0:39:0
Device /dev/sdan [TOSHIBA MG10ACA2: ]: 0:0:40:0
Device /dev/sdao [TOSHIBA MG10ACA2: ]: 0:0:42:0
Device /dev/sdap [TOSHIBA MG10ACA2: ]: 0:0:43:0
Device /dev/sdaq [TOSHIBA MG10ACA2: ]: 0:0:44:0
Device /dev/sdar [TOSHIBA MG10ACA2: ]: 0:0:45:0
Device /dev/sdas [TOSHIBA MG10ACA2: ]: 0:0:46:0
Device /dev/sdat [TOSHIBA MG10ACA2: ]: 0:0:47:0
Device /dev/sdau [TOSHIBA MG10ACA2: ]: 0:0:48:0
Device /dev/sdav [TOSHIBA MG10ACA2: ]: 0:0:49:0
Device /dev/sdaw [TOSHIBA MG10ACA2: ]: 0:0:50:0
Device /dev/sdax [TOSHIBA MG10ACA2: ]: 0:0:51:0
Device /dev/sday [TOSHIBA MG10ACA2: ]: 0:0:52:0
Device /dev/sdaz [TOSHIBA MG10ACA2: ]: 0:0:53:0
Device /dev/sdba [TOSHIBA MG10ACA2: ]: 0:0:54:0
Device /dev/sdbb [TOSHIBA MG10ACA2: ]: 0:0:55:0
Device /dev/sdbc [TOSHIBA MG10ACA2: ]: 0:0:56:0
Device /dev/sdbd [TOSHIBA MG10ACA2: ]: 0:0:57:0
Device /dev/sdbe [TOSHIBA MG10ACA2: ]: 0:0:58:0
Device /dev/sdbf [TOSHIBA MG10ACA2: ]: 0:0:59:0
Device /dev/sdbg [TOSHIBA MG10ACA2: ]: 0:0:60:0
Device /dev/sdbh [TOSHIBA MG10ACA2: ]: 0:0:61:0

Inspected 60 SCSI devices


Good luck,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: v19.2.2 Squid released

2025-04-11 Thread Vladimir Sigunov
Hi All,

My upgrade 19.2.1 -> 19.2.2 was successful (8 nodes, 320 OSDs, HDD for
data, SSD for WAL/DB).
Could the issue be related to IP v6? I'm using IP v4, public network only.

Today I will test the upgrade 18.2.4 to 18.2.5 (same cluster
configuration). Will provide feedback, if needed.

SIncerely,
Vladimir.


On Thu, Apr 10, 2025 at 4:09 PM Yuri Weinstein  wrote:

> We're happy to announce the 2nd backport release in the Squid series.
>
> https://ceph.io/en/news/blog/2025/v19-2-2-squid-released/
>
> Notable Changes
> ---
> - This hotfix release resolves an RGW data loss bug when CopyObject is
> used to copy an object onto itself.
>   S3 clients typically do this when they want to change the metadata
> of an existing object.
>   Due to a regression caused by an earlier fix for
> https://tracker.ceph.com/issues/66286,
>   any tail objects associated with such objects are erroneously marked
> for garbage collection.
>   RGW deployments on Squid are encouraged to upgrade as soon as
> possible to minimize the damage.
>   The experimental rgw-gap-list tool can help to identify damaged objects.
>
> Getting Ceph
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at https://download.ceph.com/tarballs/ceph-19.2.2.tar.gz
> * Containers at https://quay.io/repository/ceph/ceph
> * For packages, see https://docs.ceph.com/en/latest/install/get-packages/
> * Release git sha1: 0eceb0defba60152a8182f7bd87d164b639885b8
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: endless remapping after increasing number of PG in a pool

2025-04-11 Thread Michel Jouvin

Hi,

After 2 weeks, the increase of the number of PGs in an EC pool (9+6) 
from 256 PGs to 1024 completed successfully! I was still wondering if 
such a duration was expected or may be the sign of a problem...


After the previous exchanges, I restarted the increase by setting both 
pg_num and pgp_num to 1024 at the same time, after stopping the 
autoscaler for the pool. I then restarted the autoscale once 'ceph osd 
pool autoscale-status` seaid that PG_NUM was 1024. We are running 18.2.2 
and I was able to set both pg_num and pgp_num.


Best regards,

Michel

Le 01/04/2025 à 10:36, Burkhard Linke a écrit :

Hi,

On 4/1/25 10:03, Michel Jouvin wrote:

Hi Bukhard,

Thanks for your answer. Your explanation seems to match well our 
observations, in particular the fact that new misplaced objects are 
added when we fall under something like 0.5% of misplaced objects. 
What is not clear for me anyway is that 'ceph osd pool ls detail' for 
the pool modified is not reporting the new pg_num target (2048) but 
the old one (256):


pool 62 'ias-z1.rgw.buckets.data' erasure profile k9_m6_host size 15 
min_size 10 crush_rule 3 object_hash rjenkins pg_num 323 pgp_num 307 
pg_num_target 256 pgp_num_target 256 autoscale_mode off last_change 
439681 lfor 0/439680/439678 flags hashpspool,bulk max_bytes 
200 stripe_width 36864 application rgw


- Is this caused by the fact that autoscaler was still on when I 
increased the number of PG and that I disabled it on the pool ~12h 
after entering the command to extend it?


This seems to be the case. The pg(p)_target settings are the number of 
PG the pool _should_ have; pg(p)_num is the number of PGs the pool 
current has. So the cluster is not splitting PGs, but merging them. If 
you want to have 2048, you should increase it again.


There are also setting for the autoscaler, e.g. 'pg_num_min'. You can 
use them to prevent the autoscaler from switching back to 256 PGs again.




- Or was it a mistake of mine to enter only extend the pg_num and not 
the pgp_num. According to the doc that I just read again, both should 
be extended at the same time or it not causing the expected result? 
If it is the case, should I just reenter the command to extend pg_num 
and pgp_num? (and wait for the resulting remapping!)


In current ceph release only pg_num can be changed. pgp_num is 
automatically adopted.



Best regards,

Burkhard Linke

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cannot reinstate ceph fs mirror because i destroyed the ceph fs mirror peer/ target server

2025-04-11 Thread Eugen Block

Hi,

I would expect that you have a similar config-key entry:

ceph config-key ls |grep "peer/cephfs"
"cephfs/mirror/peer/cephfs/18c02021-8902-4e3f-bc17-eaf48331cc56",

Maybe removing that peer would already suffice?


Zitat von Jan Zeinstra :


Hi,
This is my first post to the forum and I don't know if it's appropriate,
but I'd like to express my gratitude to all people working hard on ceph
because I think it's a fantastic piece of software.

The problem I'm having is caused by me; we had a well working ceph fs
mirror solution; let's call it source cluster A, and target cluster B.
Source cluster A is a modest cluster consisting of 6 instances, 3 OSD
instances, and 3 mon instances. The OSD instances all have 3 disks (HDD's)
and 3 OSD demons, totalling 9 OSD daemons and 9 HDD's. Target cluster B is
a single node system having 3 OSD daemons and 3 HDD's. Both clusters run
ceph 18.2.4 reef. Both clusters use Ubuntu 22.04 as OS throughout. Both
systems are installed using cephadm.
I have destroyed cluster B, and have built it from the ground up (I made a
mistake in PG sizing in the original cluster)
Now i find i cannot create/ reinstate the mirroring between 2 ceph fs
filesystems, and i suspect there is a peer left behind in the filesystem of
the source, pointing to the now non-existent target cluster.
When i do 'ceph fs snapshot mirror peer_list prodfs', i get:
'{"f3ea4e15-6d77-4f28-aacb-9afbfe8cc1c5": {"client_name":
"client.mirror_remote", "site_name": "bk-site", "fs_name": "prodfs"}}'
When i try to delete it: 'ceph fs snapshot mirror peer_remove prodfs
f3ea4e15-6d77-4f28-aacb-9afbfe8cc1c5', i get: 'Error EACCES: failed to
remove peeraccess denied: does your client key have mgr caps? See
http://docs.ceph.com/en/latest/mgr/administrator/#client-authentication',
but the logging of the daemon points to the more likely reason of failure:

Apr 08 12:54:26 s1mon systemd[1]: Started Ceph cephfs-mirror.s1mon.lvlkwp
for d0ea284a-8a16-11ee-9232-5934f0f00ec2.
Apr 08 12:54:26 s1mon cephfs-mirror[310088]: set uid:gid to 167:167
(ceph:ceph)
Apr 08 12:54:26 s1mon cephfs-mirror[310088]: ceph version 18.2.4
(e7ad5345525c7aa95470c26863873b581076945d) reef (stable), process
cephfs-mirror, pid 2
Apr 08 12:54:26 s1mon cephfs-mirror[310088]: pidfile_write: ignore empty
--pid-file
Apr 08 12:54:26 s1mon cephfs-mirror[310088]: mgrc service_daemon_register
cephfs-mirror.22849497 metadata
{arch=x86_64,ceph_release=reef,ceph_version=ceph version 18.2.4
(e7ad5345525c7a>
Apr 08 12:54:30 s1mon cephfs-mirror[310088]:
cephfs::mirror::PeerReplayer(f3ea4e15-6d77-4f28-aacb-9afbfe8cc1c5) init:
remote monitor host=[v2:172.17.16.12:3300/0,v1:172.17.16.12:6789/0]
Apr 08 12:54:30 s1mon conmon[310082]: 2025-04-08T10:54:30.365+
7f57c51ba640 -1 monclient(hunting): handle_auth_bad_method server
allowed_methods [2] but i only support [2,1]
Apr 08 12:54:30 s1mon conmon[310082]: 2025-04-08T10:54:30.365+
7f57d81e0640 -1 cephfs::mirror::Utils connect: error connecting to bk-site:
(13) Permission denied
Apr 08 12:54:30 s1mon cephfs-mirror[310088]: cephfs::mirror::Utils connect:
error connecting to bk-site: (13) Permission denied
Apr 08 12:54:30 s1mon conmon[310082]: 2025-04-08T10:54:30.365+
7f57d81e0640 -1
cephfs::mirror::PeerReplayer(f3ea4e15-6d77-4f28-aacb-9afbfe8cc1c5) init:
error connecting to remote cl>
Apr 08 12:54:30 s1mon cephfs-mirror[310088]:
cephfs::mirror::PeerReplayer(f3ea4e15-6d77-4f28-aacb-9afbfe8cc1c5) init:
error connecting to remote cluster: (13) Permission denied
Apr 09 00:00:16 s1mon cephfs-mirror[310088]: received  signal: Hangup from
Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() )
UID: 0
Apr 09 00:00:16 s1mon conmon[310082]: 2025-04-08T22:00:16.362+
7f57d99e3640 -1 received  signal: Hangup from Kernel ( Could be generated
by pthread_kill(), raise(), abort(), alarm()>
Apr 09 00:00:16 s1mon conmon[310082]: 2025-04-08T22:00:16.386+
7f57d99e3640 -1 received  signal: Hangup from Kernel ( Could be generated
by pthread_kill(), raise(), abort(), alarm()>
Apr 09 00:00:16 s1mon cephfs-mirror[310088]: received  signal: Hangup from
Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() )
UID: 0
Apr 09 00:00:16 s1mon conmon[310082]: 2025-04-08T22:00:16.430+
7f57d99e3640 -1 received  signal: Hangup from Kernel ( Could be generated
by pthread_kill(), raise(), abort(), alarm()>
Apr 09 00:00:16 s1mon cephfs-mirror[310088]: received  signal: Hangup from
Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() )
UID: 0
Apr 09 00:00:16 s1mon conmon[310082]: 2025-04-08T22:00:16.466+
7f57d99e3640 -1 received  signal: Hangup from Kernel ( Could be generated
by pthread_kill(), raise(), abort(), alarm()>
Apr 09 00:00:16 s1mon cephfs-mirror[310088]: received  signal: Hangup from
Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() )
UID: 0
Apr 10 00:00:01 s1mon cephfs-mirror[310088]: received  signal: Hangup from
Kernel ( Could be generated by pt

[ceph-users] Re: nodes with high density of OSDs

2025-04-11 Thread Anthony D'Atri
I thought those links were to the by-uuid paths for that reason?

> On Apr 11, 2025, at 6:39 AM, Janne Johansson  wrote:
> 
> Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri :
>> 
>> Filestore IIRC used partitions, with cute hex GPT types for various states 
>> and roles.  Udev activation was sometimes problematic, and LVM tags are more 
>> flexible and reliable than the prior approach.  There no doubt is more to it 
>> but that’s what I recall.
> 
> Filestore used to have softlinks towards the journal device (if used)
> which pointed to sdX where that X of course would jump around if you
> changed the number of drives on the box, or the kernel disk detection
> order changed, breaking the OSD.
> 
> -- 
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: nodes with high density of OSDs

2025-04-11 Thread Anthony D'Atri


> I think one of the scariest things about your setup is that there are only 4 
> nodes (I'm assuming that means Ceph hosts carrying OSDs). I've been bouncing 
> around different configurations lately between some of my deployment issues 
> and cranky old hardware and I presently am down to 4 hosts with 1-2 OSDs per 
> host. If even one of those hosts goes down, Ceph gets unhappy. If 2 are 
> offline at once, Ceph goes into self-defense mode. I'd hate to think of 116 
> OSDs at risk on a single host.

My sense is that from a cluster perspective it’s not so much a function of the 
absolute number of OSDs that go down as the percentage of the cluster that a 
host represents.  If a cluster comprises 20x hosts each with 116 OSDs, one 
going down is only 5% of the whole.

One of the concerns is maintaining enough space to recover that many OSDs’ 
worth of data, if mon_osd_down_out_subtree_limit is not used to forestall most 
whole-host recovery.



> 
> I got curious about when LVM comes online, and I believe that the vgchange 
> command that activates the LVs is actually in the initrd file before systemd 
> comes up if a system was configured for LVM support. That's necessary, in 
> fact, since the live root partition can be and often is an LV itself.
> 
> As for for systemd dependencies, that's something I've been doing a lot of 
> tuning on myself, as things like my backup system won't work if certain 
> volumes aren't mounted, so I've had to add "RequiresVolume" dependencies, 
> plus some daemons require other daemons. So it's an interesting dance.
> 
> At this point I think that the best way to ensure that all LVs are online 
> would be to add overrides under /etc/systemd/system/ceph.service (probably 
> needs the fsid in the service name, too). Include a beforeStartup command 
> that scans the proc ps list and loops until the vgscan process no longer show 
> up (command completed).

I was thinking ExecStartPre=/bin/sleep 60 or so as an override to keep it 
simple, but feel free to get surgical.  With of course Ansible or other 
automation to persist the override for new/updated/changed hosts.

> But I really would reconsider both your host and OSD count. Larger OSDs and 
> more hosts would give better reliability and performance.

Indeed.  If such a chassis is picked due to perceived cost savings over all 
else, there is the cost of not doing the job, but moreover having only 4 
prevents the use of a reasonably wide EC profile, which probably costs more in 
Capex than having a larger number of more conventional chassis.

I think we haven’t seen the OP’s drive size, but I’ll bet that they’re already 
at least 20TB HDDs, with the usual SATA bottleneck.  Ultradense toploaders can 
also exhibit HBA and backplane saturation.

I think you may have meant “Fewer OSDs per host and more hosts”, Sir Enchanter.



> 
>Tim
> 
> On 4/11/25 03:53, Alex from North wrote:
>> Hello Tim! First of all, thanks for the detailed answer!
>> Yes, probably in set up of 4 nodes by 116 OSD it looks a bit overloaded, but 
>> what if I have 10 nodes? Yes, nodes itself are still heavy but in a row it 
>> seems to be not that dramatic, no?
>> 
>> However, in a docu I see that it is quite common for systemd to fail on boot 
>> and even showed a way to escape.
>> 
>> ```
>> It is common to have failures when a system is coming up online. The devices 
>> are sometimes not fully available and this unpredictable behavior may cause 
>> an OSD to not be ready to be used.
>> 
>> There are two configurable environment variables used to set the retry 
>> behavior:
>> 
>> CEPH_VOLUME_SYSTEMD_TRIES: Defaults to 30
>> 
>> CEPH_VOLUME_SYSTEMD_INTERVAL: Defaults to 5
>> ```
>> 
>> But if where should I set these vars? If I set it as ENV vars in bashrc of 
>> root it doesnt seem to work as ceph starts at the boot time when root env 
>> vars are not active yet...
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph deployment best practice

2025-04-11 Thread Anthony D'Atri


> On Apr 11, 2025, at 4:04 AM, gagan tiwari  
> wrote:
> 
> Hi Anthony,
>   Thanks for the reply!
> 
> We will be using  CephFS  to access  Ceph Storage from clients.  So, this
> will need MDS daemon also.

MDS is single-threaded, so unlike most Ceph daemons it benefits more from a 
high-frequency CPU than core count.

> So, based on your advice, I am thinking of having 4 Dell PowerEdge servers
> . 3 of them will run 3 Monitor daemons and one of them  will run MDS
> daemon.
> 
> These Dell Servers will have following hardware :-
> 
> 1. 4 cores (  8 threads )  ( Can go for 8 core and 16 threads )
> 
> 2.  64G RAM
> 
> 3. 2x4T  Samsung SSD  with RA!D 1 to install OS and run monitor and
> metadata services.

That probably suffices for a small cluster.  Are those Samsungs enterprise? 


> OSD nodes will be upgraded to have 32 cores ( 64 threads ).  Disk and RAM
> will remain same ( 128G and 22X8T Samsung SSD )

Which Samsung SSD?  Using client SKUs for OSDs has a way of leading to 
heartbreak.

64 threads would be better for a 22x OSD node, though still a bit light.  Are 
these SATA or NVMe?

> Actually , I want to use OSD nodes to run OSD damons and not any
> other demons and which is why I am thinking of having 4 additional Dell
> servers as mentioned above.

Colocation of daemons is common these days, especially with smaller clusters.  

> 
> Please advise if this plan will be better.

That’ll work, but unless you already have those quite-modest 4x non-OSD nodes 
sitting around idle you might consider just going with the OSD nodes and 
bumping the CPU again so you can colocate all the daemons.

> 
> Thanks,
> Gagan
> 
> 
> 
> 
> 
> 
> On Wed, Apr 9, 2025 at 8:12 PM Anthony D'Atri 
> wrote:
> 
>> 
>>> 
>>> We would start deploying Ceph with 4 hosts ( HP Proliant servers ) each
>>> running RockyLinux 9.
>>> 
>>> One of the hosts called ceph-adm will be smaller one and will have
>>> following hardware :-
>>> 
>>> 2x4T SSD  with raid 1 to install OS on.
>>> 
>>> 8 Core with 3600MHz freq.
>>> 
>>> 64G  RAM
>>> 
>>> We are planning to run all Ceph daemons except OSD daemon like monitor ,
>>> metadata ,etc on this host.
>> 
>> 8 core == 16 threads? Are you provisioning this node because you have it
>> laying around idle?
>> 
>> Note that you will want *at least* 3 Monitor (monitors) daemons, which
>> must be on different nodes.  5 is better, but at least 3. You’ll also have
>> Grafana, Prometheus, MDS (if you’re going to CephFS vs using S3 object
>> storage or RBD block)
>> 
>> 8c is likely on the light side for all of that.  You would also benefit
>> from not having that node be a single point of failure.  I would suggest if
>> you can raising this node to the spec of the planned 3x OSD nodes so you
>> have 4x equivalent nodes, and spread that non-OSD daemons across them.
>> 
>> Note also that your OSD nodes will also have node_exporter, crash, and
>> other boilerplate daemons.
>> 
>> 
>>> We will have 3 hosts to run OSD which will store actual data.
>>> 
>>> Each OSD host will have following hardware
>>> 
>>> 2x4T SSD  with raid 1 to install OS on.
>>> 
>>> 22X8T SSD  to store data ( OSDs ) ( without partition ). We will use
>> entire
>>> disk without partitions
>> 
>> SAS, SATA, or NVMe SSDs?  Which specific model?  You really want to avoid
>> client (desktop) models for Ceph, but you likely do not need to pay for
>> higher endurance mixed-use SKUs.
>> 
>>> Each OSD host will have 128G RAM  ( No swap space )
>> 
>> Thank you for skipping swap.  Some people are really stuck in the past in
>> that regard.
>> 
>>> Each OSD host will have 16 cores.
>> 
>> So 32 threads total?  That is very light for 22 OSDs + other daemons.  For
>> HDD OSDs a common rule of thumb is at minimum 2x threads per, for SAS/SATA
>> SSDs, 4, for NVMe SSDs 6.  Plus margin for the OS and other processes.
>> 
>>> All 4 hosts will connect to each via 10G nic.
>> 
>> Two ports with bonding? Redundant switches?
>> 
>>> The 500T data
>> 
>> The specs you list above include 528 TB of *raw* space.  Be advised that
>> with three OSD nodes, you will necessarily be doing replication.  For
>> safety replication with size=3.  Taking into consideration TB vs TiB and
>> headroom, you’re looking at 133TiB of usable space.  You could go with
>> size=2 to get 300TB of usable space, but at increased risk of data
>> unavailability or loss when drives/hosts fail or reboot.
>> 
>> With at least 4 OSD nodes - even if they aren’t fully populated with
>> capacity drives — you could do EC for a more favorable raw:usable ratio, at
>> the expense of slower writes and recovery.  With 4 nodes you could in
>> theory do 2,2 EC for 200 TiB of usable space, with 5 you could do 3,2 for
>> 240 TiB usable, etc.
>> 
>>> will be accessed by the clients. We need to have
>>> read performance as fast as possible.
>> 
>> Hope your SSDs are enterprise NVMe.
>> 
>>> We can't afford data loss and downtime.
>> 
>> Then no size=2 for you.
>> 
>>> So, we want to hav

[ceph-users] Re: v19.2.2 Squid released

2025-04-11 Thread Stephan Hohn
Looks like this "common/pick_address: check if address in subnet all public
address (pr#57590 , Nitzan
Mordechai)" is ipv4 only


Am Fr., 11. Apr. 2025 um 13:36 Uhr schrieb Vladimir Sigunov <
vladimir.sigu...@gmail.com>:

> Hi All,
>
> My upgrade 19.2.1 -> 19.2.2 was successful (8 nodes, 320 OSDs, HDD for
> data, SSD for WAL/DB).
> Could the issue be related to IP v6? I'm using IP v4, public network only.
>
> Today I will test the upgrade 18.2.4 to 18.2.5 (same cluster
> configuration). Will provide feedback, if needed.
>
> SIncerely,
> Vladimir.
>
>
> On Thu, Apr 10, 2025 at 4:09 PM Yuri Weinstein 
> wrote:
>
> > We're happy to announce the 2nd backport release in the Squid series.
> >
> > https://ceph.io/en/news/blog/2025/v19-2-2-squid-released/
> >
> > Notable Changes
> > ---
> > - This hotfix release resolves an RGW data loss bug when CopyObject is
> > used to copy an object onto itself.
> >   S3 clients typically do this when they want to change the metadata
> > of an existing object.
> >   Due to a regression caused by an earlier fix for
> > https://tracker.ceph.com/issues/66286,
> >   any tail objects associated with such objects are erroneously marked
> > for garbage collection.
> >   RGW deployments on Squid are encouraged to upgrade as soon as
> > possible to minimize the damage.
> >   The experimental rgw-gap-list tool can help to identify damaged
> objects.
> >
> > Getting Ceph
> > 
> > * Git at git://github.com/ceph/ceph.git
> > * Tarball at https://download.ceph.com/tarballs/ceph-19.2.2.tar.gz
> > * Containers at https://quay.io/repository/ceph/ceph
> > * For packages, see
> https://docs.ceph.com/en/latest/install/get-packages/
> > * Release git sha1: 0eceb0defba60152a8182f7bd87d164b639885b8
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph deployment best practice

2025-04-11 Thread Dominique Ramaekers
Hi Anthony,

Your statement about MDS is interesting... So it's possible depending on the 
CPU-type that read/write operations on RBD will show a better performance than 
similar read/write operations on a CephFS?

> 
> MDS is single-threaded, so unlike most Ceph daemons it benefits more from
> a high-frequency CPU than core count.
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph deployment best practice

2025-04-11 Thread Anthony D'Atri
There are a lot of variables there, including whether one uses KRBD or librbd 
for clients.

I suspect that one can’t make a blanket statement either way.

> 
> 
> Hi Anthony,
> 
> Your statement about MDS is interesting... So it's possible depending on the 
> CPU-type that read/write operations on RBD will show a better performance 
> than similar read/write operations on a CephFS?
> 
>> 
>> MDS is single-threaded, so unlike most Ceph daemons it benefits more from
>> a high-frequency CPU than core count.
>> 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-announce] v18.2.5 Reef released

2025-04-11 Thread Stephan Hohn
Ok the two issues I see with reef release v18.2.5

- Subnet check seems to be ipv4 only which leads to e.g "public address is
not in 'fd01:1:f00f:443::/64' subnet" warnings on ipv6 only clusters.


   -

   common/pick_address: check if address in subnet all public address (
   pr#57590 , Nitzan Mordechai)
   -

   osd: Report health error if OSD public address is not within subnet (
   pr#55697 , Prashant D)

- cryptsetup version check isn't working at least in the container image of
v18.2.5 (
https://github.com/ceph/ceph/blob/reef/src/ceph-volume/ceph_volume/util/encryption.py)
which leads to encrypted osds not starting due to "'Error while checking
cryptsetup version.\n', '`cryptsetup --version` output:\n', 'cryptsetup
2.7.2 flags: UDEV BLKID KEYRING FIPS KERNEL_CAPI PWQUALITY '"

Happy to help with logs etc.

BR

Stephan



Am Fr., 11. Apr. 2025 um 09:11 Uhr schrieb Stephan Hohn <
step...@gridscale.io>:

> Hi all,
>
> started an update on our staging cluster from v18.2.4 --> v18.2.5
>
> ~# ceph orch upgrade start --image quay.io/ceph/ceph:v18.2.5 Mons and Mgr
> went fine but osds not coming up with v18.2.5 Apr 11 06:59:56
> 0cc47a6df14e podman[263290]: 2025-04-11 06:59:56.697993041 + UTC
> m=+0.057869056 image pull  quay.io/ceph/ceph:v18.2.5
> Apr 11 06:59:56 0cc47a6df14e podman[263290]: 2025-04-11 06:59:56.778833855
> + UTC m=+0.138709869 container init
> 5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23 (image=
> quay.io/ceph/ceph:v18.2.5,
> name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20, CEPH_GIT_REPO=
> https://github.com/ceph/ceph.git, OSD_FLAVOR=default,
> org.label-schema.schema-version=1.0, GANESHA_REPO_BASEURL=
> https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
> org.opencontainers.image.documentation=https://docs.ceph.com/,
> CEPH_REF=reef, org.label-schema.vendor=CentOS, ceph=True,
> org.label-schema.name=CentOS Stream 9 Base Image,
> io.buildah.version=1.39.3,
> CEPH_SHA1=a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1, FROM_IMAGE=
> quay.io/centos/centos:stream9, org.opencontainers.image.authors=Ceph
> Release Team , org.label-schema.license=GPLv2,
> org.label-schema.build-date=20250325)
> Apr 11 06:59:56 0cc47a6df14e podman[263290]: 2025-04-11 06:59:56.790749299
> + UTC m=+0.150625308 container start
> 5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23 (image=
> quay.io/ceph/ceph:v18.2.5,
> name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20,
> org.label-schema.build-date=20250325, org.opencontainers.image.authors=Ceph
> Release Team , org.label-schema.license=GPLv2,
> org.label-schema.schema-version=1.0, ceph=True, CEPH_REF=reef,
> CEPH_GIT_REPO=https://github.com/ceph/ceph.git, OSD_FLAVOR=default,
> org.label-schema.name=CentOS Stream 9 Base Image,
> org.opencontainers.image.documentation=https://docs.ceph.com/,
> GANESHA_REPO_BASEURL=
> https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
> io.buildah.version=1.39.3,
> CEPH_SHA1=a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1, FROM_IMAGE=
> quay.io/centos/centos:stream9, org.label-schema.vendor=CentOS)
> Apr 11 06:59:56 0cc47a6df14e bash[263290]:
> 5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23
> Apr 11 06:59:56 0cc47a6df14e
> ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20[263380]:
> 2025-04-11T06:59:56.817+ 7b72d0abc740 -1
> bluestore(/var/lib/ceph/osd/ceph-20/block) _read_bdev_label failed to open
> /var/lib/ceph/osd/ceph-20/block: (2) No such file or directory
> Apr 11 06:59:56 0cc47a6df14e
> ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20[263380]:
> 2025-04-11T06:59:56.817+ 7b72d0abc740 -1  ** ERROR: unable to open OSD
> superblock on /var/lib/ceph/osd/ceph-20: (2) No such file or directory
> Apr 11 06:59:56 0cc47a6df14e systemd[1]: Started Ceph osd.20 for
> 03977a23-f00f-4bb0-b9a7-de57f40ba853.
> Apr 11 06:59:56 0cc47a6df14e podman[263399]: 2025-04-11 06:59:56.90105365
> + UTC m=+0.076310419 container died
> 5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23 (image=
> quay.io/ceph/ceph:v18.2.5,
> name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20)
> Apr 11 06:59:56 0cc47a6df14e podman[263399]: 2025-04-11 06:59:56.948423169
> + UTC m=+0.123679914 container remove
> 5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23 (image=
> quay.io/ceph/ceph:v18.2.5,
> name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20, ceph=True,
> io.buildah.version=1.39.3, org.label-schema.name=CentOS Stream 9 Base
> Image, org.opencontainers.image.authors=Ceph Release Team <
> ceph-maintain...@ceph.io>, CEPH_REF=reef, GANESHA_REPO_BASEURL=
> https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
> CEPH_SHA1=a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1, FROM_IMAGE=
> quay.io/centos/centos:stream9, OSD_FLAVOR=default,
> org.label-schema.build-date=20250325, org.label-schema.v

[ceph-users] Re: ceph deployment best practice

2025-04-11 Thread gagan tiwari
Hi Anthony,
 We will be using Samsung SSD 870 QVO 8TB disks on
all OSD servers.

One more thing , I want to know is that CephFS supports  mounting with
FsCache on clients ?  500T data stored in the cluster will be accessed by
the jobs running on the clients nodes and we need super fast read
performance. For that we do have additional cache disk installed on all the
clients nodes. And the way NFS V4 supports mount NFS share with FsCache on
clients' hosts ,CephFS also supports that.

On those  4x non-OSD nodes, I will probably run ldap and HTCondor service.
But mds node will not be used for anything other than mds daemon.

Thanks,
Gagan



On Fri, Apr 11, 2025 at 8:45 PM Anthony D'Atri 
wrote:

>
>
> > On Apr 11, 2025, at 4:04 AM, gagan tiwari <
> gagan.tiw...@mathisys-india.com> wrote:
> >
> > Hi Anthony,
> >   Thanks for the reply!
> >
> > We will be using  CephFS  to access  Ceph Storage from clients.  So, this
> > will need MDS daemon also.
>
> MDS is single-threaded, so unlike most Ceph daemons it benefits more from
> a high-frequency CPU than core count.
>
> > So, based on your advice, I am thinking of having 4 Dell PowerEdge
> servers
> > . 3 of them will run 3 Monitor daemons and one of them  will run MDS
> > daemon.
> >
> > These Dell Servers will have following hardware :-
> >
> > 1. 4 cores (  8 threads )  ( Can go for 8 core and 16 threads )
> >
> > 2.  64G RAM
> >
> > 3. 2x4T  Samsung SSD  with RA!D 1 to install OS and run monitor and
> > metadata services.
>
> That probably suffices for a small cluster.  Are those Samsungs
> enterprise?
>
>
> > OSD nodes will be upgraded to have 32 cores ( 64 threads ).  Disk and RAM
> > will remain same ( 128G and 22X8T Samsung SSD )
>
> Which Samsung SSD?  Using client SKUs for OSDs has a way of leading to
> heartbreak.
>
> 64 threads would be better for a 22x OSD node, though still a bit light.
> Are these SATA or NVMe?
>
> > Actually , I want to use OSD nodes to run OSD damons and not any
> > other demons and which is why I am thinking of having 4 additional Dell
> > servers as mentioned above.
>
> Colocation of daemons is common these days, especially with smaller
> clusters.
>
> >
> > Please advise if this plan will be better.
>
> That’ll work, but unless you already have those quite-modest 4x non-OSD
> nodes sitting around idle you might consider just going with the OSD nodes
> and bumping the CPU again so you can colocate all the daemons.
>
> >
> > Thanks,
> > Gagan
> >
> >
> >
> >
> >
> >
> > On Wed, Apr 9, 2025 at 8:12 PM Anthony D'Atri 
> > wrote:
> >
> >>
> >>>
> >>> We would start deploying Ceph with 4 hosts ( HP Proliant servers ) each
> >>> running RockyLinux 9.
> >>>
> >>> One of the hosts called ceph-adm will be smaller one and will have
> >>> following hardware :-
> >>>
> >>> 2x4T SSD  with raid 1 to install OS on.
> >>>
> >>> 8 Core with 3600MHz freq.
> >>>
> >>> 64G  RAM
> >>>
> >>> We are planning to run all Ceph daemons except OSD daemon like monitor
> ,
> >>> metadata ,etc on this host.
> >>
> >> 8 core == 16 threads? Are you provisioning this node because you have it
> >> laying around idle?
> >>
> >> Note that you will want *at least* 3 Monitor (monitors) daemons, which
> >> must be on different nodes.  5 is better, but at least 3. You’ll also
> have
> >> Grafana, Prometheus, MDS (if you’re going to CephFS vs using S3 object
> >> storage or RBD block)
> >>
> >> 8c is likely on the light side for all of that.  You would also benefit
> >> from not having that node be a single point of failure.  I would
> suggest if
> >> you can raising this node to the spec of the planned 3x OSD nodes so you
> >> have 4x equivalent nodes, and spread that non-OSD daemons across them.
> >>
> >> Note also that your OSD nodes will also have node_exporter, crash, and
> >> other boilerplate daemons.
> >>
> >>
> >>> We will have 3 hosts to run OSD which will store actual data.
> >>>
> >>> Each OSD host will have following hardware
> >>>
> >>> 2x4T SSD  with raid 1 to install OS on.
> >>>
> >>> 22X8T SSD  to store data ( OSDs ) ( without partition ). We will use
> >> entire
> >>> disk without partitions
> >>
> >> SAS, SATA, or NVMe SSDs?  Which specific model?  You really want to
> avoid
> >> client (desktop) models for Ceph, but you likely do not need to pay for
> >> higher endurance mixed-use SKUs.
> >>
> >>> Each OSD host will have 128G RAM  ( No swap space )
> >>
> >> Thank you for skipping swap.  Some people are really stuck in the past
> in
> >> that regard.
> >>
> >>> Each OSD host will have 16 cores.
> >>
> >> So 32 threads total?  That is very light for 22 OSDs + other daemons.
> For
> >> HDD OSDs a common rule of thumb is at minimum 2x threads per, for
> SAS/SATA
> >> SSDs, 4, for NVMe SSDs 6.  Plus margin for the OS and other processes.
> >>
> >>> All 4 hosts will connect to each via 10G nic.
> >>
> >> Two ports with bonding? Redundant switches?
> >>
> >>> The 500T data
> >>
> >> The specs you list above incl

[ceph-users] Re: Experience with 100G Ceph in Proxmox

2025-04-11 Thread Giovanna Ratini

Hello Eneko,

I switched to KRDB, and I’m seeing slightly better performance now.

For Switching: 
https://forum.proxmox.com/threads/how-to-safely-enable-krbd-in-a-5-node-production-environment-running-7-4-19.159186/


NVMe performance remains disappointing, though...
They went from 35MB/s to 45MB/s.

I’m planning to apply the change that Anthony recommended:
setting mon_target_pg_per_osd to 250 and configuring 2 osds_per_device.
This will take a bit of time.
ceph config set global mon_target_pg_per_osd 250
ceph config set global osds_per_device 2

To split the drives into 2 OSDs each,
I’ll need to update the ceph orch ls --export OSD service spec,

then zap an existing OSD, allow it to be rebuilt as two, and repeat the 
process for the remaining ones.


We'll see if this change helps. I’ll write the results here once it's done.

Cheers,

Gio

root@gitlab:~# fio --name=registry-read --ioengine=libaio --rw=randread 
--bs=4k --numjobs=4 --iodepth=16 --size=1G --runtime=60


registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
(T) 4096B-4096B, ioengine=libaio, iodepth=16

...
fio-3.33
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=91.7MiB/s][r=23.5k IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=1): err= 0: pid=2547: Fri Apr 11 
21:02:31 2025

  read: IOPS=2756, BW=10.8MiB/s (11.3MB/s)(646MiB/60001msec)
    slat (usec): min=50, max=8619, avg=360.14, stdev=217.84
    clat (usec): min=2, max=17259, avg=5441.99, stdev=1633.01
 lat (usec): min=108, max=17721, avg=5802.13, stdev=1728.71
    clat percentiles (usec):
 |  1.00th=[ 1909],  5.00th=[ 2507], 10.00th=[ 2966], 20.00th=[ 3818],
 | 30.00th=[ 4621], 40.00th=[ 5342], 50.00th=[ 5932], 60.00th=[ 6259],
 | 70.00th=[ 6456], 80.00th=[ 6718], 90.00th=[ 6980], 95.00th=[ 7308],
 | 99.00th=[ 9241], 99.50th=[10290], 99.90th=[13173], 99.95th=[13698],
 | 99.99th=[16450]
   bw (  KiB/s): min= 8456, max=22296, per=24.64%, avg=10937.08, 
stdev=3222.24, samples=119
   iops    : min= 2114, max= 5574, avg=2734.27, stdev=805.56, 
samples=119

  lat (usec)   : 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=1.33%, 4=20.70%, 10=77.32%, 20=0.65%
  cpu  : usr=0.78%, sys=6.75%, ctx=165432, majf=0, minf=27
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
>=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
>=64=0.0%

 issued rwts: total=165408,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=16
registry-read: (groupid=0, jobs=1): err= 0: pid=2548: Fri Apr 11 
21:02:31 2025

  read: IOPS=2807, BW=11.0MiB/s (11.5MB/s)(658MiB/60001msec)
    slat (usec): min=50, max=8950, avg=353.61, stdev=213.68
    clat (usec): min=2, max=17110, avg=5344.32, stdev=1642.90
 lat (usec): min=93, max=17575, avg=5697.93, stdev=1740.41
    clat percentiles (usec):
 |  1.00th=[ 1844],  5.00th=[ 2409], 10.00th=[ 2868], 20.00th=[ 3687],
 | 30.00th=[ 4490], 40.00th=[ 5276], 50.00th=[ 5866], 60.00th=[ 6194],
 | 70.00th=[ 6390], 80.00th=[ 6587], 90.00th=[ 6915], 95.00th=[ 7242],
 | 99.00th=[ 8979], 99.50th=[10159], 99.90th=[13042], 99.95th=[13829],
 | 99.99th=[15926]
   bw (  KiB/s): min= 8536, max=23624, per=25.10%, avg=11138.08, 
stdev=3441.69, samples=119
   iops    : min= 2134, max= 5906, avg=2784.52, stdev=860.42, 
samples=119

  lat (usec)   : 4=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=1.80%, 4=22.21%, 10=75.40%, 20=0.58%
  cpu  : usr=0.98%, sys=6.72%, ctx=168450, majf=0, minf=25
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
>=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
>=64=0.0%

 issued rwts: total=168432,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=16
registry-read: (groupid=0, jobs=1): err= 0: pid=2549: Fri Apr 11 
21:02:31 2025

  read: IOPS=2773, BW=10.8MiB/s (11.4MB/s)(650MiB/60001msec)
    slat (usec): min=46, max=8246, avg=357.89, stdev=213.33
    clat (usec): min=2, max=19652, avg=5408.19, stdev=1641.03
 lat (usec): min=411, max=20124, avg=5766.08, stdev=1738.36
    clat percentiles (usec):
 |  1.00th=[ 1909],  5.00th=[ 2474], 10.00th=[ 2933], 20.00th=[ 3752],
 | 30.00th=[ 4555], 40.00th=[ 5342], 50.00th=[ 5932], 60.00th=[ 6259],
 | 70.00th=[ 6456], 80.00th=[ 6652], 90.00th=[ 6980], 95.00th=[ 7242],
 | 99.00th=[ 9110], 99.50th=[10421], 99.90th=[12911], 99.95th=[14353],
 | 99.99th=[16909]
   bw (  KiB/s): min= 8432, max=22520, per=24.79%, avg=11004.77, 
stdev=3330.83, samples=119
   iops    : min= 2108, max= 5630, avg=2751.19, stdev=832.71, 
samples=119

  lat (usec)   : 4=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat 

[ceph-users] Re: nodes with high density of OSDs

2025-04-11 Thread Tim Holloway
I just checked an OSD and the "block" entry is indeed linked to storage 
using a /dev/mapper uuid LV, not a /dev/device. When ceph builds an 
LV-based OSD, it creates a VG whose name is "ceph-u", where "" 
is a UUID, and an LV named "osd-block-", where "" is also a 
uuid. So although you'd map the osd to something like /dev/vdb in a VM, 
the actual name ceph uses is uuid-based (and lvm-based) and thus not 
subject to change with alterations in the hardware as the uuids are part 
of the metadata in VGs and LVs created by ceph.


Since I got that from a VM, I can't vouch for all cases, but I thought 
it especially interesting that a ceph was creating LVM counterparts even 
for devices that were not themselves LVM-based.


And yeah, I understand that it's the amount of OSD replicate data that 
counts more than the number of hosts, but when an entire host goes down 
and there are few hosts, that can take a large bite out of the replicas.


   Tim

On 4/11/25 10:36, Anthony D'Atri wrote:

I thought those links were to the by-uuid paths for that reason?


On Apr 11, 2025, at 6:39 AM, Janne Johansson  wrote:

Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri :

Filestore IIRC used partitions, with cute hex GPT types for various states and 
roles.  Udev activation was sometimes problematic, and LVM tags are more 
flexible and reliable than the prior approach.  There no doubt is more to it 
but that’s what I recall.

Filestore used to have softlinks towards the journal device (if used)
which pointed to sdX where that X of course would jump around if you
changed the number of drives on the box, or the kernel disk detection
order changed, breaking the OSD.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: nodes with high density of OSDs

2025-04-11 Thread Anthony D'Atri
Filestore, pre-ceph-volume may have been entirely different.  IIRC LVM is used 
these days to exploit persistent metadata tags.

> On Apr 11, 2025, at 4:03 PM, Tim Holloway  wrote:
> 
> I just checked an OSD and the "block" entry is indeed linked to storage using 
> a /dev/mapper uuid LV, not a /dev/device. When ceph builds an LV-based OSD, 
> it creates a VG whose name is "ceph-u", where "" is a UUID, and an LV 
> named "osd-block-", where "" is also a uuid. So although you'd map 
> the osd to something like /dev/vdb in a VM, the actual name ceph uses is 
> uuid-based (and lvm-based) and thus not subject to change with alterations in 
> the hardware as the uuids are part of the metadata in VGs and LVs created by 
> ceph.
> 
> Since I got that from a VM, I can't vouch for all cases, but I thought it 
> especially interesting that a ceph was creating LVM counterparts even for 
> devices that were not themselves LVM-based.
> 
> And yeah, I understand that it's the amount of OSD replicate data that counts 
> more than the number of hosts, but when an entire host goes down and there 
> are few hosts, that can take a large bite out of the replicas.
> 
>Tim
> 
> On 4/11/25 10:36, Anthony D'Atri wrote:
>> I thought those links were to the by-uuid paths for that reason?
>> 
>>> On Apr 11, 2025, at 6:39 AM, Janne Johansson  wrote:
>>> 
>>> Den fre 11 apr. 2025 kl 09:59 skrev Anthony D'Atri 
>>> :
 Filestore IIRC used partitions, with cute hex GPT types for various states 
 and roles.  Udev activation was sometimes problematic, and LVM tags are 
 more flexible and reliable than the prior approach.  There no doubt is 
 more to it but that’s what I recall.
>>> Filestore used to have softlinks towards the journal device (if used)
>>> which pointed to sdX where that X of course would jump around if you
>>> changed the number of drives on the box, or the kernel disk detection
>>> order changed, breaking the OSD.
>>> 
>>> -- 
>>> May the most significant bit of your life be positive.
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Experience with 100G Ceph in Proxmox

2025-04-11 Thread Anthony D'Atri
Please do let me know if that strategy works out.  When you change an osd_spec, 
out of an abundance of caution it won’t be retroactively applied to existing 
OSDs, which can be exploited for migrations.

> On Apr 11, 2025, at 3:29 PM, Giovanna Ratini 
>  wrote:
> 
> Hello Eneko,
> 
> I switched to KRDB, and I’m seeing slightly better performance now.
> 
> For Switching: 
> https://forum.proxmox.com/threads/how-to-safely-enable-krbd-in-a-5-node-production-environment-running-7-4-19.159186/
> 
> NVMe performance remains disappointing, though...
> They went from 35MB/s to 45MB/s.
> 
> I’m planning to apply the change that Anthony recommended:
> setting mon_target_pg_per_osd to 250 and configuring 2 osds_per_device.
> This will take a bit of time.
> ceph config set global mon_target_pg_per_osd 250
> ceph config set global osds_per_device 2
> 
> To split the drives into 2 OSDs each,
> I’ll need to update the ceph orch ls --export OSD service spec,
> 
> then zap an existing OSD, allow it to be rebuilt as two, and repeat the 
> process for the remaining ones.
> 
> We'll see if this change helps. I’ll write the results here once it's done.
> 
> Cheers,
> 
> Gio
> 
> root@gitlab:~# fio --name=registry-read --ioengine=libaio --rw=randread 
> --bs=4k --numjobs=4 --iodepth=16 --size=1G --runtime=60
> 
> registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=16
> ...
> fio-3.33
> Starting 4 processes
> Jobs: 4 (f=4): [r(4)][100.0%][r=91.7MiB/s][r=23.5k IOPS][eta 00m:00s]
> registry-read: (groupid=0, jobs=1): err= 0: pid=2547: Fri Apr 11 21:02:31 2025
>   read: IOPS=2756, BW=10.8MiB/s (11.3MB/s)(646MiB/60001msec)
> slat (usec): min=50, max=8619, avg=360.14, stdev=217.84
> clat (usec): min=2, max=17259, avg=5441.99, stdev=1633.01
>  lat (usec): min=108, max=17721, avg=5802.13, stdev=1728.71
> clat percentiles (usec):
>  |  1.00th=[ 1909],  5.00th=[ 2507], 10.00th=[ 2966], 20.00th=[ 3818],
>  | 30.00th=[ 4621], 40.00th=[ 5342], 50.00th=[ 5932], 60.00th=[ 6259],
>  | 70.00th=[ 6456], 80.00th=[ 6718], 90.00th=[ 6980], 95.00th=[ 7308],
>  | 99.00th=[ 9241], 99.50th=[10290], 99.90th=[13173], 99.95th=[13698],
>  | 99.99th=[16450]
>bw (  KiB/s): min= 8456, max=22296, per=24.64%, avg=10937.08, 
> stdev=3222.24, samples=119
>iops: min= 2114, max= 5574, avg=2734.27, stdev=805.56, samples=119
>   lat (usec)   : 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=1.33%, 4=20.70%, 10=77.32%, 20=0.65%
>   cpu  : usr=0.78%, sys=6.75%, ctx=165432, majf=0, minf=27
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  issued rwts: total=165408,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>  latency   : target=0, window=0, percentile=100.00%, depth=16
> registry-read: (groupid=0, jobs=1): err= 0: pid=2548: Fri Apr 11 21:02:31 2025
>   read: IOPS=2807, BW=11.0MiB/s (11.5MB/s)(658MiB/60001msec)
> slat (usec): min=50, max=8950, avg=353.61, stdev=213.68
> clat (usec): min=2, max=17110, avg=5344.32, stdev=1642.90
>  lat (usec): min=93, max=17575, avg=5697.93, stdev=1740.41
> clat percentiles (usec):
>  |  1.00th=[ 1844],  5.00th=[ 2409], 10.00th=[ 2868], 20.00th=[ 3687],
>  | 30.00th=[ 4490], 40.00th=[ 5276], 50.00th=[ 5866], 60.00th=[ 6194],
>  | 70.00th=[ 6390], 80.00th=[ 6587], 90.00th=[ 6915], 95.00th=[ 7242],
>  | 99.00th=[ 8979], 99.50th=[10159], 99.90th=[13042], 99.95th=[13829],
>  | 99.99th=[15926]
>bw (  KiB/s): min= 8536, max=23624, per=25.10%, avg=11138.08, 
> stdev=3441.69, samples=119
>iops: min= 2134, max= 5906, avg=2784.52, stdev=860.42, samples=119
>   lat (usec)   : 4=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
>   lat (usec)   : 1000=0.01%
>   lat (msec)   : 2=1.80%, 4=22.21%, 10=75.40%, 20=0.58%
>   cpu  : usr=0.98%, sys=6.72%, ctx=168450, majf=0, minf=25
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  issued rwts: total=168432,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>  latency   : target=0, window=0, percentile=100.00%, depth=16
> registry-read: (groupid=0, jobs=1): err= 0: pid=2549: Fri Apr 11 21:02:31 2025
>   read: IOPS=2773, BW=10.8MiB/s (11.4MB/s)(650MiB/60001msec)
> slat (usec): min=46, max=8246, avg=357.89, stdev=213.33
> clat (usec): min=2, max=19652, avg=5408.19, stdev=1641.03
>  lat (usec): min=411, max=20124, avg=5766.08, stdev=1738.36
> clat percentiles (usec):
>  |  1.00th=[ 1909],  5.00th=[ 2474], 10.00th=[ 2933], 20.00th=[ 3752],
>  | 30.00th=[ 4555], 40.00th=[ 5342], 50.00th=[ 

[ceph-users] Re: ceph deployment best practice

2025-04-11 Thread Anthony D'Atri


> 
> Hi Anthony,
> We will be using Samsung SSD 870 QVO 8TB disks on
> all OSD servers.

Your choices are yours to make, but for what it’s worth, I would not use these.

* They are client-class, not designed for enterprise workloads or duty cycle
* Best I can tell this lacks PLP power loss protection, which can result in 
corrupted or lost data
* QLC can be just smurfy for object storage workloads that are read-mostly, but 
can be disappointing for RBD or small objects/files
* 3 year warranty instead of the 5 years typical for enterprise SKUs
* Slow writes after the SLC cache portion fills, this is designed for desktop 
intermittent workload, not sustained enterprise workload.
* Rated endurance for a 4KB random write workload is ~ 0.33 DWPD over the 3 
year warranty period, which if divided by the enterprise 5 year warranty 
workload would be .20 DWPD.

If you expect a low write workload and have VERY limited performance 
expectations, maybe they’d work for you, but especially don’t think you can 
safely do replication size=2 or EC 2/3+1.  A few months ago someone in the 
community unilaterally sent me money *begging* me to make their cluster of 
these faster.  Nothing I could do sort of recommending that they be replaced 
with a more appropriate SKU.


> One more thing , I want to know is that CephFS supports  mounting with
> FsCache on clients ?

I find some references on the net to people doing this, but have zero 
experience with it.

>  500T data stored in the cluster will be accessed by
> the jobs running on the clients nodes and we need super fast read
> performance.

Client-class media are incompatible with super fast anything.  I don’t recall 
you mentioning the network — bonded 10GE at least?

> For that we do have additional cache disk installed on all the
> clients nodes. And the way NFS V4 supports mount NFS share with FsCache on
> clients' hosts ,CephFS also supports that.

You would do better to invest in enterprise cluster tech than in band-aids that 
may or may not work well.


{Good,Fast,Cheap} Pick Any Two.

Trite but so often true.

> 
> On those  4x non-OSD nodes, I will probably run ldap and HTCondor service.
> But mds node will not be used for anything other than mds daemon.
> 
> Thanks,
> Gagan
> 
> 
> 
> On Fri, Apr 11, 2025 at 8:45 PM Anthony D'Atri 
> wrote:
> 
>> 
>> 
>>> On Apr 11, 2025, at 4:04 AM, gagan tiwari <
>> gagan.tiw...@mathisys-india.com> wrote:
>>> 
>>> Hi Anthony,
>>>  Thanks for the reply!
>>> 
>>> We will be using  CephFS  to access  Ceph Storage from clients.  So, this
>>> will need MDS daemon also.
>> 
>> MDS is single-threaded, so unlike most Ceph daemons it benefits more from
>> a high-frequency CPU than core count.
>> 
>>> So, based on your advice, I am thinking of having 4 Dell PowerEdge
>> servers
>>> . 3 of them will run 3 Monitor daemons and one of them  will run MDS
>>> daemon.
>>> 
>>> These Dell Servers will have following hardware :-
>>> 
>>> 1. 4 cores (  8 threads )  ( Can go for 8 core and 16 threads )
>>> 
>>> 2.  64G RAM
>>> 
>>> 3. 2x4T  Samsung SSD  with RA!D 1 to install OS and run monitor and
>>> metadata services.
>> 
>> That probably suffices for a small cluster.  Are those Samsungs
>> enterprise?
>> 
>> 
>>> OSD nodes will be upgraded to have 32 cores ( 64 threads ).  Disk and RAM
>>> will remain same ( 128G and 22X8T Samsung SSD )
>> 
>> Which Samsung SSD?  Using client SKUs for OSDs has a way of leading to
>> heartbreak.
>> 
>> 64 threads would be better for a 22x OSD node, though still a bit light.
>> Are these SATA or NVMe?
>> 
>>> Actually , I want to use OSD nodes to run OSD damons and not any
>>> other demons and which is why I am thinking of having 4 additional Dell
>>> servers as mentioned above.
>> 
>> Colocation of daemons is common these days, especially with smaller
>> clusters.
>> 
>>> 
>>> Please advise if this plan will be better.
>> 
>> That’ll work, but unless you already have those quite-modest 4x non-OSD
>> nodes sitting around idle you might consider just going with the OSD nodes
>> and bumping the CPU again so you can colocate all the daemons.
>> 
>>> 
>>> Thanks,
>>> Gagan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Apr 9, 2025 at 8:12 PM Anthony D'Atri 
>>> wrote:
>>> 
 
> 
> We would start deploying Ceph with 4 hosts ( HP Proliant servers ) each
> running RockyLinux 9.
> 
> One of the hosts called ceph-adm will be smaller one and will have
> following hardware :-
> 
> 2x4T SSD  with raid 1 to install OS on.
> 
> 8 Core with 3600MHz freq.
> 
> 64G  RAM
> 
> We are planning to run all Ceph daemons except OSD daemon like monitor
>> ,
> metadata ,etc on this host.
 
 8 core == 16 threads? Are you provisioning this node because you have it
 laying around idle?
 
 Note that you will want *at least* 3 Monitor (monitors) daemons, which
 must be on different nodes.  5 is better, but at least 3

[ceph-users] Re: Cephadm flooding /var/log/ceph/cephadm.log

2025-04-11 Thread John Mulligan
On Thursday, April 10, 2025 10:42:50 PM Eastern Daylight Time Alex wrote:
> I made a Pull Request for cephadm.log set DEBUG.
> Not sure if I should merge it.

Please, no. Even if github allows you to (I think it won't) you should not 
merge your own PRs unless you are a component lead and it is an emergency (IMO 
- but I think many would agree).

I have left a comment on what I assume is your pr: https://github.com/ceph/
ceph/pull/62789

PS. If you meant creating merge commits in your own PR branch, that is 
unsightly but harmless. If you meant what I think you meant - merge the PR 
into ceph main - that's what I object to. You need to follow the contribution 
process and get reviews, and have the PR get tested, before the component lead 
would possibly merge the changes into the main branch.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm flooding /var/log/ceph/cephadm.log

2025-04-11 Thread Alex
Sounds good to me.
I responded to your comment in the PR.
Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Repo name bug?

2025-04-11 Thread John Mulligan
On Thursday, April 10, 2025 1:08:00 AM Eastern Daylight Time Alex wrote:
> Good morning everyone.
> 
> 
> Does the preflight playbook have a bug?
> 
> https://github.com/ceph/cephadm-ansible/blob/devel/cephadm-preflight.yml
> 
> Line 82:
> paths: "{{ ['noarch', '$basearch'] if ceph_origin == 'community' else
> ['$basearch'] }}"
> 
> The yum repo file then gets named
> ceph_stable_$basearch.
> 
> 
> Shouldn't it be basearch without the $ ?
> 

Hi Alex,
The `$basearch` there is likely trying to act as a yum (dnf) repo file 
variable. Take a look at https://developers.redhat.com/articles/2022/10/07/
whats-inside-rpm-repo-file
and search in the page for `$basearch`.  You will see it appear in various 
locations in the example file. This is a placeholder variable for dnf to 
replace with the actual base architecture of the system it is running on.

See also https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/
html/deployment_guide/sec-using_yum_variables#sec-Using_Yum_Variables

It's likely correct for the content of the repo file, but might be incorrect 
for the file name. I don't think simply removing the dollar sign would fix 
things.

Perhaps a different suggestion would be to remove the variable from the file 
name parameter: ceph_stable_https://github.com/ceph/cephadm-ansible/blob/
74520740e4f85ea001d7ca7ab6992ce66145bc3f/cephadm-preflight.yml#L99C24-L99C36
doing this should create mutiple repo defintions in a single repo file as per 
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/
yum_repository_module.html  (search for `Add multiple repositories into the 
same file`).



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Repo name bug?

2025-04-11 Thread Alex
Thanks for the response John.

We "spoke" on my PR for the log level set to DEBUG.
I also have a PR open https://github.com/ceph/cephadm-ansible/pull/339 .
I tested this one on my Ceph cluster.

The issue which caused me to was that when I ran the preflight
playbook it populated my /etc/yum.repos.d/ dir with a file called
ceph_stable_$basearch (notice the dollar sign in the file name). The
issue seems to be cosmetic since it still works, but still IMHO should
be fixed. Upon looking into the file it also adds the $ into the name
and description.

The bug is really simple.
It uses $basearch for the repo URL as it should, but it also uses the
same variable with the dollar sign for the file name, repo name and
description.

My fix is to simply add a "| trim('$')" to remove the $ from the
places where we don't want it.
To me it seemed like the simplest solution, although may be not the
most elegant.
We can't simply add the dollar sign to baseurl: "{{ _ceph_repo.baseurl
}}/{{ $item }}" since noarch is also a variable but it can't have a
'$'.

Since I was already fixing this, I fixed another cosmetic issue of the
repo url having two forward slashes (//).
Those come from the trailing slash in the variable as well as a slash in
"{{ _ceph_repo.baseurl }}/{{ item }}"
My first attempt was to remove the "/" from the line above but then I
realized that if someone left it out from the variable it would break
so I removed it from the variable but kept it in the code.

If you agree that this is a bug and not intended to be that way then
please take a look at my PR.
I'm fairly comfortable with Ansible but first time Forking a repo so
I'm sure I did it wrong so please let me know how to fix it and if I
can actually make this VERY small contribution to the Ceph codebase
that would be amazing.

Thanks!
Alex
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-announce] v18.2.5 Reef released

2025-04-11 Thread Stephan Hohn
Hi all,

started an update on our staging cluster from v18.2.4 --> v18.2.5

~# ceph orch upgrade start --image quay.io/ceph/ceph:v18.2.5 Mons and Mgr
went fine but osds not coming up with v18.2.5 Apr 11 06:59:56 0cc47a6df14e
podman[263290]: 2025-04-11 06:59:56.697993041 + UTC m=+0.057869056
image pull  quay.io/ceph/ceph:v18.2.5
Apr 11 06:59:56 0cc47a6df14e podman[263290]: 2025-04-11 06:59:56.778833855
+ UTC m=+0.138709869 container init
5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23 (image=
quay.io/ceph/ceph:v18.2.5,
name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20, CEPH_GIT_REPO=
https://github.com/ceph/ceph.git, OSD_FLAVOR=default,
org.label-schema.schema-version=1.0, GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
org.opencontainers.image.documentation=https://docs.ceph.com/,
CEPH_REF=reef, org.label-schema.vendor=CentOS, ceph=True,
org.label-schema.name=CentOS Stream 9 Base Image,
io.buildah.version=1.39.3,
CEPH_SHA1=a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1, FROM_IMAGE=
quay.io/centos/centos:stream9, org.opencontainers.image.authors=Ceph
Release Team , org.label-schema.license=GPLv2,
org.label-schema.build-date=20250325)
Apr 11 06:59:56 0cc47a6df14e podman[263290]: 2025-04-11 06:59:56.790749299
+ UTC m=+0.150625308 container start
5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23 (image=
quay.io/ceph/ceph:v18.2.5,
name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20,
org.label-schema.build-date=20250325, org.opencontainers.image.authors=Ceph
Release Team , org.label-schema.license=GPLv2,
org.label-schema.schema-version=1.0, ceph=True, CEPH_REF=reef,
CEPH_GIT_REPO=https://github.com/ceph/ceph.git, OSD_FLAVOR=default,
org.label-schema.name=CentOS Stream 9 Base Image,
org.opencontainers.image.documentation=https://docs.ceph.com/,
GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
io.buildah.version=1.39.3,
CEPH_SHA1=a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1, FROM_IMAGE=
quay.io/centos/centos:stream9, org.label-schema.vendor=CentOS)
Apr 11 06:59:56 0cc47a6df14e bash[263290]:
5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23
Apr 11 06:59:56 0cc47a6df14e
ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20[263380]:
2025-04-11T06:59:56.817+ 7b72d0abc740 -1
bluestore(/var/lib/ceph/osd/ceph-20/block) _read_bdev_label failed to open
/var/lib/ceph/osd/ceph-20/block: (2) No such file or directory
Apr 11 06:59:56 0cc47a6df14e
ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20[263380]:
2025-04-11T06:59:56.817+ 7b72d0abc740 -1  ** ERROR: unable to open OSD
superblock on /var/lib/ceph/osd/ceph-20: (2) No such file or directory
Apr 11 06:59:56 0cc47a6df14e systemd[1]: Started Ceph osd.20 for
03977a23-f00f-4bb0-b9a7-de57f40ba853.
Apr 11 06:59:56 0cc47a6df14e podman[263399]: 2025-04-11 06:59:56.90105365
+ UTC m=+0.076310419 container died
5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23 (image=
quay.io/ceph/ceph:v18.2.5,
name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20)
Apr 11 06:59:56 0cc47a6df14e podman[263399]: 2025-04-11 06:59:56.948423169
+ UTC m=+0.123679914 container remove
5db97f7e32705cc0e8fee1bc5741dfbd97ffa430b8fb5a1cfe19b768aed78b23 (image=
quay.io/ceph/ceph:v18.2.5,
name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20, ceph=True,
io.buildah.version=1.39.3, org.label-schema.name=CentOS Stream 9 Base
Image, org.opencontainers.image.authors=Ceph Release Team <
ceph-maintain...@ceph.io>, CEPH_REF=reef, GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
CEPH_SHA1=a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1, FROM_IMAGE=
quay.io/centos/centos:stream9, OSD_FLAVOR=default,
org.label-schema.build-date=20250325, org.label-schema.vendor=CentOS,
org.label-schema.schema-version=1.0, org.opencontainers.image.documentation=
https://docs.ceph.com/, org.label-schema.license=GPLv2, CEPH_GIT_REPO=
https://github.com/ceph/ceph.git)
Apr 11 06:59:56 0cc47a6df14e systemd[1]:
ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853@osd.20.service: Main process
exited, code=exited, status=1/FAILURE
Apr 11 06:59:57 0cc47a6df14e podman[263966]:
Apr 11 06:59:57 0cc47a6df14e podman[263966]: 2025-04-11 06:59:57.495704469
+ UTC m=+0.105177519 container create
d96a2746c9b6ac37f42e1beaac9f572d22558c16d662dfaff994d1d90c611ad8 (image=
quay.io/ceph/ceph:v18.2.5,
name=ceph-03977a23-f00f-4bb0-b9a7-de57f40ba853-osd-20-deactivate,
CEPH_GIT_REPO=https://github.com/ceph/ceph.git,
CEPH_SHA1=a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1, GANESHA_REPO_BASEURL=
https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/,
org.label-schema.vendor=CentOS, org.label-schema.name=CentOS Stream 9 Base
Image, FROM_IMAGE=quay.io/centos/centos:stream9, io.buildah.version=1.39.3,
org.opencontainers.image.documentation=https://docs.ceph.com/,
org.label-schema.schema-version=1.0, org.lab

[ceph-users] FS not mount after update to quincy

2025-04-11 Thread Iban Cabrillo


Hi guys Good morning, 


Since I performed the update to Quincy, I've noticed a problem that wasn't 
present with Octopus. Currently, our Ceph cluster exports a filesystem to 
certain nodes, which we use as a backup repository. 
The machines that mount this FS are currently running Ubuntu 24 with Ceph Squid 
as the client version. 

zeus22:~ # ls -la /cephvmsfs/ 
total 225986576 
drwxrwxrwx 13 root root 17 Apr  4 13:10 . 
drwxr-xr-x  1 root root   286 Mar 19 13:27 
.. 
-rw-r--r-- 1 root root 124998647808 Apr  4 13:18 
arcceal9.img 
drwxrwxrwx  2 nobodynogroup2 Jul 12  2018 backup 
drwxr-xr-x 2 nobodynogroup1 Oct 18  2017 
Default 
-rw-r--r--1 root root  214cat /etc74836480 Mar 26 18:11 
ns1.img 
drwxr-xr-x 2 root root   1 Aug 29  2024 
OnlyOffice 
Before the update, these nodes mounted the FS correctly (even cluster in 
octopus and clients in squid), and the nodes that haven't been restarted are 
still accessing it.

One of these machines has been reinstalled, and using the same configuration as 
the nodes that are still mounting this FS, it is unable to mount, giving errors 
such as: 

`mount error: no mds (Metadata Server) is up. The cluster might be laggy, or 
you may not be authorized` 
10.10.3.1:3300,10.10.3.2:3300,10.10.3.3:3300:/ /cephvmsfs ceph 
name=cephvmsfs,secretfile=/etc/ceph/cephvmsfs.secret,noatime,mds_namespace=cephvmsfs,_netdev
 0 0 

If I change the port to use 6789 (v1) 


mount error 110 = Connection timed out 

ceph cluster is healty and msd are up 

cephmon01:~ # ceph -s 
cluster: 
id: 6f5a65a7-yyy---428608941dd1 
health: HEALTH_OK 

services: 
mon: 3 daemons, quorum cephmon01,cephmon03,cephmon02 (age 2d) 
mgr: cephmon02(active, since 7d), standbys: cephmon01, cephmon03 
mds: 1/1 daemons up, 1 standby 
osd: 231 osds: 231 up (since 7d), 231 in (since 9d) 
rgw: 2 daemons active (2 hosts, 1 zones) 



Cephmons are available from clients in both ports: 
zeus:~ # telnet cephmon02 6789 
Trying 10.10.3.2... 
Connected to cephmon02. 
Escape character is '^]'. 
ceph v027�� 

Ҭ 

zeus01:~ # telnet cephmon02 3300 
Trying 10.10.3.2... 
Connected to cephmon02. 
Escape character is '^]'. 
ceph v2 


Any advise is welcomed, regards I 
-- 

 
Ibán Cabrillo Bartolomé 
Instituto de Física de Cantabria (IFCA-CSIC) 
Santander, Spain 
Tel: +34942200969/+34669930421 
Responsible for advanced computing service (RSC) 
=
 
=
 
All our suppliers must know and accept IFCA policy available at: 

https://confluence.ifca.es/display/IC/Information+Security+Policy+for+External+Suppliers
 
==
 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io