[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities

2024-06-10 Thread Petr Bena
Hello,

No I don't have osd_scrub_auto_repair, interestingly after about a week after 
forgetting about this, an error manifested:

[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 4.1d is active+clean+inconsistent, acting [4,2]

which could be repaired as expected, since I damaged only 1 OSD. It's 
interesting it took a whole week to find it. For some reason it seems to be 
that running deep-scrub on entire OSD only runs it for PGs where the OSD is 
considered "primary", so maybe that's why it wasn't detected when I ran it 
manually?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities

2024-06-10 Thread Eugen Block
That would have been my next question, did you verify that the  
corrupted OSD was a primary? The default deep-scrub config scrubs all  
PGs within a week, so yeah, it can take a week until it's detected. It  
could have been detected sooner if those objects would have been in  
use by clients and required to be updated (rewritten).


Zitat von Petr Bena :


Hello,

No I don't have osd_scrub_auto_repair, interestingly after about a  
week after forgetting about this, an error manifested:


[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 4.1d is active+clean+inconsistent, acting [4,2]

which could be repaired as expected, since I damaged only 1 OSD.  
It's interesting it took a whole week to find it. For some reason it  
seems to be that running deep-scrub on entire OSD only runs it for  
PGs where the OSD is considered "primary", so maybe that's why it  
wasn't detected when I ran it manually?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question regarding bluestore labels

2024-06-10 Thread Igor Fedotov

Hi Bailey,

yes, this should be doable using the following steps:

1. Copy the very first block 0~4096 from a different OSD to that 
non-working one.


2. Use ceph-bluestore-tool's set-label-key command to modify "osd_uud" 
at target OSD


3. Adjust "size" field at target OSD if DB volume size at target OSD is 
different.



Hope this helps.

Thanks,

Igor

On 6/8/2024 3:38 AM, Bailey Allison wrote:

I have a question regarding bluestore labels, specifically for a block.db
partition.

  


To make a long story short, we are currently in a position where checking
the label of a block.db partition and it appears corrupted.

  


I have seen another thread on here suggesting to copy the label from a
working OSD to the non working OSD, then re-adding the correct value to the
labels with ceph-bluestore-tool.

  


Where this was mentioned this was with an OSD in mind, would the same logic
apply if we were working with a db device instead? This is assuming the only
issue with the db is the corrupted label, and there is no other issues.

  


Regards,

  


Bailey

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Performance issues RGW (S3)

2024-06-10 Thread sinan

Hi all,

My Ceph setup:
- 12 OSD nodes, 4 OSD nodes per rack. Replication of 3, 1 replica per 
rack.

- 20 spinning SAS disks per node.
- Some nodes have 256GB RAM, some nodes 128GB.
- CPU varies between Intel E5-2650 and Intel Gold 5317.
- Each node has 10Gbit/s network.

Using rados bench I am getting decent results (depending on block size):
- 1000 MB/s throughput, 1000 IOps with 1MB block size
- 30 MB/s throughput, 7500 IOps with 4K block size

Unfortunately not getting the same performance with Rados Gateway (S3).

- 1x HAProxy with 3 backend RGW's.

I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 
Warp clients. Benchmarking towards the HAProxy.


Results:
- Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy 
server. Thats good.
- Using 500K object size, I am getting a throughput of 70 up to 150 MB/s 
with 140 up to 300 obj/s. It depends on the concurrency setting of Warp.


It look likes the objects/s is the bottleneck, not the throughput.

Max memory usage is about 80-90GB per node. CPU's are quite idling.

Is it reasonable to expect more IOps / objects/s for RGW with my setup? 
At this moment I am not able to find the bottleneck what is causing the 
low obj/s.


Ceph version is 15.2.

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question regarding bluestore labels

2024-06-10 Thread Bailey Allison
Hey Igor,

Thanks for the validation, I was also able to validate this in testing on
the weekend myself, though on a db I messed up myself, and it was able to be
restored.

If this ends up being the solution for the customer in this case, I will
follow up here if anyone is curious.

Thanks again Igor, it was your post on here mentioning this a few weeks ago
that actually let me know to even check this stuff.

Regards,

Bailey

> -Original Message-
> From: Igor Fedotov 
> Sent: June 10, 2024 7:08 AM
> To: Bailey Allison ; 'ceph-users'  us...@ceph.io>
> Subject: [ceph-users] Re: Question regarding bluestore labels
> 
> Hi Bailey,
> 
> yes, this should be doable using the following steps:
> 
> 1. Copy the very first block 0~4096 from a different OSD to that
non-working
> one.
> 
> 2. Use ceph-bluestore-tool's set-label-key command to modify "osd_uud"
> at target OSD
> 
> 3. Adjust "size" field at target OSD if DB volume size at target OSD is
different.
> 
> 
> Hope this helps.
> 
> Thanks,
> 
> Igor
> 
> On 6/8/2024 3:38 AM, Bailey Allison wrote:
> > I have a question regarding bluestore labels, specifically for a
block.db
> > partition.
> >
> >
> >
> > To make a long story short, we are currently in a position where
checking
> > the label of a block.db partition and it appears corrupted.
> >
> >
> >
> > I have seen another thread on here suggesting to copy the label from a
> > working OSD to the non working OSD, then re-adding the correct value to
> the
> > labels with ceph-bluestore-tool.
> >
> >
> >
> > Where this was mentioned this was with an OSD in mind, would the same
> logic
> > apply if we were working with a db device instead? This is assuming the
> only
> > issue with the db is the corrupted label, and there is no other issues.
> >
> >
> >
> > Regards,
> >
> >
> >
> > Bailey
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> --
> Igor Fedotov
> Ceph Lead Developer
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D'Atri



> Hi all,
> 
> My Ceph setup:
> - 12 OSD nodes, 4 OSD nodes per rack. Replication of 3, 1 replica per rack.
> - 20 spinning SAS disks per node.

Don't use legacy HDDs if you care about performance.

> - Some nodes have 256GB RAM, some nodes 128GB.

128GB is on the low side for 20 OSDs.

> - CPU varies between Intel E5-2650 and Intel Gold 5317.

E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd make a 
decent MDS system but assuming a dual socket system, you have ~2 threads per 
OSD, which is maybe acceptable for HDDs, but I assume you have mon/mgr/rgw on 
some of them too.


> - Each node has 10Gbit/s network.
> 
> Using rados bench I am getting decent results (depending on block size):
> - 1000 MB/s throughput, 1000 IOps with 1MB block size
> - 30 MB/s throughput, 7500 IOps with 4K block size

rados bench is a useful for smoke testing but not always a reflection of E2E 
experience.

> 
> Unfortunately not getting the same performance with Rados Gateway (S3).
> 
> - 1x HAProxy with 3 backend RGW's.

Run an RGW on every node.


> I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 Warp 
> clients. Benchmarking towards the HAProxy.
> 
> Results:
> - Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy 
> server. Thats good.
> - Using 500K object size, I am getting a throughput of 70 up to 150 MB/s with 
> 140 up to 300 obj/s.

Tiny objects are the devil of any object storage deployment.  The HDDs are 
killing you here, especially for the index pool.  You might get a bit better by 
upping pg_num from the party line.

You might also disable Nagle on the RGW nodes.


> It depends on the concurrency setting of Warp.
> 
> It look likes the objects/s is the bottleneck, not the throughput.
> 
> Max memory usage is about 80-90GB per node. CPU's are quite idling.
> 
> Is it reasonable to expect more IOps / objects/s for RGW with my setup? At 
> this moment I am not able to find the bottleneck what is causing the low 
> obj/s.

HDDs are a false economy.

> 
> Ceph version is 15.2.
> 
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Stuck OSD down/out + workaround

2024-06-10 Thread Mazzystr
Hi ceph users,
I've seen this happen a couple times and been meaning to ask the group
about it.

Sometimes I get a failed block device and I have to replace it.  My normal
process is -
* stop the osd process
* remove the osd from crush map
* rm -rf /var/lib/ceph/osd/-/*
* run mkfs
* start osd process
* add osd to crush map

Sometimes the osd will get "stuck" and it doesn't want to flip to up/in.  I
have to actually rm the osd.id from ceph itself then recreate the id then
do all the above.

Does anyone know why this could be?  I'm on reef 18.2 but I've seen this a
lot over the years.

Thanks,
/Chris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities

2024-06-10 Thread Petr Bena
Most likely it wasn't, the ceph help or documentation is not very clear about 
this:

osd deep-scrub 
  initiate deep scrub on osd , or use 
 to deep scrub all

It doesn't say anything like "initiate deep scrub of primary PGs on osd"

I assumed it just runs a scrub of everything on given OSD.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread sinan

On 2024-06-10 15:20, Anthony D'Atri wrote:

Hi all,

My Ceph setup:
- 12 OSD nodes, 4 OSD nodes per rack. Replication of 3, 1 replica per 
rack.

- 20 spinning SAS disks per node.


Don't use legacy HDDs if you care about performance.


You are right here, but we use Ceph mainly for RBD. It performs 'good 
enough' for our RBD load.



- Some nodes have 256GB RAM, some nodes 128GB.


128GB is on the low side for 20 OSDs.


Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. 
We haven't had any server that OOM'ed yet.



- CPU varies between Intel E5-2650 and Intel Gold 5317.


E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd 
make a decent MDS system but assuming a dual socket system, you have ~2 
threads per OSD, which is maybe acceptable for HDDs, but I assume you 
have mon/mgr/rgw on some of them too.


The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't 
hosted on the OSD nodes and are running on modern hardware.



- Each node has 10Gbit/s network.

Using rados bench I am getting decent results (depending on block 
size):

- 1000 MB/s throughput, 1000 IOps with 1MB block size
- 30 MB/s throughput, 7500 IOps with 4K block size


rados bench is a useful for smoke testing but not always a reflection 
of E2E experience.




Unfortunately not getting the same performance with Rados Gateway 
(S3).


- 1x HAProxy with 3 backend RGW's.


Run an RGW on every node.


On every OSD node?



I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 
Warp clients. Benchmarking towards the HAProxy.


Results:
- Using 10MB object size, I am hitting the 10Gbit/s link of the 
HAProxy server. Thats good.
- Using 500K object size, I am getting a throughput of 70 up to 150 
MB/s with 140 up to 300 obj/s.


Tiny objects are the devil of any object storage deployment.  The HDDs 
are killing you here, especially for the index pool.  You might get a 
bit better by upping pg_num from the party line.


I would expect high write await times, but all OSD/disks have write 
await times of 1 ms up to 3 ms.




You might also disable Nagle on the RGW nodes.


I need to lookup what that exactly is and does.




It depends on the concurrency setting of Warp.

It look likes the objects/s is the bottleneck, not the throughput.

Max memory usage is about 80-90GB per node. CPU's are quite idling.

Is it reasonable to expect more IOps / objects/s for RGW with my 
setup? At this moment I am not able to find the bottleneck what is 
causing the low obj/s.


HDDs are a false economy.


Got it :)



Ceph version is 15.2.

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D'Atri



>>> - 20 spinning SAS disks per node.
>> Don't use legacy HDDs if you care about performance.
> 
> You are right here, but we use Ceph mainly for RBD. It performs 'good enough' 
> for our RBD load.

You use RBD for archival?


>>> - Some nodes have 256GB RAM, some nodes 128GB.
>> 128GB is on the low side for 20 OSDs.
> 
> Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. We 
> haven't had any server that OOM'ed yet.

Remember that's a *target* not a *limit*.  Say one or more of your failure 
domains goes offline or you have some other large topology change.  Your OSDs 
might want up to 2x osd_memory_target, then you OOM and it cascades.  I've been 
there, had to do an emergency upgrade of 24xOSD nodes from 128GB to 192GB.

>>> - CPU varies between Intel E5-2650 and Intel Gold 5317.
>> E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd make a 
>> decent MDS system but assuming a dual socket system, you have ~2 threads per 
>> OSD, which is maybe acceptable for HDDs, but I assume you have mon/mgr/rgw 
>> on some of them too.
> 
> The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted 
> on the OSD nodes and are running on modern hardware.

You didn't list additional nodes so I assumed.  You might still do well to have 
a larger number of RGWs, wherever they run.  RGWs often scale better 
horizontally than vertically.

> 
>> rados bench is a useful for smoke testing but not always a reflection of E2E 
>> experience.
>>> Unfortunately not getting the same performance with Rados Gateway (S3).
>>> - 1x HAProxy with 3 backend RGW's.
>> Run an RGW on every node.
> 
> On every OSD node?

Yep, why not?


>>> I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 Warp 
>>> clients. Benchmarking towards the HAProxy.
>>> Results:
>>> - Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy 
>>> server. Thats good.
>>> - Using 500K object size, I am getting a throughput of 70 up to 150 MB/s 
>>> with 140 up to 300 obj/s.
>> Tiny objects are the devil of any object storage deployment.  The HDDs are 
>> killing you here, especially for the index pool.  You might get a bit better 
>> by upping pg_num from the party line.
> 
> I would expect high write await times, but all OSD/disks have write await 
> times of 1 ms up to 3 ms.

There are still serializations in the OSD and PG code.  You have 240 OSDs, does 
your index pool have *at least* 256 PGs?


> 
>> You might also disable Nagle on the RGW nodes.
> 
> I need to lookup what that exactly is and does.
> 
>>> It depends on the concurrency setting of Warp.
>>> It look likes the objects/s is the bottleneck, not the throughput.
>>> Max memory usage is about 80-90GB per node. CPU's are quite idling.
>>> Is it reasonable to expect more IOps / objects/s for RGW with my setup? At 
>>> this moment I am not able to find the bottleneck what is causing the low 
>>> obj/s.
>> HDDs are a false economy.
> 
> Got it :)
> 
>>> Ceph version is 15.2.
>>> Thanks!
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities

2024-06-10 Thread Anthony D'Atri
Scrubs are of PGs not OSDs, the lead OSD for a PG orchestrates subops to 
secondary OSDs.  If you can point me to where this is in docs/src I'll clarify 
it, ideally if you can put in a tracker ticket and send me a link.

Scrubbing all PGs on an OSD at once or even in sequence would be impactful.

> On Jun 10, 2024, at 10:51, Petr Bena  wrote:
> 
> Most likely it wasn't, the ceph help or documentation is not very clear about 
> this:
> 
> osd deep-scrub   
> initiate deep scrub on osd , or 
> use  to deep scrub all
> 
> It doesn't say anything like "initiate deep scrub of primary PGs on osd"
> 
> I assumed it just runs a scrub of everything on given OSD.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Leadership Team Weekly Minutes 2024-06-10

2024-06-10 Thread Casey Bodley
# quincy now past estimated 2024-06-01 end-of-life

will 17.2.8 be the last point release? maybe not, depending on timing

# centos 8 eol

* Casey tried to summarize the fallout in
https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/H7I4Q4RAIT6UZQNPPZ5O3YB6AUXLLAFI/
* c8 builds were disabled with https://github.com/ceph/ceph-build/pull/2235
* Patrick points out that we'll no longer be able to test upgrades
from octopus/pacific to quincy/reef

## reef 18.2.3 validation delayed

* need to remove references to centos 8 in the qa suites
** reef backport started in https://github.com/ceph/ceph/pull/57932
** still blocked on fs upgrade suite (for main/squid also)
* the plan is to move upgrade suites to cephadm where possible
** concern about lack of package-based upgrade testing

## alternatives to centos for container base distro?

to be discussed in public on the mailing list

# Cephalocon program committee - volunteers?

* Josh Durgin
* Patrick Donnelly
* Joseph Mundackal (first time - so happy to help in anyway i can)
* Matt Benjamin (talk review)

# Crimson Tech lead change

* congrats to Matan Breizman

# docs backports to Reef and to Quincy fail

* Zac would like to learn about the doc build infrastructure in order
to fix issues like this
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-10 Thread Mark Lehrer
> Not the most helpful response, but on a (admittedly well-tuned)

Actually this was the most helpful since you ran the same rados bench
command.  I'm trying to stay away from rbd & qemu issues and just test
rados bench on a non-virtualized client.

I have a test instance newer drives, CPUs, and Ceph code, I'll see
what that looks like.

Maged's comments were quite useful as far as iops per thread.  It
seems like Ceph still hasn't adjusted to SSD performance.  This kind
of feels like MongoDB before the Wired Tiger engine... slow
performance but with all the system resources close to idle due to
threads being blocked.

Thanks,
Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-10 Thread Anthony D'Atri
Eh?  cf. Mark and Dan's 1TB/s presentation.

> On Jun 10, 2024, at 13:58, Mark Lehrer  wrote:
> 
>  It
> seems like Ceph still hasn't adjusted to SSD performance.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread sinan

On 2024-06-10 17:42, Anthony D'Atri wrote:

- 20 spinning SAS disks per node.

Don't use legacy HDDs if you care about performance.


You are right here, but we use Ceph mainly for RBD. It performs 'good 
enough' for our RBD load.


You use RBD for archival?


No, storage for (light-weight) virtual machines.




- Some nodes have 256GB RAM, some nodes 128GB.

128GB is on the low side for 20 OSDs.


Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. 
We haven't had any server that OOM'ed yet.


Remember that's a *target* not a *limit*.  Say one or more of your 
failure domains goes offline or you have some other large topology 
change.  Your OSDs might want up to 2x osd_memory_target, then you OOM 
and it cascades.  I've been there, had to do an emergency upgrade of 
24xOSD nodes from 128GB to 192GB.


+1


- CPU varies between Intel E5-2650 and Intel Gold 5317.
E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd 
make a decent MDS system but assuming a dual socket system, you have 
~2 threads per OSD, which is maybe acceptable for HDDs, but I assume 
you have mon/mgr/rgw on some of them too.


The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't 
hosted on the OSD nodes and are running on modern hardware.


You didn't list additional nodes so I assumed.  You might still do well 
to have a larger number of RGWs, wherever they run.  RGWs often scale 
better horizontally than vertically.


Good to know. I'll check if adding more RGW nodes is possible.



rados bench is a useful for smoke testing but not always a reflection 
of E2E experience.
Unfortunately not getting the same performance with Rados Gateway 
(S3).

- 1x HAProxy with 3 backend RGW's.

Run an RGW on every node.


On every OSD node?


Yep, why not?


I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 
5 Warp clients. Benchmarking towards the HAProxy.

Results:
- Using 10MB object size, I am hitting the 10Gbit/s link of the 
HAProxy server. Thats good.
- Using 500K object size, I am getting a throughput of 70 up to 150 
MB/s with 140 up to 300 obj/s.
Tiny objects are the devil of any object storage deployment.  The 
HDDs are killing you here, especially for the index pool.  You might 
get a bit better by upping pg_num from the party line.


I would expect high write await times, but all OSD/disks have write 
await times of 1 ms up to 3 ms.


There are still serializations in the OSD and PG code.  You have 240 
OSDs, does your index pool have *at least* 256 PGs?


Index as the data pool has 256 PG's.






You might also disable Nagle on the RGW nodes.


I need to lookup what that exactly is and does.


It depends on the concurrency setting of Warp.
It look likes the objects/s is the bottleneck, not the throughput.
Max memory usage is about 80-90GB per node. CPU's are quite idling.
Is it reasonable to expect more IOps / objects/s for RGW with my 
setup? At this moment I am not able to find the bottleneck what is 
causing the low obj/s.

HDDs are a false economy.


Got it :)


Ceph version is 15.2.
Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Attention: Documentation - mon states and names

2024-06-10 Thread Joel Davidow
As this is my first submission to the Ceph docs, I want to start by saying
a big thank you to the Ceph team for all the efforts that have been put
into improving the docs. The improvements already made have been many and
have made it easier for me to operate Ceph.

In
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#the-cluster-has-quorum-but-at-least-one-monitor-is-down,
the section "What does it mean when a Monitor’s state is ``leader`` or
``peon``?" discusses those two mon states only in the context of an issue
that has a health detail entry.

However, the title of that section is not scoped to just that particular
case and so could lead to confusion because during normal Ceph operations,
there is a mon that has state leader and the other mons have state peon, as
can be seen by the values of state returned by ceph tell 
mon_status.

To alleviate any such confusion, I recommend inserting the following before
the existing text in that section: "During normal Ceph operations when the
cluster is in the HEALTH_OK state, one monitor in the Ceph cluster will be
in the leader state and the rest of the monitors will be in the peon state.
The state of a given monitor can be determined by examining the value of
the state key returned by ceph tell  mon_status."

Note the difference of convention in ceph command presentation. In
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#understanding-mon-status,
mon.X uses X to represent the portion of the command to be replaced by the
operator with a specific value. However, that may not be clear to all
readers, some of whom may read that as a literal X. I recommend switching
convention to something that makes visually explicit any portion of a
command that an operator has to replace with a specific value. One such
convention is to use <> as delimiters marking the portion of a command that
an operator has to replace with a specific value, minus the delimiters
themselves. I'm sure there are other conventions that would accomplish the
same goal and provide the <> convention as an example only.

Also, the actual name of a mon is not clear due to the variety of mon name
formats. The value of the NAME column returned by ceph orch ps
--daemon-type mon and the return from ceph mon dump follow the format of
mon. whereas the value of name returned by ceph tell 
mon_status, the mon line returned by ceph -s, and the return from ceph mon
stat follow the format of . Unifying the return for the mon name
value of all those commands could be helpful in establishing the format of
a mon name, though that is probably easier said than done.

In addition, in
https://docs.ceph.com/en/latest/rados/configuration/mon-config-ref/#configuring-monitors,
mon names are stated to use alpha notation by convention, but that
convention is not followed by cephadm in the clusters that I've deployed.
Cephadm also uses a minimal ceph.conf file with configs in the mon
database. I recommend this section be updated to mention those changes. If
there is a way to explain what a mon name is or how it is formatted,
perhaps adding that to that same section would be good.

Thanks again for the on-going work to improve the Ceph docs!
Joel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D'Atri



 
>>> You are right here, but we use Ceph mainly for RBD. It performs 'good 
>>> enough' for our RBD load.
>> You use RBD for archival?
> 
> No, storage for (light-weight) virtual machines.

I'm surprised that it's enough, I've seen HDDs fail miserably in that role.

> The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted 
> on the OSD nodes and are running on modern hardware.
>> You didn't list additional nodes so I assumed.  You might still do well to 
>> have a larger number of RGWs, wherever they run.  RGWs often scale better 
>> horizontally than vertically.
> 
> Good to know. I'll check if adding more RGW nodes is possible.

To be clear, you don't need more nodes.  You can add RGWs to the ones you 
already have.  You have 12 OSD nodes - why not put an RGW on each?

> 
>> There are still serializations in the OSD and PG code.  You have 240 OSDs, 
>> does your index pool have *at least* 256 PGs?
> 
> Index as the data pool has 256 PG's.

To be clear, that means whatever.rgw.buckets.index ?

> 
 You might also disable Nagle on the RGW nodes.
>>> I need to lookup what that exactly is and does.
> It depends on the concurrency setting of Warp.
> It look likes the objects/s is the bottleneck, not the throughput.
> Max memory usage is about 80-90GB per node. CPU's are quite idling.
> Is it reasonable to expect more IOps / objects/s for RGW with my setup? 
> At this moment I am not able to find the bottleneck what is causing the 
> low obj/s.
 HDDs are a false economy.
>>> Got it :)
> Ceph version is 15.2.
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS crashes to damaged metadata

2024-06-10 Thread Patrick Donnelly
You could try manually deleting the files from the directory
fragments, using `rados` commands. Make sure to flush your MDS journal
first and take the fs offline (`ceph fs fail`).

On Tue, Jun 4, 2024 at 8:50 AM Stolte, Felix  wrote:
>
> Hi Patrick,
>
> it has been a year now and we did not have a single crash since upgrading to 
> 16.2.13. We still have the 19 corrupted files which are reported by 'damage 
> ls‘. Is it now possible to delete the corrupted files without taking the 
> filesystem offline?
>
> Am 22.05.2023 um 20:23 schrieb Patrick Donnelly :
>
> Hi Felix,
>
> On Sat, May 13, 2023 at 9:18 AM Stolte, Felix  wrote:
>
> Hi Patrick,
>
> we have been running one daily snapshot since december and our cephfs crashed 
> 3 times because of this https://tracker.ceph.com/issues/38452
>
> We currentliy have 19 files with corrupt metadata found by your 
> first-damage.py script. We isolated the these files from access by users and 
> are waiting for a fix before we remove them with your script (or maybe a new 
> way?)
>
> No other fix is anticipated at this time. Probably one will be
> developed after the cause is understood.
>
> Today we upgraded our cluster from 16.2.11 and 16.2.13. After Upgrading the 
> mds  servers, cluster health went to ERROR MDS_DAMAGE. 'ceph tells mds 0 
> damage ls‘ is showing me the same files as your script (initially only a 
> part, after a cephfs scrub all of them).
>
> This is expected. Once the dentries are marked damaged, the MDS won't
> allow operations on those files (like those triggering tracker
> #38452).
>
> I noticed "mds: catch damage to CDentry’s first member before persisting 
> (issue#58482, pr#50781, Patrick Donnelly)“ in the change logs for 16.2.13  
> and like to ask you the following questions:
>
> a) can we repair the damaged files online now instead of bringing down the 
> whole fs and using the python script?
>
> Not yet.
>
> b) should we set one of the new mds options in our specific case to avoid our 
> fileserver crashing because of the wrong snap ids?
>
> Have your MDS crashed or just marked the dentries damaged? If you can
> reproduce a crash with detailed logs (debug_mds=20), that would be
> incredibly helpful.
>
> c) will your patch prevent wrong snap ids in the future?
>
> It will prevent persisting the damage.
>
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
> mit freundlichem Gruß
> Felix Stolte
>
> IT-Services
> mailto: f.sto...@fz-juelich.de
> Tel: 02461-619243
>
> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Stefan Müller
> Geschäftsführung: Prof. Dr. Astrid Lambrecht (Vorsitzende),
> Karsten Beneke (stellv. Vorsitzender), Dr. Ir. Pieter Jansens
> -
> -
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread sinan

On 2024-06-10 21:37, Anthony D'Atri wrote:


You are right here, but we use Ceph mainly for RBD. It performs 
'good enough' for our RBD load.

You use RBD for archival?


No, storage for (light-weight) virtual machines.


I'm surprised that it's enough, I've seen HDDs fail miserably in that 
role.


The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't 
hosted on the OSD nodes and are running on modern hardware.
You didn't list additional nodes so I assumed.  You might still do 
well to have a larger number of RGWs, wherever they run.  RGWs often 
scale better horizontally than vertically.


Good to know. I'll check if adding more RGW nodes is possible.


To be clear, you don't need more nodes.  You can add RGWs to the ones 
you already have.  You have 12 OSD nodes - why not put an RGW on each?


Might be an option, just don't like the idea to host multiple components 
on nodes. But I'll consider it.




There are still serializations in the OSD and PG code.  You have 240 
OSDs, does your index pool have *at least* 256 PGs?


Index as the data pool has 256 PG's.


To be clear, that means whatever.rgw.buckets.index ?


No, sorry my bad. .index is 32 and .data is 256.




You might also disable Nagle on the RGW nodes.

I need to lookup what that exactly is and does.

It depends on the concurrency setting of Warp.
It look likes the objects/s is the bottleneck, not the throughput.
Max memory usage is about 80-90GB per node. CPU's are quite 
idling.
Is it reasonable to expect more IOps / objects/s for RGW with my 
setup? At this moment I am not able to find the bottleneck what is 
causing the low obj/s.

HDDs are a false economy.

Got it :)

Ceph version is 15.2.
Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] multipart uploads in reef 18.2.2

2024-06-10 Thread Christopher Durham

We have a reef 18.2.2 cluster with 6 radosgw servers on Rocky 8.9. The radosgw 
servers are not fronted by anything like HAProxy as the clients connect 
directly to a DNS name via a round-robin DNS. Each of the radosgw servers have 
a certificate using SAN entries for all 6 radosgw servers as well as the 
primary DNS name. This has worked wonderfully for four years in distributing 
our load. This is an RPM install.

Since our update to 18.2.2, we have had some issues with a specific set of 
clients (spark). They *always* create multipart uploads, both before and after 
the ceph update from 17.2.6 to 18.2.2, even when there is a single multipart 
part less than what would otherwise be the threshold. The single multipart part 
is the norm.

This works fine for a time after restarting the radosgws. This is what happens:

1. The single PUT of a single multipart part with a given uploadId
2. The single POST to otherwise complete the multipart, with the same uploadId 
used in the PUT

But when the problem occurs, the PUT works, but  The POST fails with a "500 
302" error, and the client will continue to try this, eventually returning with 
a reported error of: "Status code; 500 Error-Code: Internal error, ... The 
multipart completion is already in progress."While this error makes sense when 
multiple POSTs happen for the completion of the multipart upload, the first one 
should not fail.
Sometimes the PUT and the POST happen from different clients or to different 
servers, even when things are working. But, when things begin to fail, just 
before the failed POST in the radosgw logs, I get:
s3:complete_multipart failed to acquire lock
Then the multiple "500 302" errors happen. Note that after the multiple "500 
302" errors, the spark client DELETEs the object, terminating the multipart 
upload (no 'leftover' multipart uploads).

There are about 50-60 multiparts happening when this occurs, and as time goes 
on, I get less and less successful multipart uploads, and eventually I have to 
restart the rados gateways. There are some other minor GETs and PUTs happening 
as well unrelated to spark

When this happens, we get multiple sockets on the radosgw servers that are in 
CLOSE-WAIT state. While I cannot prove it, these appear to be related to the 
issue at hand as the CLOSE-WAIT states are only coming from IPs associated 
withthe spark jobs. After I restart the radosgw servers, things are good for a 
time, and all CLOSE-WAITs disappear until the problem starts.

I have set:
rgw thread pool size = 1024rgw max concurrent requests = 2048
and restarted both the mons and the radosgw servers to no avail. The problem 
takes maybe 6 hours to start

No object versioning is in effect. Any ideas would be appreciated, thanks.
-Chris





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Attention: Documentation - mon states and names

2024-06-10 Thread Zac Dover
Joel,

Thank you for this message. This is a model of what in a perfect world user 
communication with upstream documentation could be.

I identify four things in your message that I can work on immediately:

1. leader/peon documentation improvement
2. Ceph command-presentation convention standardization
3. Standardizing the strings returned by various monitor-related commands
4. alpha notation -- not observed by Davidow in the field in cephadm clusters


Here are my thoughts, respectively:

1. I've already incorporated your suggestion into the documentation: 
https://github.com/ceph/ceph/pull/57957. Thank you for this.
2. For two years, I have kept notes about adding braces and brackets and angle 
brackets to the documentation to clear up these ambiguities. I will begin here.
3. This is far beyond my art, but a man's reach must exceed his grasp, or 
what's a heaven for? I'll ask if anyone upstream can do anything about this.
4. I will ask Radoslaw about this.

I have quoted your email in full below, and I have numbered the four parts of 
your email with the schema I've used here. I hope that it is formatted 
correctly.

Zac Dover
Upstream Documentation
Ceph Foundation




On Tuesday, June 11th, 2024 at 5:33 AM, Joel Davidow  wrote:

> 
> 
> As this is my first submission to the Ceph docs, I want to start by saying
> a big thank you to the Ceph team for all the efforts that have been put
> into improving the docs. The improvements already made have been many and
> have made it easier for me to operate Ceph.

I

> In
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#the-cluster-has-quorum-but-at-least-one-monitor-is-down,
> the section "What does it mean when a Monitor’s state is `leader` or
> `peon`?" discusses those two mon states only in the context of an issue
> that has a health detail entry.
> 
> However, the title of that section is not scoped to just that particular
> case and so could lead to confusion because during normal Ceph operations,
> there is a mon that has state leader and the other mons have state peon, as
> can be seen by the values of state returned by ceph tell 
> 
> mon_status.
> 
> To alleviate any such confusion, I recommend inserting the following before
> the existing text in that section: "During normal Ceph operations when the
> cluster is in the HEALTH_OK state, one monitor in the Ceph cluster will be
> in the leader state and the rest of the monitors will be in the peon state.
> The state of a given monitor can be determined by examining the value of
> the state key returned by ceph tell  mon_status."
> 

II 

> Note the difference of convention in ceph command presentation. In
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#understanding-mon-status,
> mon.X uses X to represent the portion of the command to be replaced by the
> operator with a specific value. However, that may not be clear to all
> readers, some of whom may read that as a literal X. I recommend switching
> convention to something that makes visually explicit any portion of a
> command that an operator has to replace with a specific value. One such
> convention is to use <> as delimiters marking the portion of a command that
> 
> an operator has to replace with a specific value, minus the delimiters
> themselves. I'm sure there are other conventions that would accomplish the
> same goal and provide the <> convention as an example only.


III

> Also, the actual name of a mon is not clear due to the variety of mon name
> formats. The value of the NAME column returned by ceph orch ps
> --daemon-type mon and the return from ceph mon dump follow the format of
> mon. whereas the value of name returned by ceph tell 
> 
> mon_status, the mon line returned by ceph -s, and the return from ceph mon
> stat follow the format of . Unifying the return for the mon name
> 
> value of all those commands could be helpful in establishing the format of
> a mon name, though that is probably easier said than done.

IV

> In addition, in
> https://docs.ceph.com/en/latest/rados/configuration/mon-config-ref/#configuring-monitors,
> mon names are stated to use alpha notation by convention, but that
> convention is not followed by cephadm in the clusters that I've deployed.
> Cephadm also uses a minimal ceph.conf file with configs in the mon
> database. I recommend this section be updated to mention those changes. If
> there is a way to explain what a mon name is or how it is formatted,
> perhaps adding that to that same section would be good.
> 
> Thanks again for the on-going work to improve the Ceph docs!
> Joel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D'Atri


>> To be clear, you don't need more nodes.  You can add RGWs to the ones you 
>> already have.  You have 12 OSD nodes - why not put an RGW on each?

> Might be an option, just don't like the idea to host multiple components on 
> nodes. But I'll consider it.

I really don't like mixing mon/mgr with other components because of coupled 
failure domains, and past experience with mon misbehavior, but many people do 
that.  ymmv.  With a bunch of RGWs none of them need grow to consume 
significant resources, and it can be difficult to get an RGW daemon to itself 
really use all of a dedicated node.

> 
 There are still serializations in the OSD and PG code.  You have 240 OSDs, 
 does your index pool have *at least* 256 PGs?
>>> Index as the data pool has 256 PG's.
>> To be clear, that means whatever.rgw.buckets.index ?
> 
> No, sorry my bad. .index is 32 and .data is 256.

Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG replicas on 
each OSD?  You want (IMHO) to end up with 100-200, keeping each pool's pg_num 
to a power of 2 ideally.

Assuming all your pools span all OSDs, I suggest at a minimum 256 for .index 
and 8192 for .data, assuming you have only RGW pools.  And would be included to 
try 512 / 8192.  Assuming your  other minor pools are at 32, I'd bump .log and 
.non-ec to 128 or 256 as well.

If you have RBD or other pools colocated, those numbers would change.



^ above assume disabling the autoscaler
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-10 Thread Mark Lehrer
If they can do 1 TB/s with a single 16K write thread, that will be
quite impressive :DOtherwise not really applicable.  Ceph scaling
has always been good.

More seriously, would you mind sending a link to this?


Thanks!

Mark

On Mon, Jun 10, 2024 at 12:01 PM Anthony D'Atri  wrote:
>
> Eh?  cf. Mark and Dan's 1TB/s presentation.
>
> On Jun 10, 2024, at 13:58, Mark Lehrer  wrote:
>
>  It
> seems like Ceph still hasn't adjusted to SSD performance.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: About disk disk iops and ultil peak

2024-06-10 Thread Anthony D'Atri
What specifically are your OSD devices?

> On Jun 10, 2024, at 22:23, Phong Tran Thanh  wrote:
> 
> Hi ceph user!
> 
> I am encountering a problem with IOPS and disk utilization of OSD. Sometimes, 
> my disk peaks in IOPS and utilization become too high, which affects my 
> cluster and causes slow operations to appear in the logs.
> 
> 6/6/24 9:51:46 AM[WRN]Health check update: 0 slow ops, oldest one blocked for 
> 36 sec, osd.268 has slow ops (SLOW_OPS)
> 
> 6/6/24 9:51:37 AM[WRN]Health check update: 0 slow ops, oldest one blocked for 
> 31 sec, osd.268 has slow ops (SLOW_OPS)
> 
> 
> This is config tu reduce it, but its not resolve my problem
> global  advanced  osd_mclock_profile  
>  custom   
>  
> global  advanced  osd_mclock_scheduler_background_best_effort_lim 
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_best_effort_res 
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_best_effort_wgt 
>  1
> 
> global  advanced  osd_mclock_scheduler_background_recovery_lim
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_recovery_res
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_recovery_wgt
>  1
> 
> global  advanced  osd_mclock_scheduler_client_lim 
>  0.40 
> 
> global  advanced  osd_mclock_scheduler_client_res 
>  0.40 
> 
> global  advanced  osd_mclock_scheduler_client_wgt 4
> 
> Hope someone can help me
> 
> Thanks so much!
> --
> 
> Email: tranphong...@gmail.com 
> Skype: tranphong079
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io