[ceph-users] My new osd is not normally ?

2025-03-17 Thread Yunus Emre Sarıpınar
I have 6 ssd sata and 12 osd per server in a 24 server cluster. This
environment was created when it was in the natilus version.

I switched this environment to the Octopus version 6 months ago. The
cluster is working healthily.

I added 8 new servers, I created 6 ssd sata and 12 osd on these servers in
the same way.

I did not change the number of PGs in the environment, I have 8192 PGs.

The problem is that in my ceph -s output, the remapped pg and missplaced
object states are gone, but there is a warning of 6nearfull osds and 4pools
nearfull.

I saw in the ceph df output that my pools are also full above normal.

In the output of the ceph osd df tree command, I observed that the
occupancy percentages of the newly added osds were around 80%, while the
percentages of the old osds were around 30%.

How do I equalize this situation?

Note: I am sharing the output of crushmap and osd df tree with you in the
attachment.
My new osds between 288-384.
My new servers are ekuark13,14,15,16 and bkuark13,14,15,16.
 ceph osd df tree
ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP META 
AVAIL %USE   VAR   PGS  STATUS  TYPE NAME
 -1 670.70117 -  671 TiB  306 TiB  301 TiB  140 GiB  4.8 TiB   
365 TiB  45.59  1.00-  root default
-53 335.35001 -  335 TiB  154 TiB  151 TiB   72 GiB  2.4 TiB   
182 TiB  45.82  1.00-  datacenter B-datacenter
-25  20.95900 -   21 TiB  6.9 TiB  6.7 TiB  3.8 GiB  156 GiB
14 TiB  32.77  0.72-  host kuarkb1
132ssd1.74699   1.0  1.7 TiB  540 GiB  526 GiB  336 MiB   13 GiB   
1.2 TiB  30.18  0.66   66  up  osd.132
133ssd1.74699   1.0  1.7 TiB  603 GiB  589 GiB  316 MiB   13 GiB   
1.2 TiB  33.71  0.74   64  up  osd.133
134ssd1.74699   1.0  1.7 TiB  570 GiB  557 GiB  317 MiB   13 GiB   
1.2 TiB  31.88  0.70   64  up  osd.134
135ssd1.74699   1.0  1.7 TiB  611 GiB  599 GiB  288 MiB   12 GiB   
1.2 TiB  34.15  0.75   61  up  osd.135
136ssd1.74699   1.0  1.7 TiB  538 GiB  524 GiB  302 MiB   14 GiB   
1.2 TiB  30.07  0.66   65  up  osd.136
137ssd1.74699   1.0  1.7 TiB  552 GiB  539 GiB  369 MiB   13 GiB   
1.2 TiB  30.89  0.68   61  up  osd.137
138ssd1.74699   1.0  1.7 TiB  629 GiB  616 GiB  332 MiB   13 GiB   
1.1 TiB  35.19  0.77   66  up  osd.138
139ssd1.74699   1.0  1.7 TiB  597 GiB  584 GiB  332 MiB   13 GiB   
1.2 TiB  33.40  0.73   65  up  osd.139
140ssd1.74699   1.0  1.7 TiB  584 GiB  571 GiB  318 MiB   13 GiB   
1.2 TiB  32.68  0.72   65  up  osd.140
141ssd1.74699   1.0  1.7 TiB  697 GiB  685 GiB  364 MiB   12 GiB   
1.1 TiB  39.00  0.86   63  up  osd.141
142ssd1.74699   1.0  1.7 TiB  591 GiB  578 GiB  305 MiB   13 GiB   
1.2 TiB  33.04  0.72   64  up  osd.142
143ssd1.74699   1.0  1.7 TiB  520 GiB  507 GiB  363 MiB   13 GiB   
1.2 TiB  29.07  0.64   64  up  osd.143
-43  20.95900 -   21 TiB  7.3 TiB  7.1 TiB  4.2 GiB  165 GiB
14 TiB  34.62  0.76-  host kuarkb10
240ssd1.74699   1.0  1.7 TiB  573 GiB  560 GiB  304 MiB   13 GiB   
1.2 TiB  32.05  0.70   65  up  osd.240
241ssd1.74699   1.0  1.7 TiB  640 GiB  627 GiB  351 MiB   13 GiB   
1.1 TiB  35.78  0.78   62  up  osd.241
242ssd1.74699   1.0  1.7 TiB  633 GiB  618 GiB  408 MiB   14 GiB   
1.1 TiB  35.40  0.78   66  up  osd.242
243ssd1.74699   1.0  1.7 TiB  655 GiB  640 GiB  411 MiB   15 GiB   
1.1 TiB  36.63  0.80   68  up  osd.243
244ssd1.74699   1.0  1.7 TiB  605 GiB  591 GiB  323 MiB   14 GiB   
1.2 TiB  33.84  0.74   68  up  osd.244
245ssd1.74699   1.0  1.7 TiB  599 GiB  585 GiB  357 MiB   14 GiB   
1.2 TiB  33.51  0.74   71  up  osd.245
246ssd1.74699   1.0  1.7 TiB  655 GiB  639 GiB  361 MiB   16 GiB   
1.1 TiB  36.62  0.80   70  up  osd.246
247ssd1.74699   1.0  1.7 TiB  659 GiB  646 GiB  335 MiB   12 GiB   
1.1 TiB  36.84  0.81   62  up  osd.247
248ssd1.74699   1.0  1.7 TiB  693 GiB  679 GiB  360 MiB   13 GiB   
1.1 TiB  38.72  0.85   64  up  osd.248
249ssd1.74699   1.0  1.7 TiB  555 GiB  541 GiB  341 MiB   13 GiB   
1.2 TiB  31.01  0.68   62  up  osd.249
250ssd1.74699   1.0  1.7 TiB  568 GiB  555 GiB  323 MiB   13 GiB   
1.2 TiB  31.76  0.70   64  up  osd.250
251ssd1.74699   1.0  1.7 TiB  594 GiB  580 GiB  409 MiB   14 GiB   
1.2 TiB  33.23  0.73   70  up  osd.251
-45  20.95900 -   21 TiB  7.3 TiB  7

[ceph-users] Adding OSD nodes

2025-03-17 Thread Sinan Polat
Hello,

I am currently managing a Ceph cluster that consists of 3 racks, each with
4 OSD nodes. Each node contains 24 OSDs. I plan to add three new nodes, one
to each rack, to help alleviate the high OSD utilization.

The current highest OSD utilization is 85%. I am concerned about the
possibility of any OSD reaching the osd_full_ratio threshold during the
rebalancing process. This would cause the cluster to enter a read-only
state, which I want to avoid at all costs.

I am planning to execute the following commands:

ceph orch host add new-node-1
ceph orch host add new-node-2
ceph orch host add new-node-3

ceph osd crush move new-node-1 rack=rack-1
ceph osd crush move new-node-2 rack=rack-2
ceph osd crush move new-node-3 rack=rack-3

ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 1
ceph config set osd osd_recovery_sleep 0.1

ceph orch apply osd --all-available-devices

Before proceeding, I would like to ask if the above steps are safe to
execute in a cluster with such high utilization. My main concern is whether
the rebalancing could cause any OSD to exceed the osd_full_ratio and result
in unexpected failures.

Any insights or advice on how to safely add these nodes without impacting
cluster stability would be greatly appreciated.

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible LARGE OMAP in RGW pool

2025-03-17 Thread Frédéric Nass
Hi Danish, 

Have you tried restarting all RGWs or at least the one running the sync thread? 
Unless the synchronization thread generates more errors than you have time to 
clean up, you should be able to trim shard 31's error log entirely by looping 
through all IDs (--marker). 

Also, have you tried GETing this object with an s3 client like s3cmd, aws cli 
or rclone? Just to make sure you can... You should be able to get it with a 
command like the one below: 

s3cmd --access_key=xxx --secret_key=xxx --no-encrypt --host=: 
get --bucket-location=':default-placement' 
s3:///wp-content/plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png
 

Regards, 
Frédéric. 

- Le 14 Mar 25, à 23:11, Danish Khan  a écrit : 

> Dear Frédéric,

> 1/ Identify the shards with the most sync errors log entries:

> I have identified the shard which is causing the issue is shard 31, but almost
> all the error shows only one object of a bucket. And the object exists in the
> master zone. but I'm not sure why the replication site is unable to sync it.

> 2/ For each shard, list every sync error log entry along with their ids:

> radosgw-admin sync error list --shard-id=X

> The output of this command shows same shard and same objects mostly (shard 31
> and object
> /plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png)

> 3/ Remove them **except the last one** with:

> radosgw-admin sync error trim --shard-id=X --marker=1_1682101321.201434_8669.1
> Trimming did remove a few entries from the error log. But still there are many
> error logs for the same object which I am unable to trim.

> Now the trim command is executing successfully but not doing anything.

> I am still getting error about the object which is not syncing in radosgw log:

> 2025-03-15T03:05:48.060+0530 7fee2affd700 0
> RGW-SYNC:data:sync:shard[80]:entry[mbackup:70134e66-872072ee2d32.2205852207.1:48]:bucket_sync_sources[target=:[]):source_bucket=:[]):source_zone=872072ee2d32]:bucket[mbackup:70134e66-872072ee2d32.2205852207.1:48<-mod-backup:70134e66-872072ee2d32.2205852207.1:48]:full_sync[mod-backup:70134e66-872072ee2d32.2205852207.1:48]:entry[wp-content/plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png]:
> ERROR: failed to sync object:
> mbackup:70134e66-872072ee2d32.2205852207.1:48/wp-content/plugins/plugins/yellow-pencil-visual-theme-customizer/images/cursor.png

> I am getting this error from appox two months, And if I remember correctly, we
> are getting LARGE OMAP warning from then only.

> I will try to delete this object from the Master zone on Monday and will see 
> if
> this fixes the issue.

> Do you have any other suggestions on this, which I should consider?

> Regards,
> Danish

> On Thu, Mar 13, 2025 at 6:07 PM Frédéric Nass < [
> mailto:frederic.n...@univ-lorraine.fr | frederic.n...@univ-lorraine.fr ] >
> wrote:

>> Hi Danish,

>> Can you access this KB article [1]? A free developer account should allow you
>> to.

>> It pretty much describes what you're facing and suggests to trim the sync 
>> error
>> log of recovering shards. Actually, every log entry **except the last one**.

>> 1/ Identify the shards with the most sync errors log entries:

>> radosgw-admin sync error list --max-entries=100 | grep shard_id | sort 
>> -n |
>> uniq -c | sort -h

>> 2/ For each shard, list every sync error log entry along with their ids:

>> radosgw-admin sync error list --shard-id=X

>> 3/ Remove them **except the last one** with:

>> radosgw-admin sync error trim --shard-id=X 
>> --marker=1_1682101321.201434_8669.1

>> the --marker above being the log entry id.

>> Are the replication threads running on the same RGWs that S3 clients are 
>> using?

>> If so, using dedicated RGWs for the sync job might help you avoid 
>> non-recovering
>> shards in the future, as described in Matthew's post [2]

>> Regards,
>> Frédéric.

>> [1] [ https://access.redhat.com/solutions/7023912 |
>> https://access.redhat.com/solutions/7023912 ]
>> [2] [ https://www.spinics.net/lists/ceph-users/msg83988.html |
>> https://www.spinics.net/lists/ceph-users/msg83988.html ]

>> - Le 12 Mar 25, à 11:15, Danish Khan [ mailto:danish52@gmail.com |
>> danish52@gmail.com ] a écrit :

>> > Dear All,

>> > My ceph cluster is giving Large OMAP warning from approx 2-3 Months. I
>> > tried a few things like :
>> > *Deep scrub of PGs*
>> > *Compact OSDs*
>> > *Trim log*
>> > But these didn't work out.

>> > I guess the main issue is that 4 shards in replication site are always
>> > recovering from 2-3 months.

>> > Any suggestions are highly appreciated.

>> > Sync status:
>> > root@drhost1:~# radosgw-admin sync status
>> > realm e259e0a92 (object-storage)
>> > zonegroup 7a8606d2 (staas)
>> > zone c8022ad1 (repstaas)
>> > metadata sync syncing
>> > full sync: 0/64 shards
>> > incremental sync: 64/64 shards
>> > metadata is caught up with master
>> > data sync source: 2072ee2d32 (masterstaas)
>> > syncing
>> > full sync: 0/128

[ceph-users] Ceph Tentacle release - dev freeze timeline

2025-03-17 Thread Yaarit Hatuka
Hi everyone,

In previous discussions, the Ceph Steering Committee tentatively agreed on
a Tentacle dev freeze around the end of March or mid-April. We would like
to revisit this and check in with all the tech leads to assess the
readiness level and ensure we're aligned on the timeline.

Please provide your input on the current status and any potential concerns
you may have regarding the freeze.

Thanks, on behalf of the CSC.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Remove ... something

2025-03-17 Thread Albert Shih
Hi, 

I got two pool one replica 3 and another with EC 4+2 for rbd. 

During my test to access from a KVM hyperviseur I create a image with meta
data in the replica 3 pool and the data in the EC 4+2. 

After I end the test (where everything works) I can't see anything in
erasure42 pool. 

  rbd list --pool erasure42

return a empty list. But 

  root@alopod:~# ceph df
  --- RAW STORAGE ---
  CLASS SIZEAVAIL USED  RAW USED  %RAW USED
  ssd147 TiB  146 TiB  564 GiB   564 GiB   0.38
  TOTAL  147 TiB  146 TiB  564 GiB   564 GiB   0.38
  
  --- POOLS ---
  POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
  .mgr11   23 MiB7   68 MiB  0 46 TiB
  erasure42  14   32  336 GiB   86.46k  504 GiB   0.36 92 TiB

it still use 336GiB.

How can I find where those 336Gib are and delete the image or whatever. 

Regards
-- 
Albert SHIH 🦫 🐸
Observatoire de Paris
France
Heure locale/Local time:
lun. 17 mars 2025 08:24:32 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Remove ... something

2025-03-17 Thread Eugen Block

Hi,


  rbd list --pool erasure42 return a empty list.


that's because the rbd metadata is stored in a replicated pool. You  
need to look into your replicated pool to delete the test image. If  
you don't find it with:


rbd list --pool 

you can inspect the rbd_data prefix of the EC chunks and then find the  
corresponding rbd image:


# retrieve the rbd_data prefix:
rados -p openstack-ec ls | head -1
rbd_data.2.c142ac217af172.0120

# myrbdprefix="c142ac217af172"; for i in $(rbd -p images ls); do if [  
$(rbd info --format json images/$i | jq -r '.block_name_prefix' | grep  
-c "$myrbdprefix") -eq 1 ]; then echo "your image is: " $(rbd info  
--format json images/$i | jq -r '.name'); break; fi; done

your image is:  volume-6cc34ae0-910f-475e-a097-7f374f4a8d57

To remove the image, you would need to issue that command for the  
replicated pool, not the EC pool:


rbd rm /

Then check your EC object list again.

Regards,
Eugen

Zitat von Albert Shih :


Hi,

I got two pool one replica 3 and another with EC 4+2 for rbd.

During my test to access from a KVM hyperviseur I create a image with meta
data in the replica 3 pool and the data in the EC 4+2.

After I end the test (where everything works) I can't see anything in
erasure42 pool.

  rbd list --pool erasure42

return a empty list. But

  root@alopod:~# ceph df
  --- RAW STORAGE ---
  CLASS SIZEAVAIL USED  RAW USED  %RAW USED
  ssd147 TiB  146 TiB  564 GiB   564 GiB   0.38
  TOTAL  147 TiB  146 TiB  564 GiB   564 GiB   0.38

  --- POOLS ---
  POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
  .mgr11   23 MiB7   68 MiB  0 46 TiB
  erasure42  14   32  336 GiB   86.46k  504 GiB   0.36 92 TiB

it still use 336GiB.

How can I find where those 336Gib are and delete the image or whatever.

Regards
--
Albert SHIH 🦫 🐸
Observatoire de Paris
France
Heure locale/Local time:
lun. 17 mars 2025 08:24:32 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: My new osd is not normally ?

2025-03-17 Thread Eugen Block

Hi,

is the balancer on? And which mode is enabled?

ceph balancer status

You definitely should split PGs, aim at 100 - 150 PGs per OSD at  
first. I would inspect the PG sizes of the new OSDs:


ceph pg ls-by-osd 288 (column BYTES)

and compare them with older OSDs. If you have very large PG sizes,  
only a few of them could fill up an OSD quite quickly since your OSD  
sizes are "only" 1.7 TB.


Zitat von Yunus Emre Sarıpınar :


I have 6 ssd sata and 12 osd per server in a 24 server cluster. This
environment was created when it was in the natilus version.

I switched this environment to the Octopus version 6 months ago. The
cluster is working healthily.

I added 8 new servers, I created 6 ssd sata and 12 osd on these servers in
the same way.

I did not change the number of PGs in the environment, I have 8192 PGs.

The problem is that in my ceph -s output, the remapped pg and missplaced
object states are gone, but there is a warning of 6nearfull osds and 4pools
nearfull.

I saw in the ceph df output that my pools are also full above normal.

In the output of the ceph osd df tree command, I observed that the
occupancy percentages of the newly added osds were around 80%, while the
percentages of the old osds were around 30%.

How do I equalize this situation?

Note: I am sharing the output of crushmap and osd df tree with you in the
attachment.
My new osds between 288-384.
My new servers are ekuark13,14,15,16 and bkuark13,14,15,16.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd/bluestore using page cache

2025-03-17 Thread Joshua Baergen
Hey Brian,

The setting you're looking for is bluefs_buffered_io. This is very
much a YMMV setting, so it's best to test with both modes, but I
usually recommend turning it off for all but omap-intensive workloads
(e.g. RGW index) due to it causing writes to tend to be split up into
smaller pieces.

It's been a long time since I've thought about this setting, so even
though it might be toggleable live, I'm not sure how much I would
trust turning it off on a live OSD; we usually set it in local conf
and then restart OSDs to gain the new setting.

Josh

On Sun, Mar 16, 2025 at 2:38 PM Brian Marcotte  wrote:
>
> Some years ago when first switching to Bluestore, I could see that
> ceph-osd wasn't using the host page cache anymore. Some time later after
> a Ceph upgrade, I found that ceph-osd was now filling the page cache. I'm
> sorry I don't remember which upgrade that was. Currently I'm running
> pacific and reef clusters.
>
> Should ceph-osd (Bluestore) be going through the page cache? Can ceph-osd
> be configured to go direct?
>
> Thanks.
>
> --
> - Brian
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it safe to set multiple OSD out across multiple failure domain?

2025-03-17 Thread Eugen Block
Before I replied, I wanted to renew my confidence and do a small test  
in a lab environment. I also created a k4m2 pool with host as  
failure-domain, started to write data chunks into it in a while loop  
and then marked three of the OSDs "out" simultaneously. After a few  
seconds of repeering backfill kicks in, I/O to the pool continues  
without interruption. So yeah, I also think it's safe to mark them out  
at the same time.


Zitat von Anthony D'Atri :

It’s context-dependent, where the OSDs are.  If they’re all in the  
same failure domain it’s safe, if you have capacity to recover into.  
 Across failure domains, usually not.



The reason I asked is that several months back I got a off list reply from
a frequent poster on this list, that setting 3 OSDs out at the same  
time could

give me incomplete PGs as a result.

But a least now I have 2 saying is OK and 1 saying it's not so thank you
Alexander and Janne.

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Remove ... something

2025-03-17 Thread Albert Shih
Le 17/03/2025 à 13:46:58+, Eugen Block a écrit
Hi, 

> 
> that's because the rbd metadata is stored in a replicated pool. You need to
> look into your replicated pool to delete the test image. If you don't find
> it with:

Ok. I know that. But I didn't know I should remove the image metadata from
the replica pool. So I already deleted the replicated pool;-)

> 
> rbd list --pool 
> 
> you can inspect the rbd_data prefix of the EC chunks and then find the
> corresponding rbd image:
> 
> # retrieve the rbd_data prefix:
> rados -p openstack-ec ls | head -1
> rbd_data.2.c142ac217af172.0120
> 
> # myrbdprefix="c142ac217af172"; for i in $(rbd -p images ls); do if [ $(rbd
> info --format json images/$i | jq -r '.block_name_prefix' | grep -c
> "$myrbdprefix") -eq 1 ]; then echo "your image is: " $(rbd info --format
> json images/$i | jq -r '.name'); break; fi; done
> your image is:  volume-6cc34ae0-910f-475e-a097-7f374f4a8d57
> 
> To remove the image, you would need to issue that command for the replicated
> pool, not the EC pool:
> 
> rbd rm /

Ok big thanks. I would (try to)  keep that in my little brain

Now I will just remove the erasure pool also ;-)

Regards.

JAS
-- 
Albert SHIH 🦫 🐸
Observatoire de Paris
France
Heure locale/Local time:
lun. 17 mars 2025 16:11:41 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Experience with 100G Ceph in Proxmox

2025-03-17 Thread Martin Konold
600 MB/s is rather slow. With 10 GBit/s I regularly measure 1,28 GB/s bandwidth even with a single connection.The issue is latency not bandwidth!The latenc is bound by the CPU serving the osds when using decent  NVMe storage is used.In an optimal world the network latency would be the limiting factor though.The good news is that CPU bound osd implementation is a solvable issue.Regards--martinAm 12.03.2025 15:14 schrieb Alex Gorbachev :How about testing the actual network throughput with iperf?  Even today

there are speed/duplex mismatches on switch ports.  And what everyone else

said about saturation etc.



We get, at absolute worst, 600 MB/s on a 10G connection.

--

Alex Gorbachev

https://alextelescope.blogspot.com







On Tue, Mar 11, 2025 at 6:57 AM Giovanna Ratini <

giovanna.rat...@uni-konstanz.de> wrote:



> Hello everyone,

>

> We are running Ceph in Proxmox with a 10G network.

>

> Unfortunately, we are experiencing very low read rates. I will try to

> implement the solution recommended in the Proxmox forum. However, even

> 80 MB per second with an NVMe drive is quite disappointing.

> Forum link

> <

> https://forum.proxmox.com/threads/slow-performance-on-ceph-per-vm.151223/#post-685070

> >

>

> For this reason, we are considering purchasing a 100G switch for our

> servers.

>

> This raises some questions:

> Should I still use separate networks for VMs and Ceph with 100G?

> I have read that running Ceph on bridged connections is not recommended.

>

> Does anyone have experience with 100G Ceph in Proxmox?

>

> Is upgrading to 100G a good idea, or will I have 60G sitting idle?

>

> Thanks in advance!

>

> Gio

>

>

> ___

> ceph-users mailing list -- ceph-users@ceph.io

> To unsubscribe send an email to ceph-users-le...@ceph.io

>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd/bluestore using page cache

2025-03-17 Thread Janne Johansson
Den mån 17 mars 2025 kl 14:48 skrev Joshua Baergen :
> Hey Brian,
>
> The setting you're looking for is bluefs_buffered_io. This is very
> much a YMMV setting, so it's best to test with both modes, but I
> usually recommend turning it off for all but omap-intensive workloads
> (e.g. RGW index) due to it causing writes to tend to be split up into
> smaller pieces.

On the other hand, having to set bluestore cache sizes for each OSD
individually is kind of weird in 2025. If I initially had 8 OSDs in a
box, then two drives died, I would want the computer to let the
remaining 6 OSDs use the extra available cache memory if it can, and
not that I would have to edit configs for the remaining 6, then
possibly once more if I ever replace the two lost OSDs. At the same
time, after losing x OSDs, I would find it wasteful to not use memory
I have bought in order to have good caches for my OSDs because it is a
static setting.
Even a single "use 110G ram as you see fit, split between the current
OSDs" for a 128G machine would be better than a per-OSD
bluestore_cache_size = xyz setting.

> On Sun, Mar 16, 2025 at 2:38 PM Brian Marcotte  wrote:
> >
> > Some years ago when first switching to Bluestore, I could see that
> > ceph-osd wasn't using the host page cache anymore. Some time later after
> > a Ceph upgrade, I found that ceph-osd was now filling the page cache. I'm
> > sorry I don't remember which upgrade that was. Currently I'm running
> > pacific and reef clusters.
> >
> > Should ceph-osd (Bluestore) be going through the page cache? Can ceph-osd
> > be configured to go direct?



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io