[ceph-users] Re: Snaptriming speed degrade with pg increase

2024-11-28 Thread Bandelow, Gunnar
Dear Istvan,


The first thing that stands out:

Ubuntu 20.04  (EOL in April 2025)
and
Ceph v15 Octopus (EOL since 2022)

Is there a possibility to upgrade these things?


Best regards
Gunnar


--- Original Nachricht ---
Betreff: [ceph-users] Snaptriming speed degrade with pg increase
Von: "Szabo, Istvan (Agoda)" 
An: "Ceph Users" 
Datum: 29-11-2024 3:30





Hi,

When we scale the placement group on a pool located in a full nvme
cluster, the snaptriming speed degrades a lot.
Currently we are running with these values to not degrade client op
and have some progress on snaptrimmin, but it is terrible. (octopus
15.2.17 on ubuntu 20.04)

-osd_max_trimming_pgs=2
--osd_snap_trim_sleep=0.1
--osd_pg_max_concurrent_snap_trims=2

We had a big pool which we used to have 128PG and that length of the
snaptrimming took around 45-60 minutes.
Due to impossible to do maintenance on the cluster with 600GB pg sizes
because it can easily max out a cluster (which we did), we increased
to 1024 and the snaptrimming duration increased to 3.5 hours.

Is there any good solution that we are missing to fix this?

On the hardware level I've changed server profile to tune some numa
settings but seems like didn't help still.

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: down OSDs, Bluestore out of space, unable to restart

2024-11-28 Thread Frédéric Nass
Hi Igor, 

Thank you for taking the time to explains the fragmentation issue. I had 
figured out the most part of it by reading the tracker and the PR but it's 
always clearer when you explain it. 

My question was more about why bluefs would still fail to allocate 4k chunks 
after being allowed to do so by https://tracker.ceph.com/issues/53466 (John's 
case with v17.2.6 actually) 

Is BlueFS aware of the remaining space and maybe using some sort of reserved 
blocks/chunks like other filesystems to handle full/near null situations ? If 
so, then it should never crash, right? 
Like other filesystems don't crash, drives's firmwares dont crash, etc. 

Thanks, 
Frédéric. 

- Le 28 Nov 24, à 12:52, Igor Fedotov  a écrit : 

> Hi Frederic,

> here is an overview of the case when BlueFS ıs unable to allocate more space 
> at
> main/shared device albeıt free space is available. Below I'm talking about
> stuff exısted before fıxıng [ https://tracker.ceph.com/issues/53466 |
> https://tracker.ceph.com/issues/53466 ] .

> First of al - BlueFS's minimal allocation unit for shared device was
> bluefs_shared_alloc_size (=64K by default). Which means that it was unable to
> use e.g. 2x32K or 16x4K chunks when it needed additional 64K bytes.

> Secondly - sometimes RocksDB performs recovery - and some other maintenance
> tasks that require space allocation - on startup. Which evidently triggers
> allocation of N*64K chunks from shared device.

> Thirdly - a while ago we switched to 4K chunk allocations for user data 
> (please
> not confuse with BlueFS allocation). Which potentially could result ın 
> specific
> free space fragmentation pattern when there ıs limited (or even empty) set of
> long (>=64K) chunks free. Still technically having enough free space 
> available.
> E.g. free extent list could look like (off~len, both in hex):

> 0x0~1000, 0x2000~1000, 0x4000~2000, 0x1~4000, 0x2000~1000, etc...

> In that case original BlueFS allocator implementation was unable to locate 
> more
> free space which in turn was effectively breaking both RockDB and OSD boot up.

> One should realize that the above free space fragmentation depends on a bunch 
> of
> factors, none of which is absolutely dominating:

> 1. how user write/remove objects

> 2. how allocator seeks for free space

> 3. how much free space is available

> So we don't have full control on 1. and 3. and have limited opportunities in
> tuning 2.

> Small device sizes and high space utilization severely increase the 
> probability
> for the issue to happen but theoretically even a large disk with mediocre
> utilization could reach "bad" state over time if used (by both clients and
> allocator) "improperly/inefficiently". Hence tuning thresholds can reduce the
> issue's probability to occur (at cost of additional spare space waste) but it
> isn't a silver bullet.

> [ https://tracker.ceph.com/issues/53466 | 
> https://tracker.ceph.com/issues/53466
> ] fixes (or rather works around) the issue by allowing BlueFS to use 4K
> extents. Plus we're working on making better resulting free space 
> fragmentation
> on aged OSDs by improving allocation strategies, e.g. see :

> - [ https://github.com/ceph/ceph/pull/52489 |
> https://github.com/ceph/ceph/pull/52489 ]

> - [ https://github.com/ceph/ceph/pull/57789 |
> https://github.com/ceph/ceph/pull/57789 ]

> - [ https://github.com/ceph/ceph/pull/60870 |
> https://github.com/ceph/ceph/pull/60870 ]

> Hope this is helpful.

> Thanks,

> Igor

> On 27.11.2024 16:31, Frédéric Nass wrote:

>> - Le 27 Nov 24, à 10:19, Igor Fedotov [ mailto:igor.fedo...@croit.io |
>>  ] a écrit :

>>> Hi Istvan,

>>> first of all let me make a remark that we don't know why BlueStore is out of
>>> space at John's cluster.

>>> It's just an unconfirmed hypothesis from Frederic that it's caused by high
>>> fragmentation and BlueFS'es inability to use chunks smaller than 64K. In 
>>> fact
>>> fragmentation issue is fixed since 17.2.6 so I doubt that's the problem.
>> Hi Igor,

>> I wasn't actually pointing this as the root cause (since John's already using
>> 17.2.6) but more to explain the context, but while we're at it...

>> Could you elaborate on circumstances that could prevent BlueFS from being 
>> able
>> to allocate chunks in collocated OSDs scenario? Does this ability depend on
>> near/full thresholds being reached or not? If so then icreasing these
>> thresholds by 1-2% may help avoiding the crash, no?

>> Also, if BlueFS is aware of these thresholds, shouldn't an OSDs be able to 
>> start
>> and live without crashing even when it's full and simply (maybe easier said
>> than done...) refuse any I/Os? Sorry for the noob questions. :-)

>> This topic is particularly important when using NVMe drives as 'collocated'
>> OSDs, expecially since they often host critical metadata pools (cephfs, rgw
>> index).

>> Cheers,
>> Frédéric.

>>> Thanks,

>>> Igor
>>> On 27.11.2024 4:01, Szabo, Istvan (Agoda) wrote:

 Hi,

 This issue sho

[ceph-users] Re: classes crush rules new cluster

2024-11-28 Thread Eugen Block
You could decompile the crushmap, add a dummy OSD (with a non-existing  
ID) with your new device class and add a rule, then compile it and  
inject. Here's an excerpt from a lab cluster with 4 OSDs (0..3),  
adding a fifth non-existing:


device 4 osd.4 class test

rule testrule {
id 6
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class test
step chooseleaf indep 0 type host
step emit
}

Note that testing this rule with crushtool won't work here since the  
fake OSD isn't assigned to a hosts.


But what's the point in having a rule without the corresponding  
devices? You won't be able to create a pool with that rule anyway  
until the OSDs are present.


Zitat von Marc :

It looks like it is not possible to create crush rules when you  
don't have harddrives active in this class.


I am testing with new squid and did not add ssd's yet, eventhough I  
added class like this.


ceph osd crush class create ssd

I can't execute this
ceph osd crush rule create-replicated replicated_ssd default host ssd

Is there any way around this?



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Snaptriming speed degrade with pg increase

2024-11-28 Thread Szabo, Istvan (Agoda)
Let's say yes if that is the issue.



Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---





From: Bandelow, Gunnar
Sent: Friday, November 29, 2024 1:47 PM
To: Szabo, Istvan (Agoda); Ceph Users
Subject: Re: [ceph-users] Snaptriming speed degrade with pg increase

Dear Istvan,

The first thing that stands out:

Ubuntu 20.04  (EOL in April 2025)
and
Ceph v15 Octopus (EOL since 2022)

Is there a possibility to upgrade these things?

Best regards
Gunnar


--- Original Nachricht ---
Betreff: [ceph-users] Snaptriming speed degrade with pg increase
Von: "Szabo, Istvan (Agoda)" 
mailto:istvan.sz...@agoda.com>>
An: "Ceph Users" mailto:ceph-users@ceph.io>>
Datum: 29-11-2024 3:30



Hi,

When we scale the placement group on a pool located in a full nvme cluster, the 
snaptriming speed degrades a lot.
Currently we are running with these values to not degrade client op and have 
some progress on snaptrimmin, but it is terrible. (octopus 15.2.17 on ubuntu 
20.04)

-osd_max_trimming_pgs=2
--osd_snap_trim_sleep=0.1
--osd_pg_max_concurrent_snap_trims=2

We had a big pool which we used to have 128PG and that length of the 
snaptrimming took around 45-60 minutes.
Due to impossible to do maintenance on the cluster with 600GB pg sizes because 
it can easily max out a cluster (which we did), we increased to 1024 and the 
snaptrimming duration increased to 3.5 hours.

Is there any good solution that we are missing to fix this?

On the hardware level I've changed server profile to tune some numa settings 
but seems like didn't help still.

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] nfs-ganesha 5 changes

2024-11-28 Thread Marc
In my old environment I have simple nfs-ganesha export like this, which is 
sufficent and mounts. 

EXPORT {
Export_Id = 200;
Path = /backup;
Pseudo = /backup;
FSAL { Name = CEPH; Filesystem = ""; User_Id = "cephfs..bakup"; 
Secret_Access_Key = "x=="; }
Disable_ACL = FALSE;
CLIENT { Clients = 192.168.11.200; access_type = "RW"; }
CLIENT { Clients = *; Access_Type = NONE; }
}

In the new ganesha 5 I am getting these errors. Don't really get why it wants 
to create a pool

rados_kv_connect :CLIENT ID :EVENT :Failed to create pool: -34
rados_ng_init :CLIENT ID :EVENT :Failed to connect to cluster: -34
main :NFS STARTUP :CRIT :Recovery backend initialization failed!

cephfs kernel mount with this userid is ok. User only has access to this dir.

Anyone an idea what config I need to update?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: nfs-ganesha 5 changes

2024-11-28 Thread Marc
> 
> In my old environment I have simple nfs-ganesha export like this, which
> is sufficent and mounts.
> 
> EXPORT {
> Export_Id = 200;
> Path = /backup;
> Pseudo = /backup;
> FSAL { Name = CEPH; Filesystem = ""; User_Id =
> "cephfs..bakup"; Secret_Access_Key = "x=="; }
> Disable_ACL = FALSE;
> CLIENT { Clients = 192.168.11.200; access_type = "RW"; }
> CLIENT { Clients = *; Access_Type = NONE; }
> }
> 
> In the new ganesha 5 I am getting these errors. Don't really get why it
> wants to create a pool
> 
> rados_kv_connect :CLIENT ID :EVENT :Failed to create pool: -34
> rados_ng_init :CLIENT ID :EVENT :Failed to connect to cluster: -34
> main :NFS STARTUP :CRIT :Recovery backend initialization failed!
> 
> cephfs kernel mount with this userid is ok. User only has access to this
> dir.
> 
> Anyone an idea what config I need to update?

I missed this, check later what it is.
#RecoveryBackend = rados_ng;
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: down OSDs, Bluestore out of space, unable to restart

2024-11-28 Thread Igor Fedotov

Hi Frederic,

here is an overview of the case when BlueFS ıs unable to allocate more 
space at main/shared device albeıt free space is available. Below I'm 
talking about stuff exısted before fıxıng 
https://tracker.ceph.com/issues/53466.


First of al - BlueFS's minimal allocation unit for shared device was 
bluefs_shared_alloc_size (=64K by default). Which means that it was 
unable to use e.g. 2x32K or 16x4K chunks when it needed additional 64K 
bytes.


Secondly - sometimes RocksDB performs recovery - and some other 
maintenance tasks that require space allocation - on startup. Which 
evidently triggers allocation of N*64K chunks from shared device.


Thirdly - a while ago we switched to 4K chunk allocations for user data 
(please not confuse with BlueFS allocation). Which potentially could 
result ın specific free space fragmentation pattern when there ıs 
limited (or even empty) set of long (>=64K) chunks free. Still 
technically having enough free space available. E.g. free extent list 
could look like (off~len, both in hex):


0x0~1000, 0x2000~1000, 0x4000~2000, 0x1~4000, 0x2000~1000, etc...

In that case original BlueFS allocator implementation was unable to 
locate more free space which in turn was effectively breaking both 
RockDB and OSD boot up.


One should realize that the above free space fragmentation depends on a 
bunch of factors, none of which is absolutely dominating:


1. how user write/remove objects

2. how allocator seeks for free space

3. how much free space is available

So we don't have full control on 1. and 3. and have limited 
opportunities in tuning 2.


Small device sizes and high space utilization severely increase the 
probability for the issue to happen but theoretically even a large disk 
with mediocre utilization could reach "bad" state over time if used (by 
both clients and allocator) "improperly/inefficiently". Hence tuning 
thresholds can reduce the issue's probability to occur (at cost of 
additional spare space waste) but it isn't a silver bullet.


https://tracker.ceph.com/issues/53466 fixes (or rather works around) the 
issue by allowing BlueFS to use 4K extents. Plus we're working on making 
better resulting free space fragmentation on aged OSDs by improving 
allocation strategies, e.g. see :


- https://github.com/ceph/ceph/pull/52489

- https://github.com/ceph/ceph/pull/57789

- https://github.com/ceph/ceph/pull/60870

Hope this is helpful.

Thanks,

Igor


On 27.11.2024 16:31, Frédéric Nass wrote:



- Le 27 Nov 24, à 10:19, Igor Fedotov  a 
écrit :


Hi Istvan,

first of all let me make a remark that we don't know why BlueStore
is out of space at John's cluster.

It's just an unconfirmed hypothesis from Frederic that it's caused
by high fragmentation and BlueFS'es inability to use chunks
smaller than 64K. In fact fragmentation issue is fixed since
17.2.6 so I doubt that's the problem.

Hi Igor,

I wasn't actually pointing this as the root cause (since John's 
already using 17.2.6) but more to explain the context, but while we're 
at it...


Could you elaborate on circumstances that could prevent BlueFS from 
being able to allocate chunks in collocated OSDs scenario? Does this 
ability depend on near/full thresholds being reached or not? If so 
then icreasing these thresholds by 1-2% may help avoiding the crash, no?


Also, if BlueFS is aware of these thresholds, shouldn't an OSDs be 
able to start and live without crashing even when it's full and simply 
(maybe easier said than done...) refuse any I/Os? Sorry for the noob 
questions. :-)


This topic is particularly important when using NVMe drives as 
'collocated' OSDs, expecially since they often host critical metadata 
pools (cephfs, rgw index).


Cheers,
Frédéric.

Thanks,

Igor

On 27.11.2024 4:01, Szabo, Istvan (Agoda) wrote:

Hi,

This issue should not happen anymore from 17.2.8 am I correct?
In this version all the fragmentation issue should have gone
even with collocated wal+db+block.


*From:* Frédéric Nass 

*Sent:* Wednesday, November 27, 2024 6:12:46 AM
*To:* John Jasen  
*Cc:* Igor Fedotov 
; ceph-users
 
*Subject:* [ceph-users] Re: down OSDs, Bluestore out of space,
unable to restart

Email received from the internet. If in doubt, don't click any
link nor open any attachment !


Hi John,

That's about right. Two potential solutions exist:
1. Adding a new drive to the server and sharing it for RocksDB
metadata, or
2. Repurposing one of the failed OSDs for the same purpose (if
adding more drives isn't feasible).

Igor's post #6 [1] expl

[ceph-users] Re: Migrated to cephadm, rgw logs to file even when rgw_ops_log_rados is true

2024-11-28 Thread Paul JURCO
Hi Eugen,
Yes i have played arround with some of them, with the most obvious ones.
They are all false by default:
:~# ceph-conf -D | grep syslog
clog_to_syslog = false
clog_to_syslog_facility = default=daemon audit=local0
clog_to_syslog_level = info
err_to_syslog = false
log_to_syslog = false
mon_cluster_log_to_syslog = default=false
mon_cluster_log_to_syslog_facility = daemon
mon_cluster_log_to_syslog_level = info

Loggically, if you turn off ops logs and it applies instantly I assume
these
configs from above are respected as intended. Still, I have turned on and
off log_to_syslog to no avail.
Interesting is the ops log file on disk is opened by the radosgw process in
the container.
So, inside a container, radosgw services ignores the rgw_ops_log_rados when
true.
Other configs like rgw_max_concurrent_requests and rgw_enable_ops_log are
working as expected.
-- 
Paul


On Thu, Nov 28, 2024 at 10:30 AM Eugen Block  wrote:

> I haven't played with rgw_ops yet, but have you looked at the various
> syslog config options?
>
> ceph config ls | grep syslog
> log_to_syslog
> err_to_syslog
> clog_to_syslog
> clog_to_syslog_level
> clog_to_syslog_facility
> mon_cluster_log_to_syslog
> mon_cluster_log_to_syslog_level
> mon_cluster_log_to_syslog_facility
>
> Maybe one of them is what you're looking for.
>
> Zitat von Paul JURCO :
>
> > Hi!
> > Currently I have limitted the optput of rgw log to syslog from rsyslog
> (as
> > suggested by
> > Anthony), limitted docker logs from daemon.json.
> > I still get ops logs written to both logs pool and ops log file
> > (ops-log-ceph-client.rgw.hostname.log).
> >
> > How to stop logging ops log on rgw disk and keep logs on logs pool?
> > Current config:
> > globaladvanced  debug_rgw
> > 0/0
> > globaladvanced
> >  rgw_enable_ops_log true
> > globaladvanced
> >  rgw_ops_log_rados   true
> >
> > Thank you!
> >
> > --
> > Paul Jurco
> >
> >
> > On Fri, Nov 22, 2024 at 6:11 PM Paul JURCO  wrote:
> >
> >> Hi,
> >> we recently migrated to cephadm from ceph-deploy a 18.2.2 ceph cluster
> >> (Ubuntu with docker).
> >> RGWs are separate vms.
> >> We noticed syslog increased a lot due to rgw's access logs sent to it.
> >> And because we use to log ops, a huge ops log file on
> >> /var/log/ceph/cluster-id/ops-log-ceph-client.rgw.hostname-here.log.
> >>
> >> While having "rgw_ops_log_rados": "true", oplogs goes to both file and
> >> rados pool for logs.
> >> If false it doesn't log anything, as expected.
> >> How to stop dockered rgws to log to syslog and to a file on disk, but to
> >> keep opslog in logs pool?
> >>
> >> Config is:
> >> globalbasic
> >> log_to_journald  false
> >> globaladvanced
> >>  rgw_enable_ops_log  false
> >> globaladvanced
> >>  rgw_ops_log_radostrue
> >>
> >> A few hours later after after enabling it back, after massive cleanup,
> it
> >> does logging ops but only to files.
> >> How to get ops logs in rados pool and access log to a file on disk but
> not
> >> on syslog?
> >> I have add this to daemon.json to limit access logs to accumulate on
> >> /var/log/docker/containers/rand/rand/json.log file:
> >>
> >> {
> >>   "log-driver": "local",
> >>   "log-opts": {
> >> "max-size": "512m",
> >> "max-file": "3"
> >>   }
> >> }
> >>
> >>
> >> Thank you!
> >> Paul
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] new cluser ceph osd perf = 0

2024-11-28 Thread Marc



My ceph osd perf are all 0, do I need to enable module for this? 
osd_perf_query? Where should I find this in manuals? Or do I just need to wait?


[@ target]# ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
 25   0  0
 24   0  0
 23   0  0
 22   0  0
 21   0  0
 20   0  0
 19   0  0
 18   0  0
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Squid: deep scrub issues

2024-11-28 Thread Nmz
Sveikas,
 
Can you try to set 'ceph config set osd osd_mclock_profile high_recovery_ops' 
and see how will it effect you?
 
For some PG deep scrub runned for about 20h for me. After I gave more priority 
1,2 hour was enaught to finish.
 
  


- Original Message -
From: Laimis Juzeliūnas 
To: ceph-users@ceph.io
Date: Wednesday, November 27, 2024, 12:36:41 AM
Subject: [ceph-users] Squid: deep scrub issues

> Hello Ceph community,

> Wanted to highlight one observation and gather any Squid users having similar 
> experiences.
> Since upgrading to 19.2.0 (from 18.4.0) we have observed that pg deep 
> scrubbing times have drastically increased. Some pgs take 2-5 days to 
> complete deep scrubbing while others increase to 20+ days. This causes the 
> deep scrubbing queue to fill up and the cluster almost constantly has 'pgs 
> not deep-scrubbed in time' alerts.
> We have on average 67 pgs/osd: running on 15TB hdd disks this results in 
> 200GB-ish pgs. While fairly large - these pgs did not cause such increase in 
> deep scrubs when on Reef.

> "ceph pg dump | grep 'deep scrubbing for'" will always have a few entries of 
> quite morbid scrubs like the following:
> 7.3e      121289                   0         0          0        0  
> 225333247207            0           0   127         0       127  
> active+clean+scrubbing+deep  2024-11-13T09:37:42.549418+     
> 490179'5220664    490179:23902923   [268,27,122]         268   [268,27,122]   
>           268     483850'5203141  2024-11-02T11:33:57.835277+     
> 472713'5197481  2024-10-11T04:30:00.639763+              0                
> 21873  deep scrubbing for 1169147s
> 34.247     62618                   0         0          0        0  
> 179797964677            0           0   101        50       101  
> active+clean+scrubbing+deep  2024-11-05T06:27:52.288785+    
> 490179'22729571    490179:80672442     [34,97,25]          34     [34,97,25]  
>             34    481331'22436869  2024-10-23T16:06:50.092439+    
> 471395'22289914  2024-10-07T19:29:26.115047+              0               
> 204864  deep scrubbing for 1871733s

> Not pointing any fingers but Squid release had "better scrub scheduling" 
> announced. 
> Though this is not scheduling directly, but maybe this change had any impact 
> causing such behaviour?

> Scrubbing configurations:
> ceph config get osd | grep scrub
> global        advanced  osd_deep_scrub_interval                         
> 2678400.00
> global        advanced  osd_deep_scrub_large_omap_object_key_threshold  50
> global        advanced  osd_max_scrubs                                  5
> global        advanced  osd_scrub_auto_repair                           true
> global        advanced  osd_scrub_max_interval                          
> 2678400.00
> global        advanced  osd_scrub_min_interval                          
> 172800.00


> Cluster details (backfilling expected and caused by some manual reweights):
>   cluster:
>     id:     96df99f6-fc1a-11ea-90a4-6cb3113cb732
>     health: HEALTH_WARN
>             24 pgs not deep-scrubbed in time

>   services:
>     mon:        5 daemons, quorum 
> ceph-node004,ceph-node003,ceph-node001,ceph-node005,ceph-node002 (age 4d)
>     mgr:        ceph-node001.hgythj(active, since 11d), standbys: 
> ceph-node002.jphtvg
>     mds:        20/20 daemons up, 12 standby
>     osd:        384 osds: 384 up (since 25h), 384 in (since 5d); 5 remapped 
> pgs
>     rbd-mirror: 2 daemons active (2 hosts)
>     rgw:        64 daemons active (32 hosts, 1 zones)

>   data:
>     volumes: 1/1 healthy
>     pools:   14 pools, 8681 pgs
>     objects: 758.42M objects, 1.5 PiB
>     usage:   4.6 PiB used, 1.1 PiB / 5.7 PiB avail
>     pgs:     275177/2275254543 objects misplaced (0.012%)
>              6807 active+clean
>              989  active+clean+scrubbing+deep
>              880  active+clean+scrubbing
>              5    active+remapped+backfilling

>   io:
>     client:   37 MiB/s rd, 59 MiB/s wr, 1.72k op/s rd, 439 op/s wr
>     recovery: 70 MiB/s, 38 objects/s


> One thread of other users experiencing same 19.2.0 prolonged deep scrub 
> issues: 
> https://www.reddit.com/r/ceph/comments/1guynak/strange_issue_where_scrubdeep_scrub_never_finishes/
>  
> Any hints or help would be greately appreciated!


> Thanks in advance,
> Laimis J. 
> laimis.juzeliu...@oxylabs.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] classes crush rules new cluster

2024-11-28 Thread Marc
It looks like it is not possible to create crush rules when you don't have 
harddrives active in this class.

I am testing with new squid and did not add ssd's yet, eventhough I added class 
like this.

ceph osd crush class create ssd

I can't execute this
ceph osd crush rule create-replicated replicated_ssd default host ssd

Is there any way around this?



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Snaptriming speed degrade with pg increase

2024-11-28 Thread Szabo, Istvan (Agoda)
Hi,

When we scale the placement group on a pool located in a full nvme cluster, the 
snaptriming speed degrades a lot.
Currently we are running with these values to not degrade client op and have 
some progress on snaptrimmin, but it is terrible. (octopus 15.2.17 on ubuntu 
20.04)

-osd_max_trimming_pgs=2
--osd_snap_trim_sleep=0.1
--osd_pg_max_concurrent_snap_trims=2

We had a big pool which we used to have 128PG and that length of the 
snaptrimming took around 45-60 minutes.
Due to impossible to do maintenance on the cluster with 600GB pg sizes because 
it can easily max out a cluster (which we did), we increased to 1024 and the 
snaptrimming duration increased to 3.5 hours.

Is there any good solution that we are missing to fix this?

On the hardware level I've changed server profile to tune some numa settings 
but seems like didn't help still.

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Migrated to cephadm, rgw logs to file even when rgw_ops_log_rados is true

2024-11-28 Thread Eugen Block
I haven't played with rgw_ops yet, but have you looked at the various  
syslog config options?


ceph config ls | grep syslog
log_to_syslog
err_to_syslog
clog_to_syslog
clog_to_syslog_level
clog_to_syslog_facility
mon_cluster_log_to_syslog
mon_cluster_log_to_syslog_level
mon_cluster_log_to_syslog_facility

Maybe one of them is what you're looking for.

Zitat von Paul JURCO :


Hi!
Currently I have limitted the optput of rgw log to syslog from rsyslog (as
suggested by
Anthony), limitted docker logs from daemon.json.
I still get ops logs written to both logs pool and ops log file
(ops-log-ceph-client.rgw.hostname.log).

How to stop logging ops log on rgw disk and keep logs on logs pool?
Current config:
globaladvanced  debug_rgw
0/0
globaladvanced
 rgw_enable_ops_log true
globaladvanced
 rgw_ops_log_rados   true

Thank you!

--
Paul Jurco


On Fri, Nov 22, 2024 at 6:11 PM Paul JURCO  wrote:


Hi,
we recently migrated to cephadm from ceph-deploy a 18.2.2 ceph cluster
(Ubuntu with docker).
RGWs are separate vms.
We noticed syslog increased a lot due to rgw's access logs sent to it.
And because we use to log ops, a huge ops log file on
/var/log/ceph/cluster-id/ops-log-ceph-client.rgw.hostname-here.log.

While having "rgw_ops_log_rados": "true", oplogs goes to both file and
rados pool for logs.
If false it doesn't log anything, as expected.
How to stop dockered rgws to log to syslog and to a file on disk, but to
keep opslog in logs pool?

Config is:
globalbasic
log_to_journald  false
globaladvanced
 rgw_enable_ops_log  false
globaladvanced
 rgw_ops_log_radostrue

A few hours later after after enabling it back, after massive cleanup, it
does logging ops but only to files.
How to get ops logs in rados pool and access log to a file on disk but not
on syslog?
I have add this to daemon.json to limit access logs to accumulate on
/var/log/docker/containers/rand/rand/json.log file:

{
  "log-driver": "local",
  "log-opts": {
"max-size": "512m",
"max-file": "3"
  }
}


Thank you!
Paul


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rgw multisite excessive data usage on secondary zone

2024-11-28 Thread Adam Prycki

Hi,

I've just configured a second zones for 2 of our ceph s3 deployments and 
I've noticed that after initial sync secondary zone data pool are much 
bigger than ones on master zones.


My setup consists of main zone, archive zone and sync_policy which 
configure directional sync from main zone to archive.


here is an example from the zone I've configured today.

Master zone looks like this
X.rgw.buckets.data  42  1024  361 GiB  254.26k 
542 GiB   0.013.1 PiB


secondary archive zone looks like this.
X-archive.rgw.buckets.data3232  755 GiB  716.68k  1007 
GiB   0.032.4 PiB


This archive zone was created few hours ago. Users didn't overwrite so 
much data to double archive zone size. (object count is almost trippled)


I've checked gc list --include-all on archive zones and it's empty. I'm 
not sure why zone is this big.


Few days ago I've also configured archive zone for different deployment. 
I've configured archive-zone lifecycle policy to 1 day and tried to 
cleanup all the buckets on archive zone. It didn't help.

My other archive zone is 150% of the size of it's master zone.
I've tried to force sync with `radosgw-admin data sync init` Sync worked 
but didn't help with excess data on the pool.


I suspect it's an error during initial multisite synchronization. I've 
restarted RGW daemon on archive zone during initial synchronization in 
both cases.


What else could have caused this?
Are RGW daemons in multisite setups sensitive to restarts?
Could similar issues happen during normal rgw restart during multisite 
operations?



Best regards
Adam Prycki





smime.p7s
Description: Kryptograficzna sygnatura S/MIME
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 2024-11-28 Perf Meeting Cancelled

2024-11-28 Thread Matt Vandermeulen
Hi folks, the perf meeting for today will be cancelled for US 
thanksgiving!


As a heads up, next week will also be cancelled for Cephalocon.

Thanks,
Matt
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC pool only for hdd

2024-11-28 Thread Anthony D'Atri
Apologies for the empty reply to this I seem to have sent.  I blame my phone :o

This process can be somewhat automated with crushtool’s reclassification 
directives, which can help avoid omissions or typos (/me whistles innocently):

https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes




> On Nov 28, 2024, at 2:53 AM, Eugen Block  wrote:
> 
> Of course it's possible. You can either change this rule by extracting the 
> crushmap, decompiling it, editing the "take" section, compile it and inject 
> it back into the cluster. Or you simply create a new rule with the class hdd 
> specified and set this new rule for your pools. So the first approach would 
> be:
> 
> 1. ceph osd getcrushmap -o crushmap.bin
> 2. crushtool -d crushmap.bin -o crushmap.txt
> 3. open crushmap.txt with the editor of your choice, replace
> 
>step take default
> with:
>step take default class hdd
> 
> and save the file.
> 
> 4. crushtool -c crushmap.txt -o crushmap.new
> 5. test it with crushtool:
> 
> crushtool -i crushmap.new --test --rule 1 --num-rep 5 --show-mappings | less
> crushtool -i crushmap.new --test --rule 1 --num-rep 5 --show-bad-mappings | 
> less
> 
> You shouldn't have bad mappings if everything is okay. Inspect the result of 
> --show-mappings to see if the OSDs match your HDD OSDs.
> 
> 6. ceph osd setcrushmap -i crushmap.new
> 
> 
> 
> Alternatively, create a new rule if your EC profile(s) already have the 
> correct crush-device-class set. If not, you can create a new one, but keep in 
> mind that you can't change the k and m values for a given pool, so you need 
> to ensure that you use the same k and m values:
> 
> ceph osd erasure-code-profile set ec-profile-k3m2 k=3 m=2 
> crush-failure-domain=host crush-device-class=hdd
> 
> ceph osd crush rule create-erasure rule-ec-k3m2 ec-profile-k3m2
> 
> And here's the result:
> 
> ceph osd crush rule dump rule-ec-k3m2 | grep -A2 take
>"op": "take",
>"item": -2,
>"item_name": "default~hdd"
> 
> Regards,
> Eugen
> 
> Zitat von Rok Jaklič :
> 
>> Hi,
>> 
>> is it possible to set/change following already used rule to only use hdd?
>> {
>>"rule_id": 1,
>>"rule_name": "ec32",
>>"type": 3,
>>"steps": [
>>{
>>"op": "set_chooseleaf_tries",
>>"num": 5
>>},
>>{
>>"op": "set_choose_tries",
>>"num": 100
>>},
>>{
>>"op": "take",
>>"item": -1,
>>"item_name": "default"
>>},
>>{
>>"op": "chooseleaf_indep",
>>"num": 0,
>>"type": "host"
>>},
>>{
>>"op": "emit"
>>}
>>]
>> }
>> 
>> Kind regards,
>> Rok
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC pool only for hdd

2024-11-28 Thread Eugen Block
Oh right, I always forget the reclassify command! It worked perfectly  
last time I used it. Thanks!


Zitat von Anthony D'Atri :

Apologies for the empty reply to this I seem to have sent.  I blame  
my phone :o


This process can be somewhat automated with crushtool’s  
reclassification directives, which can help avoid omissions or typos  
(/me whistles innocently):


https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes





On Nov 28, 2024, at 2:53 AM, Eugen Block  wrote:

Of course it's possible. You can either change this rule by  
extracting the crushmap, decompiling it, editing the "take"  
section, compile it and inject it back into the cluster. Or you  
simply create a new rule with the class hdd specified and set this  
new rule for your pools. So the first approach would be:


1. ceph osd getcrushmap -o crushmap.bin
2. crushtool -d crushmap.bin -o crushmap.txt
3. open crushmap.txt with the editor of your choice, replace

   step take default
with:
   step take default class hdd

and save the file.

4. crushtool -c crushmap.txt -o crushmap.new
5. test it with crushtool:

crushtool -i crushmap.new --test --rule 1 --num-rep 5 --show-mappings | less
crushtool -i crushmap.new --test --rule 1 --num-rep 5  
--show-bad-mappings | less


You shouldn't have bad mappings if everything is okay. Inspect the  
result of --show-mappings to see if the OSDs match your HDD OSDs.


6. ceph osd setcrushmap -i crushmap.new



Alternatively, create a new rule if your EC profile(s) already have  
the correct crush-device-class set. If not, you can create a new  
one, but keep in mind that you can't change the k and m values for  
a given pool, so you need to ensure that you use the same k and m  
values:


ceph osd erasure-code-profile set ec-profile-k3m2 k=3 m=2  
crush-failure-domain=host crush-device-class=hdd


ceph osd crush rule create-erasure rule-ec-k3m2 ec-profile-k3m2

And here's the result:

ceph osd crush rule dump rule-ec-k3m2 | grep -A2 take
   "op": "take",
   "item": -2,
   "item_name": "default~hdd"

Regards,
Eugen

Zitat von Rok Jaklič :


Hi,

is it possible to set/change following already used rule to only use hdd?
{
   "rule_id": 1,
   "rule_name": "ec32",
   "type": 3,
   "steps": [
   {
   "op": "set_chooseleaf_tries",
   "num": 5
   },
   {
   "op": "set_choose_tries",
   "num": 100
   },
   {
   "op": "take",
   "item": -1,
   "item_name": "default"
   },
   {
   "op": "chooseleaf_indep",
   "num": 0,
   "type": "host"
   },
   {
   "op": "emit"
   }
   ]
}

Kind regards,
Rok
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io