[ceph-users] Re: Bug with Cephadm module osd service preventing orchestrator start

2024-08-19 Thread Eugen Block

Hi,

what is the output of this command?

ceph config-key get mgr/cephadm/osd_remove_queue

I just tried to cancel a draining on a small 18.2.4 test cluster, it  
went well, though. After scheduling the drain the mentioned key looks  
like this:


# ceph config-key get mgr/cephadm/osd_remove_queue
[{"osd_id": 1, "started": true, "draining": false, "stopped": false,  
"replace": false, "force": false, "zap": false, "hostname": "host5",  
"original_weight": 0.0233917236328125, "drain_started_at": null,  
"drain_stopped_at": null, "drain_done_at": null, "process_started_at":  
"2024-08-19T07:21:27.783527Z"}, {"osd_id": 13, "started": true,  
"draining": true, "stopped": false, "replace": false, "force": false,  
"zap": false, "hostname": "host5", "original_weight":  
0.0233917236328125, "drain_started_at": "2024-08-19T07:21:30.365237Z",  
"drain_stopped_at": null, "drain_done_at": null, "process_started_at":  
"2024-08-19T07:21:27.794688Z"}]


Here you see the original_weight which the orchestrator failed to  
read, apparently. (Note that there are only small 20 GB OSDs, hence  
the small weight). You probably didn't have the output while the OSDs  
were scheduled for draining, correct? I was able to break my cephadm  
module by injecting that json again (it was already completed, hence  
empty), but maybe I did it incorrectly, not sure yet.


Regards,
Eugen

Zitat von Benjamin Huth :


So about a week and a half ago, I started a drain on an incorrect host. I
fairly quickly realized that it was the wrong host, so I stopped the drain,
canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
dumped, edited the crush map to properly reweight those osds and host, and
applied the edited crush map. I then proceeded with a full drain of the
correct host and completed that before attempting to upgrade my cluster.

I started the upgrade, and all 3 of my managers were upgraded from 18.2.2
to 18.2.4. At this point, my managers started back up, but with an
orchestrator that had failed to start, so the upgrade was unable to
continue. My cluster is in a stage where only the 3 managers are upgraded
to 18.2.4 and every other part is at 18.2.2

Since my orchestrator is not able to start, I'm unfortunately not able to
run any ceph orch commands as I receive "Error ENOENT: Module not found"
because the cephadm module doesn't load.
Output of ceph versions:
{
"mon": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 5
},
"mgr": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 1
},
"osd": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 119
},
"mds": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 4
},
"overall": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 129
}
}

I mentioned in my previous post that I tried manually downgrading the
managers to 18.2.2 because I thought there may be an issue with 18.2.4, but
18.2.2 also has the PR that I believe is causing this (
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d)
so no luck

Thanks!
(so sorry, I did not reply all so you may have received this twice)

On Sat, Aug 17, 2024 at 2:55 AM Eugen Block  wrote:


Just to get some background information, did you remove OSDs while
performing the upgrade? Or did you start OSD removal and then started
the upgrade? Upgrades should be started with a healthy cluster, but
one can’t guarantee that of course, OSDs and/or entire hosts can
obviously also fail during an upgrade.
Just trying to understand what could cause this (I haven’t upgraded
production clusters to Reef yet, only test clusters). Have you stopped
the upgrade to cancel the process entirely? Can you share this
information please:

ceph versions
ceph orch upgrade status

Zitat von Benjamin Huth :

> Just wanted to follow up on this, I am unfortunately still stuck with
this
> and can't find where the json for this value is stored. I'm wondering if
I
> should attempt to build a manager container  with the code for this
> reverted to before the commit that introduced the original_weight
argument.
> Please let me know if you guys have any thoughts
>
> Thank you!
>
> On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth 
wrote:
>
>> Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and have
>> encountered a problem with my managers. After they had been upgraded, my
>> ceph orch module broke because the cephadm module would not load. This
>> obviously halted the update because you can't really update without the
>> orchestrator. Here are the logs related to why the cephadm module fails
to
>> start:
>>
>> https://pastebin.com/SzHbEDVA
>>
>> and the relevent part here:
>>
>> "backtrace": [
>>
>> " File \\"/usr/share/ceph/mgr/cephadm/module.py\\", line 591, in
>> __init__\\n self.to_remove_osds.load_from_store(

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Giovanna Ratini

Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a scrub 
unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several days:

 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the growing.

Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting anything 
suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq": 
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq": 
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",

*   "snap_trimq_len": 5741,*
    "state": "active+clean+snaptrim",
    "epoch": 734240,
    "up": [

Do you know the way to see if the snaptim "process" works?

Best Regard

Gio


Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:

Hello Eugen,

thank you for your answer.

I restarted all the kube-ceph nodes one after the other. Nothing 
has changed.


ok, I deactivate the snap ... : ceph fs snap-schedule deactivate /

Is there a way to see how many snapshots will be deleted per hour?

Regards,

Gio





Am 17.08.2024 um 10:12 schrieb Eugen Block:

Hi,

have you tried to fail the mgr? Sometimes the PG stats are not 
correct. You could also temporarily disable snapshots to see if 
things settle down.


Zitat von Giovanna Ratini :


Hello all,

We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a 
Kubernetes environment. Last week, we had a problem with the MDS 
falling behind on trimming every 4-5 days (GitHub issue link). We 
resolved the issue using the steps outlined in the GitHub issue.


We have 3 hosts (I know, I need to increase this as soon as 
possible, and I will!) and 6 OSDs. After running the commands:


ceph config set mds mds_dir_max_commit_size 80,

ceph fs fail , and

ceph fs set  joinable true,

After that, the snaptrim queue for our PGs has stopped 
decreasing. All PGs of our CephFS are in either 
active+clean+snaptrim_wait or active+clean+snaptrim states. For 
example, the PG 3.12 is in the active+clean+snaptrim state, and 
its snap_trimq_len was 4077 yesterday but has increased to 4538 
today.


I increased the osd_snap_trim_priority to 10 (ceph config set osd 
osd_snap_trim_priority 10), but it didn't help. Only t

[ceph-users] Re: memory leak in mds?

2024-08-19 Thread Dario Graña
Thank you Frédéric and Venky for your answers.
I will try to do some tests before changing the production environment.

On Mon, Aug 19, 2024 at 8:53 AM Venky Shankar  wrote:

> [cc Xiubo]
>
> On Fri, Aug 16, 2024 at 8:10 PM Dario Graña  wrote:
> >
> > Hi all,
> > We’re experiencing an issue with CephFS. I think we are facing this issue
> > . The main symptom is that the
> MDS
> > starts using a lot of memory within a few minutes and finally it gets
> > killed by OS (Out Of Memory). Sometimes it happens once a week and
> > sometimes 2 times a day. We are running ceph quincy 17.2.7 on both the
> > cluster and clients. I have read through some emails on the mailing list
> > about it, but I didn't find a workaround. Does anyone have any
> suggestions?
>
> It is likely that you are running into the issue described in the
> mentioned tracker. The change is pending backport to quincy as of now,
> so the other alternative approach might be downgrading the client.
>
> > Thanks in advance.
> >
> > --
> > Dario Graña
> > PIC (Port d'Informació Científica)
> > Campus UAB, Edificio D
> > E-08193 Bellaterra, Barcelona
> > http://www.pic.es
> > Avis - Aviso - Legal Notice: http://legal.ifae.es
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> --
> Cheers,
> Venky
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: memory leak in mds?

2024-08-19 Thread Dario Graña
I was testing to downgrade the *ceph-common* package for a client with
alma9, the same OS we use in production. I was trying to install the
*ceph-common-17.2.6* package since the troubles began in 17.2.7. But I face
a dependency problem
nothing provides libthrift-0.14.0.so()(64bit) needed by
ceph-common-2:17.2.6-0.el9.x86_64 from Ceph
When I try to install with de --nobest flag, dnf suggest to install
*16.2.4-5.el9* version. I will test it in a non-production environment, but
I want to know if you look safe to use the 16.2.4-5 client version against
a 17.2.7 cluster.
Thanks in advance.


On Mon, Aug 19, 2024 at 10:10 AM Dario Graña  wrote:

> Thank you Frédéric and Venky for your answers.
> I will try to do some tests before changing the production environment.
>
> On Mon, Aug 19, 2024 at 8:53 AM Venky Shankar  wrote:
>
>> [cc Xiubo]
>>
>> On Fri, Aug 16, 2024 at 8:10 PM Dario Graña  wrote:
>> >
>> > Hi all,
>> > We’re experiencing an issue with CephFS. I think we are facing this
>> issue
>> > . The main symptom is that the
>> MDS
>> > starts using a lot of memory within a few minutes and finally it gets
>> > killed by OS (Out Of Memory). Sometimes it happens once a week and
>> > sometimes 2 times a day. We are running ceph quincy 17.2.7 on both the
>> > cluster and clients. I have read through some emails on the mailing list
>> > about it, but I didn't find a workaround. Does anyone have any
>> suggestions?
>>
>> It is likely that you are running into the issue described in the
>> mentioned tracker. The change is pending backport to quincy as of now,
>> so the other alternative approach might be downgrading the client.
>>
>> > Thanks in advance.
>> >
>> > --
>> > Dario Graña
>> > PIC (Port d'Informació Científica)
>> > Campus UAB, Edificio D
>> > E-08193 Bellaterra, Barcelona
>> > http://www.pic.es
>> > Avis - Aviso - Legal Notice: http://legal.ifae.es
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>>
>> --
>> Cheers,
>> Venky
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-19 Thread Venky Shankar
Hi Brad,

On Fri, Aug 16, 2024 at 8:59 AM Brad Hubbard  wrote:
>
> On Thu, Aug 15, 2024 at 11:50 AM Brad Hubbard  wrote:
> >
> > On Tue, Aug 6, 2024 at 6:33 AM Yuri Weinstein  wrote:
> > >
> > > Details of this release are summarized here:
> > >
> > > https://tracker.ceph.com/issues/67340#note-1
> > >
> > > Release Notes - N/A
> > > LRC upgrade - N/A
> > > Gibba upgrade -TBD
> > >
> > > Seeking approvals/reviews for:
> > >
> > > rados - Radek, Laura (https://github.com/ceph/ceph/pull/59020 is being
> > > tested and will be cherry-picked when ready)
> > >
> > > rgw - Eric, Adam E
> > > fs - Venky
> > > orch - Adam King
> > > rbd, krbd - Ilya
> > >
> > > quincy-x, reef-x - Laura, Neha
> > >
> > > powercycle - Brad
> >
> > https://pulpito.ceph.com/yuriw-2024-08-02_15:42:13-powercycle-squid-release-distro-default-smithi/7833420/
> > is a problem with the cfuse_workunit_kernel_untar_build task where
> > it's failing to build the kernel, so a problem with the task itself I
> > believe at this point.
> >
> > https://pulpito.ceph.com/yuriw-2024-08-02_15:42:13-powercycle-squid-release-distro-default-smithi/7833422/
> > is a problem with the cfuse_workunit_suites_ffsb task where it's
> > reporting
> > 2024-08-03T06:51:35.402
> > INFO:tasks.workunit.client.0.smithi089.stdout:Probably out of disk
> > space
>
> I'm pretty sure the first of these, and possibly the second as well,
> are ceph-fuse issues and I've created
> https://tracker.ceph.com/issues/67565 and asked for input from the FS
> team.

I have pushed a fix for the crash. The bug exists in 19.1.0 RC too --
do we consider it as a blocker for this RC (Neha/Patrick)?

>
> >
> > I'll chase these down, but I don't think they are powercycle issues
> > per se at this stage. I will prioritise identifying the specific root
> > cause however.
> >
> > APPROVED.
> >
> > > crimson-rados - Matan, Samuel
> > >
> > > ceph-volume - Guillaume
> > >
> > > Pls let me know if any tests were missed from this list.
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
> >
> > --
> > Cheers,
> > Brad
>
>
>
> --
> Cheers,
> Brad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Prometheus and "404" error on console

2024-08-19 Thread Tim Holloway
Although I'm seeing this in Pacific, it appears to be a perennial issue
with no well-documented solution. The dashboard home screen is flooded
with popups saying "404 - Not Found

Could not reach Prometheus's API on
http://ceph1234.mydomain.com:9095/api/v1
"

If I was a slack-jawed PHB casually wandering into the operations
center and saw that, I'd probably doubt that Ceph was a good product
decision.

The "404" indicates that the Prometheus server is running and accepting
requests, but all the "404" says is that whatever requests it's
receiving are meaningless to it. Presumably either some handler needs
to be jacked into Prometheus or the sender needs to be notified to
desist.

Unfortunately, "404" doesn't provide any clues. Ideally, ceph should
log something more specific by default, but failing that, if anyone
knows how to shut it up (cleanly), I'd appreciate knowing!

   Tim
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus and "404" error on console

2024-08-19 Thread Daniel Brown


I’ve seen similar. 

Have been wondering if it would be possible to either setup a LoadBalancer or 
something like “keeepalived” to provide a “VIP” which could move between nodes 
to support the dashboard (and prometheus, Grafana, etc).

I do see notes about HA Proxy in the docs, but haven’t gotten to trying that 
setup: 

https://docs.ceph.com/en/quincy/mgr/dashboard/#haproxy-example-configuration



 
In my opinion, a VIP for the dashboard (etc.) could and maybe should be and out 
of the box config. 





> On Aug 19, 2024, at 8:23 AM, Tim Holloway  wrote:
> 
> Although I'm seeing this in Pacific, it appears to be a perennial issue
> with no well-documented solution. The dashboard home screen is flooded
> with popups saying "404 - Not Found
> 
> Could not reach Prometheus's API on
> http://ceph1234.mydomain.com:9095/api/v1
> "
> 
> If I was a slack-jawed PHB casually wandering into the operations
> center and saw that, I'd probably doubt that Ceph was a good product
> decision.
> 
> The "404" indicates that the Prometheus server is running and accepting
> requests, but all the "404" says is that whatever requests it's
> receiving are meaningless to it. Presumably either some handler needs
> to be jacked into Prometheus or the sender needs to be notified to
> desist.
> 
> Unfortunately, "404" doesn't provide any clues. Ideally, ceph should
> log something more specific by default, but failing that, if anyone
> knows how to shut it up (cleanly), I'd appreciate knowing!
> 
>   Tim
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Eugen Block

What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily  
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a  
scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several days:

 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the growing.

Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting  
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq":  
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq":  
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",

*   "snap_trimq_len": 5741,*
    "state": "active+clean+snaptrim",
    "epoch": 734240,
    "up": [

Do you know the way to see if the snaptim "process" works?

Best Regard

Gio


Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:

Hello Eugen,

thank you for your answer.

I restarted all the kube-ceph nodes one after the other. Nothing  
has changed.


ok, I deactivate the snap ... : ceph fs snap-schedule deactivate /

Is there a way to see how many snapshots will be deleted per hour?

Regards,

Gio





Am 17.08.2024 um 10:12 schrieb Eugen Block:

Hi,

have you tried to fail the mgr? Sometimes the PG stats are not  
correct. You could also temporarily disable snapshots to see if  
things settle down.


Zitat von Giovanna Ratini :


Hello all,

We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a  
Kubernetes environment. Last week, we had a problem with the  
MDS falling behind on trimming every 4-5 days (GitHub issue  
link). We resolved the issue using the steps outlined in the  
GitHub issue.


We have 3 hosts (I know, I need to increase this as soon as  
possible, and I will!) and 6 OSDs. After running the commands:


ceph config set mds mds_dir_max_commit_size 80,

ceph fs fail , and

ceph fs set  joinable true,

After that, the snaptrim queue for our PGs has stopped  
decreasing. All PGs of our CephFS are in either  
active+clean+snaptrim_wait or active+clean+snaptrim states.  
For example, the PG 3.12 is in the a

[ceph-users] cephadm module fails to load with "got an unexpected keyword argument"

2024-08-19 Thread Alex Sanderson

Hi everyone,

I recently upgraded from Quincy to Reef v18.2.4 and my dashboard and mgr 
systems have been broken since.  Since the upgrade I was slowly removing 
and zapping osd's that still had the 64k "bluestore_bdev_block_size" and 
decided to have a look at the dashboard problem.   I restarted the mgrs 
one at a time and they showed in status that they working but actually 
the cephadm module was failing.  The systems were all upgraded to 18 via 
orch from 17.2.7 and are running the official docker images.


This is the error message:

debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr load Failed to 
construct class in 'cephadm'
debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr load Traceback 
(most recent call last):

  File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__
    self.to_remove_osds.load_from_store()
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 924, in 
load_from_store

    osd_obj = OSD.from_json(osd, rm_util=self.rm_util)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 789, in 
from_json

    return cls(**inp)
TypeError: __init__() got an unexpected keyword argument 'original_weight'

debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr operator() Failed 
to run module in active mode ('cephadm')


The config-key responsible was mgr/cephadm/osd_remove_queue

This is what it looked like before.  After removing the original_weight 
field and setting the variable again, the cephadm module loads and orch 
works.   It seems like a bug.


[{"osd_id": 89, "started": true, "draining": true, "stopped": false, 
"replace": false, "force": true, "zap": true, "hostname": "goanna", 
"original_weight": 0.930999755859375, "drain_started_at": 
"2024-08-12T13:21:04.458019Z", "drain_stopped_at": null, 
"drain_done_at": null, "process_started_at": 
"2024-08-12T13:20:40.021185Z"}, {"osd_id": 37, "started": true, 
"draining": true, "stopped": false, "replace": false, "force": true, 
"zap": true, "hostname": "gsceph1osd05", "original_weight": 4, 
"drain_started_at": "2024-08-10T06:30:37.569931Z", "drain_stopped_at": 
null, "drain_done_at": null, "process_started_at": 
"2024-08-10T06:30:19.729143Z"}, {"osd_id": 47, "started": true, 
"draining": true, "stopped": false, "replace": false, "force": true, 
"zap": true, "hostname": "gsceph1osd07", "original_weight": 4, 
"drain_started_at": "2024-08-10T09:54:49.132830Z", "drain_stopped_at": 
null, "drain_done_at": null, "process_started_at": 
"2024-08-10T09:54:34.367655Z"}]


I thought I should put this out there in case anyone else was having a 
weird issue with a keyword argument problem.  It did not fix the problem 
with the dashboard, still working on that.


Alex

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm module fails to load with "got an unexpected keyword argument"

2024-08-19 Thread Eugen Block

Hi,

there's a tracker issue [0] for that. I was assisting with the same  
issue in a different thread [1].


Thanks,
Eugen

[0] https://tracker.ceph.com/issues/67329
[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/SRJPC5ZYTPXF63AKGIIOA2LLLBBWCIT4/


Zitat von Alex Sanderson :


Hi everyone,

I recently upgraded from Quincy to Reef v18.2.4 and my dashboard and  
mgr systems have been broken since.  Since the upgrade I was slowly  
removing and zapping osd's that still had the 64k  
"bluestore_bdev_block_size" and decided to have a look at the  
dashboard problem.   I restarted the mgrs one at a time and they  
showed in status that they working but actually the cephadm module  
was failing.  The systems were all upgraded to 18 via orch from  
17.2.7 and are running the official docker images.


This is the error message:

debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr load Failed  
to construct class in 'cephadm'
debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr load  
Traceback (most recent call last):

  File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__
    self.to_remove_osds.load_from_store()
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 924, in  
load_from_store

    osd_obj = OSD.from_json(osd, rm_util=self.rm_util)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 789, in from_json
    return cls(**inp)
TypeError: __init__() got an unexpected keyword argument 'original_weight'

debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr operator()  
Failed to run module in active mode ('cephadm')


The config-key responsible was mgr/cephadm/osd_remove_queue

This is what it looked like before.  After removing the  
original_weight field and setting the variable again, the cephadm  
module loads and orch works.   It seems like a bug.


[{"osd_id": 89, "started": true, "draining": true, "stopped": false,  
"replace": false, "force": true, "zap": true, "hostname": "goanna",  
"original_weight": 0.930999755859375, "drain_started_at":  
"2024-08-12T13:21:04.458019Z", "drain_stopped_at": null,  
"drain_done_at": null, "process_started_at":  
"2024-08-12T13:20:40.021185Z"}, {"osd_id": 37, "started": true,  
"draining": true, "stopped": false, "replace": false, "force": true,  
"zap": true, "hostname": "gsceph1osd05", "original_weight": 4,  
"drain_started_at": "2024-08-10T06:30:37.569931Z",  
"drain_stopped_at": null, "drain_done_at": null,  
"process_started_at": "2024-08-10T06:30:19.729143Z"}, {"osd_id": 47,  
"started": true, "draining": true, "stopped": false, "replace":  
false, "force": true, "zap": true, "hostname": "gsceph1osd07",  
"original_weight": 4, "drain_started_at":  
"2024-08-10T09:54:49.132830Z", "drain_stopped_at": null,  
"drain_done_at": null, "process_started_at":  
"2024-08-10T09:54:34.367655Z"}]


I thought I should put this out there in case anyone else was having  
a weird issue with a keyword argument problem.  It did not fix the  
problem with the dashboard, still working on that.


Alex

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bug with Cephadm module osd service preventing orchestrator start

2024-08-19 Thread Eugen Block

There's a tracker issue for this:

https://tracker.ceph.com/issues/67329

Zitat von Eugen Block :


Hi,

what is the output of this command?

ceph config-key get mgr/cephadm/osd_remove_queue

I just tried to cancel a draining on a small 18.2.4 test cluster, it  
went well, though. After scheduling the drain the mentioned key  
looks like this:


# ceph config-key get mgr/cephadm/osd_remove_queue
[{"osd_id": 1, "started": true, "draining": false, "stopped": false,  
"replace": false, "force": false, "zap": false, "hostname": "host5",  
"original_weight": 0.0233917236328125, "drain_started_at": null,  
"drain_stopped_at": null, "drain_done_at": null,  
"process_started_at": "2024-08-19T07:21:27.783527Z"}, {"osd_id": 13,  
"started": true, "draining": true, "stopped": false, "replace":  
false, "force": false, "zap": false, "hostname": "host5",  
"original_weight": 0.0233917236328125, "drain_started_at":  
"2024-08-19T07:21:30.365237Z", "drain_stopped_at": null,  
"drain_done_at": null, "process_started_at":  
"2024-08-19T07:21:27.794688Z"}]


Here you see the original_weight which the orchestrator failed to  
read, apparently. (Note that there are only small 20 GB OSDs, hence  
the small weight). You probably didn't have the output while the  
OSDs were scheduled for draining, correct? I was able to break my  
cephadm module by injecting that json again (it was already  
completed, hence empty), but maybe I did it incorrectly, not sure yet.


Regards,
Eugen

Zitat von Benjamin Huth :


So about a week and a half ago, I started a drain on an incorrect host. I
fairly quickly realized that it was the wrong host, so I stopped the drain,
canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
dumped, edited the crush map to properly reweight those osds and host, and
applied the edited crush map. I then proceeded with a full drain of the
correct host and completed that before attempting to upgrade my cluster.

I started the upgrade, and all 3 of my managers were upgraded from 18.2.2
to 18.2.4. At this point, my managers started back up, but with an
orchestrator that had failed to start, so the upgrade was unable to
continue. My cluster is in a stage where only the 3 managers are upgraded
to 18.2.4 and every other part is at 18.2.2

Since my orchestrator is not able to start, I'm unfortunately not able to
run any ceph orch commands as I receive "Error ENOENT: Module not found"
because the cephadm module doesn't load.
Output of ceph versions:
{
   "mon": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 5
   },
   "mgr": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 1
   },
   "osd": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 119
   },
   "mds": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 4
   },
   "overall": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 129
   }
}

I mentioned in my previous post that I tried manually downgrading the
managers to 18.2.2 because I thought there may be an issue with 18.2.4, but
18.2.2 also has the PR that I believe is causing this (
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d)
so no luck

Thanks!
(so sorry, I did not reply all so you may have received this twice)

On Sat, Aug 17, 2024 at 2:55 AM Eugen Block  wrote:


Just to get some background information, did you remove OSDs while
performing the upgrade? Or did you start OSD removal and then started
the upgrade? Upgrades should be started with a healthy cluster, but
one can’t guarantee that of course, OSDs and/or entire hosts can
obviously also fail during an upgrade.
Just trying to understand what could cause this (I haven’t upgraded
production clusters to Reef yet, only test clusters). Have you stopped
the upgrade to cancel the process entirely? Can you share this
information please:

ceph versions
ceph orch upgrade status

Zitat von Benjamin Huth :


Just wanted to follow up on this, I am unfortunately still stuck with

this

and can't find where the json for this value is stored. I'm wondering if

I

should attempt to build a manager container  with the code for this
reverted to before the commit that introduced the original_weight

argument.

Please let me know if you guys have any thoughts

Thank you!

On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth 

wrote:



Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and have
encountered a problem with my managers. After they had been upgraded, my
ceph orch module broke because the cephadm module would not load. This
obviously halted the update because you can't really update without the
orchestrator. Here are the logs related to why the cephadm module fails

to

start:

https://pastebin.com/SzHbEDVA

and the relevent part here:

"backtrace": [

" File \\"/usr/share/ceph/mgr/cephadm/module.py\\", line 591, in
__init_

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Giovanna Ratini

Hallo Eugen,

yes, the load is for now not too much.

I stop the snap and now this is the output. No changes in the queue.

root@kube-master02:~# k ceph -s
Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    nosnaptrim flag(s) set
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 30h)
    mgr: a(active, since 29h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
 flags nosnaptrim

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.21M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 32 active+clean+snaptrim_wait

  io:
    client:   7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr

Am 19.08.2024 um 14:54 schrieb Eugen Block:

What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily 
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a 
scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several days:

 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the growing.

Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting 
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq": 
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq": 
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",

*   "snap_trimq_len": 5741,*
    "state": "active+clean+snaptrim",
    "epoch": 734240,
    "up": [

Do you know the way to see if the snaptim "process" works?

Best Regard

Gio


Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:

Hello Eugen,

thank you for your answer.

I restarted all the kube-ceph nodes one after the other. Nothing 
has changed.


ok, I deactivate the snap ... : ceph fs snap-schedule deactivate /

Is there a way to see how many snapshots will be deleted per hour?

Re

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Eugen Block
There's a lengthy thread [0] where several approaches are proposed.  
The worst is a OSD recreation, but that's the last resort, of course.


What's are the current values for these configs?

ceph config get osd osd_pg_max_concurrent_snap_trims
ceph config get osd osd_max_trimming_pgs

Maybe decrease them to 1 each while the nosnaptrim flag is set, then  
unset it. You could also try online (and/or offline osd compaction)  
before unsetting the flag. Are the OSD processes utilizing an entire  
CPU?


[0] https://www.spinics.net/lists/ceph-users/msg75626.html

Zitat von Giovanna Ratini :


Hallo Eugen,

yes, the load is for now not too much.

I stop the snap and now this is the output. No changes in the queue.

root@kube-master02:~# k ceph -s
Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    nosnaptrim flag(s) set
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 30h)
    mgr: a(active, since 29h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
 flags nosnaptrim

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.21M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 32 active+clean+snaptrim_wait

  io:
    client:   7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr

Am 19.08.2024 um 14:54 schrieb Eugen Block:

What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily  
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a  
scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several days:

 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the growing.

Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting  
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq":  
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq":  
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5

[ceph-users] Re: squid release codename

2024-08-19 Thread Yehuda Sadeh-Weinraub
On Sat, Aug 17, 2024 at 9:12 AM Anthony D'Atri  wrote:
>
> > It's going to wreak havoc on search engines that can't tell when
> > someone's looking up Ceph versus the long-establish Squid Proxy.
>
> Search engines are way smarter than that, and I daresay that people are far 
> more likely to search for “Ceph” or “Ceph squid" than for “squid” alone 
> looking for Ceph.
>
>
> > I don’t know how many more (sub)species there are to start over from A (the 
> > first release was Argonaut)
>
> Ammonite is a natural, and two years later we *must* release Cthulhu.
>
> Cartoon names run some risk of trademark issues.
>
> > ...  that said, naming a *release* of a software with the name of
> > well known other open source software is pure crazyness.
>
> I haven’t seen the web cache used in years — maybe still in Antarctica?  
> These are vanity names for fun.  I’ve found that more people know the numeric 
> release they run than the codename anyway.
>
> > What's coming next? Ceph Redis? Ceph Apache? Or Apache Ceph?
>
> Since you mention Apache, their “Spark” is an overload.  And Apache itself is 
> cultural appropriation but that’s a tangent.
>
> When I worked for Advanced Micro Devices we used the Auto Mounter Daemon
>
> I’ve also used AMANDA for backups, which was not a Boston song.
>
> Let’s not forget Apple’s iOS and Cisco’s IOS.
>
> Ceph Octopus, and this cable 
> https://usb.brando.com/usb-octopus-4-port-hub-cable_p999c39d15.html and of 
> course this one https://www.ebay.com/itm/110473961774
>
> The first Ceph release named after Jason’s posse.
> Bobcat colliding with skid-loaders and Goldthwaite

Originally I remember also suggesting "banana" (after bananaslug) [1]
, imagine how much worse it could have been.

[1] https://marc.info/?l=ceph-devel&m=133954522619841&w=2

> Dumpling and gyoza
> Firefly and the Uriah Heep album (though Demons & Wizards was better)
> Giant and the Liz Taylor movie (and grocery store)
> Hammer and Jan
> Jewel and the singer
> Moreover, Ceph Nautilus:
> Korg software
> Process engineering software
> CMS
> GNOME file manager
> Firefox and the Clint Eastwood movie
> Chrome and the bumper on a 1962 Karmann Ghia
> Slack and the Linux distribution
>
> When I worked for Cisco, people thought I was in food service.  Namespaces 
> are crowded.  Overlap happens.  Context resolves readily.
>
> Within the Cephapod scheme we’ve used Octopus and Nautilus, to not use Squid 
> would be odd.  And Shantungendoceras doesn’t roll off the tongue.
>
>
>
> “What’s in a name?”  - Shakespeare
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: squid release codename

2024-08-19 Thread Anthony D'Atri


> On Aug 19, 2024, at 9:45 AM, Yehuda Sadeh-Weinraub  wrote:
> 
> Originally I remember also suggesting "banana" (after bananaslug) [1] , 
> imagine how much worse it could have been.


Solidigm could have been Stodesic or Velostate ;)


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Giovanna Ratini

Hello Eugen,

root@kube-master02:~# k ceph config get osd osd_pg_max_concurrent_snap_trims
Info: running 'ceph' command with args: [config get osd 
osd_pg_max_concurrent_snap_trims]

2
root@kube-master02:~# k ceph config get osd osd_max_trimming_pgs
Info: running 'ceph' command with args: [config get osd 
osd_max_trimming_pgs]

2

CPU Usage is not too much. The VMs node use ca. 15%

Am 19.08.2024 um 15:43 schrieb Eugen Block:
There's a lengthy thread [0] where several approaches are proposed. 
The worst is a OSD recreation, but that's the last resort, of course.


What's are the current values for these configs?

ceph config get osd osd_pg_max_concurrent_snap_trims
ceph config get osd osd_max_trimming_pgs

Maybe decrease them to 1 each while the nosnaptrim flag is set, then 
unset it. You could also try online (and/or offline osd compaction) 
before unsetting the flag. Are the OSD processes utilizing an entire CPU?


[0] https://www.spinics.net/lists/ceph-users/msg75626.html

Zitat von Giovanna Ratini :


Hallo Eugen,

yes, the load is for now not too much.

I stop the snap and now this is the output. No changes in the queue.

root@kube-master02:~# k ceph -s
Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    nosnaptrim flag(s) set
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 30h)
    mgr: a(active, since 29h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
 flags nosnaptrim

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.21M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 32 active+clean+snaptrim_wait

  io:
    client:   7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr

Am 19.08.2024 um 14:54 schrieb Eugen Block:

What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily 
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a 
scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several 
days:


 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the 
growing.


Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting 
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq": 
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq": 
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-19 Thread Adam King
https://tracker.ceph.com/issues/67583 didn't reproduce across 10 reruns
https://pulpito.ceph.com/lflores-2024-08-16_00:04:51-upgrade:quincy-x-squid-release-distro-default-smithi/.
Given the original failure was just "Unable to find image '
quay.io/ceph/grafana:9.4.12' locally" which doesn't look very serious
anyway, I don't think there's any reason for the failure to hold up the
release

On Thu, Aug 15, 2024 at 6:53 PM Laura Flores  wrote:

> The upgrade suites look mostly good to me, except for one tracker I think
> would be in @Adam King 's realm to look at. If the new
> grafana issue below is deemed okay, then we can proceed with approving the
> upgrade suite.
>
> *This issue stood out to me, where the cluster had trouble pulling the
> grafana image locally to redeploy it. *@Adam King * can
> you take a look?*
>
>- *https://tracker.ceph.com/issues/67583
> - upgrade:quincy-x/stress-split:
>Cluster fails to redeploy grafana daemon after image is unable to be found
>locally*
>
>
> Otherwise, tests failed from cluster log warnings that are expected during
> upgrade tests. Many of these warnings have already been fixed and are in
> the stages of getting backported.
> I checked for each test that the cluster had upgraded all daemons to
> 19.1.1, and that was the case.
>
>- https://tracker.ceph.com/issues/66602 - rados/upgrade: Health check
>failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED)
>- https://tracker.ceph.com/issues/65422 - upgrade/quincy-x: "1 pg
>degraded (PG_DEGRADED)" in cluster log
>- https://tracker.ceph.com/issues/67584 - upgrade:quincy-x: cluster
>[WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log
>- https://tracker.ceph.com/issues/64460 - rados/upgrade: "[WRN]
>MON_DOWN: 1/3 mons down, quorum a,b" in cluster log
>- https://tracker.ceph.com/issues/66809 - upgrade/quincy-x;
>upgrade/reef-x: Health check failed: Reduced data availability: 1 pg
>peering (PG_AVAILABILITY)" in cluster log
>
>
>
> On Thu, Aug 15, 2024 at 11:55 AM Laura Flores  wrote:
>
>> Rados approved. Failures tracked here:
>> https://tracker.ceph.com/projects/rados/wiki/SQUID#v1911-httpstrackercephcomissues67340
>>
>> On Thu, Aug 15, 2024 at 11:30 AM Yuri Weinstein 
>> wrote:
>>
>>> Laura,
>>>
>>> The PR was cherry-picked and the `squid-release` branch was built.
>>> Please review the run results in the tracker.
>>>
>>> On Wed, Aug 14, 2024 at 2:18 PM Laura Flores  wrote:
>>> >
>>> > Hey @Yuri Weinstein ,
>>> >
>>> > We've fixed a couple of issues and now need a few things rerun.
>>> >
>>> >
>>> >1. *Can you please rerun upgrade/reef-x and upgrade/quincy-x? *
>>> >   - Reasoning: Many jobs in those suites died due to
>>> >   https://tracker.ceph.com/issues/66883, which we deduced was a
>>> recent
>>> >   merge to teuthology. Now that the affecting commit was reverted,
>>> we are
>>> >   ready to have those rerun.
>>> >2. *Can you please cherry-pick
>>> https://github.com/ceph/ceph/pull/58607
>>> > to squid-release and
>>> reschedule
>>> >rados:thrash-old-clients?*
>>> >   - Reasoning: Since we stopped building focal for squid, we can no
>>> >   longer test squid against pacific clients.
>>> >   - For this second RC, we had to make the decision to drop pacific
>>> >   from the *rados:thrash-old-clients* tests, which will now use
>>> centos
>>> >   9 stream packages to test against only reef and quincy clients (
>>> >   https://github.com/ceph/ceph/pull/58607).
>>> >   - We have raised https://tracker.ceph.com/issues/67469 to track
>>> the
>>> >   implementation of a containerized solution for older clients
>>> that don't
>>> >   have centos 9 stream packages, so that we can reincorporate
>>> > pacific in the
>>> >   future.
>>> >
>>> > After these two things are rescheduled, we can proceed with a rados
>>> suite
>>> > approval and an upgrade suite approval.
>>> >
>>> > Thanks,
>>> > Laura
>>> >
>>> > On Wed, Aug 14, 2024 at 12:49 PM Adam Emerson 
>>> wrote:
>>> >
>>> > > On 14/08/2024, Yuri Weinstein wrote:
>>> > > > Still waiting to hear back:
>>> > > >
>>> > > > rgw - Eric, Adam E
>>> > >
>>> > > Approved.
>>> > >
>>> > > (Sorry, I thought we were supposed to reply on the tracker.)
>>> > > ___
>>> > > Dev mailing list -- d...@ceph.io
>>> > > To unsubscribe send an email to dev-le...@ceph.io
>>> > >
>>> > >
>>> >
>>> > --
>>> >
>>> > Laura Flores
>>> >
>>> > She/Her/Hers
>>> >
>>> > Software Engineer, Ceph Storage 
>>> >
>>> > Chicago, IL
>>> >
>>> > lflo...@ibm.com | lflo...@redhat.com 
>>> > M: +17087388804
>>> > ___
>>> > ceph-users mailing list -- ceph-users@ceph.io
>>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>>
>>
>> --
>>
>> Laura Flores
>>
>> She/

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-19 Thread Laura Flores
Thanks @Adam King !

@Yuri Weinstein  the upgrade suites are approved.

On Mon, Aug 19, 2024 at 9:28 AM Adam King  wrote:

> https://tracker.ceph.com/issues/67583 didn't reproduce across 10 reruns
> https://pulpito.ceph.com/lflores-2024-08-16_00:04:51-upgrade:quincy-x-squid-release-distro-default-smithi/.
> Given the original failure was just "Unable to find image '
> quay.io/ceph/grafana:9.4.12' locally" which doesn't look very serious
> anyway, I don't think there's any reason for the failure to hold up the
> release
>
> On Thu, Aug 15, 2024 at 6:53 PM Laura Flores  wrote:
>
>> The upgrade suites look mostly good to me, except for one tracker I think
>> would be in @Adam King 's realm to look at. If the
>> new grafana issue below is deemed okay, then we can proceed with approving
>> the upgrade suite.
>>
>> *This issue stood out to me, where the cluster had trouble pulling the
>> grafana image locally to redeploy it. *@Adam King * can
>> you take a look?*
>>
>>- *https://tracker.ceph.com/issues/67583
>> - upgrade:quincy-x/stress-split:
>>Cluster fails to redeploy grafana daemon after image is unable to be found
>>locally*
>>
>>
>> Otherwise, tests failed from cluster log warnings that are expected
>> during upgrade tests. Many of these warnings have already been fixed and
>> are in the stages of getting backported.
>> I checked for each test that the cluster had upgraded all daemons to
>> 19.1.1, and that was the case.
>>
>>- https://tracker.ceph.com/issues/66602 - rados/upgrade: Health check
>>failed: 1 pool(s) do not have an application enabled 
>> (POOL_APP_NOT_ENABLED)
>>- https://tracker.ceph.com/issues/65422 - upgrade/quincy-x: "1 pg
>>degraded (PG_DEGRADED)" in cluster log
>>- https://tracker.ceph.com/issues/67584 - upgrade:quincy-x: cluster
>>[WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log
>>- https://tracker.ceph.com/issues/64460 - rados/upgrade: "[WRN]
>>MON_DOWN: 1/3 mons down, quorum a,b" in cluster log
>>- https://tracker.ceph.com/issues/66809 - upgrade/quincy-x;
>>upgrade/reef-x: Health check failed: Reduced data availability: 1 pg
>>peering (PG_AVAILABILITY)" in cluster log
>>
>>
>>
>> On Thu, Aug 15, 2024 at 11:55 AM Laura Flores  wrote:
>>
>>> Rados approved. Failures tracked here:
>>> https://tracker.ceph.com/projects/rados/wiki/SQUID#v1911-httpstrackercephcomissues67340
>>>
>>> On Thu, Aug 15, 2024 at 11:30 AM Yuri Weinstein 
>>> wrote:
>>>
 Laura,

 The PR was cherry-picked and the `squid-release` branch was built.
 Please review the run results in the tracker.

 On Wed, Aug 14, 2024 at 2:18 PM Laura Flores 
 wrote:
 >
 > Hey @Yuri Weinstein ,
 >
 > We've fixed a couple of issues and now need a few things rerun.
 >
 >
 >1. *Can you please rerun upgrade/reef-x and upgrade/quincy-x? *
 >   - Reasoning: Many jobs in those suites died due to
 >   https://tracker.ceph.com/issues/66883, which we deduced was a
 recent
 >   merge to teuthology. Now that the affecting commit was
 reverted, we are
 >   ready to have those rerun.
 >2. *Can you please cherry-pick
 https://github.com/ceph/ceph/pull/58607
 > to squid-release and
 reschedule
 >rados:thrash-old-clients?*
 >   - Reasoning: Since we stopped building focal for squid, we can
 no
 >   longer test squid against pacific clients.
 >   - For this second RC, we had to make the decision to drop
 pacific
 >   from the *rados:thrash-old-clients* tests, which will now use
 centos
 >   9 stream packages to test against only reef and quincy clients (
 >   https://github.com/ceph/ceph/pull/58607).
 >   - We have raised https://tracker.ceph.com/issues/67469 to
 track the
 >   implementation of a containerized solution for older clients
 that don't
 >   have centos 9 stream packages, so that we can reincorporate
 > pacific in the
 >   future.
 >
 > After these two things are rescheduled, we can proceed with a rados
 suite
 > approval and an upgrade suite approval.
 >
 > Thanks,
 > Laura
 >
 > On Wed, Aug 14, 2024 at 12:49 PM Adam Emerson 
 wrote:
 >
 > > On 14/08/2024, Yuri Weinstein wrote:
 > > > Still waiting to hear back:
 > > >
 > > > rgw - Eric, Adam E
 > >
 > > Approved.
 > >
 > > (Sorry, I thought we were supposed to reply on the tracker.)
 > > ___
 > > Dev mailing list -- d...@ceph.io
 > > To unsubscribe send an email to dev-le...@ceph.io
 > >
 > >
 >
 > --
 >
 > Laura Flores
 >
 > She/Her/Hers
 >
 > Software Engineer, Ceph Storage 
 >
 > Chicago, IL
 >
 > 

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-19 Thread Yuri Weinstein
We need approval from Guillaume

Laura, and gibba upgrade.

On Mon, Aug 19, 2024 at 7:31 AM Laura Flores  wrote:

> Thanks @Adam King !
>
> @Yuri Weinstein  the upgrade suites are approved.
>
> On Mon, Aug 19, 2024 at 9:28 AM Adam King  wrote:
>
>> https://tracker.ceph.com/issues/67583 didn't reproduce across 10 reruns
>> https://pulpito.ceph.com/lflores-2024-08-16_00:04:51-upgrade:quincy-x-squid-release-distro-default-smithi/.
>> Given the original failure was just "Unable to find image '
>> quay.io/ceph/grafana:9.4.12' locally" which doesn't look very serious
>> anyway, I don't think there's any reason for the failure to hold up the
>> release
>>
>> On Thu, Aug 15, 2024 at 6:53 PM Laura Flores  wrote:
>>
>>> The upgrade suites look mostly good to me, except for one tracker I
>>> think would be in @Adam King 's realm to look at. If
>>> the new grafana issue below is deemed okay, then we can proceed with
>>> approving the upgrade suite.
>>>
>>> *This issue stood out to me, where the cluster had trouble pulling the
>>> grafana image locally to redeploy it. *@Adam King * can
>>> you take a look?*
>>>
>>>- *https://tracker.ceph.com/issues/67583
>>> - upgrade:quincy-x/stress-split:
>>>Cluster fails to redeploy grafana daemon after image is unable to be 
>>> found
>>>locally*
>>>
>>>
>>> Otherwise, tests failed from cluster log warnings that are expected
>>> during upgrade tests. Many of these warnings have already been fixed and
>>> are in the stages of getting backported.
>>> I checked for each test that the cluster had upgraded all daemons to
>>> 19.1.1, and that was the case.
>>>
>>>- https://tracker.ceph.com/issues/66602 - rados/upgrade: Health
>>>check failed: 1 pool(s) do not have an application enabled
>>>(POOL_APP_NOT_ENABLED)
>>>- https://tracker.ceph.com/issues/65422 - upgrade/quincy-x: "1 pg
>>>degraded (PG_DEGRADED)" in cluster log
>>>- https://tracker.ceph.com/issues/67584 - upgrade:quincy-x: cluster
>>>[WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log
>>>- https://tracker.ceph.com/issues/64460 - rados/upgrade: "[WRN]
>>>MON_DOWN: 1/3 mons down, quorum a,b" in cluster log
>>>- https://tracker.ceph.com/issues/66809 - upgrade/quincy-x;
>>>upgrade/reef-x: Health check failed: Reduced data availability: 1 pg
>>>peering (PG_AVAILABILITY)" in cluster log
>>>
>>>
>>>
>>> On Thu, Aug 15, 2024 at 11:55 AM Laura Flores 
>>> wrote:
>>>
 Rados approved. Failures tracked here:
 https://tracker.ceph.com/projects/rados/wiki/SQUID#v1911-httpstrackercephcomissues67340

 On Thu, Aug 15, 2024 at 11:30 AM Yuri Weinstein 
 wrote:

> Laura,
>
> The PR was cherry-picked and the `squid-release` branch was built.
> Please review the run results in the tracker.
>
> On Wed, Aug 14, 2024 at 2:18 PM Laura Flores 
> wrote:
> >
> > Hey @Yuri Weinstein ,
> >
> > We've fixed a couple of issues and now need a few things rerun.
> >
> >
> >1. *Can you please rerun upgrade/reef-x and upgrade/quincy-x? *
> >   - Reasoning: Many jobs in those suites died due to
> >   https://tracker.ceph.com/issues/66883, which we deduced was a
> recent
> >   merge to teuthology. Now that the affecting commit was
> reverted, we are
> >   ready to have those rerun.
> >2. *Can you please cherry-pick
> https://github.com/ceph/ceph/pull/58607
> > to squid-release and
> reschedule
> >rados:thrash-old-clients?*
> >   - Reasoning: Since we stopped building focal for squid, we can
> no
> >   longer test squid against pacific clients.
> >   - For this second RC, we had to make the decision to drop
> pacific
> >   from the *rados:thrash-old-clients* tests, which will now use
> centos
> >   9 stream packages to test against only reef and quincy clients
> (
> >   https://github.com/ceph/ceph/pull/58607).
> >   - We have raised https://tracker.ceph.com/issues/67469 to
> track the
> >   implementation of a containerized solution for older clients
> that don't
> >   have centos 9 stream packages, so that we can reincorporate
> > pacific in the
> >   future.
> >
> > After these two things are rescheduled, we can proceed with a rados
> suite
> > approval and an upgrade suite approval.
> >
> > Thanks,
> > Laura
> >
> > On Wed, Aug 14, 2024 at 12:49 PM Adam Emerson 
> wrote:
> >
> > > On 14/08/2024, Yuri Weinstein wrote:
> > > > Still waiting to hear back:
> > > >
> > > > rgw - Eric, Adam E
> > >
> > > Approved.
> > >
> > > (Sorry, I thought we were supposed to reply on the tracker.)
> > > ___
> > > Dev mailing list -- d...@ceph.

[ceph-users] Re: squid 19.1.1 RC QE validation status

2024-08-19 Thread Laura Flores
I can do the gibba upgrade after everything's approved.

On Mon, Aug 19, 2024 at 9:47 AM Yuri Weinstein  wrote:

> We need approval from Guillaume
>
> Laura, and gibba upgrade.
>
> On Mon, Aug 19, 2024 at 7:31 AM Laura Flores  wrote:
>
>> Thanks @Adam King !
>>
>> @Yuri Weinstein  the upgrade suites are approved.
>>
>> On Mon, Aug 19, 2024 at 9:28 AM Adam King  wrote:
>>
>>> https://tracker.ceph.com/issues/67583 didn't reproduce across 10 reruns
>>> https://pulpito.ceph.com/lflores-2024-08-16_00:04:51-upgrade:quincy-x-squid-release-distro-default-smithi/.
>>> Given the original failure was just "Unable to find image '
>>> quay.io/ceph/grafana:9.4.12' locally" which doesn't look very serious
>>> anyway, I don't think there's any reason for the failure to hold up the
>>> release
>>>
>>> On Thu, Aug 15, 2024 at 6:53 PM Laura Flores  wrote:
>>>
 The upgrade suites look mostly good to me, except for one tracker I
 think would be in @Adam King 's realm to look at.
 If the new grafana issue below is deemed okay, then we can proceed with
 approving the upgrade suite.

 *This issue stood out to me, where the cluster had trouble pulling the
 grafana image locally to redeploy it. *@Adam King * can
 you take a look?*

- *https://tracker.ceph.com/issues/67583
 - upgrade:quincy-x/stress-split:
Cluster fails to redeploy grafana daemon after image is unable to be 
 found
locally*


 Otherwise, tests failed from cluster log warnings that are expected
 during upgrade tests. Many of these warnings have already been fixed and
 are in the stages of getting backported.
 I checked for each test that the cluster had upgraded all daemons to
 19.1.1, and that was the case.

- https://tracker.ceph.com/issues/66602 - rados/upgrade: Health
check failed: 1 pool(s) do not have an application enabled
(POOL_APP_NOT_ENABLED)
- https://tracker.ceph.com/issues/65422 - upgrade/quincy-x: "1 pg
degraded (PG_DEGRADED)" in cluster log
- https://tracker.ceph.com/issues/67584 - upgrade:quincy-x: cluster
[WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log
- https://tracker.ceph.com/issues/64460 - rados/upgrade: "[WRN]
MON_DOWN: 1/3 mons down, quorum a,b" in cluster log
- https://tracker.ceph.com/issues/66809 - upgrade/quincy-x;
upgrade/reef-x: Health check failed: Reduced data availability: 1 pg
peering (PG_AVAILABILITY)" in cluster log



 On Thu, Aug 15, 2024 at 11:55 AM Laura Flores 
 wrote:

> Rados approved. Failures tracked here:
> https://tracker.ceph.com/projects/rados/wiki/SQUID#v1911-httpstrackercephcomissues67340
>
> On Thu, Aug 15, 2024 at 11:30 AM Yuri Weinstein 
> wrote:
>
>> Laura,
>>
>> The PR was cherry-picked and the `squid-release` branch was built.
>> Please review the run results in the tracker.
>>
>> On Wed, Aug 14, 2024 at 2:18 PM Laura Flores 
>> wrote:
>> >
>> > Hey @Yuri Weinstein ,
>> >
>> > We've fixed a couple of issues and now need a few things rerun.
>> >
>> >
>> >1. *Can you please rerun upgrade/reef-x and upgrade/quincy-x? *
>> >   - Reasoning: Many jobs in those suites died due to
>> >   https://tracker.ceph.com/issues/66883, which we deduced was
>> a recent
>> >   merge to teuthology. Now that the affecting commit was
>> reverted, we are
>> >   ready to have those rerun.
>> >2. *Can you please cherry-pick
>> https://github.com/ceph/ceph/pull/58607
>> > to squid-release and
>> reschedule
>> >rados:thrash-old-clients?*
>> >   - Reasoning: Since we stopped building focal for squid, we
>> can no
>> >   longer test squid against pacific clients.
>> >   - For this second RC, we had to make the decision to drop
>> pacific
>> >   from the *rados:thrash-old-clients* tests, which will now use
>> centos
>> >   9 stream packages to test against only reef and quincy
>> clients (
>> >   https://github.com/ceph/ceph/pull/58607).
>> >   - We have raised https://tracker.ceph.com/issues/67469 to
>> track the
>> >   implementation of a containerized solution for older clients
>> that don't
>> >   have centos 9 stream packages, so that we can reincorporate
>> > pacific in the
>> >   future.
>> >
>> > After these two things are rescheduled, we can proceed with a rados
>> suite
>> > approval and an upgrade suite approval.
>> >
>> > Thanks,
>> > Laura
>> >
>> > On Wed, Aug 14, 2024 at 12:49 PM Adam Emerson 
>> wrote:
>> >
>> > > On 14/08/2024, Yuri Weinstein wrote:
>> > > > Still waiting to hear back:
>> > > >
>

[ceph-users] CLT meeting notes August 19th 2024

2024-08-19 Thread Adam King
   - [travisn] Arm64 OSDs crashing on v18.2.4, need a fix in v18.2.5


   - https://tracker.ceph.com/issues/67213


   - tcmalloc issue, solved by rebuilding the gperftools package


   - Travis to reach out to Rongqi Sun about the issue


   - moving away from tcmalloc would probably cause performance regressions
   and is therefore undesirable


   - [zdover]


   - ceph-users list moderation minute


   - the whole Python mess seems to be fixed. Release branches on
   docs.ceph.com are now updating.


   - [venky]


   - CVE testing process/documentation


   -
   
https://docs.ceph.com/en/reef/dev/release-process/?highlight=ceph%20release%20process#security-release-process-deviation


   - does not include the testing procedure


   - currently, even when pointed towards branch on private repo, tries to
   pull packages from ceph-ci


   - Venky and Zac Dover to coordinate on creating the docs


   - should this documentation be publicly available?


   - only once CI/build issues that require the builds being in public
   places are fixed


   - just in a google doc for now


   - [Eric I]


   - How do people feel that Read the Docs is super-imposing ads over our
   docs? So far I've only seen one for www.ethicalads.io .


   - comes with Read the Docs, not enough of an issue to justify moving
   away from it


   - [DanV]


   - Cephalocon CFP is closed : 175 submissions! (for around 50 sessions)


   - Program Committee reviewing until the end of August.


   - On-site Dev Summit being planned for Dec 3, will go live here:
   https://indico.cern.ch/e/ceph-developer-summit


   - 3-4 rooms at CERN IT for dev whiteboarding. power users session in the
   afternoon.


   - If you still want to submit a CFP, talk to Dan ASAP


   - still looking for additional sponsors


   - 19.1.1


   - almost ready, need ceph-volume approval and Gibba upgrade


   - 17.2.8


   - mark PRs if you need them in the release


   - Yuri to start testing soon


   - core team to potentially do scrub of backport PRs
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Prometheus and "404" error on console

2024-08-19 Thread Tim Holloway
Since I use keepalived, I can affirm with virtual certainty that
keepalived could do stuff like that. Although it may involve using
special IP address that keepalived would aim at the preferred server
instance.

But that's not the problem here, as "404" means that the server is up,
but it sneers at what you are pushing at it.

If Prometheus could, for example, log WHAT is being pushed at it, it
would go a long way.

   Tim

On Mon, 2024-08-19 at 08:40 -0400, Daniel Brown wrote:
> 
> 
> I’ve seen similar. 
> 
> Have been wondering if it would be possible to either setup a
> LoadBalancer or something like “keeepalived” to provide a “VIP” which
> could move between nodes to support the dashboard (and prometheus,
> Grafana, etc).
> 
> I do see notes about HA Proxy in the docs, but haven’t gotten to
> trying that setup: 
> 
> https://docs.ceph.com/en/quincy/mgr/dashboard/#haproxy-example-configuration
> 
> 
> 
>  
> In my opinion, a VIP for the dashboard (etc.) could and maybe should
> be and out of the box config. 
> 
> 
> 
> 
> 
> > On Aug 19, 2024, at 8:23 AM, Tim Holloway 
> > wrote:
> > 
> > Although I'm seeing this in Pacific, it appears to be a perennial
> > issue
> > with no well-documented solution. The dashboard home screen is
> > flooded
> > with popups saying "404 - Not Found
> > 
> > Could not reach Prometheus's API on
> > http://ceph1234.mydomain.com:9095/api/v1
> > "
> > 
> > If I was a slack-jawed PHB casually wandering into the operations
> > center and saw that, I'd probably doubt that Ceph was a good
> > product
> > decision.
> > 
> > The "404" indicates that the Prometheus server is running and
> > accepting
> > requests, but all the "404" says is that whatever requests it's
> > receiving are meaningless to it. Presumably either some handler
> > needs
> > to be jacked into Prometheus or the sender needs to be notified to
> > desist.
> > 
> > Unfortunately, "404" doesn't provide any clues. Ideally, ceph
> > should
> > log something more specific by default, but failing that, if anyone
> > knows how to shut it up (cleanly), I'd appreciate knowing!
> > 
> >   Tim
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bug with Cephadm module osd service preventing orchestrator start

2024-08-19 Thread Benjamin Huth
Thank you so much for the help! Thanks to the issue you linked and the
other guy you replied to with the same issue, I was able to edit the
config-key and get my orchestrator back. Sorry for not checking the issues
as well as I should have, that's my bad there.

On Mon, Aug 19, 2024 at 6:12 AM Eugen Block  wrote:

> There's a tracker issue for this:
>
> https://tracker.ceph.com/issues/67329
>
> Zitat von Eugen Block :
>
> > Hi,
> >
> > what is the output of this command?
> >
> > ceph config-key get mgr/cephadm/osd_remove_queue
> >
> > I just tried to cancel a draining on a small 18.2.4 test cluster, it
> > went well, though. After scheduling the drain the mentioned key
> > looks like this:
> >
> > # ceph config-key get mgr/cephadm/osd_remove_queue
> > [{"osd_id": 1, "started": true, "draining": false, "stopped": false,
> > "replace": false, "force": false, "zap": false, "hostname": "host5",
> > "original_weight": 0.0233917236328125, "drain_started_at": null,
> > "drain_stopped_at": null, "drain_done_at": null,
> > "process_started_at": "2024-08-19T07:21:27.783527Z"}, {"osd_id": 13,
> > "started": true, "draining": true, "stopped": false, "replace":
> > false, "force": false, "zap": false, "hostname": "host5",
> > "original_weight": 0.0233917236328125, "drain_started_at":
> > "2024-08-19T07:21:30.365237Z", "drain_stopped_at": null,
> > "drain_done_at": null, "process_started_at":
> > "2024-08-19T07:21:27.794688Z"}]
> >
> > Here you see the original_weight which the orchestrator failed to
> > read, apparently. (Note that there are only small 20 GB OSDs, hence
> > the small weight). You probably didn't have the output while the
> > OSDs were scheduled for draining, correct? I was able to break my
> > cephadm module by injecting that json again (it was already
> > completed, hence empty), but maybe I did it incorrectly, not sure yet.
> >
> > Regards,
> > Eugen
> >
> > Zitat von Benjamin Huth :
> >
> >> So about a week and a half ago, I started a drain on an incorrect host.
> I
> >> fairly quickly realized that it was the wrong host, so I stopped the
> drain,
> >> canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
> >> dumped, edited the crush map to properly reweight those osds and host,
> and
> >> applied the edited crush map. I then proceeded with a full drain of the
> >> correct host and completed that before attempting to upgrade my cluster.
> >>
> >> I started the upgrade, and all 3 of my managers were upgraded from
> 18.2.2
> >> to 18.2.4. At this point, my managers started back up, but with an
> >> orchestrator that had failed to start, so the upgrade was unable to
> >> continue. My cluster is in a stage where only the 3 managers are
> upgraded
> >> to 18.2.4 and every other part is at 18.2.2
> >>
> >> Since my orchestrator is not able to start, I'm unfortunately not able
> to
> >> run any ceph orch commands as I receive "Error ENOENT: Module not found"
> >> because the cephadm module doesn't load.
> >> Output of ceph versions:
> >> {
> >>"mon": {
> >>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 5
> >>},
> >>"mgr": {
> >>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 1
> >>},
> >>"osd": {
> >>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 119
> >>},
> >>"mds": {
> >>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 4
> >>},
> >>"overall": {
> >>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 129
> >>}
> >> }
> >>
> >> I mentioned in my previous post that I tried manually downgrading the
> >> managers to 18.2.2 because I thought there may be an issue with 18.2.4,
> but
> >> 18.2.2 also has the PR that I believe is causing this (
> >>
> https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d
> )
> >> so no luck
> >>
> >> Thanks!
> >> (so sorry, I did not reply all so you may have received this twice)
> >>
> >> On Sat, Aug 17, 2024 at 2:55 AM Eugen Block  wrote:
> >>
> >>> Just to get some background information, did you remove OSDs while
> >>> performing the upgrade? Or did you start OSD removal and then started
> >>> the upgrade? Upgrades should be started with a healthy cluster, but
> >>> one can’t guarantee that of course, OSDs and/or entire hosts can
> >>> obviously also fail during an upgrade.
> >>> Just trying to understand what could cause this (I haven’t upgraded
> >>> production clusters to Reef yet, only test clusters). Have you stopped
> >>> the upgrade to cancel the process entirely? Can you share this
> >>> information please:
> >>>
> >>> ceph versions
> >>> ceph orch upgrade status
> >>>
> >>> Zitat von Benjamin Huth :
> >>>
>  Just wanted to follow up on this, I am unfortunately still stuck with
> >>> this
>  and can't find where the json for this value is stored. I'm wondering
> i

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Giovanna Ratini

Hello Eugen,

yesterday after stop and go of snaptrim the queue decrease a little and 
then remain blocked.

They didn't grow and didn't decrease.

Is that good or bad?


Am 19.08.2024 um 15:43 schrieb Eugen Block:
There's a lengthy thread [0] where several approaches are proposed. 
The worst is a OSD recreation, but that's the last resort, of course.


What's are the current values for these configs?

ceph config get osd osd_pg_max_concurrent_snap_trims
ceph config get osd osd_max_trimming_pgs

Maybe decrease them to 1 each while the nosnaptrim flag is set, then 
unset it. You could also try online (and/or offline osd compaction) 
before unsetting the flag. Are the OSD processes utilizing an entire CPU?


[0] https://www.spinics.net/lists/ceph-users/msg75626.html

Zitat von Giovanna Ratini :


Hallo Eugen,

yes, the load is for now not too much.

I stop the snap and now this is the output. No changes in the queue.

root@kube-master02:~# k ceph -s
Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    nosnaptrim flag(s) set
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 30h)
    mgr: a(active, since 29h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
 flags nosnaptrim

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.21M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 32 active+clean+snaptrim_wait

  io:
    client:   7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr

Am 19.08.2024 um 14:54 schrieb Eugen Block:

What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily 
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a 
scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several 
days:


 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the 
growing.


Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting 
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq": 
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq": 
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1