from:"Sake"

[ceph-users] FS down

2023-12-20 Thread Sake

Hi all,I need your help! Our FS is degraded.Health: mds.1 is damagedCeph tell mds.1 damage lsResolve_mds: gid 1 not in mds mapBest regards, Sake ___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds generates slow request: peer_request, how to deal with it?

2023-12-31 Thread Sake

Hi David,How does your filesystem looks like. We have a few folders with a lot of subfolders, which are all randomly accessed. And I guess the balancer is moving a lot of folders between the mds nodes.We noticed that multiple active mds isn't working in this setup, with the same errors as you get. And after restarting the problematic mds, everything is fine for a few hours and the errors show again. So for now we reverted to 1 mds (the load is low with the holidays). Also the load on the cluster was very high (1000+ iops and 100+ MB traffic) with multiple mds, like it was continuing to load balance folders over the active mds nodes. The load is currently around 500 iops and 50 MB traffic, or even lower. After the holidays I'm going to see what I can achieve with manual pinning directories to mds ranks. Best regards, Sake On 31 Dec 2023 09:01, David Yang  wrote:I hope this message finds you well.



I have a cephfs cluster with 3 active mds, and use 3-node samba to

export through the kernel.



Currently, there are 2 node mds experiencing slow requests. We have

tried restarting the mds. After a few hours, the replay log status

became active.

But the slow request reappears. The slow request does not seem to come

from the client, but from the request of the mds node.



Looking forward to your prompt response.



HEALTH_WARN 2 MDSs report slow requests; 2 MDSs behind on trimming

[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests

    mds.osd44(mds.0): 2 slow requests are blocked > 30 secs

    mds.osd43(mds.1): 2 slow requests are blocked > 30 secs

[WRN] MDS_TRIM: 2 MDSs behind on trimming

    mds.osd44(mds.0): Behind on trimming (18642/1024) max_segments:

1024, num_segments: 18642

    mds.osd43(mds.1): Behind on trimming (976612/1024) max_segments:

1024, num_segments: 976612



mds.0



{

    "ops": [

    {

    "description": "peer_request:mds.1:1",

    "initiated_at": "2023-12-31T11:19:38.679925+0800",

    "age": 4358.8009461359998,

    "duration": 4358.8009636369998,

    "type_data": {

    "flag_point": "dispatched",

    "reqid": "mds.1:1",

    "op_type": "peer_request",

    "leader_info": {

    "leader": "1"

    },

    "events": [

    {

    "time": "2023-12-31T11:19:38.679925+0800",

    "event": "initiated"

    },

    {

    "time": "2023-12-31T11:19:38.679925+0800",

    "event": "throttled"

    },

    {

    "time": "2023-12-31T11:19:38.679925+0800",

    "event": "header_read"

    },

    {

    "time": "2023-12-31T11:19:38.679936+0800",

    "event": "all_read"

    },

    {

    "time": "2023-12-31T11:19:38.679940+0800",

    "event": "dispatched"

    }

    ]

    }

    },

    {

    "description": "peer_request:mds.1:2",

    "initiated_at": "2023-12-31T11:19:38.679938+0800",

    "age": 4358.8009326969996,

    "duration": 4358.800976354,

    "type_data": {

    "flag_point": "dispatched",

    "reqid": "mds.1:2",

    "op_type": "peer_request",

    "leader_info": {

    "leader": "1"

    },

    "events": [

    {

    "time": "2023-12-31T11:19:38.679938+0800",

    "event": "initiated"

    },

    {

    "time": "2023-12-31T11:19:38.679938+0800",

    "event": "throttled"

    },

    {

    "time": "2023-12-31T11:19:38.679938+0800",

    "event": "header_read"

    },

    {

    "time": "2023-12-31T11:19:38.679941+0800",

    "event":

[ceph-users] Restful API and Cephfs quota usage

2023-06-13 Thread Sake

Hi! I would like to build a simple PowerShell script which monitors the quotas set on certain directories. Is this possible via the Restful API? Some extra information:Ceph version 17.2.6Deployed via Cephadm and having mgr nodes with an accessable Rest API. Folder structure:/  Folder 1/  Folder 2/  Folder 3/The (different) quotas are set on the Folders 1-3. I would like to know for example if a certain folder will hit 90% usage of the set quota.I can find pool information etc. via the Restful API, but it looks like nothing on quota usage. Or does someone have another method to inform users/admins about thresholds like 90% usage being hit? Best regards, Sake___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Restful API and Cephfs quota usage

2023-06-15 Thread Sake

Not sure why my message shows up as an html attachment..Best regards, SakeOn Jun 14, 2023 08:53, Sake  wrote:Hi! I would like to build a simple PowerShell script which monitors the quotas set on certain directories. Is this possible via the Restful API? Some extra information:Ceph version 17.2.6Deployed via Cephadm and having mgr nodes with an accessable Rest API. Folder structure:/  Folder 1/  Folder 2/  Folder 3/The (different) quotas are set on the Folders 1-3. I would like to know for example if a certain folder will hit 90% usage of the set quota.I can find pool information etc. via the Restful API, but it looks like nothing on quota usage. Or does someone have another method to inform users/admins about thresholds like 90% usage being hit? Best regards, Sake___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Rebuilding data resiliency after adding new OSD's stuck for so long at 5%

2023-09-13 Thread Sake

Which version do you use? Quincy has currently incorrect values for it's new IOPS scheduler, this will be fixed in the next release (hopefully soon). But there are workaround, please check the mailing list about this, I'm in a hurry so can't point directly to the correct post. Best regards, SakeOn 14 Sept 2023 07:55, sharathvuthp...@gmail.com wrote:Hi, 



We have HDD disks. 



Today, after almost 36 hours, Rebuilding Data Resiliency is 58% and still going on. The good thing is it is not stuck at 5%.



Does it take this long to complete rebuilding resiliency process whenever there is a maintenance in the cluster?

___

ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow recovery and inaccurate recovery figures since Quincy upgrade

2023-10-04 Thread Sake

Hi,Please take a look at the following thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/PWHG6QJ6N2TJEYD2U4AXJAJ23CRPJG4E/#7ZMBM23GXYFIGY52ZWJDY5NUSYSDSYL6In short, the value for "osd_mclock_cost_per_byte_usec_hdd" isn't correct. With the release of 17.2.7 this option will be gone, but the recovery speed will be fixed :)Best regards, Sake ___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Patching Ceph cluster

2024-06-13 Thread Sake Ceph

Yeah we fully automated this with Ansible. In short we do the following. 

1. Check if cluster is healthy before continuing (via REST-API) only health_ok 
is good
2. Disable scrub and deep-scrub
3. Update all applications on all the hosts in the cluster
4. For every host, one by one, do the following:
4a. Check if applications got updated
4b. Check via reboot-hint if a reboot is necessary
4c. If applications got updated or reboot is necessary, do the following :
4c1. Put host in maintenance 
4c2. Reboot host if necessary 
4c3. Check and wait via 'ceph orch host ls' if status of the host is maintance 
and nothing else
4c4. Get host out of maintenance 
4d. Check if cluster is healthy before continuing (via Rest-API) only warning 
about scrub and deep-scrub is allowed, but no pg's should be degraded 
5. Enable scrub and deep-scrub when all hosts are done
6. Check if cluster is healthy (via Rest-API) only health_ok is good
7. Done

For upgrade the OS we have something similar, but exiting maintenance mode is 
broken (with 17.2.7) :(
I need to check the tracker for similar issues and if I can't find anything, I 
will create a ticket. 

Kind regards, 
Sake 

> Op 12-06-2024 19:02 CEST schreef Daniel Brown :
> 
>  
> I have two ansible roles, one for enter, one for exit. There’s likely better 
> ways to do this — and I’ll not be surprised if someone here lets me know. 
> They’re using orch commands via the cephadm shell. I’m using Ansible for 
> other configuration management in my environment, as well, including setting 
> up clients of the ceph cluster. 
> 
> 
> Below excerpts from main.yml in the “tasks” for the enter/exit roles. The 
> host I’m running ansible from is one of my CEPH servers - I’ve limited which 
> process run there though so it’s in the cluster but not equal to the others. 
> 
> 
> —
> Enter
> —
> 
> - name: Ceph Maintenance Mode Enter
>   shell:
> 
> cmd: ' cephadm shell ceph orch host maintenance enter {{ 
> (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }} 
> --force --yes-i-really-mean-it ‘
>   become: True
> 
> 
> 
> —
> Exit
> — 
> 
> 
> - name: Ceph Maintenance Mode Exit
>   shell:
> cmd: 'cephadm shell ceph orch host maintenance exit {{ 
> (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }} ‘
>   become: True
>   connection: local
> 
> 
> - name: Wait for Ceph to be available
>   ansible.builtin.wait_for:
> delay: 60
> host: '{{ 
> (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }}’
> port: 9100
>   connection: local
> 
> 
> 
> 
> 
> 
> > On Jun 12, 2024, at 11:28 AM, Michael Worsham  
> > wrote:
> > 
> > Interesting. How do you set this "maintenance mode"? If you have a series 
> > of documented steps that you have to do and could provide as an example, 
> > that would be beneficial for my efforts.
> > 
> > We are in the process of standing up both a dev-test environment consisting 
> > of 3 Ceph servers (strictly for testing purposes) and a new production 
> > environment consisting of 20+ Ceph servers.
> > 
> > We are using Ubuntu 22.04.
> > 
> > -- Michael
> > From: Daniel Brown 
> > Sent: Wednesday, June 12, 2024 9:18 AM
> > To: Anthony D'Atri 
> > Cc: Michael Worsham ; ceph-users@ceph.io 
> > 
> > Subject: Re: [ceph-users] Patching Ceph cluster
> >  This is an external email. Please take care when clicking links or opening 
> > attachments. When in doubt, check with the Help Desk or Security.
> > 
> > 
> > There’s also a Maintenance mode that you can set for each server, as you’re 
> > doing updates, so that the cluster doesn’t try to move data from affected 
> > OSD’s, while the server being updated is offline or down. I’ve worked some 
> > on automating this with Ansible, but have found my process (and/or my 
> > cluster) still requires some manual intervention while it’s running to get 
> > things done cleanly.
> > 
> > 
> > 
> > > On Jun 12, 2024, at 8:49 AM, Anthony D'Atri  
> > > wrote:
> > >
> > > Do you mean patching the OS?
> > >
> > > If so, easy -- one node at a time, then after it comes back up, wait 
> > > until all PGs are active+clean and the mon quorum is complete before 
> > > proceeding.
> > >
> > >
> > >
> > >> On Jun 12, 2024, at 07:56, Michael Worsham  
> > >> wrote:
> > >>
> > >> What is the proper way to patch a Ceph cluster and reboot the servers in 
> > >> said cluster if a reboo

[ceph-users] Re: Patching Ceph cluster

2024-06-14 Thread Sake Ceph

I needed to do some cleaning before I could share this :) 
Maybe you or someone else can use it. 

Kind regards, 
Sake 

> Op 14-06-2024 03:53 CEST schreef Michael Worsham 
> :
> 
>  
> I'd love to see what your playbook(s) looks like for doing this.
> 
> -- Michael
> 
> From: Sake Ceph 
> Sent: Thursday, June 13, 2024 4:05 PM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] Re: Patching Ceph cluster
> 
> This is an external email. Please take care when clicking links or opening 
> attachments. When in doubt, check with the Help Desk or Security.
> 
> 
> Yeah we fully automated this with Ansible. In short we do the following.
> 
> 1. Check if cluster is healthy before continuing (via REST-API) only 
> health_ok is good
> 2. Disable scrub and deep-scrub
> 3. Update all applications on all the hosts in the cluster
> 4. For every host, one by one, do the following:
> 4a. Check if applications got updated
> 4b. Check via reboot-hint if a reboot is necessary
> 4c. If applications got updated or reboot is necessary, do the following :
> 4c1. Put host in maintenance
> 4c2. Reboot host if necessary
> 4c3. Check and wait via 'ceph orch host ls' if status of the host is 
> maintance and nothing else
> 4c4. Get host out of maintenance
> 4d. Check if cluster is healthy before continuing (via Rest-API) only warning 
> about scrub and deep-scrub is allowed, but no pg's should be degraded
> 5. Enable scrub and deep-scrub when all hosts are done
> 6. Check if cluster is healthy (via Rest-API) only health_ok is good
> 7. Done
> 
> For upgrade the OS we have something similar, but exiting maintenance mode is 
> broken (with 17.2.7) :(
> I need to check the tracker for similar issues and if I can't find anything, 
> I will create a ticket.
> 
> Kind regards,
> Sake
> 
> > Op 12-06-2024 19:02 CEST schreef Daniel Brown 
> > :
> >
> >
> > I have two ansible roles, one for enter, one for exit. There’s likely 
> > better ways to do this — and I’ll not be surprised if someone here lets me 
> > know. They’re using orch commands via the cephadm shell. I’m using Ansible 
> > for other configuration management in my environment, as well, including 
> > setting up clients of the ceph cluster.
> >
> >
> > Below excerpts from main.yml in the “tasks” for the enter/exit roles. The 
> > host I’m running ansible from is one of my CEPH servers - I’ve limited 
> > which process run there though so it’s in the cluster but not equal to the 
> > others.
> >
> >
> > —
> > Enter
> > —
> >
> > - name: Ceph Maintenance Mode Enter
> >   shell:
> >
> > cmd: ' cephadm shell ceph orch host maintenance enter {{ 
> > (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }} 
> > --force --yes-i-really-mean-it ‘
> >   become: True
> >
> >
> >
> > —
> > Exit
> > —
> >
> >
> > - name: Ceph Maintenance Mode Exit
> >   shell:
> > cmd: 'cephadm shell ceph orch host maintenance exit {{ 
> > (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }} ‘
> >   become: True
> >   connection: local
> >
> >
> > - name: Wait for Ceph to be available
> >   ansible.builtin.wait_for:
> > delay: 60
> > host: '{{ 
> > (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }}’
> > port: 9100
> >   connection: local
> >
> >
> >
> >
> >
> >
> > > On Jun 12, 2024, at 11:28 AM, Michael Worsham 
> > >  wrote:
> > >
> > > Interesting. How do you set this "maintenance mode"? If you have a series 
> > > of documented steps that you have to do and could provide as an example, 
> > > that would be beneficial for my efforts.
> > >
> > > We are in the process of standing up both a dev-test environment 
> > > consisting of 3 Ceph servers (strictly for testing purposes) and a new 
> > > production environment consisting of 20+ Ceph servers.
> > >
> > > We are using Ubuntu 22.04.
> > >
> > > -- Michael
> > > From: Daniel Brown 
> > > Sent: Wednesday, June 12, 2024 9:18 AM
> > > To: Anthony D'Atri 
> > > Cc: Michael Worsham ; ceph-users@ceph.io 
> > > 
> > > Subject: Re: [ceph-users] Patching Ceph cluster
> > >  This is an external email. Please take care when clicking links or 
> > > opening attachments. When in doubt, check with the Help Desk or Security.
>

[ceph-users] Re: Patching Ceph cluster

2024-06-14 Thread Sake Ceph

Edit: someone made some changes which broke some tasks when selecting the 
cephadm host to use. Just keep in mind it's an example

> Op 14-06-2024 10:28 CEST schreef Sake Ceph :
> 
>  
> I needed to do some cleaning before I could share this :) 
> Maybe you or someone else can use it. 
> 
> Kind regards, 
> Sake 
> 
> > Op 14-06-2024 03:53 CEST schreef Michael Worsham 
> > :
> > 
> >  
> > I'd love to see what your playbook(s) looks like for doing this.
> > 
> > -- Michael
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph 19 Squid released?

2024-07-21 Thread Sake Ceph

Not yet released. Every x.1.z release is release candidate. Always wait for the 
x.2.z release (in this case 19.2.0) and the official release notes on 
docs.ceph.com :-) 

> Op 21-07-2024 18:32 CEST schreef Nicola Mori :
> 
>  
> Dear Ceph users,
> 
> on quay.io I see available images for 19.1.0. Yet I can't find any 
> public release announcement, and on this page:
> 
>https://docs.ceph.com/en/latest/releases/
> 
> version 19 is still not mentioned at all. So what's going on?
> 
> Nicola
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Release 18.2.4

2024-07-24 Thread Sake Ceph

What I read on the Slack channel is that the publication job got stuck late in 
the day and the restart finished late. I'll guess they announce today the new 
version. 

Kind regards, 
Sake

> Op 24-07-2024 13:05 CEST schreef Alfredo Rezinovsky :
> 
>  
> Ceph dashboard offers me to upgrade to v18.2.4.
> 
> I can't find any information on 18.2.4.
> The is no 18.2.4 in https://docs.ceph.com/en/latest/releases/
> Not a tag in https://github.com/ceph/ceph
> 
> I don´t understand why there's a 18.2.4 image and what's in it.
> 
> -- 
> Alfrenovsky
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS cache always increasing

2024-08-30 Thread Sake Ceph

I hope someone can help us with a MDS caching problem.

Ceph version 18.2.4 with cephadm container deployment.

Question 1:
For me it's not clear how much cache/memory you should allocate for the MDS. Is 
this based on the number of open files, caps or something else?

Question 2/Problem:
At the moment we have MDS nodes with 32 GB of memory and a configured cache 
limit of 20 GB. There are 4 MDS nodes: 2 active and 2 in standby-replay mode 
(with max_mds set at 2 of course). We pinned top directories to specific ranks, 
so the balancer isn't used.
The memory usage is for the most part increasing, sometimes a little dip with 
couple hundred MB's freed. After all the memory is consumed, SWAP gets used. 
This results in a couple of hundred MB's of freed memory, but not much. When 
eventually the SWAP runs out and the memory is full, the MDS service stops and 
the cluster logs show:
1. no beacon from mds
2. marking mds up:active laggy
3. replacing mds
4. MDS daemon  is removed because it is dead or otherwise unavailable

For example: we have the top folder app2 and app4 which is pinnend to rank 1. 
Folder app2 is always accessed by 4 clients (application servers), the same 
happens with folder app4. Folder app2 is 3 times larger than folder app4 (last 
time I checked, don't wanna do a du at the moment).
After a couple of hours the memory usage of the MDS server stays around 18% 
(Grafana shows a flatline for 7 hours).
At night the 9the client connects and makes first a backup with rsync of the 
latest snapshot folder of app2 and afterwards the same happens for folder app4 
with a pause for 5 minutes.
When the backup starts, the memory increases to 70% and stays at 70% after the 
backup of app2 is completed. 5 minutes later the memory starts increases again 
with the start of the backup of folder app4. When the backup is done, it's at 
78% and stays there for the rest of the day.
Why isn't the memory usage decreasing after the rsync is completed?

Is there a memory leak with the MDS service?

Ps. I have some small log files/Grafana screenshots, not sure how to share.

Kind regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS cache always increasing

2024-08-31 Thread Sake Ceph

It was worse with 1 MDS, therefor we moved to 2 active MDS with directory 
pinning (so the balancer won't be an issue/make things extra complicated). 

The number of caps stay for the most part the same, some ups and downs. I would 
guess it maybe has something to do with caching the accessed directories or 
files. Because it increases a lot the first time when using rsync and the 
second time there isn't really an increase of memory usage, only for a little 
time when the rsync is running and afterwards it drops again. 

NFS isn't really an option because it adds another hop for the clients :( 
Second it happens on our Production environment and I won't be making any 
changes there for a test.
Will try to replicate in our staging environment, but that one has a lot less 
load on it. 

Kind regards, 
Sake 
> Op 31-08-2024 09:15 CEST schreef Alexander Patrakov :
> 
>  
> Got it.
> 
> However, to narrow down the issue, I suggest that you test whether it
> still exists after the following changes:
> 
> 1. Reduce max_mds to 1.
> 2. Do not reduce max_mds to 1, but migrate all clients from a direct
> CephFS mount to NFS.
> 
> On Sat, Aug 31, 2024 at 2:55 PM Sake Ceph  wrote:
> >
> > I was talking about the hosts where the MDS containers are running on. The 
> > clients are all RHEL 9.
> >
> > Kind regards,
> > Sake
> >
> > > Op 31-08-2024 08:34 CEST schreef Alexander Patrakov :
> > >
> > >
> > > Hello Sake,
> > >
> > > The combination of two active MDSs and RHEL8 does ring a bell, and I
> > > have seen this with Quincy, too. However, what's relevant is the
> > > kernel version on the clients. If they run the default 4.18.x kernel
> > > from RHEL8, please either upgrade to the mainline kernel or decrease
> > > max_mds to 1. If they run a modern kernel, then it is something I do
> > > not know about.
> > >
> > > On Sat, Aug 31, 2024 at 1:21 PM Sake Ceph  wrote:
> > > >
> > > > @Anthony: it's a small virtualized cluster and indeed SWAP shouldn't be 
> > > > used, but this doesn't change the problem.
> > > >
> > > > @Alexander: the problem is in the active nodes, the standby replay 
> > > > don't have issues anymore.
> > > >
> > > > Last night's backup run increased the memory usage to 86% when rsync 
> > > > was running for app2. It dropped to 77,8% when it was done. When the 
> > > > rsync for app4 was running it increased to 84% and dropping to 80%. 
> > > > After a few hours it's now settled on 82%.
> > > > It looks to me the MDS server is caching something forever while it 
> > > > isn't being used..
> > > >
> > > > The underlying host is running on RHEL 8. Upgrade to RHEL 9 is planned, 
> > > > but hit some issues with automatically upgrading hosts.
> > > >
> > > > Kind regards,
> > > > Sake
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > >
> > >
> > > --
> > > Alexander Patrakov
> 
> 
> 
> -- 
> Alexander Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS cache always increasing

2024-08-30 Thread Sake Ceph

@Anthony: it's a small virtualized cluster and indeed SWAP shouldn't be used, 
but this doesn't change the problem. 

@Alexander: the problem is in the active nodes, the standby replay don't have 
issues anymore. 

Last night's backup run increased the memory usage to 86% when rsync was 
running for app2. It dropped to 77,8% when it was done. When the rsync for app4 
was running it increased to 84% and dropping to 80%. After a few hours it's now 
settled on 82%.
It looks to me the MDS server is caching something forever while it isn't being 
used.. 

The underlying host is running on RHEL 8. Upgrade to RHEL 9 is planned, but hit 
some issues with automatically upgrading hosts.

Kind regards, 
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS cache always increasing

2024-08-31 Thread Sake Ceph

Ow it got worse after the upgrade to Reef (was running Quincy). With Quincy the 
memory usage was also a lot of times around 95% and some swap usage, but never 
exceeding both to the point of crashing. 

Kind regards, 
Sake
> Op 31-08-2024 09:15 CEST schreef Alexander Patrakov :
> 
>  
> Got it.
> 
> However, to narrow down the issue, I suggest that you test whether it
> still exists after the following changes:
> 
> 1. Reduce max_mds to 1.
> 2. Do not reduce max_mds to 1, but migrate all clients from a direct
> CephFS mount to NFS.
> 
> On Sat, Aug 31, 2024 at 2:55 PM Sake Ceph  wrote:
> >
> > I was talking about the hosts where the MDS containers are running on. The 
> > clients are all RHEL 9.
> >
> > Kind regards,
> > Sake
> >
> > > Op 31-08-2024 08:34 CEST schreef Alexander Patrakov :
> > >
> > >
> > > Hello Sake,
> > >
> > > The combination of two active MDSs and RHEL8 does ring a bell, and I
> > > have seen this with Quincy, too. However, what's relevant is the
> > > kernel version on the clients. If they run the default 4.18.x kernel
> > > from RHEL8, please either upgrade to the mainline kernel or decrease
> > > max_mds to 1. If they run a modern kernel, then it is something I do
> > > not know about.
> > >
> > > On Sat, Aug 31, 2024 at 1:21 PM Sake Ceph  wrote:
> > > >
> > > > @Anthony: it's a small virtualized cluster and indeed SWAP shouldn't be 
> > > > used, but this doesn't change the problem.
> > > >
> > > > @Alexander: the problem is in the active nodes, the standby replay 
> > > > don't have issues anymore.
> > > >
> > > > Last night's backup run increased the memory usage to 86% when rsync 
> > > > was running for app2. It dropped to 77,8% when it was done. When the 
> > > > rsync for app4 was running it increased to 84% and dropping to 80%. 
> > > > After a few hours it's now settled on 82%.
> > > > It looks to me the MDS server is caching something forever while it 
> > > > isn't being used..
> > > >
> > > > The underlying host is running on RHEL 8. Upgrade to RHEL 9 is planned, 
> > > > but hit some issues with automatically upgrading hosts.
> > > >
> > > > Kind regards,
> > > > Sake
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > >
> > >
> > > --
> > > Alexander Patrakov
> 
> 
> 
> -- 
> Alexander Patrakov
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS cache always increasing

2024-08-30 Thread Sake Ceph

I was talking about the hosts where the MDS containers are running on. The 
clients are all RHEL 9.

Kind regards, 
Sake 

> Op 31-08-2024 08:34 CEST schreef Alexander Patrakov :
> 
>  
> Hello Sake,
> 
> The combination of two active MDSs and RHEL8 does ring a bell, and I
> have seen this with Quincy, too. However, what's relevant is the
> kernel version on the clients. If they run the default 4.18.x kernel
> from RHEL8, please either upgrade to the mainline kernel or decrease
> max_mds to 1. If they run a modern kernel, then it is something I do
> not know about.
> 
> On Sat, Aug 31, 2024 at 1:21 PM Sake Ceph  wrote:
> >
> > @Anthony: it's a small virtualized cluster and indeed SWAP shouldn't be 
> > used, but this doesn't change the problem.
> >
> > @Alexander: the problem is in the active nodes, the standby replay don't 
> > have issues anymore.
> >
> > Last night's backup run increased the memory usage to 86% when rsync was 
> > running for app2. It dropped to 77,8% when it was done. When the rsync for 
> > app4 was running it increased to 84% and dropping to 80%. After a few hours 
> > it's now settled on 82%.
> > It looks to me the MDS server is caching something forever while it isn't 
> > being used..
> >
> > The underlying host is running on RHEL 8. Upgrade to RHEL 9 is planned, but 
> > hit some issues with automatically upgrading hosts.
> >
> > Kind regards,
> > Sake
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> -- 
> Alexander Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS cache always increasing

2024-09-02 Thread Sake Ceph

The folders contain a couple of million files, but are really static. We have 
another folder with a lot of updates and the MDS server for that folder has 
indeed a continuous increase of memory usage. But I would focus on the app2 and 
app4 folders, because those have a lot less changes in it. 
But why keeps the MDS al this information in its memory? If it isn't accessed 
for more than 20 hours, it should release it in my opinion (even a lot earlier, 
like after an hour). 

Kind regards, 
Sake

> Op 02-09-2024 09:33 CEST schreef Eugen Block :
> 
>  
> Can you tell if the number of objects increases in your cephfs between  
> those bursts? I noticed something similar in a 16.2.15 cluster as  
> well. It's not that heavily used, but it contains home directories and  
> development working directories etc. And when one user checked out a  
> git project, the mds memory usage increased a lot, getting near its  
> configured limit. Before there were around 3,7 Million objects in the  
> cephfs, that user added more than a million more files with his  
> checkout. It wasn't a real issue (yet) because the usage isn't very  
> dynamical and the total number of files is relatively stable.
> This doesn't really help resolve anything, but if your total number of  
> files grows, I'm not surprised that the mds requires more memory.
> 
> Zitat von Alexander Patrakov :
> 
> > As a workaround, to reduce the impact of the MDS slowed down by
> > excessive memory consumption, I would suggest installing earlyoom,
> > disabling swap, and configuring earlyoom as follows (usually through
> > /etc/sysconfig/earlyoom, but could be in a different place on your
> > distribution):
> >
> > EARLYOOM_ARGS="-p -r 600 -m 4,4 -s 1,1"
> >
> > On Sat, Aug 31, 2024 at 3:44 PM Sake Ceph  wrote:
> >>
> >> Ow it got worse after the upgrade to Reef (was running Quincy).  
> >> With Quincy the memory usage was also a lot of times around 95% and  
> >> some swap usage, but never exceeding both to the point of crashing.
> >>
> >> Kind regards,
> >> Sake
> >> > Op 31-08-2024 09:15 CEST schreef Alexander Patrakov :
> >> >
> >> >
> >> > Got it.
> >> >
> >> > However, to narrow down the issue, I suggest that you test whether it
> >> > still exists after the following changes:
> >> >
> >> > 1. Reduce max_mds to 1.
> >> > 2. Do not reduce max_mds to 1, but migrate all clients from a direct
> >> > CephFS mount to NFS.
> >> >
> >> > On Sat, Aug 31, 2024 at 2:55 PM Sake Ceph  wrote:
> >> > >
> >> > > I was talking about the hosts where the MDS containers are  
> >> running on. The clients are all RHEL 9.
> >> > >
> >> > > Kind regards,
> >> > > Sake
> >> > >
> >> > > > Op 31-08-2024 08:34 CEST schreef Alexander Patrakov  
> >> :
> >> > > >
> >> > > >
> >> > > > Hello Sake,
> >> > > >
> >> > > > The combination of two active MDSs and RHEL8 does ring a bell, and I
> >> > > > have seen this with Quincy, too. However, what's relevant is the
> >> > > > kernel version on the clients. If they run the default 4.18.x kernel
> >> > > > from RHEL8, please either upgrade to the mainline kernel or decrease
> >> > > > max_mds to 1. If they run a modern kernel, then it is something I do
> >> > > > not know about.
> >> > > >
> >> > > > On Sat, Aug 31, 2024 at 1:21 PM Sake Ceph  wrote:
> >> > > > >
> >> > > > > @Anthony: it's a small virtualized cluster and indeed SWAP  
> >> shouldn't be used, but this doesn't change the problem.
> >> > > > >
> >> > > > > @Alexander: the problem is in the active nodes, the standby  
> >> replay don't have issues anymore.
> >> > > > >
> >> > > > > Last night's backup run increased the memory usage to 86%  
> >> when rsync was running for app2. It dropped to 77,8% when it was  
> >> done. When the rsync for app4 was running it increased to 84% and  
> >> dropping to 80%. After a few hours it's now settled on 82%.
> >> > > > > It looks to me the MDS server is caching something forever  
> >> while it isn't being used..
> >> > > > >
> >> > > > > The underlying h

[ceph-users] Re: MDS cache always increasing

2024-09-03 Thread Sake Ceph

But the client which is doing the rsync, doesn't hold any caps after the rsync. 
Cephfs-top shows 0 caps. Even a system reboot of the client doesn't make a 
change. 

Kind regards, 
Sake 
> Op 03-09-2024 04:01 CEST schreef Alexander Patrakov :
> 
>  
> MDS cannot release an inode if a client has cached it (and thus can
> have newer data than OSDs have). The MDS needs to know at least which
> client to ask if someone else requests the same file.
> 
> MDS does ask clients to release caps, but sometimes this doesn't work,
> and there is no good troubleshooting guide except trying different
> kernel versions and switching between kernel client / fuse / nfs.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Grafana dashboards is missing data

2024-09-04 Thread Sake Ceph

After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are empty. For example 
the Osd latency under OSD device details or the Osd Overview has a lot of No 
data messages.

I deployed ceph-exporter on all hosts, am I missing something? Did even a 
redeploy of prometheus. 

Kind regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-04 Thread Sake Ceph

Hi Frank, 

That option is set to false (I didnt enabled security for the monitoring 
stack). 

Kind regards, 
Sake
> Op 04-09-2024 20:17 CEST schreef Frank de Bot (lists) :
> 
>  
> Hi Sake,
> 
> Do you have the config mgr/cephadm/secure_monitoring_stack to true? If 
> so, this pull request will fix your problem: 
> https://github.com/ceph/ceph/pull/58402
> 
> Regards,
> 
> Frank
> 
> Sake Ceph wrote:
> > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are empty. For 
> > example the Osd latency under OSD device details or the Osd Overview has a 
> > lot of No data messages.
> > 
> > I deployed ceph-exporter on all hosts, am I missing something? Did even a 
> > redeploy of prometheus.
> > 
> > Kind regards,
> > Sake
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-05 Thread Sake Ceph

I would like to stay away from using the workaround.

First I did a redeployment of prometheus and later ceph-exporter, but still no 
data. After the deployment of ceph-exporter I saw the following messages (2 
times, for each host running prometheus): Reconfiguring daemon 
prometheus..
You mention missing a configuration, but I can't find any specific 
configuration options in the docs; 
https://docs.ceph.com/en/reef/mgr/prometheus/ and 
https://docs.ceph.com/en/reef/cephadm/services/monitoring/

Kind regards,
Sake

> Op 05-09-2024 10:00 CEST schreef Pierre Riteau :
> 
>  
> As a workaround you can use: ceph config set mgr
> mgr/prometheus/exclude_perf_counters false
> 
> However I understand that deploying a ceph-exporter daemon on each host is
> the proper fix. You may still be missing some configuration for it?
> 
> On Thu, 5 Sept 2024 at 08:25, Sake Ceph  wrote:
> 
> > Hi Frank,
> >
> > That option is set to false (I didnt enabled security for the monitoring
> > stack).
> >
> > Kind regards,
> > Sake
> > > Op 04-09-2024 20:17 CEST schreef Frank de Bot (lists)  > >:
> > >
> > >
> > > Hi Sake,
> > >
> > > Do you have the config mgr/cephadm/secure_monitoring_stack to true? If
> > > so, this pull request will fix your problem:
> > > https://github.com/ceph/ceph/pull/58402
> > >
> > > Regards,
> > >
> > > Frank
> > >
> > > Sake Ceph wrote:
> > > > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are empty. For
> > example the Osd latency under OSD device details or the Osd Overview has a
> > lot of No data messages.
> > > >
> > > > I deployed ceph-exporter on all hosts, am I missing something? Did
> > even a redeploy of prometheus.
> > > >
> > > > Kind regards,
> > > > Sake
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-06 Thread Sake Ceph

That is working, but I noticed the firewall isn't opened for that port. 
Shouldn't cephadm manage this, like it does for all the other ports? 

Kind regards, 
Sake 

> Op 06-09-2024 16:14 CEST schreef Björn Lässig :
> 
>  
> Am Mittwoch, dem 04.09.2024 um 20:01 +0200 schrieb Sake Ceph:
> > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are empty. For
> > example the Osd latency under OSD device details or the Osd Overview
> > has a lot of No data messages.
> > 
> 
> is the ceph-exporter listening on port 9926 (on every host)?
> 
>   ss -tlpn sport 9926
> 
> Can you connect via browser?
> 
>   curl localhost:9926/metrics
> 
> > I deployed ceph-exporter on all hosts, am I missing something? Did
> > even a redeploy of prometheus. 
> 
> there is a bug, that this exporter does not listens for IPv6.
> 
> greetings
> Björn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-09 Thread Sake Ceph

After opening port 9926 manually, the Grafana dashboards show the data.
So is this a bug?

Kind regards,
Sake
> Op 06-09-2024 17:39 CEST schreef Sake Ceph :
> 
>  
> That is working, but I noticed the firewall isn't opened for that port. 
> Shouldn't cephadm manage this, like it does for all the other ports? 
> 
> Kind regards, 
> Sake 
> 
> > Op 06-09-2024 16:14 CEST schreef Björn Lässig :
> > 
> >  
> > Am Mittwoch, dem 04.09.2024 um 20:01 +0200 schrieb Sake Ceph:
> > > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are empty. For
> > > example the Osd latency under OSD device details or the Osd Overview
> > > has a lot of No data messages.
> > > 
> > 
> > is the ceph-exporter listening on port 9926 (on every host)?
> > 
> >   ss -tlpn sport 9926
> > 
> > Can you connect via browser?
> > 
> >   curl localhost:9926/metrics
> > 
> > > I deployed ceph-exporter on all hosts, am I missing something? Did
> > > even a redeploy of prometheus. 
> > 
> > there is a bug, that this exporter does not listens for IPv6.
> > 
> > greetings
> > Björn
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-09 Thread Sake Ceph

Hello Eugen,

Well nothing about enabling port 9926.

For example I see the following when deploying Grafanan:
2024-09-05 14:27:30,969 7fd2e6583740 INFO firewalld ready
2024-09-05 14:27:31,334 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success
2024-09-05 14:27:31,350 7fd2e6583740 INFO firewalld ready
2024-09-05 14:27:31,593 7fd2e6583740 DEBUG Non-zero exit code 1 from 
/bin/firewall-cmd --permanent --query-port 3000/tcp
2024-09-05 14:27:31,594 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout no
2024-09-05 14:27:31,594 7fd2e6583740 INFO Enabling firewalld port 3000/tcp in 
current zone...
2024-09-05 14:27:31,832 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success
2024-09-05 14:27:32,212 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success

But only the following when deploying ceph-exporter:
2024-09-05 12:17:48,897 7f3d7cc0e740 INFO firewalld ready
2024-09-05 12:17:49,269 7f3d7cc0e740 DEBUG /bin/firewall-cmd: stdout success

When looking in the deploy configuration, Grafana shows 'ports': [3000], but 
ceph-exporter shows 'ports': []

Kind regards,
Sake

> Op 09-09-2024 10:50 CEST schreef Eugen Block :
> 
>  
> Sorry, clicked "send" too soon. In a test cluster, cephadm.log shows  
> that it would try to open a port if a firewall was enabled:
> 
> 2024-09-09 10:48:59,686 7f3142b11740 DEBUG firewalld.service is not enabled
> 2024-09-09 10:48:59,686 7f3142b11740 DEBUG Not possible to enable  
> service . firewalld.service is not available
> 
> Zitat von Eugen Block :
> 
> > Do you see anything in the cephadm.log related to the firewall?
> >
> > Zitat von Sake Ceph :
> >
> >> After opening port 9926 manually, the Grafana dashboards show the data.
> >> So is this a bug?
> >>
> >> Kind regards,
> >> Sake
> >>> Op 06-09-2024 17:39 CEST schreef Sake Ceph :
> >>>
> >>>
> >>> That is working, but I noticed the firewall isn't opened for that  
> >>> port. Shouldn't cephadm manage this, like it does for all the  
> >>> other ports?
> >>>
> >>> Kind regards,
> >>> Sake
> >>>
> >>>> Op 06-09-2024 16:14 CEST schreef Björn Lässig :
> >>>>
> >>>>
> >>>> Am Mittwoch, dem 04.09.2024 um 20:01 +0200 schrieb Sake Ceph:
> >>>> > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are empty. For
> >>>> > example the Osd latency under OSD device details or the Osd Overview
> >>>> > has a lot of No data messages.
> >>>> >
> >>>>
> >>>> is the ceph-exporter listening on port 9926 (on every host)?
> >>>>
> >>>>   ss -tlpn sport 9926
> >>>>
> >>>> Can you connect via browser?
> >>>>
> >>>>   curl localhost:9926/metrics
> >>>>
> >>>> > I deployed ceph-exporter on all hosts, am I missing something? Did
> >>>> > even a redeploy of prometheus.
> >>>>
> >>>> there is a bug, that this exporter does not listens for IPv6.
> >>>>
> >>>> greetings
> >>>> Björn
> >>>> ___
> >>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-09 Thread Sake Ceph

We're using default :) I'm talking about the deployment configuration which is 
shown in the log files when deploying grafana/ceph-exporter. 

I got the same configuration as you for ceph-exporter (the default) when 
exporting the service. 

Kind regards, 
Sake 

> Op 09-09-2024 12:04 CEST schreef Eugen Block :
> 
>  
> Can you be more specific about "deploy configuration"? Do you have  
> your own spec files for grafana and ceph-exporter?
> I just ran 'ceph orch apply ceph-exporter' and the resulting config is  
> this one:
> 
> # ceph orch ls ceph-exporter --export
> service_type: ceph-exporter
> service_name: ceph-exporter
> placement:
>host_pattern: '*'
> spec:
>prio_limit: 5
>stats_period: 5
> 
> Zitat von Sake Ceph :
> 
> > Hello Eugen,
> >
> > Well nothing about enabling port 9926.
> >
> > For example I see the following when deploying Grafanan:
> > 2024-09-05 14:27:30,969 7fd2e6583740 INFO firewalld ready
> > 2024-09-05 14:27:31,334 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success
> > 2024-09-05 14:27:31,350 7fd2e6583740 INFO firewalld ready
> > 2024-09-05 14:27:31,593 7fd2e6583740 DEBUG Non-zero exit code 1 from  
> > /bin/firewall-cmd --permanent --query-port 3000/tcp
> > 2024-09-05 14:27:31,594 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout no
> > 2024-09-05 14:27:31,594 7fd2e6583740 INFO Enabling firewalld port  
> > 3000/tcp in current zone...
> > 2024-09-05 14:27:31,832 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success
> > 2024-09-05 14:27:32,212 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success
> >
> > But only the following when deploying ceph-exporter:
> > 2024-09-05 12:17:48,897 7f3d7cc0e740 INFO firewalld ready
> > 2024-09-05 12:17:49,269 7f3d7cc0e740 DEBUG /bin/firewall-cmd: stdout success
> >
> > When looking in the deploy configuration, Grafana shows 'ports':  
> > [3000], but ceph-exporter shows 'ports': []
> >
> > Kind regards,
> > Sake
> >
> >> Op 09-09-2024 10:50 CEST schreef Eugen Block :
> >>
> >>
> >> Sorry, clicked "send" too soon. In a test cluster, cephadm.log shows
> >> that it would try to open a port if a firewall was enabled:
> >>
> >> 2024-09-09 10:48:59,686 7f3142b11740 DEBUG firewalld.service is not enabled
> >> 2024-09-09 10:48:59,686 7f3142b11740 DEBUG Not possible to enable
> >> service . firewalld.service is not available
> >>
> >> Zitat von Eugen Block :
> >>
> >> > Do you see anything in the cephadm.log related to the firewall?
> >> >
> >> > Zitat von Sake Ceph :
> >> >
> >> >> After opening port 9926 manually, the Grafana dashboards show the data.
> >> >> So is this a bug?
> >> >>
> >> >> Kind regards,
> >> >> Sake
> >> >>> Op 06-09-2024 17:39 CEST schreef Sake Ceph :
> >> >>>
> >> >>>
> >> >>> That is working, but I noticed the firewall isn't opened for that
> >> >>> port. Shouldn't cephadm manage this, like it does for all the
> >> >>> other ports?
> >> >>>
> >> >>> Kind regards,
> >> >>> Sake
> >> >>>
> >> >>>> Op 06-09-2024 16:14 CEST schreef Björn Lässig  
> >> :
> >> >>>>
> >> >>>>
> >> >>>> Am Mittwoch, dem 04.09.2024 um 20:01 +0200 schrieb Sake Ceph:
> >> >>>> > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are  
> >> empty. For
> >> >>>> > example the Osd latency under OSD device details or the Osd Overview
> >> >>>> > has a lot of No data messages.
> >> >>>> >
> >> >>>>
> >> >>>> is the ceph-exporter listening on port 9926 (on every host)?
> >> >>>>
> >> >>>>   ss -tlpn sport 9926
> >> >>>>
> >> >>>> Can you connect via browser?
> >> >>>>
> >> >>>>   curl localhost:9926/metrics
> >> >>>>
> >> >>>> > I deployed ceph-exporter on all hosts, am I missing something? Did
> >> >>>> > even a redeploy of prometheus.
> >> >>>>
> >> >>>> there is a bug, that this exporter does not listens for IPv6.
> >> >>>>
> >> >>>> greetings
> >> >>>> Björn
> >> >>>> ___
> >> >>>> ceph-users mailing list -- ceph-users@ceph.io
> >> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >>> ___
> >> >>> ceph-users mailing list -- ceph-users@ceph.io
> >> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >> ___
> >> >> ceph-users mailing list -- ceph-users@ceph.io
> >> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-09 Thread Sake Ceph

We're using RHEL 8 and 9 and on both the port was not open.
It's just strange it isn't working for ceph-exporter but just fine for 
everything else.

Kind regards,
Sake

> Op 09-09-2024 14:03 CEST schreef Eugen Block :
> 
>  
> Those two daemons are handled differently by cephadm, they're  
> different classes (grafana is "class Monitoring(ContainerDaemonForm)"  
> while ceph-exporter is "class CephExporter(ContainerDaemonForm)"),  
> therefore they have different metadata etc., for example:
> 
> soc9-ceph:~ # jq '.ports'  
> /var/lib/ceph/{FSID}/ceph-exporter.soc9-ceph/unit.meta
> []
> 
> soc9-ceph:~ # jq '.ports' /var/lib/ceph/{FSID}/grafana.soc9-ceph/unit.meta
> [
>3000
> ]
> 
> But that's about all I can provide here. Maybe the host OS plays some  
> role here as well, not sure.
> 
> Zitat von Sake Ceph :
> 
> > We're using default :) I'm talking about the deployment  
> > configuration which is shown in the log files when deploying  
> > grafana/ceph-exporter.
> >
> > I got the same configuration as you for ceph-exporter (the default)  
> > when exporting the service.
> >
> > Kind regards,
> > Sake
> >
> >> Op 09-09-2024 12:04 CEST schreef Eugen Block :
> >>
> >>
> >> Can you be more specific about "deploy configuration"? Do you have
> >> your own spec files for grafana and ceph-exporter?
> >> I just ran 'ceph orch apply ceph-exporter' and the resulting config is
> >> this one:
> >>
> >> # ceph orch ls ceph-exporter --export
> >> service_type: ceph-exporter
> >> service_name: ceph-exporter
> >> placement:
> >>host_pattern: '*'
> >> spec:
> >>prio_limit: 5
> >>stats_period: 5
> >>
> >> Zitat von Sake Ceph :
> >>
> >> > Hello Eugen,
> >> >
> >> > Well nothing about enabling port 9926.
> >> >
> >> > For example I see the following when deploying Grafanan:
> >> > 2024-09-05 14:27:30,969 7fd2e6583740 INFO firewalld ready
> >> > 2024-09-05 14:27:31,334 7fd2e6583740 DEBUG /bin/firewall-cmd:  
> >> stdout success
> >> > 2024-09-05 14:27:31,350 7fd2e6583740 INFO firewalld ready
> >> > 2024-09-05 14:27:31,593 7fd2e6583740 DEBUG Non-zero exit code 1 from
> >> > /bin/firewall-cmd --permanent --query-port 3000/tcp
> >> > 2024-09-05 14:27:31,594 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout no
> >> > 2024-09-05 14:27:31,594 7fd2e6583740 INFO Enabling firewalld port
> >> > 3000/tcp in current zone...
> >> > 2024-09-05 14:27:31,832 7fd2e6583740 DEBUG /bin/firewall-cmd:  
> >> stdout success
> >> > 2024-09-05 14:27:32,212 7fd2e6583740 DEBUG /bin/firewall-cmd:  
> >> stdout success
> >> >
> >> > But only the following when deploying ceph-exporter:
> >> > 2024-09-05 12:17:48,897 7f3d7cc0e740 INFO firewalld ready
> >> > 2024-09-05 12:17:49,269 7f3d7cc0e740 DEBUG /bin/firewall-cmd:  
> >> stdout success
> >> >
> >> > When looking in the deploy configuration, Grafana shows 'ports':
> >> > [3000], but ceph-exporter shows 'ports': []
> >> >
> >> > Kind regards,
> >> > Sake
> >> >
> >> >> Op 09-09-2024 10:50 CEST schreef Eugen Block :
> >> >>
> >> >>
> >> >> Sorry, clicked "send" too soon. In a test cluster, cephadm.log shows
> >> >> that it would try to open a port if a firewall was enabled:
> >> >>
> >> >> 2024-09-09 10:48:59,686 7f3142b11740 DEBUG firewalld.service is  
> >> not enabled
> >> >> 2024-09-09 10:48:59,686 7f3142b11740 DEBUG Not possible to enable
> >> >> service . firewalld.service is not available
> >> >>
> >> >> Zitat von Eugen Block :
> >> >>
> >> >> > Do you see anything in the cephadm.log related to the firewall?
> >> >> >
> >> >> > Zitat von Sake Ceph :
> >> >> >
> >> >> >> After opening port 9926 manually, the Grafana dashboards show  
> >> the data.
> >> >> >> So is this a bug?
> >> >> >>
> >> >> >> Kind regards,
> >> >> >> Sake
> >> >> >>> Op 06-09-2024 17:39 CEST schreef Sake Ceph :
> >> >> >>

[ceph-users] Re: Grafana dashboards is missing data

2024-09-10 Thread Sake Ceph

Thank you!

> Op 10-09-2024 09:39 CEST schreef Redouane Kachach :
> 
> 
> Seems like a BUG in cephadm, the ceph-exporter when deployed doesn't specify 
> its port that's why it's not being opened automatically. You can see that in 
> the cephadm logs (ports list is empty):
> 
> 2024-09-09 04:39:48,986 7fc2993d7740 DEBUG Loaded deploy configuration: 
> {'fsid': '250b9d7c-6e65-11ef-8e0e-525400ecf80a', 'name': 
> 'ceph-exporter.ceph-node-0', 'image': '', 'deploy_arguments': [], 'params': 
> {}, 'meta': {'service_name': 'ceph-exporter', 'ports': [], 'ip': None, 
> 'deployed_by': 
> ['quay.ceph.io/ceph-ci/ceph@sha256:02ce7c1aa356b524041713a3603da8445c4fe00ed30cb1c1f91532926db20d3c'
>  
> (http://quay.ceph.io/ceph-ci/ceph@sha256:02ce7c1aa356b524041713a3603da8445c4fe00ed30cb1c1f91532926db20d3c')],
>  'rank': None, 'rank_generation': None,
> 
> 
> I opened the following tracker to fix the issue: 
> https://tracker.ceph.com/issues/67975
> 
> 
> 
> 
> On Mon, Sep 9, 2024 at 2:54 PM Sake Ceph  wrote:
> > We're using RHEL 8 and 9 and on both the port was not open.
> >  It's just strange it isn't working for ceph-exporter but just fine for 
> > everything else.
> >  
> >  Kind regards,
> >  Sake
> >  
> >  > Op 09-09-2024 14:03 CEST schreef Eugen Block :
> >  > 
> >  > 
> >  > Those two daemons are handled differently by cephadm, they're 
> >  > different classes (grafana is "class Monitoring(ContainerDaemonForm)" 
> >  > while ceph-exporter is "class CephExporter(ContainerDaemonForm)"), 
> >  > therefore they have different metadata etc., for example:
> >  > 
> >  > soc9-ceph:~ # jq '.ports' 
> >  > /var/lib/ceph/{FSID}/ceph-exporter.soc9-ceph/unit.meta
> >  > []
> >  > 
> >  > soc9-ceph:~ # jq '.ports' 
> > /var/lib/ceph/{FSID}/grafana.soc9-ceph/unit.meta
> >  > [
> >  > 3000
> >  > ]
> >  > 
> >  > But that's about all I can provide here. Maybe the host OS plays some 
> >  > role here as well, not sure.
> >  > 
> >  > Zitat von Sake Ceph :
> >  > 
> >  > > We're using default :) I'm talking about the deployment 
> >  > > configuration which is shown in the log files when deploying 
> >  > > grafana/ceph-exporter.
> >  > >
> >  > > I got the same configuration as you for ceph-exporter (the default) 
> >  > > when exporting the service.
> >  > >
> >  > > Kind regards,
> >  > > Sake
> >  > >
> >  > >> Op 09-09-2024 12:04 CEST schreef Eugen Block :
> >  > >>
> >  > >>
> >  > >> Can you be more specific about "deploy configuration"? Do you have
> >  > >> your own spec files for grafana and ceph-exporter?
> >  > >> I just ran 'ceph orch apply ceph-exporter' and the resulting config is
> >  > >> this one:
> >  > >>
> >  > >> # ceph orch ls ceph-exporter --export
> >  > >> service_type: ceph-exporter
> >  > >> service_name: ceph-exporter
> >  > >> placement:
> >  > >> host_pattern: '*'
> >  > >> spec:
> >  > >> prio_limit: 5
> >  > >> stats_period: 5
> >  > >>
> >  > >> Zitat von Sake Ceph :
> >  > >>
> >  > >> > Hello Eugen,
> >  > >> >
> >  > >> > Well nothing about enabling port 9926.
> >  > >> >
> >  > >> > For example I see the following when deploying Grafanan:
> >  > >> > 2024-09-05 14:27:30,969 7fd2e6583740 INFO firewalld ready
> >  > >> > 2024-09-05 14:27:31,334 7fd2e6583740 DEBUG /bin/firewall-cmd: 
> >  > >> stdout success
> >  > >> > 2024-09-05 14:27:31,350 7fd2e6583740 INFO firewalld ready
> >  > >> > 2024-09-05 14:27:31,593 7fd2e6583740 DEBUG Non-zero exit code 1 from
> >  > >> > /bin/firewall-cmd --permanent --query-port 3000/tcp
> >  > >> > 2024-09-05 14:27:31,594 7fd2e6583740 DEBUG /bin/firewall-cmd: 
> > stdout no
> >  > >> > 2024-09-05 14:27:31,594 7fd2e6583740 INFO Enabling firewalld port
> >  > >> > 3000/tcp in current zone...
> >  > >

[ceph-users] Help needed with Grafana password

2023-11-08 Thread Sake Ceph



 
 
  
   I configured a password for Grafana because I want to use Loki. I used the spec parameter initial_admin_password and this works fine for a staging environment, where I never tried to used Grafana with a password for Loki. 
   
  
    
   
  
   Using the username admin with the configured password gives a credentials error on environment where I tried to use Grafana with Loki in the past (with 17.2.6 of Ceph/cephadm). I changed the password in the past within Grafana, but how can I overwrite this now? Or is there a way to cleanup all Grafana files? 
   
  
    
   
  
   Best regards, 
   
  
   Sake
   
 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Help needed with Grafana password

2023-11-08 Thread Sake Ceph

I configured a password for Grafana because I want to use Loki. I used the spec 
parameter initial_admin_password and this works fine for a staging environment, 
where I never tried to used Grafana with a password for Loki. 
 
Using the username admin with the configured password gives a credentials error 
on environment where I tried to use Grafana with Loki in the past (with 17.2.6 
of Ceph/cephadm). I changed the password in the past within Grafana, but how 
can I overwrite this now? Or is there a way to cleanup all Grafana files? 
 
Best regards, 
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph

Hi, 

Well to get promtail working with Loki, you need to setup a password in 
Grafana. 
But promtail wasn't working with the 17.2.6 release, the URL was set to 
containers.local. So I stopped using it, but forgot to click on save in KeePass 
:(

I didn't configure anything special in Grafana, the default dashboards are 
great! So a wipe isn't a problem, it's what I want. 

Best regards, 
Sake 
> Op 09-11-2023 08:19 CET schreef Eugen Block :
> 
>  
> Hi,
> you mean you forgot your password? You can remove the service with  
> 'ceph orch rm grafana', then re-apply your grafana.yaml containing the  
> initial password. Note that this would remove all of the grafana  
> configs or custom dashboards etc., you would have to reconfigure them.  
> So before doing that you should verify that this is actually what  
> you're looking for. Not sure what this has to do with Loki though.
> 
> Eugen
> 
> Zitat von Sake Ceph :
> 
> > I configured a password for Grafana because I want to use Loki. I
> > used the spec parameter initial_admin_password and this works fine for a
> > staging environment, where I never tried to used Grafana with a password
> > for Loki. 
> > 
> >Using the username admin with the configured password gives a
> > credentials error on environment where I tried to use Grafana with Loki in
> > the past (with 17.2.6 of Ceph/cephadm). I changed the password in the past
> > within Grafana, but how can I overwrite this now? Or is there a way to
> > cleanup all Grafana files? 
> > 
> >Best regards, 
> >Sake
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph

To bad, that doesn't work :(
> Op 09-11-2023 09:07 CET schreef Sake Ceph :
> 
>  
> Hi, 
> 
> Well to get promtail working with Loki, you need to setup a password in 
> Grafana. 
> But promtail wasn't working with the 17.2.6 release, the URL was set to 
> containers.local. So I stopped using it, but forgot to click on save in 
> KeePass :(
> 
> I didn't configure anything special in Grafana, the default dashboards are 
> great! So a wipe isn't a problem, it's what I want. 
> 
> Best regards, 
> Sake 
> > Op 09-11-2023 08:19 CET schreef Eugen Block :
> > 
> >  
> > Hi,
> > you mean you forgot your password? You can remove the service with  
> > 'ceph orch rm grafana', then re-apply your grafana.yaml containing the  
> > initial password. Note that this would remove all of the grafana  
> > configs or custom dashboards etc., you would have to reconfigure them.  
> > So before doing that you should verify that this is actually what  
> > you're looking for. Not sure what this has to do with Loki though.
> > 
> > Eugen
> > 
> > Zitat von Sake Ceph :
> > 
> > > I configured a password for Grafana because I want to use Loki. I
> > > used the spec parameter initial_admin_password and this works fine for a
> > > staging environment, where I never tried to used Grafana with a password
> > > for Loki. 
> > > 
> > >Using the username admin with the configured password gives a
> > > credentials error on environment where I tried to use Grafana with Loki in
> > > the past (with 17.2.6 of Ceph/cephadm). I changed the password in the past
> > > within Grafana, but how can I overwrite this now? Or is there a way to
> > > cleanup all Grafana files? 
> > > 
> > >Best regards, 
> > >Sake
> > 
> > 
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph

Using podman version 4.4.1 on RHEL 8.8, Ceph 17.2.7

I used 'podman system prune -a -f' and 'podman volume prune -f' to cleanup 
files, but this leaves a lot of files over in 
/var/lib/containers/storage/overlay and a empty folder 
/var/lib/ceph//custom_config_files/grafana..
Found those files with 'find / -name *grafana*'.

> Op 09-11-2023 09:53 CET schreef Eugen Block :
> 
>  
> What doesn't work exactly? For me it did...
> 
> Zitat von Sake Ceph :
> 
> > To bad, that doesn't work :(
> >> Op 09-11-2023 09:07 CET schreef Sake Ceph :
> >>
> >>
> >> Hi,
> >>
> >> Well to get promtail working with Loki, you need to setup a  
> >> password in Grafana.
> >> But promtail wasn't working with the 17.2.6 release, the URL was  
> >> set to containers.local. So I stopped using it, but forgot to click  
> >> on save in KeePass :(
> >>
> >> I didn't configure anything special in Grafana, the default  
> >> dashboards are great! So a wipe isn't a problem, it's what I want.
> >>
> >> Best regards,
> >> Sake
> >> > Op 09-11-2023 08:19 CET schreef Eugen Block :
> >> >
> >> >
> >> > Hi,
> >> > you mean you forgot your password? You can remove the service with
> >> > 'ceph orch rm grafana', then re-apply your grafana.yaml containing the
> >> > initial password. Note that this would remove all of the grafana
> >> > configs or custom dashboards etc., you would have to reconfigure them.
> >> > So before doing that you should verify that this is actually what
> >> > you're looking for. Not sure what this has to do with Loki though.
> >> >
> >> > Eugen
> >> >
> >> > Zitat von Sake Ceph :
> >> >
> >> > > I configured a password for Grafana because I want to use Loki. I
> >> > > used the spec parameter initial_admin_password and this works fine for 
> >> > > a
> >> > > staging environment, where I never tried to used Grafana with a 
> >> > > password
> >> > > for Loki. 
> >> > > 
> >> > >Using the username admin with the configured password gives a
> >> > > credentials error on environment where I tried to use Grafana  
> >> with Loki in
> >> > > the past (with 17.2.6 of Ceph/cephadm). I changed the password  
> >> in the past
> >> > > within Grafana, but how can I overwrite this now? Or is there a way to
> >> > > cleanup all Grafana files? 
> >> > > 
> >> > >Best regards, 
> >> > >Sake
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed with Grafana password

2023-11-09 Thread Sake Ceph

I tried everything at this point, even waited a hour, still no luck. Got it 1 
time accidentally working, but with a placeholder for a password. Tried with 
correct password, nothing and trying again with the placeholder didn't work 
anymore. 

So I thought to switch the manager, maybe something is not right (shouldn't 
happen). But applying the Grafana spec on the other mgr, I get the following 
error in the log files:

services/grafana/ceph-dashboard.yml.j2 Traceback (most recent call last): File 
"/usr/share/ceph/mgr/cephadm/template.py",
line 40, in render template = self.env.get_template(name) File 
"/lib/python3.6/site-packages/jinja2/environment.py", 
ine 830, in get_template return self._load_template(name, 
self.make_globals(globals)) File 
"/lib/python3.6/site-packages/jinja2/environment.py",
line 804, in _load_template template = self.loader.load(self, name, globals) 
File "/lib/python3.6/site-packages/jinja2/loaders.py",
line 113, in load source, filename, uptodate = self.get_source(environment, 
name) File "/lib/python3.6/site-packages/jinja2/loaders.py",
line 235, in get_source raise TemplateNotFound(template) 
jinja2.exceptions.TemplateNotFound: services/grafana/ceph-dashboard.yml.j2

During handling of the above exception, another exception occurred: Traceback 
(most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py",
line 1002, in _check_daemons self.mgr._daemon_action(daemon_spec, 
action=action) File "/usr/share/ceph/mgr/cephadm/module.py",
line 2131, in _daemon_action 
daemon_spec.daemon_type)].prepare_create(daemon_spec) File 
"/usr/share/ceph/mgr/cephadm/services/monitoring.py",
line 27, in prepare_create daemon_spec.final_config, daemon_spec.deps = 
self.generate_config(daemon_spec) File 
"/usr/share/ceph/mgr/cephadm/services/monitoring.py",
line 54, in generate_config 'services/grafana/ceph-dashboard.yml.j2', {'hosts': 
prom_services, 'loki_host': loki_host}) File 
"/usr/share/ceph/mgr/cephadm/template.py",
line 109, in render return self.engine.render(name, ctx) File 
"/usr/share/ceph/mgr/cephadm/template.py",
line 47, in render raise TemplateNotFoundError(e.message) 
cephadm.template.TemplateNotFoundError: services/grafana/ceph-dashboard.yml.j2

I use the following config for Grafana, nothing special. 

service_type: grafana
service_name: grafana
placement:
  count: 2
  label: grafana
extra_container_args:
- -v=/opt/ceph_cert/host.cert:/etc/grafana/certs/cert_file:ro
- -v=/opt/ceph_cert/host.key:/etc/grafana/certs/cert_key:ro
spec:
  anonymous_access: true
  initial_admin_password: aPassw0rdWithSpecialChars-#
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stretch mode size

2023-11-09 Thread Sake Ceph

I believe they are working on it or want to work on it to revert from a 
stretched cluster, because of the reason you mention: if the other datacenter 
is totally burned down, you maybe want for the time being switch to one 
datacenter setup. 

Best regards, 
Sake
> Op 09-11-2023 11:18 CET schreef Eugen Block :
> 
>  
> Hi,
> 
> I'd like to ask for confirmation how I understand the docs on stretch  
> mode [1]. It requires exact size 4 for the rule? Other sizes are not  
> supported/won't work, for example size 6? Are there clusters out there  
> which use this stretch mode?
> Once stretch mode is enabled, it's not possible to get out of it. How  
> would one deal with a burnt down datacenter which can take months to  
> rebuild? In a "self-managed" stretch cluster (let's say size 6) I  
> could simply change the crush rule to not consider the failed  
> datacenter anymore, deploy an additional mon somewhere and maybe  
> reduce the size/min_size. Am I missing something?
> 
> Thanks,
> Eugen
> 
> [1] https://docs.ceph.com/en/reef/rados/operations/stretch-mode/#id2
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed with Grafana password

2023-11-10 Thread Sake Ceph

Thank you Eugen! This worked :) 
> Op 09-11-2023 14:55 CET schreef Eugen Block :
> 
>  
> It's the '#' character, everything after (including '#' itself) is cut  
> off. I tried with single and double quotes which also failed. But as I  
> already said, use a simple password and then change it within grafana.  
> That way you also don't have the actual password lying around in clear  
> text in a yaml file...
> 
> Zitat von Eugen Block :
> 
> > I just tried it on a 17.2.6 test cluster, although I don't have a  
> > stack trace the complicated password doesn't seem to be applied  
> > (don't know why yet). But since it's an "initial" password you can  
> > choose something simple like "admin", and during the first login you  
> > are asked to change it anyway. And then you can choose your more  
> > complicated password, I just verified that.
> >
> > Zitat von Sake Ceph :
> >
> >> I tried everything at this point, even waited a hour, still no  
> >> luck. Got it 1 time accidentally working, but with a placeholder  
> >> for a password. Tried with correct password, nothing and trying  
> >> again with the placeholder didn't work anymore.
> >>
> >> So I thought to switch the manager, maybe something is not right  
> >> (shouldn't happen). But applying the Grafana spec on the other mgr,  
> >> I get the following error in the log files:
> >>
> >> services/grafana/ceph-dashboard.yml.j2 Traceback (most recent call  
> >> last): File "/usr/share/ceph/mgr/cephadm/template.py",
> >> line 40, in render template = self.env.get_template(name) File  
> >> "/lib/python3.6/site-packages/jinja2/environment.py",
> >> ine 830, in get_template return self._load_template(name,  
> >> self.make_globals(globals)) File  
> >> "/lib/python3.6/site-packages/jinja2/environment.py",
> >> line 804, in _load_template template = self.loader.load(self, name,  
> >> globals) File "/lib/python3.6/site-packages/jinja2/loaders.py",
> >> line 113, in load source, filename, uptodate =  
> >> self.get_source(environment, name) File  
> >> "/lib/python3.6/site-packages/jinja2/loaders.py",
> >> line 235, in get_source raise TemplateNotFound(template)  
> >> jinja2.exceptions.TemplateNotFound:  
> >> services/grafana/ceph-dashboard.yml.j2
> >>
> >> During handling of the above exception, another exception occurred:  
> >> Traceback (most recent call last): File  
> >> "/usr/share/ceph/mgr/cephadm/serve.py",
> >> line 1002, in _check_daemons self.mgr._daemon_action(daemon_spec,  
> >> action=action) File "/usr/share/ceph/mgr/cephadm/module.py",
> >> line 2131, in _daemon_action  
> >> daemon_spec.daemon_type)].prepare_create(daemon_spec) File  
> >> "/usr/share/ceph/mgr/cephadm/services/monitoring.py",
> >> line 27, in prepare_create daemon_spec.final_config,  
> >> daemon_spec.deps = self.generate_config(daemon_spec) File  
> >> "/usr/share/ceph/mgr/cephadm/services/monitoring.py",
> >> line 54, in generate_config  
> >> 'services/grafana/ceph-dashboard.yml.j2', {'hosts': prom_services,  
> >> 'loki_host': loki_host}) File  
> >> "/usr/share/ceph/mgr/cephadm/template.py",
> >> line 109, in render return self.engine.render(name, ctx) File  
> >> "/usr/share/ceph/mgr/cephadm/template.py",
> >> line 47, in render raise TemplateNotFoundError(e.message)  
> >> cephadm.template.TemplateNotFoundError:  
> >> services/grafana/ceph-dashboard.yml.j2
> >>
> >> I use the following config for Grafana, nothing special.
> >>
> >> service_type: grafana
> >> service_name: grafana
> >> placement:
> >>  count: 2
> >>  label: grafana
> >> extra_container_args:
> >> - -v=/opt/ceph_cert/host.cert:/etc/grafana/certs/cert_file:ro
> >> - -v=/opt/ceph_cert/host.key:/etc/grafana/certs/cert_key:ro
> >> spec:
> >>  anonymous_access: true
> >>  initial_admin_password: aPassw0rdWithSpecialChars-#
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stretch mode size

2023-11-15 Thread Sake Ceph

Don't forget with stretch mode, osds only communicate with mons in the same DC 
and the tiebreaker only communicate with the other mons (to prevent split brain 
scenarios).

Little late response, but I wanted you to know this :)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] FS down - mds degraded

2023-12-20 Thread Sake Ceph

Starting a new thread, forgot subject in the previous. 
So our FS down. Got the following error, what can I do?

# ceph health detail
HEALTH_ERR 1 filesystem is degraded; 1 mds daemon damaged
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs atlassian/prod is degraded
[ERR] MDS_DAMAGE: 1 mds daemon damaged
fs atlassian-prod mds.1 is damaged

# ceph fs get atlassian-prod
Filesystem 'atlassian-prod' (2)
fs_name atlassian-prod
epoch   43440
flags   32 joinable allow_snaps allow_multimds_snaps allow_standby_replay
created 2023-05-10T08:45:46.911064+
modified2023-12-21T06:47:19.291154+
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
required_client_features{}
last_failure0
last_failure_osd_epoch  29480
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
max_mds 3
in  0,1,2
up  {0=1073573,2=1073583}
failed
damaged 1
stopped
data_pools  [5]
metadata_pool   4
inline_data disabled
balancer
standby_count_wanted1
[mds.atlassian-prod.pwsoel13142.egsdfl{0:1073573} state up:resolve seq 573 
join_fscid=2 addr 
[v2:10.233.127.22:6800/61692284,v1:10.233.127.22:6801/61692284] compat 
{c=[1],r=[1],i=[7ff]}]
[mds.atlassian-prod.pwsoel13143.qlvypn{2:1073583} state up:resolve seq 571 
join_fscid=2 addr 
[v2:10.233.127.18:6800/3627858294,v1:10.233.127.18:6801/3627858294] compat 
{c=[1],r=[1],i=[7ff]}]

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: FS down - mds degraded

2023-12-21 Thread Sake Ceph

Hi David

Reducing max_mds didn't work. So I executed a fs reset:
ceph fs set atlassian-prod allow_standby_replay false
ceph fs set atlassian-prod cluster_down true
ceph mds fail atlassian-prod.pwsoel13142.egsdfl
ceph mds fail atlassian-prod.pwsoel13143.qlvypn
ceph fs reset atlassian-prod
ceph fs reset atlassian-prod --yes-i-really-mean-it

This brought the fs back online and the servers/applications are working again. 

Question: can I increase the max_mds and active standby_replay? 

Will collect logs, maybe we can pinpoint the cause. 

Best regards, 
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: FS down - mds degraded

2023-12-21 Thread Sake Ceph

That wasn't really clear in the docs :(

> Op 21-12-2023 17:26 CET schreef Patrick Donnelly :
> 
>  
> On Thu, Dec 21, 2023 at 3:05 AM Sake Ceph  wrote:
> >
> > Hi David
> >
> > Reducing max_mds didn't work. So I executed a fs reset:
> > ceph fs set atlassian-prod allow_standby_replay false
> > ceph fs set atlassian-prod cluster_down true
> > ceph mds fail atlassian-prod.pwsoel13142.egsdfl
> > ceph mds fail atlassian-prod.pwsoel13143.qlvypn
> > ceph fs reset atlassian-prod
> > ceph fs reset atlassian-prod --yes-i-really-mean-it
> >
> > This brought the fs back online and the servers/applications are working 
> > again.
> 
> This was not the right thing to do. You can mark the rank repaired. See end 
> of:
> 
> https://docs.ceph.com/en/latest/cephfs/administration/#daemons
> 
> (ceph mds repaired )
> 
> I admit that is not easy to find. I will add a ticket to improve the
> documentation:
> 
> https://tracker.ceph.com/issues/63885
> 
> -- 
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS subtree pinning

2023-12-22 Thread Sake Ceph

Hi!

As I'm reading through the documentation about subtree pinning, I was wondering 
if the following is possible.

We've got the following directory structure. 
/
  /app1
  /app2
  /app3
  /app4

Can I pin /app1 to MDS rank 0 and 1, the directory /app2 to rank 2 and finally 
/app3 and /app4 to rank 3?

I would like to load balance the subfolders of /app1 to 2 (or 3) MDS servers.

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS subtree pinning

2023-12-31 Thread Sake Ceph

Hi all! 

I have a few follow up questions about subtree pinning. Because the information 
isn't provided in the docs and couldnt find anything about it. 

Deployment via cephadm and using v17.2.7. 

1. Can I pin directories before having multiple mds nodes? Like I have only 1 
active mds and assign folder App2 to rank 1.
2. Is there already a feature request for pinning directories via the 
dashboard? Again, I couldn't find a request. 
3. I believe in the past you needed to remove the manual pins before an 
upgrade, is this still the case? 

Best regards, 
Sake 

> Op 22-12-2023 13:43 CET schreef Sake Ceph :
> 
>  
> Hi!
> 
> As I'm reading through the documentation about subtree pinning, I was 
> wondering if the following is possible.
> 
> We've got the following directory structure. 
> /
>   /app1
>   /app2
>   /app3
>   /app4
> 
> Can I pin /app1 to MDS rank 0 and 1, the directory /app2 to rank 2 and 
> finally /app3 and /app4 to rank 3?
> 
> I would like to load balance the subfolders of /app1 to 2 (or 3) MDS servers.
> 
> Best regards,
> Sake
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Cephfs error state with one bad file

2024-01-02 Thread Sake Ceph

Hi again, hopefully for the last time with problems. 

We had a MDS crash earlier with the MDS staying in failed state and used a 
command to reset the filesystem (this was wrong, I know now, thanks Patrick 
Donnelly for pointing this out). I did a full scrub on the filesystem and two 
files were damaged. One of those got repaired, but the following file keeps 
giving errors and can't be removed.
What can I do now? Below some information.

# ceph tell mds.atlassian-prod:0 damage ls
[
{
"damage_type": "backtrace",
"id": 224901,
"ino": 1099534008829,
"path": 
"/app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01"
}
]


Trying to repair the error (online research shows this should work for a 
backtrace damage type)
--
# ceph tell mds.atlassian-prod:0 scrub start 
/app1/shared/data/repositories/11271 recursive,repair,force
{
"return_code": 0,
"scrub_tag": "d10ead42-5280-4224-971e-4f3022e79278",
"mode": "asynchronous"
}


Cluster logs after this
--
1/2/24 9:37:05 AM
[INF]
scrub summary: idle

1/2/24 9:37:02 AM
[INF]
scrub summary: idle+waiting paths [/app1/shared/data/repositories/11271]

1/2/24 9:37:01 AM
[INF]
scrub summary: active paths [/app1/shared/data/repositories/11271]

1/2/24 9:37:01 AM
[INF]
scrub summary: idle+waiting paths [/app1/shared/data/repositories/11271]

1/2/24 9:37:01 AM
[INF]
scrub queued for path: /app1/shared/data/repositories/11271


But the error doesn't disappear and still can't remove the file.


On the client trying to remove the file (we got a backup)
--
$ rm -f 
/mnt/shared_disk-app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01
rm: cannot remove 
'/mnt/shared_disk-app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01':
 Input/output error


Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] TLS 1.2 for dashboard

2024-01-25 Thread Sake Ceph

After upgrading to 17.2.7 our load balancers can't check the status of the 
manager nodes for the dashboard. After some troubleshooting I noticed only TLS 
1.3 is availalbe for the dashboard. 

Looking at the source (quincy), TLS config got changed from 1.2 to 1.3. 
Searching in the tracker I found out that we are not the only one with troubles 
and there will be added an option to the dashboard config. Tracker ID 62940 got 
backports and the ones for reef and pacific already merged. But the pull 
request (63068) for Quincy is closed :(

What to do? I hope this one can get merged for 17.2.8.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: TLS 1.2 for dashboard

2024-01-25 Thread Sake Ceph

Hi Nizamudeen, 

Thank you for your quick response! 

The load balancers support TLS 1.3, but the administrators need to reconfigure 
the healthchecks. The only problem, it's a global change for all load 
balancers... So not something they change overnight and need to plan/test for.

Best regards, 
Sake

> Op 25-01-2024 15:22 CET schreef Nizamudeen A :
> 
> 
> Hi,
> 
> I'll re-open the PR and will merge it to Quincy. Btw i want to know if the 
> load balancers will be supporting tls 1.3 in future. Because we were planning 
> to completely drop the tls1.2 support from dashboard because of security 
> reasons. (But so far we are planning to keep it as it is atleast for the 
> older releases)
> 
> Regards,
> Nizam
> 
> 
> On Thu, Jan 25, 2024, 19:41 Sake Ceph  wrote:
> > After upgrading to 17.2.7 our load balancers can't check the status of the 
> > manager nodes for the dashboard. After some troubleshooting I noticed only 
> > TLS 1.3 is availalbe for the dashboard. 
> >  
> >  Looking at the source (quincy), TLS config got changed from 1.2 to 1.3. 
> > Searching in the tracker I found out that we are not the only one with 
> > troubles and there will be added an option to the dashboard config. Tracker 
> > ID 62940 got backports and the ones for reef and pacific already merged. 
> > But the pull request (63068) for Quincy is closed :(
> >  
> >  What to do? I hope this one can get merged for 17.2.8.
> >  ___
> >  ceph-users mailing list -- ceph-users@ceph.io
> >  To unsubscribe send an email to ceph-users-le...@ceph.io
> >  
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: TLS 1.2 for dashboard

2024-01-25 Thread Sake Ceph

I would say drop it for squid release or if you keep it in squid, but going to 
disable it in a minor release later, please make a note in the release notes if 
the option is being removed. 
Just my 2 cents :) 

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mysterious Space-Eating Monster

2024-04-19 Thread Sake Ceph

Hi Matthew, 

Cephadm doesn't cleanup old container images, at least with Quincy. After a 
upgrade we run the following commands:
sudo podman system prune -a -f
sudo podman volume prune -f

But if someone has a better advice, please tell us. 

Kind regards, 
Sake 
> Op 19-04-2024 10:24 CEST schreef duluxoz :
> 
>  
> Hi All,
> 
> *Something* is chewing up a lot of space on our `\var` partition to the 
> point where we're getting warnings about the Ceph monitor running out of 
> space (ie > 70% full).
> 
> I've been looking, but I can't find anything significant (ie log files 
> aren't too big, etc) BUT there seem to be a hell of a lot (15) of 
> sub-directories (with GUIDs for names) under the 
> `/var/lib/containers/storage/overlay/` folder, all ending with `merged` 
> - ie `/var/lib/containers/storage/overlay/{{GUID}}/`merged`.
> 
> Is this normal, or is something going wrong somewhere, or am I looking 
> in the wrong place?
> 
> Also, if this is the issue, can I delete these folders?
> 
> Sorry for asking such a noob Q, but the Cephadm/Podman stuff is 
> extremely new to me  :-)
> 
> Thanks in advance
> 
> Cheers
> 
> Dulux-Oz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Sake Ceph

Just a question: is it possible to block or disable all clients? Just to 
prevent load on the system. 

Kind regards, 
Sake 
> Op 22-04-2024 20:33 CEST schreef Erich Weiler :
> 
>  
> I also see this from 'ceph health detail':
> 
> # ceph health detail
> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1 
> MDSs behind on trimming
> [WRN] FS_DEGRADED: 1 filesystem is degraded
>  fs slugfs is degraded
> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
>  mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large 
> (19GB/8GB); 0 inodes in use by clients, 0 stray files
> [WRN] MDS_TRIM: 1 MDSs behind on trimming
>  mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250) 
> max_segments: 250, num_segments: 127084
> 
> MDS cache too large?  The mds process is taking up 22GB right now and 
> starting to swap my server, so maybe it somehow is too large
> 
> On 4/22/24 11:17 AM, Erich Weiler wrote:
> > Hi All,
> > 
> > We have a somewhat serious situation where we have a cephfs filesystem 
> > (18.2.1), and 2 active MDSs (one standby).  ThI tried to restart one of 
> > the active daemons to unstick a bunch of blocked requests, and the 
> > standby went into 'replay' for a very long time, then RAM on that MDS 
> > server filled up, and it just stayed there for a while then eventually 
> > appeared to give up and switched to the standby, but the cycle started 
> > again.  So I restarted that MDS, and now I'm in a situation where I see 
> > this:
> > 
> > # ceph fs status
> > slugfs - 29 clients
> > ==
> > RANK   STATE    MDS    ACTIVITY   DNS    INOS   DIRS   CAPS
> >   0 replay  slugfs.pr-md-01.xdtppo    3958k  57.1k  12.2k 0
> >   1    resolve  slugfs.pr-md-02.sbblqq   0  3  1  0
> >     POOL   TYPE USED  AVAIL
> >   cephfs_metadata    metadata   997G  2948G
> > cephfs_md_and_data    data   0   87.6T
> >     cephfs_data    data 773T   175T
> >   STANDBY MDS
> > slugfs.pr-md-03.mclckv
> > MDS version: ceph version 18.2.1 
> > (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
> > 
> > It just stays there indefinitely.  All my clients are hung.  I tried 
> > restarting all MDS daemons and they just went back to this state after 
> > coming back up.
> > 
> > Is there any way I can somehow escape this state of indefinite 
> > replay/resolve?
> > 
> > Thanks so much!  I'm kinda nervous since none of my clients have 
> > filesystem access at the moment...
> > 
> > cheers,
> > erich
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Sake Ceph

100 GB of Ram! Damn that's a lot for a filesystem in my opinion, or am I wrong? 

Kind regards, 
Sake 

> Op 22-04-2024 21:50 CEST schreef Erich Weiler :
> 
>  
> I was able to start another MDS daemon on another node that had 512GB 
> RAM, and then the active MDS eventually migrated there, and went through 
> the replay (which consumed about 100 GB of RAM), and then things 
> recovered.  Phew.  I guess I need significantly more RAM in my MDS 
> servers...  I had no idea the MDS daemon could require that much RAM.
> 
> -erich
> 
> On 4/22/24 11:41 AM, Erich Weiler wrote:
> > possibly but it would be pretty time consuming and difficult...
> > 
> > Is it maybe a RAM issue since my MDS RAM is filling up?  Should maybe I 
> > bring up another MDS on another server with huge amount of RAM and move 
> > the MDS there in hopes it will have enough RAM to complete the replay?
> > 
> > On 4/22/24 11:37 AM, Sake Ceph wrote:
> >> Just a question: is it possible to block or disable all clients? Just 
> >> to prevent load on the system.
> >>
> >> Kind regards,
> >> Sake
> >>> Op 22-04-2024 20:33 CEST schreef Erich Weiler :
> >>>
> >>> I also see this from 'ceph health detail':
> >>>
> >>> # ceph health detail
> >>> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1
> >>> MDSs behind on trimming
> >>> [WRN] FS_DEGRADED: 1 filesystem is degraded
> >>>   fs slugfs is degraded
> >>> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
> >>>   mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large
> >>> (19GB/8GB); 0 inodes in use by clients, 0 stray files
> >>> [WRN] MDS_TRIM: 1 MDSs behind on trimming
> >>>   mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250)
> >>> max_segments: 250, num_segments: 127084
> >>>
> >>> MDS cache too large?  The mds process is taking up 22GB right now and
> >>> starting to swap my server, so maybe it somehow is too large
> >>>
> >>> On 4/22/24 11:17 AM, Erich Weiler wrote:
> >>>> Hi All,
> >>>>
> >>>> We have a somewhat serious situation where we have a cephfs filesystem
> >>>> (18.2.1), and 2 active MDSs (one standby).  ThI tried to restart one of
> >>>> the active daemons to unstick a bunch of blocked requests, and the
> >>>> standby went into 'replay' for a very long time, then RAM on that MDS
> >>>> server filled up, and it just stayed there for a while then eventually
> >>>> appeared to give up and switched to the standby, but the cycle started
> >>>> again.  So I restarted that MDS, and now I'm in a situation where I see
> >>>> this:
> >>>>
> >>>> # ceph fs status
> >>>> slugfs - 29 clients
> >>>> ==
> >>>> RANK   STATE    MDS    ACTIVITY   DNS    INOS   
> >>>> DIRS   CAPS
> >>>>    0 replay  slugfs.pr-md-01.xdtppo    3958k  57.1k  
> >>>> 12.2k 0
> >>>>    1    resolve  slugfs.pr-md-02.sbblqq   0  3  
> >>>> 1  0
> >>>>      POOL   TYPE USED  AVAIL
> >>>>    cephfs_metadata    metadata   997G  2948G
> >>>> cephfs_md_and_data    data   0   87.6T
> >>>>      cephfs_data    data 773T   175T
> >>>>    STANDBY MDS
> >>>> slugfs.pr-md-03.mclckv
> >>>> MDS version: ceph version 18.2.1
> >>>> (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
> >>>>
> >>>> It just stays there indefinitely.  All my clients are hung.  I tried
> >>>> restarting all MDS daemons and they just went back to this state after
> >>>> coming back up.
> >>>>
> >>>> Is there any way I can somehow escape this state of indefinite
> >>>> replay/resolve?
> >>>>
> >>>> Thanks so much!  I'm kinda nervous since none of my clients have
> >>>> filesystem access at the moment...
> >>>>
> >>>> cheers,
> >>>> erich
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Status of 18.2.3

2024-05-23 Thread Sake Ceph

I was wondering what happened to the release of 18.2.3? Validation started on 
April 13th and as far as I know there have been a couple of builds and some 
extra bug fixes. Is there a way to follow a release or what is holding it back?

Normally I wouldn't ask about a release and just wait, but I really need some 
fixes of this release.

Kind regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Status of 18.2.3

2024-05-23 Thread Sake Ceph

I don't have access to Slack, but thank you for all your work! Fingers crossed 
for a quick release. 

Kind regards, 
Sake

> Op 23-05-2024 16:20 CEST schreef Yuri Weinstein :
> 
>  
> We are still working on the last-minute fixes, see this for details
> https://ceph-storage.slack.com/archives/C054Q1NUBQT/p1711041666180929
> 
> Regards
> YuriW
> 
> On Thu, May 23, 2024 at 6:22 AM Sake Ceph  wrote:
> >
> > I was wondering what happened to the release of 18.2.3? Validation started 
> > on April 13th and as far as I know there have been a couple of builds and 
> > some extra bug fixes. Is there a way to follow a release or what is holding 
> > it back?
> >
> > Normally I wouldn't ask about a release and just wait, but I really need 
> > some fixes of this release.
> >
> > Kind regards,
> > Sake
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-26 Thread Sake Ceph

Hi

Isn't this just the limit of one HDD or the other HDD's for providing the data? 
Don't forget, recovery will drop even more for the last few objects. At least I 
noticed this when replacing a drive in my (little) cluster. 

Kind regards, 
Sake 

> Op 26-05-2024 09:36 CEST schreef Mazzystr :
> 
>  
> I can't explain the problem.  I have to recover three discs that are hdds.
> I figured on just replacing one to give the full recovery capacity of the
> cluster to that one disc.  I was never able to achieve a higher recovery
> rate than about 22 MiB/sec so I just added the other two discs.  Recovery
> bounced up to 129 MiB/sec for a while.  Then things settled at 50 MiB/sec.
> I kept tinkering to try to get back to 120 and now things are back to 23
> MiB/Sec again.  This is very irritating.
> 
> Cpu usage is minimal in the single digit %.  Mem is right on target per
> target setting in ceph.conf.  Disc's and network appear to be 20%
> utilized.
> 
> I'm not a normal Ceph user.  I don't care about client access at all.  The
> mclock assumptions are wrong for me.  I want my data to be replicated
> correctly as fast as possible.
> 
> How do I open up the floodgates for maximum recovery performance?
> 
> 
> 
> 
> On Sat, May 25, 2024 at 8:13 PM Zakhar Kirpichenko  wrote:
> 
> > Hi!
> >
> > Could you please elaborate what you meant by "adding another disc to the
> > recovery process"?
> >
> > /Z
> >
> >
> > On Sat, 25 May 2024, 22:49 Mazzystr,  wrote:
> >
> >> Well this was an interesting journey through the bowels of Ceph.  I have
> >> about 6 hours into tweaking every setting imaginable just to circle back
> >> to
> >> my basic configuration and 2G memory target per osd.  I was never able to
> >> exceed 22 Mib/Sec recovery time during that journey.
> >>
> >> I did end up fixing the issue and now I see the following -
> >>
> >>   io:
> >> recovery: 129 MiB/s, 33 objects/s
> >>
> >> This is normal for my measly cluster.  I like micro ceph clusters.  I have
> >> a lot of them. :)
> >>
> >> What was the fix?  Adding another disc to the recovery process!  I was
> >> recovering to one disc now I'm recovering to two.  I have three total that
> >> need to be recovered.  Somehow that one disc was completely swamped.  I
> >> was
> >> unable to see it in htop, atop, iostat.  Disc business was 6% max.
> >>
> >> My config is back to mclock scheduler, profile high_recovery_ops, and
> >> backfills of 256.
> >>
> >> Thank you everyone that took the time to review and contribute.  Hopefully
> >> this provides some modern information for the next person that has slow
> >> recovery.
> >>
> >> /Chris C
> >>
> >>
> >>
> >>
> >>
> >> On Fri, May 24, 2024 at 1:43 PM Kai Stian Olstad 
> >> wrote:
> >>
> >> > On 24.05.2024 21:07, Mazzystr wrote:
> >> > > I did the obnoxious task of updating ceph.conf and restarting all my
> >> > > osds.
> >> > >
> >> > > ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config get
> >> > > osd_op_queue
> >> > > {
> >> > > "osd_op_queue": "wpq"
> >> > > }
> >> > >
> >> > > I have some spare memory on my target host/osd and increased the
> >> target
> >> > > memory of that OSD to 10 Gb and restarted.  No effect observed.  In
> >> > > fact
> >> > > mem usage on the host is stable so I don't think the change took
> >> effect
> >> > > even with updating ceph.conf, restart and a direct asok config set.
> >> > > target
> >> > > memory value is confirmed to be set via asok config get
> >> > >
> >> > > Nothing has helped.  I still cannot break the 21 MiB/s barrier.
> >> > >
> >> > > Does anyone have any more ideas?
> >> >
> >> > For recovery you can adjust the following.
> >> >
> >> > osd_max_backfills default is 1, in my system I get the best performance
> >> > with 3 and wpq.
> >> >
> >> > The following I have not adjusted myself, but you can try.
> >> > osd_recovery_max_active is default to 3.
> >> > osd_recovery_op_priority is default to 3, a lower number increases the
> >> > priority for recovery.
> >> >
> >> > All of them can be runtime adjusted.
> >> >
> >> >
> >> > --
> >> > Kai Stian Olstad
> >> >
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed please ! Filesystem became read-only !

2024-06-04 Thread Sake Ceph

Hi, 

A little break into this thread, but I have some questions:
* How does this happen, that the filesystem gets into readonly modus
* Is this avoidable? 
* How-to fix the issue, because I didn't see a workaround in the mentioned 
tracker (or I missed it) 
* With this bug around, should you use cephfs with reef? 

Kind regards, 
Sake 

> Op 04-06-2024 04:04 CEST schreef Xiubo Li :
> 
>  
> Hi Nicolas,
> 
> This is a known issue and Venky is working on it, please see 
> https://tracker.ceph.com/issues/63259.
> 
> Thanks
> - Xiubo
> 
> On 6/3/24 20:04, nbarb...@deltaonline.net wrote:
> > Hello,
> >
> > First of all, thanks for reading my message. I set up a Ceph version 18.2.2 
> > cluster with 4 nodes, everything went fine for a while, but after copying 
> > some files, the storage showed a warning status and the following message : 
> > "HEALTH_WARN: 1 MDSs are read only mds.PVE-CZ235007SH(mds.0): MDS in 
> > read-only mode".
> >
> > The logs are showing :
> >
> > Jun 03 08:20:41 PVE-CZ235007SH ceph-mds[1329868]:  -> 
> > 2024-06-03T07:57:17.589+0200 77250fc006c0 -1 log_channel(cluster) log [ERR] 
> > : failed to store backtrace on ino 0x100039c object, pool 5, errno -2
> > Jun 03 08:20:41 PVE-CZ235007SH ceph-mds[1329868]:  -9998> 
> > 2024-06-03T07:57:17.589+0200 77250fc006c0 -1 mds.0.189541 unhandled write 
> > error (2) No such file or directory, force readonly...
> >
> > After googling for a while, I did not find a hint to understand more 
> > precisely the root cause. Any help would we greatly appreciated, or even a 
> > link to post this request elsewhere if this is not the place to.
> >
> > Please find below additional details if needed. Thanks a lot !
> >
> > Nicolas
> >
> > ---
> >
> > # ceph osd dump
> > [...]
> > pool 5 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
> > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 292 
> > flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
> > recovery_priority 5 application cephfs read_balance_score 4.51
> > [...]
> >
> > # ceph osd lspools
> > 1 .mgr
> > 4 cephfs_data
> > 5 cephfs_metadata
> > 18 ec-pool-001-data
> > 19 ec-pool-001-metadata
> >
> >
> > # ceph df
> > --- RAW STORAGE ---
> > CLASS SIZEAVAILUSED  RAW USED  %RAW USED
> > hdd633 TiB  633 TiB  61 GiB61 GiB  0
> > TOTAL  633 TiB  633 TiB  61 GiB61 GiB  0
> >
> > --- POOLS ---
> > POOL  ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
> > .mgr   11  119 MiB   31  357 MiB  0200 TiB
> > cephfs_data4   32   71 KiB8.38k  240 KiB  0200 TiB
> > cephfs_metadata5   32  329 MiB6.56k  987 MiB  0200 TiB
> > ec-pool-001-data  18   32   42 GiB   15.99k   56 GiB  0451 TiB
> > ec-pool-001-metadata  19   32  0 B0  0 B  0200 TiB
> >
> >
> >
> > # ceph status
> >cluster:
> >  id: f16f53e1-7028-440f-bf48-f99912619c33
> >  health: HEALTH_WARN
> >  1 MDSs are read only
> >
> >services:
> >  mon: 4 daemons, quorum 
> > PVE-CZ235007SG,PVE-CZ2341016V,PVE-CZ235007SH,PVE-CZ2341016T (age 35h)
> >  mgr: PVE-CZ235007SG(active, since 2d), standbys: PVE-CZ235007SH, 
> > PVE-CZ2341016T, PVE-CZ2341016V
> >  mds: 1/1 daemons up, 3 standby
> >  osd: 48 osds: 48 up (since 2d), 48 in (since 3d)
> >
> >data:
> >  volumes: 1/1 healthy
> >  pools:   5 pools, 129 pgs
> >  objects: 30.97k objects, 42 GiB
> >  usage:   61 GiB used, 633 TiB / 633 TiB avail
> >  pgs: 129 active+clean
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed please ! Filesystem became read-only !

2024-06-04 Thread Sake Ceph

Hi Xiubo

Thank you for the explanation! This won't be a issue for us, but made me think 
twice :) 

Kind regards, 
Sake 

> Op 04-06-2024 12:30 CEST schreef Xiubo Li :
> 
>  
> On 6/4/24 15:20, Sake Ceph wrote:
> > Hi,
> >
> > A little break into this thread, but I have some questions:
> > * How does this happen, that the filesystem gets into readonly modus
> 
> The detail explanation you can refer to the ceph PR: 
> https://github.com/ceph/ceph/pull/55421.
> 
> > * Is this avoidable?
> > * How-to fix the issue, because I didn't see a workaround in the mentioned 
> > tracker (or I missed it)
> Possibly avoid changing data pools or disable multiple data pools?
> > * With this bug around, should you use cephfs with reef?
> 
> This will happen in all the releases, so that doesn't matter.
> 
> - Xiubo
> 
> >
> > Kind regards,
> > Sake
> >
> >> Op 04-06-2024 04:04 CEST schreef Xiubo Li :
> >>
> >>   
> >> Hi Nicolas,
> >>
> >> This is a known issue and Venky is working on it, please see
> >> https://tracker.ceph.com/issues/63259.
> >>
> >> Thanks
> >> - Xiubo
> >>
> >> On 6/3/24 20:04, nbarb...@deltaonline.net wrote:
> >>> Hello,
> >>>
> >>> First of all, thanks for reading my message. I set up a Ceph version 
> >>> 18.2.2 cluster with 4 nodes, everything went fine for a while, but after 
> >>> copying some files, the storage showed a warning status and the following 
> >>> message : "HEALTH_WARN: 1 MDSs are read only mds.PVE-CZ235007SH(mds.0): 
> >>> MDS in read-only mode".
> >>>
> >>> The logs are showing :
> >>>
> >>> Jun 03 08:20:41 PVE-CZ235007SH ceph-mds[1329868]:  -> 
> >>> 2024-06-03T07:57:17.589+0200 77250fc006c0 -1 log_channel(cluster) log 
> >>> [ERR] : failed to store backtrace on ino 0x100039c object, pool 5, 
> >>> errno -2
> >>> Jun 03 08:20:41 PVE-CZ235007SH ceph-mds[1329868]:  -9998> 
> >>> 2024-06-03T07:57:17.589+0200 77250fc006c0 -1 mds.0.189541 unhandled write 
> >>> error (2) No such file or directory, force readonly...
> >>>
> >>> After googling for a while, I did not find a hint to understand more 
> >>> precisely the root cause. Any help would we greatly appreciated, or even 
> >>> a link to post this request elsewhere if this is not the place to.
> >>>
> >>> Please find below additional details if needed. Thanks a lot !
> >>>
> >>> Nicolas
> >>>
> >>> ---
> >>>
> >>> # ceph osd dump
> >>> [...]
> >>> pool 5 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
> >>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
> >>> 292 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
> >>> recovery_priority 5 application cephfs read_balance_score 4.51
> >>> [...]
> >>>
> >>> # ceph osd lspools
> >>> 1 .mgr
> >>> 4 cephfs_data
> >>> 5 cephfs_metadata
> >>> 18 ec-pool-001-data
> >>> 19 ec-pool-001-metadata
> >>>
> >>>
> >>> # ceph df
> >>> --- RAW STORAGE ---
> >>> CLASS SIZEAVAILUSED  RAW USED  %RAW USED
> >>> hdd633 TiB  633 TiB  61 GiB61 GiB  0
> >>> TOTAL  633 TiB  633 TiB  61 GiB61 GiB  0
> >>>
> >>> --- POOLS ---
> >>> POOL  ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
> >>> .mgr   11  119 MiB   31  357 MiB  0200 TiB
> >>> cephfs_data4   32   71 KiB8.38k  240 KiB  0200 TiB
> >>> cephfs_metadata5   32  329 MiB6.56k  987 MiB  0200 TiB
> >>> ec-pool-001-data  18   32   42 GiB   15.99k   56 GiB  0451 TiB
> >>> ec-pool-001-metadata  19   32  0 B0  0 B  0200 TiB
> >>>
> >>>
> >>>
> >>> # ceph status
> >>> cluster:
> >>>   id: f16f53e1-7028-440f-bf48-f99912619c33
> >>>   health: HEALTH_WARN
> >>>   1 MDSs are read only
> >>>
> >>> services:
> >>>   mon: 4 daemons, quorum 
> >>> PVE-CZ235007SG,PVE-CZ2341016V,PVE-CZ235007SH,PVE-CZ2341016T (age 35h)
> >>>   mgr: PVE-CZ235007SG(active, since 2d), standbys: PVE-CZ235007SH, 
> >>> PVE-CZ2341016T, PVE-CZ2341016V
> >>>   mds: 1/1 daemons up, 3 standby
> >>>   osd: 48 osds: 48 up (since 2d), 48 in (since 3d)
> >>>
> >>> data:
> >>>   volumes: 1/1 healthy
> >>>   pools:   5 pools, 129 pgs
> >>>   objects: 30.97k objects, 42 GiB
> >>>   usage:   61 GiB used, 633 TiB / 633 TiB avail
> >>>   pgs: 129 active+clean
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Update OS with clean install

2024-06-04 Thread Sake Ceph

Hi all
I'm working on a way to automate the OS upgrade of our hosts. This happens with 
a complete reinstall of the OS.

What is the correct way to do this? At the moment I'm using the following:
* Store host labels (we use labels to deploy the services)
* Fail-over MDS and MGR services if running on the host
* Set host in maintenance mode
* Reinstall host with newer OS
* Remove host from cluster
* Configure host with correct settings (for example cephadm user SSH key etc.)
* Add host to cluster again with correct labels
* For OSD hosts run ceph cephadm osd activate

If somebody has some advice I would gladly hear about it! 

KIND regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Update OS with clean install

2024-06-04 Thread Sake Ceph

Hi Robert, 

I tried, but that doesn't work :( 

Using exit maintenance mode results in the error: "missing 2 required 
positional arguments: 'hostname' and 'addr'" 
But running the command a second time, it looks like it works, but then I get 
errors with starting the containers. The start up fails because it can't pull 
the container image because authentication is required (our instance is offline 
and we're using a local image registry with authentication). 

Kind regards, 
Sake 
> Op 04-06-2024 14:40 CEST schreef Robert Sander :
> 
>  
> Hi,
> 
> On 6/4/24 14:35, Sake Ceph wrote:
> 
> > * Store host labels (we use labels to deploy the services)
> > * Fail-over MDS and MGR services if running on the host
> > * Remove host from cluster
> > * Add host to cluster again with correct labels
> 
> AFAIK the steps above are not necessary. It should be sufficient to do these:
> 
> * Set host in maintenance mode
> * Reinstall host with newer OS
> * Configure host with correct settings (for example cephadm user SSH key etc.)
> * Unset maintenance mode for the host
> * For OSD hosts run ceph cephadm osd activate
> 
> Regards
> -- 
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Balancing MDS services on multiple hosts

2022-10-18 Thread Sake Paulusma

Another shot, company mail server did something special...

I deployed a small cluster for testing/deploying CephFS with cephadm. I was 
wondering if it's possible to balance the active and standby daemons on hosts.

The service configuration:
service_type: mds
service_id: test-fs
service_name: mds.test-fs
placement:
  count: 4
  hosts:
  - host1.example.com
  - host2.example.com

Commands used to create filesystem:
ceph fs volume create test-fs
ceph fs set test-fs max_mds 2

After the creation of the filesystem and setting the service configuration, an 
active and a standby service where deployed on each node.
But after a reboot of host2, the active services where all hosted on host1 and 
all standby services where hosted on host2. This didn't change.

It would be great if after a while the active services would be distributed 
evenly over the available hosts. Is it possible to achieve this at the moment 
(automatically or manually)?

Thanks,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Failed to probe daemons or devices

2022-10-24 Thread Sake Paulusma

Last friday I upgrade the Ceph cluster from 17.2.3 to 17.2.5 with "ceph orch 
upgrade start --image 
localcontainerregistry.local.com:5000/ceph/ceph:v17.2.5-20221017". After 
sometime, an hour?, I've got a health warning: CEPHADM_REFRESH_FAILED: failed 
to probe daemons or devices. I'm using only Cephfs on the cluster and it's 
still working correctly.
Checking the running services, everything is up and running; mon, osd and mds. 
But on the hosts running mon and mds services I get errors in the cephadm.log, 
see the loglines below.

I look likes cephadm tries to start a container for checking something? What 
could be wrong?


On mon nodes I got the following:
2022-10-24 10:31:43,880 7f179e5bfb80 DEBUG 

cephadm ['gather-facts']
2022-10-24 10:31:44,333 7fc2d52b6b80 DEBUG 

cephadm ['--image', 
'localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0',
 'ceph-volume', '--fsid', '8909ef90-22ea-11ed-8df1-0050569ee1b1', '--', 
'inventory', '--format=json-pretty', '--filter-for-batch']
2022-10-24 10:31:44,663 7fc2d52b6b80 INFO Inferring config 
/var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/mon.oqsoel24332/config
2022-10-24 10:31:44,663 7fc2d52b6b80 DEBUG Using specified fsid: 
8909ef90-22ea-11ed-8df1-0050569ee1b1
2022-10-24 10:31:45,574 7fc2d52b6b80 INFO Non-zero exit code 1 from /bin/podman 
run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json 
--net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk 
--init -e 
CONTAINER_IMAGE=localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0
 -e NODE_NAME=monnode2.local.com -e CEPH_USE_RANDOM_NONCE=1 -e 
CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v 
/var/run/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/run/ceph:z -v 
/var/log/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/log/ceph:z -v 
/var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/crash:/var/lib/ceph/crash:z 
-v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v 
/run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v 
/run/lock/lvm:/run/lock/lvm -v 
/var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/selinux:/sys/fs/selinux:ro 
-v /:/rootfs -v /tmp/ceph-tmp31tx1iy2:/etc/ceph/ce
 ph.conf:z 
localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0
 inventory --format=json-pretty --filter-for-batch
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr Traceback (most 
recent call last):
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/sbin/ceph-volume", line 11, in 
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr 
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr 
self.main(self.argv)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in 
newfunc
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr return f(*a, **kw)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr 
terminal.dispatch(self.mapper, subcommand_args)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in 
dispatch
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr instance.main()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/inventory/main.py", line 53, in 
main
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr 
with_lsm=self.args.with_lsm))
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 39, in 
__init__
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr all_devices_vgs = 
lvm.get_all_devices_vgs()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in 
get_all_devices_vgs
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr return 
[VolumeGroup(**vg) for vg in vgs]
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in 

2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr return 
[VolumeGroup(**vg) for vg in vgs]
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /b

[ceph-users] Re: Failed to probe daemons or devices

2022-10-25 Thread Sake Paulusma

I've created an issue: https://tracker.ceph.com/issues/57918
What can I do more the get to fix this issue?

And the output of the requested commands
[cephadm@mdshost2 ~]$ sudo lvs -a
  LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
  lv_home vg_sys -wi-ao 256.00m
  lv_opt vg_sys -wi-ao 3.00g
  lv_root vg_sys -wi-ao 5.00g
  lv_swap vg_sys -wi-ao 7.56g
  lv_tmp vg_sys -wi-ao 1.00g
  lv_var vg_sys -wi-ao 15.00g
  lv_var_log vg_sys -wi-ao 5.00g
  lv_var_log_audit vg_sys -wi-ao 512.00m

[cephadm@mdshost2 ~]$ sudo vgs -a
  VG #PV #LV #SN Attr VSize VFree
  vg_sys 1 8 0 wz--n- <49.00g 11.68g

[cephadm@mdshost2 ~]$ sudo parted --list
Model: VMware Virtual disk (scsi)
Disk /dev/sda: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number Start End Size Type File system Flags
 1 1049kB 1075MB 1074MB primary xfs boot
 2 1075MB 53.7GB 52.6GB primary lvm

Error: /dev/sdb: unrecognised disk label
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags:

From: Guillaume Abrioux 
Sent: Monday, October 24, 2022 5:50:20 PM
To: Sake Paulusma 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Failed to probe daemons or devices

Hello Sake,

Could you share the output of vgs / lvs commands?
Also, I would suggest you to open a tracker [1]

Thanks!

[1] 
https://tracker.ceph.com/projects/ceph-volume<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fprojects%2Fceph-volume&data=05%7C01%7C%7C8cf7f4f7348f4560917308dab5d77e21%7C84df9e7fe9f640afb435%7C1%7C0%7C638022234378380450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QZh7xliaxqBOtNGgJsfUo2OGZV%2FzOI9cdNotTNT%2Bs%2BU%3D&reserved=0>

On Mon, 24 Oct 2022 at 10:51, Sake Paulusma 
mailto:sake1...@hotmail.com>> wrote:
Last friday I upgrade the Ceph cluster from 17.2.3 to 17.2.5 with "ceph orch 
upgrade start --image 
localcontainerregistry.local.com:5000/ceph/ceph:v17.2.5-20221017<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Flocalcontainerregistry.local.com%3A5000%2Fceph%2Fceph%3Av17.2.5-20221017&data=05%7C01%7C%7C8cf7f4f7348f4560917308dab5d77e21%7C84df9e7fe9f640afb435%7C1%7C0%7C638022234378380450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=GKhA934ODD%2FQEHdd97vLskLkEhJuekIfUjBqx4wRlGU%3D&reserved=0>".
 After sometime, an hour?, I've got a health warning: CEPHADM_REFRESH_FAILED: 
failed to probe daemons or devices. I'm using only Cephfs on the cluster and 
it's still working correctly.
Checking the running services, everything is up and running; mon, osd and mds. 
But on the hosts running mon and mds services I get errors in the cephadm.log, 
see the loglines below.

I look likes cephadm tries to start a container for checking something? What 
could be wrong?


On mon nodes I got the following:
2022-10-24 10:31:43,880 7f179e5bfb80 DEBUG 

cephadm ['gather-facts']
2022-10-24 10:31:44,333 7fc2d52b6b80 DEBUG 

cephadm ['--image', 
'localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Flocalcontainerregistry.local.com%3A5000%2Fceph%2Fceph%40sha256%3A122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0&data=05%7C01%7C%7C8cf7f4f7348f4560917308dab5d77e21%7C84df9e7fe9f640afb435%7C1%7C0%7C638022234378380450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uNhTCkSwBrcq%2FZNFF6TgKHy5nmC%2FDM5ly3QMIEl8Z8I%3D&reserved=0>',
 'ceph-volume', '--fsid', '8909ef90-22ea-11ed-8df1-0050569ee1b1', '--', 
'inventory', '--format=json-pretty', '--filter-for-batch']
2022-10-24 10:31:44,663 7fc2d52b6b80 INFO Inferring config 
/var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/mon.oqsoel24332/config
2022-10-24 10:31:44,663 7fc2d52b6b80 DEBUG Using specified fsid: 
8909ef90-22ea-11ed-8df1-0050569ee1b1
2022-10-24 10:31:45,574 7fc2d52b6b80 INFO Non-zero exit code 1 from /bin/podman 
run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json 
--net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk 
--init -e 
CONTAINER_IMAGE=localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0<https://nam12.safelinks.protection.outlook.com/?u

[ceph-users] Re: Failed to probe daemons or devices

2022-10-25 Thread Sake Paulusma

I fixed the issue by removing the blanco/not labeled disk. It is still a bug, 
so hopefully it can get fixed for someone else who can't easily remove a disk :)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to ... alertmanager and prometheus

2022-11-09 Thread Sake Paulusma

Hi

I noticed that cephadm would update the grafana-frontend-api-url with version 
17.2.3, but it looks broken with version 17.2.5. It isn't a big deal to update 
the url by myself, but it's quite irritating to do if in the past it corrected 
itself.

Best regards,
Sake

From: Eugen Block 
Sent: Wednesday, November 9, 2022 9:26:28 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: How to ... alertmanager and prometheus

The only thing I noticed was that I had to change the grafana-api-url
for the dashboard when I stopped one of the two grafana instances. I
wasn't able to test the dashboard before because I had to wait for new
certificates so my browser wouldn't complain about the cephadm cert.
So it seems as if the failover doesn't work entirely automatic, but
it's not too much work to switch the api url. :-)

Zitat von Michael Lipp :

> Thank you both very much! I have understood things better now.
>
> I'm not sure, though, whether all URIs are adjusted properly when
> changing the placement of the services. Still testing...
>
> Am 08.11.22 um 17:13 schrieb Redouane Kachach Elhichou:
>> Welcome Eugen,
>>
>> There are some ongoing efforts to make the whole prometheus stack config
>> more dynamic by using the http sd configuration [1]. In fact part of the
>> changes are already in main but they will not be available till the next
>> Ceph official release.
>>
>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprometheus.io%2Fdocs%2Fprometheus%2Flatest%2Fconfiguration%2Fconfiguration%2F%23http_sd_config&data=05%7C01%7C%7C6b20dfcbe4864afa800408dac22c2be0%7C84df9e7fe9f640afb435%7C1%7C0%7C638035792261472298%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=G9tOki9%2FzRHSJXMU4BlcaQjtscEkNWKXIG1TGCGR14Y%3D&reserved=0
>> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprometheus.io%2Fdocs%2Fprometheus%2F2.28%2Fconfiguration%2Fconfiguration%2F%23http_sd_config&data=05%7C01%7C%7C6b20dfcbe4864afa800408dac22c2be0%7C84df9e7fe9f640afb435%7C1%7C0%7C638035792261472298%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EKDJJnfTfN5fYN8T6Z2%2Fn7MpgmMysUCI2NT8%2BOX5aic%3D&reserved=0>
>>
>>
>> On Tue, Nov 8, 2022 at 4:47 PM Eugen Block  wrote:
>>
>>> I somehow missed the HA part in [1], thanks for pointing that out.
>>>
>>>
>>> Zitat von Redouane Kachach Elhichou :
>>>
>>>> If you are running quincy and using cephadm then you can have more
>>>> instances of prometheus (and other monitoring daemons) running in HA mode
>>>> by increasing the number of daemons as in [1]:
>>>>
>>>> from a cephadm shell (to run 2 instances of prometheus and
>>> altertmanager):
>>>>> ceph orch apply prometheus --placement 'count:2'
>>>>> ceph orch apply alertmanager --placement 'count:2'
>>>> You can have as many instances as you need. You can choose on which nodes
>>>> to place them by using the daemon placement specification of cephadm [2]
>>> by
>>>> using a specific label for monitoring i.e. In case of mgr failover
>>> cephadm
>>>> should reconfigure the daemons accordingly.
>>>>
>>>> [1]
>>>>
>>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Fquincy%2Fcephadm%2Fservices%2Fmonitoring%2F%23deploying-monitoring-with-cephadm&data=05%7C01%7C%7C6b20dfcbe4864afa800408dac22c2be0%7C84df9e7fe9f640afb435%7C1%7C0%7C638035792261472298%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PUm4%2FZ6%2B19uSureq%2Bn47bGAlfs%2BA9TLrZop%2BR%2F0o5kA%3D&reserved=0
>>>> [2] 
>>>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Fquincy%2Fcephadm%2Fservices%2F%23daemon-placement&data=05%7C01%7C%7C6b20dfcbe4864afa800408dac22c2be0%7C84df9e7fe9f640afb435%7C1%7C0%7C638035792261472298%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I3eWKVLUV8wfxiVyJe1X1NzC0wCNjF%2F13WemeBTEsc0%3D&reserved=0
>>>>
>>>> Hope it helps,
>>>> Redouane.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 8, 2022 at 3:58 PM Eugen Block  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> the only information I found so far was this statement from the redhat
>&g

[ceph-users] How to replace or add a monitor in stretch cluster?

2022-12-02 Thread Sake Paulusma

I succesfully setup a stretched cluster, except the CRUSH rule mentioned in the 
docs wasn't correct. The parameters for "min_size" and "max_size" should be 
removed, or else the rule can't be imported.
Second there should be a mention about setting the monitor crush location takes 
sometime and know other ceph command can be used.

But now I need to replace a few monitors (it's virtualized and machines need to 
be replaced). I use cephadm and have a label "mon" which is assigned to the 
monitor services.
With the command "ceph orch host add   --labels=mon" I 
add normally a new monitor to the cluster.

Only this results in the following error in de logs:
12/2/22 1:27:19 PM [INF]  attempted to join from  
[v2::3300/0,v1::6789/0]; but lacks a crush_location for 
stretch mode

Next I tried to set the CRUSH location like used in the docs for stretch mode 
with the command "ceph mon set_location  datacenter=". This only 
results in the following error:
Error ENOENT: mon.oqsoel11437 does not exist

So how can I add/replace a monitor in a stretched cluster?

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to replace or add a monitor in stretch cluster?

2022-12-02 Thread Sake Paulusma

That isn't a great solution indeed, but I'll try the solution. Would this also 
be necessary to replace the Tiebreaker?

From: Adam King 
Sent: Friday, December 2, 2022 2:48:19 PM
To: Sake Paulusma 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] How to replace or add a monitor in stretch cluster?

This can't be done in a very nice way currently. There's actually an open PR 
against main to allow setting the crush location for mons in the service spec 
specifically because others found that this was annoying as well. What I think 
should work as a workaround is, go to the host where the mon that failed to 
join the quorum but failed due to lack of crush location is, open the 
/var/lib/ceph//mon./unit.run file, then at the very end of the 
last line (the last line should be a long podman/docker run command) append 
"--set-crush-location ".  Then, still on that host, a "systemctl 
restart " where mon-service is the systemd unit listed for the mon 
in "cephadm ls --no-detail". That should allow the monitor to at least join the 
quorum as it now has a crush location, and then you should be able to make 
other alterations a bit easier.

On Fri, Dec 2, 2022 at 7:40 AM Sake Paulusma 
mailto:sake1...@hotmail.com>> wrote:
I succesfully setup a stretched cluster, except the CRUSH rule mentioned in the 
docs wasn't correct. The parameters for "min_size" and "max_size" should be 
removed, or else the rule can't be imported.
Second there should be a mention about setting the monitor crush location takes 
sometime and know other ceph command can be used.

But now I need to replace a few monitors (it's virtualized and machines need to 
be replaced). I use cephadm and have a label "mon" which is assigned to the 
monitor services.
With the command "ceph orch host add   --labels=mon" I 
add normally a new monitor to the cluster.

Only this results in the following error in de logs:
12/2/22 1:27:19 PM [INF]  attempted to join from  
[v2::3300/0,v1::6789/0]; but lacks a crush_location for 
stretch mode

Next I tried to set the CRUSH location like used in the docs for stretch mode 
with the command "ceph mon set_location  datacenter=". This only 
results in the following error:
Error ENOENT: mon.oqsoel11437 does not exist

So how can I add/replace a monitor in a stretched cluster?

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to replace or add a monitor in stretch cluster?

2022-12-02 Thread Sake Paulusma

The instructions work great, the monitor is added in the monmap now.

I asked about the Tiebreaker because there is a special command to replace the 
current one. But this manual intervention is probably still needed to first set 
the correct location. Will report back later when I replace the current 
Tiebreaker with one in another datacenter.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Health warning - POOL_TARGET_SIZE_BYTES_OVERCOMMITED

2023-02-13 Thread Sake Paulusma

Hello,

I configured a stretched cluster on two datacenters. It's working fine, except 
this weekend the Raw Capicity exceeded 50% and the error 
POOL_TARGET_SIZE_BYTES_OVERCOMMITED showed up.

The command "ceph df" is showing the correct cluster size, but "ceph osd pool 
autoscale-status" is showing half of the total Raw Capacity.

What could be wrong?




[ceph: root@aqsoel11445 /]# ceph status
  cluster:
id: adbe7bb6-5h6d-11ed-8511-004449ede0c
health: HEALTH_WARN
1 MDSs report oversized cache
1 subtrees have overcommitted pool target_size_bytes

  services:
mon: 5 daemons, quorum host1,host2,host3,host4,host5 (age 4w)
mgr: aqsoel11445.nqamuz(active, since 5w), standbys: host1.wujgas
mds: 2/2 daemons up, 2 standby
osd: 12 osds: 12 up (since 5w), 12 in (since 9w)

  data:
volumes: 2/2 healthy
pools:   5 pools, 193 pgs
objects: 17.31M objects, 1.2 TiB
usage:   5.0 TiB used, 3.8 TiB / 8.8 TiB avail
pgs: 192 active+clean
 1   active+clean+scrubbing



[ceph: root@aqsoel11445 /]# ceph df
--- RAW STORAGE ---
CLASS SIZEAVAIL USED  RAW USED  %RAW USED
ssd8.8 TiB  3.8 TiB  5.0 TiB   5.0 TiB  56.83
TOTAL  8.8 TiB  3.8 TiB  5.0 TiB   5.0 TiB  56.83

--- POOLS ---
POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
.mgr11  449 KiB2  1.8 MiB  0320 GiB
cephfs.application-tst.meta   2   16  540 MiB   18.79k  2.1 GiB   0.16320 
GiB
cephfs.application-tst.data   3   32  4.4 GiB8.01k   17 GiB   1.33320 
GiB
cephfs.application-acc.meta   4   16   11 GiB3.54M   45 GiB   3.37320 
GiB
cephfs.application-acc.data   5  128  1.2 TiB   13.74M  4.8 TiB  79.46320 
GiB



[ceph: root@aqsoel11445 /]# ceph osd pool autoscale-status
POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
.mgr   448.5k4.0 4499G  0.  
1.0   1  on False
cephfs.application-tst.meta  539.8M4.0 4499G  0.0005
  4.0  16  on False
cephfs.application-tst.data   4488M   51200M   4.0 4499G  0.0444
  1.0  32  on False
cephfs.application-acc.meta  11430M4.0 4499G  0.0099
  4.0  16  on False
cephfs.application-acc.data   1244G4.0 4499G  1.1062
1.   0.9556   1.0 128  on False


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Health warning - POOL_TARGET_SIZE_BYTES_OVERCOMMITED

2023-02-13 Thread Sake Paulusma

The RATIO for cephfs.application-acc.data shouldn't be over 1.0, I believe this 
triggered the error.

All weekend I was thinking about this issue, but couldn't find an option to 
correct this.

But minutes after posting I found a blog about the autoscaler 
(https://ceph.io/en/news/blog/2022/autoscaler_tuning) and it speaks about the 
option to set the Rate. Shouldn't this option be set to 2 when using a 
stretched cluster and not 4?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Health warning - POOL_TARGET_SIZE_BYTES_OVERCOMMITED

2023-02-13 Thread Sake Paulusma

Hey Greg,

I'm just analyzing this issue and it isn't strange the total cluster size is 
half the total size (or the smallest of both clusters). Because you shouldn't 
write more data to the cluster than the smallest datacenter can handle.
Second when in datacenter fail over modus, the cluster size


From: Gregory Farnum 
Sent: Monday, February 13, 2023 5:32:18 PM
To: Sake Paulusma 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Health warning - POOL_TARGET_SIZE_BYTES_OVERCOMMITED

On Mon, Feb 13, 2023 at 4:16 AM Sake Paulusma  wrote:
>
> Hello,
>
> I configured a stretched cluster on two datacenters. It's working fine, 
> except this weekend the Raw Capicity exceeded 50% and the error 
> POOL_TARGET_SIZE_BYTES_OVERCOMMITED showed up.
>
> The command "ceph df" is showing the correct cluster size, but "ceph osd pool 
> autoscale-status" is showing half of the total Raw Capacity.
>
> What could be wrong?

There's a bug with the statistics handling of pools in stretch mode,
and others like them. :(
https://tracker.ceph.com/issues/56650

-Greg


>
>
>
> 
> [ceph: root@aqsoel11445 /]# ceph status
>   cluster:
> id: adbe7bb6-5h6d-11ed-8511-004449ede0c
> health: HEALTH_WARN
> 1 MDSs report oversized cache
> 1 subtrees have overcommitted pool target_size_bytes
>
>   services:
> mon: 5 daemons, quorum host1,host2,host3,host4,host5 (age 4w)
> mgr: aqsoel11445.nqamuz(active, since 5w), standbys: host1.wujgas
> mds: 2/2 daemons up, 2 standby
> osd: 12 osds: 12 up (since 5w), 12 in (since 9w)
>
>   data:
> volumes: 2/2 healthy
> pools:   5 pools, 193 pgs
> objects: 17.31M objects, 1.2 TiB
> usage:   5.0 TiB used, 3.8 TiB / 8.8 TiB avail
> pgs: 192 active+clean
>  1   active+clean+scrubbing
> 
>
> 
> [ceph: root@aqsoel11445 /]# ceph df
> --- RAW STORAGE ---
> CLASS SIZEAVAIL USED  RAW USED  %RAW USED
> ssd8.8 TiB  3.8 TiB  5.0 TiB   5.0 TiB  56.83
> TOTAL  8.8 TiB  3.8 TiB  5.0 TiB   5.0 TiB  56.83
>
> --- POOLS ---
> POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX 
> AVAIL
> .mgr11  449 KiB2  1.8 MiB  0320 
> GiB
> cephfs.application-tst.meta   2   16  540 MiB   18.79k  2.1 GiB   0.16320 
> GiB
> cephfs.application-tst.data   3   32  4.4 GiB8.01k   17 GiB   1.33320 
> GiB
> cephfs.application-acc.meta   4   16   11 GiB3.54M   45 GiB   3.37320 
> GiB
> cephfs.application-acc.data   5  128  1.2 TiB   13.74M  4.8 TiB  79.46320 
> GiB
> 
>
> 
> [ceph: root@aqsoel11445 /]# ceph osd pool autoscale-status
> POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
> TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
> .mgr   448.5k4.0 4499G  0.
>   1.0   1  on False
> cephfs.application-tst.meta  539.8M4.0 4499G  0.0005  
> 4.0  16  on False
> cephfs.application-tst.data   4488M   51200M   4.0 4499G  0.0444  
> 1.0  32  on False
> cephfs.application-acc.meta  11430M4.0 4499G  0.0099  
> 4.0  16  on False
> cephfs.application-acc.data   1244G4.0 4499G  1.1062  
>   1.   0.9556   1.0 128  on False
> 
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Health warning - POOL_TARGET_SIZE_BYTES_OVERCOMMITED

2023-02-13 Thread Sake Paulusma

OK, pushed a little soon on the send button..

But in datacenter fail over modus the replication size changes to 2. And that's 
why I believe the RATIO should be 2 instead of 4 or the Raw Capacity should be 
doubled.
Am I wrong or should someone make a choice?

From: Sake Paulusma 
Sent: Monday, February 13, 2023 6:52:45 PM
To: Gregory Farnum 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Health warning - POOL_TARGET_SIZE_BYTES_OVERCOMMITED

Hey Greg,

I'm just analyzing this issue and it isn't strange the total cluster size is 
half the total size (or the smallest of both clusters). Because you shouldn't 
write more data to the cluster than the smallest datacenter can handle.
Second when in datacenter fail over modus, the cluster size


From: Gregory Farnum 
Sent: Monday, February 13, 2023 5:32:18 PM
To: Sake Paulusma 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Health warning - POOL_TARGET_SIZE_BYTES_OVERCOMMITED

On Mon, Feb 13, 2023 at 4:16 AM Sake Paulusma  wrote:
>
> Hello,
>
> I configured a stretched cluster on two datacenters. It's working fine, 
> except this weekend the Raw Capicity exceeded 50% and the error 
> POOL_TARGET_SIZE_BYTES_OVERCOMMITED showed up.
>
> The command "ceph df" is showing the correct cluster size, but "ceph osd pool 
> autoscale-status" is showing half of the total Raw Capacity.
>
> What could be wrong?

There's a bug with the statistics handling of pools in stretch mode,
and others like them. :(
https://tracker.ceph.com/issues/56650

-Greg


>
>
>
> 
> [ceph: root@aqsoel11445 /]# ceph status
>   cluster:
> id: adbe7bb6-5h6d-11ed-8511-004449ede0c
> health: HEALTH_WARN
> 1 MDSs report oversized cache
> 1 subtrees have overcommitted pool target_size_bytes
>
>   services:
> mon: 5 daemons, quorum host1,host2,host3,host4,host5 (age 4w)
> mgr: aqsoel11445.nqamuz(active, since 5w), standbys: host1.wujgas
> mds: 2/2 daemons up, 2 standby
> osd: 12 osds: 12 up (since 5w), 12 in (since 9w)
>
>   data:
> volumes: 2/2 healthy
> pools:   5 pools, 193 pgs
> objects: 17.31M objects, 1.2 TiB
> usage:   5.0 TiB used, 3.8 TiB / 8.8 TiB avail
> pgs: 192 active+clean
>  1   active+clean+scrubbing
> 
>
> 
> [ceph: root@aqsoel11445 /]# ceph df
> --- RAW STORAGE ---
> CLASS SIZEAVAIL USED  RAW USED  %RAW USED
> ssd8.8 TiB  3.8 TiB  5.0 TiB   5.0 TiB  56.83
> TOTAL  8.8 TiB  3.8 TiB  5.0 TiB   5.0 TiB  56.83
>
> --- POOLS ---
> POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX 
> AVAIL
> .mgr11  449 KiB2  1.8 MiB  0320 
> GiB
> cephfs.application-tst.meta   2   16  540 MiB   18.79k  2.1 GiB   0.16320 
> GiB
> cephfs.application-tst.data   3   32  4.4 GiB8.01k   17 GiB   1.33320 
> GiB
> cephfs.application-acc.meta   4   16   11 GiB3.54M   45 GiB   3.37320 
> GiB
> cephfs.application-acc.data   5  128  1.2 TiB   13.74M  4.8 TiB  79.46320 
> GiB
> 
>
> 
> [ceph: root@aqsoel11445 /]# ceph osd pool autoscale-status
> POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
> TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
> .mgr   448.5k4.0 4499G  0.
>   1.0   1  on False
> cephfs.application-tst.meta  539.8M4.0 4499G  0.0005  
> 4.0  16  on False
> cephfs.application-tst.data   4488M   51200M   4.0 4499G  0.0444  
> 1.0  32  on False
> cephfs.application-acc.meta  11430M4.0 4499G  0.0099  
> 4.0  16  on False
> cephfs.application-acc.data   1244G4.0 4499G  1.1062  
>   1.   0.9556   1.0 128  on False
> 
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma



We noticed extremely slow performance when remapping is necessary. We didn't do 
anything special other than assigning the correct device_class (to ssd). When 
checking ceph status, we notice the number of objects recovering is around 
17-25 (with watch -n 1 -c ceph status).

How can we increase the recovery process?

There isn't any client load, because we're going to migrate to this cluster in 
the future, so only a rsync once a while is being executed.

[ceph: root@pwsoel12998 /]# ceph status
  cluster:
id: da3ca2e4-ee5b-11ed-8096-0050569e8c3b
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set

  services:
mon: 5 daemons, quorum 
pqsoel12997,pqsoel12996,pwsoel12994,pwsoel12998,prghygpl03 (age 3h)
mgr: pwsoel12998.ylvjcb(active, since 3h), standbys: pqsoel12997.gagpbt
mds: 4/4 daemons up, 2 standby
osd: 32 osds: 32 up (since 73m), 32 in (since 6d); 10 remapped pgs
 flags noscrub,nodeep-scrub

  data:
volumes: 2/2 healthy
pools:   5 pools, 193 pgs
objects: 13.97M objects, 853 GiB
usage:   3.5 TiB used, 12 TiB / 16 TiB avail
pgs: 755092/55882956 objects misplaced (1.351%)
 183 active+clean
 10  active+remapped+backfilling

  io:
recovery: 2.3 MiB/s, 20 objects/s

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma

Hi,

The config shows "mclock_scheduler" and I already switched to the 
high_recovery_ops, this does increase the recovery ops, but only a little.

You mention there is a fix in 17.2.6+, but we're running on 17.2.6 (this 
cluster is created on this version). Any more ideas?

Best regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma

Just to add:
high_client_ops: around 8-13 objects per second
high_recovery_ops: around 17-25 objects per second

Both observed with "watch - n 1 - c ceph status"

Best regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma

Thanks for the input! Changing this value we indeed increased the recovery 
speed from 20 object per second to 500!

Now something strange:
1. We needed to change the class for our drives manually to ssd.
2. The setting "osd_mclock_max_capacity_iops_ssd" was set to 0. With osd bench 
descriped in the documentation, we configured the value 1 for the ssd 
parameter. Only nothing changed.
3. But when setting "osd_mclock_max_capacity_iops_hdd" to 1, the recovery 
speed also increased dramatically!

I don't understand this anymore :( Is the mclock scheduler ignoring the 
override of the device class?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma

Did an extra test with shutting down an osd host and force a recovery. Only 
using the iops setting I got 500 objects a second, but using also the 
bytes_per_usec setting, I got 1200 objects a second!

Maybe there should also be an investigation about this performance issue.

Best regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow recovery on Quincy

2023-05-24 Thread Sake Paulusma

I'm on 17.2.6, but the option "osd_mclock_max_sequential_bandwidth_hdd" isn't 
available when I try to set it via "ceph config set osd.0 
osd_mclock_max_sequential_bandwidth_hdd 500Mi".

I need to use large numbers for hdd, because it looks like the mclock scheduler 
isn't using the device class override value.

Best regards,
Sake


From: Sridhar Seshasayee 
Sent: Wednesday, May 24, 2023 11:34:02 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Slow recovery on Quincy

As someone in this thread noted, the cost related config options are
removed in the next version (ceph-17.2.6-45.el9cp).
The cost parameters may not work in all cases due to the inherent
differences in the underlying device types and other
external factors.

With the endeavor to achieve a more hands free operation, there are
significant improvements in the next version
that should help resolve issues described in this thread. I am listing the
significant ones below:

1. The mClock profile QoS parameters (reservation and limit) are now
simplified and are specified in terms of a fraction of the OSD's IOPS
capacity
rather than in terms of IOPS.
2. The default mclock profile is now changed to the 'balanced' profile
which gives equal priority to client and background OSD operations.
3. The allocations for different types of OSD ops within mClock
profiles are changed.
4. The earlier cost related config parameters are removed. The cost is
now determined by the OSD based on the underlying device
characteristics, i.e.,
 its 4 KiB random IOPS capacity and the device's max sequential bandwidth.
  -  The random IOPS capacity is determined using 'osd bench' as
before, but now based on the result, unrealistic values are not
considered and
  reasonable defaults are used if the measurement crosses a
threshold governed by *osd_mclock_iops_capacity_threshold_[hdd**|ssd].
*The
  default IOPS capacity may be overridden by users if not
accurate, The thresholds too are configurable. The max sequential
bandwidth is
  defined by *osd_mclock_max_sequential_bandwidth_[hdd|ssd*],
and are set to reasonable defaults. Again, these may be modified if
not accurate.
  Therefore, these changes account for inaccuracies and
provide good control to the user in terms of specifying accurate OSD
characteristics.
5. Degraded recoveries are given more priority than backfills.
Therefore, you may observe faster degraded recovery rates compared to
backfills with the 'balanced' and 'high_client_ops' profile. But
the backfills will still show healthy rates when compared to the slow
backfill rates
mentioned in this thread. For faster recovery and backfills, the
'high_recovery_ops' profile with modified QoS parameters would help.

Please see the latest upstream documentation for more details:
https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/


The recommendation is to upgrade when feasible and provide your
feedback, questions and suggestions.

-Sridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow recovery on Quincy

2023-05-24 Thread Sake Paulusma

If I glance at the commits to the quincy branch, shouldn't the mentioned 
configurations be included in 17.2.7?

The requested command output:
[ceph: root@mgrhost1 /]# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

[ceph: root@mgrhost1 /]# ceph config show-with-defaults osd.0 | grep osd_mclock
osd_mclock_cost_per_byte_usec 0.00 default
osd_mclock_cost_per_byte_usec_hdd 0.10 mon
osd_mclock_cost_per_byte_usec_ssd 0.10 mon
osd_mclock_cost_per_io_usec 0.00 default
osd_mclock_cost_per_io_usec_hdd 11400.00 default
osd_mclock_cost_per_io_usec_ssd 50.00 default
osd_mclock_force_run_benchmark_on_init false default
osd_mclock_iops_capacity_threshold_hdd 500.00 default
osd_mclock_iops_capacity_threshold_ssd 8.00 default
osd_mclock_max_capacity_iops_hdd 1.00 mon
osd_mclock_max_capacity_iops_ssd 0.00 override mon[1.00]
osd_mclock_override_recovery_settings false mon
osd_mclock_profile high_client_ops mon
osd_mclock_scheduler_anticipation_timeout 0.00 default
osd_mclock_scheduler_background_best_effort_lim 99 default
osd_mclock_scheduler_background_best_effort_res 500 default
osd_mclock_scheduler_background_best_effort_wgt 2 default
osd_mclock_scheduler_background_recovery_lim 2000 default
osd_mclock_scheduler_background_recovery_res 500 default
osd_mclock_scheduler_background_recovery_wgt 1 default
osd_mclock_scheduler_client_lim 99 default
osd_mclock_scheduler_client_res 1000 default
osd_mclock_scheduler_client_wgt 2 default
osd_mclock_skip_benchmark false default

[ceph: root@mgrhost1 /]# ceph config show osd.0 | grep osd_mclock
osd_mclock_cost_per_byte_usec_hdd 0.10 mon
osd_mclock_cost_per_byte_usec_ssd 0.10 mon
osd_mclock_max_capacity_iops_hdd 1.00 mon
osd_mclock_max_capacity_iops_ssd 0.00 override mon[1.00]
osd_mclock_override_recovery_settings false mon
osd_mclock_profile high_client_ops mon
osd_mclock_scheduler_background_best_effort_lim 99 default
osd_mclock_scheduler_background_best_effort_res 500 default
osd_mclock_scheduler_background_best_effort_wgt 2 default
osd_mclock_scheduler_background_recovery_lim 2000 default
osd_mclock_scheduler_background_recovery_res 500 default
osd_mclock_scheduler_background_recovery_wgt 1 default
osd_mclock_scheduler_client_lim 99 default
osd_mclock_scheduler_client_res 1000 default
osd_mclock_scheduler_client_wgt 2 default
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow recovery on Quincy

2023-05-24 Thread Sake Paulusma

Thanks, will keep an eye out for this version. Will report back to this thread 
about these options and the recovery time/number of objects per second for 
recovery.

Again, thank you'll for the information and answers!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

2023-05-26 Thread Sake Paulusma

Just a user opinion, maybe add the following additions to the options?

For option 1:
* Clear instructions how to remove all traces to the failed installation (if 
you can automate it, you can write a manual) or provide instructions to start a 
cleanup script.
* Don't allow another deployment of Cephadm if there's a failed deployment, 
only if everything is cleaned up.

For option 2:
* If an installation failed and gotten completely removed, don't allow another 
run unless the user sets an override (or removes the thing which triggers the 
check for failed installations). This to prevent a user in an endless loop to 
try and deploy Cephadm. Inform the user about the last failed deployment, show 
the available options for a retry and the option to keep the deployment files 
to troubleshoot the issue.
* If the deployment failed (or got interrupted) and the user wanted to keep a 
failed deployment, provide just like Option 1 clear instructions how to clean 
up the failed deployment.

With the above additions, I would prefer Option 1. Because there's almost 
always a reason a deployment fails and I would like to investigate directly why 
it happened.

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Ceph | Quency ]The scheduled snapshots are not getting created till we create a manual backup.

2023-05-29 Thread Sake Paulusma

Hi!

I noticed the same that the snapshot scheduler seemed to do nothing , but after 
a manager fail over the creation of snapshots started to work (including the 
retention rules)..

Best regards,
Sake


From: Lokendra Rathour 
Sent: Monday, May 29, 2023 10:11:54 AM
To: ceph-users ; Ceph Users 
Subject: [ceph-users] [Ceph | Quency ]The scheduled snapshots are not getting 
created till we create a manual backup.

Hi Team,



*Problem:*

Create scheduled snapshots of the ceph subvolume.



*Expected Result:*

The scheduled snapshots should be created at the given scheduled time.



*Actual Result:*

The scheduled snapshots are not getting created till we create a manual
backup.



*Description:*

*Ceph version: 17(quincy)*

OS: Centos/Almalinux





The scheduled snapshot creation is not working and we were only able to see
the following logs in the file "ceph-mgr.storagenode3.log":



*2023-05-29T04:59:35.101+ 7f4cd3ad8700  0 [snap_schedule INFO mgr_util]
scanning for idle connections..*

*2023-05-29T04:59:35.101+ 7f4cd3ad8700  0 [snap_schedule DEBUG
mgr_util] fs_name (cephfs) connections ([])*

*2023-05-29T04:59:35.101+ 7f4cd3ad8700  0 [snap_schedule INFO mgr_util]
cleaning up connections: [*





The command which we were executing to add the snapshot schedule:

*ceph fs snap-schedule add /volumes//
 *

*eg.*

*ceph fs snap-schedule add /volumes/xyz/test_restore_53 1h
2023-05-26T11:05:00*



We can make sure that the schedule has been created using the following
commands:

*#ceph fs snap-schedule list / --recursive=true*

*#ceph fs snap-schedule status /volumes/xyz/test_restore_53*



Even though we created the snapshot schedule, snapshots were not getting
created.

We then tried creating a manual snapshot for one of the sub-volumes using
the following command:

*#ceph fs subvolume snapshot create cephfs  
--group_name *

*eg. ceph fs subvolume snapshot create cephfs test_restore_53 snapshot-1
--group_name xyz*



To check the snapshots created we can use the following command:

*ceph fs subvolume snapshot ls cephfs  
*

*eg. ceph fs subvolume snapshot ls cephfs test_restore_53 snapshot-1 xyz*



To delete the manually created snapshot:

*ceph fs subvolume snapshot rm cephfs  
*

*eg. ceph fs subvolume snapshot rm cephfs test_restore_53 snapshot-1 xyz*



To our surprise, the scheduled snapshots started working. We also applied
the retention policy and seems to be working fine.

We re-tested this understanding for another subvolume. And the scheduled
snapshots only started once we triggered a manual snapshot.



Could you please help us out with this?



--
~ Lokendra
skype: lokendrarathour
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Cephadm fails to deploy loki with promtail correctly

2023-07-11 Thread Sake Ceph

I'm not sure if it's a bug with Cephadm, but it looks like it. I've got Loki 
deployed on one machine and Promtail deployed to all machines. After creating a 
login, I can view only the logs on the hosts on which Loki is running.

When inspecting the Promtail configuration, the configured URL for Loki is set 
to http://host.containers.internal:3100. Shouldn't this be configured by 
Cephadm and pointing to the Loki host?

This looks a lot like the issues with incorrectly setting the Grafana or 
Prometheus URL's, bug 57018 is created for this. Should I create another bug 
report?

And does someone know a workaround to set the correct URL for the time being?

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS cache is too large and crashes

2023-07-21 Thread Sake Ceph

At 01:27 this morning I received the first email about MDS cache is too large 
(mailing happens every 15 minutes if something happens). Looking into it, it 
was again a standby-replay host which stops working.

At 01:00 a few rsync processes start in parallel on a client machine. This 
copies data from a NFS share to Cephfs share to sync the latest changes. (we 
want to switch to Cephfs in the near future).

This crashing of the standby-replay mds happend a couple times now, so I think 
it would be good to get some help. Where should I look next?

Some cephfs information
--
# ceph fs status
atlassian-opl - 8 clients
=
RANK  STATE MDSACTIVITY DNS
INOS   DIRS   CAPS
 0active  atlassian-opl.mds5.zsxfep  Reqs:0 /s  7830   7803
635   3706
0-s   standby-replay  atlassian-opl.mds6.svvuii  Evts:0 /s  3139   1924
461  0
   POOL  TYPE USED  AVAIL
cephfs.atlassian-opl.meta  metadata  2186M  1161G
cephfs.atlassian-opl.datadata23.0G  1161G
atlassian-prod - 12 clients
==
RANK  STATE  MDSACTIVITY DNS
INOS   DIRS   CAPS
 0active  atlassian-prod.mds1.msydxf  Reqs:0 /s  2703k  2703k   
905k  1585
 1active  atlassian-prod.mds2.oappgu  Reqs:0 /s   961k   961k   
317k   622
 2active  atlassian-prod.mds3.yvkjsi  Reqs:0 /s  2083k  2083k   
670k   443
0-s   standby-replay  atlassian-prod.mds4.qlvypn  Evts:0 /s   352k   352k   
102k 0
1-s   standby-replay  atlassian-prod.mds5.egsdfl  Evts:0 /s   873k   873k   
277k 0
2-s   standby-replay  atlassian-prod.mds6.ghonso  Evts:0 /s  2317k  2316k   
679k 0
   POOL   TYPE USED  AVAIL
cephfs.atlassian-prod.meta  metadata  58.8G  1161G
cephfs.atlassian-prod.datadata5492G  1161G
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) 
quincy (stable)


When looking at the log on the MDS server, I've got the following:
2023-07-21T01:21:01.942+ 7f668a5e0700 -1 received  signal: Hangup from 
Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2023-07-21T01:23:13.856+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5671 from 
mon.1
2023-07-21T01:23:18.369+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5672 from 
mon.1
2023-07-21T01:23:31.719+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5673 from 
mon.1
2023-07-21T01:23:35.769+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5674 from 
mon.1
2023-07-21T01:28:23.764+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5675 from 
mon.1
2023-07-21T01:29:13.657+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5676 from 
mon.1
2023-07-21T01:33:43.886+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5677 from 
mon.1
(and another 20 lines about updating MDS map)

Alert mailings:
Mail at 01:27
--
HEALTH_WARN

--- New ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(13GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(13GB/9GB); 0 inodes in use by clients, 0 stray files


Mail at 03:27
--
HEALTH_OK

--- Cleared ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(14GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===


Mail at 04:12
--
HEALTH_WARN

--- New ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(15GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(15GB/9GB); 0 inodes in use by clients, 0 stray files


Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS cache is too large and crashes

2023-07-24 Thread Sake Ceph

Thank you Patrick for responding and fix the issue! Good to know the issue is 
know and been worked on :-)

> Op 21-07-2023 15:59 CEST schreef Patrick Donnelly :
> 
>  
> Hello Sake,
> 
> On Fri, Jul 21, 2023 at 3:43 AM Sake Ceph  wrote:
> >
> > At 01:27 this morning I received the first email about MDS cache is too 
> > large (mailing happens every 15 minutes if something happens). Looking into 
> > it, it was again a standby-replay host which stops working.
> >
> > At 01:00 a few rsync processes start in parallel on a client machine. This 
> > copies data from a NFS share to Cephfs share to sync the latest changes. 
> > (we want to switch to Cephfs in the near future).
> >
> > This crashing of the standby-replay mds happend a couple times now, so I 
> > think it would be good to get some help. Where should I look next?
> 
> It's this issue: https://tracker.ceph.com/issues/48673
> 
> Sorry I'm still evaluating the fix for it before merging. Hope to be
> done with it soon.
> 
> -- 
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS and stretched clusters

2024-10-29 Thread Sake Ceph

Hi all
We deployed successfully a stretched cluster and all is working fine. But is it 
possible to assign the active MDS services in one DC and the standby-replay in 
the other?

We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with 2 active 
MDS on pinnend ranks and 2 in standby-replay mode.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS and stretched clusters

2024-10-29 Thread Sake Ceph

I hope someone of the development team can share some light on this. Will 
search the tracker if some else made a request about this. 

> Op 29-10-2024 16:02 CET schreef Frédéric Nass 
> :
> 
>  
> Hi,
> 
> I'm not aware of any service settings that would allow that.
> 
> You'll have to monitor each MDS state and restart any non-local active MDSs 
> to reverse roles.
> 
> Regards,
> Frédéric.
> 
> - Le 29 Oct 24, à 14:06, Sake Ceph c...@paulusma.eu a écrit :
> 
> > Hi all
> > We deployed successfully a stretched cluster and all is working fine. But 
> > is it
> > possible to assign the active MDS services in one DC and the standby-replay 
> > in
> > the other?
> > 
> > We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with 2 
> > active
> > MDS on pinnend ranks and 2 in standby-replay mode.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS and stretched clusters

2024-10-31 Thread Sake Ceph

We're looking for the multiple mds daemons to be active in zone A and 
standby(-replay) in zone B. 
This scenario would also benefit people who have more powerfull hardware in 
zone A than zone B. 

Kind regards, 
Sake 

> Op 31-10-2024 15:50 CET schreef Adam King :
> 
>  
> Just noticed this thread. A couple questions. Is what we want to have MDS
> daemons in say zone A and zone B, but the ones in zone A are prioritized to
> be active and ones in zone B remain as standby unless absolutely necessary
> (all the ones in zone A are down) or is it that we want to have some subset
> of a pool of hosts in zone A and zone B have mds daemons? If it's the
> former, cephadm doesn't do it. The followup question in that case would be
> if there is some way to tell the mds daemons to prioritize certain ones to
> be active over others? If there is, I didn't know about it, but I assume
> we'd need that functionality to get that case to work.
> 
> On Tue, Oct 29, 2024 at 5:34 PM Gregory Farnum  wrote:
> 
> > No, unfortunately this needs to be done at a higher level and is not
> > included in Ceph right now. Rook may be able to do this, but I don't think
> > cephadm does.
> > Adam, is there some way to finagle this with pod placement rules (ie,
> > tagging nodes as mds and mds-standby, and then assigning special mds config
> > info to corresponding pods)?
> > -Greg
> >
> > On Tue, Oct 29, 2024 at 12:46 PM Sake Ceph  wrote:
> >
> >> I hope someone of the development team can share some light on this. Will
> >> search the tracker if some else made a request about this.
> >>
> >> > Op 29-10-2024 16:02 CET schreef Frédéric Nass <
> >> frederic.n...@univ-lorraine.fr>:
> >> >
> >> >
> >> > Hi,
> >> >
> >> > I'm not aware of any service settings that would allow that.
> >> >
> >> > You'll have to monitor each MDS state and restart any non-local active
> >> MDSs to reverse roles.
> >> >
> >> > Regards,
> >> > Frédéric.
> >> >
> >> > - Le 29 Oct 24, à 14:06, Sake Ceph c...@paulusma.eu a écrit :
> >> >
> >> > > Hi all
> >> > > We deployed successfully a stretched cluster and all is working fine.
> >> But is it
> >> > > possible to assign the active MDS services in one DC and the
> >> standby-replay in
> >> > > the other?
> >> > >
> >> > > We're running 18.2.4, deployed via cephadm. Using 4 MDS servers with
> >> 2 active
> >> > > MDS on pinnend ranks and 2 in standby-replay mode.
> >> > > ___
> >> > > ceph-users mailing list -- ceph-users@ceph.io
> >> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph OSD perf metrics missing

2024-11-25 Thread Sake Ceph

I stumbled on this problem earlier, port 9926 isn't being opened. See also 
thread "Grafana dashboards is missing data".
A tracker is already opened to fix the issue: 
https://tracker.ceph.com/issues/67975

> Op 25-11-2024 13:44 CET schreef Kilian Ries :
> 
>  
> Prometheus metrics seem to be broken, too:
> 
> 
> ceph_osd_op_r_latency_sum
> 
> ceph_osd_op_w_latency_sum
> 
> 
> Both of them for example are not reported by the ceph mgr metrics exporter:
> 
> 
> curl http://192.168.XXX.XXX:9283/metrics |grep ceph_osd_
> 
> 
> 
> I get some merics like "ceph_osd_commit_latency_ms" or 
> "ceph_osd_apply_latency_ms" but none of the "op_r / op_w" metrics. Do the 
> lable names have switched and my grafana dashboard is outdated? Or are they 
> missing at the exporter level?
> 
> 
> Thanks
> 
> 
> 
> 
> 
> Von: Kilian Ries
> Gesendet: Montag, 25. November 2024 13:37:16
> An: ceph-users@ceph.io
> Betreff: AW: Ceph OSD perf metrics missing
> 
> 
> Any ideas? Still facing the problem ...
> 
> 
> Von: Kilian Ries
> Gesendet: Mittwoch, 23. Oktober 2024 13:59:06
> An: ceph-users@ceph.io
> Betreff: Ceph OSD perf metrics missing
> 
> 
> Hi,
> 
> 
> i'm running a Ceph v18.2.4 cluster. I'm trying to build some latency 
> monitoring with the
> 
> 
> ceph daemon osd.4 perf dump
> 
> 
> cli command. On most of the OSDs i get all the metrics i need. On some OSDs i 
> only get zero values:
> 
> 
> osd.op_latency.avgcount: 0
> 
> 
> I already tried restarting the OSD process which didn't help. I also tried to 
> reset the metrics via
> 
> 
> ceph daemon osd.4 perf reset all
> 
> 
> but that didn't help either. How can that be that some OSDs don't show values 
> here? How can i fix that?
> 
> 
> Thanks
> 
> Regards,
> 
> Kilian
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

84 matches

Mail list logo