[ceph-users] MDS cache is too large and crashes

2023-07-21 Thread Sake Ceph
At 01:27 this morning I received the first email about MDS cache is too large 
(mailing happens every 15 minutes if something happens). Looking into it, it 
was again a standby-replay host which stops working.

At 01:00 a few rsync processes start in parallel on a client machine. This 
copies data from a NFS share to Cephfs share to sync the latest changes. (we 
want to switch to Cephfs in the near future).

This crashing of the standby-replay mds happend a couple times now, so I think 
it would be good to get some help. Where should I look next?

Some cephfs information
--
# ceph fs status
atlassian-opl - 8 clients
=
RANK  STATE MDSACTIVITY DNS
INOS   DIRS   CAPS
 0active  atlassian-opl.mds5.zsxfep  Reqs:0 /s  7830   7803
635   3706
0-s   standby-replay  atlassian-opl.mds6.svvuii  Evts:0 /s  3139   1924
461  0
   POOL  TYPE USED  AVAIL
cephfs.atlassian-opl.meta  metadata  2186M  1161G
cephfs.atlassian-opl.datadata23.0G  1161G
atlassian-prod - 12 clients
==
RANK  STATE  MDSACTIVITY DNS
INOS   DIRS   CAPS
 0active  atlassian-prod.mds1.msydxf  Reqs:0 /s  2703k  2703k   
905k  1585
 1active  atlassian-prod.mds2.oappgu  Reqs:0 /s   961k   961k   
317k   622
 2active  atlassian-prod.mds3.yvkjsi  Reqs:0 /s  2083k  2083k   
670k   443
0-s   standby-replay  atlassian-prod.mds4.qlvypn  Evts:0 /s   352k   352k   
102k 0
1-s   standby-replay  atlassian-prod.mds5.egsdfl  Evts:0 /s   873k   873k   
277k 0
2-s   standby-replay  atlassian-prod.mds6.ghonso  Evts:0 /s  2317k  2316k   
679k 0
   POOL   TYPE USED  AVAIL
cephfs.atlassian-prod.meta  metadata  58.8G  1161G
cephfs.atlassian-prod.datadata5492G  1161G
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) 
quincy (stable)


When looking at the log on the MDS server, I've got the following:
2023-07-21T01:21:01.942+ 7f668a5e0700 -1 received  signal: Hangup from 
Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2023-07-21T01:23:13.856+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5671 from 
mon.1
2023-07-21T01:23:18.369+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5672 from 
mon.1
2023-07-21T01:23:31.719+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5673 from 
mon.1
2023-07-21T01:23:35.769+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5674 from 
mon.1
2023-07-21T01:28:23.764+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5675 from 
mon.1
2023-07-21T01:29:13.657+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5676 from 
mon.1
2023-07-21T01:33:43.886+ 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5677 from 
mon.1
(and another 20 lines about updating MDS map)

Alert mailings:
Mail at 01:27
--
HEALTH_WARN

--- New ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(13GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(13GB/9GB); 0 inodes in use by clients, 0 stray files


Mail at 03:27
--
HEALTH_OK

--- Cleared ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(14GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===


Mail at 04:12
--
HEALTH_WARN

--- New ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(15GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(15GB/9GB); 0 inodes in use by clients, 0 stray files


Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS cache is too large and crashes

2023-07-21 Thread Marc
> 
> At 01:27 this morning I received the first email about MDS cache is too
> large (mailing happens every 15 minutes if something happens). Looking
> into it, it was again a standby-replay host which stops working.
> 
> At 01:00 a few rsync processes start in parallel on a client machine.
> This copies data from a NFS share to Cephfs share to sync the latest
> changes. (we want to switch to Cephfs in the near future).
> 
> This crashing of the standby-replay mds happend a couple times now, so I
> think it would be good to get some help. Where should I look next?
> 

What do you mean with crashing? Is the container just getting OOM and killed 
and restarted? Then you just have to adapt your settings not?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS stuck in rejoin

2023-07-21 Thread Xiubo Li


On 7/20/23 22:09, Frank Schilder wrote:

Hi all,

we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to 
advance oldest client/flush tid". I looked at the client and there was nothing going 
on, so I rebooted it. After the client was back, the message was still there. To clean 
this up I failed the MDS. Unfortunately, the MDS that took over is remained stuck in 
rejoin without doing anything. All that happened in the log was:


BTW, are you using the kclient or user space client ? How long was the 
MDS stuck in rejoin state ?


This means in the client side the oldest client has been stuck too long, 
maybe in heavy load case there were to many requests generated in a 
short time and the oldest request was stuck too long in MDS.




[root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log
2023-07-20T15:54:29.147+0200 7fedb9c9f700  1 mds.2.896604 rejoin_start
2023-07-20T15:54:29.161+0200 7fedb9c9f700  1 mds.2.896604 rejoin_joint_start
2023-07-20T15:55:28.005+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896614 from mon.4
2023-07-20T15:56:00.278+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896615 from mon.4
[...]
2023-07-20T16:02:54.935+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896653 from mon.4
2023-07-20T16:03:07.276+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896654 from mon.4


Did you see any slow request log in the mds log files ? And any other 
suspect logs from the dmesg if it's kclient ?




After some time I decided to give another fail a try and, this time, the 
replacement daemon went to active state really fast.

If I have a message like the above, what is the clean way of getting the client 
clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable))?


I think your steps are correct.

Thanks

- Xiubo



Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: what is the point of listing "auth: unable to find a keyring on /etc/ceph/ceph.client nfs-ganesha

2023-07-21 Thread Dhairya Parmar
Hi Marc,

Can you confirm if the mon ip in ceph.conf is correct and is public; also
the keyring path is specified correctly?


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com



On Thu, Jul 20, 2023 at 9:40 PM Marc  wrote:

>
> I need some help understanding this. I have configured nfs-ganesha for
> cephfs using something like this in ganesha.conf
>
> FSAL { Name = CEPH; User_Id = "testing.nfs"; Secret_Access_Key =
> "AAA=="; }
>
> But I contstantly have these messages in de ganesha logs, 6x per user_id
>
> auth: unable to find a keyring on /etc/ceph/ceph.client.testing
>
> I thought this was a ganesha authentication order issue, but they[1] say
> it has to do with ceph. I am still on Nautilus so maybe this has been fixed
> in newer releases. I still have a hard time understanding why this is an
> issue of ceph (libraries).
>
>
> [1]
> https://github.com/nfs-ganesha/nfs-ganesha/issues/974
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS cache is too large and crashes

2023-07-21 Thread Patrick Donnelly
Hello Sake,

On Fri, Jul 21, 2023 at 3:43 AM Sake Ceph  wrote:
>
> At 01:27 this morning I received the first email about MDS cache is too large 
> (mailing happens every 15 minutes if something happens). Looking into it, it 
> was again a standby-replay host which stops working.
>
> At 01:00 a few rsync processes start in parallel on a client machine. This 
> copies data from a NFS share to Cephfs share to sync the latest changes. (we 
> want to switch to Cephfs in the near future).
>
> This crashing of the standby-replay mds happend a couple times now, so I 
> think it would be good to get some help. Where should I look next?

It's this issue: https://tracker.ceph.com/issues/48673

Sorry I'm still evaluating the fix for it before merging. Hope to be
done with it soon.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: what is the point of listing "auth: unable to find a keyring on /etc/ceph/ceph.client nfs-ganesha

2023-07-21 Thread Marc
Hi Dhairya,

Yes I have in ceph.conf (only copied the lines below, there are more in these 
sections). I do not have a keyring path setting in ceph.conf


public network = a.b.c.111/24

[mon]
mon host = a.b.c.111,a.b.c.112,a.b.c.113

[mon.a]
mon addr = a.b.c.111

[mon.b]
mon addr = a.b.c.112

[mon.c]
mon addr = a.b.c.113


> 
> Can you confirm if the mon ip in ceph.conf is correct and is public;
> also the keyring path is specified correctly?
> 
> 
> 
>   I need some help understanding this. I have configured nfs-ganesha
> for cephfs using something like this in ganesha.conf
> 
>   FSAL { Name = CEPH; User_Id = "testing.nfs"; Secret_Access_Key =
> "AAA=="; }
> 
>   But I contstantly have these messages in de ganesha logs, 6x per
> user_id
> 
>   auth: unable to find a keyring on /etc/ceph/ceph.client.testing
> 
>   I thought this was a ganesha authentication order issue, but
> they[1] say it has to do with ceph. I am still on Nautilus so maybe this
> has been fixed in newer releases. I still have a hard time understanding
> why this is an issue of ceph (libraries).
> 
> 
>   [1]
>   https://github.com/nfs-ganesha/nfs-ganesha/issues/974
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD tries (and fails) to scrub the same PGs over and over

2023-07-21 Thread Vladimir Brik

> what's the cluster status? Is there recovery or backfilling
> going on?
No. Everything is good except this PG is not getting scrubbed.

Vlad

On 7/21/23 01:41, Eugen Block wrote:

Hi,

what's the cluster status? Is there recovery or backfilling 
going on?



Zitat von Vladimir Brik :

I have a PG that hasn't been scrubbed in over a month and 
not deep-scrubbed in over two months.


I tried forcing with `ceph pg (deep-)scrub` but with no 
success.


Looking at the logs of that PG's primary OSD it looks like 
every once in a while it attempts (and apparently fails) 
to scrub that PG, along with two others, over and over. 
For example:


2023-07-19T16:26:07.082 ... 24.3ea scrub starts
2023-07-19T16:26:10.284 ... 27.aae scrub starts
2023-07-19T16:26:11.169 ... 24.aa scrub starts
2023-07-19T16:26:12.153 ... 24.3ea scrub starts
2023-07-19T16:26:13.346 ... 27.aae scrub starts
2023-07-19T16:26:16.239 ... 24.aa scrub starts
...

Lines like that are repeated throughout the log file.


Has anyone seen something similar? How can I debug this?

I am running 17.2.5


Vlad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: what is the point of listing "auth: unable to find a keyring on /etc/ceph/ceph.client nfs-ganesha

2023-07-21 Thread Dhairya Parmar
Okay then I'd suggest adding keyring to the client section in ceph.conf, it
is as simple as
keyring = /keyring

I hope the client(that the logs complain) is in the keyring file. Do let me
know if that works for you, if not, some logs would be good to have to
diagnose further.

On Fri, Jul 21, 2023 at 7:44 PM Marc  wrote:

> Hi Dhairya,
>
> Yes I have in ceph.conf (only copied the lines below, there are more in
> these sections). I do not have a keyring path setting in ceph.conf
>
>
> public network = a.b.c.111/24
>
> [mon]
> mon host = a.b.c.111,a.b.c.112,a.b.c.113
>
> [mon.a]
> mon addr = a.b.c.111
>
> [mon.b]
> mon addr = a.b.c.112
>
> [mon.c]
> mon addr = a.b.c.113
>
>
> >
> > Can you confirm if the mon ip in ceph.conf is correct and is public;
> > also the keyring path is specified correctly?
> >
> >
> >
> >   I need some help understanding this. I have configured nfs-ganesha
> > for cephfs using something like this in ganesha.conf
> >
> >   FSAL { Name = CEPH; User_Id = "testing.nfs"; Secret_Access_Key =
> > "AAA=="; }
> >
> >   But I contstantly have these messages in de ganesha logs, 6x per
> > user_id
> >
> >   auth: unable to find a keyring on /etc/ceph/ceph.client.testing
> >
> >   I thought this was a ganesha authentication order issue, but
> > they[1] say it has to do with ceph. I am still on Nautilus so maybe this
> > has been fixed in newer releases. I still have a hard time understanding
> > why this is an issue of ceph (libraries).
> >
> >
> >   [1]
> >   https://github.com/nfs-ganesha/nfs-ganesha/issues/974
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs - unable to create new subvolume

2023-07-21 Thread Patrick Donnelly
Hello karon,

On Fri, Jun 23, 2023 at 4:55 AM karon karon  wrote:
>
> Hello,
>
> I recently use cephfs in version 17.2.6
> I have a pool named "*data*" and a fs "*kube*"
> it was working fine until a few days ago, now i can no longer create a new
> subvolume*, *it gives me the following error:
>
> Error EINVAL: invalid value specified for ceph.dir.subvolume

We have heard other reports of this. We don't know how, but it seems
something has erroneously set the subvolume flag on parent
directories. Please try:

setfattr -n ceph.dir.subvolume -v 0 /volumes/csi

Then check if it works. If still not:

setfattr -n ceph.dir.subvolume -v 0 /volumes/

try again, if still not:

setfattr -n ceph.dir.subvolume -v 0 /

Please let us know which directory fixed the issue for you.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] July Ceph Science Virtual User Group

2023-07-21 Thread Kevin Hrpcek

Hey all,

We will be having a Ceph science/research/big cluster call on Wednesday 
July 26th. If anyone wants to discuss something specific they can add it 
to the pad linked below. If you have questions or comments you can 
contact me.


This is an informal open call of community members mostly from 
hpc/htc/research environments where we discuss whatever is on our minds 
regarding ceph. Updates, outages, features, maintenance, etc...there is 
no set presenter but I do attempt to keep the conversation lively.


NOTE: The change to using Jitsi for the meeting. We are no longer using 
the bluejeans meeting links. The ceph calendar event does not yet 
reflect this and has the wrong day as well.


Pad URL:
https://pad.ceph.com/p/Ceph_Science_User_Group_20230726

Ceph calendar event details:
July 26th, 2023
14:00 UTC
4pm Central European
9am Central US

Description: Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index

Meetings will be recorded and posted to the Ceph Youtube channel.

To join the meeting on a computer or mobile phone: 
https://meet.jit.si/ceph-science-wg



Kevin

--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS/TROPICS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph quincy repo update to debian bookworm...?

2023-07-21 Thread Luke Hall
Ditto this query. I can't recall if there's a separate list for Debian 
packaging of Ceph or not.


On 22/06/2023 15:25, Christian Peters wrote:

Hi ceph users/maintainers,

I installed ceph quincy on debian bullseye as a ceph client and now want 
to update to bookworm.

I see that there is at the moment only bullseye supported.

https://download.ceph.com/debian-quincy/dists/bullseye/

Will there be an update of

deb https://download.coeh.com/debian-quincy/ bullseye main

to

deb https://download.coeh.com/debian-quincy/ boowkworm main

in the near future!?

Regards,

Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io