[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-21 Thread Frank Schilder
Hi Alexander,

it might be that you are expecting too much from ceph. The design of the 
filesystem was not some grand plan with every detail worked out. It was more 
the classic evolutionary approach, something working was screwed on top of 
rados and things evolved from there on.

It is possible that the code and pin-seperation is not as clean as one would 
imagine. Here is what I observe before and after pinning everything explicitly:

- before pinning:
  * high MDS load for no apparent reason - the balancer was just going in 
circles
  * stopping an MDS would besically bring all IO down

- after pinning:
  * low MDS load, better user performance, much faster restarts
  * stopping an MDS does not kill all IO immediately, some IO continues, 
however, eventually every client gets stuck

There is apparently still communication between all ranks about all clients and 
it is a bit annoying that some of this communication is blocking. Not sure if 
it has to be blocking or if one could make it asynchronous requests to the down 
rank. My impression is that ceph internals are rather bad at making stuff 
asynchronous. So if something in the MDS cluster is not healthy sooner or later 
IO will stop waiting for some blocking request to the unhealthy MDS. There 
seems to be no such thing as IO on other healthy MDSes continues as usual.

Specifically rank 0 is critical.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Thursday, November 21, 2024 9:36 AM
To: Александр Руденко
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: [CephFS] Completely exclude some MDS rank from 
directory processing

I'm not aware of any hard limit for the number of Filesystems, but
that doesn't really mean very much. IIRC, last week during a Clyso
talk at Eventbrite I heard someone say that they deployed around 200
Filesystems or so, I don't remember if it was a production environment
or just a lab environment. I assume that you would probably be limited
by the number of OSDs/PGs rather than by the number of Filesystems,
200 Filesystems require at least 400 pools. But maybe someone else has
more experience in scaling CephFS that way. What we did was to scale
the number of active MDS daemons for one CephFS. I believe in the end
the customer had 48 MDS daemons on three MDS servers, 16 of them were
active with directory pinning, at that time they had 16 standby-replay
and 16 standby daemons. But it turned out that standby-replay didn't
help their use case, so we disabled standby-replay.

Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?

Zitat von Александр Руденко :

>>
>> Just for testing purposes, have you tried pinning rank 1 to some other
>> directory? Does it still break the CephFS if you stop it?
>
>
> Yes, nothing changed.
>
> It's no problem that FS hangs when one of the ranks goes down, we will have
> standby-reply for all ranks. I don't like that rank which is not pinned to
> some dir handled some io of this dir or from clients which work with this
> dir.
> I mean that I can't robustly and fully separate client IO by ranks.
>
> Would it be an option to rather use multiple Filesystems instead of
>> multi-active for one CephFS?
>
>
> Yes, it's an option. But it is much more complicated in our case. Btw, do
> you know how many different FS can be created in one cluster? Maybe you
> know some potential problems with 100-200 FSs in one cluster?
>
> ср, 20 нояб. 2024 г. в 17:50, Eugen Block :
>
>> Ah, I misunderstood, I thought you wanted an even distribution across
>> both ranks.
>> Just for testing purposes, have you tried pinning rank 1 to some other
>> directory? Does it still break the CephFS if you stop it? I'm not sure
>> if you can prevent rank 1 from participating, I haven't looked into
>> all the configs in quite a while. Would it be an option to rather use
>> multiple Filesystems instead of multi-active for one CephFS?
>>
>> Zitat von Александр Руденко :
>>
>> > No it's not a typo. It's misleading example)
>> >
>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
>> without
>> > rank 1.
>> > rank 1 is used for something when I work with this dirs.
>> >
>> > ceph 16.2.13, metadata balancer and policy based balancing not used.
>> >
>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
>> >
>> >> Hi,
>> >>
>> >> > After pinning:
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >>
>> >> is this a typo? If not, you did pin both directories to the same rank.
>> >>
>> >> Zitat von Александр Руденко :
>> >>
>> >> > Hi,
>> >> >
>> >> > I try to distribute all top level dirs in CephFS by different MDS
>> ranks.
>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like
>> >> > */dir1* and* /dir2*.
>> >> >
>> >> > After pinning:
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >> > setfattr -n c

[ceph-users] Re: MDS blocklist/evict clients during network maintenance

2024-11-21 Thread Eugen Block

Hi Dan,

thanks for the link, I've been reading it over and over again but  
still didn't come to a conclusion yet.
IIRC, the maintenance windows are one hour long, currently every week.  
But it's not entirely clear if the maintenance will even have an  
impact, because apparently, last time nobody complained. But there  
have been interruptions which caused stale clients in the last weeks,  
so it's difficult to predict.
They mainly use rbd and CephFS for k8s clusters, but so far I haven't  
heard about rbd issues during this maintenance windows.
They have grafana showing a drop of many MDS sessions when the network  
is interrupted, I think from around 130 active sessions to around 30.  
So not all sessions were dropped. After the maintenance, they failed  
the MDS and the number of sessions was restored. Since they don't have  
access to the k8s clusters themselves, they can't do much on that  
side. We're still wondering if a MDS failover is really necessary or  
if anything on the client side could be done. But I only have very  
limited details on this. The MDS log (I don't have a copy) shows that  
the session drops are caused by the client evictions.
Do you think it could make sense to disable client  
eviction/blocklisting only during this maintenance window? Or can that  
be dangerous because we can't predict which clients will actually be  
interrupted and how k8s will handle the returning clients if they  
won't be evicted?


Thanks
Eugen

Zitat von Dan van der Ster :


Hi Eugene,

Disabling blocklisting on eviction is a pretty standard config. In my
experience it allows clients resume their session cleanly without needing a
remount.

There's docs about this here:
https://docs.ceph.com/en/latest/cephfs/eviction/#advanced-configuring-blocklisting

I don't have a good feeling if this will be useful for your network
intervention though... What are you trying to achieve? How long will
clients be unreachable?

Cheers, Dan


--
Dan van der Ster
CTO@CLYSO & CEC Member


On Thu, Nov 21, 2024, 10:15 Eugen Block  wrote:


Hi,

can anyone share some experience with these two configs?

ceph config get mds mds_session_blocklist_on_timeout
true
ceph config get mds mds_session_blocklist_on_evict
true

If there's some network maintenance going on and the client connection
is interrupted, could it help to disable evicting and blocklisting MDS
clients? And what risks should we be aware of if we tried that? We're
not entirely sure yet if this could be a reasonable approach, but
we're trying to figure out how to make network maintenance less
painful for clients.
I'm also looking at some other possible configs, but let's start with
these two first.

Any comments would be appreciated!

Thanks!
Eugen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-21 Thread Eugen Block
I'm not aware of any hard limit for the number of Filesystems, but  
that doesn't really mean very much. IIRC, last week during a Clyso  
talk at Eventbrite I heard someone say that they deployed around 200  
Filesystems or so, I don't remember if it was a production environment  
or just a lab environment. I assume that you would probably be limited  
by the number of OSDs/PGs rather than by the number of Filesystems,  
200 Filesystems require at least 400 pools. But maybe someone else has  
more experience in scaling CephFS that way. What we did was to scale  
the number of active MDS daemons for one CephFS. I believe in the end  
the customer had 48 MDS daemons on three MDS servers, 16 of them were  
active with directory pinning, at that time they had 16 standby-replay  
and 16 standby daemons. But it turned out that standby-replay didn't  
help their use case, so we disabled standby-replay.


Can you show the entire 'ceph fs status' output? Any maybe also 'ceph  
fs dump'?


Zitat von Александр Руденко :



Just for testing purposes, have you tried pinning rank 1 to some other
directory? Does it still break the CephFS if you stop it?



Yes, nothing changed.

It's no problem that FS hangs when one of the ranks goes down, we will have
standby-reply for all ranks. I don't like that rank which is not pinned to
some dir handled some io of this dir or from clients which work with this
dir.
I mean that I can't robustly and fully separate client IO by ranks.

Would it be an option to rather use multiple Filesystems instead of

multi-active for one CephFS?



Yes, it's an option. But it is much more complicated in our case. Btw, do
you know how many different FS can be created in one cluster? Maybe you
know some potential problems with 100-200 FSs in one cluster?

ср, 20 нояб. 2024 г. в 17:50, Eugen Block :


Ah, I misunderstood, I thought you wanted an even distribution across
both ranks.
Just for testing purposes, have you tried pinning rank 1 to some other
directory? Does it still break the CephFS if you stop it? I'm not sure
if you can prevent rank 1 from participating, I haven't looked into
all the configs in quite a while. Would it be an option to rather use
multiple Filesystems instead of multi-active for one CephFS?

Zitat von Александр Руденко :

> No it's not a typo. It's misleading example)
>
> dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
without
> rank 1.
> rank 1 is used for something when I work with this dirs.
>
> ceph 16.2.13, metadata balancer and policy based balancing not used.
>
> ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
>
>> Hi,
>>
>> > After pinning:
>> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>>
>> is this a typo? If not, you did pin both directories to the same rank.
>>
>> Zitat von Александр Руденко :
>>
>> > Hi,
>> >
>> > I try to distribute all top level dirs in CephFS by different MDS
ranks.
>> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like
>> > */dir1* and* /dir2*.
>> >
>> > After pinning:
>> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >
>> > I can see next INOS and DNS distribution:
>> > RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
>> >  0active   c   Reqs:127 /s  12.6k  12.5k   333505
>> >  1active   b   Reqs:11 /s21 24 19  1
>> >
>> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
>> >
>> > Events in journal of MDS with rank 1:
>> > cephfs-journal-tool --rank=fs1:1 event get list
>> >
>> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
(scatter_writebehind)
>> >   A2037D53
>> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
>> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
accounted
>> > scatter stat update)
>> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
>> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
>> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
>> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
>> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
>> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
>> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
>> >   di1/A2037D53
>> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
>> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
>> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
>> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
>> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
>> >
>> > But the main problem, when I stop MDS rank 1 (without any kind of
>> standby)
>> > - FS hangs for all actions.
>> > Is this correct? Is it possible to completely exclude rank 1 from
>> > processing dir1 and not stop io when rank 1 goes down?
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an ema

[ceph-users] Re: Crush rule examples

2024-11-21 Thread Andre Tann

Hi Frank,

thanks a lot for the hint, and I have read the documentation about this. 
What is not clear to me is this:


== snip
The first category of these failures that we will discuss involves 
inconsistent networks -- if there is a netsplit (a disconnection between 
two servers that splits the network into two pieces), Ceph might be 
unable to mark OSDs down and remove them from the acting PG sets.

== snip

Why is Ceph not able to mark OSDs down, and why is it unclear whether or 
not it is able to do so ("might")?


Cheers
Andre


Am 20.11.24 um 12:23 schrieb Frank Schilder:

Hi Andre,

I think what you really want to look at is stretch mode. There have been long 
discussions on this list why a crush rule with rep 4 and 2 copies per DC will 
not handle a DC failure as expected. Stretch mode will  make sure writes happen 
in a way that prevents split brain scenarios.

Hand-crafted crush rules for this purpose require 3 or more DCs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Janne Johansson 
Sent: Wednesday, November 20, 2024 11:30 AM
To: Andre Tann
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Crush rule examples


Sorry, sent too early. So here we go again:
   My setup looks like this:

DC1
node01
node02
node03
node04
node05
DC2
node06
node07
node08
node09
node10

I want a replicated pool with size=4. Two copies should go in each DC,
and then no two copies on a single node.
How can I describe this in a crush rule?


This post seem to show that, except they have their root named "nvme"
and they split on rack and not dc, but that is not important.

https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup

with the answer at the bottom:

for example this should work as well, to have 4 replicas in total,
distributed across two racks:
step take default class nvme
step choose firstn 2 type rack
step chooseleaf firstn 2 type host

--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Andre Tann
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush rule examples

2024-11-21 Thread Janne Johansson
Den tors 21 nov. 2024 kl 09:45 skrev Andre Tann :
> Hi Frank,
> thanks a lot for the hint, and I have read the documentation about this.
> What is not clear to me is this:
>
> == snip
> The first category of these failures that we will discuss involves
> inconsistent networks -- if there is a netsplit (a disconnection between
> two servers that splits the network into two pieces), Ceph might be
> unable to mark OSDs down and remove them from the acting PG sets.
> == snip
>
> Why is Ceph not able to mark OSDs down, and why is it unclear whether or
> not it is able to do so ("might")?

I think designs with 2 DCs usually have one or two mons per DC, and then
a third/fifth mon in a (small) third site so it can arbitrate which
side is up and which
isn't. OSDs report to each other but also to mons about their existence.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush rule examples

2024-11-21 Thread Andre Tann

Am 21.11.24 um 10:56 schrieb Janne Johansson:


== snip
The first category of these failures that we will discuss involves
inconsistent networks -- if there is a netsplit (a disconnection between
two servers that splits the network into two pieces), Ceph might be
unable to mark OSDs down and remove them from the acting PG sets.
== snip

Why is Ceph not able to mark OSDs down, and why is it unclear whether or
not it is able to do so ("might")?


I think designs with 2 DCs usually have one or two mons per DC, and then
a third/fifth mon in a (small) third site so it can arbitrate which
side is up and which
isn't. OSDs report to each other but also to mons about their existence.


Yes absolutely, you need another qdevice/witness/mon... in a third 
location for the quorum, and my setup will have that. But still I don't 
see why Ceph should not be able to mark an OSD down if one site went down.


--
Andre Tann
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-21 Thread Eugen Block
They actually did have problems after standby-replay daemons took over  
as active daemons. After each failover they had to clean up some stale  
processes (or something like that). I'm not sure who recommended it,  
probably someone from SUSE engineering, but we switched off  
standby-replay and then the failover only took a minute or so, without  
anything to clean up. With hot standby it took several minutes (and  
cleaning afterwards). But the general recommendation is to not use  
standby-replay anyway, so we follow(ed) that. But it still might be  
useful in some scenarios, so proper testing is necessary.


Zitat von Александр Руденко :



IRC, last week during a Clyso
talk at Eventbrite I heard someone say that they deployed around 200
Filesystems or so, I don't remember if it was a production environment
or just a lab environment



Interesting, thanks!

I assume that you would probably be limited

by the number of OSDs/PGs rather than by the number of Filesystems,
200 Filesystems require at least 400 pools.



Sure, it's clear, thanks.

But it turned out that standby-replay didn't

help their use case, so we disabled standby-replay.



Interesting. There were some problems with standby-replay or they just do
not need "hot" standby?

чт, 21 нояб. 2024 г. в 11:36, Eugen Block :


I'm not aware of any hard limit for the number of Filesystems, but
that doesn't really mean very much. IIRC, last week during a Clyso
talk at Eventbrite I heard someone say that they deployed around 200
Filesystems or so, I don't remember if it was a production environment
or just a lab environment. I assume that you would probably be limited
by the number of OSDs/PGs rather than by the number of Filesystems,
200 Filesystems require at least 400 pools. But maybe someone else has
more experience in scaling CephFS that way. What we did was to scale
the number of active MDS daemons for one CephFS. I believe in the end
the customer had 48 MDS daemons on three MDS servers, 16 of them were
active with directory pinning, at that time they had 16 standby-replay
and 16 standby daemons. But it turned out that standby-replay didn't
help their use case, so we disabled standby-replay.

Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?

Zitat von Александр Руденко :

>>
>> Just for testing purposes, have you tried pinning rank 1 to some other
>> directory? Does it still break the CephFS if you stop it?
>
>
> Yes, nothing changed.
>
> It's no problem that FS hangs when one of the ranks goes down, we will
have
> standby-reply for all ranks. I don't like that rank which is not pinned
to
> some dir handled some io of this dir or from clients which work with this
> dir.
> I mean that I can't robustly and fully separate client IO by ranks.
>
> Would it be an option to rather use multiple Filesystems instead of
>> multi-active for one CephFS?
>
>
> Yes, it's an option. But it is much more complicated in our case. Btw, do
> you know how many different FS can be created in one cluster? Maybe you
> know some potential problems with 100-200 FSs in one cluster?
>
> ср, 20 нояб. 2024 г. в 17:50, Eugen Block :
>
>> Ah, I misunderstood, I thought you wanted an even distribution across
>> both ranks.
>> Just for testing purposes, have you tried pinning rank 1 to some other
>> directory? Does it still break the CephFS if you stop it? I'm not sure
>> if you can prevent rank 1 from participating, I haven't looked into
>> all the configs in quite a while. Would it be an option to rather use
>> multiple Filesystems instead of multi-active for one CephFS?
>>
>> Zitat von Александр Руденко :
>>
>> > No it's not a typo. It's misleading example)
>> >
>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
>> without
>> > rank 1.
>> > rank 1 is used for something when I work with this dirs.
>> >
>> > ceph 16.2.13, metadata balancer and policy based balancing not used.
>> >
>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
>> >
>> >> Hi,
>> >>
>> >> > After pinning:
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >>
>> >> is this a typo? If not, you did pin both directories to the same
rank.
>> >>
>> >> Zitat von Александр Руденко :
>> >>
>> >> > Hi,
>> >> >
>> >> > I try to distribute all top level dirs in CephFS by different MDS
>> ranks.
>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs
like
>> >> > */dir1* and* /dir2*.
>> >> >
>> >> > After pinning:
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >> >
>> >> > I can see next INOS and DNS distribution:
>> >> > RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
>> >> >  0active   c   Reqs:127 /s  12.6k  12.5k   333505
>> >> >  1active   b   Reqs:11 /s21 24 19  1
>> >> >
>> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
>> >> >
>> >> > E

[ceph-users] Lifetime for ceph

2024-11-21 Thread Steve Brasier
The octopus repos disappeared a couple of days ago - no argument with that
given its marked as out of support. However I see from
https://docs.ceph.com/en/latest/releases/ that quincy is also marked as out
of support, but currently the repos are still there.

Is there any guesstimate of when the quincy repos might disappear please?

many thanks
Steve Brasier
http://stackhpc.com/
Please note I work Tuesday to Friday.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-21 Thread Frank Schilder
I concur with that observation. Standby-replay seems a useless mode of 
operation. The replay daemons use a lot more RAM than the active ones and the 
fail-over took ages. After switching to standby-only fail-over is usually 5-20s 
with the lower end being more common.

We have 8 active and 4 standby daemons configured.

Bets regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Thursday, November 21, 2024 3:55 PM
To: Александр Руденко
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: [CephFS] Completely exclude some MDS rank from 
directory processing

They actually did have problems after standby-replay daemons took over
as active daemons. After each failover they had to clean up some stale
processes (or something like that). I'm not sure who recommended it,
probably someone from SUSE engineering, but we switched off
standby-replay and then the failover only took a minute or so, without
anything to clean up. With hot standby it took several minutes (and
cleaning afterwards). But the general recommendation is to not use
standby-replay anyway, so we follow(ed) that. But it still might be
useful in some scenarios, so proper testing is necessary.

Zitat von Александр Руденко :

>>
>> IRC, last week during a Clyso
>> talk at Eventbrite I heard someone say that they deployed around 200
>> Filesystems or so, I don't remember if it was a production environment
>> or just a lab environment
>
>
> Interesting, thanks!
>
> I assume that you would probably be limited
>> by the number of OSDs/PGs rather than by the number of Filesystems,
>> 200 Filesystems require at least 400 pools.
>
>
> Sure, it's clear, thanks.
>
> But it turned out that standby-replay didn't
>> help their use case, so we disabled standby-replay.
>
>
> Interesting. There were some problems with standby-replay or they just do
> not need "hot" standby?
>
> чт, 21 нояб. 2024 г. в 11:36, Eugen Block :
>
>> I'm not aware of any hard limit for the number of Filesystems, but
>> that doesn't really mean very much. IIRC, last week during a Clyso
>> talk at Eventbrite I heard someone say that they deployed around 200
>> Filesystems or so, I don't remember if it was a production environment
>> or just a lab environment. I assume that you would probably be limited
>> by the number of OSDs/PGs rather than by the number of Filesystems,
>> 200 Filesystems require at least 400 pools. But maybe someone else has
>> more experience in scaling CephFS that way. What we did was to scale
>> the number of active MDS daemons for one CephFS. I believe in the end
>> the customer had 48 MDS daemons on three MDS servers, 16 of them were
>> active with directory pinning, at that time they had 16 standby-replay
>> and 16 standby daemons. But it turned out that standby-replay didn't
>> help their use case, so we disabled standby-replay.
>>
>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
>> fs dump'?
>>
>> Zitat von Александр Руденко :
>>
>> >>
>> >> Just for testing purposes, have you tried pinning rank 1 to some other
>> >> directory? Does it still break the CephFS if you stop it?
>> >
>> >
>> > Yes, nothing changed.
>> >
>> > It's no problem that FS hangs when one of the ranks goes down, we will
>> have
>> > standby-reply for all ranks. I don't like that rank which is not pinned
>> to
>> > some dir handled some io of this dir or from clients which work with this
>> > dir.
>> > I mean that I can't robustly and fully separate client IO by ranks.
>> >
>> > Would it be an option to rather use multiple Filesystems instead of
>> >> multi-active for one CephFS?
>> >
>> >
>> > Yes, it's an option. But it is much more complicated in our case. Btw, do
>> > you know how many different FS can be created in one cluster? Maybe you
>> > know some potential problems with 100-200 FSs in one cluster?
>> >
>> > ср, 20 нояб. 2024 г. в 17:50, Eugen Block :
>> >
>> >> Ah, I misunderstood, I thought you wanted an even distribution across
>> >> both ranks.
>> >> Just for testing purposes, have you tried pinning rank 1 to some other
>> >> directory? Does it still break the CephFS if you stop it? I'm not sure
>> >> if you can prevent rank 1 from participating, I haven't looked into
>> >> all the configs in quite a while. Would it be an option to rather use
>> >> multiple Filesystems instead of multi-active for one CephFS?
>> >>
>> >> Zitat von Александр Руденко :
>> >>
>> >> > No it's not a typo. It's misleading example)
>> >> >
>> >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
>> >> without
>> >> > rank 1.
>> >> > rank 1 is used for something when I work with this dirs.
>> >> >
>> >> > ceph 16.2.13, metadata balancer and policy based balancing not used.
>> >> >
>> >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> > After pinning:
>> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >> >> > setfattr -n ceph.dir.pin

[ceph-users] MDS blocklist/evict clients during network maintenance

2024-11-21 Thread Eugen Block

Hi,

can anyone share some experience with these two configs?

ceph config get mds mds_session_blocklist_on_timeout
true
ceph config get mds mds_session_blocklist_on_evict
true

If there's some network maintenance going on and the client connection  
is interrupted, could it help to disable evicting and blocklisting MDS  
clients? And what risks should we be aware of if we tried that? We're  
not entirely sure yet if this could be a reasonable approach, but  
we're trying to figure out how to make network maintenance less  
painful for clients.
I'm also looking at some other possible configs, but let's start with  
these two first.


Any comments would be appreciated!

Thanks!
Eugen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS subvolumes not inheriting ephemeral distributed pin

2024-11-21 Thread Patrick Donnelly
On Wed, Nov 20, 2024 at 2:05 PM Rajmohan Ramamoorthy
 wrote:
>
> Hi Patrick,
>
> Few other follow up questions.
>
> Is directory fragmentation applicable only when multiple active MDS is 
> enabled for a Ceph FS?

It has no effect when applied with only one rank (active). It can be
useful to have it already set in case you increase max_mds.

> Will directory fragmenation and distribution of fragments amongs active MDS 
> happen if we turn off balancer for a Ceph FS volume `ceph fs set midline-a 
> balance_automate false` ? In Squide, the CephFS automatic metadata load 
> (sometimes called “default”) balancer is now disabled by default. 
> (https://docs.ceph.com/en/latest/releases/squid/)

Yes.

> Is there a way for us to ensure that the directory tree of a Subvolume 
> (Kubernetes PV) is part of a same fragment and handled by a single MDS so 
> that a client operations are handled by one MDS?

A subvolume would not be split across two MDS.

> What is the trigger to start fragmenting directories within a Subvolumegroup?

You don't need to do anything more than set the distribute ephemeral pin.

> With the `balance_automate` set to false and `ephemeral distributed pin` 
> enabled for a Subvolumegroup, can we expect (almost) equal distribution of 
> Subvolumes (Kubernetes PVs) amongst the active MDS daemons and stable 
> operation without hotspot migrations?

Yes.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-21 Thread Frédéric Nass
Hi Eugen,

During the talk you've mentioned, Dan said there's a hard coded limit of 256 
MDSs per cluster. So with one active and one standby-ish MDSs per filesystem, 
that would be 128 filesystems at max per cluster.
Mark said he got 120 but.. things start to get wacky by 80. :-)

More fun to come, for sure.

Cheers,
Frédéric.

[1] https://youtu.be/qiCE1Ifws80?t=2602

- Le 21 Nov 24, à 9:36, Eugen Block ebl...@nde.ag a écrit :

> I'm not aware of any hard limit for the number of Filesystems, but
> that doesn't really mean very much. IIRC, last week during a Clyso
> talk at Eventbrite I heard someone say that they deployed around 200
> Filesystems or so, I don't remember if it was a production environment
> or just a lab environment. I assume that you would probably be limited
> by the number of OSDs/PGs rather than by the number of Filesystems,
> 200 Filesystems require at least 400 pools. But maybe someone else has
> more experience in scaling CephFS that way. What we did was to scale
> the number of active MDS daemons for one CephFS. I believe in the end
> the customer had 48 MDS daemons on three MDS servers, 16 of them were
> active with directory pinning, at that time they had 16 standby-replay
> and 16 standby daemons. But it turned out that standby-replay didn't
> help their use case, so we disabled standby-replay.
> 
> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
> fs dump'?
> 
> Zitat von Александр Руденко :
> 
>>>
>>> Just for testing purposes, have you tried pinning rank 1 to some other
>>> directory? Does it still break the CephFS if you stop it?
>>
>>
>> Yes, nothing changed.
>>
>> It's no problem that FS hangs when one of the ranks goes down, we will have
>> standby-reply for all ranks. I don't like that rank which is not pinned to
>> some dir handled some io of this dir or from clients which work with this
>> dir.
>> I mean that I can't robustly and fully separate client IO by ranks.
>>
>> Would it be an option to rather use multiple Filesystems instead of
>>> multi-active for one CephFS?
>>
>>
>> Yes, it's an option. But it is much more complicated in our case. Btw, do
>> you know how many different FS can be created in one cluster? Maybe you
>> know some potential problems with 100-200 FSs in one cluster?
>>
>> ср, 20 нояб. 2024 г. в 17:50, Eugen Block :
>>
>>> Ah, I misunderstood, I thought you wanted an even distribution across
>>> both ranks.
>>> Just for testing purposes, have you tried pinning rank 1 to some other
>>> directory? Does it still break the CephFS if you stop it? I'm not sure
>>> if you can prevent rank 1 from participating, I haven't looked into
>>> all the configs in quite a while. Would it be an option to rather use
>>> multiple Filesystems instead of multi-active for one CephFS?
>>>
>>> Zitat von Александр Руденко :
>>>
>>> > No it's not a typo. It's misleading example)
>>> >
>>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
>>> without
>>> > rank 1.
>>> > rank 1 is used for something when I work with this dirs.
>>> >
>>> > ceph 16.2.13, metadata balancer and policy based balancing not used.
>>> >
>>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
>>> >
>>> >> Hi,
>>> >>
>>> >> > After pinning:
>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>>> >>
>>> >> is this a typo? If not, you did pin both directories to the same rank.
>>> >>
>>> >> Zitat von Александр Руденко :
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > I try to distribute all top level dirs in CephFS by different MDS
>>> ranks.
>>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like
>>> >> > */dir1* and* /dir2*.
>>> >> >
>>> >> > After pinning:
>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>>> >> >
>>> >> > I can see next INOS and DNS distribution:
>>> >> > RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
>>> >> >  0active   c   Reqs:127 /s  12.6k  12.5k   333505
>>> >> >  1active   b   Reqs:11 /s21 24 19  1
>>> >> >
>>> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
>>> >> >
>>> >> > Events in journal of MDS with rank 1:
>>> >> > cephfs-journal-tool --rank=fs1:1 event get list
>>> >> >
>>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
>>> (scatter_writebehind)
>>> >> >   A2037D53
>>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
>>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
>>> accounted
>>> >> > scatter stat update)
>>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
>>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
>>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
>>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
>>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
>>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  

[ceph-users] 2024-11-21 Perf meeting cancelled!

2024-11-21 Thread Matt Vandermeulen
Hi folks, the perf meeting will be cancelled today, Mark is flying from 
a conference!


Thanks,
Matt
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-21 Thread Александр Руденко
>
> IRC, last week during a Clyso
> talk at Eventbrite I heard someone say that they deployed around 200
> Filesystems or so, I don't remember if it was a production environment
> or just a lab environment


Interesting, thanks!

I assume that you would probably be limited
> by the number of OSDs/PGs rather than by the number of Filesystems,
> 200 Filesystems require at least 400 pools.


Sure, it's clear, thanks.

But it turned out that standby-replay didn't
> help their use case, so we disabled standby-replay.


Interesting. There were some problems with standby-replay or they just do
not need "hot" standby?

чт, 21 нояб. 2024 г. в 11:36, Eugen Block :

> I'm not aware of any hard limit for the number of Filesystems, but
> that doesn't really mean very much. IIRC, last week during a Clyso
> talk at Eventbrite I heard someone say that they deployed around 200
> Filesystems or so, I don't remember if it was a production environment
> or just a lab environment. I assume that you would probably be limited
> by the number of OSDs/PGs rather than by the number of Filesystems,
> 200 Filesystems require at least 400 pools. But maybe someone else has
> more experience in scaling CephFS that way. What we did was to scale
> the number of active MDS daemons for one CephFS. I believe in the end
> the customer had 48 MDS daemons on three MDS servers, 16 of them were
> active with directory pinning, at that time they had 16 standby-replay
> and 16 standby daemons. But it turned out that standby-replay didn't
> help their use case, so we disabled standby-replay.
>
> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
> fs dump'?
>
> Zitat von Александр Руденко :
>
> >>
> >> Just for testing purposes, have you tried pinning rank 1 to some other
> >> directory? Does it still break the CephFS if you stop it?
> >
> >
> > Yes, nothing changed.
> >
> > It's no problem that FS hangs when one of the ranks goes down, we will
> have
> > standby-reply for all ranks. I don't like that rank which is not pinned
> to
> > some dir handled some io of this dir or from clients which work with this
> > dir.
> > I mean that I can't robustly and fully separate client IO by ranks.
> >
> > Would it be an option to rather use multiple Filesystems instead of
> >> multi-active for one CephFS?
> >
> >
> > Yes, it's an option. But it is much more complicated in our case. Btw, do
> > you know how many different FS can be created in one cluster? Maybe you
> > know some potential problems with 100-200 FSs in one cluster?
> >
> > ср, 20 нояб. 2024 г. в 17:50, Eugen Block :
> >
> >> Ah, I misunderstood, I thought you wanted an even distribution across
> >> both ranks.
> >> Just for testing purposes, have you tried pinning rank 1 to some other
> >> directory? Does it still break the CephFS if you stop it? I'm not sure
> >> if you can prevent rank 1 from participating, I haven't looked into
> >> all the configs in quite a while. Would it be an option to rather use
> >> multiple Filesystems instead of multi-active for one CephFS?
> >>
> >> Zitat von Александр Руденко :
> >>
> >> > No it's not a typo. It's misleading example)
> >> >
> >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
> >> without
> >> > rank 1.
> >> > rank 1 is used for something when I work with this dirs.
> >> >
> >> > ceph 16.2.13, metadata balancer and policy based balancing not used.
> >> >
> >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
> >> >
> >> >> Hi,
> >> >>
> >> >> > After pinning:
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >> >>
> >> >> is this a typo? If not, you did pin both directories to the same
> rank.
> >> >>
> >> >> Zitat von Александр Руденко :
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > I try to distribute all top level dirs in CephFS by different MDS
> >> ranks.
> >> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs
> like
> >> >> > */dir1* and* /dir2*.
> >> >> >
> >> >> > After pinning:
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >> >> >
> >> >> > I can see next INOS and DNS distribution:
> >> >> > RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
> >> >> >  0active   c   Reqs:127 /s  12.6k  12.5k   333505
> >> >> >  1active   b   Reqs:11 /s21 24 19  1
> >> >> >
> >> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
> >> >> >
> >> >> > Events in journal of MDS with rank 1:
> >> >> > cephfs-journal-tool --rank=fs1:1 event get list
> >> >> >
> >> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
> >> (scatter_writebehind)
> >> >> >   A2037D53
> >> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
> >> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
> >> accounted
> >> >> > scatter stat update)
> >> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  

[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-21 Thread Александр Руденко
Hi, Frank, thanks!

it might be that you are expecting too much from ceph. The design of the
> filesystem was not some grand plan with every detail worked out. It was
> more the classic evolutionary approach, something working was screwed on
> top of rados and things evolved from there on.


There was some hope that it's just a configuration problem in my
environment)

Specifically rank 0 is critical.


Yes, because we can't re-pin the root of FS to some other rank. It was
clear that rank 0 is critical. But unfortunately, as we can see all ranks
are critical for stable work in any directories.

чт, 21 нояб. 2024 г. в 14:46, Frank Schilder :

> Hi Alexander,
>
> it might be that you are expecting too much from ceph. The design of the
> filesystem was not some grand plan with every detail worked out. It was
> more the classic evolutionary approach, something working was screwed on
> top of rados and things evolved from there on.
>
> It is possible that the code and pin-seperation is not as clean as one
> would imagine. Here is what I observe before and after pinning everything
> explicitly:
>
> - before pinning:
>   * high MDS load for no apparent reason - the balancer was just going in
> circles
>   * stopping an MDS would besically bring all IO down
>
> - after pinning:
>   * low MDS load, better user performance, much faster restarts
>   * stopping an MDS does not kill all IO immediately, some IO continues,
> however, eventually every client gets stuck
>
> There is apparently still communication between all ranks about all
> clients and it is a bit annoying that some of this communication is
> blocking. Not sure if it has to be blocking or if one could make it
> asynchronous requests to the down rank. My impression is that ceph
> internals are rather bad at making stuff asynchronous. So if something in
> the MDS cluster is not healthy sooner or later IO will stop waiting for
> some blocking request to the unhealthy MDS. There seems to be no such thing
> as IO on other healthy MDSes continues as usual.
>
> Specifically rank 0 is critical.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Eugen Block 
> Sent: Thursday, November 21, 2024 9:36 AM
> To: Александр Руденко
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: [CephFS] Completely exclude some MDS rank from
> directory processing
>
> I'm not aware of any hard limit for the number of Filesystems, but
> that doesn't really mean very much. IIRC, last week during a Clyso
> talk at Eventbrite I heard someone say that they deployed around 200
> Filesystems or so, I don't remember if it was a production environment
> or just a lab environment. I assume that you would probably be limited
> by the number of OSDs/PGs rather than by the number of Filesystems,
> 200 Filesystems require at least 400 pools. But maybe someone else has
> more experience in scaling CephFS that way. What we did was to scale
> the number of active MDS daemons for one CephFS. I believe in the end
> the customer had 48 MDS daemons on three MDS servers, 16 of them were
> active with directory pinning, at that time they had 16 standby-replay
> and 16 standby daemons. But it turned out that standby-replay didn't
> help their use case, so we disabled standby-replay.
>
> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
> fs dump'?
>
> Zitat von Александр Руденко :
>
> >>
> >> Just for testing purposes, have you tried pinning rank 1 to some other
> >> directory? Does it still break the CephFS if you stop it?
> >
> >
> > Yes, nothing changed.
> >
> > It's no problem that FS hangs when one of the ranks goes down, we will
> have
> > standby-reply for all ranks. I don't like that rank which is not pinned
> to
> > some dir handled some io of this dir or from clients which work with this
> > dir.
> > I mean that I can't robustly and fully separate client IO by ranks.
> >
> > Would it be an option to rather use multiple Filesystems instead of
> >> multi-active for one CephFS?
> >
> >
> > Yes, it's an option. But it is much more complicated in our case. Btw, do
> > you know how many different FS can be created in one cluster? Maybe you
> > know some potential problems with 100-200 FSs in one cluster?
> >
> > ср, 20 нояб. 2024 г. в 17:50, Eugen Block :
> >
> >> Ah, I misunderstood, I thought you wanted an even distribution across
> >> both ranks.
> >> Just for testing purposes, have you tried pinning rank 1 to some other
> >> directory? Does it still break the CephFS if you stop it? I'm not sure
> >> if you can prevent rank 1 from participating, I haven't looked into
> >> all the configs in quite a while. Would it be an option to rather use
> >> multiple Filesystems instead of multi-active for one CephFS?
> >>
> >> Zitat von Александр Руденко :
> >>
> >> > No it's not a typo. It's misleading example)
> >> >
> >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
>

[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-21 Thread Александр Руденко
>
> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
> fs dump'?


Nothing special, just smoll test cluster.
fs1 - 10 clients
===
RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
 0active   a   Reqs:0 /s  18.7k  18.4k   351513
 1active   b   Reqs:0 /s21 24 16  1
  POOL  TYPE USED  AVAIL
fs1_meta  metadata   116M  3184G
fs1_datadata23.8G  3184G
STANDBY MDS
 c


fs dump

e48
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'fs1' (1)
fs_name fs1
epoch 47
flags 12
created 2024-10-15T18:55:10.905035+0300
modified 2024-11-21T10:55:12.688598+0300
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 943
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 2
in 0,1
up {0=12200812,1=11974933}
failed
damaged
stopped
data_pools [7]
metadata_pool 6
inline_data disabled
balancer
standby_count_wanted 1
[mds.a{0:12200812} state up:active seq 13 addr [v2:
10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987] compat
{c=[1],r=[1],i=[7ff]}]
[mds.b{1:11974933} state up:active seq 5 addr [v2:
10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454] compat
{c=[1],r=[1],i=[7ff]}]


Standby daemons:

[mds.c{-1:11704322} state up:standby seq 1 addr [v2:
10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504] compat
{c=[1],r=[1],i=[7ff]}]

чт, 21 нояб. 2024 г. в 11:36, Eugen Block :

> I'm not aware of any hard limit for the number of Filesystems, but
> that doesn't really mean very much. IIRC, last week during a Clyso
> talk at Eventbrite I heard someone say that they deployed around 200
> Filesystems or so, I don't remember if it was a production environment
> or just a lab environment. I assume that you would probably be limited
> by the number of OSDs/PGs rather than by the number of Filesystems,
> 200 Filesystems require at least 400 pools. But maybe someone else has
> more experience in scaling CephFS that way. What we did was to scale
> the number of active MDS daemons for one CephFS. I believe in the end
> the customer had 48 MDS daemons on three MDS servers, 16 of them were
> active with directory pinning, at that time they had 16 standby-replay
> and 16 standby daemons. But it turned out that standby-replay didn't
> help their use case, so we disabled standby-replay.
>
> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
> fs dump'?
>
> Zitat von Александр Руденко :
>
> >>
> >> Just for testing purposes, have you tried pinning rank 1 to some other
> >> directory? Does it still break the CephFS if you stop it?
> >
> >
> > Yes, nothing changed.
> >
> > It's no problem that FS hangs when one of the ranks goes down, we will
> have
> > standby-reply for all ranks. I don't like that rank which is not pinned
> to
> > some dir handled some io of this dir or from clients which work with this
> > dir.
> > I mean that I can't robustly and fully separate client IO by ranks.
> >
> > Would it be an option to rather use multiple Filesystems instead of
> >> multi-active for one CephFS?
> >
> >
> > Yes, it's an option. But it is much more complicated in our case. Btw, do
> > you know how many different FS can be created in one cluster? Maybe you
> > know some potential problems with 100-200 FSs in one cluster?
> >
> > ср, 20 нояб. 2024 г. в 17:50, Eugen Block :
> >
> >> Ah, I misunderstood, I thought you wanted an even distribution across
> >> both ranks.
> >> Just for testing purposes, have you tried pinning rank 1 to some other
> >> directory? Does it still break the CephFS if you stop it? I'm not sure
> >> if you can prevent rank 1 from participating, I haven't looked into
> >> all the configs in quite a while. Would it be an option to rather use
> >> multiple Filesystems instead of multi-active for one CephFS?
> >>
> >> Zitat von Александр Руденко :
> >>
> >> > No it's not a typo. It's misleading example)
> >> >
> >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
> >> without
> >> > rank 1.
> >> > rank 1 is used for something when I work with this dirs.
> >> >
> >> > ceph 16.2.13, metadata balancer and policy based balancing not used.
> >> >
> >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
> >> >
> >> >> Hi,
> >> >>
> >> >> > After pinning:
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >> >>
> >> >> is this a typo? If not

[ceph-users] Re: MDS blocklist/evict clients during network maintenance

2024-11-21 Thread Dan van der Ster
Hi Eugene,

Disabling blocklisting on eviction is a pretty standard config. In my
experience it allows clients resume their session cleanly without needing a
remount.

There's docs about this here:
https://docs.ceph.com/en/latest/cephfs/eviction/#advanced-configuring-blocklisting

I don't have a good feeling if this will be useful for your network
intervention though... What are you trying to achieve? How long will
clients be unreachable?

Cheers, Dan


--
Dan van der Ster
CTO@CLYSO & CEC Member


On Thu, Nov 21, 2024, 10:15 Eugen Block  wrote:

> Hi,
>
> can anyone share some experience with these two configs?
>
> ceph config get mds mds_session_blocklist_on_timeout
> true
> ceph config get mds mds_session_blocklist_on_evict
> true
>
> If there's some network maintenance going on and the client connection
> is interrupted, could it help to disable evicting and blocklisting MDS
> clients? And what risks should we be aware of if we tried that? We're
> not entirely sure yet if this could be a reasonable approach, but
> we're trying to figure out how to make network maintenance less
> painful for clients.
> I'm also looking at some other possible configs, but let's start with
> these two first.
>
> Any comments would be appreciated!
>
> Thanks!
> Eugen
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multisite sync issue with bucket sync

2024-11-21 Thread Casey Bodley
hey Chris,

On Wed, Nov 20, 2024 at 6:02 PM Christopher Durham  wrote:
>
> Casey,
>
> OR, is there a way to continue on with new data syncing (incremental) as the 
> full sync catches up, as the full sync will take a long time, and no new 
> incremental data is being replicated.

full sync walks through the entire bucket listing, so it will visit
some new objects along the way. but where possible, multisite tries to
prioritize sync of older data because minimizing the average
time-to-sync is important for disaster recovery. if we tried to
process both full and incremental at the same time, that would slow
down the full sync and some objects could take far longer to
replicate. it would also be less efficient overall, because longer
full sync means more overlap and duplicated effort with incremental

>
> -Chris
>
> On Wednesday, November 20, 2024 at 03:30:40 PM MST, Christopher Durham 
>  wrote:
>
>
> Casey,
>
> Thanks for your response. So is there a way to abandon a full sync and just 
> move on with an incremental from the time you abandon the full sync?

i'm afraid not. multisite tries very hard to maintain consistency
between zones, so it's not easy to subvert that. 'radosgw-admin bucket
sync init' is probably the only command that can modify bucket sync
status

>
> -Chris
>
> On Wednesday, November 20, 2024 at 12:29:26 PM MST, Casey Bodley 
>  wrote:
>
>
> On Wed, Nov 20, 2024 at 2:10 PM Christopher Durham  wrote:
> >
> >  Ok,
> > Source code review reveals that full sync is marker based and sync errors 
> > within a marker group *suggest* that data within the marker isre-checked, 
> > (I may be wrong about this, but that is consistent with my 304 errors 
> > below). I do however, have the folllowing question:
> > Is there a way to otherwise abort a full sync of a bucket (as a result of 
> > radosgw-admin bucket sync init --bucket  and bucket sync run (or 
> > restart of radosgw),and have it just do incremental sync from then on (yes, 
> > having the objects not be the same on both sides prior to the 'restart' of 
> > an incremental sync.
> > Would radosgw-admin bucket sync disable --bucket  followed by 
> > radosgw-admin bucket sync enable --bucket  do this? Or would that 
> > do anotherfull sync and not an incremental?
>
> 'bucket sync enable' does start a new full sync (to catch objects that
> were uploaded since 'bucket sync disable')
>
>
> Thanks
> > -Chris
> >
> >On Thursday, November 14, 2024 at 04:18:34 PM MST, Christopher Durham 
> >  wrote:
> >
> >  Hi,
> > I have heard nothing on this, but have done some more research.
> > Again, both sides of a multisite s3 configuration are ceph 18.2.4 on Rocky 
> > 9.
> > For a given bucket, there are thousands of 'missing' objects. I did:
> > radosgw-admin bucket sync init --bucket  --src-zone  > zone>sync starts after I restart a radosgw on the source zone that has a 
> > sync thread.
> > But based on number and size of objects needing replication, it NEVER 
> > finishes, as more objects are created as I am going.I may need to increase 
> > the number of radosgw and or the sync threads.
> >
> > What I have discovered that if a radosgw on the side with missing objects 
> > is restarted, all sycing starts over!In other words, it starts polling each 
> > object, getting a 304 error in the radosgw log on the server on the 
> > multisite that has the missing objects.It *appears* to do this sequential 
> > object scan in lexographic order of object and/or prefix name, although I 
> > cannot be sure.
> >
> > So some questions:
> > 1. Is there a recommendation/rule of thumb/formula for the number of 
> > radosgws/syncthreads/ etc based on number of objects, buckets, bandwidth, 
> > etc?2. Why does the syncing restart for a bucket when a radosgw is 
> > restarted? Is there a way to tell it to restart where it left off as 
> > opposed to starting over?There may be reasons to restart a bucket sync if a 
> > radosgw restarts, but there should be a way to checkpoint/force it to not 
> > restart/start where left off, etc.3. Is there a way to 'abort' the sync and 
> > cause the bucket to think it is up to date and only replicate new objects 
> > from the time it was marked up to date?
> > Thanks for any information
> > -Chris
> >
> >
> >
> >On Friday, November 8, 2024 at 03:45:05 PM MST, Christopher Durham 
> >  wrote:
> >
> >
> > I have a 2-site multisite configuration on cdnh 18.2.4 on EL9.
> > After system updates, we discovered that a particular bucket had several 
> > thousand objects missing, which the other side had. Newly created objects 
> > were being replicated just fine.
> >
> > I decided to 'restart' syncing that bucket. Here is what I did
> > On the side with misisng objects:
> > > radosgw-admin bucket sync init --bucket  --src-zone 
> >
> > I restarted the radosgw set up to do the sync thread on the same zone as I 
> > ran the radosgw-admin command.
> >
> > Logs on the radosgw src-zone side show GETs with http code 200 for objects 
> > that do n