[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing
Hi Alexander, it might be that you are expecting too much from ceph. The design of the filesystem was not some grand plan with every detail worked out. It was more the classic evolutionary approach, something working was screwed on top of rados and things evolved from there on. It is possible that the code and pin-seperation is not as clean as one would imagine. Here is what I observe before and after pinning everything explicitly: - before pinning: * high MDS load for no apparent reason - the balancer was just going in circles * stopping an MDS would besically bring all IO down - after pinning: * low MDS load, better user performance, much faster restarts * stopping an MDS does not kill all IO immediately, some IO continues, however, eventually every client gets stuck There is apparently still communication between all ranks about all clients and it is a bit annoying that some of this communication is blocking. Not sure if it has to be blocking or if one could make it asynchronous requests to the down rank. My impression is that ceph internals are rather bad at making stuff asynchronous. So if something in the MDS cluster is not healthy sooner or later IO will stop waiting for some blocking request to the unhealthy MDS. There seems to be no such thing as IO on other healthy MDSes continues as usual. Specifically rank 0 is critical. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Thursday, November 21, 2024 9:36 AM To: Александр Руденко Cc: ceph-users@ceph.io Subject: [ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing I'm not aware of any hard limit for the number of Filesystems, but that doesn't really mean very much. IIRC, last week during a Clyso talk at Eventbrite I heard someone say that they deployed around 200 Filesystems or so, I don't remember if it was a production environment or just a lab environment. I assume that you would probably be limited by the number of OSDs/PGs rather than by the number of Filesystems, 200 Filesystems require at least 400 pools. But maybe someone else has more experience in scaling CephFS that way. What we did was to scale the number of active MDS daemons for one CephFS. I believe in the end the customer had 48 MDS daemons on three MDS servers, 16 of them were active with directory pinning, at that time they had 16 standby-replay and 16 standby daemons. But it turned out that standby-replay didn't help their use case, so we disabled standby-replay. Can you show the entire 'ceph fs status' output? Any maybe also 'ceph fs dump'? Zitat von Александр Руденко : >> >> Just for testing purposes, have you tried pinning rank 1 to some other >> directory? Does it still break the CephFS if you stop it? > > > Yes, nothing changed. > > It's no problem that FS hangs when one of the ranks goes down, we will have > standby-reply for all ranks. I don't like that rank which is not pinned to > some dir handled some io of this dir or from clients which work with this > dir. > I mean that I can't robustly and fully separate client IO by ranks. > > Would it be an option to rather use multiple Filesystems instead of >> multi-active for one CephFS? > > > Yes, it's an option. But it is much more complicated in our case. Btw, do > you know how many different FS can be created in one cluster? Maybe you > know some potential problems with 100-200 FSs in one cluster? > > ср, 20 нояб. 2024 г. в 17:50, Eugen Block : > >> Ah, I misunderstood, I thought you wanted an even distribution across >> both ranks. >> Just for testing purposes, have you tried pinning rank 1 to some other >> directory? Does it still break the CephFS if you stop it? I'm not sure >> if you can prevent rank 1 from participating, I haven't looked into >> all the configs in quite a while. Would it be an option to rather use >> multiple Filesystems instead of multi-active for one CephFS? >> >> Zitat von Александр Руденко : >> >> > No it's not a typo. It's misleading example) >> > >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work >> without >> > rank 1. >> > rank 1 is used for something when I work with this dirs. >> > >> > ceph 16.2.13, metadata balancer and policy based balancing not used. >> > >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block : >> > >> >> Hi, >> >> >> >> > After pinning: >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >> >> >> >> is this a typo? If not, you did pin both directories to the same rank. >> >> >> >> Zitat von Александр Руденко : >> >> >> >> > Hi, >> >> > >> >> > I try to distribute all top level dirs in CephFS by different MDS >> ranks. >> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like >> >> > */dir1* and* /dir2*. >> >> > >> >> > After pinning: >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> >> > setfattr -n c
[ceph-users] Re: MDS blocklist/evict clients during network maintenance
Hi Dan, thanks for the link, I've been reading it over and over again but still didn't come to a conclusion yet. IIRC, the maintenance windows are one hour long, currently every week. But it's not entirely clear if the maintenance will even have an impact, because apparently, last time nobody complained. But there have been interruptions which caused stale clients in the last weeks, so it's difficult to predict. They mainly use rbd and CephFS for k8s clusters, but so far I haven't heard about rbd issues during this maintenance windows. They have grafana showing a drop of many MDS sessions when the network is interrupted, I think from around 130 active sessions to around 30. So not all sessions were dropped. After the maintenance, they failed the MDS and the number of sessions was restored. Since they don't have access to the k8s clusters themselves, they can't do much on that side. We're still wondering if a MDS failover is really necessary or if anything on the client side could be done. But I only have very limited details on this. The MDS log (I don't have a copy) shows that the session drops are caused by the client evictions. Do you think it could make sense to disable client eviction/blocklisting only during this maintenance window? Or can that be dangerous because we can't predict which clients will actually be interrupted and how k8s will handle the returning clients if they won't be evicted? Thanks Eugen Zitat von Dan van der Ster : Hi Eugene, Disabling blocklisting on eviction is a pretty standard config. In my experience it allows clients resume their session cleanly without needing a remount. There's docs about this here: https://docs.ceph.com/en/latest/cephfs/eviction/#advanced-configuring-blocklisting I don't have a good feeling if this will be useful for your network intervention though... What are you trying to achieve? How long will clients be unreachable? Cheers, Dan -- Dan van der Ster CTO@CLYSO & CEC Member On Thu, Nov 21, 2024, 10:15 Eugen Block wrote: Hi, can anyone share some experience with these two configs? ceph config get mds mds_session_blocklist_on_timeout true ceph config get mds mds_session_blocklist_on_evict true If there's some network maintenance going on and the client connection is interrupted, could it help to disable evicting and blocklisting MDS clients? And what risks should we be aware of if we tried that? We're not entirely sure yet if this could be a reasonable approach, but we're trying to figure out how to make network maintenance less painful for clients. I'm also looking at some other possible configs, but let's start with these two first. Any comments would be appreciated! Thanks! Eugen ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing
I'm not aware of any hard limit for the number of Filesystems, but that doesn't really mean very much. IIRC, last week during a Clyso talk at Eventbrite I heard someone say that they deployed around 200 Filesystems or so, I don't remember if it was a production environment or just a lab environment. I assume that you would probably be limited by the number of OSDs/PGs rather than by the number of Filesystems, 200 Filesystems require at least 400 pools. But maybe someone else has more experience in scaling CephFS that way. What we did was to scale the number of active MDS daemons for one CephFS. I believe in the end the customer had 48 MDS daemons on three MDS servers, 16 of them were active with directory pinning, at that time they had 16 standby-replay and 16 standby daemons. But it turned out that standby-replay didn't help their use case, so we disabled standby-replay. Can you show the entire 'ceph fs status' output? Any maybe also 'ceph fs dump'? Zitat von Александр Руденко : Just for testing purposes, have you tried pinning rank 1 to some other directory? Does it still break the CephFS if you stop it? Yes, nothing changed. It's no problem that FS hangs when one of the ranks goes down, we will have standby-reply for all ranks. I don't like that rank which is not pinned to some dir handled some io of this dir or from clients which work with this dir. I mean that I can't robustly and fully separate client IO by ranks. Would it be an option to rather use multiple Filesystems instead of multi-active for one CephFS? Yes, it's an option. But it is much more complicated in our case. Btw, do you know how many different FS can be created in one cluster? Maybe you know some potential problems with 100-200 FSs in one cluster? ср, 20 нояб. 2024 г. в 17:50, Eugen Block : Ah, I misunderstood, I thought you wanted an even distribution across both ranks. Just for testing purposes, have you tried pinning rank 1 to some other directory? Does it still break the CephFS if you stop it? I'm not sure if you can prevent rank 1 from participating, I haven't looked into all the configs in quite a while. Would it be an option to rather use multiple Filesystems instead of multi-active for one CephFS? Zitat von Александр Руденко : > No it's not a typo. It's misleading example) > > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work without > rank 1. > rank 1 is used for something when I work with this dirs. > > ceph 16.2.13, metadata balancer and policy based balancing not used. > > ср, 20 нояб. 2024 г. в 16:33, Eugen Block : > >> Hi, >> >> > After pinning: >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >> >> is this a typo? If not, you did pin both directories to the same rank. >> >> Zitat von Александр Руденко : >> >> > Hi, >> > >> > I try to distribute all top level dirs in CephFS by different MDS ranks. >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like >> > */dir1* and* /dir2*. >> > >> > After pinning: >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >> > >> > I can see next INOS and DNS distribution: >> > RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS >> > 0active c Reqs:127 /s 12.6k 12.5k 333505 >> > 1active b Reqs:11 /s21 24 19 1 >> > >> > When I write to dir1 I can see a small amount on Reqs: in rank 1. >> > >> > Events in journal of MDS with rank 1: >> > cephfs-journal-tool --rank=fs1:1 event get list >> > >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE: (scatter_writebehind) >> > A2037D53 >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION: () >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE: (lock inest accounted >> > scatter stat update) >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION: () >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION: () >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION: () >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION: () >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION: () >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION: () >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT: () >> > di1/A2037D53 >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION: () >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION: () >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION: () >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION: () >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS: () >> > >> > But the main problem, when I stop MDS rank 1 (without any kind of >> standby) >> > - FS hangs for all actions. >> > Is this correct? Is it possible to completely exclude rank 1 from >> > processing dir1 and not stop io when rank 1 goes down? >> > ___ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an ema
[ceph-users] Re: Crush rule examples
Hi Frank, thanks a lot for the hint, and I have read the documentation about this. What is not clear to me is this: == snip The first category of these failures that we will discuss involves inconsistent networks -- if there is a netsplit (a disconnection between two servers that splits the network into two pieces), Ceph might be unable to mark OSDs down and remove them from the acting PG sets. == snip Why is Ceph not able to mark OSDs down, and why is it unclear whether or not it is able to do so ("might")? Cheers Andre Am 20.11.24 um 12:23 schrieb Frank Schilder: Hi Andre, I think what you really want to look at is stretch mode. There have been long discussions on this list why a crush rule with rep 4 and 2 copies per DC will not handle a DC failure as expected. Stretch mode will make sure writes happen in a way that prevents split brain scenarios. Hand-crafted crush rules for this purpose require 3 or more DCs. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Janne Johansson Sent: Wednesday, November 20, 2024 11:30 AM To: Andre Tann Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Crush rule examples Sorry, sent too early. So here we go again: My setup looks like this: DC1 node01 node02 node03 node04 node05 DC2 node06 node07 node08 node09 node10 I want a replicated pool with size=4. Two copies should go in each DC, and then no two copies on a single node. How can I describe this in a crush rule? This post seem to show that, except they have their root named "nvme" and they split on rack and not dc, but that is not important. https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup with the answer at the bottom: for example this should work as well, to have 4 replicas in total, distributed across two racks: step take default class nvme step choose firstn 2 type rack step chooseleaf firstn 2 type host -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Andre Tann ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Crush rule examples
Den tors 21 nov. 2024 kl 09:45 skrev Andre Tann : > Hi Frank, > thanks a lot for the hint, and I have read the documentation about this. > What is not clear to me is this: > > == snip > The first category of these failures that we will discuss involves > inconsistent networks -- if there is a netsplit (a disconnection between > two servers that splits the network into two pieces), Ceph might be > unable to mark OSDs down and remove them from the acting PG sets. > == snip > > Why is Ceph not able to mark OSDs down, and why is it unclear whether or > not it is able to do so ("might")? I think designs with 2 DCs usually have one or two mons per DC, and then a third/fifth mon in a (small) third site so it can arbitrate which side is up and which isn't. OSDs report to each other but also to mons about their existence. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Crush rule examples
Am 21.11.24 um 10:56 schrieb Janne Johansson: == snip The first category of these failures that we will discuss involves inconsistent networks -- if there is a netsplit (a disconnection between two servers that splits the network into two pieces), Ceph might be unable to mark OSDs down and remove them from the acting PG sets. == snip Why is Ceph not able to mark OSDs down, and why is it unclear whether or not it is able to do so ("might")? I think designs with 2 DCs usually have one or two mons per DC, and then a third/fifth mon in a (small) third site so it can arbitrate which side is up and which isn't. OSDs report to each other but also to mons about their existence. Yes absolutely, you need another qdevice/witness/mon... in a third location for the quorum, and my setup will have that. But still I don't see why Ceph should not be able to mark an OSD down if one site went down. -- Andre Tann ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing
They actually did have problems after standby-replay daemons took over as active daemons. After each failover they had to clean up some stale processes (or something like that). I'm not sure who recommended it, probably someone from SUSE engineering, but we switched off standby-replay and then the failover only took a minute or so, without anything to clean up. With hot standby it took several minutes (and cleaning afterwards). But the general recommendation is to not use standby-replay anyway, so we follow(ed) that. But it still might be useful in some scenarios, so proper testing is necessary. Zitat von Александр Руденко : IRC, last week during a Clyso talk at Eventbrite I heard someone say that they deployed around 200 Filesystems or so, I don't remember if it was a production environment or just a lab environment Interesting, thanks! I assume that you would probably be limited by the number of OSDs/PGs rather than by the number of Filesystems, 200 Filesystems require at least 400 pools. Sure, it's clear, thanks. But it turned out that standby-replay didn't help their use case, so we disabled standby-replay. Interesting. There were some problems with standby-replay or they just do not need "hot" standby? чт, 21 нояб. 2024 г. в 11:36, Eugen Block : I'm not aware of any hard limit for the number of Filesystems, but that doesn't really mean very much. IIRC, last week during a Clyso talk at Eventbrite I heard someone say that they deployed around 200 Filesystems or so, I don't remember if it was a production environment or just a lab environment. I assume that you would probably be limited by the number of OSDs/PGs rather than by the number of Filesystems, 200 Filesystems require at least 400 pools. But maybe someone else has more experience in scaling CephFS that way. What we did was to scale the number of active MDS daemons for one CephFS. I believe in the end the customer had 48 MDS daemons on three MDS servers, 16 of them were active with directory pinning, at that time they had 16 standby-replay and 16 standby daemons. But it turned out that standby-replay didn't help their use case, so we disabled standby-replay. Can you show the entire 'ceph fs status' output? Any maybe also 'ceph fs dump'? Zitat von Александр Руденко : >> >> Just for testing purposes, have you tried pinning rank 1 to some other >> directory? Does it still break the CephFS if you stop it? > > > Yes, nothing changed. > > It's no problem that FS hangs when one of the ranks goes down, we will have > standby-reply for all ranks. I don't like that rank which is not pinned to > some dir handled some io of this dir or from clients which work with this > dir. > I mean that I can't robustly and fully separate client IO by ranks. > > Would it be an option to rather use multiple Filesystems instead of >> multi-active for one CephFS? > > > Yes, it's an option. But it is much more complicated in our case. Btw, do > you know how many different FS can be created in one cluster? Maybe you > know some potential problems with 100-200 FSs in one cluster? > > ср, 20 нояб. 2024 г. в 17:50, Eugen Block : > >> Ah, I misunderstood, I thought you wanted an even distribution across >> both ranks. >> Just for testing purposes, have you tried pinning rank 1 to some other >> directory? Does it still break the CephFS if you stop it? I'm not sure >> if you can prevent rank 1 from participating, I haven't looked into >> all the configs in quite a while. Would it be an option to rather use >> multiple Filesystems instead of multi-active for one CephFS? >> >> Zitat von Александр Руденко : >> >> > No it's not a typo. It's misleading example) >> > >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work >> without >> > rank 1. >> > rank 1 is used for something when I work with this dirs. >> > >> > ceph 16.2.13, metadata balancer and policy based balancing not used. >> > >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block : >> > >> >> Hi, >> >> >> >> > After pinning: >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >> >> >> >> is this a typo? If not, you did pin both directories to the same rank. >> >> >> >> Zitat von Александр Руденко : >> >> >> >> > Hi, >> >> > >> >> > I try to distribute all top level dirs in CephFS by different MDS >> ranks. >> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like >> >> > */dir1* and* /dir2*. >> >> > >> >> > After pinning: >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >> >> > >> >> > I can see next INOS and DNS distribution: >> >> > RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS >> >> > 0active c Reqs:127 /s 12.6k 12.5k 333505 >> >> > 1active b Reqs:11 /s21 24 19 1 >> >> > >> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1. >> >> > >> >> > E
[ceph-users] Lifetime for ceph
The octopus repos disappeared a couple of days ago - no argument with that given its marked as out of support. However I see from https://docs.ceph.com/en/latest/releases/ that quincy is also marked as out of support, but currently the repos are still there. Is there any guesstimate of when the quincy repos might disappear please? many thanks Steve Brasier http://stackhpc.com/ Please note I work Tuesday to Friday. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing
I concur with that observation. Standby-replay seems a useless mode of operation. The replay daemons use a lot more RAM than the active ones and the fail-over took ages. After switching to standby-only fail-over is usually 5-20s with the lower end being more common. We have 8 active and 4 standby daemons configured. Bets regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Thursday, November 21, 2024 3:55 PM To: Александр Руденко Cc: ceph-users@ceph.io Subject: [ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing They actually did have problems after standby-replay daemons took over as active daemons. After each failover they had to clean up some stale processes (or something like that). I'm not sure who recommended it, probably someone from SUSE engineering, but we switched off standby-replay and then the failover only took a minute or so, without anything to clean up. With hot standby it took several minutes (and cleaning afterwards). But the general recommendation is to not use standby-replay anyway, so we follow(ed) that. But it still might be useful in some scenarios, so proper testing is necessary. Zitat von Александр Руденко : >> >> IRC, last week during a Clyso >> talk at Eventbrite I heard someone say that they deployed around 200 >> Filesystems or so, I don't remember if it was a production environment >> or just a lab environment > > > Interesting, thanks! > > I assume that you would probably be limited >> by the number of OSDs/PGs rather than by the number of Filesystems, >> 200 Filesystems require at least 400 pools. > > > Sure, it's clear, thanks. > > But it turned out that standby-replay didn't >> help their use case, so we disabled standby-replay. > > > Interesting. There were some problems with standby-replay or they just do > not need "hot" standby? > > чт, 21 нояб. 2024 г. в 11:36, Eugen Block : > >> I'm not aware of any hard limit for the number of Filesystems, but >> that doesn't really mean very much. IIRC, last week during a Clyso >> talk at Eventbrite I heard someone say that they deployed around 200 >> Filesystems or so, I don't remember if it was a production environment >> or just a lab environment. I assume that you would probably be limited >> by the number of OSDs/PGs rather than by the number of Filesystems, >> 200 Filesystems require at least 400 pools. But maybe someone else has >> more experience in scaling CephFS that way. What we did was to scale >> the number of active MDS daemons for one CephFS. I believe in the end >> the customer had 48 MDS daemons on three MDS servers, 16 of them were >> active with directory pinning, at that time they had 16 standby-replay >> and 16 standby daemons. But it turned out that standby-replay didn't >> help their use case, so we disabled standby-replay. >> >> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph >> fs dump'? >> >> Zitat von Александр Руденко : >> >> >> >> >> Just for testing purposes, have you tried pinning rank 1 to some other >> >> directory? Does it still break the CephFS if you stop it? >> > >> > >> > Yes, nothing changed. >> > >> > It's no problem that FS hangs when one of the ranks goes down, we will >> have >> > standby-reply for all ranks. I don't like that rank which is not pinned >> to >> > some dir handled some io of this dir or from clients which work with this >> > dir. >> > I mean that I can't robustly and fully separate client IO by ranks. >> > >> > Would it be an option to rather use multiple Filesystems instead of >> >> multi-active for one CephFS? >> > >> > >> > Yes, it's an option. But it is much more complicated in our case. Btw, do >> > you know how many different FS can be created in one cluster? Maybe you >> > know some potential problems with 100-200 FSs in one cluster? >> > >> > ср, 20 нояб. 2024 г. в 17:50, Eugen Block : >> > >> >> Ah, I misunderstood, I thought you wanted an even distribution across >> >> both ranks. >> >> Just for testing purposes, have you tried pinning rank 1 to some other >> >> directory? Does it still break the CephFS if you stop it? I'm not sure >> >> if you can prevent rank 1 from participating, I haven't looked into >> >> all the configs in quite a while. Would it be an option to rather use >> >> multiple Filesystems instead of multi-active for one CephFS? >> >> >> >> Zitat von Александр Руденко : >> >> >> >> > No it's not a typo. It's misleading example) >> >> > >> >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work >> >> without >> >> > rank 1. >> >> > rank 1 is used for something when I work with this dirs. >> >> > >> >> > ceph 16.2.13, metadata balancer and policy based balancing not used. >> >> > >> >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block : >> >> > >> >> >> Hi, >> >> >> >> >> >> > After pinning: >> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> >> >> > setfattr -n ceph.dir.pin
[ceph-users] MDS blocklist/evict clients during network maintenance
Hi, can anyone share some experience with these two configs? ceph config get mds mds_session_blocklist_on_timeout true ceph config get mds mds_session_blocklist_on_evict true If there's some network maintenance going on and the client connection is interrupted, could it help to disable evicting and blocklisting MDS clients? And what risks should we be aware of if we tried that? We're not entirely sure yet if this could be a reasonable approach, but we're trying to figure out how to make network maintenance less painful for clients. I'm also looking at some other possible configs, but let's start with these two first. Any comments would be appreciated! Thanks! Eugen ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS subvolumes not inheriting ephemeral distributed pin
On Wed, Nov 20, 2024 at 2:05 PM Rajmohan Ramamoorthy wrote: > > Hi Patrick, > > Few other follow up questions. > > Is directory fragmentation applicable only when multiple active MDS is > enabled for a Ceph FS? It has no effect when applied with only one rank (active). It can be useful to have it already set in case you increase max_mds. > Will directory fragmenation and distribution of fragments amongs active MDS > happen if we turn off balancer for a Ceph FS volume `ceph fs set midline-a > balance_automate false` ? In Squide, the CephFS automatic metadata load > (sometimes called “default”) balancer is now disabled by default. > (https://docs.ceph.com/en/latest/releases/squid/) Yes. > Is there a way for us to ensure that the directory tree of a Subvolume > (Kubernetes PV) is part of a same fragment and handled by a single MDS so > that a client operations are handled by one MDS? A subvolume would not be split across two MDS. > What is the trigger to start fragmenting directories within a Subvolumegroup? You don't need to do anything more than set the distribute ephemeral pin. > With the `balance_automate` set to false and `ephemeral distributed pin` > enabled for a Subvolumegroup, can we expect (almost) equal distribution of > Subvolumes (Kubernetes PVs) amongst the active MDS daemons and stable > operation without hotspot migrations? Yes. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing
Hi Eugen, During the talk you've mentioned, Dan said there's a hard coded limit of 256 MDSs per cluster. So with one active and one standby-ish MDSs per filesystem, that would be 128 filesystems at max per cluster. Mark said he got 120 but.. things start to get wacky by 80. :-) More fun to come, for sure. Cheers, Frédéric. [1] https://youtu.be/qiCE1Ifws80?t=2602 - Le 21 Nov 24, à 9:36, Eugen Block ebl...@nde.ag a écrit : > I'm not aware of any hard limit for the number of Filesystems, but > that doesn't really mean very much. IIRC, last week during a Clyso > talk at Eventbrite I heard someone say that they deployed around 200 > Filesystems or so, I don't remember if it was a production environment > or just a lab environment. I assume that you would probably be limited > by the number of OSDs/PGs rather than by the number of Filesystems, > 200 Filesystems require at least 400 pools. But maybe someone else has > more experience in scaling CephFS that way. What we did was to scale > the number of active MDS daemons for one CephFS. I believe in the end > the customer had 48 MDS daemons on three MDS servers, 16 of them were > active with directory pinning, at that time they had 16 standby-replay > and 16 standby daemons. But it turned out that standby-replay didn't > help their use case, so we disabled standby-replay. > > Can you show the entire 'ceph fs status' output? Any maybe also 'ceph > fs dump'? > > Zitat von Александр Руденко : > >>> >>> Just for testing purposes, have you tried pinning rank 1 to some other >>> directory? Does it still break the CephFS if you stop it? >> >> >> Yes, nothing changed. >> >> It's no problem that FS hangs when one of the ranks goes down, we will have >> standby-reply for all ranks. I don't like that rank which is not pinned to >> some dir handled some io of this dir or from clients which work with this >> dir. >> I mean that I can't robustly and fully separate client IO by ranks. >> >> Would it be an option to rather use multiple Filesystems instead of >>> multi-active for one CephFS? >> >> >> Yes, it's an option. But it is much more complicated in our case. Btw, do >> you know how many different FS can be created in one cluster? Maybe you >> know some potential problems with 100-200 FSs in one cluster? >> >> ср, 20 нояб. 2024 г. в 17:50, Eugen Block : >> >>> Ah, I misunderstood, I thought you wanted an even distribution across >>> both ranks. >>> Just for testing purposes, have you tried pinning rank 1 to some other >>> directory? Does it still break the CephFS if you stop it? I'm not sure >>> if you can prevent rank 1 from participating, I haven't looked into >>> all the configs in quite a while. Would it be an option to rather use >>> multiple Filesystems instead of multi-active for one CephFS? >>> >>> Zitat von Александр Руденко : >>> >>> > No it's not a typo. It's misleading example) >>> > >>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work >>> without >>> > rank 1. >>> > rank 1 is used for something when I work with this dirs. >>> > >>> > ceph 16.2.13, metadata balancer and policy based balancing not used. >>> > >>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block : >>> > >>> >> Hi, >>> >> >>> >> > After pinning: >>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >>> >> >>> >> is this a typo? If not, you did pin both directories to the same rank. >>> >> >>> >> Zitat von Александр Руденко : >>> >> >>> >> > Hi, >>> >> > >>> >> > I try to distribute all top level dirs in CephFS by different MDS >>> ranks. >>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like >>> >> > */dir1* and* /dir2*. >>> >> > >>> >> > After pinning: >>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >>> >> > >>> >> > I can see next INOS and DNS distribution: >>> >> > RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS >>> >> > 0active c Reqs:127 /s 12.6k 12.5k 333505 >>> >> > 1active b Reqs:11 /s21 24 19 1 >>> >> > >>> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1. >>> >> > >>> >> > Events in journal of MDS with rank 1: >>> >> > cephfs-journal-tool --rank=fs1:1 event get list >>> >> > >>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE: >>> (scatter_writebehind) >>> >> > A2037D53 >>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION: () >>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE: (lock inest >>> accounted >>> >> > scatter stat update) >>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION: () >>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION: () >>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION: () >>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION: () >>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION: () >>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:
[ceph-users] 2024-11-21 Perf meeting cancelled!
Hi folks, the perf meeting will be cancelled today, Mark is flying from a conference! Thanks, Matt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing
> > IRC, last week during a Clyso > talk at Eventbrite I heard someone say that they deployed around 200 > Filesystems or so, I don't remember if it was a production environment > or just a lab environment Interesting, thanks! I assume that you would probably be limited > by the number of OSDs/PGs rather than by the number of Filesystems, > 200 Filesystems require at least 400 pools. Sure, it's clear, thanks. But it turned out that standby-replay didn't > help their use case, so we disabled standby-replay. Interesting. There were some problems with standby-replay or they just do not need "hot" standby? чт, 21 нояб. 2024 г. в 11:36, Eugen Block : > I'm not aware of any hard limit for the number of Filesystems, but > that doesn't really mean very much. IIRC, last week during a Clyso > talk at Eventbrite I heard someone say that they deployed around 200 > Filesystems or so, I don't remember if it was a production environment > or just a lab environment. I assume that you would probably be limited > by the number of OSDs/PGs rather than by the number of Filesystems, > 200 Filesystems require at least 400 pools. But maybe someone else has > more experience in scaling CephFS that way. What we did was to scale > the number of active MDS daemons for one CephFS. I believe in the end > the customer had 48 MDS daemons on three MDS servers, 16 of them were > active with directory pinning, at that time they had 16 standby-replay > and 16 standby daemons. But it turned out that standby-replay didn't > help their use case, so we disabled standby-replay. > > Can you show the entire 'ceph fs status' output? Any maybe also 'ceph > fs dump'? > > Zitat von Александр Руденко : > > >> > >> Just for testing purposes, have you tried pinning rank 1 to some other > >> directory? Does it still break the CephFS if you stop it? > > > > > > Yes, nothing changed. > > > > It's no problem that FS hangs when one of the ranks goes down, we will > have > > standby-reply for all ranks. I don't like that rank which is not pinned > to > > some dir handled some io of this dir or from clients which work with this > > dir. > > I mean that I can't robustly and fully separate client IO by ranks. > > > > Would it be an option to rather use multiple Filesystems instead of > >> multi-active for one CephFS? > > > > > > Yes, it's an option. But it is much more complicated in our case. Btw, do > > you know how many different FS can be created in one cluster? Maybe you > > know some potential problems with 100-200 FSs in one cluster? > > > > ср, 20 нояб. 2024 г. в 17:50, Eugen Block : > > > >> Ah, I misunderstood, I thought you wanted an even distribution across > >> both ranks. > >> Just for testing purposes, have you tried pinning rank 1 to some other > >> directory? Does it still break the CephFS if you stop it? I'm not sure > >> if you can prevent rank 1 from participating, I haven't looked into > >> all the configs in quite a while. Would it be an option to rather use > >> multiple Filesystems instead of multi-active for one CephFS? > >> > >> Zitat von Александр Руденко : > >> > >> > No it's not a typo. It's misleading example) > >> > > >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work > >> without > >> > rank 1. > >> > rank 1 is used for something when I work with this dirs. > >> > > >> > ceph 16.2.13, metadata balancer and policy based balancing not used. > >> > > >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block : > >> > > >> >> Hi, > >> >> > >> >> > After pinning: > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 > >> >> > >> >> is this a typo? If not, you did pin both directories to the same > rank. > >> >> > >> >> Zitat von Александр Руденко : > >> >> > >> >> > Hi, > >> >> > > >> >> > I try to distribute all top level dirs in CephFS by different MDS > >> ranks. > >> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs > like > >> >> > */dir1* and* /dir2*. > >> >> > > >> >> > After pinning: > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 > >> >> > > >> >> > I can see next INOS and DNS distribution: > >> >> > RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS > >> >> > 0active c Reqs:127 /s 12.6k 12.5k 333505 > >> >> > 1active b Reqs:11 /s21 24 19 1 > >> >> > > >> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1. > >> >> > > >> >> > Events in journal of MDS with rank 1: > >> >> > cephfs-journal-tool --rank=fs1:1 event get list > >> >> > > >> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE: > >> (scatter_writebehind) > >> >> > A2037D53 > >> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION: () > >> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE: (lock inest > >> accounted > >> >> > scatter stat update) > >> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:
[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing
Hi, Frank, thanks! it might be that you are expecting too much from ceph. The design of the > filesystem was not some grand plan with every detail worked out. It was > more the classic evolutionary approach, something working was screwed on > top of rados and things evolved from there on. There was some hope that it's just a configuration problem in my environment) Specifically rank 0 is critical. Yes, because we can't re-pin the root of FS to some other rank. It was clear that rank 0 is critical. But unfortunately, as we can see all ranks are critical for stable work in any directories. чт, 21 нояб. 2024 г. в 14:46, Frank Schilder : > Hi Alexander, > > it might be that you are expecting too much from ceph. The design of the > filesystem was not some grand plan with every detail worked out. It was > more the classic evolutionary approach, something working was screwed on > top of rados and things evolved from there on. > > It is possible that the code and pin-seperation is not as clean as one > would imagine. Here is what I observe before and after pinning everything > explicitly: > > - before pinning: > * high MDS load for no apparent reason - the balancer was just going in > circles > * stopping an MDS would besically bring all IO down > > - after pinning: > * low MDS load, better user performance, much faster restarts > * stopping an MDS does not kill all IO immediately, some IO continues, > however, eventually every client gets stuck > > There is apparently still communication between all ranks about all > clients and it is a bit annoying that some of this communication is > blocking. Not sure if it has to be blocking or if one could make it > asynchronous requests to the down rank. My impression is that ceph > internals are rather bad at making stuff asynchronous. So if something in > the MDS cluster is not healthy sooner or later IO will stop waiting for > some blocking request to the unhealthy MDS. There seems to be no such thing > as IO on other healthy MDSes continues as usual. > > Specifically rank 0 is critical. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Eugen Block > Sent: Thursday, November 21, 2024 9:36 AM > To: Александр Руденко > Cc: ceph-users@ceph.io > Subject: [ceph-users] Re: [CephFS] Completely exclude some MDS rank from > directory processing > > I'm not aware of any hard limit for the number of Filesystems, but > that doesn't really mean very much. IIRC, last week during a Clyso > talk at Eventbrite I heard someone say that they deployed around 200 > Filesystems or so, I don't remember if it was a production environment > or just a lab environment. I assume that you would probably be limited > by the number of OSDs/PGs rather than by the number of Filesystems, > 200 Filesystems require at least 400 pools. But maybe someone else has > more experience in scaling CephFS that way. What we did was to scale > the number of active MDS daemons for one CephFS. I believe in the end > the customer had 48 MDS daemons on three MDS servers, 16 of them were > active with directory pinning, at that time they had 16 standby-replay > and 16 standby daemons. But it turned out that standby-replay didn't > help their use case, so we disabled standby-replay. > > Can you show the entire 'ceph fs status' output? Any maybe also 'ceph > fs dump'? > > Zitat von Александр Руденко : > > >> > >> Just for testing purposes, have you tried pinning rank 1 to some other > >> directory? Does it still break the CephFS if you stop it? > > > > > > Yes, nothing changed. > > > > It's no problem that FS hangs when one of the ranks goes down, we will > have > > standby-reply for all ranks. I don't like that rank which is not pinned > to > > some dir handled some io of this dir or from clients which work with this > > dir. > > I mean that I can't robustly and fully separate client IO by ranks. > > > > Would it be an option to rather use multiple Filesystems instead of > >> multi-active for one CephFS? > > > > > > Yes, it's an option. But it is much more complicated in our case. Btw, do > > you know how many different FS can be created in one cluster? Maybe you > > know some potential problems with 100-200 FSs in one cluster? > > > > ср, 20 нояб. 2024 г. в 17:50, Eugen Block : > > > >> Ah, I misunderstood, I thought you wanted an even distribution across > >> both ranks. > >> Just for testing purposes, have you tried pinning rank 1 to some other > >> directory? Does it still break the CephFS if you stop it? I'm not sure > >> if you can prevent rank 1 from participating, I haven't looked into > >> all the configs in quite a while. Would it be an option to rather use > >> multiple Filesystems instead of multi-active for one CephFS? > >> > >> Zitat von Александр Руденко : > >> > >> > No it's not a typo. It's misleading example) > >> > > >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work >
[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing
> > Can you show the entire 'ceph fs status' output? Any maybe also 'ceph > fs dump'? Nothing special, just smoll test cluster. fs1 - 10 clients === RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS 0active a Reqs:0 /s 18.7k 18.4k 351513 1active b Reqs:0 /s21 24 16 1 POOL TYPE USED AVAIL fs1_meta metadata 116M 3184G fs1_datadata23.8G 3184G STANDBY MDS c fs dump e48 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'fs1' (1) fs_name fs1 epoch 47 flags 12 created 2024-10-15T18:55:10.905035+0300 modified 2024-11-21T10:55:12.688598+0300 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 943 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 2 in 0,1 up {0=12200812,1=11974933} failed damaged stopped data_pools [7] metadata_pool 6 inline_data disabled balancer standby_count_wanted 1 [mds.a{0:12200812} state up:active seq 13 addr [v2: 10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987] compat {c=[1],r=[1],i=[7ff]}] [mds.b{1:11974933} state up:active seq 5 addr [v2: 10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454] compat {c=[1],r=[1],i=[7ff]}] Standby daemons: [mds.c{-1:11704322} state up:standby seq 1 addr [v2: 10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504] compat {c=[1],r=[1],i=[7ff]}] чт, 21 нояб. 2024 г. в 11:36, Eugen Block : > I'm not aware of any hard limit for the number of Filesystems, but > that doesn't really mean very much. IIRC, last week during a Clyso > talk at Eventbrite I heard someone say that they deployed around 200 > Filesystems or so, I don't remember if it was a production environment > or just a lab environment. I assume that you would probably be limited > by the number of OSDs/PGs rather than by the number of Filesystems, > 200 Filesystems require at least 400 pools. But maybe someone else has > more experience in scaling CephFS that way. What we did was to scale > the number of active MDS daemons for one CephFS. I believe in the end > the customer had 48 MDS daemons on three MDS servers, 16 of them were > active with directory pinning, at that time they had 16 standby-replay > and 16 standby daemons. But it turned out that standby-replay didn't > help their use case, so we disabled standby-replay. > > Can you show the entire 'ceph fs status' output? Any maybe also 'ceph > fs dump'? > > Zitat von Александр Руденко : > > >> > >> Just for testing purposes, have you tried pinning rank 1 to some other > >> directory? Does it still break the CephFS if you stop it? > > > > > > Yes, nothing changed. > > > > It's no problem that FS hangs when one of the ranks goes down, we will > have > > standby-reply for all ranks. I don't like that rank which is not pinned > to > > some dir handled some io of this dir or from clients which work with this > > dir. > > I mean that I can't robustly and fully separate client IO by ranks. > > > > Would it be an option to rather use multiple Filesystems instead of > >> multi-active for one CephFS? > > > > > > Yes, it's an option. But it is much more complicated in our case. Btw, do > > you know how many different FS can be created in one cluster? Maybe you > > know some potential problems with 100-200 FSs in one cluster? > > > > ср, 20 нояб. 2024 г. в 17:50, Eugen Block : > > > >> Ah, I misunderstood, I thought you wanted an even distribution across > >> both ranks. > >> Just for testing purposes, have you tried pinning rank 1 to some other > >> directory? Does it still break the CephFS if you stop it? I'm not sure > >> if you can prevent rank 1 from participating, I haven't looked into > >> all the configs in quite a while. Would it be an option to rather use > >> multiple Filesystems instead of multi-active for one CephFS? > >> > >> Zitat von Александр Руденко : > >> > >> > No it's not a typo. It's misleading example) > >> > > >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work > >> without > >> > rank 1. > >> > rank 1 is used for something when I work with this dirs. > >> > > >> > ceph 16.2.13, metadata balancer and policy based balancing not used. > >> > > >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block : > >> > > >> >> Hi, > >> >> > >> >> > After pinning: > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 > >> >> > >> >> is this a typo? If not
[ceph-users] Re: MDS blocklist/evict clients during network maintenance
Hi Eugene, Disabling blocklisting on eviction is a pretty standard config. In my experience it allows clients resume their session cleanly without needing a remount. There's docs about this here: https://docs.ceph.com/en/latest/cephfs/eviction/#advanced-configuring-blocklisting I don't have a good feeling if this will be useful for your network intervention though... What are you trying to achieve? How long will clients be unreachable? Cheers, Dan -- Dan van der Ster CTO@CLYSO & CEC Member On Thu, Nov 21, 2024, 10:15 Eugen Block wrote: > Hi, > > can anyone share some experience with these two configs? > > ceph config get mds mds_session_blocklist_on_timeout > true > ceph config get mds mds_session_blocklist_on_evict > true > > If there's some network maintenance going on and the client connection > is interrupted, could it help to disable evicting and blocklisting MDS > clients? And what risks should we be aware of if we tried that? We're > not entirely sure yet if this could be a reasonable approach, but > we're trying to figure out how to make network maintenance less > painful for clients. > I'm also looking at some other possible configs, but let's start with > these two first. > > Any comments would be appreciated! > > Thanks! > Eugen > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multisite sync issue with bucket sync
hey Chris, On Wed, Nov 20, 2024 at 6:02 PM Christopher Durham wrote: > > Casey, > > OR, is there a way to continue on with new data syncing (incremental) as the > full sync catches up, as the full sync will take a long time, and no new > incremental data is being replicated. full sync walks through the entire bucket listing, so it will visit some new objects along the way. but where possible, multisite tries to prioritize sync of older data because minimizing the average time-to-sync is important for disaster recovery. if we tried to process both full and incremental at the same time, that would slow down the full sync and some objects could take far longer to replicate. it would also be less efficient overall, because longer full sync means more overlap and duplicated effort with incremental > > -Chris > > On Wednesday, November 20, 2024 at 03:30:40 PM MST, Christopher Durham > wrote: > > > Casey, > > Thanks for your response. So is there a way to abandon a full sync and just > move on with an incremental from the time you abandon the full sync? i'm afraid not. multisite tries very hard to maintain consistency between zones, so it's not easy to subvert that. 'radosgw-admin bucket sync init' is probably the only command that can modify bucket sync status > > -Chris > > On Wednesday, November 20, 2024 at 12:29:26 PM MST, Casey Bodley > wrote: > > > On Wed, Nov 20, 2024 at 2:10 PM Christopher Durham wrote: > > > > Ok, > > Source code review reveals that full sync is marker based and sync errors > > within a marker group *suggest* that data within the marker isre-checked, > > (I may be wrong about this, but that is consistent with my 304 errors > > below). I do however, have the folllowing question: > > Is there a way to otherwise abort a full sync of a bucket (as a result of > > radosgw-admin bucket sync init --bucket and bucket sync run (or > > restart of radosgw),and have it just do incremental sync from then on (yes, > > having the objects not be the same on both sides prior to the 'restart' of > > an incremental sync. > > Would radosgw-admin bucket sync disable --bucket followed by > > radosgw-admin bucket sync enable --bucket do this? Or would that > > do anotherfull sync and not an incremental? > > 'bucket sync enable' does start a new full sync (to catch objects that > were uploaded since 'bucket sync disable') > > > Thanks > > -Chris > > > >On Thursday, November 14, 2024 at 04:18:34 PM MST, Christopher Durham > > wrote: > > > > Hi, > > I have heard nothing on this, but have done some more research. > > Again, both sides of a multisite s3 configuration are ceph 18.2.4 on Rocky > > 9. > > For a given bucket, there are thousands of 'missing' objects. I did: > > radosgw-admin bucket sync init --bucket --src-zone > zone>sync starts after I restart a radosgw on the source zone that has a > > sync thread. > > But based on number and size of objects needing replication, it NEVER > > finishes, as more objects are created as I am going.I may need to increase > > the number of radosgw and or the sync threads. > > > > What I have discovered that if a radosgw on the side with missing objects > > is restarted, all sycing starts over!In other words, it starts polling each > > object, getting a 304 error in the radosgw log on the server on the > > multisite that has the missing objects.It *appears* to do this sequential > > object scan in lexographic order of object and/or prefix name, although I > > cannot be sure. > > > > So some questions: > > 1. Is there a recommendation/rule of thumb/formula for the number of > > radosgws/syncthreads/ etc based on number of objects, buckets, bandwidth, > > etc?2. Why does the syncing restart for a bucket when a radosgw is > > restarted? Is there a way to tell it to restart where it left off as > > opposed to starting over?There may be reasons to restart a bucket sync if a > > radosgw restarts, but there should be a way to checkpoint/force it to not > > restart/start where left off, etc.3. Is there a way to 'abort' the sync and > > cause the bucket to think it is up to date and only replicate new objects > > from the time it was marked up to date? > > Thanks for any information > > -Chris > > > > > > > >On Friday, November 8, 2024 at 03:45:05 PM MST, Christopher Durham > > wrote: > > > > > > I have a 2-site multisite configuration on cdnh 18.2.4 on EL9. > > After system updates, we discovered that a particular bucket had several > > thousand objects missing, which the other side had. Newly created objects > > were being replicated just fine. > > > > I decided to 'restart' syncing that bucket. Here is what I did > > On the side with misisng objects: > > > radosgw-admin bucket sync init --bucket --src-zone > > > > I restarted the radosgw set up to do the sync thread on the same zone as I > > ran the radosgw-admin command. > > > > Logs on the radosgw src-zone side show GETs with http code 200 for objects > > that do n