[ceph-users] Re: multisite sync issue with bucket sync

2024-11-20 Thread Christopher Durham
 Casey,
OR, is there a way to continue on with new data syncing (incremental) as the 
full sync catches up, as the full sync will take a long time, and no new 
incremental data is being replicated.
-Chris

On Wednesday, November 20, 2024 at 03:30:40 PM MST, Christopher Durham 
 wrote:   

  Casey,
Thanks for your response. So is there a way to abandon a full sync and just 
move on with an incremental from the time you abandon the full sync?
-Chris

On Wednesday, November 20, 2024 at 12:29:26 PM MST, Casey Bodley 
 wrote:   

 On Wed, Nov 20, 2024 at 2:10 PM Christopher Durham  wrote:
>
>  Ok,
> Source code review reveals that full sync is marker based and sync errors 
> within a marker group *suggest* that data within the marker isre-checked, (I 
> may be wrong about this, but that is consistent with my 304 errors below). I 
> do however, have the folllowing question:
> Is there a way to otherwise abort a full sync of a bucket (as a result of 
> radosgw-admin bucket sync init --bucket  and bucket sync run (or 
> restart of radosgw),and have it just do incremental sync from then on (yes, 
> having the objects not be the same on both sides prior to the 'restart' of an 
> incremental sync.
> Would radosgw-admin bucket sync disable --bucket  followed by 
> radosgw-admin bucket sync enable --bucket  do this? Or would that do 
> anotherfull sync and not an incremental?

'bucket sync enable' does start a new full sync (to catch objects that
were uploaded since 'bucket sync disable')

Thanks
> -Chris
>
>    On Thursday, November 14, 2024 at 04:18:34 PM MST, Christopher Durham 
> wrote:
>
>  Hi,
> I have heard nothing on this, but have done some more research.
> Again, both sides of a multisite s3 configuration are ceph 18.2.4 on Rocky 9.
> For a given bucket, there are thousands of 'missing' objects. I did:
> radosgw-admin bucket sync init --bucket  --src-zone  zone>sync starts after I restart a radosgw on the source zone that has a sync 
> thread.
> But based on number and size of objects needing replication, it NEVER 
> finishes, as more objects are created as I am going.I may need to increase 
> the number of radosgw and or the sync threads.
>
> What I have discovered that if a radosgw on the side with missing objects is 
> restarted, all sycing starts over!In other words, it starts polling each 
> object, getting a 304 error in the radosgw log on the server on the multisite 
> that has the missing objects.It *appears* to do this sequential object scan 
> in lexographic order of object and/or prefix name, although I cannot be sure.
>
> So some questions:
> 1. Is there a recommendation/rule of thumb/formula for the number of 
> radosgws/syncthreads/ etc based on number of objects, buckets, bandwidth, 
> etc?2. Why does the syncing restart for a bucket when a radosgw is restarted? 
> Is there a way to tell it to restart where it left off as opposed to starting 
> over?There may be reasons to restart a bucket sync if a radosgw restarts, but 
> there should be a way to checkpoint/force it to not restart/start where left 
> off, etc.3. Is there a way to 'abort' the sync and cause the bucket to think 
> it is up to date and only replicate new objects from the time it was marked 
> up to date?
> Thanks for any information
> -Chris
>
>
>
>    On Friday, November 8, 2024 at 03:45:05 PM MST, Christopher Durham 
> wrote:
>
>
> I have a 2-site multisite configuration on cdnh 18.2.4 on EL9.
> After system updates, we discovered that a particular bucket had several 
> thousand objects missing, which the other side had. Newly created objects 
> were being replicated just fine.
>
> I decided to 'restart' syncing that bucket. Here is what I did
> On the side with misisng objects:
> > radosgw-admin bucket sync init --bucket  --src-zone 
>
> I restarted the radosgw set up to do the sync thread on the same zone as I 
> ran the radosgw-admin command.
>
> Logs on the radosgw src-zone side show GETs with http code 200 for objects 
> that do not exist on the side with missing objects, and GETs with http 304 
> for objects that already exist on the side with missing objects.
> So far, so good.
> As I said, the bucket is active. So on the src-zone side, data is continually 
> being written to /prefixA/../../ There is also data being written to 
> /prefixB/../../
> prefixA/ comes lexographically before prefixB/
> What happens is that all the 304s happen as it scans the bucket, then starts 
> pulling with GETs and http 200s for the objects the side doing the sync 
> doesnt have. This is on /prefixA. When it 'caches up' with alldata in 
> /prefixA at the moment, the sync seems to START OVER with /prefixA, giving 
> 304s for everything that existed in the bucket up to the moment it caught up, 
> then doing GETs with 200s for the remainingnewer objects. This happens over 
> and over again. It NEVER gets to /prefixB. So it seems to be periodically 
> catching up to /prefixA, but never going on to /prefixB that is also being 
> written

[ceph-users] Re: multisite sync issue with bucket sync

2024-11-20 Thread Christopher Durham
 Casey,
Thanks for your response. So is there a way to abandon a full sync and just 
move on with an incremental from the time you abandon the full sync?
-Chris

On Wednesday, November 20, 2024 at 12:29:26 PM MST, Casey Bodley 
 wrote:   

 On Wed, Nov 20, 2024 at 2:10 PM Christopher Durham  wrote:
>
>  Ok,
> Source code review reveals that full sync is marker based and sync errors 
> within a marker group *suggest* that data within the marker isre-checked, (I 
> may be wrong about this, but that is consistent with my 304 errors below). I 
> do however, have the folllowing question:
> Is there a way to otherwise abort a full sync of a bucket (as a result of 
> radosgw-admin bucket sync init --bucket  and bucket sync run (or 
> restart of radosgw),and have it just do incremental sync from then on (yes, 
> having the objects not be the same on both sides prior to the 'restart' of an 
> incremental sync.
> Would radosgw-admin bucket sync disable --bucket  followed by 
> radosgw-admin bucket sync enable --bucket  do this? Or would that do 
> anotherfull sync and not an incremental?

'bucket sync enable' does start a new full sync (to catch objects that
were uploaded since 'bucket sync disable')

Thanks
> -Chris
>
>    On Thursday, November 14, 2024 at 04:18:34 PM MST, Christopher Durham 
> wrote:
>
>  Hi,
> I have heard nothing on this, but have done some more research.
> Again, both sides of a multisite s3 configuration are ceph 18.2.4 on Rocky 9.
> For a given bucket, there are thousands of 'missing' objects. I did:
> radosgw-admin bucket sync init --bucket  --src-zone  zone>sync starts after I restart a radosgw on the source zone that has a sync 
> thread.
> But based on number and size of objects needing replication, it NEVER 
> finishes, as more objects are created as I am going.I may need to increase 
> the number of radosgw and or the sync threads.
>
> What I have discovered that if a radosgw on the side with missing objects is 
> restarted, all sycing starts over!In other words, it starts polling each 
> object, getting a 304 error in the radosgw log on the server on the multisite 
> that has the missing objects.It *appears* to do this sequential object scan 
> in lexographic order of object and/or prefix name, although I cannot be sure.
>
> So some questions:
> 1. Is there a recommendation/rule of thumb/formula for the number of 
> radosgws/syncthreads/ etc based on number of objects, buckets, bandwidth, 
> etc?2. Why does the syncing restart for a bucket when a radosgw is restarted? 
> Is there a way to tell it to restart where it left off as opposed to starting 
> over?There may be reasons to restart a bucket sync if a radosgw restarts, but 
> there should be a way to checkpoint/force it to not restart/start where left 
> off, etc.3. Is there a way to 'abort' the sync and cause the bucket to think 
> it is up to date and only replicate new objects from the time it was marked 
> up to date?
> Thanks for any information
> -Chris
>
>
>
>    On Friday, November 8, 2024 at 03:45:05 PM MST, Christopher Durham 
> wrote:
>
>
> I have a 2-site multisite configuration on cdnh 18.2.4 on EL9.
> After system updates, we discovered that a particular bucket had several 
> thousand objects missing, which the other side had. Newly created objects 
> were being replicated just fine.
>
> I decided to 'restart' syncing that bucket. Here is what I did
> On the side with misisng objects:
> > radosgw-admin bucket sync init --bucket  --src-zone 
>
> I restarted the radosgw set up to do the sync thread on the same zone as I 
> ran the radosgw-admin command.
>
> Logs on the radosgw src-zone side show GETs with http code 200 for objects 
> that do not exist on the side with missing objects, and GETs with http 304 
> for objects that already exist on the side with missing objects.
> So far, so good.
> As I said, the bucket is active. So on the src-zone side, data is continually 
> being written to /prefixA/../../ There is also data being written to 
> /prefixB/../../
> prefixA/ comes lexographically before prefixB/
> What happens is that all the 304s happen as it scans the bucket, then starts 
> pulling with GETs and http 200s for the objects the side doing the sync 
> doesnt have. This is on /prefixA. When it 'caches up' with alldata in 
> /prefixA at the moment, the sync seems to START OVER with /prefixA, giving 
> 304s for everything that existed in the bucket up to the moment it caught up, 
> then doing GETs with 200s for the remainingnewer objects. This happens over 
> and over again. It NEVER gets to /prefixB. So it seems to be periodically 
> catching up to /prefixA, but never going on to /prefixB that is also being 
> written to
> There are 1.2 million objects in this bucket, with about 35 TiB in the bucket.
> There is a lifecycle expiration happening of 60 days.
> Any thoughts would be appreciated.
> -Chris
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To

[ceph-users] Re: CephFS subvolumes not inheriting ephemeral distributed pin

2024-11-20 Thread Rajmohan Ramamoorthy
Hi Patrick,

Few other follow up questions.

Is directory fragmentation applicable only when multiple active MDS is
enabled for a Ceph FS?

Will directory fragmenation and distribution of fragments amongs active MDS
happen if we turn off balancer for a Ceph FS volume `ceph fs set midline-a
balance_automate false` ? In Squide, the CephFS automatic metadata load
(sometimes called “default”) balancer is now disabled by default. (
https://docs.ceph.com/en/latest/releases/squid/)

Is there a way for us to ensure that the directory tree of a Subvolume
(Kubernetes PV) is part of a same fragment and handled by a single MDS so
that a client operations are handled by one MDS?

What is the trigger to start fragmenting directories within a
Subvolumegroup?

With the `balance_automate` set to false and `ephemeral distributed pin`
enabled for a Subvolumegroup, can we expect (almost) equal distribution of
Subvolumes (Kubernetes PVs) amongst the active MDS daemons and stable
operation without hotspot migrations?


Regards,
Rajmohan R


On Wed, Nov 20, 2024 at 1:27 PM Patrick Donnelly 
wrote:

> On Tue, Nov 19, 2024 at 9:20 PM Rajmohan Ramamoorthy
>  wrote:
> >
> > ```
> > Subvolumes do not "inherit" the distributed ephemeral pin. What you
> > should expect below is that the "csi" subvolumegroup will be
> > fragmented and distributed across the ranks. Consequently, the
> > subvolumes will also be distributed across ranks as part of the
> > subtrees rooted at each fragment of the "csi" subvolumegroup
> > (directory).
> > ```
> >
> > How is subvolumegroup fragmentation handled?
>
> Fragmentation is automatically applied (to a minimum level) when a
> directory is marked with the distributed ephemeral pin.
>
> > Are the subvolumes equally
> > distributed across all available active MDS?
>
> As the documentation says, it's a consistent hash of the fragments
> (which include the subvolumes which fall into those fragments) across
> ranks.
>
> > In the following scenario,
> > will 3 of the subvolumes be mapped to each of the MDS?
>
> You cannot say. It depends how the fragments are hashed.
>
> > Will setting the ephemeral distributed pin on Subvolumegroup ensure that
> > the subvolumes in it will be equally distributed across MDS ?
>
> Approximately.
>
> > We are looking at
> > ceph-csi use case for Kubernetes. PVs (subvolumes) are dynamically
> created
> > by Kubernetes.
>
> This is an ideal use-case for the distributed ephemeral pin.
>
> > # Ceph FS configuration
> >
> > ceph fs subvolumegroup create midline-a csi
> > ceph fs subvolumegroup pin midline-a csi distributed 1
> >
> > ceph fs subvolume create midline-a subvol1 csi
> > ceph fs subvolume create midline-a subvol2 csi
> > ceph fs subvolume create midline-a subvol3 csi
> > ceph fs subvolume create midline-a subvol4 csi
> > ceph fs subvolume create midline-a subvol5 csi
> > ceph fs subvolume create midline-a subvol6 csi
> >
> > # ceph fs ls
> > name: midline-a, metadata pool: fs-midline-metadata-a, data pools:
> [fs-midline-data-a ]
> >
> > # ceph fs subvolumegroup ls midline-a
> > [
> > {
> > "name": "csi"
> > }
> > ]
> >
> > # ceph fs subvolume ls midline-a csi
> > [
> > {
> > "name": "subvol4"
> > },
> > {
> > "name": "subvol2"
> > },
> > {
> > "name": "subvol3"
> > },
> > {
> > "name": "subvol5"
> > },
> > {
> > "name": "subvol6"
> > },
> > {
> > "name": "subvol1"
> > }
> > ]
> >
> > # ceph fs status
> > midline-a - 2 clients
> > =
> > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
> > 0 active midline.server1.njyfcn Reqs: 0 /s 514 110 228 36
> > 1 active midline.server2.lpnjmx Reqs: 0 /s 47 22 17 6
> > POOL TYPE USED AVAIL
> > fs-midline-metadata-a metadata 25.4M 25.9T
> > fs-midline-data-a data 216k 25.9T
> > STANDBY MDS
> > midline.server3.wsbxsh
> > MDS version: ceph version 19.2.0
> (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
> >
> > Following are the subtrees output from the MDSs. The directory fragments
> does not
> > seem to be equally mapped to MDS.
> >
> > # ceph tell mds.midline.server1.njyfcn get subtrees | jq
> > [
> > {
> > "is_auth": true,
> > "auth_first": 0,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": false,
> > "random_ephemeral_pin": false,
> > "export_pin_target": -1,
> > "dir": {
> > "path": "",
> > "dirfrag": "0x1",
> > "snapid_first": 2,
> > "projected_version": "1240",
> > "version": "1240",
> > "committing_version": "0",
> > "committed_version": "0",
> > "is_rep": false,
> > "dir_auth": "0",
> > "states": [
> > "auth",
> > "dirty",
> > "complete"
> > ],
> > "is_auth": true,
> > "auth_state": {
> > "replicas": {
> > "1": 1
> > }
> > },
> > "replica_state": {
> > "authority": [
> > 0,
> > -2
> > ],
> > "replica_nonce": 0
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "child": 1,
> > "subtree": 1,
> > "subtreetemp": 0,
> > "replicated": 1,
> > "dirty": 1,
> > "waiter": 0,
> > "authpin": 0
> > },
> > "nref": 4
> > }
> > },
> > {
> > "is_auth": true,
>

[ceph-users] Squid: regression in rgw multisite replication from Quincy/Reef clusters

2024-11-20 Thread Casey Bodley
Recent multisite testing has uncovered a regression on Squid that
happens when secondary zones are upgraded to Squid before the metadata
master zone. User metadata replicates incorrectly in this
configuration, such that their access keys are "inactive". As a
result, these users are denied access to API requests on that
secondary zone.

This is tracked in https://tracker.ceph.com/issues/68985, and we'll
prioritize the backport for 19.2.1. In the meantime, we strongly
recommend upgrading the metadata master zone before others.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-20 Thread Eugen Block
Ah, I misunderstood, I thought you wanted an even distribution across  
both ranks.
Just for testing purposes, have you tried pinning rank 1 to some other  
directory? Does it still break the CephFS if you stop it? I'm not sure  
if you can prevent rank 1 from participating, I haven't looked into  
all the configs in quite a while. Would it be an option to rather use  
multiple Filesystems instead of multi-active for one CephFS?


Zitat von Александр Руденко :


No it's not a typo. It's misleading example)

dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work without
rank 1.
rank 1 is used for something when I work with this dirs.

ceph 16.2.13, metadata balancer and policy based balancing not used.

ср, 20 нояб. 2024 г. в 16:33, Eugen Block :


Hi,

> After pinning:
> setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2

is this a typo? If not, you did pin both directories to the same rank.

Zitat von Александр Руденко :

> Hi,
>
> I try to distribute all top level dirs in CephFS by different MDS ranks.
> I have two active MDS with rank *0* and *1 *and I have 2 top dirs like
> */dir1* and* /dir2*.
>
> After pinning:
> setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>
> I can see next INOS and DNS distribution:
> RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
>  0active   c   Reqs:127 /s  12.6k  12.5k   333505
>  1active   b   Reqs:11 /s21 24 19  1
>
> When I write to dir1 I can see a small amount on Reqs: in rank 1.
>
> Events in journal of MDS with rank 1:
> cephfs-journal-tool --rank=fs1:1 event get list
>
> 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:  (scatter_writebehind)
>   A2037D53
> 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
> 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest accounted
> scatter stat update)
> 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
> 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
> 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
> 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
> 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
> 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
> 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
>   di1/A2037D53
> 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
> 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
> 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
> 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
> 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
>
> But the main problem, when I stop MDS rank 1 (without any kind of
standby)
> - FS hangs for all actions.
> Is this correct? Is it possible to completely exclude rank 1 from
> processing dir1 and not stop io when rank 1 goes down?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush rule examples

2024-11-20 Thread Andre Tann

Hi Janne

Am 20.11.24 um 11:30 schrieb Janne Johansson:


This post seem to show that, except they have their root named "nvme"
and they split on rack and not dc, but that is not important.

https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup


This is indeed a good example, thanks.

Let me put some thoughts/questions here:



step choose firstn 2 type rack


This choses 2 racks out of all available racks. As there are 2 racks 
available, all are chosen.




step chooseleaf firstn 2 type host


For each selected rack from the previous step, 2 hosts are chosen. But 
as the action is "chooseleaf", in fact not the hosts are picked, but one 
random (?) OSD in each of the 2 selected hosts.


In the end we have 4 OSDs in 4 different hosts, 2 in each rack.

Is this understanding correct?


Shouldn't we note this one additionally:

min_size 4
max_size 4

Reason: If we wanted to place more ore less than 4 replicas, the rule 
won't work. Or what would happen if we don't specify min/max_size? 
Should lead to an error in case the pool is e.g. size=5, shouldn't it?



One last question: if we edit a crush map after a pool was created on 
it, what happens? In my understanding, this  lead to massive data 
shifting so that the placements comply with the new rules. That right?


Thanks again

--
Andre Tann
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS maximum filename length

2024-11-20 Thread Naumann, Thomas
Hi at all,

we use proxmoxcluster (v8.2.8) with ceph (v18.2.4) an EC-pools (all
ceph options on default). One pool is exported with CephFS as backend
storage for nextcloud servers.
At the moment there is data migration from old S3 storage to CephFS
pool. There are many files with huge filename length.
In Wikipedia
(see: https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits)
there is CephFS max filename length to 255 characters and no limit for
pathname length. Is this true?
What is maximum filename length for CephFS (reef)?
Which CephFS options could be altered to increase filnename length?
What is the best practice to handle files with too long filename?
Has anyone experience about this setting?

best regards
-- 
Thomas Naumann


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multisite sync issue with bucket sync

2024-11-20 Thread Christopher Durham
 Ok,
Source code review reveals that full sync is marker based and sync errors 
within a marker group *suggest* that data within the marker isre-checked, (I 
may be wrong about this, but that is consistent with my 304 errors below). I do 
however, have the folllowing question:
Is there a way to otherwise abort a full sync of a bucket (as a result of 
radosgw-admin bucket sync init --bucket  and bucket sync run (or 
restart of radosgw),and have it just do incremental sync from then on (yes, 
having the objects not be the same on both sides prior to the 'restart' of an 
incremental sync.
Would radosgw-admin bucket sync disable --bucket  followed by 
radosgw-admin bucket sync enable --bucket  do this? Or would that do 
anotherfull sync and not an incremental? Thanks
-Chris

On Thursday, November 14, 2024 at 04:18:34 PM MST, Christopher Durham 
 wrote:   

  Hi,
I have heard nothing on this, but have done some more research.
Again, both sides of a multisite s3 configuration are ceph 18.2.4 on Rocky 9.
For a given bucket, there are thousands of 'missing' objects. I did:
radosgw-admin bucket sync init --bucket  --src-zone sync starts after I restart a radosgw on the source zone that has a sync 
thread.
But based on number and size of objects needing replication, it NEVER finishes, 
as more objects are created as I am going.I may need to increase the number of 
radosgw and or the sync threads.

What I have discovered that if a radosgw on the side with missing objects is 
restarted, all sycing starts over!In other words, it starts polling each 
object, getting a 304 error in the radosgw log on the server on the multisite 
that has the missing objects.It *appears* to do this sequential object scan in 
lexographic order of object and/or prefix name, although I cannot be sure.

So some questions:
1. Is there a recommendation/rule of thumb/formula for the number of 
radosgws/syncthreads/ etc based on number of objects, buckets, bandwidth, 
etc?2. Why does the syncing restart for a bucket when a radosgw is restarted? 
Is there a way to tell it to restart where it left off as opposed to starting 
over?There may be reasons to restart a bucket sync if a radosgw restarts, but 
there should be a way to checkpoint/force it to not restart/start where left 
off, etc.3. Is there a way to 'abort' the sync and cause the bucket to think it 
is up to date and only replicate new objects from the time it was marked up to 
date?
Thanks for any information
-Chris



On Friday, November 8, 2024 at 03:45:05 PM MST, Christopher Durham 
 wrote:   

 
I have a 2-site multisite configuration on cdnh 18.2.4 on EL9.
After system updates, we discovered that a particular bucket had several 
thousand objects missing, which the other side had. Newly created objects were 
being replicated just fine.

I decided to 'restart' syncing that bucket. Here is what I did
On the side with misisng objects:
> radosgw-admin bucket sync init --bucket  --src-zone 

I restarted the radosgw set up to do the sync thread on the same zone as I ran 
the radosgw-admin command. 

Logs on the radosgw src-zone side show GETs with http code 200 for objects that 
do not exist on the side with missing objects, and GETs with http 304 for 
objects that already exist on the side with missing objects.
So far, so good.
As I said, the bucket is active. So on the src-zone side, data is continually 
being written to /prefixA/../../ There is also data being written to 
/prefixB/../../
prefixA/ comes lexographically before prefixB/
What happens is that all the 304s happen as it scans the bucket, then starts 
pulling with GETs and http 200s for the objects the side doing the sync doesnt 
have. This is on /prefixA. When it 'caches up' with alldata in /prefixA at the 
moment, the sync seems to START OVER with /prefixA, giving 304s for everything 
that existed in the bucket up to the moment it caught up, then doing GETs with 
200s for the remainingnewer objects. This happens over and over again. It NEVER 
gets to /prefixB. So it seems to be periodically catching up to /prefixA, but 
never going on to /prefixB that is also being written to
There are 1.2 million objects in this bucket, with about 35 TiB in the bucket.
There is a lifecycle expiration happening of 60 days.
Any thoughts would be appreciated.
-Chris



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multisite sync issue with bucket sync

2024-11-20 Thread Casey Bodley
On Wed, Nov 20, 2024 at 2:10 PM Christopher Durham  wrote:
>
>  Ok,
> Source code review reveals that full sync is marker based and sync errors 
> within a marker group *suggest* that data within the marker isre-checked, (I 
> may be wrong about this, but that is consistent with my 304 errors below). I 
> do however, have the folllowing question:
> Is there a way to otherwise abort a full sync of a bucket (as a result of 
> radosgw-admin bucket sync init --bucket  and bucket sync run (or 
> restart of radosgw),and have it just do incremental sync from then on (yes, 
> having the objects not be the same on both sides prior to the 'restart' of an 
> incremental sync.
> Would radosgw-admin bucket sync disable --bucket  followed by 
> radosgw-admin bucket sync enable --bucket  do this? Or would that do 
> anotherfull sync and not an incremental?

'bucket sync enable' does start a new full sync (to catch objects that
were uploaded since 'bucket sync disable')

Thanks
> -Chris
>
> On Thursday, November 14, 2024 at 04:18:34 PM MST, Christopher Durham 
>  wrote:
>
>   Hi,
> I have heard nothing on this, but have done some more research.
> Again, both sides of a multisite s3 configuration are ceph 18.2.4 on Rocky 9.
> For a given bucket, there are thousands of 'missing' objects. I did:
> radosgw-admin bucket sync init --bucket  --src-zone  zone>sync starts after I restart a radosgw on the source zone that has a sync 
> thread.
> But based on number and size of objects needing replication, it NEVER 
> finishes, as more objects are created as I am going.I may need to increase 
> the number of radosgw and or the sync threads.
>
> What I have discovered that if a radosgw on the side with missing objects is 
> restarted, all sycing starts over!In other words, it starts polling each 
> object, getting a 304 error in the radosgw log on the server on the multisite 
> that has the missing objects.It *appears* to do this sequential object scan 
> in lexographic order of object and/or prefix name, although I cannot be sure.
>
> So some questions:
> 1. Is there a recommendation/rule of thumb/formula for the number of 
> radosgws/syncthreads/ etc based on number of objects, buckets, bandwidth, 
> etc?2. Why does the syncing restart for a bucket when a radosgw is restarted? 
> Is there a way to tell it to restart where it left off as opposed to starting 
> over?There may be reasons to restart a bucket sync if a radosgw restarts, but 
> there should be a way to checkpoint/force it to not restart/start where left 
> off, etc.3. Is there a way to 'abort' the sync and cause the bucket to think 
> it is up to date and only replicate new objects from the time it was marked 
> up to date?
> Thanks for any information
> -Chris
>
>
>
> On Friday, November 8, 2024 at 03:45:05 PM MST, Christopher Durham 
>  wrote:
>
>
> I have a 2-site multisite configuration on cdnh 18.2.4 on EL9.
> After system updates, we discovered that a particular bucket had several 
> thousand objects missing, which the other side had. Newly created objects 
> were being replicated just fine.
>
> I decided to 'restart' syncing that bucket. Here is what I did
> On the side with misisng objects:
> > radosgw-admin bucket sync init --bucket  --src-zone 
>
> I restarted the radosgw set up to do the sync thread on the same zone as I 
> ran the radosgw-admin command.
>
> Logs on the radosgw src-zone side show GETs with http code 200 for objects 
> that do not exist on the side with missing objects, and GETs with http 304 
> for objects that already exist on the side with missing objects.
> So far, so good.
> As I said, the bucket is active. So on the src-zone side, data is continually 
> being written to /prefixA/../../ There is also data being written to 
> /prefixB/../../
> prefixA/ comes lexographically before prefixB/
> What happens is that all the 304s happen as it scans the bucket, then starts 
> pulling with GETs and http 200s for the objects the side doing the sync 
> doesnt have. This is on /prefixA. When it 'caches up' with alldata in 
> /prefixA at the moment, the sync seems to START OVER with /prefixA, giving 
> 304s for everything that existed in the bucket up to the moment it caught up, 
> then doing GETs with 200s for the remainingnewer objects. This happens over 
> and over again. It NEVER gets to /prefixB. So it seems to be periodically 
> catching up to /prefixA, but never going on to /prefixB that is also being 
> written to
> There are 1.2 million objects in this bucket, with about 35 TiB in the bucket.
> There is a lifecycle expiration happening of 60 days.
> Any thoughts would be appreciated.
> -Chris
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush rule examples

2024-11-20 Thread Joachim Kraftmayer
I have worked with crush and crush rules a lot over the last 12 years. I
would always recommend testing the rules with a crushtool, for example.
https://docs.ceph.com/en/reef/man/8/crushtool/


  joachim.kraftma...@clyso.com

  www.clyso.com

  Hohenzollernstr. 27, 80801 Munich

Utting | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE275430677



Am Mi., 20. Nov. 2024 um 11:31 Uhr schrieb Janne Johansson <
icepic...@gmail.com>:

> > Sorry, sent too early. So here we go again:
> >   My setup looks like this:
> >
> >DC1
> >node01
> >node02
> >node03
> >node04
> >node05
> >DC2
> >node06
> >node07
> >node08
> >node09
> >node10
> >
> > I want a replicated pool with size=4. Two copies should go in each DC,
> > and then no two copies on a single node.
> > How can I describe this in a crush rule?
>
> This post seem to show that, except they have their root named "nvme"
> and they split on rack and not dc, but that is not important.
>
>
> https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup
>
> with the answer at the bottom:
>
> for example this should work as well, to have 4 replicas in total,
> distributed across two racks:
> step take default class nvme
> step choose firstn 2 type rack
> step chooseleaf firstn 2 type host
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [CephFS] Completely exclude some MDS rank from directory processing

2024-11-20 Thread Александр Руденко
Hi,

I try to distribute all top level dirs in CephFS by different MDS ranks.
I have two active MDS with rank *0* and *1 *and I have 2 top dirs like
*/dir1* and* /dir2*.

After pinning:
setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2

I can see next INOS and DNS distribution:
RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
 0active   c   Reqs:127 /s  12.6k  12.5k   333505
 1active   b   Reqs:11 /s21 24 19  1

When I write to dir1 I can see a small amount on Reqs: in rank 1.

Events in journal of MDS with rank 1:
cephfs-journal-tool --rank=fs1:1 event get list

2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:  (scatter_writebehind)
  A2037D53
2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest accounted
scatter stat update)
2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
  di1/A2037D53
2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()

But the main problem, when I stop MDS rank 1 (without any kind of standby)
- FS hangs for all actions.
Is this correct? Is it possible to completely exclude rank 1 from
processing dir1 and not stop io when rank 1 goes down?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encrypt OSDs on running System. A good Idea?

2024-11-20 Thread Janne Johansson
> What issues should I expect if I take an OSD (15TB) out one at a time,
> encrypt it, and put it back into the cluster? I would have a long period
> where some OSDs are encrypted and others are not. How dangerous is this?

I don't think it would be more dangerous than if you were redoing OSDs
for any other reasons, so if you empty the OSD, rebuild it with
--dmcrypt and refill again, you would always have the correct number
of copies. It would not be an issue that some OSDs are encrypted and
others are not, this is only an aspect of how data is stored on disk,
when the OSD is up and running it will serve the same data regardless
of the encryption status.
Encryption for OSDs is more related to what is stored on the drive
when you remove it from the server and if that is readable or not in
that state.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Encrypt OSDs on running System. A good Idea?

2024-11-20 Thread Giovanna Ratini

Hello all :-),

We use Ceph both as storage in Proxmox and as storage in K8S. I would 
like to encrypt the OSDs. I have backups of the Proxmox machines, but 
honestly, I would prefer not to have to use them, as it would take two 
days to rebuild everything from scratch.


I ran some tests on a small Proxmox machine, and telling the OSDs to 
encrypt themselves doesn’t seem too difficult. However, nothing is 
running on that storage.


What issues should I expect if I take an OSD (15TB) out one at a time, 
encrypt it, and put it back into the cluster? I would have a long period 
where some OSDs are encrypted and others are not. How dangerous is this? 
Has anyone done it before ?


Best wishes,

Gio

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush rule examples

2024-11-20 Thread Andre Tann

Sorry, sent too early. So here we go again:


 My setup looks like this:

  DC1
  node01
  node02
  node03
  node04
  node05
  DC2
  node06
  node07
  node08
  node09
  node10

I want a replicated pool with size=4. Two copies should go in each DC, 
and then no two copies on a single node.


How can I describe this in a crush rule?

If someone has a link where a similar strategy is explained, I'm more 
than happy to figure out myself how to do it. However, while the Ceph 
documentation is very good in general, the CRUSH rule explanation is 
difficult to understand for me.


Thanks.


--
Andre Tann
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Crush rule examples

2024-11-20 Thread Andre Tann

Hi all,

I'm trying to understand how crush rules need to be set up, and much to 
my surprise I cannot find examples and/or good explanations (or I'm too 
stupid to understand them ;) )


My setup looks like this:

DC1
node01
node02
node03
node04
node05
DC2
node01
node01
node01
node01




--
Andre Tann
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush rule examples

2024-11-20 Thread Andre Tann




How can I describe this in a crush rule?


Let me please add the point that causes me the most difficulties:

I consider DC and host both to be failure domains. But still I accept 
that two copies go into one DC, but I don't want to accept that two 
copies go to one host.


And also, how can I say "2 copies in DC1 and 2 copies in DC2", and not 
for example 3 copies in one DC, 1 copy in the other?



--
Andre Tann
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush rule examples

2024-11-20 Thread Janne Johansson
> Sorry, sent too early. So here we go again:
>   My setup looks like this:
>
>DC1
>node01
>node02
>node03
>node04
>node05
>DC2
>node06
>node07
>node08
>node09
>node10
>
> I want a replicated pool with size=4. Two copies should go in each DC,
> and then no two copies on a single node.
> How can I describe this in a crush rule?

This post seem to show that, except they have their root named "nvme"
and they split on rack and not dc, but that is not important.

https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup

with the answer at the bottom:

for example this should work as well, to have 4 replicas in total,
distributed across two racks:
step take default class nvme
step choose firstn 2 type rack
step chooseleaf firstn 2 type host

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush rule examples

2024-11-20 Thread Frank Schilder
Hi Andre,

I think what you really want to look at is stretch mode. There have been long 
discussions on this list why a crush rule with rep 4 and 2 copies per DC will 
not handle a DC failure as expected. Stretch mode will  make sure writes happen 
in a way that prevents split brain scenarios.

Hand-crafted crush rules for this purpose require 3 or more DCs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Janne Johansson 
Sent: Wednesday, November 20, 2024 11:30 AM
To: Andre Tann
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Crush rule examples

> Sorry, sent too early. So here we go again:
>   My setup looks like this:
>
>DC1
>node01
>node02
>node03
>node04
>node05
>DC2
>node06
>node07
>node08
>node09
>node10
>
> I want a replicated pool with size=4. Two copies should go in each DC,
> and then no two copies on a single node.
> How can I describe this in a crush rule?

This post seem to show that, except they have their root named "nvme"
and they split on rack and not dc, but that is not important.

https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup

with the answer at the bottom:

for example this should work as well, to have 4 replicas in total,
distributed across two racks:
step take default class nvme
step choose firstn 2 type rack
step chooseleaf firstn 2 type host

--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-20 Thread Eugen Block

Hi,


After pinning:
setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2


is this a typo? If not, you did pin both directories to the same rank.

Zitat von Александр Руденко :


Hi,

I try to distribute all top level dirs in CephFS by different MDS ranks.
I have two active MDS with rank *0* and *1 *and I have 2 top dirs like
*/dir1* and* /dir2*.

After pinning:
setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2

I can see next INOS and DNS distribution:
RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
 0active   c   Reqs:127 /s  12.6k  12.5k   333505
 1active   b   Reqs:11 /s21 24 19  1

When I write to dir1 I can see a small amount on Reqs: in rank 1.

Events in journal of MDS with rank 1:
cephfs-journal-tool --rank=fs1:1 event get list

2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:  (scatter_writebehind)
  A2037D53
2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest accounted
scatter stat update)
2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
  di1/A2037D53
2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()

But the main problem, when I stop MDS rank 1 (without any kind of standby)
- FS hangs for all actions.
Is this correct? Is it possible to completely exclude rank 1 from
processing dir1 and not stop io when rank 1 goes down?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Join us for today's User + Developer Monthly Meetup!

2024-11-20 Thread Laura Flores
Hi all,

Please join us for today's User + Developer Monthly Meetup at 10:00 AM ET.
RSVP here! https://www.meetup.com/ceph-user-group/events/304636936

Thanks,
Laura Flores

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-20 Thread Александр Руденко
No it's not a typo. It's misleading example)

dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work without
rank 1.
rank 1 is used for something when I work with this dirs.

ceph 16.2.13, metadata balancer and policy based balancing not used.

ср, 20 нояб. 2024 г. в 16:33, Eugen Block :

> Hi,
>
> > After pinning:
> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>
> is this a typo? If not, you did pin both directories to the same rank.
>
> Zitat von Александр Руденко :
>
> > Hi,
> >
> > I try to distribute all top level dirs in CephFS by different MDS ranks.
> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like
> > */dir1* and* /dir2*.
> >
> > After pinning:
> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >
> > I can see next INOS and DNS distribution:
> > RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
> >  0active   c   Reqs:127 /s  12.6k  12.5k   333505
> >  1active   b   Reqs:11 /s21 24 19  1
> >
> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
> >
> > Events in journal of MDS with rank 1:
> > cephfs-journal-tool --rank=fs1:1 event get list
> >
> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:  (scatter_writebehind)
> >   A2037D53
> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest accounted
> > scatter stat update)
> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
> >   di1/A2037D53
> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
> >
> > But the main problem, when I stop MDS rank 1 (without any kind of
> standby)
> > - FS hangs for all actions.
> > Is this correct? Is it possible to completely exclude rank 1 from
> > processing dir1 and not stop io when rank 1 goes down?
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-20 Thread Александр Руденко
>
> Just for testing purposes, have you tried pinning rank 1 to some other
> directory? Does it still break the CephFS if you stop it?


Yes, nothing changed.

It's no problem that FS hangs when one of the ranks goes down, we will have
standby-reply for all ranks. I don't like that rank which is not pinned to
some dir handled some io of this dir or from clients which work with this
dir.
I mean that I can't robustly and fully separate client IO by ranks.

Would it be an option to rather use multiple Filesystems instead of
> multi-active for one CephFS?


Yes, it's an option. But it is much more complicated in our case. Btw, do
you know how many different FS can be created in one cluster? Maybe you
know some potential problems with 100-200 FSs in one cluster?

ср, 20 нояб. 2024 г. в 17:50, Eugen Block :

> Ah, I misunderstood, I thought you wanted an even distribution across
> both ranks.
> Just for testing purposes, have you tried pinning rank 1 to some other
> directory? Does it still break the CephFS if you stop it? I'm not sure
> if you can prevent rank 1 from participating, I haven't looked into
> all the configs in quite a while. Would it be an option to rather use
> multiple Filesystems instead of multi-active for one CephFS?
>
> Zitat von Александр Руденко :
>
> > No it's not a typo. It's misleading example)
> >
> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
> without
> > rank 1.
> > rank 1 is used for something when I work with this dirs.
> >
> > ceph 16.2.13, metadata balancer and policy based balancing not used.
> >
> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
> >
> >> Hi,
> >>
> >> > After pinning:
> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >>
> >> is this a typo? If not, you did pin both directories to the same rank.
> >>
> >> Zitat von Александр Руденко :
> >>
> >> > Hi,
> >> >
> >> > I try to distribute all top level dirs in CephFS by different MDS
> ranks.
> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs like
> >> > */dir1* and* /dir2*.
> >> >
> >> > After pinning:
> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >> >
> >> > I can see next INOS and DNS distribution:
> >> > RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
> >> >  0active   c   Reqs:127 /s  12.6k  12.5k   333505
> >> >  1active   b   Reqs:11 /s21 24 19  1
> >> >
> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
> >> >
> >> > Events in journal of MDS with rank 1:
> >> > cephfs-journal-tool --rank=fs1:1 event get list
> >> >
> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
> (scatter_writebehind)
> >> >   A2037D53
> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
> accounted
> >> > scatter stat update)
> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
> >> >   di1/A2037D53
> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
> >> >
> >> > But the main problem, when I stop MDS rank 1 (without any kind of
> >> standby)
> >> > - FS hangs for all actions.
> >> > Is this correct? Is it possible to completely exclude rank 1 from
> >> > processing dir1 and not stop io when rank 1 goes down?
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io