[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-22 Thread Eugen Block
Hm, the same test worked for me with version 16.2.13... I mean, I only  
do a few writes from a single client, so this may be an invalid test,  
but I don't see any interruption.


Zitat von Eugen Block :

I just tried to reproduce the behaviour but failed to do so. I have  
a Reef (18.2.2) cluster with multi-active MDS. Don't mind the  
hostnames, this cluster was deployed with Nautilus.


# mounted the FS
mount -t ceph nautilus:/ /mnt -o  
name=admin,secret=,mds_namespace=secondfs


# created and pinned directories
nautilus:~ # mkdir /mnt/dir1
nautilus:~ # mkdir /mnt/dir2

nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir1
nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir2

I stopped all standby daemons while writing into /mnt/dir1, then I  
also stopped rank 1. But the writes were not interrupted (until I  
stopped them). You're on Pacific, I'll see if I can reproduce it  
there.


Zitat von Александр Руденко :



Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?



Nothing special, just smoll test cluster.
fs1 - 10 clients
===
RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
0active   a   Reqs:0 /s  18.7k  18.4k   351513
1active   b   Reqs:0 /s21 24 16  1
 POOL  TYPE USED  AVAIL
fs1_meta  metadata   116M  3184G
fs1_datadata23.8G  3184G
STANDBY MDS
c


fs dump

e48
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'fs1' (1)
fs_name fs1
epoch 47
flags 12
created 2024-10-15T18:55:10.905035+0300
modified 2024-11-21T10:55:12.688598+0300
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 943
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 2
in 0,1
up {0=12200812,1=11974933}
failed
damaged
stopped
data_pools [7]
metadata_pool 6
inline_data disabled
balancer
standby_count_wanted 1
[mds.a{0:12200812} state up:active seq 13 addr [v2:
10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987] compat
{c=[1],r=[1],i=[7ff]}]
[mds.b{1:11974933} state up:active seq 5 addr [v2:
10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454] compat
{c=[1],r=[1],i=[7ff]}]


Standby daemons:

[mds.c{-1:11704322} state up:standby seq 1 addr [v2:
10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504] compat
{c=[1],r=[1],i=[7ff]}]

чт, 21 нояб. 2024 г. в 11:36, Eugen Block :


I'm not aware of any hard limit for the number of Filesystems, but
that doesn't really mean very much. IIRC, last week during a Clyso
talk at Eventbrite I heard someone say that they deployed around 200
Filesystems or so, I don't remember if it was a production environment
or just a lab environment. I assume that you would probably be limited
by the number of OSDs/PGs rather than by the number of Filesystems,
200 Filesystems require at least 400 pools. But maybe someone else has
more experience in scaling CephFS that way. What we did was to scale
the number of active MDS daemons for one CephFS. I believe in the end
the customer had 48 MDS daemons on three MDS servers, 16 of them were
active with directory pinning, at that time they had 16 standby-replay
and 16 standby daemons. But it turned out that standby-replay didn't
help their use case, so we disabled standby-replay.

Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?

Zitat von Александр Руденко :



Just for testing purposes, have you tried pinning rank 1 to some other
directory? Does it still break the CephFS if you stop it?



Yes, nothing changed.

It's no problem that FS hangs when one of the ranks goes down, we will

have

standby-reply for all ranks. I don't like that rank which is not pinned

to

some dir handled some io of this dir or from clients which work with this
dir.
I mean that I can't robustly and fully separate client IO by ranks.

Would it be an option to rather use multiple Filesystems instead of

multi-active for one CephFS?



Yes, it's an option. But it is much more complicated in our case. Btw, do
you know how many different FS can be created in one cluster? Maybe you
know some potential problems with 100-200 FSs in one cluster?

ср, 20 нояб. 2024 г. в 17:50, Eugen Block :


Ah, I misunderstood, I thought you wanted an even distribution across
both ranks.
Just for testing purposes, have you tried pinning rank 1 to some other
directory? Does it still break the CephFS if you stop it? I'm not sure
if you can prevent rank 1 from participati

[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-22 Thread Eugen Block
Then you were clearly paying more attention than me. ;-) We had some  
maintenance going on during that talk, so I couldn't really focus  
entirely on listening. But thanks for clarifying!


Zitat von Frédéric Nass :


Hi Eugen,

During the talk you've mentioned, Dan said there's a hard coded  
limit of 256 MDSs per cluster. So with one active and one  
standby-ish MDSs per filesystem, that would be 128 filesystems at  
max per cluster.

Mark said he got 120 but.. things start to get wacky by 80. :-)

More fun to come, for sure.

Cheers,
Frédéric.

[1] https://youtu.be/qiCE1Ifws80?t=2602

- Le 21 Nov 24, à 9:36, Eugen Block ebl...@nde.ag a écrit :


I'm not aware of any hard limit for the number of Filesystems, but
that doesn't really mean very much. IIRC, last week during a Clyso
talk at Eventbrite I heard someone say that they deployed around 200
Filesystems or so, I don't remember if it was a production environment
or just a lab environment. I assume that you would probably be limited
by the number of OSDs/PGs rather than by the number of Filesystems,
200 Filesystems require at least 400 pools. But maybe someone else has
more experience in scaling CephFS that way. What we did was to scale
the number of active MDS daemons for one CephFS. I believe in the end
the customer had 48 MDS daemons on three MDS servers, 16 of them were
active with directory pinning, at that time they had 16 standby-replay
and 16 standby daemons. But it turned out that standby-replay didn't
help their use case, so we disabled standby-replay.

Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?

Zitat von Александр Руденко :



Just for testing purposes, have you tried pinning rank 1 to some other
directory? Does it still break the CephFS if you stop it?



Yes, nothing changed.

It's no problem that FS hangs when one of the ranks goes down, we will have
standby-reply for all ranks. I don't like that rank which is not pinned to
some dir handled some io of this dir or from clients which work with this
dir.
I mean that I can't robustly and fully separate client IO by ranks.

Would it be an option to rather use multiple Filesystems instead of

multi-active for one CephFS?



Yes, it's an option. But it is much more complicated in our case. Btw, do
you know how many different FS can be created in one cluster? Maybe you
know some potential problems with 100-200 FSs in one cluster?

ср, 20 нояб. 2024 г. в 17:50, Eugen Block :


Ah, I misunderstood, I thought you wanted an even distribution across
both ranks.
Just for testing purposes, have you tried pinning rank 1 to some other
directory? Does it still break the CephFS if you stop it? I'm not sure
if you can prevent rank 1 from participating, I haven't looked into
all the configs in quite a while. Would it be an option to rather use
multiple Filesystems instead of multi-active for one CephFS?

Zitat von Александр Руденко :

> No it's not a typo. It's misleading example)
>
> dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
without
> rank 1.
> rank 1 is used for something when I work with this dirs.
>
> ceph 16.2.13, metadata balancer and policy based balancing not used.
>
> ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
>
>> Hi,
>>
>> > After pinning:
>> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>>
>> is this a typo? If not, you did pin both directories to the same rank.
>>
>> Zitat von Александр Руденко :
>>
>> > Hi,
>> >
>> > I try to distribute all top level dirs in CephFS by different MDS
ranks.
>> > I have two active MDS with rank *0* and *1 *and I have 2 top  
dirs like

>> > */dir1* and* /dir2*.
>> >
>> > After pinning:
>> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >
>> > I can see next INOS and DNS distribution:
>> > RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
>> >  0active   c   Reqs:127 /s  12.6k  12.5k   333505
>> >  1active   b   Reqs:11 /s21 24 19  1
>> >
>> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
>> >
>> > Events in journal of MDS with rank 1:
>> > cephfs-journal-tool --rank=fs1:1 event get list
>> >
>> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
(scatter_writebehind)
>> >   A2037D53
>> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
>> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
accounted
>> > scatter stat update)
>> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
>> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
>> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
>> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
>> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
>> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
>> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
>> >   di1/A2037D53
>> > 2024-11-20T12:30:46.909621+0300 0xc5d96f 

[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-22 Thread Eugen Block
I just tried to reproduce the behaviour but failed to do so. I have a  
Reef (18.2.2) cluster with multi-active MDS. Don't mind the hostnames,  
this cluster was deployed with Nautilus.


# mounted the FS
mount -t ceph nautilus:/ /mnt -o name=admin,secret=,mds_namespace=secondfs

# created and pinned directories
nautilus:~ # mkdir /mnt/dir1
nautilus:~ # mkdir /mnt/dir2

nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir1
nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir2

I stopped all standby daemons while writing into /mnt/dir1, then I  
also stopped rank 1. But the writes were not interrupted (until I  
stopped them). You're on Pacific, I'll see if I can reproduce it there.


Zitat von Александр Руденко :



Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?



Nothing special, just smoll test cluster.
fs1 - 10 clients
===
RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
 0active   a   Reqs:0 /s  18.7k  18.4k   351513
 1active   b   Reqs:0 /s21 24 16  1
  POOL  TYPE USED  AVAIL
fs1_meta  metadata   116M  3184G
fs1_datadata23.8G  3184G
STANDBY MDS
 c


fs dump

e48
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'fs1' (1)
fs_name fs1
epoch 47
flags 12
created 2024-10-15T18:55:10.905035+0300
modified 2024-11-21T10:55:12.688598+0300
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 943
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 2
in 0,1
up {0=12200812,1=11974933}
failed
damaged
stopped
data_pools [7]
metadata_pool 6
inline_data disabled
balancer
standby_count_wanted 1
[mds.a{0:12200812} state up:active seq 13 addr [v2:
10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987] compat
{c=[1],r=[1],i=[7ff]}]
[mds.b{1:11974933} state up:active seq 5 addr [v2:
10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454] compat
{c=[1],r=[1],i=[7ff]}]


Standby daemons:

[mds.c{-1:11704322} state up:standby seq 1 addr [v2:
10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504] compat
{c=[1],r=[1],i=[7ff]}]

чт, 21 нояб. 2024 г. в 11:36, Eugen Block :


I'm not aware of any hard limit for the number of Filesystems, but
that doesn't really mean very much. IIRC, last week during a Clyso
talk at Eventbrite I heard someone say that they deployed around 200
Filesystems or so, I don't remember if it was a production environment
or just a lab environment. I assume that you would probably be limited
by the number of OSDs/PGs rather than by the number of Filesystems,
200 Filesystems require at least 400 pools. But maybe someone else has
more experience in scaling CephFS that way. What we did was to scale
the number of active MDS daemons for one CephFS. I believe in the end
the customer had 48 MDS daemons on three MDS servers, 16 of them were
active with directory pinning, at that time they had 16 standby-replay
and 16 standby daemons. But it turned out that standby-replay didn't
help their use case, so we disabled standby-replay.

Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?

Zitat von Александр Руденко :

>>
>> Just for testing purposes, have you tried pinning rank 1 to some other
>> directory? Does it still break the CephFS if you stop it?
>
>
> Yes, nothing changed.
>
> It's no problem that FS hangs when one of the ranks goes down, we will
have
> standby-reply for all ranks. I don't like that rank which is not pinned
to
> some dir handled some io of this dir or from clients which work with this
> dir.
> I mean that I can't robustly and fully separate client IO by ranks.
>
> Would it be an option to rather use multiple Filesystems instead of
>> multi-active for one CephFS?
>
>
> Yes, it's an option. But it is much more complicated in our case. Btw, do
> you know how many different FS can be created in one cluster? Maybe you
> know some potential problems with 100-200 FSs in one cluster?
>
> ср, 20 нояб. 2024 г. в 17:50, Eugen Block :
>
>> Ah, I misunderstood, I thought you wanted an even distribution across
>> both ranks.
>> Just for testing purposes, have you tried pinning rank 1 to some other
>> directory? Does it still break the CephFS if you stop it? I'm not sure
>> if you can prevent rank 1 from participating, I haven't looked into
>> all the configs in quite a while. Would it be an option to rather use
>> multiple Filesystems instead of multi-active for one CephF

[ceph-users] Re: Crush rule examples

2024-11-22 Thread Janne Johansson
Den tors 21 nov. 2024 kl 19:18 skrev Andre Tann :
> > This post seem to show that, except they have their root named "nvme"
> > and they split on rack and not dc, but that is not important.
> >
> > https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup
>
> This is indeed a good example, thanks.
> Let me put some thoughts/questions here:
>
> > step choose firstn 2 type rack
>
> This choses 2 racks out of all available racks. As there are 2 racks
> available, all are chosen.

Yes, and you would name it DC instead of course.

> > step chooseleaf firstn 2 type host
>
> For each selected rack from the previous step, 2 hosts are chosen. But
> as the action is "chooseleaf", in fact not the hosts are picked, but one
> random (?) OSD in each of the 2 selected hosts.

Well, it picks a leaf out of the host, which is a branch in the tree.
I see it as
after picking the host, don't do anything special but just grab an OSD
from there.

> In the end we have 4 OSDs in 4 different hosts, 2 in each rack.
> Is this understanding correct?

I believe so, yes.

> Shouldn't we note this one additionally:
>
> min_size 4

Not necessary, you could allow for min_size 3 so that single-drive
problems doesn't cause the PG to stop.

> max_size 4
>
> Reason: If we wanted to place more ore less than 4 replicas, the rule
> won't work. Or what would happen if we don't specify min/max_size?
> Should lead to an error in case the pool is e.g. size=5, shouldn't it?

Yes, but when you figure you need a repl=5 pool you would have to make a rule
that picks 3 from one DC. I'm sure there is a way to say "..and then you pick as
many hosts as needed", but I don't know it offhand. Might be that the above rule
would allow 5 copies, but the fifth ends up on the same host as the
one of the others.

> One last question: if we edit a crush map after a pool was created on
> it, what happens? In my understanding, this  lead to massive data
> shifting so that the placements comply with the new rules. That right?

Yes, but it can be mitigated somewhat using the remappers and let the balancer
slowly do the changes.

1. set norebalance
2. stop the balancer
3. apply the new crush rule on pool
4. let the mons figure out all new places for the PGs
5. run one of the remapper tools, jj-balancer, upmap-remapper.py or the
golang pgremapper, which makes most (sometimes all) PGs
think they are in the correct place after all
6. unset norebalance
7. start the ceph balancer with a setting of max misplaced % that suits the
load you want to have during the moves.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to speed up OSD deployment process

2024-11-22 Thread Eugen Block

Hi,

I don't see how it would be currently possible. The OSD creation is  
handled by ceph-volume, which activates each OSD separately:


[2024-11-22 14:03:08,415][ceph_volume.main][INFO  ] Running command:  
ceph-volume  activate --osd-id 0 --osd-uuid  
aacabeca-9adb-465c-88ee-935f06fa45f7 --no-systemd --no-tmpfs


[2024-11-22 14:03:09,343][ceph_volume.devices.raw.activate][INFO  ]  
Activating osd.0 uuid aacabeca-9adb-465c-88ee-935f06fa45f7 cluster  
e57f7b6a-a8d9-11ef-af3c-fa163e2ad8c5


The ceph-volume lvm activate description [0] states:

It is possible to activate all existing OSDs at once by using the  
--all flag. For example:


ceph-volume lvm activate --all

This call will inspect all the OSDs created by ceph-volume that are  
inactive and will activate them one by one.


I assume that even if the OSD creation process could be tweaked in a  
way that all OSDs are created first without separate activation, and  
then cephadm would issue "ceph-volume lvm activate --all", the OSDs  
would still be activated one by one.


But as Tim already stated, an hour for almost 200 OSDs is not that  
bad. ;-) I guess you could create a tracker issue for an enhancement,  
maybe some of the devs can clarify why the OSDs need to be activated  
one by one.


Regards,
Eugen

[0] https://docs.ceph.com/en/latest/ceph-volume/lvm/activate/


Zitat von YuFan Chen :


Hi,

I’m setting up a 6-node Ceph cluster using Ceph Squid.
Each node is configured with 32 OSDs (32 HDDs and 8 NVMe SSDs for  
db_devices).


I’ve created an OSD service specification and am using cephadm to
apply the configuration.
The deployment of all 192 OSDs takes about an hour to complete.

However, I’ve noticed that cephadm creates the OSDs sequentially.
Then, on each node, it starts a single OSD and waits for it to become
ready before moving on to the next.

Is there a way to speed up the OSD deployment process?
Thanks in advance for your help!

Best regards,
Yufan Chen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multisite sync issue with bucket sync

2024-11-22 Thread Christopher Durham
 Casey,
Is there any way to 'speed up' full sync? While I (now) understand full sync 
prioritizing older objects first, a use case of frequently reading/using 
'recent' objects cannot be met in the current scenario of pulling older objects 
first.
In my current scenario it appears:

that after a time, the full sync starts over again at some point, giving http 
304 responses on the radosgw logs on the side it pulls from for all the objects 
it already has. 

As such, I have no way of predicting how long it will take to do a full sync. 
Can you help me understand why it does this (the 304 errors and an apparent 
'restart' of the sync?). I may have localissues between the sites, but 
understanding the sync process and 304 responses would help me. Thanks
Again, I have 18.2.4 on Rocky9.

-Chris



On Thursday, November 21, 2024 at 12:53:58 PM MST, Casey Bodley 
 wrote:   

 hey Chris,

On Wed, Nov 20, 2024 at 6:02 PM Christopher Durham  wrote:
>
> Casey,
>
> OR, is there a way to continue on with new data syncing (incremental) as the 
> full sync catches up, as the full sync will take a long time, and no new 
> incremental data is being replicated.

full sync walks through the entire bucket listing, so it will visit
some new objects along the way. but where possible, multisite tries to
prioritize sync of older data because minimizing the average
time-to-sync is important for disaster recovery. if we tried to
process both full and incremental at the same time, that would slow
down the full sync and some objects could take far longer to
replicate. it would also be less efficient overall, because longer
full sync means more overlap and duplicated effort with incremental

>
> -Chris
>
> On Wednesday, November 20, 2024 at 03:30:40 PM MST, Christopher Durham 
>  wrote:
>
>
> Casey,
>
> Thanks for your response. So is there a way to abandon a full sync and just 
> move on with an incremental from the time you abandon the full sync?

i'm afraid not. multisite tries very hard to maintain consistency
between zones, so it's not easy to subvert that. 'radosgw-admin bucket
sync init' is probably the only command that can modify bucket sync
status

>
> -Chris
>
> On Wednesday, November 20, 2024 at 12:29:26 PM MST, Casey Bodley 
>  wrote:
>
>
> On Wed, Nov 20, 2024 at 2:10 PM Christopher Durham  wrote:
> >
> >  Ok,
> > Source code review reveals that full sync is marker based and sync errors 
> > within a marker group *suggest* that data within the marker isre-checked, 
> > (I may be wrong about this, but that is consistent with my 304 errors 
> > below). I do however, have the folllowing question:
> > Is there a way to otherwise abort a full sync of a bucket (as a result of 
> > radosgw-admin bucket sync init --bucket  and bucket sync run (or 
> > restart of radosgw),and have it just do incremental sync from then on (yes, 
> > having the objects not be the same on both sides prior to the 'restart' of 
> > an incremental sync.
> > Would radosgw-admin bucket sync disable --bucket  followed by 
> > radosgw-admin bucket sync enable --bucket  do this? Or would that 
> > do anotherfull sync and not an incremental?
>
> 'bucket sync enable' does start a new full sync (to catch objects that
> were uploaded since 'bucket sync disable')
>
>
> Thanks
> > -Chris
> >
> >    On Thursday, November 14, 2024 at 04:18:34 PM MST, Christopher Durham 
> > wrote:
> >
> >  Hi,
> > I have heard nothing on this, but have done some more research.
> > Again, both sides of a multisite s3 configuration are ceph 18.2.4 on Rocky 
> > 9.
> > For a given bucket, there are thousands of 'missing' objects. I did:
> > radosgw-admin bucket sync init --bucket  --src-zone  > zone>sync starts after I restart a radosgw on the source zone that has a 
> > sync thread.
> > But based on number and size of objects needing replication, it NEVER 
> > finishes, as more objects are created as I am going.I may need to increase 
> > the number of radosgw and or the sync threads.
> >
> > What I have discovered that if a radosgw on the side with missing objects 
> > is restarted, all sycing starts over!In other words, it starts polling each 
> > object, getting a 304 error in the radosgw log on the server on the 
> > multisite that has the missing objects.It *appears* to do this sequential 
> > object scan in lexographic order of object and/or prefix name, although I 
> > cannot be sure.
> >
> > So some questions:
> > 1. Is there a recommendation/rule of thumb/formula for the number of 
> > radosgws/syncthreads/ etc based on number of objects, buckets, bandwidth, 
> > etc?2. Why does the syncing restart for a bucket when a radosgw is 
> > restarted? Is there a way to tell it to restart where it left off as 
> > opposed to starting over?There may be reasons to restart a bucket sync if a 
> > radosgw restarts, but there should be a way to checkpoint/force it to not 
> > restart/start where left off, etc.3. Is there a way to 'abort' the sync and 
> > cause the bucket to think it

[ceph-users] Re: How to speed up OSD deployment process

2024-11-22 Thread YuFan Chen
Hi,

Thank you. The Ceph cluster is running smoothly so far. However, during our
testing, we re-installed it multiple times and observed that the
ceph-volume command took over a minute to activate the OSD.
In the activation stage, ceph-volume called "ceph-bluestore-tool
show-label".

 It appears that the command scans all disks to identify which disk is
being activated.

Best regards,
Yufan Chen

Joachim Kraftmayer  於 2024年11月23日 週六 00:50 寫道:

> Hi,
> I can remember that it took with the tools before cephadm somehow 8 hours
> to deploy a ceph cluster with more than 2000 osds.
>
> But I also know that CBT has a much faster approach to installing a Ceph
> cluster.
> Just an idea: maybe you can look at the approach at CBT to make cephadm
> faster.
>
> Regards, Joachim
>
>   joachim.kraftma...@clyso.com
>
>   www.clyso.com
>
>   Hohenzollernstr. 27, 80801 Munich
> 
>
> Utting | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE275430677
>
>
>
> Am Fr., 22. Nov. 2024 um 15:46 Uhr schrieb Eugen Block :
>
> > Hi,
> >
> > I don't see how it would be currently possible. The OSD creation is
> > handled by ceph-volume, which activates each OSD separately:
> >
> > [2024-11-22 14:03:08,415][ceph_volume.main][INFO  ] Running command:
> > ceph-volume  activate --osd-id 0 --osd-uuid
> > aacabeca-9adb-465c-88ee-935f06fa45f7 --no-systemd --no-tmpfs
> >
> > [2024-11-22 14:03:09,343][ceph_volume.devices.raw.activate][INFO  ]
> > Activating osd.0 uuid aacabeca-9adb-465c-88ee-935f06fa45f7 cluster
> > e57f7b6a-a8d9-11ef-af3c-fa163e2ad8c5
> >
> > The ceph-volume lvm activate description [0] states:
> >
> > > It is possible to activate all existing OSDs at once by using the
> > > --all flag. For example:
> > >
> > > ceph-volume lvm activate --all
> > >
> > > This call will inspect all the OSDs created by ceph-volume that are
> > > inactive and will activate them one by one.
> >
> > I assume that even if the OSD creation process could be tweaked in a
> > way that all OSDs are created first without separate activation, and
> > then cephadm would issue "ceph-volume lvm activate --all", the OSDs
> > would still be activated one by one.
> >
> > But as Tim already stated, an hour for almost 200 OSDs is not that
> > bad. ;-) I guess you could create a tracker issue for an enhancement,
> > maybe some of the devs can clarify why the OSDs need to be activated
> > one by one.
> >
> > Regards,
> > Eugen
> >
> > [0] https://docs.ceph.com/en/latest/ceph-volume/lvm/activate/
> >
> >
> > Zitat von YuFan Chen :
> >
> > > Hi,
> > >
> > > I’m setting up a 6-node Ceph cluster using Ceph Squid.
> > > Each node is configured with 32 OSDs (32 HDDs and 8 NVMe SSDs for
> > > db_devices).
> > >
> > > I’ve created an OSD service specification and am using cephadm to
> > > apply the configuration.
> > > The deployment of all 192 OSDs takes about an hour to complete.
> > >
> > > However, I’ve noticed that cephadm creates the OSDs sequentially.
> > > Then, on each node, it starts a single OSD and waits for it to become
> > > ready before moving on to the next.
> > >
> > > Is there a way to speed up the OSD deployment process?
> > > Thanks in advance for your help!
> > >
> > > Best regards,
> > > Yufan Chen
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to speed up OSD deployment process

2024-11-22 Thread Joachim Kraftmayer
Hi,
I can remember that it took with the tools before cephadm somehow 8 hours
to deploy a ceph cluster with more than 2000 osds.

But I also know that CBT has a much faster approach to installing a Ceph
cluster.
Just an idea: maybe you can look at the approach at CBT to make cephadm
faster.

Regards, Joachim

  joachim.kraftma...@clyso.com

  www.clyso.com

  Hohenzollernstr. 27, 80801 Munich

Utting | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE275430677



Am Fr., 22. Nov. 2024 um 15:46 Uhr schrieb Eugen Block :

> Hi,
>
> I don't see how it would be currently possible. The OSD creation is
> handled by ceph-volume, which activates each OSD separately:
>
> [2024-11-22 14:03:08,415][ceph_volume.main][INFO  ] Running command:
> ceph-volume  activate --osd-id 0 --osd-uuid
> aacabeca-9adb-465c-88ee-935f06fa45f7 --no-systemd --no-tmpfs
>
> [2024-11-22 14:03:09,343][ceph_volume.devices.raw.activate][INFO  ]
> Activating osd.0 uuid aacabeca-9adb-465c-88ee-935f06fa45f7 cluster
> e57f7b6a-a8d9-11ef-af3c-fa163e2ad8c5
>
> The ceph-volume lvm activate description [0] states:
>
> > It is possible to activate all existing OSDs at once by using the
> > --all flag. For example:
> >
> > ceph-volume lvm activate --all
> >
> > This call will inspect all the OSDs created by ceph-volume that are
> > inactive and will activate them one by one.
>
> I assume that even if the OSD creation process could be tweaked in a
> way that all OSDs are created first without separate activation, and
> then cephadm would issue "ceph-volume lvm activate --all", the OSDs
> would still be activated one by one.
>
> But as Tim already stated, an hour for almost 200 OSDs is not that
> bad. ;-) I guess you could create a tracker issue for an enhancement,
> maybe some of the devs can clarify why the OSDs need to be activated
> one by one.
>
> Regards,
> Eugen
>
> [0] https://docs.ceph.com/en/latest/ceph-volume/lvm/activate/
>
>
> Zitat von YuFan Chen :
>
> > Hi,
> >
> > I’m setting up a 6-node Ceph cluster using Ceph Squid.
> > Each node is configured with 32 OSDs (32 HDDs and 8 NVMe SSDs for
> > db_devices).
> >
> > I’ve created an OSD service specification and am using cephadm to
> > apply the configuration.
> > The deployment of all 192 OSDs takes about an hour to complete.
> >
> > However, I’ve noticed that cephadm creates the OSDs sequentially.
> > Then, on each node, it starts a single OSD and waits for it to become
> > ready before moving on to the next.
> >
> > Is there a way to speed up the OSD deployment process?
> > Thanks in advance for your help!
> >
> > Best regards,
> > Yufan Chen
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] please unsubscribe

2024-11-22 Thread Debian 108

Hi
this is the tenth unsubscribe mail I send and after few minutes I 
receive another email.



please, could some admin delete my email from the mailing list ?

thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Migrated to cephadm, rgw logs to file even when rgw_ops_log_rados is true

2024-11-22 Thread Paul JURCO
Hi,
we recently migrated to cephadm from ceph-deploy a 18.2.2 ceph cluster
(Ubuntu with docker).
RGWs are separate vms.
We noticed syslog increased a lot due to rgw's access logs sent to it.
And because we use to log ops, a huge ops log file on
/var/log/ceph/cluster-id/ops-log-ceph-client.rgw.hostname-here.log.

While having "rgw_ops_log_rados": "true", oplogs goes to both file and
rados pool for logs.
If false it doesn't log anything, as expected.
How to stop dockered rgws to log to syslog and to a file on disk, but to
keep opslog in logs pool?

Config is:
globalbasic
log_to_journald  false
globaladvanced
 rgw_enable_ops_log  false
globaladvanced
 rgw_ops_log_radostrue

A few hours later after after enabling it back, after massive cleanup, it
does logging ops but only to files.
How to get ops logs in rados pool and access log to a file on disk but not
on syslog?
I have add this to daemon.json to limit access logs to accumulate on
/var/log/docker/containers/rand/rand/json.log file:

{
  "log-driver": "local",
  "log-opts": {
"max-size": "512m",
"max-file": "3"
  }
}


Thank you!
Paul
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [CephFS] Completely exclude some MDS rank from directory processing

2024-11-22 Thread Frédéric Nass
Well...only because I had this discussion in the back of my mind when I've 
watch the video yesterday. ;-)

Cheers,
Frédéric.

- Le 22 Nov 24, à 8:59, Eugen Block ebl...@nde.ag a écrit :

> Then you were clearly paying more attention than me. ;-) We had some
> maintenance going on during that talk, so I couldn't really focus
> entirely on listening. But thanks for clarifying!
> 
> Zitat von Frédéric Nass :
> 
>> Hi Eugen,
>>
>> During the talk you've mentioned, Dan said there's a hard coded
>> limit of 256 MDSs per cluster. So with one active and one
>> standby-ish MDSs per filesystem, that would be 128 filesystems at
>> max per cluster.
>> Mark said he got 120 but.. things start to get wacky by 80. :-)
>>
>> More fun to come, for sure.
>>
>> Cheers,
>> Frédéric.
>>
>> [1] https://youtu.be/qiCE1Ifws80?t=2602
>>
>> - Le 21 Nov 24, à 9:36, Eugen Block ebl...@nde.ag a écrit :
>>
>>> I'm not aware of any hard limit for the number of Filesystems, but
>>> that doesn't really mean very much. IIRC, last week during a Clyso
>>> talk at Eventbrite I heard someone say that they deployed around 200
>>> Filesystems or so, I don't remember if it was a production environment
>>> or just a lab environment. I assume that you would probably be limited
>>> by the number of OSDs/PGs rather than by the number of Filesystems,
>>> 200 Filesystems require at least 400 pools. But maybe someone else has
>>> more experience in scaling CephFS that way. What we did was to scale
>>> the number of active MDS daemons for one CephFS. I believe in the end
>>> the customer had 48 MDS daemons on three MDS servers, 16 of them were
>>> active with directory pinning, at that time they had 16 standby-replay
>>> and 16 standby daemons. But it turned out that standby-replay didn't
>>> help their use case, so we disabled standby-replay.
>>>
>>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
>>> fs dump'?
>>>
>>> Zitat von Александр Руденко :
>>>
>
> Just for testing purposes, have you tried pinning rank 1 to some other
> directory? Does it still break the CephFS if you stop it?


 Yes, nothing changed.

 It's no problem that FS hangs when one of the ranks goes down, we will have
 standby-reply for all ranks. I don't like that rank which is not pinned to
 some dir handled some io of this dir or from clients which work with this
 dir.
 I mean that I can't robustly and fully separate client IO by ranks.

 Would it be an option to rather use multiple Filesystems instead of
> multi-active for one CephFS?


 Yes, it's an option. But it is much more complicated in our case. Btw, do
 you know how many different FS can be created in one cluster? Maybe you
 know some potential problems with 100-200 FSs in one cluster?

 ср, 20 нояб. 2024 г. в 17:50, Eugen Block :

> Ah, I misunderstood, I thought you wanted an even distribution across
> both ranks.
> Just for testing purposes, have you tried pinning rank 1 to some other
> directory? Does it still break the CephFS if you stop it? I'm not sure
> if you can prevent rank 1 from participating, I haven't looked into
> all the configs in quite a while. Would it be an option to rather use
> multiple Filesystems instead of multi-active for one CephFS?
>
> Zitat von Александр Руденко :
>
> > No it's not a typo. It's misleading example)
> >
> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
> without
> > rank 1.
> > rank 1 is used for something when I work with this dirs.
> >
> > ceph 16.2.13, metadata balancer and policy based balancing not used.
> >
> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block :
> >
> >> Hi,
> >>
> >> > After pinning:
> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >>
> >> is this a typo? If not, you did pin both directories to the same rank.
> >>
> >> Zitat von Александр Руденко :
> >>
> >> > Hi,
> >> >
> >> > I try to distribute all top level dirs in CephFS by different MDS
> ranks.
> >> > I have two active MDS with rank *0* and *1 *and I have 2 top
> dirs like
> >> > */dir1* and* /dir2*.
> >> >
> >> > After pinning:
> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >> >
> >> > I can see next INOS and DNS distribution:
> >> > RANK  STATE   MDS ACTIVITY DNSINOS   DIRS   CAPS
> >> >  0active   c   Reqs:127 /s  12.6k  12.5k   333505
> >> >  1active   b   Reqs:11 /s21 24 19  1
> >> >
> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
> >> >
> >> > Events in journal of MDS with rank 1:
> >> > cephfs-journal-tool --rank=fs1:1 event get list
> >> >
> 

[ceph-users] Re: please unsubscribe

2024-11-22 Thread Anthony D'Atri
As previously disclosed the list currently has issues.  This is the first 
request visible to me.  I will get you unsubscribed tonight.  

> On Nov 22, 2024, at 8:28 AM, Debian 108  wrote:
> 
> Hi
> this is the tenth unsubscribe mail I send and after few minutes I receive 
> another email.
> 
> 
> please, could some admin delete my email from the mailing list ?
> 
> thanks
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io