[ceph-users] Re: Stale monitoring alerts in UI

2021-11-05 Thread Eugen Block

Hi,

sometimes it helps to fail the MGR service, I just had this with a  
customer last week where we had to fail it twice within a few hours  
because the information was not updated. It was on latest Octopus.


ceph mgr fail

As for the MTU mismatch I believe there was a thread a few weeks ago,  
but I don't a link at hand. I also can't remember if there was a  
solution.



Zitat von Zakhar Kirpichenko :


Hi,

I seem to have some stale monitoring alerts in my Mgr UI, which do not want
to go away. For example (I'm also attaching an image for your convenience):

MTU Mismatch: Node ceph04 has a different MTU size (9000) than the median
value on device storage-int.

The alerts appears to be active, but doesn't reflect the actual situation:

06:00 [root@ceph04 ~]# ip li li | grep -E "ens2f0|ens3f0|8:
bond0|storage-int"
4: ens3f0:  mtu 9000 qdisc mq master
bond0 state UP mode DEFAULT group default qlen 1000
6: ens2f0:  mtu 9000 qdisc mq master
bond0 state UP mode DEFAULT group default qlen 1000
8: bond0:  mtu 9000 qdisc noqueue
state UP mode DEFAULT group default qlen 1000
10: storage-int@bond0:  mtu 9000 qdisc
noqueue state UP mode DEFAULT group default qlen 1000

I have similarly stuck alerts about 'high pg count deviation', which
triggered during the cluster rebalance but somehow never cleared, despite
all operations finished successfully and CLI tools report that the cluster
is healthy. How can I clear these alerts?

I would very much appreciate any advice.

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-05 Thread Boris Behrens
Hi Teoman,

I don't sync the bucket content. It's just the metadata that get's synced.
But turning off the access to our s3 is not an option, because our customer
rely on it (the make backups and serve objects for their web applications
through it).

Am Do., 4. Nov. 2021 um 18:20 Uhr schrieb Teoman Onay :

> AFAIK dynamic resharding is not supported for multisite setups but you can
> reshard manually.
> Note that this is a very expensive process which requires you to:
>
> - disable the sync of the bucket you want to reshard.
> - Stops all the RGW (no more access to your Ceph cluster)
> - On a node of the master zone, reshard the bucket
> - On the secondary zone, purge the bucket
> - Restart the RGW(s)
> - re-enable sync of the bucket.
>
> 4m objects/bucket is way to much...
>
> Regards
>
> Teoman
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-05 Thread Boris Behrens
Cheers Istvan,

how do you do this?

Am Do., 4. Nov. 2021 um 19:45 Uhr schrieb Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com>:

> This one you need to prepare, you beed to preshard the bucket which you
> know that will hold more than millions of objects.
>
> I have a bucket where we store 1.2 billions of objects with 24xxx shard.
> No omap issue.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Optimal Erasure Code profile?

2021-11-05 Thread Eugen Block

Hi,

since you can't change a pool's EC profile afterwards you have to  
choose a reasonable number of chunks. If you need to start with those  
6 hosts I would also recommend to span the EC profile across all those  
nodes, but keep in mind that the cluster won't be able to recover if a  
host fails. But this will be possible of course if you expand your  
cluster. Which profile exactly to choose for the 6 chunks is depending  
on your resiliency requirements. How many host failures do you  
need/want to sustain? 4:2 is a reasonable profile, your clients would  
not notice one host failure but the IO would pause if a second host  
fails (because default min_size is k + 1), but data loss would still  
be prevented.


Regards,
Eugen


Zitat von Zakhar Kirpichenko :


Hi!

I've got a CEPH 16.2.6 cluster, the hardware is 6 x Supermicro SSG-6029P
nodes, each equipped with:

2 x Intel(R) Xeon(R) Gold 5220R CPUs
384 GB RAM
2 x boot drives
2 x 1.6 TB enterprise NVME drives (DB/WAL)
2 x 6.4 TB enterprise drives (storage tier)
9 x 9TB HDDs (storage tier)
2 x Intel XL710 NICs connected to a pair of 40/100GE switches

Please help me understand the calculation / choice of the optimal EC
profile for this setup. I would like the EC pool to span all 6 nodes on HDD
only and have the optimal combination of resiliency and efficiency with the
view that the cluster will expand. Previously when I had only 3 nodes I
tested EC with:

crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8

I am leaning towards using the above profile with k=4,m=2 for "production"
use, but am not sure that I understand the math correctly, that this
profile is optimal for my current setup, and that I'll be able to scale it
properly by adding new nodes. I would very much appreciate any advice!

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Optimal Erasure Code profile?

2021-11-05 Thread Sebastian Mazza
Hi Zakhar,

I don't have much experience with Ceph, so you should read my words with 
reasonable skepticism.

If your failure domain should be the host level, then k=4, m=2 is you most 
space efficient option for 6 server that allows you to still do write IO when 
one of the servers failed. Assuming that you want you pool min-size to be at 
least k+1. However, your cluster will not be able to “heal itself” in the event 
of a single server outage, since 5 servers are not enough to distribute pgs for 
“k=4, m=2”.

When you add one more server (7 in total) with enough space your cluster will 
be able to self heal from one server outage. Or you can create a new CRUSH rule 
with “k=5, m=2” and “rebalance” your data with this new rule, and get a better 
space efficiency.

I think for a production setup you should go with a CRUSH rule that establishes 
a simple host based failure domain like you already stated. Furthermore, your 
pool min-size for write IO should be at least k+1, so that you are not in 
immediate danger of losing data after the first OSD/Host failure. However, for 
the sake of completeness I want to mention that it is possible to to create 
CRUSH rules that does not consider Hosts and only uses the OSD level as failure 
domain. This would allow you to use erasure coding also with a larger k (e.g. 
k=6, n=2). Furthermore, it is also possible to create CRUSH rules that 
establishes resilient agains a server outage while the pool has “k+m > your 
server count”. 
See 
* 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033502.html
* 
http://cephnotes.ksperis.com/blog/2017/01/27/erasure-code-on-small-clusters

Example of a CRUSH rule that uses 5 hosts and 2 OSDs per host:
---
rule ec5x2hdd { 
id 123 
type erasure 
min_size 10 
max_size 10 
step set_chooseleaf_tries 5 
step set_choose_tries 100 
step take default class hdd 
step choose indep 5 type host 
step choose indep 2 type osd 
step emit 
}
This selects 5 servers and uses 2 OSDs on each server. With an erasure coded 
pool of “k=7, m=3” the system could take one failing server or two failing HDDs 
before you loose write IO. The system can also survive one Host + one OSD or 3 
OSD fails before you loose data. This would give you theoretically 70% usable 
space with only 5 active servers. Instead of 66% with the simple Host failure 
domain on 6 aktive servers and “k=4, m=2” erasure coding.
But I don't advise you to do that!


Best 
Sebastian


> On 05.11.2021, at 06:14, Zakhar Kirpichenko  wrote:
> 
> Hi!
> 
> I've got a CEPH 16.2.6 cluster, the hardware is 6 x Supermicro SSG-6029P
> nodes, each equipped with:
> 
> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> 384 GB RAM
> 2 x boot drives
> 2 x 1.6 TB enterprise NVME drives (DB/WAL)
> 2 x 6.4 TB enterprise drives (storage tier)
> 9 x 9TB HDDs (storage tier)
> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> 
> Please help me understand the calculation / choice of the optimal EC
> profile for this setup. I would like the EC pool to span all 6 nodes on HDD
> only and have the optimal combination of resiliency and efficiency with the
> view that the cluster will expand. Previously when I had only 3 nodes I
> tested EC with:
> 
> crush-device-class=hdd
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=2
> m=1
> plugin=jerasure
> technique=reed_sol_van
> w=8
> 
> I am leaning towards using the above profile with k=4,m=2 for "production"
> use, but am not sure that I understand the math correctly, that this
> profile is optimal for my current setup, and that I'll be able to scale it
> properly by adding new nodes. I would very much appreciate any advice!
> 
> Best regards,
> Zakhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-Dokan Mount Caps at ~1GB transfer?

2021-11-05 Thread Mason-Williams, Gabryel (RFI,RAL,-)
Hello,

I have tried with a native client under Linux and its performance is fine, also 
the performance under 1GB is fine on the windows machine.

Kind regards

Gabryel

From: Radoslav Milanov 
Sent: 01 November 2021 12:55
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Ceph-Dokan Mount Caps at ~1GB transfer?

Have you tries this with the native client under Linux ? It could be
just slow cephfs ?

On 1.11.2021 г. 06:40 ч., Mason-Williams, Gabryel (RFI,RAL,-) wrote:
> Hello,
>
> We have been trying to use Ceph-Dokan to mount cephfs on Windows. When 
> transferring any data below ~1GB the transfer speed is as quick as desired 
> and works perfectly. However, once more than ~1GB has been transferred the 
> connection stops being able to send data and everything seems to just hang.
>
> I've ruled out it being a quota problem as I can transfer just than just 
> under 1GB close the connection and then reopen it and then transfer just 
> under 1GB again, with no issues.
>
> Windows Version: 10
> Dokan Version: 1.3.1.1000
>
> Does anyone have any idea why this is occurring and have any suggestions on 
> how to fix it?
>
> Kind regards
>
> Gabryel Mason-Williams
>
> Junior Research Software Engineer
>
> Please bear in mind that I work part-time, so there may be a delay in my 
> response.
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Optimal Erasure Code profile?

2021-11-05 Thread Zakhar Kirpichenko
Many thanks for your detailed advices, gents, I very much appreciate them!

I read in various places that for production environments it's advised to
keep (k+m) <= host count. Looks like for my setup it is 3+2 then. Would it
be best to proceed with 3+2, or should we go with 4+2?

/Z

On Fri, Nov 5, 2021 at 1:33 PM Sebastian Mazza 
wrote:

> Hi Zakhar,
>
> I don't have much experience with Ceph, so you should read my words with
> reasonable skepticism.
>
> If your failure domain should be the host level, then k=4, m=2 is you most
> space efficient option for 6 server that allows you to still do write IO
> when one of the servers failed. Assuming that you want you pool min-size to
> be at least k+1. However, your cluster will not be able to “heal itself” in
> the event of a single server outage, since 5 servers are not enough to
> distribute pgs for “k=4, m=2”.
>
> When you add one more server (7 in total) with enough space your cluster
> will be able to self heal from one server outage. Or you can create a new
> CRUSH rule with “k=5, m=2” and “rebalance” your data with this new rule,
> and get a better space efficiency.
>
> I think for a production setup you should go with a CRUSH rule that
> establishes a simple host based failure domain like you already stated.
> Furthermore, your pool min-size for write IO should be at least k+1, so
> that you are not in immediate danger of losing data after the first
> OSD/Host failure. However, for the sake of completeness I want to mention
> that it is possible to to create CRUSH rules that does not consider Hosts
> and only uses the OSD level as failure domain. This would allow you to use
> erasure coding also with a larger k (e.g. k=6, n=2). Furthermore, it is
> also possible to create CRUSH rules that establishes resilient agains a
> server outage while the pool has “k+m > your server count”.
> See
> *
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033502.html
> *
> http://cephnotes.ksperis.com/blog/2017/01/27/erasure-code-on-small-clusters
>
> Example of a CRUSH rule that uses 5 hosts and 2 OSDs per host:
> ---
> rule ec5x2hdd {
> id 123
> type erasure
> min_size 10
> max_size 10
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default class hdd
> step choose indep 5 type host
> step choose indep 2 type osd
> step emit
> }
> This selects 5 servers and uses 2 OSDs on each server. With an erasure
> coded pool of “k=7, m=3” the system could take one failing server or two
> failing HDDs before you loose write IO. The system can also survive one
> Host + one OSD or 3 OSD fails before you loose data. This would give you
> theoretically 70% usable space with only 5 active servers. Instead of 66%
> with the simple Host failure domain on 6 aktive servers and “k=4, m=2”
> erasure coding.
> But I don't advise you to do that!
>
>
> Best
> Sebastian
>
>
> > On 05.11.2021, at 06:14, Zakhar Kirpichenko  wrote:
> >
> > Hi!
> >
> > I've got a CEPH 16.2.6 cluster, the hardware is 6 x Supermicro SSG-6029P
> > nodes, each equipped with:
> >
> > 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> > 384 GB RAM
> > 2 x boot drives
> > 2 x 1.6 TB enterprise NVME drives (DB/WAL)
> > 2 x 6.4 TB enterprise drives (storage tier)
> > 9 x 9TB HDDs (storage tier)
> > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> >
> > Please help me understand the calculation / choice of the optimal EC
> > profile for this setup. I would like the EC pool to span all 6 nodes on
> HDD
> > only and have the optimal combination of resiliency and efficiency with
> the
> > view that the cluster will expand. Previously when I had only 3 nodes I
> > tested EC with:
> >
> > crush-device-class=hdd
> > crush-failure-domain=host
> > crush-root=default
> > jerasure-per-chunk-alignment=false
> > k=2
> > m=1
> > plugin=jerasure
> > technique=reed_sol_van
> > w=8
> >
> > I am leaning towards using the above profile with k=4,m=2 for
> "production"
> > use, but am not sure that I understand the math correctly, that this
> > profile is optimal for my current setup, and that I'll be able to scale
> it
> > properly by adding new nodes. I would very much appreciate any advice!
> >
> > Best regards,
> > Zakhar
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Optimal Erasure Code profile?

2021-11-05 Thread Zakhar Kirpichenko
Thanks! I'll stick to 3:2 for now then.

/Z

On Fri, Nov 5, 2021 at 1:55 PM Szabo, Istvan (Agoda) 
wrote:

> With 6 servers I'd go with 3:2, with 7 can go with 4:2.
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> -Original Message-
> From: Zakhar Kirpichenko 
> Sent: Friday, November 5, 2021 6:45 PM
> To: ceph-users 
> Subject: [ceph-users] Re: Optimal Erasure Code profile?
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> Many thanks for your detailed advices, gents, I very much appreciate them!
>
> I read in various places that for production environments it's advised to
> keep (k+m) <= host count. Looks like for my setup it is 3+2 then. Would it
> be best to proceed with 3+2, or should we go with 4+2?
>
> /Z
>
> On Fri, Nov 5, 2021 at 1:33 PM Sebastian Mazza 
> wrote:
>
> > Hi Zakhar,
> >
> > I don't have much experience with Ceph, so you should read my words
> > with reasonable skepticism.
> >
> > If your failure domain should be the host level, then k=4, m=2 is you
> > most space efficient option for 6 server that allows you to still do
> > write IO when one of the servers failed. Assuming that you want you
> > pool min-size to be at least k+1. However, your cluster will not be
> > able to “heal itself” in the event of a single server outage, since 5
> > servers are not enough to distribute pgs for “k=4, m=2”.
> >
> > When you add one more server (7 in total) with enough space your
> > cluster will be able to self heal from one server outage. Or you can
> > create a new CRUSH rule with “k=5, m=2” and “rebalance” your data with
> > this new rule, and get a better space efficiency.
> >
> > I think for a production setup you should go with a CRUSH rule that
> > establishes a simple host based failure domain like you already stated.
> > Furthermore, your pool min-size for write IO should be at least k+1,
> > so that you are not in immediate danger of losing data after the first
> > OSD/Host failure. However, for the sake of completeness I want to
> > mention that it is possible to to create CRUSH rules that does not
> > consider Hosts and only uses the OSD level as failure domain. This
> > would allow you to use erasure coding also with a larger k (e.g. k=6,
> > n=2). Furthermore, it is also possible to create CRUSH rules that
> > establishes resilient agains a server outage while the pool has “k+m >
> your server count”.
> > See
> > *
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033502.html
> > *
> > http://cephnotes.ksperis.com/blog/2017/01/27/erasure-code-on-small-clu
> > sters
> >
> > Example of a CRUSH rule that uses 5 hosts and 2 OSDs per host:
> > ---
> > rule ec5x2hdd {
> > id 123
> > type erasure
> > min_size 10
> > max_size 10
> > step set_chooseleaf_tries 5
> > step set_choose_tries 100
> > step take default class hdd
> > step choose indep 5 type host
> > step choose indep 2 type osd
> > step emit
> > }
> > This selects 5 servers and uses 2 OSDs on each server. With an erasure
> > coded pool of “k=7, m=3” the system could take one failing server or
> > two failing HDDs before you loose write IO. The system can also
> > survive one Host + one OSD or 3 OSD fails before you loose data. This
> > would give you theoretically 70% usable space with only 5 active
> > servers. Instead of 66% with the simple Host failure domain on 6 aktive
> servers and “k=4, m=2”
> > erasure coding.
> > But I don't advise you to do that!
> >
> >
> > Best
> > Sebastian
> >
> >
> > > On 05.11.2021, at 06:14, Zakhar Kirpichenko  wrote:
> > >
> > > Hi!
> > >
> > > I've got a CEPH 16.2.6 cluster, the hardware is 6 x Supermicro
> > > SSG-6029P nodes, each equipped with:
> > >
> > > 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> > > 384 GB RAM
> > > 2 x boot drives
> > > 2 x 1.6 TB enterprise NVME drives (DB/WAL)
> > > 2 x 6.4 TB enterprise drives (storage tier)
> > > 9 x 9TB HDDs (storage tier)
> > > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> > >
> > > Please help me understand the calculation / choice of the optimal EC
> > > profile for this setup. I would like the EC pool to span all 6 nodes
> > > on
> > HDD
> > > only and have the optimal combination of resiliency and efficiency
> > > with
> > the
> > > view that the cluster will expand. Previously when I had only 3
> > > nodes I tested EC with:
> > >
> > > crush-device-class=hdd
> > > crush-failure-domain=host
> > > crush-root=default
> > > jerasure-per-chunk-alignment=false
> > > k=2
> > > m=1
> > > plugin=jerasure
> > > technique=reed_sol_van
> > > w=8
> > >
> > > I am leaning towards using the above profile with k=4,m=2 for
> > "production"
> > > use, but am not s

[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-05 Thread mhnx
I also use this method and I hate it.

Stopping all of the RGW clients is never an option! It shouldn't be.
Sharding is hell. I was have 250M objects in a bucket and reshard failed
after 2days and object count doubled somehow! 2 days of downtime is not an
option.

I wonder if I stop the write-read on a bucket and while resharding it is
there any problem of using RGW's with all other buckets?

Nowadays I advise splitting buckets as much as you can! That means changing
your apps directory tree but this design requires it.
You need to plan object count at least for 5 years and create ones.
Usually I use 101 shards which means 10.100.000 objects.
Also If I need to use versioning I use 2x101 or 3x101 because versions are
hard to predict. You need to predict how many versions you need and set a
lifecycle even before using the bucket!
The max shard that I use 1999. I'm not happy about it but sometimes you
gotta do what you need to do.
Fighting with customers is not an option, you can only advise changing
their apps folder tree but I've never seen someone accept the deal without
arguing.

My offers usually like this:
1- Core files bucket: no need to change or very limited changes. "calculate
the object count and multiply with 2"
2- Hot data bucket: There will be daily changes and versioning. "calculate
the object count and multiply with 3"
3- Cold data bucket[s]: There will be no daily changes. You should open new
buckets every Year or Month. This is good to keep it clean and steady. No
need for versioning and Multisite Will not suffer due to barely changes.
4- Temp files bucket[s]: This is so important. If you're crawling millions
of millions objects everyday and delete it at the end of the week or month
then you should definitely use a temp bucket.  No versioning, No multisite,
No index if it's possible.



Szabo, Istvan (Agoda) , 5 Kas 2021 Cum, 12:30
tarihinde şunu yazdı:

> You mean prepare or reshard?
> Prepare:
> I collect as much information for the users before onboarding so I can
> prepare for their use case in the future and set things up.
>
> Preshard:
> After created the bucket:
> radosgw-admin bucket reshard --bucket=ex-bucket --num-shards=101
>
> Also when you shard the buckets, you need to use prime numbers.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> From: Boris Behrens 
> Sent: Friday, November 5, 2021 4:22 PM
> To: Szabo, Istvan (Agoda) ; ceph-users@ceph.io
> Subject: Re: [ceph-users] large bucket index in multisite environement
> (how to deal with large omap objects warning)?
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
> Cheers Istvan,
>
> how do you do this?
>
> Am Do., 4. Nov. 2021 um 19:45 Uhr schrieb Szabo, Istvan (Agoda) <
> istvan.sz...@agoda.com>:
> This one you need to prepare, you beed to preshard the bucket which you
> know that will hold more than millions of objects.
>
> I have a bucket where we store 1.2 billions of objects with 24xxx shard.
> No omap issue.
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-05 Thread Manuel Lausch
Hi Sage,

I tested again with setting paxos_propose_interval = 0.3
Now stopping OSDs causes way less slow ops. While starting OSDs the
slows seems gone. 
With osd_fast_shutdown_notify_mon = true the slow ops are gone
completely. So I would like to keep the shutdown notify enabled.

As far as I understand the paxos propose interval. This is the interval
in which messages are processed and leads, if neccessary, to new
OSDMaps. Correct?
I wonder if a smaller value could lead to load issues on bigger
clusters while some impact happens, like a host or a whole rack goes
down.

Thanks
Manuel

On Thu, 4 Nov 2021 17:51:55 -0500
Sage Weil  wrote:

> Can you try setting paxos_propose_interval to a smaller number,
> like .3 (by default it is 2 seconds) and see if that has any effect.
> 
> It sounds like the problem is not related to getting the OSD marked
> down (or at least that is not the only thing going on).  My next
> guess is that the peering process that follows needs to get OSDs'
> up_thru values to update and there is delay there.
> 
> Thanks!
> sage
> 
> 
> On Thu, Nov 4, 2021 at 4:15 AM Manuel Lausch 
> wrote:
> 
> > On Tue, 2 Nov 2021 09:02:31 -0500
> > Sage Weil  wrote:
> >
> >  
> > >
> > > Just to be clear, you should try
> > >   osd_fast_shutdown = true
> > >   osd_fast_shutdown_notify_mon = false  
> >
> > I added some logs to the tracker ticket with this options set.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-05 Thread Peter Lieven
Am 04.11.21 um 23:51 schrieb Sage Weil:
> Can you try setting paxos_propose_interval to a smaller number, like .3 (by 
> default it is 2 seconds) and see if that has any effect.
>
> It sounds like the problem is not related to getting the OSD marked down (or 
> at least that is not the only thing going on).  My next guess is that the 
> peering process that follows needs to get OSDs' up_thru values to update and 
> there is delay there.
>
> Thanks!
> sage


I remember that someone wrote earlier that the issues while upgrading from 
Nautilus to Octopus started at the point where the osd compat level is set to 
octopus.

So one of my initial guesses back when I tried to analyze this issue was that 
it has something to do with the new "read from all osds not just the primary" 
feature.

Makes that sense?


Best,

Peter



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-05 Thread Manuel Lausch
Maybe this was me in an earlier mail 
It started at the point all replica partners are on octopus.

This makes sense if I look at this code snippet:

  if (!HAVE_FEATURE(recovery_state.get_min_upacting_features(),
SERVER_OCTOPUS)) {
dout(20) << __func__ << " not all upacting has SERVER_OCTOPUS" << dendl; 
return true; }

-> https://github.com/ceph/ceph/blob/v15.2.12/src/osd/PrimaryLogPG.cc#L772-L775


On Fri, 5 Nov 2021 14:20:00 +0100
Peter Lieven  wrote:
> 
> I remember that someone wrote earlier that the issues while upgrading
> from Nautilus to Octopus started at the point where the osd compat
> level is set to octopus.
> 
> So one of my initial guesses back when I tried to analyze this issue
> was that it has something to do with the new "read from all osds not
> just the primary" feature.
> 
> Makes that sense?
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-05 Thread Сергей Процун
There should not be any issues using rgw for other buckets while
re-sharding.

As for doubling number of objects after reshard is an interesting
situation. After the manual reshard is done, there might be leftover from
the old bucket index. As during reshard new .dir.new_bucket_index objects
are created. They contain all data related to the objects which are stored
in buckets.data pool. Just wondering if the issue with the doubled number
of objects was related to old bucket index. If so its save to delete old
bucket index.

 In the perfect world, it would be ideal to know the eventoal number of
objects inside the bucket and set number of shards to the corresponding
setting initially.

 In the real world when the client re-purpose the usage of the bucket, we
have to deal with reshards.

пт, 5 лист. 2021, 14:43 користувач mhnx  пише:

> I also use this method and I hate it.
>
> Stopping all of the RGW clients is never an option! It shouldn't be.
> Sharding is hell. I was have 250M objects in a bucket and reshard failed
> after 2days and object count doubled somehow! 2 days of downtime is not an
> option.
>
> I wonder if I stop the write-read on a bucket and while resharding it is
> there any problem of using RGW's with all other buckets?
>
> Nowadays I advise splitting buckets as much as you can! That means changing
> your apps directory tree but this design requires it.
> You need to plan object count at least for 5 years and create ones.
> Usually I use 101 shards which means 10.100.000 objects.
> Also If I need to use versioning I use 2x101 or 3x101 because versions are
> hard to predict. You need to predict how many versions you need and set a
> lifecycle even before using the bucket!
> The max shard that I use 1999. I'm not happy about it but sometimes you
> gotta do what you need to do.
> Fighting with customers is not an option, you can only advise changing
> their apps folder tree but I've never seen someone accept the deal without
> arguing.
>
> My offers usually like this:
> 1- Core files bucket: no need to change or very limited changes. "calculate
> the object count and multiply with 2"
> 2- Hot data bucket: There will be daily changes and versioning. "calculate
> the object count and multiply with 3"
> 3- Cold data bucket[s]: There will be no daily changes. You should open new
> buckets every Year or Month. This is good to keep it clean and steady. No
> need for versioning and Multisite Will not suffer due to barely changes.
> 4- Temp files bucket[s]: This is so important. If you're crawling millions
> of millions objects everyday and delete it at the end of the week or month
> then you should definitely use a temp bucket.  No versioning, No multisite,
> No index if it's possible.
>
>
>
> Szabo, Istvan (Agoda) , 5 Kas 2021 Cum, 12:30
> tarihinde şunu yazdı:
>
> > You mean prepare or reshard?
> > Prepare:
> > I collect as much information for the users before onboarding so I can
> > prepare for their use case in the future and set things up.
> >
> > Preshard:
> > After created the bucket:
> > radosgw-admin bucket reshard --bucket=ex-bucket --num-shards=101
> >
> > Also when you shard the buckets, you need to use prime numbers.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > From: Boris Behrens 
> > Sent: Friday, November 5, 2021 4:22 PM
> > To: Szabo, Istvan (Agoda) ; ceph-users@ceph.io
> > Subject: Re: [ceph-users] large bucket index in multisite environement
> > (how to deal with large omap objects warning)?
> >
> > Email received from the internet. If in doubt, don't click any link nor
> > open any attachment !
> > 
> > Cheers Istvan,
> >
> > how do you do this?
> >
> > Am Do., 4. Nov. 2021 um 19:45 Uhr schrieb Szabo, Istvan (Agoda) <
> > istvan.sz...@agoda.com>:
> > This one you need to prepare, you beed to preshard the bucket which you
> > know that will hold more than millions of objects.
> >
> > I have a bucket where we store 1.2 billions of objects with 24xxx shard.
> > No omap issue.
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> >
> >
> > --
> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> > groüen Saal.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___

[ceph-users] Re: Ceph-Dokan Mount Caps at ~1GB transfer?

2021-11-05 Thread Lucian Petrut
Hi,

Did you build the Windows client yourself or is it a Suse or Cloudbase build? 
Which version is your Ceph cluster running, the one that you’re connecting to?

Early versions had a few known bugs which might behave like that (e.g. 
overflows or connection issues) but it shouldn’t be the case with recent 
versions. Could you try a recent msi from here? 
https://cloudbase.it/ceph-for-windows/

Thanks,
Lucian

From: Mason-Williams, Gabryel 
(RFI,RAL,-)
Sent: Friday, November 5, 2021 1:44 PM
To: Radoslav Milanov; 
ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph-Dokan Mount Caps at ~1GB transfer?

Hello,

I have tried with a native client under Linux and its performance is fine, also 
the performance under 1GB is fine on the windows machine.

Kind regards

Gabryel

From: Radoslav Milanov 
Sent: 01 November 2021 12:55
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Ceph-Dokan Mount Caps at ~1GB transfer?

Have you tries this with the native client under Linux ? It could be
just slow cephfs ?

On 1.11.2021 г. 06:40 ч., Mason-Williams, Gabryel (RFI,RAL,-) wrote:
> Hello,
>
> We have been trying to use Ceph-Dokan to mount cephfs on Windows. When 
> transferring any data below ~1GB the transfer speed is as quick as desired 
> and works perfectly. However, once more than ~1GB has been transferred the 
> connection stops being able to send data and everything seems to just hang.
>
> I've ruled out it being a quota problem as I can transfer just than just 
> under 1GB close the connection and then reopen it and then transfer just 
> under 1GB again, with no issues.
>
> Windows Version: 10
> Dokan Version: 1.3.1.1000
>
> Does anyone have any idea why this is occurring and have any suggestions on 
> how to fix it?
>
> Kind regards
>
> Gabryel Mason-Williams
>
> Junior Research Software Engineer
>
> Please bear in mind that I work part-time, so there may be a delay in my 
> response.
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] steady increasing of osd map epoch since octopus

2021-11-05 Thread Manuel Lausch
Hello,

I observed some interessting behavior change since upgrading to
octopus and above. The OSD map epoch is constantly increasing.
Until nautilus the epoch did only change if OSDs went
down/out/up/in, snapshots are created or deleted, recovery or
backfilling took place, flags like the noout was set or deleted, and so
on. In most cases, there were no changes.
But since nautilus, and in pacific too, I see steady increase in the
map epoch.
Is this expcted behaviour since then?

Regards
Manuel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How can user home directory quotas be automatically set on CephFS?

2021-11-05 Thread Artur Kerge
Thank you, Magnus for such quick reply!

Good pointers in there!

Cheers,
Artur

On Tue, 2 Nov 2021 at 14:50, Magnus HAGDORN  wrote:

> Hi Artur,
> we did write a script (in fact a series of scripts) that we use to
> manage our users and their quotas. Our script adds a new user to our
> LDAP and sets the default quotas for various storage areas. Quota
> information is kept in the LDAP. Another script periodically scans the
> LDAP for changes and creates directories and adjusts quotas
> accordingly.
>
> I have a more detailed description of how we use cephfs to provide home
> directories on my blog here:
>
> https://blogs.ed.ac.uk/mhagdorn/2020/08/04/school-of-geosciences-file-storage/
>
> Hope this helps.
> magnus
>
> On Tue, 2021-11-02 at 13:31 +0200, Artur Kerge wrote:
> > This email was sent to you by someone outside the University.
> > You should only click on links or attachments if you are certain that
> > the email is genuine and the content is safe.
> >
> > Hello!
> >
> > As I understand CephFS user max file and byte quotas
> > (ceph.quota.max_{files,bytes}) can be set on an MDS (or CephFS
> > client) via
> > setfattr command (https://docs.ceph.com/en/octopus/cephfs/quota/).
> >
> > My question is, how can the quotas be set automatically for every new
> > user's home directory?
> >
> > Currently the setup of adding a user and setting limits is following:
> > new
> > user is added on LDAP, home folder is created with skel on CephFS
> > (mounted
> > via kernel driver) and then the ceph.quota's are set. I'd like to
> > automate
> > the last step.
> >
> > One possible option is to create a script, that sets the
> > max_bytes/files
> > limits after folder creation but that seems hack-y.
> >
> > To re-iterate, is there a way to automatically set quota limits on a
> > CephFS
> > directory?
> >
> > Best regards
> > Artur Kerge
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336. Is e buidheann carthannais a th’ ann an
> Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephalocon 2022 is official!

2021-11-05 Thread Mike Perez
Hello everyone!

I'm pleased to announce Cephalocon 2022 will be taking place April 5-7
in Portland, Oregon + Virtually!

The CFP is now open until December 10th, so don't delay! Registration
and sponsorship details will be available soon!

I am looking forward to seeing you all in person again soon!

https://ceph.io/en/community/events/2022/cephalocon-portland/

-- 
Mike Perez

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: steady increasing of osd map epoch since octopus

2021-11-05 Thread Dan van der Ster
Hi,

You can get two adjacent osdmap epochs (ceph osd getmap  -o map.)
Then use osdmaptool to print those maps, hopefully revealing what is
changing between the two epochs.

Cheers, Dan


On Fri, Nov 5, 2021 at 4:54 PM Manuel Lausch  wrote:
>
> Hello,
>
> I observed some interessting behavior change since upgrading to
> octopus and above. The OSD map epoch is constantly increasing.
> Until nautilus the epoch did only change if OSDs went
> down/out/up/in, snapshots are created or deleted, recovery or
> backfilling took place, flags like the noout was set or deleted, and so
> on. In most cases, there were no changes.
> But since nautilus, and in pacific too, I see steady increase in the
> map epoch.
> Is this expcted behaviour since then?
>
> Regards
> Manuel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Regarding bug #53139 "OSD might wrongly attempt to use "slow" device when single device is backing the store"

2021-11-05 Thread Igor Fedotov

Right, setup with single device for everything is affected only.


On 11/5/2021 5:54 PM, J-P Methot wrote:

Hi,

I have a quick question regarding bug #53139, as the language in the 
report is slightly confusing. This bug affects any setup where a 
single OSD's data, Bluestore DB and WAL are all located on the same 
device, right? As opposed to the DB being on another device.



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Regarding bug #53139 "OSD might wrongly attempt to use "slow" device when single device is backing the store"

2021-11-05 Thread Igor Fedotov
I haven't seen the beginning of the story - OSDs I was troubleshooting 
were failing on startup. But I think the initial failure had happened 
during regular operation - there is nothing specific to startup for that 
issue to pop up...



On 11/5/2021 10:47 PM, J-P Methot wrote:

I see. This issue only happens during the OSD startup process?

We are running 16.2.6 in production, so the last thing we want right 
now is a stuck OSD. We have to choose if we rollback or if we upgrade 
asap once a patch is out.


On 11/5/21 3:38 PM, Igor Fedotov wrote:
Right, setup with single device for everything is affected only. 



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-05 Thread Sage Weil
Yeah, I think two different things are going on here.

The read leases were new, and I think the way that OSDs are marked down is
the key things that affects that behavior. I'm a bit surprised that the
_notify_mon option helps there, and will take a closer look at that Monday
to make sure it's doing what it's supposed to be doing.

The paxos_propose_interval is an upper bound on how long the monitor is
allowed to batch updates before committing them.  Many/most changes are
committed immediately, but the osdmap management tries to batch things up
so that a single osdmap epoch combines lots of changes when they are
happening quickly (there tends to be mini storms up dates when cluster
changes happen).  The default of 2s might be too much for many
environments, though... and we might consider changing the default to
something smaller (maybe more like 250ms).

sage

On Fri, Nov 5, 2021 at 8:40 AM Manuel Lausch  wrote:

> Maybe this was me in an earlier mail
> It started at the point all replica partners are on octopus.
>
> This makes sense if I look at this code snippet:
>
>   if (!HAVE_FEATURE(recovery_state.get_min_upacting_features(),
> SERVER_OCTOPUS)) {
> dout(20) << __func__ << " not all upacting has SERVER_OCTOPUS" <<
> dendl; return true; }
>
> ->
> https://github.com/ceph/ceph/blob/v15.2.12/src/osd/PrimaryLogPG.cc#L772-L775
>
>
> On Fri, 5 Nov 2021 14:20:00 +0100
> Peter Lieven  wrote:
> >
> > I remember that someone wrote earlier that the issues while upgrading
> > from Nautilus to Octopus started at the point where the osd compat
> > level is set to octopus.
> >
> > So one of my initial guesses back when I tried to analyze this issue
> > was that it has something to do with the new "read from all osds not
> > just the primary" feature.
> >
> > Makes that sense?
> >
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: One cephFS snapshot kills performance

2021-11-05 Thread Sebastian Mazza
Hi Stefan,

thank you for sharing your experience! After I read your mail I did some more 
testing and for me the issue is strictly related to snapshots and perfectly 
reproducible.  However, I mad two new observations that was not clear for me 
until now.

First, snapshots that was created before the folders that I use to test with 
`du` are written, does not have any negative performance impact. 
Second, the metadata a client already has in cache before a snapshot is taken 
is not invalide after a snapshot has ben tacken and still can be used which 
make my executions of `du` very fast.

The following order of actions should illustrate my observations:
$ mkdir share/users/.snap/t1
$ mkdir share/users/.snap/t2
$ mkdir share/users/.snap/t3
$ mkdir share/users/.snap/t4
$ rsync -aH share/backup-remote-1/ share/backup-remote-2/ 
$ umount
$ mount -t ceph ...
$ du -h share/backup-remote-2/
The execution of `du` take only 2m18.720s which is reasonable.

Now lets take another snapshot with an other ceph-client:
$ mkdir share/users/.snap/t5 

Back on our man test machine that has the cephFS still mounted:
$ du -h share/backup-remote-2/
=> 14.156s very fast (presumable all required date read from the client cache)

umount  and mount the cephFS again:
$ umount
$ mount -t ceph …
$ du -h share/backup-remote-2/
=> 20m16.984s …. 10 times slower



> No solution, but good to know there are more workloads out there that hit 
> this issue. If there are any CephFS devs interested in investigating this 
> issue we are more than happy to provide more info.

I also would be happy to provide more infos.


Best wishes,
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io