[ceph-users] Re: Stale monitoring alerts in UI

2021-11-05 Thread Eugen Block
Hi, sometimes it helps to fail the MGR service, I just had this with a customer last week where we had to fail it twice within a few hours because the information was not updated. It was on latest Octopus. ceph mgr fail As for the MTU mismatch I believe there was a thread a few weeks ago,

[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-05 Thread Boris Behrens
Hi Teoman, I don't sync the bucket content. It's just the metadata that get's synced. But turning off the access to our s3 is not an option, because our customer rely on it (the make backups and serve objects for their web applications through it). Am Do., 4. Nov. 2021 um 18:20 Uhr schrieb Teoman

[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-05 Thread Boris Behrens
Cheers Istvan, how do you do this? Am Do., 4. Nov. 2021 um 19:45 Uhr schrieb Szabo, Istvan (Agoda) < istvan.sz...@agoda.com>: > This one you need to prepare, you beed to preshard the bucket which you > know that will hold more than millions of objects. > > I have a bucket where we store 1.2 bill

[ceph-users] Re: Optimal Erasure Code profile?

2021-11-05 Thread Eugen Block
Hi, since you can't change a pool's EC profile afterwards you have to choose a reasonable number of chunks. If you need to start with those 6 hosts I would also recommend to span the EC profile across all those nodes, but keep in mind that the cluster won't be able to recover if a host fa

[ceph-users] Re: Optimal Erasure Code profile?

2021-11-05 Thread Sebastian Mazza
Hi Zakhar, I don't have much experience with Ceph, so you should read my words with reasonable skepticism. If your failure domain should be the host level, then k=4, m=2 is you most space efficient option for 6 server that allows you to still do write IO when one of the servers failed. Assumin

[ceph-users] Re: Ceph-Dokan Mount Caps at ~1GB transfer?

2021-11-05 Thread Mason-Williams, Gabryel (RFI,RAL,-)
Hello, I have tried with a native client under Linux and its performance is fine, also the performance under 1GB is fine on the windows machine. Kind regards Gabryel From: Radoslav Milanov Sent: 01 November 2021 12:55 To: ceph-users@ceph.io Subject: [ceph-user

[ceph-users] Re: Optimal Erasure Code profile?

2021-11-05 Thread Zakhar Kirpichenko
Many thanks for your detailed advices, gents, I very much appreciate them! I read in various places that for production environments it's advised to keep (k+m) <= host count. Looks like for my setup it is 3+2 then. Would it be best to proceed with 3+2, or should we go with 4+2? /Z On Fri, Nov 5,

[ceph-users] Re: Optimal Erasure Code profile?

2021-11-05 Thread Zakhar Kirpichenko
Thanks! I'll stick to 3:2 for now then. /Z On Fri, Nov 5, 2021 at 1:55 PM Szabo, Istvan (Agoda) wrote: > With 6 servers I'd go with 3:2, with 7 can go with 4:2. > > > Istvan Szabo > Senior Infrastructure Engineer > --- > Agoda Services Co., Ltd. >

[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-05 Thread mhnx
I also use this method and I hate it. Stopping all of the RGW clients is never an option! It shouldn't be. Sharding is hell. I was have 250M objects in a bucket and reshard failed after 2days and object count doubled somehow! 2 days of downtime is not an option. I wonder if I stop the write-read

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-05 Thread Manuel Lausch
Hi Sage, I tested again with setting paxos_propose_interval = 0.3 Now stopping OSDs causes way less slow ops. While starting OSDs the slows seems gone. With osd_fast_shutdown_notify_mon = true the slow ops are gone completely. So I would like to keep the shutdown notify enabled. As far as I unde

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-05 Thread Peter Lieven
Am 04.11.21 um 23:51 schrieb Sage Weil: > Can you try setting paxos_propose_interval to a smaller number, like .3 (by > default it is 2 seconds) and see if that has any effect. > > It sounds like the problem is not related to getting the OSD marked down (or > at least that is not the only thing g

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-05 Thread Manuel Lausch
Maybe this was me in an earlier mail It started at the point all replica partners are on octopus. This makes sense if I look at this code snippet: if (!HAVE_FEATURE(recovery_state.get_min_upacting_features(), SERVER_OCTOPUS)) { dout(20) << __func__ << " not all upacting

[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-05 Thread Сергей Процун
There should not be any issues using rgw for other buckets while re-sharding. As for doubling number of objects after reshard is an interesting situation. After the manual reshard is done, there might be leftover from the old bucket index. As during reshard new .dir.new_bucket_index objects are cr

[ceph-users] Re: Ceph-Dokan Mount Caps at ~1GB transfer?

2021-11-05 Thread Lucian Petrut
Hi, Did you build the Windows client yourself or is it a Suse or Cloudbase build? Which version is your Ceph cluster running, the one that you’re connecting to? Early versions had a few known bugs which might behave like that (e.g. overflows or connection issues) but it shouldn’t be the case wi

[ceph-users] steady increasing of osd map epoch since octopus

2021-11-05 Thread Manuel Lausch
Hello, I observed some interessting behavior change since upgrading to octopus and above. The OSD map epoch is constantly increasing. Until nautilus the epoch did only change if OSDs went down/out/up/in, snapshots are created or deleted, recovery or backfilling took place, flags like the noout was

[ceph-users] Re: How can user home directory quotas be automatically set on CephFS?

2021-11-05 Thread Artur Kerge
Thank you, Magnus for such quick reply! Good pointers in there! Cheers, Artur On Tue, 2 Nov 2021 at 14:50, Magnus HAGDORN wrote: > Hi Artur, > we did write a script (in fact a series of scripts) that we use to > manage our users and their quotas. Our script adds a new user to our > LDAP and se

[ceph-users] Cephalocon 2022 is official!

2021-11-05 Thread Mike Perez
Hello everyone! I'm pleased to announce Cephalocon 2022 will be taking place April 5-7 in Portland, Oregon + Virtually! The CFP is now open until December 10th, so don't delay! Registration and sponsorship details will be available soon! I am looking forward to seeing you all in person again soo

[ceph-users] Re: steady increasing of osd map epoch since octopus

2021-11-05 Thread Dan van der Ster
Hi, You can get two adjacent osdmap epochs (ceph osd getmap -o map.) Then use osdmaptool to print those maps, hopefully revealing what is changing between the two epochs. Cheers, Dan On Fri, Nov 5, 2021 at 4:54 PM Manuel Lausch wrote: > > Hello, > > I observed some interessting behavior chang

[ceph-users] Re: Regarding bug #53139 "OSD might wrongly attempt to use "slow" device when single device is backing the store"

2021-11-05 Thread Igor Fedotov
Right, setup with single device for everything is affected only. On 11/5/2021 5:54 PM, J-P Methot wrote: Hi, I have a quick question regarding bug #53139, as the language in the report is slightly confusing. This bug affects any setup where a single OSD's data, Bluestore DB and WAL are all l

[ceph-users] Re: Regarding bug #53139 "OSD might wrongly attempt to use "slow" device when single device is backing the store"

2021-11-05 Thread Igor Fedotov
I haven't seen the beginning of the story - OSDs I was troubleshooting were failing on startup. But I think the initial failure had happened during regular operation - there is nothing specific to startup for that issue to pop up... On 11/5/2021 10:47 PM, J-P Methot wrote: I see. This issue

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-05 Thread Sage Weil
Yeah, I think two different things are going on here. The read leases were new, and I think the way that OSDs are marked down is the key things that affects that behavior. I'm a bit surprised that the _notify_mon option helps there, and will take a closer look at that Monday to make sure it's doin

[ceph-users] Re: One cephFS snapshot kills performance

2021-11-05 Thread Sebastian Mazza
Hi Stefan, thank you for sharing your experience! After I read your mail I did some more testing and for me the issue is strictly related to snapshots and perfectly reproducible. However, I mad two new observations that was not clear for me until now. First, snapshots that was created before