[ceph-users] Single-site cluster - multiple RGW issue

2022-02-28 Thread Adam Olszewski
Hi,
We have deployed two RGW hosts with two containers each in our single-site
cluster.
When any of these two hosts is down, second one becomes unresponsive too,
returning error 500. Are they connected in some way?
OSDs are splitted by two racks in crushmap, as long as MDS (2 instances)
and MGR (3 instances), so when one rack is down, the cluster is still
operatible - except RGW.
Thanks,
Adam
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Single-site cluster - multiple RGW issue

2022-02-28 Thread Janne Johansson
Den mån 28 feb. 2022 kl 10:40 skrev Adam Olszewski :
> Hi,
> We have deployed two RGW hosts with two containers each in our single-site
> cluster.
> When any of these two hosts is down, second one becomes unresponsive too,
> returning error 500. Are they connected in some way?

They are not. You should read the logs of the second one, and do some
basic network checks (does S3 requests reach the second not-down RGW?)
to see what actually is happening to the second instance when the
first is down.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Single-site cluster - multiple RGW issue

2022-02-28 Thread Adam Olszewski
Hi Janne,
Thanks for reply.
It's not related to network, when one rack is down (containing one RGW
host), 'ceph -s' command shows no RGW services, however systemd ceph
daemons are running on second RGW. There is no event in ceph crash list.

pon., 28 lut 2022 o 11:08 Janne Johansson  napisał(a):

> Den mån 28 feb. 2022 kl 10:40 skrev Adam Olszewski <
> adamolszewski...@gmail.com>:
> > Hi,
> > We have deployed two RGW hosts with two containers each in our
> single-site
> > cluster.
> > When any of these two hosts is down, second one becomes unresponsive too,
> > returning error 500. Are they connected in some way?
>
> They are not. You should read the logs of the second one, and do some
> basic network checks (does S3 requests reach the second not-down RGW?)
> to see what actually is happening to the second instance when the
> first is down.
>
>
> --
> May the most significant bit of your life be positive.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Single-site cluster - multiple RGW issue

2022-02-28 Thread Janne Johansson
Den mån 28 feb. 2022 kl 11:18 skrev Adam Olszewski :
>
> Hi Janne,
> Thanks for reply.
> It's not related to network, when one rack is down (containing one RGW host), 
> 'ceph -s' command shows no RGW services, however systemd ceph daemons are 
> running on second RGW. There is no event in ceph crash list.

Does ceph -s jump from two to zero rgws?


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Single-site cluster - multiple RGW issue

2022-02-28 Thread Adam Olszewski
even more - "rgw: " disappears from "services: " section.

pon., 28 lut 2022 o 11:29 Janne Johansson  napisał(a):

> Den mån 28 feb. 2022 kl 11:18 skrev Adam Olszewski <
> adamolszewski...@gmail.com>:
> >
> > Hi Janne,
> > Thanks for reply.
> > It's not related to network, when one rack is down (containing one RGW
> host), 'ceph -s' command shows no RGW services, however systemd ceph
> daemons are running on second RGW. There is no event in ceph crash list.
>
> Does ceph -s jump from two to zero rgws?
>
>
> --
> May the most significant bit of your life be positive.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] mclock and backgourd best effort

2022-02-28 Thread Luis Domingues
Hello,

As we are testing mClock scheduler, we have a question that did not found any 
answer on the documentation.

The documentation says mClock has 3 types of load, client, recovery and best 
effort. I guess client is the client traffic, and recovery is the recovery when 
something goes wrong.

Could someone tell me what kind of load is included in best effort?

Regards,

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to clear "Too many repaired reads on 1 OSDs" on pacific

2022-02-28 Thread Sascha Vogt



Hi all,

I'd like to clear the warning of too many repaired reads. In the 
changelog (and on some mailing list entries) I found that in Nautilus 
the flag "clear_shards_repaired" was added (issued via ceph tell), but 
unfortunately when trying to execute it, I get a "no valid command 
found" response.


Is there a way to clear the error counter on pacific? If so, how?

Greetings
-Sascha-
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: *****SPAM***** Re: removing osd, reweight 0, backfilling done, after purge, again backfilling.

2022-02-28 Thread Marc


> > I have a clean cluster state, with the osd's that I am going to remove a
> reweight of 0. And then after executing 'ceph osd purge 19', I have again
> remapping+backfilling done?
> >
> > Is this indeed the correct procedure, or is this old?
> > https://docs.ceph.com/en/latest/rados/operations/add-or-rm-
> osds/#removing-osds-manual
> 
> When you either 1) purge an OSD, or 2) ceph osd crush reweight to 0.0
> you change the total weight of the OSD-host, so if you ceph osd
> reweight an OSD, it will push its PGs to other OSDs on the same host
> and empty itself, but that host is now having more PGs than it really
> should. When you do one of the two above steps, the host weight
> becomes corrected and the extra PGs move to other osd hosts. This will
> also affect the total weight of the whole subtree, so other PGs might
> start moving aswell, on hosts not directly related, but this is more
> uncommon.
>

You are right, I did not read my own manual correctly applied the reweight and 
not the crush reweight.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Errors when scrub ~mdsdir and lots of num_strays

2022-02-28 Thread Arnaud M
Hello to everyone

Our ceph cluster is healthy and everything seems to go well but we have a
lot of num_strays

ceph tell mds.0 perf dump | grep stray
"num_strays": 1990574,
"num_strays_delayed": 0,
"num_strays_enqueuing": 0,
"strays_created": 3,
"strays_enqueued": 17,
"strays_reintegrated": 0,
"strays_migrated": 0,

And num_strays doesn't seems to reduce whatever we do (scrub / or scrub
~mdsdir)
And when we scrub ~mdsdir (force,recursive,repair) we get thoses error

{
"damage_type": "dir_frag",
"id": 3775653237,
"ino": 1099569233128,
"frag": "*",
"path": "~mds0/stray3/100036efce8"
},
{
"damage_type": "dir_frag",
"id": 3776355973,
"ino": 1099567262916,
"frag": "*",
"path": "~mds0/stray3/1000350ecc4"
},
{
"damage_type": "dir_frag",
"id": 3776485071,
"ino": 1099559071399,
"frag": "*",
"path": "~mds0/stray4/10002d3eea7"
},

And just before the end of the ~mdsdir scrub the mds crashes and I have to
do a

ceph mds repaired 0 to have the filesystem back online

A lot of them. Do you have any ideas of what those errors are and how
should I handle them ?

We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot
everyday of / and keep them for 1 month (rolling)

here is our cluster state

ceph -s
  cluster:
id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9
health: HEALTH_WARN
78 pgs not deep-scrubbed in time
70 pgs not scrubbed in time

  services:
mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d)
mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys:
ceph-g-112-1.ksojnh
mds: 1/1 daemons up, 1 standby
osd: 67 osds: 67 up (since 14m), 67 in (since 7d)

  data:
volumes: 1/1 healthy
pools:   5 pools, 609 pgs
objects: 186.86M objects, 231 TiB
usage:   351 TiB used, 465 TiB / 816 TiB avail
pgs: 502 active+clean
 82  active+clean+snaptrim_wait
 20  active+clean+snaptrim
 4   active+clean+scrubbing+deep
 1   active+clean+scrubbing+deep+snaptrim_wait

  io:
client:   8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr

My questions are about the damage found on the ~mdsdir scrub, should I
worry about it ? What does it mean ? It seems to be linked with my issue of
the high number of strays, is it right ? How to fix it and how to reduce
num_stray ?

Thank for all

All the best

Arnaud
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Understanding RGW multi zonegroup replication topology

2022-02-28 Thread Mark Selby
I am designing a Ceph RGW multisite configuration. I have done a fair bit if 
reading but still am having trouble groking the utility of having multiple 
zonegroups within a realm. I know that all meta data is replicated between 
zonegroups and that replication can be setup between zones across zonegroups. I 
am having trouble understanding the benefits that a multi zonegroup topology 
would provide. I really want to understand this as I want to make sure that I 
design a topology that best meets the company’s needs.

 

Any and all help is greatly appreciated. Thanks!

 

-- 

Mark Selby

mse...@voleon.com 

 

 This email is subject to important conditions and disclosures that are listed 
on this web page: https://voleon.com/disclaimer/.

 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to clear "Too many repaired reads on 1 OSDs" on pacific

2022-02-28 Thread Anthony D'Atri
I would think that such an error has a failing drive as the root cause, so 
investigate that first.  Destroying and redeploying such an OSD should take 
care of itt.

> On Feb 28, 2022, at 5:04 PM, Szabo, Istvan (Agoda)  
> wrote:
> 
> Restart osd.
> 
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
> 
> On 2022. Mar 1., at 2:55, Sascha Vogt  wrote:
> 
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
> 
> Hi all,
> 
> I'd like to clear the warning of too many repaired reads. In the
> changelog (and on some mailing list entries) I found that in Nautilus
> the flag "clear_shards_repaired" was added (issued via ceph tell), but
> unfortunately when trying to execute it, I get a "no valid command
> found" response.
> 
> Is there a way to clear the error counter on pacific? If so, how?
> 
> Greetings
> -Sascha-
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any confidentiality or privilege 
> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
> the message. All messages sent to and from Agoda may be monitored to ensure 
> compliance with company policies, to protect the company's interests and to 
> remove potential malware. Electronic messages may be intercepted, amended, 
> lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io