[ceph-users] Re: Ceph Mon not able to authenticate

2022-03-30 Thread Konstantin Shalygin
Hi,

You are not first with this issue
If you are on 146% sure that is not a network (arp, ip, mtu, firewall) issue - 
I suggest to remove this mon and deploy it again. Or deploy on another (unused) 
ipaddr
Also, you can add --debug_ms=20 and you should see some "lossy channel" 
messages before quorum join fails


k

> On 29 Mar 2022, at 15:20, Thomas Bruckmann  
> wrote:
> 
> Hello again,
> increased the Debug level now to a maximum for the mons and I still have no 
> idea what the problem could be.
> 
> So I just print the Debug Log of the Mon failing to join here, in hope, 
> someone could help me. In addition, it seems the mon not joining, stays quiet 
> long in the probing phase, sometimes it switches to synchronizing, which 
> seems to work and after that its back on probing.
> 
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 bootstrap
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 sync_reset_requester
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 unregister_cluster_logger - not registered
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 monmap e16: 3 mons at 
> {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 _reset
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing).auth v46972 _set_mon_num_rank num 0 rank 0
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 timecheck_finish
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 15 
> mon.controller2@-1(probing) e16 health_tick_stop
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 15 
> mon.controller2@-1(probing) e16 health_interval_stop
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 scrub_event_cancel
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 scrub_reset
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 reset_probe_timeout 0x55c46fbb8d80 after 2 
> seconds
> debug 2022-03-29T11:10:53.695+ 7f81c0811700 10 
> mon.controller2@-1(probing) e16 probing other monitors
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 20 
> mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4900 
> for mon.2
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 20 
> mon.controller2@-1(probing) e16  entity_name  global_id 0 (none) caps allow *
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 20 is_capable service=mon 
> command= read addr v2:192.168.9.210:3300/0 on cap allow *
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 20  allow so far , doing 
> grant allow *
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 20  allow all
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 10 
> mon.controller2@-1(probing) e16 handle_probe mon_probe(reply 
> 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller5 quorum 0,1,2 leader 0 
> paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 10 
> mon.controller2@-1(probing) e16 handle_probe_reply mon.2 
> v2:192.168.9.210:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 
> name controller5 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) 
> mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 10 
> mon.controller2@-1(probing) e16  monmap is e16: 3 mons at 
> {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 10 
> mon.controller2@-1(probing) e16  peer name is controller5
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 10 
> mon.controller2@-1(probing) e16  existing quorum 0,1,2
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 10 
> mon.controller2@-1(probing) e16  peer paxos version 133913211 vs my version 
> 133913204 (ok)
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 10 
> mon.controller2@-1(probing) e16  ready to join, but i'm not in the monmap/my 
> addr is blank/location is wrong, trying to join
> debug 2022-03-29T11:10:53.695+ 7f81be00c700 20 
> mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4b40 
> for mon.1
> debug 2022-03-29T11:10:53.695+ 7f81be00c7

[ceph-users] Re: replace MON server keeping identity (Octopus)

2022-03-30 Thread York Huang
Hi Nigel,


https://github.com/ceph/ceph-ansible/tree/master/infrastructure-playbooks


the shrink-mon.yml and add-mon.yml playbooks may give you some insights for 
such operations. (remember to check out the correct Ceph version first)
 
 
-- Original --
From:  "Nigel Williams"

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-30 Thread Fulvio Galeazzi

Ciao Dan,
this is what I did with chunk s3, copying it from osd.121 to 
osd.176 (which is managed by the same host).


But still
pg 85.25 is stuck stale for 85029.707069, current state 
stale+down+remapped, last acting 
[2147483647,2147483647,96,2147483647,2147483647]


So "health detail" apparently plainly ignores osd.176: moreover, its 
output only shows OSD 96, but I just checked again and the other chunks 
are still on OSDs 56,64,140,159 which are all "up".


By the way, you talk about a "bug" in your message: do you have any 
specific one in mind, or was it just a generic synonym for "problem"?

By the way, I uploaded here:
https://pastebin.ubuntu.com/p/dTfPkMb7mD/
a few hundreds of lines from one of the failed OSDs upon "activate --all".

  Thanks

Fulvio

On 29/03/2022 10:53, Dan van der Ster wrote:

Hi Fulvio,

I don't think upmap will help -- that is used to remap where data
should be "up", but your problem is more that the PG chunks are not
going active due to the bug.

What happens if you export one of the PG chunks then import it to
another OSD -- does that chunk become active?

-- dan



On Tue, Mar 29, 2022 at 10:51 AM Fulvio Galeazzi
 wrote:


Hallo again Dan, I am afraid I'd need a little more help, please...

Current status is as follows.

This is where I moved the chunk which was on osd.121:
~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/cephpa1-176
--no-mon-config --op list-pgs  | grep ^85\.25
85.25s3

while other chunks are (server - osd.id):
=== r2srv101.pa1.box.garr - 96
85.25s2
=== r2srv100.pa1.box.garr - 121  <-- down, chunk is on osd.176
85.25s3
=== r2srv100.pa1.box.garr - 159
85.25s1
85.25s4
=== r1-sto09.pa1.box.garr - 56
85.25s4
=== r1-sto09.pa1.box.garr - 64
85.25s0
=== r3srv15.pa1.box.garr - 140
85.25s1

Health detail shows that just one chunk can be found (if I understand
the output correctly):

~]# ceph health detail | grep 85\.25
  pg 85.25 is stuck stale for 5680.315732, current state
stale+down+remapped, last acting
[2147483647,2147483647,96,2147483647,2147483647]

Can I run some magic upmap command to explain my cluster where all the
chunks are? What would be the right syntax?
Little additional problem: I see s1 and s4 twice... I guess this was due
to remapping, as I was adding disks to the cluster: which one is the
right copy?

Thanks!

 Fulvio

Il 3/29/2022 9:35 AM, Fulvio Galeazzi ha scritto:

Thanks a lot, Dan!

  > The EC pgs have a naming convention like 85.25s1 etc.. for the various
  > k/m EC shards.

That was the bit of information I was missing... I was looking for the
wrong object.
I can now go on and export/import that one PGid chunk.

Thanks again!

  Fulvio

On 28/03/2022 16:27, Dan van der Ster wrote:

Hi Fulvio,

You can check (offline) which PGs are on an OSD with the list-pgs op,
e.g.

ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/
--op list-pgs


-- dan


On Mon, Mar 28, 2022 at 2:29 PM Fulvio Galeazzi
 wrote:


Hallo,
   all of a sudden, 3 of my OSDs failed, showing similar messages in
the log:

.
   -5> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
ec=148456/148456 lis/c
612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
unknown mbc={}]
enter Started
   -4> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
ec=148456/148456 lis/c
612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
unknown mbc={}]
enter Start
   -3> 2022-03-28 14:19:02.451 7fc20fe99700  1 osd.145 pg_epoch:
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
ec=148456/148456 lis/c
612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
unknown mbc={}]
state: transitioning to Stray
   -2> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
ec=148456/148456 lis/c
612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
unknown mbc={}]
exit Start 0.08 0 0.00
   -1> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
ec=148456/148456 lis/c
612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
[168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
unknown mbc={}]
enter Started/Stray
0> 2022-03-28 14:19:02.451 7fc20f698700 -1 *** Caught signal
(Aborted) **
in thread 7fc20f698700 thread_name:tp_osd_tp

ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
nautilus (stable)
1: (()+0x12ce0) [0x7fc2327dcce0]
2: (gsignal

[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-03-30 Thread Casey Bodley
On Mon, Mar 28, 2022 at 5:48 PM Yuri Weinstein  wrote:
>
> We are trying to release v17.2.0 as soon as possible.
> And need to do a quick approval of tests and review failures.
>
> Still outstanding are two PRs:
> https://github.com/ceph/ceph/pull/45673
> https://github.com/ceph/ceph/pull/45604
>
> Build failing and I need help to fix it ASAP.
> (
> https://shaman.ceph.com/builds/ceph/wip-yuri11-testing-2022-03-28-0907-quincy/61b142c76c991abe3fe77390e384b025e1711757/
> )
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/55089
> Release Notes - https://github.com/ceph/ceph/pull/45048
>
> Seeking approvals for:
>
> smoke - Neha, Josh (the failure appears reproducible)
> rgw - Casey

approved for rgw, based on the latest results in
https://pulpito.ceph.com/yuriw-2022-03-29_21:32:48-rgw-wip-yuri11-testing-2022-03-28-0907-quincy-distro-default-smithi/

this test includes the arrow submodule PR
https://github.com/ceph/ceph/pull/45604 which is now ready for merge.
however, github now requires 6 reviews to merge this for quincy.
should i just tag a few more people for approval?

> fs - Venky, Gerg
> rbd - Ilya, Deepika
> krbd  Ilya, Deepika
> upgrade/octopus-x - Casey

i see ragweed boostrap failures from octopus, tracked by
https://tracker.ceph.com/issues/53829. these are preventing the
upgrade tests from completing

> powercycle - Brag (SELinux denials)
> ceph-volume - Guillaume, David G
>
> Please reply to this email with approval and/or tracks of know issues/PRs to 
> address them.
>
> Thx
> YuriW
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Laggy OSDs

2022-03-30 Thread Rice, Christian
we had issues with slow ops on ssd AND nvme; mostly fixed by raising aio-max-nr 
from 64K to 1M, eg "fs.aio-max-nr=1048576" if I remember correctly.

On 3/29/22, 2:13 PM, "Alex Closs"  wrote:

Hey folks,

We have a 16.2.7 cephadm cluster that's had slow ops and several 
(constantly changing) laggy PGs. The set of OSDs with slow ops seems to change 
at random, among all 6 OSD hosts in the cluster. All drives are enterprise SATA 
SSDs, by either Intel or Micron. We're still not ruling out a network issue, 
but wanted to troubleshoot from the Ceph side in case something broke there.

ceph -s:

 health: HEALTH_WARN
 3 slow ops, oldest one blocked for 246 sec, daemons 
[osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops.

 services:
 mon: 5 daemons, quorum ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 
(age 28h)
 mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh, 
ceph-mon1.iogajr
 osd: 143 osds: 143 up (since 92m), 143 in (since 2w)
 rgw: 3 daemons active (3 hosts, 1 zones)

 data:
 pools: 26 pools, 3936 pgs
 objects: 33.14M objects, 144 TiB
 usage: 338 TiB used, 162 TiB / 500 TiB avail
 pgs: 3916 active+clean
 19 active+clean+laggy
 1 active+clean+scrubbing+deep

 io:
 client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr

This is actually much faster than it's been for much of the past hour, it's 
been as low as 50 kb/s and dozens of iops in both directions (where the cluster 
typically does 300MB to a few gigs, and ~4k iops)

The cluster has been on 16.2.7 since a few days after release without 
issue. The only recent change was an apt upgrade and reboot on the hosts (which 
was last Friday and didn't show signs of problems).

Happy to provide logs, let me know what would be useful. Thanks for reading 
this wall :)

-Alex

MIT CSAIL
he/they
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-30 Thread Dan van der Ster
Hi Fulvio,

I'm not sure why that PG doesn't register.
But let's look into your log. The relevant lines are:

  -635> 2022-03-30 14:49:57.810 7ff904970700 -1 log_channel(cluster)
log [ERR] : 85.12s0 past_intervals [616435,616454) start interval does
not contain the required bound [605868,616454) start

  -628> 2022-03-30 14:49:57.810 7ff904970700 -1 osd.158 pg_epoch:
616454 pg[85.12s0( empty local-lis/les=0/0 n=0 ec=616435/616435 lis/c
605866/605866 les/c/f 605867/605868/0 616453/616454/616454)
[158,168,64,102,156]/[67,91,82,121,112]p67(0) r=-1 lpr=616454
pi=[616435,616454)/0 crt=0'0 remapped NOTIFY mbc={}] 85.12s0
past_intervals [616435,616454) start interval does not contain the
required bound [605868,616454) start

  -355> 2022-03-30 14:49:57.816 7ff904970700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc:
In function 'void PG::check_past_interval_bounds() const' thread
7ff904970700 time 2022-03-30 14:49:57.811165

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc:
956: ceph_abort_msg("past_interval start interval mismatch")


What is the output of `ceph pg 85.12 query` ?

What's the history of that PG? was it moved around recently prior to this crash?
Are the other down osds also hosting broken parts of PG 85.12 ?

Cheers, Dan

On Wed, Mar 30, 2022 at 3:00 PM Fulvio Galeazzi  wrote:
>
> Ciao Dan,
>  this is what I did with chunk s3, copying it from osd.121 to
> osd.176 (which is managed by the same host).
>
> But still
> pg 85.25 is stuck stale for 85029.707069, current state
> stale+down+remapped, last acting
> [2147483647,2147483647,96,2147483647,2147483647]
>
> So "health detail" apparently plainly ignores osd.176: moreover, its
> output only shows OSD 96, but I just checked again and the other chunks
> are still on OSDs 56,64,140,159 which are all "up".
>
> By the way, you talk about a "bug" in your message: do you have any
> specific one in mind, or was it just a generic synonym for "problem"?
> By the way, I uploaded here:
> https://pastebin.ubuntu.com/p/dTfPkMb7mD/
> a few hundreds of lines from one of the failed OSDs upon "activate --all".
>
>Thanks
>
> Fulvio
>
> On 29/03/2022 10:53, Dan van der Ster wrote:
> > Hi Fulvio,
> >
> > I don't think upmap will help -- that is used to remap where data
> > should be "up", but your problem is more that the PG chunks are not
> > going active due to the bug.
> >
> > What happens if you export one of the PG chunks then import it to
> > another OSD -- does that chunk become active?
> >
> > -- dan
> >
> >
> >
> > On Tue, Mar 29, 2022 at 10:51 AM Fulvio Galeazzi
> >  wrote:
> >>
> >> Hallo again Dan, I am afraid I'd need a little more help, please...
> >>
> >> Current status is as follows.
> >>
> >> This is where I moved the chunk which was on osd.121:
> >> ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/cephpa1-176
> >> --no-mon-config --op list-pgs  | grep ^85\.25
> >> 85.25s3
> >>
> >> while other chunks are (server - osd.id):
> >> === r2srv101.pa1.box.garr - 96
> >> 85.25s2
> >> === r2srv100.pa1.box.garr - 121  <-- down, chunk is on osd.176
> >> 85.25s3
> >> === r2srv100.pa1.box.garr - 159
> >> 85.25s1
> >> 85.25s4
> >> === r1-sto09.pa1.box.garr - 56
> >> 85.25s4
> >> === r1-sto09.pa1.box.garr - 64
> >> 85.25s0
> >> === r3srv15.pa1.box.garr - 140
> >> 85.25s1
> >>
> >> Health detail shows that just one chunk can be found (if I understand
> >> the output correctly):
> >>
> >> ~]# ceph health detail | grep 85\.25
> >>   pg 85.25 is stuck stale for 5680.315732, current state
> >> stale+down+remapped, last acting
> >> [2147483647,2147483647,96,2147483647,2147483647]
> >>
> >> Can I run some magic upmap command to explain my cluster where all the
> >> chunks are? What would be the right syntax?
> >> Little additional problem: I see s1 and s4 twice... I guess this was due
> >> to remapping, as I was adding disks to the cluster: which one is the
> >> right copy?
> >>
> >> Thanks!
> >>
> >>  Fulvio
> >>
> >> Il 3/29/2022 9:35 AM, Fulvio Galeazzi ha scritto:
> >>> Thanks a lot, Dan!
> >>>
> >>>   > The EC pgs have a naming convention like 85.25s1 etc.. for the various
> >>>   > k/m EC shards.
> >>>
> >>> That was the bit of information I was missing... I was looking for the
> >>> wrong object.
> >>> I can now go on and export/import that one PGid chunk.
> >>>
> >>> Thanks again!
> >>>
> >>>   Fulvio
> >>>
> >>> On 28/03/2022 16:27, Dan van der Ster wrote:
>  Hi Fulvio,
> 
>  You can check (offline) which PGs are on an OSD with the list-pgs op,
>  e.g.
> 
>  ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/
>  --op list-pgs
> 
> >

[ceph-users] Re: Quincy: mClock config propagation does not work properly

2022-03-30 Thread Sridhar Seshasayee
Hi Luis,

As Neha mentioned, I am trying out your steps and investigating this
further.
I will get back to you in the next day or two. Thanks for your patience.

-Sridhar

On Thu, Mar 17, 2022 at 11:51 PM Neha Ojha  wrote:

> Hi Luis,
>
> Thanks for testing the Quincy rc and trying out the mClock settings!
> Sridhar is looking into this issue and will provide his feedback as
> soon as possible.
>
> Thanks,
> Neha
>
> On Thu, Mar 3, 2022 at 5:05 AM Luis Domingues 
> wrote:
> >
> > Hi all,
> >
> > As we are doing some tests on our lab cluster, running Quincy 17.1.0, we
> observed some strange behavior regarding the propagation of the mClock
> parameters to the OSDs. Basically, when we change the profile is set on a
> per-recorded one, and we change to custom, the change on the different
> mClock parameters are not propagated.
> >
> > For more details, here is how we reproduce the issue on our lab:
> >
> > ** Step 1
> >
> > We start the OSDs, with this configuration set, using ceph config dump:
> >
> > ```
> >
> > osd advanced osd_mclock_profile custom
> > osd advanced osd_mclock_scheduler_background_recovery_lim 512
> > osd advanced osd_mclock_scheduler_background_recovery_res 128
> > osd advanced osd_mclock_scheduler_background_recovery_wgt 3
> > osd advanced osd_mclock_scheduler_client_lim 80
> > osd advanced osd_mclock_scheduler_client_res 30
> > osd advanced osd_mclock_scheduler_client_wgt 1 osd advanced osd_op_queue
> mclock_scheduler *
> > ```
> >
> > And we can observe that this is what the OSD is running, using ceph
> daemon osd.X config show:
> >
> > ```
> > "osd_mclock_profile": "custom",
> > "osd_mclock_scheduler_anticipation_timeout": "0.00",
> > "osd_mclock_scheduler_background_best_effort_lim": "99",
> > "osd_mclock_scheduler_background_best_effort_res": "1",
> > "osd_mclock_scheduler_background_best_effort_wgt": "1",
> > "osd_mclock_scheduler_background_recovery_lim": "512",
> > "osd_mclock_scheduler_background_recovery_res": "128",
> > "osd_mclock_scheduler_background_recovery_wgt": "3",
> > "osd_mclock_scheduler_client_lim": "80",
> > "osd_mclock_scheduler_client_res": "30",
> > "osd_mclock_scheduler_client_wgt": "1",
> > "osd_mclock_skip_benchmark": "false",
> > "osd_op_queue": "mclock_scheduler",
> > ```
> >
> > At this point, is we change something, the change can be viewed on the
> osd. Let's say we change the background recovery to 100:
> >
> > `ceph config set osd osd_mclock_scheduler_background_recovery_res 100`
> >
> > The change has been set properly on the OSDs:
> >
> > ```
> > "osd_mclock_profile": "custom",
> > "osd_mclock_scheduler_anticipation_timeout": "0.00",
> > "osd_mclock_scheduler_background_best_effort_lim": "99",
> > "osd_mclock_scheduler_background_best_effort_res": "1",
> > "osd_mclock_scheduler_background_best_effort_wgt": "1",
> > "osd_mclock_scheduler_background_recovery_lim": "512",
> > "osd_mclock_scheduler_background_recovery_res": "100",
> > "osd_mclock_scheduler_background_recovery_wgt": "3",
> > "osd_mclock_scheduler_client_lim": "80",
> > "osd_mclock_scheduler_client_res": "30",
> > "osd_mclock_scheduler_client_wgt": "1",
> > "osd_mclock_skip_benchmark": "false",
> > "osd_op_queue": "mclock_scheduler",
> > ```
> >
> > ** Step 2
> >
> > We change the profile to high_recovery_ops, and remove the old
> configuration
> >
> > ```
> > ceph config set osd osd_mclock_profile high_recovery_ops
> > ceph config rm osd osd_mclock_scheduler_background_recovery_lim
> > ceph config rm osd osd_mclock_scheduler_background_recovery_res
> > ceph config rm osd osd_mclock_scheduler_background_recovery_wgt
> > ceph config rm osd osd_mclock_scheduler_client_lim
> > ceph config rm osd osd_mclock_scheduler_client_resceph config rm osd
> osd_mclock_scheduler_client_wgt
> > ```
> >
> > The config contains this now:
> >
> > ```
> > osd advanced osd_mclock_profile high_recovery_ops
> > osd advanced osd_op_queue mclock_scheduler *
> > ```
> >
> > And we can see that the configuration was propagated to the OSDs:
> >
> > ```
> > "osd_mclock_profile": "high_recovery_ops",
> > "osd_mclock_scheduler_anticipation_timeout": "0.00",
> > "osd_mclock_scheduler_background_best_effort_lim": "99",
> > "osd_mclock_scheduler_background_best_effort_res": "1",
> > "osd_mclock_scheduler_background_best_effort_wgt": "2",
> > "osd_mclock_scheduler_background_recovery_lim": "343",
> > "osd_mclock_scheduler_background_recovery_res": "103",
> > "osd_mclock_scheduler_background_recovery_wgt": "2",
> > "osd_mclock_scheduler_client_lim": "137",
> > "osd_mclock_scheduler_client_res": "51",
> > "osd_mclock_scheduler_client_wgt": "1",
> > "osd_mclock_skip_benchmark": "false",
> > "osd_op_queue": "mclock_scheduler",
> >
> > ```
> >
> > ** Step 3
> >
> > The issue comes now, when we try to go back to custom profile:
> >
> > ```
> > ceph config set osd osd_mclock_profile custom
> > ceph config set osd osd_mclock_scheduler_background_rec

[ceph-users] Re: OSD crush with end_of_buffer

2022-03-30 Thread Wissem MIMOUNA
Dear all,

We noticed that the issue we encounter happen exclusivly on one host amount 
global of 10 hosts (almost the 8 osds on this host crashes periodically => ~3 
times a week).

Is there any idea/suggestion ??

Thanks


ZjQcmQRYFpfptBannerEnd

Hi ,



I found more information in the OSD logs about this assertion , may be it could 
help =>



ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)



in thread 7f8002357700 thread_name:msgr-worker-2



*** Caught signal (Aborted) **



what():  buffer::end_of_buffer



terminate called after throwing an instance of 
'ceph::buffer::v15_2_0::end_of_buffer'







Thank for your help











Objet : [ceph-users] OSD crush on a new ceph cluster







Dear All,





We recently installed a new ceph cluster with ceph-ansible . Everything works 
fine exepct we noticed last few days that some OSDs crushed .







Here below the log for more information.







Thanks for your help.







"crash_id": "2022-03-23T08:27:05.085966Z_xx",





"timestamp": "2022-03-23T08:27:05.085966Z",





"process_name": "ceph-osd",





"entity_name": "osd.xx",





"ceph_version": "15.2.16",





"utsname_hostname": "",





"utsname_sysname": "Linux",





"utsname_release": "4.15.0-169-generic",





"utsname_version": "#177-Ubuntu SMP Thu Feb 3 10:50:38 UTC 2022",





"utsname_machine": "x86_64",





"os_name": "Ubuntu",





"os_id": "ubuntu",



"os_version_id": "18.04",



"os_version": "18.04.6 LTS (Bionic Beaver)",



"backtrace": [



"(()+0x12980) [0x7f557c3f8980]",



"(gsignal()+0xc7) [0x7f557b0aae87]",



"(abort()+0x141) [0x7f557b0ac7f1]",



"(()+0x8c957) [0x7f557ba9f957]",



"(()+0x92ae6) [0x7f557baa5ae6]",



"(()+0x92b21) [0x7f557baa5b21]",



"(()+0x92d54) [0x7f557baa5d54]",



"(()+0x964eda) [0x555f1a9e9eda]",



"(()+0x11f3e87) [0x555f1b278e87]",



"(ceph::buffer::v15_2_0::list::iterator_impl::copy_deep(unsigned 
int, ceph::buffer::v15_2_0::ptr&)+0x77) [0x555f1b2799d7]",




"(CryptoKey::decode(ceph::buffer::v15_2_0::list::iterator_impl&)+0x7a) 
[0x555f1b07e52a]",



"(void 
decode_decrypt_enc_bl(ceph::common::CephContext*, 
CephXServiceTicketInfo&, CryptoKey, ceph::buffer::v15_2_0::list const&, 
std::__cxx11::basic_string, std::allocator 
>&)+0x7ed) [0x555f1b3e364d]",



"(cephx_verify_authorizer(ceph::common::CephContext*, KeyStore const&, 
ceph::buffer::v15_2_0::list::iterator_impl&, unsigned long, 
CephXServiceTicketInfo&, std::unique_ptr >*, 
std::__cxx11::basic_string, std::allocator 
>*, ceph::buffer::v15_2_0::list*)+0x519) [0x555f1b3ddaa9]",



"(CephxAuthorizeHandler::verify_authorizer(ceph::common::CephContext*, 
KeyStore const&, ceph::buffer::v15_2_0::list const&, unsigned long, 
ceph::buffer::v15_2_0::list*, EntityName*, unsigned long*, AuthCapsInfo*, 
CryptoKey*, std::__cxx11::basic_string, 
std::allocator >*, std::unique_ptr >*)+0x74b) [0x555f1b3d1ccb]",



"(MonClient::handle_auth_request(Connection*, AuthConnectionMeta*, 
bool, unsigned int, ceph::buffer::v15_2_0::list const&, 
ceph::buffer::v15_2_0::list*)+0x284) [0x555f1b2a02e4]",



"(ProtocolV1::handle_connect_message_2()+0x7d7) [0x555f1b426167]",



"(ProtocolV1::handle_connect_message_auth(char*, int)+0x80) 
[0x555f1b429430]",



"(()+0x138869d) [0x555f1b40d69d]",



"(AsyncConnection::process()+0x5fc) [0x555f1b40a4bc]",



"(EventCenter::process_events(unsigned int, 
std::chrono::duration >*)+0x7dd) 
[0x555f1b25a6dd]",



"(()+0x11db258) [0x555f1b260258]",



"(()+0xbd6df) [0x7f557bad06df]",



"(()+0x76db) [0x7f557c3ed6db]",



"(clone()+0x3f) [0x7f557b18d61f]"



]









Best Regards























___

ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-03-30 Thread Ilya Dryomov
On Mon, Mar 28, 2022 at 11:48 PM Yuri Weinstein  wrote:
>
> We are trying to release v17.2.0 as soon as possible.
> And need to do a quick approval of tests and review failures.
>
> Still outstanding are two PRs:
> https://github.com/ceph/ceph/pull/45673
> https://github.com/ceph/ceph/pull/45604
>
> Build failing and I need help to fix it ASAP.
> (
> https://shaman.ceph.com/builds/ceph/wip-yuri11-testing-2022-03-28-0907-quincy/61b142c76c991abe3fe77390e384b025e1711757/
> )
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/55089
> Release Notes - https://github.com/ceph/ceph/pull/45048
>
> Seeking approvals for:
>
> smoke - Neha, Josh (the failure appears reproducible)
> rgw - Casey
> fs - Venky, Gerg
> rbd - Ilya, Deepika
> krbd  Ilya, Deepika

After an additional rerun for krbd, rbd and krbd approved.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] zap an osd and it appears again

2022-03-30 Thread Alfredo Rezinovsky
I want to create osds manually

If I zap the osd  0 with:

ceph orch osd rm 0 --zap

as soon as the dev is available the orchestrator creates it again

If I use:

ceph orch apply osd --all-available-devices --unmanaged=true

and then zap the osd.0 it also appears again.

There is a real way to disable the orch apply persistency or disable it
temporarily?

-- 
Alfrenovsky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: zap an osd and it appears again

2022-03-30 Thread Eugen Block
Do you have other osd services defined which would apply to the  
affected host? Check ‚ceph orch ls‘ for other osd services.


Zitat von Alfredo Rezinovsky :


I want to create osds manually

If I zap the osd  0 with:

ceph orch osd rm 0 --zap

as soon as the dev is available the orchestrator creates it again

If I use:

ceph orch apply osd --all-available-devices --unmanaged=true

and then zap the osd.0 it also appears again.

There is a real way to disable the orch apply persistency or disable it
temporarily?

--
Alfrenovsky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Questions / doubts about rgw users and zones

2022-03-30 Thread Arno Lehmann

Hi Ulrich, all,

took me a while to get back to this, which was because I got as slow 
with $JOB as my Ceph clusters are in general :-)


Am 19.03.2022 um 20:26 schrieb Ulrich Klein:

Hi,

I'm not the expert, either :) So if someone with more experience wants to 
correct me, that’s fine.


At least that allowed me to notice that nobody found anything to correct.


But I think I have a similar setup with a similar goal.

I have two clusters, purely for RGW/S3.
I have a realm R in which I created a zonegroup ZG (not the low tax Kanton:) )


Actually, the taxes the company pays there are noticeable, as far as I 
understand.



On the primary cluster I have a zone ZA as master and on the second cluster a 
zone ZB.
With all set up including the access keys for the zones, metadata and data is 
synced between the two.

Users access only the primary cluster, the secondary is basically a very safe 
backup.


Indeed, that is what I needed as a development environment.


But I want - for some users - that their data is NOT replicated to that 
secondary cluster, cheaper plan or short lived data.


And that came later, when I decoided to actually learn what I can do 
with Ceph.



I found two ways to achieve that.
One is similar to what I understand is your setup:


It is, and it turns out that the explicit disabling of syncing and then 
selectively enabling seems to play an important role.


That is, after I fixed some missing credentials and also tweaked some 
endpoint settings and stuff.


After all my playing around, I'm not sure I could pinpoint where the 
manual was misleading, and where I was just implementing my own issues.


...

My alternative solution was to turn on/off synchronization on buckets:
For any existing (!) bucket one can simply turn off/on synchronization via
# radosgw-admin bucket sync [enable/disable] --bucket=

Problem is that it only works on existing buckets. I've found no way to turn 
synchronization off by default, and even less what I actually need, which is 
turn synchronization/replication on/off per RGW user.

I discarded sync policies as they left the sync status in a suspicious state, were 
complicated in a strange way and the documentation "wasn't too clear to me"

Dunno, if this helps, and I'm pretty sure their may be better ways. But this 
worked for me.


In the end, this approach did work for me, and the manual and selective 
en- and disabling was somehwat important, it appears.




Ciao, Uli


PS:
I use s3cmd, rclone and cyberduck for my simple testing. aws cli I found more 
AWS-centric and it also doen't work well with Ceph/RGW tenants.


The applications using this storage system are a boto3-based python 
application and its testing framework. Up to now, $CUSTOMER has found no 
problem to complain about :-)


I've also started using rbd for virtual disk storage with Proxmox, and 
found no issues so far, so I will not venture into new lands for now, 
but thanks for the advice.



And, I'm not sure why you have so many endpoints in the zonegroup, but no load 
balancer a la RGW ingress, i.e. keepalived+haproxy. But that may be my lack of 
expertise.


Rather lack of my expertise... I did not want to deploy more systems 
than necessary, and while having three rgw heads is kind of overkill for 
my purposes, I'm very satified it works at all and I do not have to 
understand a load balancer and TLS endpoint :-)



Again, thanks for your advice; while it was not directly telling me what 
to do, it gave me the hints I needed!



Cheers,

Arno
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: zap an osd and it appears again

2022-03-30 Thread Alfredo Rezinovsky
Yes.

osd.all-available-devices 0  -  3h

osd.dashboard-admin-1635797884745 7  4m ago 4M   *

How should I disable the creation?

El mié, 30 mar 2022 a las 17:24, Eugen Block () escribió:

> Do you have other osd services defined which would apply to the
> affected host? Check ‚ceph orch ls‘ for other osd services.
>
> Zitat von Alfredo Rezinovsky :
>
> > I want to create osds manually
> >
> > If I zap the osd  0 with:
> >
> > ceph orch osd rm 0 --zap
> >
> > as soon as the dev is available the orchestrator creates it again
> >
> > If I use:
> >
> > ceph orch apply osd --all-available-devices --unmanaged=true
> >
> > and then zap the osd.0 it also appears again.
> >
> > There is a real way to disable the orch apply persistency or disable it
> > temporarily?
> >
> > --
> > Alfrenovsky
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Alfrenovsky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replace MON server keeping identity (Octopus)

2022-03-30 Thread Nigel Williams
Thank you York, that suggestion worked well.

'ceph-deploy mon destroy' on the old server followed by new server identity
change, then 'ceph-deploy mon create' on this replacement worked.



On Wed, 30 Mar 2022 at 19:06, York Huang  wrote:

> the shrink-mon.yml and add-mon.yml playbooks may give you some insights
> for such operations. (remember to check out the correct Ceph version first)
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: zap an osd and it appears again

2022-03-30 Thread Eugen Block
I'm still not sure how that service osd.dashboard-admin-1635797884745  
is created, I've seen a couple of reports with this service. Is it  
created automatically when you try to manage OSDs via dashboard? This  
tracker issue [1] reads like that. By the way, there's a thread [2]  
asking the same question.
Anyway, you could probably just remove this service if you'd rather  
use a drivegroup.yml to have more control over the OSD creation:


ceph orch rm osd.dashboard-admin-1635797884745


[1] https://tracker.ceph.com/issues/50296
[2] https://www.spinics.net/lists/ceph-users/msg69816.html

Zitat von Alfredo Rezinovsky :


Yes.

osd.all-available-devices 0  -  3h

osd.dashboard-admin-1635797884745 7  4m ago 4M   *

How should I disable the creation?

El mié, 30 mar 2022 a las 17:24, Eugen Block () escribió:


Do you have other osd services defined which would apply to the
affected host? Check ‚ceph orch ls‘ for other osd services.

Zitat von Alfredo Rezinovsky :

> I want to create osds manually
>
> If I zap the osd  0 with:
>
> ceph orch osd rm 0 --zap
>
> as soon as the dev is available the orchestrator creates it again
>
> If I use:
>
> ceph orch apply osd --all-available-devices --unmanaged=true
>
> and then zap the osd.0 it also appears again.
>
> There is a real way to disable the orch apply persistency or disable it
> temporarily?
>
> --
> Alfrenovsky
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




--
Alfrenovsky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io