[ceph-users] Cluster suspends when Add Mon or stop and start after a while.

2021-03-28 Thread by morphin
Hello!

I have a cluster with Datacenter crushmap (A+B).(9+9 = 18 servers)
The cluster started with v12.2.0 Luminous 4 years ago.
All these years I upgraded the Cluster Luminous > Mimic > v14.2.16 Nautilus.
Now I have a weird issue. When I add a mon or shutdown a while and
start it up again, all the cluster suspends, ceph -s do not respond
and other two monitors starting election while booting mon is syncing.
(logs below)



2021-03-28 00:18:23.482 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:23.782 7fe2eee07700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:24.292 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:26.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3919 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:29.782 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7899) init, last seen epoch 7899,
mid-election, bumping
2021-03-28 00:18:29.812 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:31.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3951 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:31.872 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:32.072 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:32.482 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:33.282 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:34.812 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7901) init, last seen epoch 7901,
mid-election, bumping
2021-03-28 00:18:34.842 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:34.872 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:35.072 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:35.492 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:36.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3989 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:36.292 7fe2f2e0f700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:39.842 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7903) init, last seen epoch 7903,
mid-election, bumping
2021-03-28 00:18:39.872 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:40.872 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.082 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 4027 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:41.492 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.812 7fe2eee07700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:42.312 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:43.882 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:44.082 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:44.492 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:45.302 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:46.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 4062 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:47.812 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7905) init, last seen epoch 7905,
mid-election, bumping
2021-03-28 00:18:47.842 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:51.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 4091 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:52.842 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7907) ini

[ceph-users] Nautilus: Reduce the number of managers

2021-03-28 Thread Dave Hall

Hello,

We are in the process of bringing new hardware online that will allow us 
to get all of the MGRs, MONs, MDSs, etc.  off of our OSD nodes and onto 
dedicated management nodes.   I've created MGRs and MONs on the new 
nodes, and I found procedures for disabling the MONs from the OSD nodes.


Now I'm looking for the correct procedure to remove the MGRs from the 
OSD nodes.  I haven't found any reference to this in the documentation.  
Is it as simple as stopping and disabling the systemd service/target?  
Or are there Ceph commands?  Do I need to clean up /var/lib/ceph/mgr?


Same questions about MDS in the near term, but I haven't searched the 
docs yet.


Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrade from Luminous to Nautilus now one MDS with could not get service secret

2021-03-28 Thread Robert LeBlanc
We just upgraded our cluster from Lumious to Nautilus and after a few
days one of our MDS servers is getting:

2021-03-28 18:06:32.304 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 16
2021-03-28 18:06:32.304 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:32.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 16 rtt 0.0041
2021-03-28 18:06:36.308 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 17
2021-03-28 18:06:36.308 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:36.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 17 rtt 0
2021-03-28 18:06:37.788 7f57c900a700  0 auth: could not find secret_id=34586
2021-03-28 18:06:37.788 7f57c900a700  0 cephx: verify_authorizer could
not get service secret for service mds secret_id=34586
2021-03-28 18:06:37.788 7f57c6004700  5 mds.sun-gcs01-mds02
ms_handle_reset on v2:10.65.101.13:46566/0
2021-03-28 18:06:40.308 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 18
2021-03-28 18:06:40.308 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:40.308 7f57c8809700  5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 18 rtt 0
2021-03-28 18:06:44.304 7f57c37ff700  5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 19
2021-03-28 18:06:44.304 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s

I've tried removing the /var/lib/ceph/mds/ directory and getting the
key again. I've removed the key and generated a new one, I've checked
the clocks between all the nodes. From what I can tell, everything is
good.

We did have an issue where the monitor cluster fell over and would not
boot. We reduced the monitors to a single monitor, disabled cephx,
pulled it off the network and restarted the service a few times which
allowed it to come up. We then expanded back to three mons and
reenabled cephx and everything has been good until this. No other
services seem to be suffering from this and it even appears that the
MDS works okay even with these messages. We would like to figure out
how to resolve this.

Thank you,
Robert LeBlanc


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ Failed ] Upgrade path for Ceph Ansible from Octopus to Pacific

2021-03-28 Thread Lokendra Rathour
HI Robert,
Just checked your email on ceph users list.
I will try to look deep into the question.
For now i have a qery related to the uphrade itself.
Is it possible for you to send any link /documents that you are following
to upgrade ceph.
 I am trying to upgrade ceph cluster using ceph ansible on centos 7 from
nautilus to octopus but have many el7 related upgrade issue.
And similarly i am trying to do it in centos8 but no luck.

I have tried posting my query to community (refer mail.in the trail) but it
is yet to be posted after moderators approval.

Any support would be appreciated thank you once again for your help.

-Lokendra

On Tue, 23 Mar 2021, 10:17 Lokendra Rathour, 
wrote:

> Hi Team,
> I am trying to upgrade my existing Ceph Cluster (using Ceph-ansible) from
> current release Octopus to pacific for which I am using a rolling upgrade.
> Facing various issues in getting it done, please note as below and suggest:
>
> issue 1: when updating the all.yml file with ceph release number to 16 and
> and ceph release as Pacific:
>
>  *TASK [ceph-validate : validate ceph_repository_community]
> **
>
>
>
>
>
>
>
> *task path:
> /home/ansible/ceph-ansible/roles/ceph-validate/tasks/main.yml:20Tuesday 23
> March 2021  10:00:09 +0530 (0:00:00.141)   0:01:06.028 *fatal:
> [cephnode1]: FAILED! => changed=false  msg: ceph_stable_release must be
> either 'nautilus' or 'octopus'fatal: [cephnode2]: FAILED! => changed=false
> msg: ceph_stable_release must be either 'nautilus' or 'octopus'fatal:
> [cephnode3]: FAILED! => changed=false  msg: ceph_stable_release must be
> either 'nautilus' or 'octopus'*
>
> *Issue 2: by keeping the ceph_stable_release to octopus and changing the
> ceph release number to 16 it gives error as :*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *fatal: [cephnode1 -> cephnode1]: FAILED! => changed=true  cmd:  - ceph  -
> --cluster  - ceph  - osd  - require-osd-release  - pacific  delta:
> '0:00:00.478257'  end: '2021-03-22 19:54:22.892994'  invocation:
> module_args:  _raw_params: ceph --cluster ceph osd require-osd-release
> pacific  _uses_shell: false  argv: null  chdir: null
> creates: null  executable: null  removes: null  stdin: null
>   stdin_add_newline: true  strip_empty_ends: true  warn: true  msg:
> non-zero return code  rc: 22  start: '2021-03-22 19:54:22.414737'  stderr:
> |-Invalid command: pacific not in luminous|mimic|nautilus|octopus
> osd require-osd-release luminous|mimic|nautilus|octopus
> [--yes-i-really-mean-it] :  set the minimum allowed OSD release to
> participate in the clusterError EINVAL: invalid command  stderr_lines:
>   stdout: ''  stdout_lines: *
>
> Problem Statement :
>
> Not able to upgrade to the upper version of Ceph-ansible from octopus to
> Pacific. Please suggest/Support.
>
> --
> ~ Lokendra
> skype: lokendrarathour
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ Failed ] Upgrade path for Ceph Ansible from Octopus to Pacific

2021-03-28 Thread Lokendra Rathour
Request the moderators to approve the same.
It is since long and the solution to the issue is not found yet.

-Lokendra

On Tue, 23 Mar 2021, 10:17 Lokendra Rathour, 
wrote:

> Hi Team,
> I am trying to upgrade my existing Ceph Cluster (using Ceph-ansible) from
> current release Octopus to pacific for which I am using a rolling upgrade.
> Facing various issues in getting it done, please note as below and suggest:
>
> issue 1: when updating the all.yml file with ceph release number to 16 and
> and ceph release as Pacific:
>
>  *TASK [ceph-validate : validate ceph_repository_community]
> **
>
>
>
>
>
>
>
> *task path:
> /home/ansible/ceph-ansible/roles/ceph-validate/tasks/main.yml:20Tuesday 23
> March 2021  10:00:09 +0530 (0:00:00.141)   0:01:06.028 *fatal:
> [cephnode1]: FAILED! => changed=false  msg: ceph_stable_release must be
> either 'nautilus' or 'octopus'fatal: [cephnode2]: FAILED! => changed=false
> msg: ceph_stable_release must be either 'nautilus' or 'octopus'fatal:
> [cephnode3]: FAILED! => changed=false  msg: ceph_stable_release must be
> either 'nautilus' or 'octopus'*
>
> *Issue 2: by keeping the ceph_stable_release to octopus and changing the
> ceph release number to 16 it gives error as :*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *fatal: [cephnode1 -> cephnode1]: FAILED! => changed=true  cmd:  - ceph  -
> --cluster  - ceph  - osd  - require-osd-release  - pacific  delta:
> '0:00:00.478257'  end: '2021-03-22 19:54:22.892994'  invocation:
> module_args:  _raw_params: ceph --cluster ceph osd require-osd-release
> pacific  _uses_shell: false  argv: null  chdir: null
> creates: null  executable: null  removes: null  stdin: null
>   stdin_add_newline: true  strip_empty_ends: true  warn: true  msg:
> non-zero return code  rc: 22  start: '2021-03-22 19:54:22.414737'  stderr:
> |-Invalid command: pacific not in luminous|mimic|nautilus|octopus
> osd require-osd-release luminous|mimic|nautilus|octopus
> [--yes-i-really-mean-it] :  set the minimum allowed OSD release to
> participate in the clusterError EINVAL: invalid command  stderr_lines:
>   stdout: ''  stdout_lines: *
>
> Problem Statement :
>
> Not able to upgrade to the upper version of Ceph-ansible from octopus to
> Pacific. Please suggest/Support.
>
> --
> ~ Lokendra
> skype: lokendrarathour
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to clear Health Warning status?

2021-03-28 Thread jinguk.k...@ungleich.ch
Hello there, 

Thank you for your response.
There is no error at syslog, dmesg, or SMART.

# ceph health detail
HEALTH_WARN Too many repaired reads on 2 OSDs
OSD_TOO_MANY_REPAIRS Too many repaired reads on 2 OSDs
osd.29 had 38 reads repaired
osd.16 had 17 reads repaired

How can i clear this waning ?
My ceph is version 14.2.9(clear_shards_repaired is not supported.)



/dev/sdh1 on /var/lib/ceph/osd/ceph-16 type xfs 
(rw,relatime,attr2,inode64,noquota)

# cat dmesg | grep sdh
[   12.990728] sd 5:2:3:0: [sdh] 19531825152 512-byte logical blocks: (10.0 
TB/9.09 TiB)
[   12.990728] sd 5:2:3:0: [sdh] Write Protect is off
[   12.990728] sd 5:2:3:0: [sdh] Mode Sense: 1f 00 00 08
[   12.990728] sd 5:2:3:0: [sdh] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
[   13.016616]  sdh: sdh1 sdh2
[   13.017780] sd 5:2:3:0: [sdh] Attached SCSI disk

# ceph tell osd.29 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 6.464404,
"bytes_per_sec": 166100668.21318716,
"iops": 39.60148530320815
}
# ceph tell osd.16 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 9.61689458,
"bytes_per_sec": 111651617.26584397,
"iops": 26.619819942914003
}

Thank you


> On 26 Mar 2021, at 16:04, Anthony D'Atri  wrote:
> 
> Did you look at syslog, dmesg, or SMART?  Mostly likely the drives are 
> failing.
> 
> 
>> On Mar 25, 2021, at 9:55 PM, jinguk.k...@ungleich.ch wrote:
>> 
>> Hello there,
>> 
>> Thank you for advanced.
>> My ceph is ceph version 14.2.9
>> I have a repair issue too.
>> 
>> ceph health detail
>> HEALTH_WARN Too many repaired reads on 2 OSDs
>> OSD_TOO_MANY_REPAIRS Too many repaired reads on 2 OSDs
>>   osd.29 had 38 reads repaired
>>   osd.16 had 17 reads repaired
>> 
>> ~# ceph tell osd.16 bench
>> {
>>   "bytes_written": 1073741824,
>>   "blocksize": 4194304,
>>   "elapsed_sec": 7.148673815996,
>>   "bytes_per_sec": 150201541.10217974,
>>   "iops": 35.81083800844663
>> }
>> ~# ceph tell osd.29 bench
>> {
>>   "bytes_written": 1073741824,
>>   "blocksize": 4194304,
>>   "elapsed_sec": 6.924432750002,
>>   "bytes_per_sec": 155065672.9246161,
>>   "iops": 36.970537406114602
>> }
>> 
>> But it looks like those osds are ok. how can i clear this warning ?
>> 
>> Best regards
>> JG
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Do I need to update ceph.conf and restart each OSD after adding more MONs?

2021-03-28 Thread Tony Liu
Thank you Stefan and Josh!
Tony

From: Josh Baergen 
Sent: March 28, 2021 08:28 PM
To: Tony Liu
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Do I need to update ceph.conf and restart each 
OSD after adding more MONs?

As was mentioned in this thread, all of the mon clients (OSDs included) learn 
about other mons through monmaps, which are distributed when mon membership and 
election changes. Thus, your OSDs should already know about the new mons.

mon_host indicates the list of mons that mon clients should try to contact at 
boot. Thus, it's important to have correct in the config but doesn't need to be 
updated after the process starts.

At least that's how I understand it; the config docs aren't terribly clear on 
this behaviour.

Josh


On Sat., Mar. 27, 2021, 2:07 p.m. Tony Liu, 
mailto:tonyliu0...@hotmail.com>> wrote:
Just realized that all config files (/var/lib/ceph///config)
on all nodes are already updated properly. It must be handled as part of adding
MONs. But "ceph config show" shows only single host.

mon_host   
[v2:10.250.50.80:3300/0,v1:10.250.50.80:6789/0]
  file

That means I still need to restart all services to apply the update, right?
Is this supposed to be part of adding MONs as well, or additional manual step?


Thanks!
Tony

From: Tony Liu mailto:tonyliu0...@hotmail.com>>
Sent: March 27, 2021 12:53 PM
To: Stefan Kooman; ceph-users@ceph.io
Subject: [ceph-users] Re: Do I need to update ceph.conf and restart each OSD 
after adding more MONs?

# ceph config set osd.0 mon_host 
[v2:10.250.50.80:3300/0,v1:10.250.50.80:6789/0,v2:10.250.50.81:3300/0,v1:10.250.50.81:6789/0,v2:10.250.50.82:3300/0,v1:10.250.50.82:6789/0]
Error EINVAL: mon_host is special and cannot be stored by the mon

It seems that the only option is to update ceph.conf and restart service.


Tony

From: Tony Liu mailto:tonyliu0...@hotmail.com>>
Sent: March 27, 2021 12:20 PM
To: Stefan Kooman; ceph-users@ceph.io
Subject: [ceph-users] Re: Do I need to update ceph.conf and restart each OSD 
after adding more MONs?

I expanded MON from 1 to 3 by updating orch service "ceph orch apply".
"mon_host" in all services (MON, MGR, OSDs) is not updated. It's still single
host from source "file".
What's the guidance here to update "mon_host" for all services? I am talking
about Ceph services, not client side.
Should I update ceph.conf for all services and restart all of them?
Or I can update it on-the-fly by "ceph config set"?
In the latter case, where the updated configuration is stored? Is it going to
be overridden by ceph.conf when restart service?


Thanks!
Tony


From: Stefan Kooman mailto:ste...@bit.nl>>
Sent: March 26, 2021 12:22 PM
To: Tony Liu; ceph-users@ceph.io
Subject: Re: [ceph-users] Do I need to update ceph.conf and restart each OSD 
after adding more MONs?

On 3/26/21 6:06 PM, Tony Liu wrote:
> Hi,
>
> Do I need to update ceph.conf and restart each OSD after adding more MONs?

This should not be necessary, as the OSDs should learn about these
changes through monmaps. Updating the ceph.conf after the mons have been
updated is advised.

> This is with 15.2.8 deployed by cephadm.
>
> When adding MON, "mon_host" should be updated accordingly.
> Given [1], is that update "the monitor cluster’s centralized configuration
> database" or "runtime overrides set by an administrator"?

No need to put that in the centralized config database. I *think* they
mean ceph.conf file on the clients and hosts. At least, that's what you
would normally do (if not using DNS).

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io