[ceph-users] Re: Balancing PGs across OSDs

2019-11-19 Thread Thomas Schneider
Hello Paul,

thanks for your analysis.

I want to share more statistics of my cluster to follow-up on your
response "You have way too few PGs in one of the roots".

Here are the pool details:
root@ld3955:~# ceph osd pool ls detail
pool 11 'hdb_backup' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode warn
last_change 294572 flags hashpspool,selfmanaged_snaps stripe_width 0
application rbd
    removed_snaps [1~3]
pool 59 'hdd' replicated size 2 min_size 2 crush_rule 3 object_hash
rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 267271
flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
    removed_snaps [1~3]
pool 60 'ssd' replicated size 2 min_size 2 crush_rule 4 object_hash
rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 299719
lfor 299717/299717/299717 flags hashpspool,selfmanaged_snaps
stripe_width 0 application rbd
    removed_snaps [1~3]
pool 61 'nvme' replicated size 2 min_size 2 crush_rule 2 object_hash
rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 267125 flags
hashpspool stripe_width 0 application rbd
pool 62 'cephfs_data' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn
last_change 300312 lfor 300310/300310/300310 flags hashpspool
stripe_width 0 application cephfs
pool 63 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change
267069 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
recovery_priority 5 application cephfs

Any pg_num / pgp_num is monitored by Ceph, means I get a warning in the
log / health status if a pool is undersized.
I didn't enable PG auto-scaler for any pool, though.
The calculation of PGs per Pool is done with pgcalc
.
Here's a screenshot  of this calculation.

My focus is on pool hdb_backup.
Based on these statistics
root@ld3955:~# ceph df detail
RAW STORAGE:
    CLASS SIZE    AVAIL   USED    RAW USED %RAW USED
    hdd   1.4 PiB 744 TiB 729 TiB  730 TiB 49.53
    nvme   23 TiB  23 TiB  43 GiB   51 GiB  0.22
    ssd    27 TiB  25 TiB 1.9 TiB  1.9 TiB  7.15
    TOTAL 1.5 PiB 792 TiB 731 TiB  732 TiB 48.02

POOLS:
    POOL    ID STORED  OBJECTS USED   
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY  
USED COMPR UNDER COMPR
    hdb_backup  11 241 TiB  63.29M 241 TiB
57.03    61 TiB N/A   N/A  63.29M
0 B 0 B
    hdd 59 553 GiB 142.16k 553 GiB 
0.50    54 TiB N/A   N/A 142.16k
0 B 0 B
    ssd 60 2.0 TiB 530.75k 2.0 TiB 
8.72    10 TiB N/A   N/A 530.75k
0 B 0 B
    nvme    61 0 B   0 0 B
0    11 TiB N/A   N/A   0
0 B 0 B
    cephfs_data 62 356 GiB 102.29k 356 GiB 
0.32    36 TiB N/A   N/A 102.29k
0 B 0 B
    cephfs_metadata 63 117 MiB  52 117 MiB
0    36 TiB N/A   N/A  52
0 B 0 B

there's only 57% used, but effectively I cannot store much more data
because some OSDs are filling up by +80%.

It is true that the disks that are used for this pool exclusively are
different in size, means
3x 48 disks à 7.2TB
4x 48 disks à 1.6TB
and the disk usage is for 7.2TB disk from 41% to 54% and for 1.6TB disks
from 52% to 81%.

If Ceph is not cabable to manage rebalancing automatically, how can I
proceed to rebalance the data manually?
OSD reweight is not an option in my opinion because it starts filling
OSDs that are not with lowest usage rate.
Can I move PGs to specific OSDs?


THX




Am 18.11.2019 um 20:18 schrieb Paul Emmerich:
> You have way too few PGs in one of the roots. Many OSDs have so few
> PGs that you should see a lot of health warnings because of it.
> The other root has a factor 5 difference in disk size which isn't ideal 
> either.
>
>
> Paul
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancing PGs across OSDs

2019-11-19 Thread Konstantin Shalygin



On 11/19/19 4:01 PM, Thomas Schneider wrote:

If Ceph is not cabable to manage rebalancing automatically, how can I
proceed to rebalance the data manually?


Use offline upmap for your target pool:

ceph osd getmap -o om; osdmaptool om --upmap upmap.sh 
--upmap-pool=hdd_backup --upmap-deviation 0; bash upmap.sh; rm -f 
upmap.sh om





k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: add debian buster stable support for ceph-deploy

2019-11-19 Thread Jelle de Jong

On 11/18/19 8:08 PM, Paul Emmerich wrote:

We maintain an unofficial mirror for Buster packages:
https://croit.io/2019/07/07/2019-07-07-debian-mirror


Thank you Paul. Yes I have seen the repository, however there is no 
ceph-deploy version in there, and ceph-deploy checks the version of 
debian and notifies that buster is not supported.


https://mirror.croit.io/debian-nautilus/pool/main/c/

Regards,

Jelle de Jong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How proceed to change a crush rule and remap pg's?

2019-11-19 Thread Maarten van Ingen
Hi, 

I have a small but impacting error in my crush rules. 
For unknown reasons the rules are not using host but osd to place the data and 
thus we have some nodes with all three copies instead of three different nodes. 
We noticed this when rebooting a node and a pg became stale. 

My crush rule: 
{ 
"rule_id": 0, 
"rule_name": "replicated_rule", 
"ruleset": 0, 
"type": 1, 
"min_size": 1, 
"max_size": 10, 
"steps": [ 
{ 
"op": "take", 
"item": -2, 
"item_name": "default~hdd" 
}, 
{ 
"op": "chooseleaf_firstn", 
"num": 0, 
"type": "osd" 
}, 
{ 
"op": "emit" 
} 
] 
}, 


Type should be host of course. And I want to alter this and move pg's such that 
all is as should. 
How can I best proceed in correcting this issue? I do like to throttle the 
remapping of the data so ceph itself won't be unavailable while the data is 
redistributed. 

We are running on Mimic (13.2.6), and this environment has been installed 
freshly as Mimic while using ceph-ansible. 

Current ceph -s output: 



cluster: 

id: < 

health: HEALTH_OK 



services: 

mon: 3 daemons, quorum mon01,mon02,mon03 

mgr: mon01(active), standbys: mon02, mon03 

mds: cephfs-2/2/2 up {0=mon03=up:active,1=mon01=up:active}, 1 up:standby 

osd: 502 osds: 502 up, 502 in 



data: 

pools: 18 pools, 8192 pgs 

objects: 28.74 M objects, 100 TiB 

usage: 331 TiB used, 2.3 PiB / 2.6 PiB avail 

pgs: 8192 active+clean 




Cheers, 

Maarten van Ingen 
| Systems Expert | Distributed Data Processing | SURFsara | Science Park 140 | 
1098 XG Amsterdam | 
| T +31 (0) 20 800 1300 | maarten.vanin...@surfsara.nl | https://surfsara.nl | 



We are ISO 27001 certified and meet the high requirements for information 
security. 


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: add debian buster stable support for ceph-deploy

2019-11-19 Thread Paul Emmerich
Correct, we don't package ceph-deploy, sorry.
ceph-deploy is currently unmaintained, I wouldn't use it for a
production setup at the moment.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Nov 19, 2019 at 10:29 AM Jelle de Jong
 wrote:
>
> On 11/18/19 8:08 PM, Paul Emmerich wrote:
> > We maintain an unofficial mirror for Buster packages:
> > https://croit.io/2019/07/07/2019-07-07-debian-mirror
>
> Thank you Paul. Yes I have seen the repository, however there is no
> ceph-deploy version in there, and ceph-deploy checks the version of
> debian and notifies that buster is not supported.
>
> https://mirror.croit.io/debian-nautilus/pool/main/c/
>
> Regards,
>
> Jelle de Jong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How proceed to change a crush rule and remap pg's?

2019-11-19 Thread Paul Emmerich
I don't think that there's a feasible way to do this in a controlled
manner. I would just change it and trust Ceph's remapping mechanism to work
properly.

You could use crushtool to calculate what the new mapping is and then do
something crazy with upmaps (move them manually to the new locations one by
one and then remove all upmaps and change the rule)... but that's quite
annoying to do and probably doesn't really help.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, Nov 19, 2019 at 11:11 AM Maarten van Ingen <
maarten.vanin...@surfsara.nl> wrote:

> Hi,
>
> I have a small but impacting error in my crush rules.
> For unknown reasons the rules are not using host but osd to place the data
> and thus we have some nodes with all three copies instead of three
> different nodes.
> We noticed this when rebooting a node and a pg became stale.
>
> My crush rule:
> {
> "rule_id": 0,
> "rule_name": "replicated_rule",
> "ruleset": 0,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -2,
> "item_name": "default~hdd"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "osd"
> },
> {
> "op": "emit"
> }
> ]
> },
>
>
> Type should be host of course. And I want to alter this and move pg's such
> that all is as should.
> How can I best proceed in correcting this issue? I do like to throttle the
> remapping of the data so ceph itself won't be unavailable while the data is
> redistributed.
>
> We are running on Mimic (13.2.6), and this environment has been installed
> freshly as Mimic while using ceph-ansible.
>
> Current ceph -s output:
>
>   cluster:
>
> id: <
>
> health: HEALTH_OK
>
>
>
>   services:
>
> mon: 3 daemons, quorum mon01,mon02,mon03
>
> mgr: mon01(active), standbys: mon02, mon03
>
> mds: cephfs-2/2/2 up  {0=mon03=up:active,1=mon01=up:active}, 1
> up:standby
>
> osd: 502 osds: 502 up, 502 in
>
>
>
>   data:
>
> pools:   18 pools, 8192 pgs
>
> objects: 28.74 M objects, 100 TiB
>
> usage:   331 TiB used, 2.3 PiB / 2.6 PiB avail
>
> pgs: 8192 active+clean
>
>
> Cheers,
>
> Maarten van Ingen
> | Systems Expert | Distributed Data Processing | SURFsara | Science Park
> 140 | 1098 XG Amsterdam |
> | T +31 (0) 20 800 1300 | maarten.vanin...@surfsara.nl |
> https://surfsara.nl |
>
> We are ISO 27001 certified and meet the high requirements for information
> security.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How proceed to change a crush rule and remap pg's?

2019-11-19 Thread Maarten van Ingen
Thanks, 

The crushtool didn't help me further much unless I did something crazy as you 
said. 
So I have started by just creating a new and correct rule and just change the 
the pools one by one to use the new rule. 
This seems to work fine and as far as I can see it didn't impact any user 
(much). 



Maarten van Ingen 
| Systems Expert | Distributed Data Processing | SURFsara | Science Park 140 | 
1098 XG Amsterdam | 
| T +31 (0) 20 800 1300 | maarten.vanin...@surfsara.nl | https://surfsara.nl | 



We are ISO 27001 certified and meet the high requirements for information 
security. 


From: "Paul Emmerich"  
To: "Maarten van Ingen"  
Cc: "ceph-users"  
Sent: Tuesday, 19 November, 2019 13:36:04 
Subject: Re: [ceph-users] How proceed to change a crush rule and remap pg's? 

I don't think that there's a feasible way to do this in a controlled manner. I 
would just change it and trust Ceph's remapping mechanism to work properly. 

You could use crushtool to calculate what the new mapping is and then do 
something crazy with upmaps (move them manually to the new locations one by one 
and then remove all upmaps and change the rule)... but that's quite annoying to 
do and probably doesn't really help. 

Paul 

-- 
Paul Emmerich 

Looking for help with your Ceph cluster? Contact us at [ https://croit.io/ | 
https://croit.io ] 

croit GmbH 
Freseniusstr. 31h 
81247 München 
[ http://www.croit.io/ | www.croit.io ] 
Tel: +49 89 1896585 90 


On Tue, Nov 19, 2019 at 11:11 AM Maarten van Ingen < [ 
mailto:maarten.vanin...@surfsara.nl | maarten.vanin...@surfsara.nl ] > wrote: 



Hi, 

I have a small but impacting error in my crush rules. 
For unknown reasons the rules are not using host but osd to place the data and 
thus we have some nodes with all three copies instead of three different nodes. 
We noticed this when rebooting a node and a pg became stale. 

My crush rule: 
{ 
"rule_id": 0, 
"rule_name": "replicated_rule", 
"ruleset": 0, 
"type": 1, 
"min_size": 1, 
"max_size": 10, 
"steps": [ 
{ 
"op": "take", 
"item": -2, 
"item_name": "default~hdd" 
}, 
{ 
"op": "chooseleaf_firstn", 
"num": 0, 
"type": "osd" 
}, 
{ 
"op": "emit" 
} 
] 
}, 


Type should be host of course. And I want to alter this and move pg's such that 
all is as should. 
How can I best proceed in correcting this issue? I do like to throttle the 
remapping of the data so ceph itself won't be unavailable while the data is 
redistributed. 

We are running on Mimic (13.2.6), and this environment has been installed 
freshly as Mimic while using ceph-ansible. 

Current ceph -s output: 



cluster: 

id: < 

health: HEALTH_OK 



services: 

mon: 3 daemons, quorum mon01,mon02,mon03 

mgr: mon01(active), standbys: mon02, mon03 

mds: cephfs-2/2/2 up {0=mon03=up:active,1=mon01=up:active}, 1 up:standby 

osd: 502 osds: 502 up, 502 in 



data: 

pools: 18 pools, 8192 pgs 

objects: 28.74 M objects, 100 TiB 

usage: 331 TiB used, 2.3 PiB / 2.6 PiB avail 

pgs: 8192 active+clean 




Cheers, 

Maarten van Ingen 
| Systems Expert | Distributed Data Processing | SURFsara | Science Park 140 | 
1098 XG Amsterdam | 
| T +31 (0) 20 800 1300 | [ mailto:maarten.vanin...@surfsara.nl | 
maarten.vanin...@surfsara.nl ] | [ https://surfsara.nl/ | https://surfsara.nl ] 
| 



We are ISO 27001 certified and meet the high requirements for information 
security. 
___ 
ceph-users mailing list -- [ mailto:ceph-users@ceph.io | ceph-users@ceph.io ] 
To unsubscribe send an email to [ mailto:ceph-users-le...@ceph.io | 
ceph-users-le...@ceph.io ] 






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] jewel OSDs refuse to start up again

2019-11-19 Thread Janne Johansson
Three OSDs, holding the 3 replicas of a PG here are only half-starting, and
hence that single PG gets stuck as "stale+active+clean".
All died of suicide timeout while walking over a huge omap (pool 7
'default.rgw.buckets.index')  and would not get the PG 7.b back online
again.

>From the logs, they try to start normally, get into a bit of leveldb
things, play the journal and then say nothing more.

2019-11-19 15:15:46.967543 7fe644fad840  0 set uid:gid to 167:167
(ceph:ceph)
2019-11-19 15:15:46.967600 7fe644fad840  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 5149
2019-11-19 15:15:47.026065 7fe644fad840  0 pidfile_write: ignore empty
--pid-file
2019-11-19 15:15:47.078291 7fe644fad840  0
filestore(/var/lib/ceph/osd/ceph-22) backend xfs (magic 0x58465342)
2019-11-19 15:15:47.079317 7fe644fad840  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2019-11-19 15:15:47.079331 7fe644fad840  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2019-11-19 15:15:47.079352 7fe644fad840  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: splice
is supported
2019-11-19 15:15:47.080287 7fe644fad840  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2019-11-19 15:15:47.080529 7fe644fad840  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_feature: extsize is
disabled by conf
2019-11-19 15:15:47.095819 7fe644fad840  1 leveldb: Recovering log #2731809
2019-11-19 15:15:47.119792 7fe644fad840  1 leveldb: Level-0 table #2731812:
started
2019-11-19 15:15:47.132107 7fe644fad840  1 leveldb: Level-0 table #2731812:
140642 bytes OK
2019-11-19 15:15:47.143782 7fe644fad840  1 leveldb: Delete type=0 #2731809

2019-11-19 15:15:47.147198 7fe644fad840  1 leveldb: Delete type=3 #2731792

2019-11-19 15:15:47.159339 7fe644fad840  0
filestore(/var/lib/ceph/osd/ceph-22) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2019-11-19 15:15:47.243262 7fe644fad840  1 journal _open
/var/lib/ceph/osd/ceph-22/journal fd 18: 21472739328 bytes, block size 4096
bytes, directio = 1, aio = 1

At this point they consume a ton of cpu, systemd thinks all is fine, and
this has been going on for some 5 hours.
ceph -s think they are down, I can't talk to the OSDs remotely from a mon,
but ceph daemon on the OSD hosts works normally, except I can't do anything
from there except get conf or perf numbers.

Strace shows they all keep looping over the same sequence:
machine1:

stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7fffd7c98080) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head", {st_mode=S_IFDIR|0755,
st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B",
{st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7fffd7c98080) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0

machine2:

stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffe0b664240) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head", {st_mode=S_IFDIR|0755,
st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B",
{st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffe0b664240) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0

machine3:

stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ff

[ceph-users] Re: msgr2 not used on OSDs in some Nautilus clusters

2019-11-19 Thread Bryan Stillwell
Closing the loop here.  I figured out that I missed a step during the Nautilus 
upgrade which was causing this issue:

ceph osd require-osd-release nautilus

If you don't do this your cluster will start having problems once you enable 
msgr2:

ceph mon enable-msgr2

Based on how hard this was to track down, maybe a check should be added before 
enabling msgr2 to make sure the require-osd-release is set to nautilus?

Bryan

> On Nov 18, 2019, at 5:41 PM, Bryan Stillwell  wrote:
> 
> I cranked up debug_ms to 20 on two of these clusters today and I'm still not 
> understanding why some of the clusters use v2 and some just use v1.
> 
> Here's the boot/peering process for the cluster which uses v2:
> 
> 2019-11-18 16:46:03.027 7fabb6281dc0  0 osd.0 39101 done with init, starting 
> boot process
> 2019-11-18 16:46:03.028 7fabb6281dc0  1 osd.0 39101 start_boot
> 2019-11-18 16:46:03.030 7fabaebac700  5 --2- 
> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> [v2:10.0.32.3:6800/1473285,v1:10.0.32.3:6801/1473285] conn(0x5596b30c3000 
> 0x5596b4bf4000 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
> tx=0).handle_hello received hello: peer_type=16 
> peer_addr_for_me=v2:10.0.32.67:51508/0
> 2019-11-18 16:46:03.034 7faba8116700  1 -- 
> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] --> 
> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] -- osd_boot(osd.0 booted 0 
> features 4611087854031667199 v39101) v7 -- 0x5596b4bd6000 con 0x5596b3b06400
> 2019-11-18 16:46:03.034 7faba8116700  5 --2- 
> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).send_message 
> enqueueing message m=0x5596b4bd6000 type=71 osd_boot(osd.0 booted 0 features 
> 4611087854031667199 v39101) v7
> 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- 
> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> tx=0).prepare_send_message m=osd_boot(osd.0 booted 0 features 
> 4611087854031667199 v39101) v7
> 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- 
> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> tx=0).prepare_send_message encoding features 4611087854031667199 
> 0x5596b4bd6000 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
> 2019-11-18 16:46:03.034 7fabaf3ad700  5 --2- 
> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).write_message 
> sending message m=0x5596b4bd6000 seq=8 osd_boot(osd.0 booted 0 features 
> 4611087854031667199 v39101) v7
> 2019-11-18 16:46:03.352 7fab9d100700  1 osd.0 39104 state: booting -> active
> 2019-11-18 16:46:03.354 7fabaebac700  5 --2- 
> [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> 
> [v2:10.0.32.9:6802/3892454,v1:10.0.32.9:6803/3892454] conn(0x5596b4d68800 
> 0x5596b4bf5080 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 
> tx=0).handle_hello received hello: peer_type=4 
> peer_addr_for_me=v2:10.0.32.67:45488/0
> 2019-11-18 16:46:03.354 7fabafbae700  5 --2- 
> [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> 
> [v2:10.0.32.142:6810/2881684,v1:10.0.32.142:6811/2881684] conn(0x5596b4d68000 
> 0x5596b4bf4580 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 
> tx=0).handle_hello received hello: peer_type=4 
> peer_addr_for_me=v2:10.0.32.67:39044/0
> 2019-11-18 16:46:03.355 7fabaf3ad700  5 --2-  >> 
> [v2:10.0.32.67:6814/100535,v1:10.0.32.67:6815/100535] conn(0x5596b4d68400 
> 0x5596b4bf4b00 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
> tx=0).handle_hello received hello: peer_type=4 
> peer_addr_for_me=v2:10.0.32.67:51558/0
> 2019-11-18 16:46:03.355 7fabaf3ad700  1 -- 10.0.32.67:0/258117 learned_addr 
> learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0)
> 2019-11-18 16:46:03.355 7fabafbae700  5 --2-  >> 
> [v2:10.0.32.67:6812/100535,v1:10.0.32.67:6813/100535] conn(0x5596b4d68c00 
> 0x5596b4bf5600 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
> tx=0).handle_hello received hello: peer_type=4 
> peer_addr_for_me=v2:10.0.32.67:40378/0
> 2019-11-18 16:46:03.355 7fabafbae700  1 -- 10.0.32.67:0/258117 learned_addr 
> learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0)
> 
> 
> You can see at the end it learns the address to be v2:10.0.32.67:0/0, but 
> compare that to the cluster which uses v1:
> 
> 2019-11-18 16:46:05.066 7f9182d8ce00  0 osd.0 46410 done with init, starting 
> boot process
> 2019-11-18 16:46:05.066 7f9182d8ce00  1 osd.0 46410 start_boot
> 2019-11-18 16:46:05.069 7f917becf700  5 --2- 
> [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> 
> [v2:10.0.12.131:

[ceph-users] Re: msgr2 not used on OSDs in some Nautilus clusters

2019-11-19 Thread Paul Emmerich
There should be a warning that says something like "all OSDs are
running nautilus but require-osd-release nautilus is not set"

That warning did exist for older releases, pretty sure nautilus also has it?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Nov 19, 2019 at 8:42 PM Bryan Stillwell  wrote:
>
> Closing the loop here.  I figured out that I missed a step during the 
> Nautilus upgrade which was causing this issue:
>
> ceph osd require-osd-release nautilus
>
> If you don't do this your cluster will start having problems once you enable 
> msgr2:
>
> ceph mon enable-msgr2
>
> Based on how hard this was to track down, maybe a check should be added 
> before enabling msgr2 to make sure the require-osd-release is set to nautilus?
>
> Bryan
>
> > On Nov 18, 2019, at 5:41 PM, Bryan Stillwell  wrote:
> >
> > I cranked up debug_ms to 20 on two of these clusters today and I'm still 
> > not understanding why some of the clusters use v2 and some just use v1.
> >
> > Here's the boot/peering process for the cluster which uses v2:
> >
> > 2019-11-18 16:46:03.027 7fabb6281dc0  0 osd.0 39101 done with init, 
> > starting boot process
> > 2019-11-18 16:46:03.028 7fabb6281dc0  1 osd.0 39101 start_boot
> > 2019-11-18 16:46:03.030 7fabaebac700  5 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.3:6800/1473285,v1:10.0.32.3:6801/1473285] conn(0x5596b30c3000 
> > 0x5596b4bf4000 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
> > tx=0).handle_hello received hello: peer_type=16 
> > peer_addr_for_me=v2:10.0.32.67:51508/0
> > 2019-11-18 16:46:03.034 7faba8116700  1 -- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] --> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] -- osd_boot(osd.0 booted 0 
> > features 4611087854031667199 v39101) v7 -- 0x5596b4bd6000 con 0x5596b3b06400
> > 2019-11-18 16:46:03.034 7faba8116700  5 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> > 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> > tx=0).send_message enqueueing message m=0x5596b4bd6000 type=71 
> > osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
> > 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> > 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> > tx=0).prepare_send_message m=osd_boot(osd.0 booted 0 features 
> > 4611087854031667199 v39101) v7
> > 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> > 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> > tx=0).prepare_send_message encoding features 4611087854031667199 
> > 0x5596b4bd6000 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) 
> > v7
> > 2019-11-18 16:46:03.034 7fabaf3ad700  5 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> > 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> > tx=0).write_message sending message m=0x5596b4bd6000 seq=8 osd_boot(osd.0 
> > booted 0 features 4611087854031667199 v39101) v7
> > 2019-11-18 16:46:03.352 7fab9d100700  1 osd.0 39104 state: booting -> active
> > 2019-11-18 16:46:03.354 7fabaebac700  5 --2- 
> > [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> 
> > [v2:10.0.32.9:6802/3892454,v1:10.0.32.9:6803/3892454] conn(0x5596b4d68800 
> > 0x5596b4bf5080 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 
> > tx=0).handle_hello received hello: peer_type=4 
> > peer_addr_for_me=v2:10.0.32.67:45488/0
> > 2019-11-18 16:46:03.354 7fabafbae700  5 --2- 
> > [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> 
> > [v2:10.0.32.142:6810/2881684,v1:10.0.32.142:6811/2881684] 
> > conn(0x5596b4d68000 0x5596b4bf4580 unknown :-1 s=HELLO_CONNECTING pgs=0 
> > cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=4 
> > peer_addr_for_me=v2:10.0.32.67:39044/0
> > 2019-11-18 16:46:03.355 7fabaf3ad700  5 --2-  >> 
> > [v2:10.0.32.67:6814/100535,v1:10.0.32.67:6815/100535] conn(0x5596b4d68400 
> > 0x5596b4bf4b00 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
> > tx=0).handle_hello received hello: peer_type=4 
> > peer_addr_for_me=v2:10.0.32.67:51558/0
> > 2019-11-18 16:46:03.355 7fabaf3ad700  1 -- 10.0.32.67:0/258117 learned_addr 
> > learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0)
> > 2019-11-18 16:46:03.355 7fabafbae700  5 --2-  >> 
> > [v2:10.0.32.67:6812/100535,v1:10.0.32.67:6813/100535] conn(0x5596b4d68c00 
> > 0x5596b4bf5600 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
> > tx=0).handle_hello received hello: peer_type=4 
> > peer

[ceph-users] Re: msgr2 not used on OSDs in some Nautilus clusters

2019-11-19 Thread Bryan Stillwell
I know I've seen that warning before, but for some reason it wasn't alerting on 
these clusters which were upgraded to 14.2.2 first and then to 14.2.4.

Bryan

> On Nov 19, 2019, at 3:20 PM, Paul Emmerich  wrote:
> 
> Notice: This email is from an external sender.
> 
> 
> 
> There should be a warning that says something like "all OSDs are
> running nautilus but require-osd-release nautilus is not set"
> 
> That warning did exist for older releases, pretty sure nautilus also has it?
> 
> Paul
> 
> --
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> 
> On Tue, Nov 19, 2019 at 8:42 PM Bryan Stillwell  
> wrote:
>> 
>> Closing the loop here.  I figured out that I missed a step during the 
>> Nautilus upgrade which was causing this issue:
>> 
>> ceph osd require-osd-release nautilus
>> 
>> If you don't do this your cluster will start having problems once you enable 
>> msgr2:
>> 
>> ceph mon enable-msgr2
>> 
>> Based on how hard this was to track down, maybe a check should be added 
>> before enabling msgr2 to make sure the require-osd-release is set to 
>> nautilus?
>> 
>> Bryan
>> 
>>> On Nov 18, 2019, at 5:41 PM, Bryan Stillwell  wrote:
>>> 
>>> I cranked up debug_ms to 20 on two of these clusters today and I'm still 
>>> not understanding why some of the clusters use v2 and some just use v1.
>>> 
>>> Here's the boot/peering process for the cluster which uses v2:
>>> 
>>> 2019-11-18 16:46:03.027 7fabb6281dc0  0 osd.0 39101 done with init, 
>>> starting boot process
>>> 2019-11-18 16:46:03.028 7fabb6281dc0  1 osd.0 39101 start_boot
>>> 2019-11-18 16:46:03.030 7fabaebac700  5 --2- 
>>> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
>>> [v2:10.0.32.3:6800/1473285,v1:10.0.32.3:6801/1473285] conn(0x5596b30c3000 
>>> 0x5596b4bf4000 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
>>> tx=0).handle_hello received hello: peer_type=16 
>>> peer_addr_for_me=v2:10.0.32.67:51508/0
>>> 2019-11-18 16:46:03.034 7faba8116700  1 -- 
>>> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] --> 
>>> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] -- osd_boot(osd.0 booted 0 
>>> features 4611087854031667199 v39101) v7 -- 0x5596b4bd6000 con 0x5596b3b06400
>>> 2019-11-18 16:46:03.034 7faba8116700  5 --2- 
>>> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
>>> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
>>> 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
>>> tx=0).send_message enqueueing message m=0x5596b4bd6000 type=71 
>>> osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
>>> 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- 
>>> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
>>> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
>>> 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
>>> tx=0).prepare_send_message m=osd_boot(osd.0 booted 0 features 
>>> 4611087854031667199 v39101) v7
>>> 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- 
>>> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
>>> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
>>> 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
>>> tx=0).prepare_send_message encoding features 4611087854031667199 
>>> 0x5596b4bd6000 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) 
>>> v7
>>> 2019-11-18 16:46:03.034 7fabaf3ad700  5 --2- 
>>> [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
>>> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
>>> 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
>>> tx=0).write_message sending message m=0x5596b4bd6000 seq=8 osd_boot(osd.0 
>>> booted 0 features 4611087854031667199 v39101) v7
>>> 2019-11-18 16:46:03.352 7fab9d100700  1 osd.0 39104 state: booting -> active
>>> 2019-11-18 16:46:03.354 7fabaebac700  5 --2- 
>>> [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> 
>>> [v2:10.0.32.9:6802/3892454,v1:10.0.32.9:6803/3892454] conn(0x5596b4d68800 
>>> 0x5596b4bf5080 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 
>>> tx=0).handle_hello received hello: peer_type=4 
>>> peer_addr_for_me=v2:10.0.32.67:45488/0
>>> 2019-11-18 16:46:03.354 7fabafbae700  5 --2- 
>>> [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> 
>>> [v2:10.0.32.142:6810/2881684,v1:10.0.32.142:6811/2881684] 
>>> conn(0x5596b4d68000 0x5596b4bf4580 unknown :-1 s=HELLO_CONNECTING pgs=0 
>>> cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=4 
>>> peer_addr_for_me=v2:10.0.32.67:39044/0
>>> 2019-11-18 16:46:03.355 7fabaf3ad700  5 --2-  >> 
>>> [v2:10.0.32.67:6814/100535,v1:10.0.32.67:6815/100535] conn(0x5596b4d68400 
>>> 0x5596b4bf4b00 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
>>> tx=0).handle_hello received hello: peer_type=4 
>>> peer_addr_for_me=v2:10.0.32.67:51558/0
>>> 2019-11-18 16:46:03.355 7fabaf3ad700  1 -- 10.0.32.67:0/258117 learned_add

[ceph-users] mgr hangs with upmap balancer

2019-11-19 Thread Bryan Stillwell
On multiple clusters we are seeing the mgr hang frequently when the balancer is 
enabled.  It seems that the balancer is getting caught in some kind of infinite 
loop which chews up all the CPU for the mgr which causes problems with other 
modules like prometheus (we don't have the devicehealth module enabled yet).

I've been able to reproduce the issue doing an offline balance as well using 
the osdmaptool:

osdmaptool --debug-osd 10 osd.map --upmap balance-upmaps.sh --upmap-pool 
default.rgw.buckets.data --upmap-max 100

It seems to loop over the same group of PGs of ~7,000 PGs over and over again 
like this without finding any new upmaps that can be added:

2019-11-19 16:39:11.131518 7f85a156f300 10  trying 24.d91
2019-11-19 16:39:11.138035 7f85a156f300 10  trying 24.2e3c
2019-11-19 16:39:11.144162 7f85a156f300 10  trying 24.176b
2019-11-19 16:39:11.149671 7f85a156f300 10  trying 24.ac6
2019-11-19 16:39:11.155115 7f85a156f300 10  trying 24.2cb2
2019-11-19 16:39:11.160508 7f85a156f300 10  trying 24.129c
2019-11-19 16:39:11.166287 7f85a156f300 10  trying 24.181f
2019-11-19 16:39:11.171737 7f85a156f300 10  trying 24.3cb1
2019-11-19 16:39:11.177260 7f85a156f300 10  24.2177 already has pg_upmap_items 
[368,271]
2019-11-19 16:39:11.177268 7f85a156f300 10  trying 24.2177
2019-11-19 16:39:11.182590 7f85a156f300 10  trying 24.a4
2019-11-19 16:39:11.188053 7f85a156f300 10  trying 24.2583
2019-11-19 16:39:11.193545 7f85a156f300 10  24.93e already has pg_upmap_items 
[80,27]
2019-11-19 16:39:11.193553 7f85a156f300 10  trying 24.93e
2019-11-19 16:39:11.198858 7f85a156f300 10  trying 24.e67
2019-11-19 16:39:11.204224 7f85a156f300 10  trying 24.16d9
2019-11-19 16:39:11.209844 7f85a156f300 10  trying 24.11dc
2019-11-19 16:39:11.215303 7f85a156f300 10  trying 24.1f3d
2019-11-19 16:39:11.221074 7f85a156f300 10  trying 24.2a57


While this cluster is running Luminous (12.2.12), I've reproduced the loop 
using the same osdmap on Nautilus (14.2.4).  Is there somewhere I can privately 
upload the osdmap for someone to troubleshoot the problem?

Thanks,
Bryan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io