Re: [ceph-users] osd be marked down when recovering

Paul Emmerich Wed, 26 Jun 2019 04:32:50 -0700

Looks like it's overloaded and runs into a timeout. For a test/dev
environment: try to set the nodown flag for this experiment if you just
want to ignore these timeouts completely.



Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 26, 2019 at 1:26 PM zhanrzh...@teamsun.com.cn <
zhanrzh...@teamsun.com.cn> wrote:

> Hi,all:
>     I start ceph cluster on my machine with development mode,to estimate
> the time of recoverying after increasing pgp_num.
>    all of daemon  run on one machine.
>     CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>     memory: 377GB
>     OS:CentOS Linux release 7.6.1810
>     ceph version:hammer
>
> builded ceph according to
> http://docs.ceph.com/docs/hammer/dev/quick_guide/,
> ceph -s shows:
>     cluster 15ec2f3f-86e5-46bc-bf98-4b35841ee6a5
>      health HEALTH_WARN
>             pool rbd pg_num 512 > pgp_num 256
>      monmap e1: 1 mons at {a=172.30.250.25:6789/0}
>             election epoch 2, quorum 0 a
>      osdmap e88: 30 osds: 30 up, 30 in
>       pgmap v829: 512 pgs, 1 pools, 57812 MB data, 14454 objects
>             5691 GB used, 27791 GB / 33483 GB avail
>                  512 active+clean
> and ceph osd tree[3]
> It start to recovering after i increased pgp_num. ceph -w says there are
> some osd down, but the process is runing.All configuration items of osd or
> mon are default[1]
> some messages that ceph -w[2] says,as below :
>
> 2019-06-26 15:03:21.839750 mon.0 [INF] pgmap v842: 512 pgs: 127
> active+degraded, 84 activating+degraded, 256 active+clean, 45
> active+recovering+degraded; 57812 MB data, 5714 GB used, 27769 GB / 33483
> GB avail; 22200/43362 objects degraded (51.197%); 50789 kB/s, 12 objects/s
> recovering
> 2019-06-26 15:03:21.840884 mon.0 [INF] osd.1 172.30.250.25:6804/22500
> failed (3 reports from 3 peers after 24.867116 >= grace 20.000000)
> 2019-06-26 15:03:21.841459 mon.0 [INF] osd.9 172.30.250.25:6836/25078
> failed (3 reports from 3 peers after 24.867645 >= grace 20.000000)
> 2019-06-26 15:03:21.841709 mon.0 [INF] osd.0 172.30.250.25:6800/22260
> failed (3 reports from 3 peers after 24.846423 >= grace 20.000000)
> 2019-06-26 15:03:21.842286 mon.0 [INF] osd.13 172.30.250.25:6852/26651
> failed (3 reports from 3 peers after 24.846896 >= grace 20.000000)
> 2019-06-26 15:03:21.842607 mon.0 [INF] osd.5 172.30.250.25:6820/23661
> failed (3 reports from 3 peers after 24.804869 >= grace 20.000000)
> 2019-06-26 15:03:21.842938 mon.0 [INF] osd.10 172.30.250.25:6840/25490
> failed (3 reports from 3 peers after 24.805155 >= grace 20.000000)
> 2019-06-26 15:03:21.843134 mon.0 [INF] osd.12 172.30.250.25:6848/26277
> failed (3 reports from 3 peers after 24.805329 >= grace 20.000000)
> 2019-06-26 15:03:21.843591 mon.0 [INF] osd.8 172.30.250.25:6832/24722
> failed (3 reports from 3 peers after 24.805843 >= grace 20.000000)
> 2019-06-26 15:03:21.849664 mon.0 [INF] osd.21 172.30.250.25:6884/29762
> failed (3 reports from 3 peers after 23.497080 >= grace 20.000000)
> 2019-06-26 15:03:21.862729 mon.0 [INF] osd.14 172.30.250.25:6856/27025
> failed (3 reports from 3 peers after 23.510172 >= grace 20.000000)
> 2019-06-26 15:03:21.864222 mon.0 [INF] osdmap e91: 30 osds: 29 up, 30 in
> 2019-06-26 15:03:20.336758 osd.11 [WRN] map e91 wrongly marked me down
> 2019-06-26 15:03:23.408659 mon.0 [INF] pgmap v843: 512 pgs: 8
> stale+activating+degraded, 8 stale+active+clean, 161 active+degraded, 2
> stale+active+recovering+degraded, 33 activating+degraded, 248 active+clean,
> 45 active+recovering+degraded, 7 stale+active+degraded; 57812 MB data, 5730
> GB used, 27752 GB / 33483 GB avail; 27317/43362 objects degraded (62.998%);
> 61309 kB/s, 14 objects/s recovering
> 2019-06-26 15:03:27.538229 mon.0 [INF] osd.18 172.30.250.25:6872/28632
> failed (3 reports from 3 peers after 23.180489 >= grace 20.000000)
> 2019-06-26 15:03:27.539416 mon.0 [INF] osd.7 172.30.250.25:6828/24366
> failed (3 reports from 3 peers after 21.900054 >= grace 20.000000)
> 2019-06-26 15:03:27.541831 mon.0 [INF] osdmap e92: 30 osds: 19 up, 30 in
> 2019-06-26 15:03:32.748179 mon.0 [INF] osdmap e93: 30 osds: 17 up, 30 in
> 2019-06-26 15:03:33.678682 mon.0 [INF] pgmap v845: 512 pgs: 17
> stale+activating+degraded, 95 stale+active+clean, 55 active+degraded, 13
> peering, 18 stale+active+recovering+degraded, 20 activating+degraded, 155
> active+clean, 22 active+recovery_wait+degraded, 48
> active+recovering+degraded, 69 stale+active+degraded; 57812 MB data, 5734
> GB used, 27748 GB / 33483 GB avail; 26979/43362 objects degraded (62.218%);
> 11510 kB/s, 2 objects/s recovering
> 2019-06-26 15:03:33.775701 osd.1 [WRN] map e92 wrongly marked me down
>
> Has anyone got any thoughts on what might have happened, or tips on how to
> dig further into this?
>
> [1] https://github.com/rongzhen-zhan/myfile/blob/master/osd.0.conf
> [2] https://github.com/rongzhen-zhan/myfile/blob/master/ceph-watch.txt
> [3] https://github.com/rongzhen-zhan/myfile/blob/master/ceph%20osd%20tree
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd be marked down when recovering

Reply via email to