Re: [ceph-users] Undersized pgs problem

Irek Fasikhov Fri, 27 Nov 2015 05:01:52 -0800

You have time to synchronize?

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757


2015-11-27 15:57 GMT+03:00 Vasiliy Angapov <anga...@gmail.com>:

> > It seams that you played around with crushmap, and done something wrong.
> > Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
> devices renamed to 'device' think threre is you problem.
> Is this a mistake actually? What I did is removed a bunch of OSDs from
> my cluster that's why the numeration is sparse. But is it an issue to
> a have a sparse numeration of OSDs?
>
> > Hi.
> > Vasiliy, Yes it is a problem with crusmap. Look at height:
> > -3 14.56000     host slpeah001
> > -2 14.56000     host slpeah002
> What exactly is wrong here?
>
> I also found out that my OSD logs are full of such records:
> 2015-11-26 08:31:19.273268 7fe4f49b1700  0 cephx: verify_authorizer
> could not get service secret for service osd secret_id=2924
> 2015-11-26 08:31:19.273276 7fe4f49b1700  0 --
> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a520).accept: got bad
> authorizer
> 2015-11-26 08:31:24.273207 7fe4f49b1700  0 auth: could not find
> secret_id=2924
> 2015-11-26 08:31:24.273225 7fe4f49b1700  0 cephx: verify_authorizer
> could not get service secret for service osd secret_id=2924
> 2015-11-26 08:31:24.273231 7fe4f49b1700  0 --
> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a3c0).accept: got bad
> authorizer
> 2015-11-26 08:31:29.273199 7fe4f49b1700  0 auth: could not find
> secret_id=2924
> 2015-11-26 08:31:29.273215 7fe4f49b1700  0 cephx: verify_authorizer
> could not get service secret for service osd secret_id=2924
> 2015-11-26 08:31:29.273222 7fe4f49b1700  0 --
> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a260).accept: got bad
> authorizer
> 2015-11-26 08:31:34.273469 7fe4f49b1700  0 auth: could not find
> secret_id=2924
> 2015-11-26 08:31:34.273482 7fe4f49b1700  0 cephx: verify_authorizer
> could not get service secret for service osd secret_id=2924
> 2015-11-26 08:31:34.273486 7fe4f49b1700  0 --
> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a100).accept: got bad
> authorizer
> 2015-11-26 08:31:39.273310 7fe4f49b1700  0 auth: could not find
> secret_id=2924
> 2015-11-26 08:31:39.273331 7fe4f49b1700  0 cephx: verify_authorizer
> could not get service secret for service osd secret_id=2924
> 2015-11-26 08:31:39.273342 7fe4f49b1700  0 --
> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19fa0).accept: got bad
> authorizer
> 2015-11-26 08:31:44.273753 7fe4f49b1700  0 auth: could not find
> secret_id=2924
> 2015-11-26 08:31:44.273769 7fe4f49b1700  0 cephx: verify_authorizer
> could not get service secret for service osd secret_id=2924
> 2015-11-26 08:31:44.273776 7fe4f49b1700  0 --
> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee189a0).accept: got bad
> authorizer
> 2015-11-26 08:31:49.273412 7fe4f49b1700  0 auth: could not find
> secret_id=2924
> 2015-11-26 08:31:49.273431 7fe4f49b1700  0 cephx: verify_authorizer
> could not get service secret for service osd secret_id=2924
> 2015-11-26 08:31:49.273455 7fe4f49b1700  0 --
> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19080).accept: got bad
> authorizer
> 2015-11-26 08:31:54.273293 7fe4f49b1700  0 auth: could not find
> secret_id=2924
>
> What does it mean? Google sais it might be a time sync issue, but my
> clocks are perfectly synchronized...
>
> 2015-11-26 21:05 GMT+08:00 Irek Fasikhov <malm...@gmail.com>:
> > Hi.
> > Vasiliy, Yes it is a problem with crusmap. Look at height:
> > " -3 14.56000     host slpeah001
> >  -2 14.56000     host slpeah002
> >  "
> >
> > С уважением, Фасихов Ирек Нургаязович
> > Моб.: +79229045757
> >
> > 2015-11-26 13:16 GMT+03:00 ЦИТ РТ-Курамшин Камиль Фидаилевич
> > <kamil.kurams...@tatar.ru>:
> >>
> >> It seams that you played around with crushmap, and done something wrong.
> >> Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
> >> devices renamed to 'device' think threre is you problem.
> >>
> >> Отправлено с мобильного устройства.
> >>
> >>
> >> -----Original Message-----
> >> From: Vasiliy Angapov <anga...@gmail.com>
> >> To: ceph-users <ceph-users@lists.ceph.com>
> >> Sent: чт, 26 нояб. 2015 7:53
> >> Subject: [ceph-users] Undersized pgs problem
> >>
> >> Hi, colleagues!
> >>
> >> I have small 4-node CEPH cluster (0.94.2), all pools have size 3,
> min_size
> >> 1.
> >> This night one host failed and cluster was unable to rebalance saying
> >> there are a lot of undersized pgs.
> >>
> >> root@slpeah002:[~]:# ceph -s
> >>     cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728
> >>      health HEALTH_WARN
> >>             1486 pgs degraded
> >>             1486 pgs stuck degraded
> >>             2257 pgs stuck unclean
> >>             1486 pgs stuck undersized
> >>             1486 pgs undersized
> >>             recovery 80429/555185 objects degraded (14.487%)
> >>             recovery 40079/555185 objects misplaced (7.219%)
> >>             4/20 in osds are down
> >>             1 mons down, quorum 1,2 slpeah002,slpeah007
> >>      monmap e7: 3 mons at
> >>
> >> {slpeah001=
> 192.168.254.11:6780/0,slpeah002=192.168.254.12:6780/0,slpeah007=172.31.252.46:6789/0
> }
> >>             election epoch 710, quorum 1,2 slpeah002,slpeah007
> >>      osdmap e14062: 20 osds: 16 up, 20 in; 771 remapped pgs
> >>       pgmap v7021316: 4160 pgs, 5 pools, 1045 GB data, 180 kobjects
> >>             3366 GB used, 93471 GB / 96838 GB avail
> >>             80429/555185 objects degraded (14.487%)
> >>             40079/555185 objects misplaced (7.219%)
> >>                 1903 active+clean
> >>                 1486 active+undersized+degraded
> >>                  771 active+remapped
> >>   client io 0 B/s rd, 246 kB/s wr, 67 op/s
> >>
> >>   root@slpeah002:[~]:# ceph osd tree
> >> ID  WEIGHT   TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >>  -1 94.63998 root default
> >>  -9 32.75999     host slpeah007
> >>  72  5.45999         osd.72          up  1.00000          1.00000
> >>  73  5.45999         osd.73          up  1.00000          1.00000
> >>  74  5.45999         osd.74          up  1.00000          1.00000
> >>  75  5.45999         osd.75          up  1.00000          1.00000
> >>  76  5.45999         osd.76          up  1.00000          1.00000
> >>  77  5.45999         osd.77          up  1.00000          1.00000
> >> -10 32.75999     host slpeah008
> >>  78  5.45999         osd.78          up  1.00000          1.00000
> >>  79  5.45999         osd.79          up  1.00000          1.00000
> >>  80  5.45999         osd.80          up  1.00000          1.00000
> >>  81  5.45999         osd.81          up  1.00000          1.00000
> >>  82  5.45999         osd.82          up  1.00000          1.00000
> >>  83  5.45999         osd.83          up  1.00000          1.00000
> >>  -3 14.56000     host slpeah001
> >>   1  3.64000          osd.1         down  1.00000          1.00000
> >>  33  3.64000         osd.33        down  1.00000          1.00000
> >>  34  3.64000         osd.34        down  1.00000          1.00000
> >>  35  3.64000         osd.35        down  1.00000          1.00000
> >>  -2 14.56000     host slpeah002
> >>   0  3.64000         osd.0           up  1.00000          1.00000
> >>  36  3.64000         osd.36          up  1.00000          1.00000
> >>  37  3.64000         osd.37          up  1.00000          1.00000
> >>  38  3.64000         osd.38          up  1.00000          1.00000
> >>
> >> Crushmap:
> >>
> >>  # begin crush map
> >> tunable choose_local_tries 0
> >> tunable choose_local_fallback_tries 0
> >> tunable choose_total_tries 50
> >> tunable chooseleaf_descend_once 1
> >> tunable chooseleaf_vary_r 1
> >> tunable straw_calc_version 1
> >> tunable allowed_bucket_algs 54
> >>
> >> # devices
> >> device 0 osd.0
> >> device 1 osd.1
> >> device 2 device2
> >> device 3 device3
> >> device 4 device4
> >> device 5 device5
> >> device 6 device6
> >> device 7 device7
> >> device 8 device8
> >> device 9 device9
> >> device 10 device10
> >> device 11 device11
> >> device 12 device12
> >> device 13 device13
> >> device 14 device14
> >> device 15 device15
> >> device 16 device16
> >> device 17 device17
> >> device 18 device18
> >> device 19 device19
> >> device 20 device20
> >> device 21 device21
> >> device 22 device22
> >> device 23 device23
> >> device 24 device24
> >> device 25 device25
> >> device 26 device26
> >> device 27 device27
> >> device 28 device28
> >> device 29 device29
> >> device 30 device30
> >> device 31 device31
> >> device 32 device32
> >> device 33 osd.33
> >> device 34 osd.34
> >> device 35 osd.35
> >> device 36 osd.36
> >> device 37 osd.37
> >> device 38 osd.38
> >> device 39 device39
> >> device 40 device40
> >> device 41 device41
> >> device 42 device42
> >> device 43 device43
> >> device 44 device44
> >> device 45 device45
> >> device 46 device46
> >> device 47 device47
> >> device 48 device48
> >> device 49 device49
> >> device 50 device50
> >> device 51 device51
> >> device 52 device52
> >> device 53 device53
> >> device 54 device54
> >> device 55 device55
> >> device 56 device56
> >> device 57 device57
> >> device 58 device58
> >> device 59 device59
> >> device 60 device60
> >> device 61 device61
> >> device 62 device62
> >> device 63 device63
> >> device 64 device64
> >> device 65 device65
> >> device 66 device66
> >> device 67 device67
> >> device 68 device68
> >> device 69 device69
> >> device 70 device70
> >> device 71 device71
> >> device 72 osd.72
> >> device 73 osd.73
> >> device 74 osd.74
> >> device 75 osd.75
> >> device 76 osd.76
> >> device 77 osd.77
> >> device 78 osd.78
> >> device 79 osd.79
> >> device 80 osd.80
> >> device 81 osd.81
> >> device 82 osd.82
> >> device 83 osd.83
> >>
> >> # types
> >> type 0 osd
> >> type 1 host
> >> type 2 chassis
> >> type 3 rack
> >> type 4 row
> >> type 5 pdu
> >> type 6 pod
> >> type 7 room
> >> type 8 datacenter
> >> type 9 region
> >> type 10 root
> >>
> >> # buckets
> >> host slpeah007 {
> >>         id -9           # do not change unnecessarily
> >>         # weight 32.760
> >>         alg straw
> >>         hash 0  # rjenkins1
> >>         item osd.72 weight 5.460
> >>         item osd.73 weight 5.460
> >>         item osd.74 weight 5.460
> >>         item osd.75 weight 5.460
> >>         item osd.76 weight 5.460
> >>         item osd.77 weight 5.460
> >> }
> >> host slpeah008 {
> >>         id -10          # do not change unnecessarily
> >>         # weight 32.760
> >>         alg straw
> >>         hash 0  # rjenkins1
> >>         item osd.78 weight 5.460
> >>         item osd.79 weight 5.460
> >>         item osd.80 weight 5.460
> >>         item osd.81 weight 5.460
> >>         item osd.82 weight 5.460
> >>         item osd.83 weight 5.460
> >> }
> >> host slpeah001 {
> >>         id -3           # do not change unnecessarily
> >>         # weight 14.560
> >>         alg straw
> >>         hash 0  # rjenkins1
> >>         item osd.1 weight 3.640
> >>         item osd.33 weight 3.640
> >>         item osd.34 weight 3.640
> >>         item osd.35 weight 3.640
> >> }
> >> host slpeah002 {
> >>         id -2           # do not change unnecessarily
> >>         # weight 14.560
> >>         alg straw
> >>         hash 0  # rjenkins1
> >>         item osd.0 weight 3.640
> >>         item osd.36 weight 3.640
> >>         item osd.37 weight 3.640
> >>         item osd.38 weight 3.640
> >> }
> >> root default {
> >>         id -1           # do not change unnecessarily
> >>         # weight 94.640
> >>         alg straw
> >>         hash 0  # rjenkins1
> >>         item slpeah007 weight 32.760
> >>         item slpeah008 weight 32.760
> >>         item slpeah001 weight 14.560
> >>         item slpeah002 weight 14.560
> >> }
> >>
> >> # rules
> >> rule default {
> >>         ruleset 0
> >>         type replicated
> >>         min_size 1
> >>         max_size 10
> >>         step take default
> >>         step chooseleaf firstn 0 type host
> >>         step emit
> >> }
> >>
> >> # end crush map
> >>
> >>
> >>
> >> This is odd because pools have size 3 and I have 3 hosts alive, so why
> >> it is saying that undersized pgs are present? It makes me feel like
> >> CRUSH is not working properly.
> >> There is not much data currently in cluster, something about 3TB and
> >> as you can see from osd tree - each host have minimum of 14TB disk
> >> space on OSDs.
> >> So I'm a bit stuck now...
> >> How can I find the source of trouble?
> >>
> >> Thanks in advance!
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Undersized pgs problem

Reply via email to