I've been struggling with a broken ceph node and I have very limited ceph
knowledge. With 3-4 days of actually using it, I was tasked with upgrading
it. Everything seemed to go fine, at first, but it didn't last.

The next day I was informed people were unable to create volumes (we
successfully created a volume immediately after the upgrade, but we were
unable to do so now) After some investigation, I discovered that 'rados -p
volumes ls' just hangs. I have another pool that behaves that way too
(images). The rest don't seem to have any issues.

We are running 6 ceph servers with 72 OSD's. Here is what ceph status
brings up (now):

root@CTR01:~# ceph -s
    cluster c14740db-4771-4f95-8268-689bba5598eb
     health HEALTH_WARN
            1538 pgs stale
            282 pgs stuck inactive
            1538 pgs stuck stale
            282 pgs stuck unclean
            too many PGs per OSD (747 > max 300)
     monmap e1: 3 mons at {Ceph02=
192.168.0.12:6789/0,ceph04=192.168.90.14:6789/0,Ceph06=192.168.0.16:6789/0}
            election epoch 3066, quorum 0,1,2 Ceph02,Ceph04,Ceph06
     osdmap e1325: 72 osds: 72 up, 72 in
      pgmap v2515322: 18232 pgs, 19 pools, 1042 GB data, 437 kobjects
            3143 GB used, 127 TB / 130 TB avail
               16412 active+clean
                1538 stale+active+clean
                 282 creating

Some notes:

1538 stale+active+clean -
Most of these (1250, 1350, or so) were leftover from the initial
installation. They weren't actually being used by the system. I inherited
the system with those and was told nobody knew how to get rid of them. It
was, apparently, part of a ceph false-start.

282 creating -
While I was looking at the issue, I noticed a 'ceph -s' warning about
another pool (one we use for swift). It complained about too few PG's per
osd, so I increased pg+num and pgp_num from 1024 to 2048; I was hoping the
two problems were related.. That's what added the status line 'creating" I
think (also all in the 19.xx - is that osd.19?)


root@MUC1-Tab-CTR01:~# ceph health detail | grep unclean
HEALTH_WARN 1538 pgs stale; 282 pgs stuck inactive; 1538 pgs stuck stale;
282 pgs stuck unclean; too many PGs per OSD (747 > max 300)
pg 19.5b1 is stuck unclean since forever, current state creating, last
acting []
pg 19.c5 is stuck unclean since forever, current state creating, last
acting []
pg 19.c6 is stuck unclean since forever, current state creating, last
acting []
pg 19.c0 is stuck unclean since forever, current state creating, last
acting []
pg 19.c2 is stuck unclean since forever, current state creating, last
acting []
pg 19.726 is stuck unclean since forever, current state creating, last
acting []
pg 19.727 is stuck unclean since forever, current state creating, last
acting []
pg 19.412 is stuck unclean since forever, current state creating, last
acting []
.
.
.
pg 19.26c is stuck unclean since forever, current state creating, last
acting []
pg 19.5be is stuck unclean since forever, current state creating, last
acting []
pg 19.264 is stuck unclean since forever, current state creating, last
acting []
pg 19.5b4 is stuck unclean since forever, current state creating, last
acting []
pg 19.260 is stuck unclean since forever, current state creating, last
acting []

Looking at osd.19 logs, I get the same messages I get with osd.20.log:

root@Ceph02:~# tail -10 /var/log/ceph/ceph-osd.19.log
2016-11-12 02:18:36.047803 7f973fe58700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.195:6814/4099 pipe(0xc057000 sd=87 :57536 s=2 pgs=1039 cs=21 l=0
c=0xa3a34a0).fault with nothing to send, going to standby
2016-11-12 02:22:49.242045 7f974045e700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.193:6812/4067 pipe(0xa402000 sd=25 :48529 s=2 pgs=986 cs=21 l=0
c=0xa3a5b20).fault with nothing to send, going to standby
2016-11-12 02:22:49.244093 7f973e741700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.196:6810/4118 pipe(0xba4e000 sd=51 :50137 s=2 pgs=933 cs=35 l=0
c=0xb7af760).fault with nothing to send, going to standby
2016-11-12 02:25:20.699763 7f97383e5700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.194:6806/4108 pipe(0xba76000 sd=134 :6818 s=2 pgs=972 cs=21 l=0
c=0xb7afb80).fault with nothing to send, going to standby
2016-11-12 02:28:02.526393 7f9720669700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.193:6806/3964 pipe(0xbb54000 sd=210 :6818 s=0 pgs=0 cs=0 l=0
c=0xc5bc840).accept connect_seq 41 vs existing 41 state standby
2016-11-12 02:28:02.526750 7f9720669700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.193:6806/3964 pipe(0xbb54000 sd=210 :6818 s=0 pgs=0 cs=0 l=0
c=0xc5bc840).accept connect_seq 42 vs existing 41 state standby
2016-11-12 02:33:40.838728 7f973d933700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.193:6822/4147 pipe(0xbbae000 sd=92 :6818 s=0 pgs=0 cs=0 l=0
c=0x5a939c0).accept connect_seq 27 vs existing 27 state standby
2016-11-12 02:33:40.839052 7f973d933700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.193:6822/4147 pipe(0xbbae000 sd=92 :6818 s=0 pgs=0 cs=0 l=0
c=0x5a939c0).accept connect_seq 28 vs existing 27 state standby
2016-11-12 02:34:00.187408 7f9719706700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.193:6818/4140 pipe(0xc052000 sd=65 :6818 s=0 pgs=0 cs=0 l=0
c=0x5a91760).accept connect_seq 31 vs existing 31 state standby
2016-11-12 02:34:00.187686 7f9719706700  0 -- 192.168.92.12:6818/4289 >>
192.168.92.193:6818/4140 pipe(0xc052000 sd=65 :6818 s=0 pgs=0 cs=0 l=0
c=0x5a91760).accept connect_seq 32 vs existing 31 state standby

​At this point I'm stuck. I have no idea what to do do fix the 'volumes'
pool. Does anybody have any suggestions?

-- Joel​

JOEL GRIFFITHS
LINUX SYSTEMS ENGINEER
UNITAS GLOBAL <http://www2.unitasglobal.com/>
*M* +1 480.717 5635 <423.914.0105>
joel.griffi...@unitasglobal.com <chris.sm...@unitasglobal.com>

This e-mail is confidential to the person or entity addressed and may be
protected by legal privilege. If you are not the intended recipient, please
notify the sender immediately and delete your copy from your system. You
should not copy it, re-transmit it, use it or disclose its contents.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to