Re: [ceph-users] Firefly OSDs stuck in creating state forever

Bruce McFarland Sat, 02 Aug 2014 18:40:29 -0700

Yes I looked at tcpdump on each of the OSDs and saw communications between all 
3 OSDs before I sent my first question to this list. When I disabled selinux on 
the one offending server based on your feedback (typically we have this 
disabled on lab systems that are only on the lab net) the 10 pages in my test 
pool all went to 'active+clean' almost immediately. Unfortunately the 3 default 
pools still remain in the creating states and are not health_ok. The OSDs all 
stayed UP/IN after the selinux change for the rest of the day until I made the 
mistake of creating a RBD image on demo-pool and it's 10 'active+clean' pages. 
I created the rbd, but when I attempted to look at it with 'rbd info' the 
cluster went into an endless loop  trying to read a placement group and loop 
that I left running overnight. This morning ceph-mon was crashed again. I'll 
probably start all over from scratch once again on Monday.


I deleted ceph-mds and got rid of the 'laggy' comments from 'ceph health'. The 
"official" online Ceph docs on that "coming soon" and most references I could 
find were pre firefly so it was a little trail and error to figure out to use 
the pool number and not it's name to get the removal to work. Same with 'ceph 
mds newfs' to get rid of 'laggy-ness' in the 'ceph health' output.

[root@essperf3 Ceph]# ceph mds rm 0  mds.essperf3
mds gid 0 dne
[root@essperf3 Ceph]# ceph health
HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs stuck inactive; 192 pgs 
stuck unclean mds essperf3 is laggy
[root@essperf3 Ceph]# ceph mds newfs 1 0  --yes-i-really-mean-it
new fs with metadata pool 1 and data pool 0
[root@essperf3 Ceph]# ceph health
HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs stuck inactive; 192 pgs 
stuck unclean
[root@essperf3 Ceph]#



From: Brian Rak [mailto:b...@gameservers.com]
Sent: Friday, August 01, 2014 6:14 PM
To: Bruce McFarland; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Firefly OSDs stuck in creating state forever

What happens if you remove nodown?  I'd be interested to see what OSDs it 
thinks are down. My next thought would be tcpdump on the private interface.  
See if the OSDs are actually managing to connect to each other.

For comparison, when I bring up a cluster of 3 OSDs it goes to HEALTH_OK nearly 
instantly (definitely under a minute!), so it's probably not just taking awhile.

Does 'ceph osd dump' show the proper public and private IPs?
On 8/1/2014 6:13 PM, Bruce McFarland wrote:
MDS: I assumed that I'd need to bring up a ceph-mds for my cluster at initial 
bringup. We also intended to modify the CRUSH map such that it's pool is 
resident to SSD(s). It is one of the areas of the online docs there doesn't 
seem to be a lot of info on and I haven't spent a lot of time researching. I'll 
stop it.

OSD connectivity:  The connectivity is good for both 1GE and 10GE. I thought 
moving to 10GE with nothing else on that net might help with group placement 
etc and bring up the pages quicker. I've checked 'tcpdump' output on all boxes.
Firewall: Thanks for that one - it's the "basic" I over looked in my ceph 
learning curve. One of the OSDs had selinux=enforcing - all others were 
disabled. Changing that box and the 10 pages in my demo-pool (kept page count 
very small for sanity) are now 'active+clean'. The pages for the default pools 
- data, metadata, rbd - are still stuck in  creating+peering or 
creating+incomplete. I did have to use manually set 'osd pool default min size 
= 1' from it's default of 2  for these 3 pools to eliminate a bunch of warnings 
in the 'ceph health detail' output.
I'm adding the [mon] setting  you suggested below and stopping ceph-mds and 
bringing everything up now.
[root@essperf3 Ceph]# ceph -s
    cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
     health HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs stuck 
inactive; 192 pgs stuck unclean; 28 requests are blocked > 32 sec; 
nodown,noscrub flag(s) set
     monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1, 
quorum 0 essperf3
     mdsmap e43: 1/1/1 up {0=essperf3=up:creating}
     osdmap e752: 3 osds: 3 up, 3 in
            flags nodown,noscrub
      pgmap v1483: 202 pgs, 4 pools, 0 bytes data, 0 objects
            134 MB used, 1158 GB / 1158 GB avail
                  96 creating+peering
                  10 active+clean <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<!!!!!!!!
                  96 creating+incomplete
[root@essperf3 Ceph]#

From: Brian Rak [mailto:b...@gameservers.com]
Sent: Friday, August 01, 2014 2:54 PM
To: Bruce McFarland; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Firefly OSDs stuck in creating state forever

Why do you have a MDS active?  I'd suggest getting rid of that at least until 
you have everything else working.

I see you've set nodown on the OSDs, did you have problems with the OSDs 
flapping?  Do the OSDs have broken connectivity between themselves?  Do you 
have some kind of firewall interfering here?
I've seen odd issues when the OSDs have broken private networking, you'll get 
one OSD marking all the other ones down.  Adding this to my config helped:

[mon]
mon osd min down reporters = 2


On 8/1/2014 5:41 PM, Bruce McFarland wrote:
Hello,
I've run out of ideas and assume I've overlooked something very basic. I've 
created 2 ceph clusters in the last 2 weeks with different OSD HW and private 
network fabrics - 1GE and 10GE. I have never been  able to get the OSDs to come 
up to the 'active+clean' state. I have followed your online documentation and 
at this point the only thing I don't think I've done is modifying the CRUSH map 
(although I have been looking into that). These are new clusters with no data 
and only 1 HDD and 1 SSD per OSD (24 2.5Ghz cores with 64GB RAM).

Since the disks are being recycled is there something I need to flag to let 
ceph just create it's mappings, but not scrub for data compatibility? I've 
tried setting the noscrub flag to no effect.

I also have constant OSD flapping. I've set nodown, but assume that is just 
masking a problem that still occurring.

Besides the lack of ever reaching 'active+clean' state ceph-mon always crashes 
after leaving it running overnight. The OSDs all eventually fill /root with 
with ceph logs so I regularly have to bring everything down Delete logs and 
restart.

I have all sorts of output from the ceph.conf; osd boot ouput with 'debug osd 
-= 20' and 'debug ms = 1'; ceph -w output; and pretty much all of the 
debug/monitoring suggestions from the online docs and 2 weeks of google 
searches from online references in blogs, mailing lists etc.

[root@essperf3 Ceph]# ceph -v
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
[root@essperf3 Ceph]# ceph -s
    cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
     health HEALTH_WARN 96 pgs incomplete; 106 pgs peering; 202 pgs stuck 
inactive; 202 pgs stuck unclean; nodown,noscrub flag(s) set
     monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1, 
quorum 0 essperf3
     mdsmap e43: 1/1/1 up {0=essperf3=up:creating}
     osdmap e752: 3 osds: 3 up, 3 in
            flags nodown,noscrub
      pgmap v1476: 202 pgs, 4 pools, 0 bytes data, 0 objects
            134 MB used, 1158 GB / 1158 GB avail
                 106 creating+peering
                  96 creating+incomplete
[root@essperf3 Ceph]#

Suggestions?
Thanks,
Bruce





_______________________________________________

ceph-users mailing list

ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Firefly OSDs stuck in creating state forever

Reply via email to