The monitor looks like it's not generating a new OSDMap including the booting OSDs. I could say with more certainty what's going on with the monitor log file, but I'm betting you've got one of the noin or noup family of flags set. I *think* these will be output in "ceph -w" or in "ceph osd dump", although I can't say for certain in Firefly. -Greg
On Fri, Apr 10, 2015 at 1:57 AM, Jacob Reid <lists-c...@jacob-reid.co.uk> wrote: > On Fri, Apr 10, 2015 at 09:55:20AM +0100, Jacob Reid wrote: >> On Thu, Apr 09, 2015 at 05:21:47PM +0100, Jacob Reid wrote: >> > On Thu, Apr 09, 2015 at 08:46:07AM -0700, Gregory Farnum wrote: >> > > On Thu, Apr 9, 2015 at 8:14 AM, Jacob Reid <lists-c...@jacob-reid.co.uk> >> > > wrote: >> > > > On Thu, Apr 09, 2015 at 06:43:45AM -0700, Gregory Farnum wrote: >> > > >> You can turn up debugging ("debug osd = 10" and "debug filestore = 10" >> > > >> are probably enough, or maybe 20 each) and see what comes out to get >> > > >> more information about why the threads are stuck. >> > > >> >> > > >> But just from the log my answer is the same as before, and now I don't >> > > >> trust that controller (or maybe its disks), regardless of what it's >> > > >> admitting to. ;) >> > > >> -Greg >> > > >> >> > > > >> > > > Ran with osd and filestore debug both at 20; still nothing jumping out >> > > > at me. Logfile attached as it got huge fairly quickly, but mostly >> > > > seems to be the same extra lines. I tried running some test I/O on the >> > > > drives in question to try and provoke some kind of problem, but they >> > > > seem fine now... >> > > >> > > Okay, this is strange. Something very wonky is happening with your >> > > scheduler — it looks like these threads are all idle, and they're >> > > scheduling wakeups that handle an appreciable amount of time after >> > > they're supposed to. For instance: >> > > 2015-04-09 15:56:55.953116 7f70a7963700 20 >> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry woke after 5.416704 >> > > 2015-04-09 15:56:55.953153 7f70a7963700 20 >> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry waiting for >> > > max_interval 5.000000 >> > > >> > > This is the thread that syncs your backing store, and it always sets >> > > itself to get woken up at 5-second intervals — but here it took >5.4 >> > > seconds, and later on in your log it takes more than 6 seconds. >> > > It looks like all the threads which are getting timed out are also >> > > idle, but are taking so much longer to wake up than they're set for >> > > that they get a timeout warning. >> > > >> > > There might be some bugs in here where we're expecting wakeups to be >> > > more precise than they can be, but these sorts of misses are >> > > definitely not normal. Is this server overloaded on the CPU? Have you >> > > done something to make the scheduler or wakeups wonky? >> > > -Greg >> > >> > CPU load is minimal - the host does nothing but run OSDs and has 8 cores >> > that are all sitting idle with a load average of 0.1. I haven't done >> > anything to scheduling. That was with the debug logging on, if that could >> > be the cause of any delays. A scheduler issue seems possible - I haven't >> > done anything to it, but `time sleep 5` run a few times returns anything >> > spread randomly from 5.002 to 7.1(!) seconds but mostly in the 5.5-6.0 >> > region where it managed fairly consistently <5.2 on the other servers in >> > the cluster and <5.02 on my desktop. I have disabled the CPU power saving >> > mode as the only thing I could think of that might be having an effect on >> > this, and running the same test again gives more sane results... we'll see >> > if this reflects in the OSD logs or not, I guess. If this is the cause, >> > it's probably something that the next version might want to make a >> > specific warning case of detecting. I will keep you updated as to their >> > behaviour now... >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> Overnight, nothing changed - I am no longer seeing the timeout in the logs >> but all the OSDs in questions are still happily sitting at booting and >> showing as down in the tree. Debug 20 logfile attached again. > ...and here actually *is* the logfile, which I managed to forget... must be > Friday, I guess. > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com