The monitor looks like it's not generating a new OSDMap including the
booting OSDs. I could say with more certainty what's going on with the
monitor log file, but I'm betting you've got one of the noin or noup
family of flags set. I *think* these will be output in "ceph -w" or in
"ceph osd dump", although I can't say for certain in Firefly.
-Greg

On Fri, Apr 10, 2015 at 1:57 AM, Jacob Reid <lists-c...@jacob-reid.co.uk> wrote:
> On Fri, Apr 10, 2015 at 09:55:20AM +0100, Jacob Reid wrote:
>> On Thu, Apr 09, 2015 at 05:21:47PM +0100, Jacob Reid wrote:
>> > On Thu, Apr 09, 2015 at 08:46:07AM -0700, Gregory Farnum wrote:
>> > > On Thu, Apr 9, 2015 at 8:14 AM, Jacob Reid <lists-c...@jacob-reid.co.uk> 
>> > > wrote:
>> > > > On Thu, Apr 09, 2015 at 06:43:45AM -0700, Gregory Farnum wrote:
>> > > >> You can turn up debugging ("debug osd = 10" and "debug filestore = 10"
>> > > >> are probably enough, or maybe 20 each) and see what comes out to get
>> > > >> more information about why the threads are stuck.
>> > > >>
>> > > >> But just from the log my answer is the same as before, and now I don't
>> > > >> trust that controller (or maybe its disks), regardless of what it's
>> > > >> admitting to. ;)
>> > > >> -Greg
>> > > >>
>> > > >
>> > > > Ran with osd and filestore debug both at 20; still nothing jumping out 
>> > > > at me. Logfile attached as it got huge fairly quickly, but mostly 
>> > > > seems to be the same extra lines. I tried running some test I/O on the 
>> > > > drives in question to try and provoke some kind of problem, but they 
>> > > > seem fine now...
>> > >
>> > > Okay, this is strange. Something very wonky is happening with your
>> > > scheduler — it looks like these threads are all idle, and they're
>> > > scheduling wakeups that handle an appreciable amount of time after
>> > > they're supposed to. For instance:
>> > > 2015-04-09 15:56:55.953116 7f70a7963700 20
>> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry woke after 5.416704
>> > > 2015-04-09 15:56:55.953153 7f70a7963700 20
>> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry waiting for
>> > > max_interval 5.000000
>> > >
>> > > This is the thread that syncs your backing store, and it always sets
>> > > itself to get woken up at 5-second intervals — but here it took >5.4
>> > > seconds, and later on in your log it takes more than 6 seconds.
>> > > It looks like all the threads which are getting timed out are also
>> > > idle, but are taking so much longer to wake up than they're set for
>> > > that they get a timeout warning.
>> > >
>> > > There might be some bugs in here where we're expecting wakeups to be
>> > > more precise than they can be, but these sorts of misses are
>> > > definitely not normal. Is this server overloaded on the CPU? Have you
>> > > done something to make the scheduler or wakeups wonky?
>> > > -Greg
>> >
>> > CPU load is minimal - the host does nothing but run OSDs and has 8 cores 
>> > that are all sitting idle with a load average of 0.1. I haven't done 
>> > anything to scheduling. That was with the debug logging on, if that could 
>> > be the cause of any delays. A scheduler issue seems possible - I haven't 
>> > done anything to it, but `time sleep 5` run a few times returns anything 
>> > spread randomly from 5.002 to 7.1(!) seconds but mostly in the 5.5-6.0 
>> > region where it managed fairly consistently <5.2 on the other servers in 
>> > the cluster and <5.02 on my desktop. I have disabled the CPU power saving 
>> > mode as the only thing I could think of that might be having an effect on 
>> > this, and running the same test again gives more sane results... we'll see 
>> > if this reflects in the OSD logs or not, I guess. If this is the cause, 
>> > it's probably something that the next version might want to make a 
>> > specific warning case of detecting. I will keep you updated as to their 
>> > behaviour now...
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> Overnight, nothing changed - I am no longer seeing the timeout in the logs 
>> but all the OSDs in questions are still happily sitting at booting and 
>> showing as down in the tree. Debug 20 logfile attached again.
> ...and here actually *is* the logfile, which I managed to forget... must be 
> Friday, I guess.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to