Re: [ceph-users] OSDs not coming up on one host

Gregory Farnum Thu, 16 Jun 2016 06:24:08 -0700

On Wed, Jun 15, 2016 at 10:21 AM, Kostis Fardelas <dante1...@gmail.com> wrote:
> Hello Jacob, Gregory,
>
> did you manage to start up those OSDs at last? I came across a very
> much alike incident [1] (no flags preventing the OSDs from getting UP
> in the cluster though, no hardware problems reported) and I wonder if
> you found out what was the culprit in your case.
>
> [1] http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/30432


Nope, never heard back. That said, it's not clear from your
description if these are actually the same problem; if they are you
need to provide monitor logs before anybody can help. If they aren't,
you are skipping steps and need to include OSD logs and things. ;)
-Greg

>
> Best regards,
> Kostis
>
> On 17 April 2015 at 02:04, Gregory Farnum <g...@gregs42.com> wrote:
>> The monitor looks like it's not generating a new OSDMap including the
>> booting OSDs. I could say with more certainty what's going on with the
>> monitor log file, but I'm betting you've got one of the noin or noup
>> family of flags set. I *think* these will be output in "ceph -w" or in
>> "ceph osd dump", although I can't say for certain in Firefly.
>> -Greg
>>
>> On Fri, Apr 10, 2015 at 1:57 AM, Jacob Reid <lists-c...@jacob-reid.co.uk> 
>> wrote:
>>> On Fri, Apr 10, 2015 at 09:55:20AM +0100, Jacob Reid wrote:
>>>> On Thu, Apr 09, 2015 at 05:21:47PM +0100, Jacob Reid wrote:
>>>> > On Thu, Apr 09, 2015 at 08:46:07AM -0700, Gregory Farnum wrote:
>>>> > > On Thu, Apr 9, 2015 at 8:14 AM, Jacob Reid 
>>>> > > <lists-c...@jacob-reid.co.uk> wrote:
>>>> > > > On Thu, Apr 09, 2015 at 06:43:45AM -0700, Gregory Farnum wrote:
>>>> > > >> You can turn up debugging ("debug osd = 10" and "debug filestore = 
>>>> > > >> 10"
>>>> > > >> are probably enough, or maybe 20 each) and see what comes out to get
>>>> > > >> more information about why the threads are stuck.
>>>> > > >>
>>>> > > >> But just from the log my answer is the same as before, and now I 
>>>> > > >> don't
>>>> > > >> trust that controller (or maybe its disks), regardless of what it's
>>>> > > >> admitting to. ;)
>>>> > > >> -Greg
>>>> > > >>
>>>> > > >
>>>> > > > Ran with osd and filestore debug both at 20; still nothing jumping 
>>>> > > > out at me. Logfile attached as it got huge fairly quickly, but 
>>>> > > > mostly seems to be the same extra lines. I tried running some test 
>>>> > > > I/O on the drives in question to try and provoke some kind of 
>>>> > > > problem, but they seem fine now...
>>>> > >
>>>> > > Okay, this is strange. Something very wonky is happening with your
>>>> > > scheduler — it looks like these threads are all idle, and they're
>>>> > > scheduling wakeups that handle an appreciable amount of time after
>>>> > > they're supposed to. For instance:
>>>> > > 2015-04-09 15:56:55.953116 7f70a7963700 20
>>>> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry woke after 5.416704
>>>> > > 2015-04-09 15:56:55.953153 7f70a7963700 20
>>>> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry waiting for
>>>> > > max_interval 5.000000
>>>> > >
>>>> > > This is the thread that syncs your backing store, and it always sets
>>>> > > itself to get woken up at 5-second intervals — but here it took >5.4
>>>> > > seconds, and later on in your log it takes more than 6 seconds.
>>>> > > It looks like all the threads which are getting timed out are also
>>>> > > idle, but are taking so much longer to wake up than they're set for
>>>> > > that they get a timeout warning.
>>>> > >
>>>> > > There might be some bugs in here where we're expecting wakeups to be
>>>> > > more precise than they can be, but these sorts of misses are
>>>> > > definitely not normal. Is this server overloaded on the CPU? Have you
>>>> > > done something to make the scheduler or wakeups wonky?
>>>> > > -Greg
>>>> >
>>>> > CPU load is minimal - the host does nothing but run OSDs and has 8 cores 
>>>> > that are all sitting idle with a load average of 0.1. I haven't done 
>>>> > anything to scheduling. That was with the debug logging on, if that 
>>>> > could be the cause of any delays. A scheduler issue seems possible - I 
>>>> > haven't done anything to it, but `time sleep 5` run a few times returns 
>>>> > anything spread randomly from 5.002 to 7.1(!) seconds but mostly in the 
>>>> > 5.5-6.0 region where it managed fairly consistently <5.2 on the other 
>>>> > servers in the cluster and <5.02 on my desktop. I have disabled the CPU 
>>>> > power saving mode as the only thing I could think of that might be 
>>>> > having an effect on this, and running the same test again gives more 
>>>> > sane results... we'll see if this reflects in the OSD logs or not, I 
>>>> > guess. If this is the cause, it's probably something that the next 
>>>> > version might want to make a specific warning case of detecting. I will 
>>>> > keep you updated as to their behaviour now...
>>>> > _______________________________________________
>>>> > ceph-users mailing list
>>>> > ceph-users@lists.ceph.com
>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>> Overnight, nothing changed - I am no longer seeing the timeout in the logs 
>>>> but all the OSDs in questions are still happily sitting at booting and 
>>>> showing as down in the tree. Debug 20 logfile attached again.
>>> ...and here actually *is* the logfile, which I managed to forget... must be 
>>> Friday, I guess.
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSDs not coming up on one host

Reply via email to