On 12/10/2014 18:52, Gregory Farnum wrote:
> On Sun, Oct 12, 2014 at 9:29 AM, Loic Dachary <l...@dachary.org> wrote:
>>
>>
>> On 12/10/2014 18:22, Gregory Farnum wrote:
>>> On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary <l...@dachary.org> wrote:
>>>>
>>>>
>>>> On 12/10/2014 17:48, Gregory Farnum wrote:
>>>>> On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary <l...@dachary.org> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On a 0.80.6 cluster the command
>>>>>>
>>>>>> ceph tell osd.6 version
>>>>>>
>>>>>> hangs forever. I checked that it establishes a TCP connection to the 
>>>>>> OSD, raised the OSD debug level to 20 and I do not see
>>>>>>
>>>>>> https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991
>>>>>>
>>>>>> in the logs. All other OSDs answer to the same "version" command as they 
>>>>>> should. And ceph daemon osd.6 version on the machine running OSD 6 
>>>>>> responds as it should. There also are an ever growing number of slow 
>>>>>> requests on this OSD. But not error in the logs. In other words, except 
>>>>>> for taking forever to answer any kind of request the OSD looks fine.
>>>>>>
>>>>>> Another OSD running on the same machine is behaving well.
>>>>>>
>>>>>> Any idea what that behaviour relates to ?
>>>>>
>>>>> What commands have you run? The admin socket commands don't require
>>>>> nearly as many locks, nor do they go through the same event loops that
>>>>> messages do. You might have found a deadlock or something. (In which
>>>>> case just restarting the OSD would probably fix it, but you should
>>>>> grab a core dump first.)
>>>>
>>>> # /etc/init.d/ceph stop osd.6
>>>> === osd.6 ===
>>>> Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done
>>>> root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6
>>>> === osd.6 ===
>>>> Starting Ceph osd.6 on g3...
>>>> starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 
>>>> /var/lib/ceph/osd/ceph-6/journal
>>>> root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version
>>>> { "version": "ceph version 0.80.6 
>>>> (f93610a4421cb670b08e974c6550ee715ac528ae)"}
>>>> root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version
>>>>
>>>> and now it blocks. It looks like a deadlock happens shortly after it boots.
>>>
>>> Is this the same cluster you're reporting on in the tracker?
>>
>> Yes, it is the same cluster as http://tracker.ceph.com/issues/9750 although 
>> I can't imagine how the two could be related, they probably are.
>>
>>> Anyway, apparently it's a disk state issue. I have no idea what kind
>>> of bug in Ceph could cause this, so my guess is that a syscall is
>>> going out to lunch — although that should get caught up in the
>>> internal heartbeat checkin code. Like I said, grab a core dump and
>>> look for deadlocks or blocked sys calls in the filestore.
>>
>> I created http://tracker.ceph.com/issues/9751 and attached the log with 
>> debug_filestore = 20. There are many slow requests but I can't relate them 
>> to any kind of error.
>>
>> It does not core dump, should I kill it to get a coredump and then examine 
>> it ? I've never tried that ;-)
> 
> That's what I was thinking; you send it a SIGQUIT signal and it'll
> dump. Or apparently you can use "gcore" instead, which won't quit it.
> The log doesn't have anything glaringly obvious; was it already "hung"
> when you packaged that? If so, it must be some kind of deadlock and
> the backtraces from the core dump will probably tell us what happened.

Since the OSD itself is not hung, I was able to gdb and bt all threads in 
http://tracker.ceph.com/issues/9751#note-7 . If you think creating a core would 
show something else, I could do it also.

cheers

>> One way or the other the problem will be fixed soon (tonight). I'd like to 
>> take advantage of the broken state we have to figure it out. Resurecting the 
>> OSD that may unblock http://tracker.ceph.com/issues/9751 and may also 
>> unblock http://tracker.ceph.com/issues/9750 and we'll lose a chance to 
>> diagnose this rare condition.

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to