On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary <l...@dachary.org> wrote:
>
>
> On 12/10/2014 17:48, Gregory Farnum wrote:
>> On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary <l...@dachary.org> wrote:
>>> Hi,
>>>
>>> On a 0.80.6 cluster the command
>>>
>>> ceph tell osd.6 version
>>>
>>> hangs forever. I checked that it establishes a TCP connection to the OSD, 
>>> raised the OSD debug level to 20 and I do not see
>>>
>>> https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991
>>>
>>> in the logs. All other OSDs answer to the same "version" command as they 
>>> should. And ceph daemon osd.6 version on the machine running OSD 6 responds 
>>> as it should. There also are an ever growing number of slow requests on 
>>> this OSD. But not error in the logs. In other words, except for taking 
>>> forever to answer any kind of request the OSD looks fine.
>>>
>>> Another OSD running on the same machine is behaving well.
>>>
>>> Any idea what that behaviour relates to ?
>>
>> What commands have you run? The admin socket commands don't require
>> nearly as many locks, nor do they go through the same event loops that
>> messages do. You might have found a deadlock or something. (In which
>> case just restarting the OSD would probably fix it, but you should
>> grab a core dump first.)
>
> # /etc/init.d/ceph stop osd.6
> === osd.6 ===
> Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done
> root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6
> === osd.6 ===
> Starting Ceph osd.6 on g3...
> starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 
> /var/lib/ceph/osd/ceph-6/journal
> root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version
> { "version": "ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)"}
> root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version
>
> and now it blocks. It looks like a deadlock happens shortly after it boots.

Is this the same cluster you're reporting on in the tracker?

Anyway, apparently it's a disk state issue. I have no idea what kind
of bug in Ceph could cause this, so my guess is that a syscall is
going out to lunch — although that should get caught up in the
internal heartbeat checkin code. Like I said, grab a core dump and
look for deadlocks or blocked sys calls in the filestore.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to