On Sun, Oct 12, 2014 at 9:29 AM, Loic Dachary <l...@dachary.org> wrote: > > > On 12/10/2014 18:22, Gregory Farnum wrote: >> On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary <l...@dachary.org> wrote: >>> >>> >>> On 12/10/2014 17:48, Gregory Farnum wrote: >>>> On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary <l...@dachary.org> wrote: >>>>> Hi, >>>>> >>>>> On a 0.80.6 cluster the command >>>>> >>>>> ceph tell osd.6 version >>>>> >>>>> hangs forever. I checked that it establishes a TCP connection to the OSD, >>>>> raised the OSD debug level to 20 and I do not see >>>>> >>>>> https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991 >>>>> >>>>> in the logs. All other OSDs answer to the same "version" command as they >>>>> should. And ceph daemon osd.6 version on the machine running OSD 6 >>>>> responds as it should. There also are an ever growing number of slow >>>>> requests on this OSD. But not error in the logs. In other words, except >>>>> for taking forever to answer any kind of request the OSD looks fine. >>>>> >>>>> Another OSD running on the same machine is behaving well. >>>>> >>>>> Any idea what that behaviour relates to ? >>>> >>>> What commands have you run? The admin socket commands don't require >>>> nearly as many locks, nor do they go through the same event loops that >>>> messages do. You might have found a deadlock or something. (In which >>>> case just restarting the OSD would probably fix it, but you should >>>> grab a core dump first.) >>> >>> # /etc/init.d/ceph stop osd.6 >>> === osd.6 === >>> Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done >>> root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6 >>> === osd.6 === >>> Starting Ceph osd.6 on g3... >>> starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 >>> /var/lib/ceph/osd/ceph-6/journal >>> root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version >>> { "version": "ceph version 0.80.6 >>> (f93610a4421cb670b08e974c6550ee715ac528ae)"} >>> root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version >>> >>> and now it blocks. It looks like a deadlock happens shortly after it boots. >> >> Is this the same cluster you're reporting on in the tracker? > > Yes, it is the same cluster as http://tracker.ceph.com/issues/9750 although I > can't imagine how the two could be related, they probably are. > >> Anyway, apparently it's a disk state issue. I have no idea what kind >> of bug in Ceph could cause this, so my guess is that a syscall is >> going out to lunch — although that should get caught up in the >> internal heartbeat checkin code. Like I said, grab a core dump and >> look for deadlocks or blocked sys calls in the filestore. > > I created http://tracker.ceph.com/issues/9751 and attached the log with > debug_filestore = 20. There are many slow requests but I can't relate them to > any kind of error. > > It does not core dump, should I kill it to get a coredump and then examine it > ? I've never tried that ;-)
That's what I was thinking; you send it a SIGQUIT signal and it'll dump. Or apparently you can use "gcore" instead, which won't quit it. The log doesn't have anything glaringly obvious; was it already "hung" when you packaged that? If so, it must be some kind of deadlock and the backtraces from the core dump will probably tell us what happened. > One way or the other the problem will be fixed soon (tonight). I'd like to > take advantage of the broken state we have to figure it out. Resurecting the > OSD that may unblock http://tracker.ceph.com/issues/9751 and may also unblock > http://tracker.ceph.com/issues/9750 and we'll lose a chance to diagnose this > rare condition. _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com