Yes it is the correct IP and port:

ceph3:~$ netstat -anp | fgrep 192.168.206.13:6804
tcp        0      0 192.168.206.13:6804     0.0.0.0:*               LISTEN      
22934/ceph-osd  

I turned up the logging on the osd and I don’t think it received the request.  
However I also noticed a large number of TCP connections to that specific osd 
from the client (192.168.206.17) in CLOSE_WAIT state (131 to be exact).  I 
think there may be a bug causing the osd not to close file descriptors.  Prior 
to the hang I had been running tests continuously for several days so the osd 
process may have been accumulating open sockets.

I’m still gathering information, but based on that is there anything specific 
that would be helpful to find the problem?

Thanks,
Phil

> On Apr 24, 2017, at 5:01 PM, Jason Dillaman <jdill...@redhat.com> wrote:
> 
> Just to cover all the bases, is 192.168.206.13:6804 really associated
> with a running daemon for OSD 11?
> 
> On Mon, Apr 24, 2017 at 4:23 PM, Phil Lacroute
> <lacro...@skyportsystems.com> wrote:
>> Jason,
>> 
>> Thanks for the suggestion.  That seems to show it is not the OSD that got
>> stuck:
>> 
>> ceph7:~$ sudo rbd -c debug/ceph.conf info app/image1
>> …
>> 2017-04-24 13:13:49.761076 7f739aefc700  1 -- 192.168.206.17:0/1250293899
>> --> 192.168.206.13:6804/22934 -- osd_op(client.4384.0:3 1.af6f1e38
>> rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc
>> 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f737c0077f0 con
>> 0x7f737c0064e0
>> …
>> 2017-04-24 13:14:04.756328 7f73a2880700  1 -- 192.168.206.17:0/1250293899
>> --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f7374000fc0 con
>> 0x7f737c0064e0
>> 
>> ceph0:~$ sudo ceph pg map 1.af6f1e38
>> osdmap e27 pg 1.af6f1e38 (1.38) -> up [11,16,2] acting [11,16,2]
>> 
>> ceph3:~$ sudo ceph daemon osd.11 ops
>> {
>>    "ops": [],
>>    "num_ops": 0
>> }
>> 
>> I repeated this a few times and it’s always the same command and same
>> placement group that hangs, but OSD11 has no ops (and neither do OSD16 and
>> OSD2, although I think that’s expected).
>> 
>> Is there other tracing I should do on the OSD or something more to look at
>> on the client?
>> 
>> Thanks,
>> Phil
>> 
>> On Apr 24, 2017, at 12:39 PM, Jason Dillaman <jdill...@redhat.com> wrote:
>> 
>> On Mon, Apr 24, 2017 at 2:53 PM, Phil Lacroute
>> <lacro...@skyportsystems.com> wrote:
>> 
>> 2017-04-24 11:30:57.058233 7f5512ffd700  1 -- 192.168.206.17:0/3282647735
>> --> 192.168.206.13:6804/22934 -- osd_op(client.4375.0:3 1.af6f1e38
>> rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc
>> 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40077f0 con
>> 0x7f54f40064e0
>> 
>> 
>> 
>> You can attempt to run "ceph daemon osd.XYZ ops" against the
>> potentially stuck OSD to figure out what it's stuck doing.
>> 
>> --
>> Jason
>> 
>> 
> 
> 
> 
> -- 
> Jason

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to