What OS are you using? I have a lot more open connections than that. (though i have some other issues, where rgw sometimes returns 500 errors, it doesn't stop like yours)
You might try tuning civetweb's num_threads and 'rgw num rados handles': rgw frontends = civetweb num_threads=125 error_log_file=/var/log/radosgw/civetweb.error.log access_log_file=/var/log/radosgw/civetweb.access.log rgw num rados handles = 32 You can also up civetweb loglevel: debug civetweb = 20 -Ben On Wed, Mar 16, 2016 at 5:03 PM, seapasu...@uchicago.edu < seapasu...@uchicago.edu> wrote: > I have a cluster of around 630 OSDs with 3 dedicated monitors and 2 > dedicated gateways. The entire cluster is running hammer (0.94.5 > (9764da52395923e0b32908d83a9f7304401fee43)). > > (Both of my gateways have stopped responding to curl right now. > root@host:~# timeout 5 curl localhost ; echo $? > 124 > > From here I checked and it looks like radosgw has over 1 million open > files: > root@host:~# grep -i rados whatisopen.files.list | wc -l > 1151753 > > And around 750 open connections: > root@host:~# netstat -planet | grep radosgw | wc -l > 752 > root@host:~# ss -tnlap | grep rados | wc -l > 752 > > I don't think that the backend storage is hanging based on the following > dump: > > root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok > objecter_requests | grep -i mtime > "mtime": "0.000000", > "mtime": "0.000000", > "mtime": "0.000000", > "mtime": "0.000000", > "mtime": "0.000000", > "mtime": "0.000000", > [...] > "mtime": "0.000000", > > The radosgw log is still showing lots of activity and so does strace which > makes me think this is a config issue or limit of some kind that is not > triggering a log. Of what I am not sure as the log doesn't seem to show any > open file limit being hit and I don't see any big errors showing up in the > logs. > (last 500 lines of /var/log/radosgw/client.radosgw.log) > http://pastebin.com/jmM1GFSA > > Perf dump of radosgw > http://pastebin.com/rjfqkxzE > > Radosgw objecter requests: > http://pastebin.com/skDJiyHb > > After restarting the gateway with '/etc/init.d/radosgw restart' the old > process remains, no error is sent, and then I get connection refused via > curl or netcat:: > root@kh11-9:~# curl localhost > curl: (7) Failed to connect to localhost port 80: Connection refused > > Once I kill the old radosgw via sigkill the new radosgw instance restarts > automatically and starts responding:: > root@kh11-9:~# curl localhost > <?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns=" > http://s3.amazonaws.com/doc/2006-03-01/ > "><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyB > > What is going on here? > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com