What OS are you using?

I have a lot more open connections than that. (though i have some other
issues, where rgw sometimes returns 500 errors, it doesn't stop like yours)

You might try tuning civetweb's num_threads and 'rgw num rados handles':

rgw frontends = civetweb num_threads=125
error_log_file=/var/log/radosgw/civetweb.error.log
access_log_file=/var/log/radosgw/civetweb.access.log
rgw num rados handles = 32

You can also up civetweb loglevel:

debug civetweb = 20

-Ben

On Wed, Mar 16, 2016 at 5:03 PM, seapasu...@uchicago.edu <
seapasu...@uchicago.edu> wrote:

> I have a cluster of around 630 OSDs with 3 dedicated monitors and 2
> dedicated gateways. The entire cluster is running hammer (0.94.5
> (9764da52395923e0b32908d83a9f7304401fee43)).
>
> (Both of my gateways have stopped responding to curl right now.
> root@host:~# timeout 5 curl localhost ; echo $?
> 124
>
> From here I checked and it looks like radosgw has over 1 million open
> files:
> root@host:~# grep -i rados whatisopen.files.list | wc -l
> 1151753
>
> And around 750 open connections:
> root@host:~# netstat -planet | grep radosgw | wc -l
> 752
> root@host:~# ss -tnlap | grep rados | wc -l
> 752
>
> I don't think that the backend storage is hanging based on the following
> dump:
>
> root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok
> objecter_requests | grep -i mtime
>             "mtime": "0.000000",
>             "mtime": "0.000000",
>             "mtime": "0.000000",
>             "mtime": "0.000000",
>             "mtime": "0.000000",
>             "mtime": "0.000000",
>             [...]
>             "mtime": "0.000000",
>
> The radosgw log is still showing lots of activity and so does strace which
> makes me think this is a config issue or limit of some kind that is not
> triggering a log. Of what I am not sure as the log doesn't seem to show any
> open file limit being hit and I don't see any big errors showing up in the
> logs.
> (last 500 lines of /var/log/radosgw/client.radosgw.log)
> http://pastebin.com/jmM1GFSA
>
> Perf dump of radosgw
> http://pastebin.com/rjfqkxzE
>
> Radosgw objecter requests:
> http://pastebin.com/skDJiyHb
>
> After restarting the gateway with '/etc/init.d/radosgw restart' the old
> process remains, no error is sent, and then I get connection refused via
> curl or netcat::
> root@kh11-9:~# curl localhost
> curl: (7) Failed to connect to localhost port 80: Connection refused
>
> Once I kill the old radosgw via sigkill the new radosgw instance restarts
> automatically and starts responding::
> root@kh11-9:~# curl localhost
> <?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="
> http://s3.amazonaws.com/doc/2006-03-01/
> "><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyB
>
> What is going on here?
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to