It's interesting, the responses are received but seems that they
aren't being handled (hence the following pings). There are a few
things that you could look at. First, try to connect to the admin
socket and see if you get any useful information from there. This
could include in-flight requests, look for other requests that have
not completed. Also see if there's indication for requests throttling.
Another thing to look at would be at the seemingly unrelated timeout
messages. These should not happen and might indicate that there's
something that is holding you up that shouldn't. Try searching for the
same thread id that is specified in these messages (omit the 0x
prefix), and see what's the last thing that it's doing.
You could also try turning on also 'debug objecter = 20', see if it
provides more info (it's very verbose though).

How much are you loading the gateway before that happens? We've seen a
similar issue in the past that was related to the fcgi library that is
dynamically linked with the radosgw process (that is, not the apache
mod_fastcgi module). This, however, would only happen when there's
heavy load and the fd numbers handled by the radosgw surpassed 1024
(buggy library that was using select() instead of poll()).

Yehuda

On Fri, Nov 29, 2013 at 7:28 AM, Sebastian <[email protected]> wrote:
> Hi,
>
> thanks for the hint. I tried this again and noticed that the time out message 
> does seem to be unrelated. Here is the log file for a stalling request with 
> debug turned on:
> http://pastebin.com/DcQuc9wP
>
> I really cannot really find a real "error" in the log. The download stalls at 
> about 500kb at that point though. Restarting radosgw fixes it for 1 download 
> only, the next one is broken again. But as i said this does not happen for 
> all files.
>
> Sebastian
>
> On 27.11.2013, at 21:53, Yehuda Sadeh wrote:
>
>> On Wed, Nov 27, 2013 at 4:46 AM, Sebastian <[email protected]> wrote:
>>> Hi,
>>>
>>> we have a setup of 4 Servers running ceph and radosgw. We use it as an 
>>> internal S3 service for our files. The Servers run Debian Squeeze with Ceph 
>>> 0.67.4.
>>>
>>> The cluster has been running smoothly for quite a while, but we are 
>>> currently experiencing issues with the radosgw. For some files the HTTP 
>>> Download just stalls at around 500kb.
>>>
>>> The Apache error log just says:
>>> [error] [client ] FastCGI: comm with server "/var/www/s3gw.fcgi" aborted: 
>>> idle timeout (30 sec)
>>> [error] [client ] Handler for fastcgi-script returned invalid result code 1
>>>
>>> radosgw logging:
>>> 7f00bc66a700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 
>>> 0x7f00934bb700' had timed out after 600
>>> 7f00bc66a700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 
>>> 0x7f00ab4eb700' had timed out after 600
>>>
>>> The interesting thing is that the cluster health is fine an only some files 
>>> are not working properly. Most of them just work fine. A restart of radosgw 
>>> fixes the issue. The other ceph logs are also clean.
>>>
>>> Any idea why this happens?
>>>
>>
>> No, but you can turn on 'debug ms = 1' on your gateway ceph.conf, and
>> that might give some better indication.
>>
>> Yehuda
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to