On Sat, Feb 20, 2016 at 2:26 AM, <aaa...@gmail.com> wrote:

> Hi All.
>
> We're running 180-node cluster in docker containers -- official
> ceph:hammer.
> Recently, we've found a rarely reproducible problem on it: sometimes
> data transfer freezes for significant time (5-15 minutes). The issue
> is taking place while using radosgw & librados apps
> (docker-distribution). This problem can be worked around with "ms tcp
> read timeout" parameter decreased to 2-3 seconds on the client side,
> but that does not seem to be a good solution.
> I've written bash script, getting every object (and it's omap/xattr)
> with 'rados' cli utility  from data pool in infinite cycle, to
> reproduce the problem. Running that on 3 hosts simultaneously on
> docker-distribution's pool (4mb objects) during 8 hours resulted in 25
> reads, each of them took more than 60 seconds.
> Script results here (hostnames substituted):
>
> https://gist.github.com/aaaler/cb190c1eb636564519a5#file-distribution-pool-err-sorted
> But there's nothing suspicious on corresponding OSD logs.
> For example, take a look on the one of these faulty reads:
>  21:44:32 consumed 1891 seconds reading
> blob:daa46e8d-170e-43ab-8c00-526782f95e02-0 on host1(192.168.1.133)
> osdmap e80485 pool 'distribution' (17) object
> 'blob:daa46e8d-170e-43ab-8c00-526782f95e02-0' -> pg 17.97f485f (17.5f)
> -> up ([139,149,167], p139) acting ([139,149,167], p139)
>
> Thus, we've got 1891 seconds of waiting, and after that the client has
> just proceed without any errors occurred, so I tried to find something
> useful in osd.139 logs
> (https://gist.github.com/aaaler/cb190c1eb636564519a5#file-osd-139-log),
> but could not find anything interesting.
>
> Another example (next line in script output) shew us 2983 seconds of
> reading blob:f5c22093-6e6d-41a6-be36-462330b36c67-71 from osd.56.
> Again, nothing in osd.56 logs during that time:
> https://gist.github.com/aaaler/cb190c1eb636564519a5#file-osd-56-log
>
> How can I troubleshoot this? As too excessive logging on 180-node
> cluster will make bunch of traffic and bring problems with finding the
> right host to check log :(
>

So you don't find any slow request with ceph -s or at
/var/log/ceph/ceph.log in monitor side?

You mentioned "ms tcp read timeout" has some effects on your case, I guess
we may has some problems on using poll call.

The only potential related PR is https://github.com/ceph/ceph/pull/6971


> Few words about underlying configuration:
> - ceph:hammer containers in docker 1.9.1 (--net=host)
> - gentoo with 3.14.18/3.18.10 kernel.
> - 1gbps LAN
> - osd using directory in /var
> - hosts share osd workload with some other php-fpm's
>
> The configuration is pretty default, except some osd parameters
> configured to reduce scrubbing workload:
> [osd]
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 5
> osd recovery max active = 1
> osd max backfills = 2
>
> --
>  Sincerely, Alexey Griazin
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Best Regards,

Wheat
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to