What does "ceph status" say? I had a problem with similar symptoms some
months ago that was accompanied by OSDs getting marked out for no apparent
reason and the cluster going into a HEALTH_WARN state intermittently.
Ultimately the root of the problem ended up being a faulty NIC. Once I took
that out of the picture everything started flying right.

QH

On Fri, Jul 17, 2015 at 8:21 AM, Mark Nelson <mnel...@redhat.com> wrote:

> On 07/17/2015 08:38 AM, J David wrote:
>
>> This is the same cluster I posted about back in April.  Since then,
>> the situation has gotten significantly worse.
>>
>> Here is what iostat looks like for the one active RBD image on this
>> cluster:
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> vdb               0.00     0.00   14.10    0.00   685.65     0.00
>> 97.26     3.43  299.40  299.40    0.00  70.92 100.00
>> vdb               0.00     0.00    1.10    0.00   140.80     0.00
>> 256.00     3.00 2753.09 2753.09    0.00 909.09 100.00
>> vdb               0.00     0.00   17.40    0.00  2227.20     0.00
>> 256.00     3.00  178.78  178.78    0.00  57.47 100.00
>> vdb               0.00     0.00    1.30    0.00   166.40     0.00
>> 256.00     3.00 2256.62 2256.62    0.00 769.23 100.00
>> vdb               0.00     0.00    8.20    0.00  1049.60     0.00
>> 256.00     3.00  362.10  362.10    0.00 121.95 100.00
>> vdb               0.00     0.00    1.10    0.00   140.80     0.00
>> 256.00     3.00 2517.45 2517.45    0.00 909.45 100.04
>> vdb               0.00     0.00    1.10    0.00   140.66     0.00
>> 256.00     3.00 2863.64 2863.64    0.00 909.09  99.90
>> vdb               0.00     0.00    0.70    0.00    89.60     0.00
>> 256.00     3.00 3898.86 3898.86    0.00 1428.57 100.00
>> vdb               0.00     0.00    0.60    0.00    76.80     0.00
>> 256.00     3.00 5093.33 5093.33    0.00 1666.67 100.00
>> vdb               0.00     0.00    1.20    0.00   153.60     0.00
>> 256.00     3.00 2568.33 2568.33    0.00 833.33 100.00
>> vdb               0.00     0.00    1.30    0.00   166.40     0.00
>> 256.00     3.00 2457.85 2457.85    0.00 769.23 100.00
>> vdb               0.00     0.00   13.90    0.00  1779.20     0.00
>> 256.00     3.00  220.95  220.95    0.00  71.94 100.00
>> vdb               0.00     0.00    1.00    0.00   128.00     0.00
>> 256.00     3.00 2250.40 2250.40    0.00 1000.00 100.00
>> vdb               0.00     0.00    1.30    0.00   166.40     0.00
>> 256.00     3.00 2798.77 2798.77    0.00 769.23 100.00
>> vdb               0.00     0.00    0.90    0.00   115.20     0.00
>> 256.00     3.00 3304.00 3304.00    0.00 1111.11 100.00
>> vdb               0.00     0.00    0.90    0.00   115.20     0.00
>> 256.00     3.00 3425.33 3425.33    0.00 1111.11 100.00
>> vdb               0.00     0.00    1.30    0.00   166.40     0.00
>> 256.00     3.00 2290.77 2290.77    0.00 769.23 100.00
>> vdb               0.00     0.00    4.30    0.00   550.40     0.00
>> 256.00     3.00  721.30  721.30    0.00 232.56 100.00
>> vdb               0.00     0.00    1.60    0.00   204.80     0.00
>> 256.00     3.00 1894.75 1894.75    0.00 625.00 100.00
>> vdb               0.00     0.00    1.20    0.00   153.60     0.00
>> 256.00     3.00 2375.00 2375.00    0.00 833.33 100.00
>> vdb               0.00     0.00    0.90    0.00   115.20     0.00
>> 256.00     3.00 3036.44 3036.44    0.00 1111.11 100.00
>> vdb               0.00     0.00    1.10    0.00   140.80     0.00
>> 256.00     3.00 3086.18 3086.18    0.00 909.09 100.00
>> vdb               0.00     0.00    0.90    0.00   115.20     0.00
>> 256.00     3.00 2480.44 2480.44    0.00 1111.11 100.00
>> vdb               0.00     0.00    1.20    0.00   153.60     0.00
>> 256.00     3.00 3124.33 3124.33    0.00 833.67 100.04
>> vdb               0.00     0.00    0.80    0.00   102.40     0.00
>> 256.00     3.00 3228.00 3228.00    0.00 1250.00 100.00
>> vdb               0.00     0.00    1.20    0.00   153.60     0.00
>> 256.00     3.00 2439.33 2439.33    0.00 833.33 100.00
>> vdb               0.00     0.00    1.30    0.00   166.40     0.00
>> 256.00     3.00 2567.08 2567.08    0.00 769.23 100.00
>> vdb               0.00     0.00    0.80    0.00   102.40     0.00
>> 256.00     3.00 3023.00 3023.00    0.00 1250.00 100.00
>> vdb               0.00     0.00    4.80    0.00   614.40     0.00
>> 256.00     3.00  712.50  712.50    0.00 208.33 100.00
>> vdb               0.00     0.00    1.30    0.00   118.75     0.00
>> 182.69     3.00 2003.69 2003.69    0.00 769.23 100.00
>> vdb               0.00     0.00   10.50    0.00  1344.00     0.00
>> 256.00     3.00  344.46  344.46    0.00  95.24 100.00
>>
>> So, between 0 and 15 reads per second, no write activity, a constant
>> queue depth of 3+, wait times in seconds, and 100% I/O utilization,
>> all for read performance of 100-200K/sec.  Even trivial writes can
>> hang for 15-60 seconds before completing.
>>
>> Sometimes this behavior will "go away" for awhile and it will go back
>> to what we saw in April: 50IOPS (read or write) and 5-20MB/sec of I/O
>> throughput.  But it always comes back.
>>
>> The hardware of the ceph cluster is:
>> - Three ceph nodes
>> - Two of the ceph nodes have 64GiB RAM and 12 5TB SATA drives
>> - One of the ceph nodes has 32GiB RAM and 4 5TB SATA drives
>> - All ceph nodes have Intel E5-2609 v2 (2.50Ghz quad-core) CPUs
>> - Everything is 10GBase-T
>> - All three nodes running Ceph 0.80.9
>>
>> The ceph hardware is all borderline idle.  The CPU is 3-5% utilized
>> and iostat reports the individual disks hover around 4-7% utilized at
>> any given time.  It does appear to be using most of the available RAM
>> for OSD caching.
>>
>> The client is a KVM virtual machine running on a server by itself.
>> Inside the virtual machine it reports 100% CPU utilization by iowait.
>> Outside the virtual machine host, it reports everything is idle (99.1%
>> idle).
>>
>> Something is *definitely* wrong.  Does anyone have any idea what it might
>> be?
>>
>> Thanks for any help with this!
>>
>
> Hi J David,
>
> Forgive me if you covered this in April, but have you tried rados bench
> from the hypervisor (or another client node)?
>
> Something like:
>
> rados -p <pool> 30 bench write
>
> just to see how it handles 4MB object writes.  You can play around with
> the -t and -b parameters to try different object workloads.  If rados bench
> is also terribly slow, then you might want to start looking for evidence of
> IO getting hung up on a specific disk or node.
>
> Mark
>
>  _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>  _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to