What does "ceph status" say? I had a problem with similar symptoms some months ago that was accompanied by OSDs getting marked out for no apparent reason and the cluster going into a HEALTH_WARN state intermittently. Ultimately the root of the problem ended up being a faulty NIC. Once I took that out of the picture everything started flying right.
QH On Fri, Jul 17, 2015 at 8:21 AM, Mark Nelson <mnel...@redhat.com> wrote: > On 07/17/2015 08:38 AM, J David wrote: > >> This is the same cluster I posted about back in April. Since then, >> the situation has gotten significantly worse. >> >> Here is what iostat looks like for the one active RBD image on this >> cluster: >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> vdb 0.00 0.00 14.10 0.00 685.65 0.00 >> 97.26 3.43 299.40 299.40 0.00 70.92 100.00 >> vdb 0.00 0.00 1.10 0.00 140.80 0.00 >> 256.00 3.00 2753.09 2753.09 0.00 909.09 100.00 >> vdb 0.00 0.00 17.40 0.00 2227.20 0.00 >> 256.00 3.00 178.78 178.78 0.00 57.47 100.00 >> vdb 0.00 0.00 1.30 0.00 166.40 0.00 >> 256.00 3.00 2256.62 2256.62 0.00 769.23 100.00 >> vdb 0.00 0.00 8.20 0.00 1049.60 0.00 >> 256.00 3.00 362.10 362.10 0.00 121.95 100.00 >> vdb 0.00 0.00 1.10 0.00 140.80 0.00 >> 256.00 3.00 2517.45 2517.45 0.00 909.45 100.04 >> vdb 0.00 0.00 1.10 0.00 140.66 0.00 >> 256.00 3.00 2863.64 2863.64 0.00 909.09 99.90 >> vdb 0.00 0.00 0.70 0.00 89.60 0.00 >> 256.00 3.00 3898.86 3898.86 0.00 1428.57 100.00 >> vdb 0.00 0.00 0.60 0.00 76.80 0.00 >> 256.00 3.00 5093.33 5093.33 0.00 1666.67 100.00 >> vdb 0.00 0.00 1.20 0.00 153.60 0.00 >> 256.00 3.00 2568.33 2568.33 0.00 833.33 100.00 >> vdb 0.00 0.00 1.30 0.00 166.40 0.00 >> 256.00 3.00 2457.85 2457.85 0.00 769.23 100.00 >> vdb 0.00 0.00 13.90 0.00 1779.20 0.00 >> 256.00 3.00 220.95 220.95 0.00 71.94 100.00 >> vdb 0.00 0.00 1.00 0.00 128.00 0.00 >> 256.00 3.00 2250.40 2250.40 0.00 1000.00 100.00 >> vdb 0.00 0.00 1.30 0.00 166.40 0.00 >> 256.00 3.00 2798.77 2798.77 0.00 769.23 100.00 >> vdb 0.00 0.00 0.90 0.00 115.20 0.00 >> 256.00 3.00 3304.00 3304.00 0.00 1111.11 100.00 >> vdb 0.00 0.00 0.90 0.00 115.20 0.00 >> 256.00 3.00 3425.33 3425.33 0.00 1111.11 100.00 >> vdb 0.00 0.00 1.30 0.00 166.40 0.00 >> 256.00 3.00 2290.77 2290.77 0.00 769.23 100.00 >> vdb 0.00 0.00 4.30 0.00 550.40 0.00 >> 256.00 3.00 721.30 721.30 0.00 232.56 100.00 >> vdb 0.00 0.00 1.60 0.00 204.80 0.00 >> 256.00 3.00 1894.75 1894.75 0.00 625.00 100.00 >> vdb 0.00 0.00 1.20 0.00 153.60 0.00 >> 256.00 3.00 2375.00 2375.00 0.00 833.33 100.00 >> vdb 0.00 0.00 0.90 0.00 115.20 0.00 >> 256.00 3.00 3036.44 3036.44 0.00 1111.11 100.00 >> vdb 0.00 0.00 1.10 0.00 140.80 0.00 >> 256.00 3.00 3086.18 3086.18 0.00 909.09 100.00 >> vdb 0.00 0.00 0.90 0.00 115.20 0.00 >> 256.00 3.00 2480.44 2480.44 0.00 1111.11 100.00 >> vdb 0.00 0.00 1.20 0.00 153.60 0.00 >> 256.00 3.00 3124.33 3124.33 0.00 833.67 100.04 >> vdb 0.00 0.00 0.80 0.00 102.40 0.00 >> 256.00 3.00 3228.00 3228.00 0.00 1250.00 100.00 >> vdb 0.00 0.00 1.20 0.00 153.60 0.00 >> 256.00 3.00 2439.33 2439.33 0.00 833.33 100.00 >> vdb 0.00 0.00 1.30 0.00 166.40 0.00 >> 256.00 3.00 2567.08 2567.08 0.00 769.23 100.00 >> vdb 0.00 0.00 0.80 0.00 102.40 0.00 >> 256.00 3.00 3023.00 3023.00 0.00 1250.00 100.00 >> vdb 0.00 0.00 4.80 0.00 614.40 0.00 >> 256.00 3.00 712.50 712.50 0.00 208.33 100.00 >> vdb 0.00 0.00 1.30 0.00 118.75 0.00 >> 182.69 3.00 2003.69 2003.69 0.00 769.23 100.00 >> vdb 0.00 0.00 10.50 0.00 1344.00 0.00 >> 256.00 3.00 344.46 344.46 0.00 95.24 100.00 >> >> So, between 0 and 15 reads per second, no write activity, a constant >> queue depth of 3+, wait times in seconds, and 100% I/O utilization, >> all for read performance of 100-200K/sec. Even trivial writes can >> hang for 15-60 seconds before completing. >> >> Sometimes this behavior will "go away" for awhile and it will go back >> to what we saw in April: 50IOPS (read or write) and 5-20MB/sec of I/O >> throughput. But it always comes back. >> >> The hardware of the ceph cluster is: >> - Three ceph nodes >> - Two of the ceph nodes have 64GiB RAM and 12 5TB SATA drives >> - One of the ceph nodes has 32GiB RAM and 4 5TB SATA drives >> - All ceph nodes have Intel E5-2609 v2 (2.50Ghz quad-core) CPUs >> - Everything is 10GBase-T >> - All three nodes running Ceph 0.80.9 >> >> The ceph hardware is all borderline idle. The CPU is 3-5% utilized >> and iostat reports the individual disks hover around 4-7% utilized at >> any given time. It does appear to be using most of the available RAM >> for OSD caching. >> >> The client is a KVM virtual machine running on a server by itself. >> Inside the virtual machine it reports 100% CPU utilization by iowait. >> Outside the virtual machine host, it reports everything is idle (99.1% >> idle). >> >> Something is *definitely* wrong. Does anyone have any idea what it might >> be? >> >> Thanks for any help with this! >> > > Hi J David, > > Forgive me if you covered this in April, but have you tried rados bench > from the hypervisor (or another client node)? > > Something like: > > rados -p <pool> 30 bench write > > just to see how it handles 4MB object writes. You can play around with > the -t and -b parameters to try different object workloads. If rados bench > is also terribly slow, then you might want to start looking for evidence of > IO getting hung up on a specific disk or node. > > Mark > > _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com