On Aug 4, 2014, at 10:53 PM, Christian Balzer wrote:
> On Mon, 4 Aug 2014 15:11:39 -0400 Chris Kitzmiller wrote:
>> On Aug 2, 2014, at 12:03 AM, Christian Balzer wrote:
>>> On Fri, 1 Aug 2014 14:23:28 -0400 Chris Kitzmiller wrote:
>>>> I have 3 nodes each running a MON and 30 OSDs. 
>>>> ...
>>>> When I test my cluster
>>>> with either rados bench or with fio via a 10GbE client using RBD I get
>>>> great initial speeds >900MBps and I max out my 10GbE links for a
>>>> while. Then, something goes wrong the performance falters and the
>>>> cluster stops responding all together. I'll see a monitor call for a
>>>> new election and then my OSDs mark each other down, they complain
>>>> that they've been wrongly marked down, I get slow request warnings of
>>>> 30 and >60 seconds. This eventually resolves itself and the cluster
>>>> recovers but it then recurs again right away. Sometimes, via fio, I'll get
>>>> an I/O error and it will bail.

This appears to still be happening. :(

From your advice, Christian, I monitored my cluster with atop and found that I 
did have one HDD which was pegged at 100% while the rest of the cluster was at 
0% utilization. I replaced that disk and set my cluster back up again. Wrote 
~20T of data into a 3x pool and that went very smoothly. My speeds did decrease 
from ~600MBps down to ~230MBps over the course of that write but I was still 
getting steady responsive writes.

Today, I'm seeing the problem recur. The trouble is that I don't have any drive 
that is at 100% like I used to. In fact, I have no drives which aren't at 0% 
utilization during these incidents. `ceph osd perf` doesn't seem to have any 
useful information in it. dump_historic_ops has what looks like interesting 
information but I'm lost when it comes to interpreting it's output (e.g. 
http://pastebin.com/raw.php?i=4KHFuyGi ). So right now I have two main 
questions:

1) How do I figure out what is going on? What explains the periods of no 
activity seen here http://pastebin.com/raw.php?i=Mv2y3Tka if not a slow OSD 
drive like before?

2) Why does fio exit with IO errors like these?

fio: io_u error on file /mnt/image1/temp.58.fio: Input/output error
     write offset=79754690560, buflen=4194304
fio: io_u error on file /mnt/image1/temp.69.fio: Input/output error
     write offset=67515711488, buflen=4194304
fio: io_u error on file /mnt/image1/temp.71.fio: Input/output error
     write offset=38646317056, buflen=4194304
fio: io_u error on file /mnt/image1/temp.68.fio: Input/output error
     write offset=103263764480, buflen=4194304
fio: pid=10972, err=5/file:io_u.c:1373, func=io_u error, error=Input/output 
error

4m-randwrite: (groupid=0, jobs=1): err= 5 (file:io_u.c:1373, func=io_u error, 
error=Input/output error): pid=10972: Fri Aug  8 11:01:48 2014

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to