On Aug 4, 2014, at 10:53 PM, Christian Balzer wrote: > On Mon, 4 Aug 2014 15:11:39 -0400 Chris Kitzmiller wrote: >> On Aug 2, 2014, at 12:03 AM, Christian Balzer wrote: >>> On Fri, 1 Aug 2014 14:23:28 -0400 Chris Kitzmiller wrote: >>>> I have 3 nodes each running a MON and 30 OSDs. >>>> ... >>>> When I test my cluster >>>> with either rados bench or with fio via a 10GbE client using RBD I get >>>> great initial speeds >900MBps and I max out my 10GbE links for a >>>> while. Then, something goes wrong the performance falters and the >>>> cluster stops responding all together. I'll see a monitor call for a >>>> new election and then my OSDs mark each other down, they complain >>>> that they've been wrongly marked down, I get slow request warnings of >>>> 30 and >60 seconds. This eventually resolves itself and the cluster >>>> recovers but it then recurs again right away. Sometimes, via fio, I'll get >>>> an I/O error and it will bail.
This appears to still be happening. :( From your advice, Christian, I monitored my cluster with atop and found that I did have one HDD which was pegged at 100% while the rest of the cluster was at 0% utilization. I replaced that disk and set my cluster back up again. Wrote ~20T of data into a 3x pool and that went very smoothly. My speeds did decrease from ~600MBps down to ~230MBps over the course of that write but I was still getting steady responsive writes. Today, I'm seeing the problem recur. The trouble is that I don't have any drive that is at 100% like I used to. In fact, I have no drives which aren't at 0% utilization during these incidents. `ceph osd perf` doesn't seem to have any useful information in it. dump_historic_ops has what looks like interesting information but I'm lost when it comes to interpreting it's output (e.g. http://pastebin.com/raw.php?i=4KHFuyGi ). So right now I have two main questions: 1) How do I figure out what is going on? What explains the periods of no activity seen here http://pastebin.com/raw.php?i=Mv2y3Tka if not a slow OSD drive like before? 2) Why does fio exit with IO errors like these? fio: io_u error on file /mnt/image1/temp.58.fio: Input/output error write offset=79754690560, buflen=4194304 fio: io_u error on file /mnt/image1/temp.69.fio: Input/output error write offset=67515711488, buflen=4194304 fio: io_u error on file /mnt/image1/temp.71.fio: Input/output error write offset=38646317056, buflen=4194304 fio: io_u error on file /mnt/image1/temp.68.fio: Input/output error write offset=103263764480, buflen=4194304 fio: pid=10972, err=5/file:io_u.c:1373, func=io_u error, error=Input/output error 4m-randwrite: (groupid=0, jobs=1): err= 5 (file:io_u.c:1373, func=io_u error, error=Input/output error): pid=10972: Fri Aug 8 11:01:48 2014
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com