It's really bizarre, since we can easily pump ~1GB/s into the cluster with rados bench from a single 10Gig-E client. We only observe this with kernel CephFS on that host -- which is why our original theory something like this: - client caches 4GB of writes - client starts many opening IOs in parallel to flush that cache - each individual 4MB write is taking longer than 30s to send from the client to the OSD, due to the 1Gig-E network interface on the client.
But in that we assume quite a lot about the implementations of librados and the osd. But something like this would also explain why only the cephfs writes are becoming slow -- the 2kHz of other (mostly RBD) IOs are not affected by this "overload". Cheers, Dan -- Dan van der Ster || Data & Storage Services || CERN IT Department -- On Tue, Feb 25, 2014 at 7:25 AM, Gregory Farnum <g...@inktank.com> wrote: > I'm with Zheng on this one. I'm a little confused though, because I > thought this was a pretty large cluster that should be able to absorb > that much data pretty easily. But if you're using a custom striping > strategy and pushing it all through one OSD, that could do it. Or > anything else with that sort of outcome, because obviously you've got > OSDs that are simply getting overloaded by the traffic pattern. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Fri, Feb 21, 2014 at 4:06 PM, Yan, Zheng <uker...@gmail.com> wrote: > > On Sat, Feb 22, 2014 at 12:04 AM, Dan van der Ster > > <daniel.vanders...@cern.ch> wrote: > >> Hi Greg, > >> Yes, this still happens after the updatedb fix. > >> > >> [root@xxx dan]# mount > >> ... > >> zzz:6789:/ on /mnt/ceph type ceph (name=cephfs,key=client.cephfs) > >> > >> [root@xxx dan]# pwd > >> /mnt/ceph/dan > >> > >> [root@xxx dan]# dd if=/dev/zero of=yyy bs=4M count=2000 > >> 2000+0 records in > >> 2000+0 records out > >> 8388608000 bytes (8.4 GB) copied, 9.21217 s, 911 MB/s > >> > >> > >> Then 30s later: > >> > >> 2014-02-21 16:16:11.315110 osd.326 x:6836/31929 683 : [WRN] 1 slow > requests, > >> 1 included below; oldest blocked for > 32.432401 secs > >> 2014-02-21 16:16:11.315317 osd.326 x:6836/31929 684 : [WRN] slow request > >> 32.432401 seconds old, received at 2014-02-21 16:15:38.882584: > >> osd_op(client.16735018.1:22522476 100000352bf.000002a4 [write 0~4194304 > >> [8@0],startsync 0~0] 0.5447d769 snapc 1=[] e42655) v4 currently > waiting for > >> subops from [357,191] > >> > >> And no slow requests for other active clients. > >> > >> Reminder, this is 1GigE client, 64GB RAM, kernel > 3.13.0-1.el6.elrepo.x86_64, > >> kernel mounted cephfs. I can't reproduce this on a 1GigE client with > only > >> 8GB ram, 3.11.0-15-generic and 3.13.4-031304-generic. (The smaller RAM > >> client is writing at 110-120MB/s vs the 900MB/s writes seen on the big > RAM > >> machine -- obviously the writes are all buffered on the big ram > machine). > >> Maybe the RAM isn't related, though, as with fdatasync mode we still > see the > >> slow requests: > >> > >> [root@xxx dan]# dd if=/dev/zero of=yyy bs=4M count=2000 conv=fdatasync > >> 2000+0 records in > >> 2000+0 records out > >> 8388608000 bytes (8.4 GB) copied, 78.26 s, 107 MB/s > > > > It's likely this issue is related to big RAM. Big RAM allow the kernel > > to cache large amount of dirty data. Therefore the kernel creates lots > > of OSD requests when flushing dirty data. (conv=fdatasync doesn't help > > here because dd calls fdatasync after all buffered writes finish) > > > > Regards > > Yan, Zheng > > > >> > >> 2014-02-21 16:26:15.202047 osd.818 x:6803/128164 1219 : [WRN] 1 slow > >> requests, 1 included below; oldest blocked for > 30.446683 secs > >> 2014-02-21 16:26:15.202194 osd.818 x:6803/128164 1220 : [WRN] slow > request > >> 30.446683 seconds old, received at 2014-02-21 16:25:44.754914: > >> osd_op(client.16735018.1:22524842 100000352bf.00000355 [write 0~4194304 > >> [12@0],startsync 0~0] 0.c36d4557 snapc 1=[] e42655) v4 currently > waiting for > >> subops from [558,827] > >> > >> > >> Cheers, Dan > >> > >> > >> > >> -- Dan van der Ster || Data & Storage Services || CERN IT Department -- > >> > >> > >> On Thu, Feb 20, 2014 at 4:02 PM, Gregory Farnum <g...@inktank.com> > wrote: > >>> > >>> Arne, > >>> Sorry this got dropped -- I had it marked in my mail but didn't have > >>> the chance to think about it seriously when you sent it. Does this > >>> still happen after the updatedb config change you guys made recently? > >>> -Greg > >>> Software Engineer #42 @ http://inktank.com | http://ceph.com > >>> > >>> > >>> On Fri, Jan 31, 2014 at 5:52 AM, Arne Wiebalck <arne.wieba...@cern.ch> > >>> wrote: > >>> > Hi, > >>> > > >>> > We observe that we can easily create slow requests with a simple dd > on > >>> > CephFS: > >>> > > >>> > --> > >>> > [root@p05153026953834 dd]# dd if=/dev/zero of=xxx bs=4M count=1000 > >>> > 1000+0 records in > >>> > 1000+0 records out > >>> > 4194304000 bytes (4.2 GB) copied, 4.27824 s, 980 MB/s > >>> > > >>> > ceph -w: > >>> > 2014-01-31 14:28:44.009543 osd.450 [WRN] 1 slow requests, 1 included > >>> > below; > >>> > oldest blocked for > 31.088950 secs > >>> > 2014-01-31 14:28:44.009676 osd.450 [WRN] slow request 31.088950 > seconds > >>> > old, > >>> > received at 2014-01-31 14:28:12.920423: > >>> > osd_op(client.16735018.1:22493091 > >>> > 100000352b3.000002e9 [write 0~4194304,startsync 0~0] 0.518f2eef snapc > >>> > 1=[] > >>> > e32400) v4 currently waiting for subops from [87,1190] > >>> > <--- > >>> > > >>> > From what we see, the OSDs are not busy, so we suspect that it is the > >>> > client > >>> > starting all requests, > >>> > but then the requests take longer than 30 secs to finish writing, > i.e. > >>> > flushing the client-side buffers. > >>> > > >>> > Is our understanding correct? > >>> > Do these slow requests have an impact on requests from other clients, > >>> > i.e. > >>> > some OSD resources > >>> > consumed by these clients? > >>> > > >>> > The setup is: > >>> > Client: kernel 3.13.0, 1GbE > >>> > MDS Emperor 0.72.2 > >>> > OSDs Dumpling 0.67.5 > >>> > > >>> > Thanks! > >>> > Dan & Arne > >>> > > >>> > > >>> > -- > >>> > Arne Wiebalck > >>> > CERN IT > >>> > > >>> > > >>> > _______________________________________________ > >>> > ceph-users mailing list > >>> > ceph-users@lists.ceph.com > >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com