It's really bizarre, since we can easily pump ~1GB/s into the cluster with
rados bench from a single 10Gig-E client. We only observe this with kernel
CephFS on that host -- which is why our original theory something like this:
   - client caches 4GB of writes
   - client starts many opening IOs in parallel to flush that cache
   - each individual 4MB write is taking longer than 30s to send from the
client to the OSD, due to the 1Gig-E network interface on the client.

But in that we assume quite a lot about the implementations of librados and
the osd. But something like this would also explain why only the cephfs
writes are becoming slow -- the 2kHz of other (mostly RBD) IOs are not
affected by this "overload".

Cheers, Dan


-- Dan van der Ster || Data & Storage Services || CERN IT Department --


On Tue, Feb 25, 2014 at 7:25 AM, Gregory Farnum <g...@inktank.com> wrote:

> I'm with Zheng on this one. I'm a little confused though, because I
> thought this was a pretty large cluster that should be able to absorb
> that much data pretty easily. But if you're using a custom striping
> strategy and pushing it all through one OSD, that could do it. Or
> anything else with that sort of outcome, because obviously you've got
> OSDs that are simply getting overloaded by the traffic pattern.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Fri, Feb 21, 2014 at 4:06 PM, Yan, Zheng <uker...@gmail.com> wrote:
> > On Sat, Feb 22, 2014 at 12:04 AM, Dan van der Ster
> > <daniel.vanders...@cern.ch> wrote:
> >> Hi Greg,
> >> Yes, this still happens after the updatedb fix.
> >>
> >> [root@xxx dan]# mount
> >> ...
> >> zzz:6789:/ on /mnt/ceph type ceph (name=cephfs,key=client.cephfs)
> >>
> >> [root@xxx dan]# pwd
> >> /mnt/ceph/dan
> >>
> >> [root@xxx dan]# dd if=/dev/zero of=yyy bs=4M count=2000
> >> 2000+0 records in
> >> 2000+0 records out
> >> 8388608000 bytes (8.4 GB) copied, 9.21217 s, 911 MB/s
> >>
> >>
> >> Then 30s later:
> >>
> >> 2014-02-21 16:16:11.315110 osd.326 x:6836/31929 683 : [WRN] 1 slow
> requests,
> >> 1 included below; oldest blocked for > 32.432401 secs
> >> 2014-02-21 16:16:11.315317 osd.326 x:6836/31929 684 : [WRN] slow request
> >> 32.432401 seconds old, received at 2014-02-21 16:15:38.882584:
> >> osd_op(client.16735018.1:22522476 100000352bf.000002a4 [write 0~4194304
> >> [8@0],startsync 0~0] 0.5447d769 snapc 1=[] e42655) v4 currently
> waiting for
> >> subops from [357,191]
> >>
> >> And no slow requests for other active clients.
> >>
> >> Reminder, this is 1GigE client, 64GB RAM, kernel
> 3.13.0-1.el6.elrepo.x86_64,
> >> kernel mounted cephfs. I can't reproduce this on a 1GigE client with
> only
> >> 8GB ram, 3.11.0-15-generic and 3.13.4-031304-generic. (The smaller RAM
> >> client is writing at 110-120MB/s vs the 900MB/s writes seen on the big
> RAM
> >> machine -- obviously the writes are all buffered on the big ram
> machine).
> >> Maybe the RAM isn't related, though, as with fdatasync mode we still
> see the
> >> slow requests:
> >>
> >> [root@xxx dan]# dd if=/dev/zero of=yyy bs=4M count=2000 conv=fdatasync
> >> 2000+0 records in
> >> 2000+0 records out
> >> 8388608000 bytes (8.4 GB) copied, 78.26 s, 107 MB/s
> >
> > It's likely this issue is related to big RAM. Big RAM allow the kernel
> > to cache large amount of dirty data. Therefore the kernel creates lots
> > of OSD requests when flushing dirty data. (conv=fdatasync doesn't help
> > here because dd calls fdatasync after all buffered writes finish)
> >
> > Regards
> > Yan, Zheng
> >
> >>
> >> 2014-02-21 16:26:15.202047 osd.818 x:6803/128164 1219 : [WRN] 1 slow
> >> requests, 1 included below; oldest blocked for > 30.446683 secs
> >> 2014-02-21 16:26:15.202194 osd.818 x:6803/128164 1220 : [WRN] slow
> request
> >> 30.446683 seconds old, received at 2014-02-21 16:25:44.754914:
> >> osd_op(client.16735018.1:22524842 100000352bf.00000355 [write 0~4194304
> >> [12@0],startsync 0~0] 0.c36d4557 snapc 1=[] e42655) v4 currently
> waiting for
> >> subops from [558,827]
> >>
> >>
> >> Cheers, Dan
> >>
> >>
> >>
> >> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
> >>
> >>
> >> On Thu, Feb 20, 2014 at 4:02 PM, Gregory Farnum <g...@inktank.com>
> wrote:
> >>>
> >>> Arne,
> >>> Sorry this got dropped -- I had it marked in my mail but didn't have
> >>> the chance to think about it seriously when you sent it. Does this
> >>> still happen after the updatedb config change you guys made recently?
> >>> -Greg
> >>> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>>
> >>>
> >>> On Fri, Jan 31, 2014 at 5:52 AM, Arne Wiebalck <arne.wieba...@cern.ch>
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > We observe that we can easily create slow requests with a simple dd
> on
> >>> > CephFS:
> >>> >
> >>> > -->
> >>> > [root@p05153026953834 dd]# dd if=/dev/zero of=xxx bs=4M count=1000
> >>> > 1000+0 records in
> >>> > 1000+0 records out
> >>> > 4194304000 bytes (4.2 GB) copied, 4.27824 s, 980 MB/s
> >>> >
> >>> > ceph -w:
> >>> > 2014-01-31 14:28:44.009543 osd.450 [WRN] 1 slow requests, 1 included
> >>> > below;
> >>> > oldest blocked for > 31.088950 secs
> >>> > 2014-01-31 14:28:44.009676 osd.450 [WRN] slow request 31.088950
> seconds
> >>> > old,
> >>> > received at 2014-01-31 14:28:12.920423:
> >>> > osd_op(client.16735018.1:22493091
> >>> > 100000352b3.000002e9 [write 0~4194304,startsync 0~0] 0.518f2eef snapc
> >>> > 1=[]
> >>> > e32400) v4 currently waiting for subops from [87,1190]
> >>> > <---
> >>> >
> >>> > From what we see, the OSDs are not busy, so we suspect that it is the
> >>> > client
> >>> > starting all requests,
> >>> > but then the requests take longer than 30 secs to finish writing,
> i.e.
> >>> > flushing the client-side buffers.
> >>> >
> >>> > Is our understanding correct?
> >>> > Do these slow requests have an impact on requests from other clients,
> >>> > i.e.
> >>> > some OSD resources
> >>> > consumed by these clients?
> >>> >
> >>> > The setup is:
> >>> > Client: kernel 3.13.0, 1GbE
> >>> > MDS Emperor 0.72.2
> >>> > OSDs Dumpling 0.67.5
> >>> >
> >>> > Thanks!
> >>> >  Dan & Arne
> >>> >
> >>> >
> >>> > --
> >>> > Arne Wiebalck
> >>> > CERN IT
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > ceph-users mailing list
> >>> > ceph-users@lists.ceph.com
> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to