Thanks, John. I'm pretty sure the root of my slow OSD issues is filestore
subfolder splitting.


On Wed, Mar 14, 2018 at 2:17 PM, John Spray <jsp...@redhat.com> wrote:

> On Tue, Mar 13, 2018 at 7:17 PM, David C <dcsysengin...@gmail.com> wrote:
> > Hi All
> >
> > I have a Samba server that is exporting directories from a Cephfs Kernel
> > mount. Performance has been pretty good for the last year but users have
> > recently been complaining of short "freezes", these seem to coincide with
> > MDS related slow requests in the monitor ceph.log such as:
> >
> >> 2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 :
> >> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13
> >> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019
> >> 3:7cea5bac:::10001a88b8f.00000000:head v 14085'846936) currently
> commit_sent
> >> 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 :
> >> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13
> >> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020
> >> 2:43bdcc3f:::10001e91a91.00000000:head v 14085'21394) currently
> commit_sent
> >> 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 :
> >> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13
> >> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077
> >> 2:6df955ef:::10001e93542.000000c4:head v 14085'21296) currently
> commit_sent
> >> 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 :
> >> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13
> >> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019
> >> 2:a9a56101:::10001e93542.000000c8:head v 14085'20437) currently
> commit_sent
> >> 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 :
> >> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13
> >> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055
> >> 2:57096bbf:::10001e93542.000000d8:head v 14085'21147) currently
> commit_sent
> >
> >
> > --
> >
> > Looking in the MDS log, with debug set to 4, it's full of
> "setfilelockrule
> > 1" and "setfilelockrule 2":
> >
> >> 2018-03-13 14:23:00.446905 7fde43e73700  4 mds.0.server
> >> handle_client_request client_request(client.9174621:141162337
> >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120,
> >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521
> caller_uid=1155,
> >> caller_gid=1131{}) v2
> >> 2018-03-13 14:23:00.447050 7fde43e73700  4 mds.0.server
> >> handle_client_request client_request(client.9174621:141162338
> >> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start
> 0,
> >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
> >> caller_gid=0{}) v2
> >> 2018-03-13 14:23:00.447258 7fde43e73700  4 mds.0.server
> >> handle_client_request client_request(client.9174621:141162339
> >> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start
> 0,
> >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
> >> caller_gid=0{}) v2
> >> 2018-03-13 14:23:00.447393 7fde43e73700  4 mds.0.server
> >> handle_client_request client_request(client.9174621:141162340
> >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124,
> >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521
> caller_uid=1155,
> >> caller_gid=1131{}) v2
>
> The MDS reporting slow requests when file locking in use is a bug, the
> ticket is:
> http://tracker.ceph.com/issues/22428
>
> Probably only indirectly related to the stuck OSD requests: perhaps
> the application itself is having trouble promptly releasing locks
> because it is hung up on flushing its data to slow OSDs.
>
> John
>
> >
> > --
> >
> > I don't have a particularly good monitoring set up on this cluster yet,
> but
> > a cursory look at a few things such as iostat doesn't seem to suggest
> OSDs
> > are being hammered.
> >
> > Some questions:
> >
> > 1) Can anyone recommend a way of diagnosing this issue?
> > 2) Are the multiple "setfilelockrule" per inode to be expected? I assume
> > this is something to do with the Samba oplocks.
> > 3) What's the recommended highest MDS debug setting before performance
> > starts to be adversely affected (I'm aware log files will get huge)?
> > 4) What's the best way of matching inodes in the MDS log to the file
> names
> > in cephfs?
> >
> > Hardware/Versions:
> >
> > Luminous 12.1.1
> > Cephfs client 3.10.0-514.2.2.el7.x86_64
> > Samba 4.4.4
> > 4 node cluster, each node 1xIntel 3700 NVME, 12x SATA, 40Gbps networking
> >
> > Thanks in advance!
> >
> > Cheers,
> > David
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to