Re: [ceph-users] cephfs kernel client blocks when removing large files

Yan, Zheng Mon, 22 Oct 2018 06:28:27 -0700

On Mon, Oct 22, 2018 at 7:47 PM Dylan McCulloch <d...@unimelb.edu.au> wrote:
>
> > On Mon, Oct 22, 2018 at 2:37 PM Dylan McCulloch <dmc at unimelb.edu.au> 
> > wrote:
> > >
> > > >
> > > > On Mon, Oct 22, 2018 at 9:46 AM Dylan McCulloch <dmc at unimelb.edu.au> 
> > > > wrote:
> > > > >
> > > > > On Mon, Oct 8, 2018 at 2:57 PM Dylan McCulloch <dmc at 
> > > > > unimelb.edu.au>> wrote:
> > > > > >>
> > > > > >> Hi all,
> > > > > >>
> > > > > >>
> > > > > >> We have identified some unexpected blocking behaviour by the 
> > > > > >> ceph-fs kernel client.
> > > > > >>
> > > > > >>
> > > > > >> When performing 'rm' on large files (100+GB), there appears to be 
> > > > > >> a significant delay of 10 seconds or more, before a 'stat' 
> > > > > >> operation can be performed on the same directory on the filesystem.
> > > > > >>
> > > > > >>
> > > > > >> Looking at the kernel client's mds inflight-ops, we observe that 
> > > > > >> there are pending
> > > > > >>
> > > > > >> UNLINK operations corresponding to the deleted files.
> > > > > >>
> > > > > >>
> > > > > >> We have noted some correlation between files being in the client 
> > > > > >> page cache and the blocking behaviour. For example, if the cache 
> > > > > >> is dropped or the filesystem remounted the blocking will not occur.
> > > > > >>
> > > > > >>
> > > > > >> Test scenario below:
> > > > > >>
> > > > > >>
> > > > > >> /mnt/cephfs_mountpoint type ceph 
> > > > > >> (rw,relatime,name=ceph_filesystem,secret=<hidden>>,noshare,acl,wsize=16777216,rasize=268439552,caps_wanted_delay_min=1,caps_wanted_delay_max=1)
> > > > > >>
> > > > > >>
> > > > > >> Test1:
> > > > > >>
> > > > > >> 1) unmount & remount:
> > > > > >>
> > > > > >>
> > > > > >> 2) Add 10 x 100GB files to a directory:
> > > > > >>
> > > > > >>
> > > > > >> for i in {1..10}; do dd if=/dev/zero 
> > > > > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > > > > >>
> > > > > >>
> > > > > >> 3) Delete all files in directory:
> > > > > >>
> > > > > >>
> > > > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > > > > >>
> > > > > >>
> > > > > >> 4) Immediately perform ls on directory:
> > > > > >>
> > > > > >>
> > > > > >> time ls /mnt/cephfs_mountpoint/test1
> > > > > >>
> > > > > >>
> > > > > >> Result: delay ~16 seconds
> > > > > >>
> > > > > >> real    0m16.818s
> > > > > >>
> > > > > >> user    0m0.000s
> > > > > >>
> > > > > >> sys     0m0.002s
> > > > > >>
> > > > > >>
> > > > >
> > > > > > Are cephfs metadata pool and data pool on the same set of OSD. Is is
> > > > > > possible that heavy data IO slowed down metadata IO?
> > > > >
> > > > > Test results are from a new pre-production cluster that does not have 
> > > > > any significant data IO. We've also confirmed the same behaviour on 
> > > > > another cluster with similar configuration. Both clusters have 
> > > > > separate device-class/crush rule for metadata pool using NVME OSDs, 
> > > > > while the data pool uses HDD OSDs.
> > > > > Most metadata operations are unaffected. It appears that it is only 
> > > > > metadata operations on files that exist in client page cache prior to 
> > > > > rm that are delayed.
> > > > >
> > > >
> > > > Ok. Please enable kernel debug when running 'ls' and send kernel log to 
> > > > us.
> > > >
> > > > echo module ceph +p > /sys/kernel/debug/dynamic_debug/control;
> > > > time /mnt/cephfs_mountpoint/test1
> > > > echo module ceph -p > /sys/kernel/debug/dynamic_debug/control;
> > > >
> > > > Yan, Zheng
> > >
> > > Thanks Yan, Zheng
> > > I've attached two logfiles as I ran the test twice.
> > > The first time as previously described Test1 - cephfs_kern.log
> > > The second time I dropped caches prior to rm as in previous Test2 - 
> > > cephfs_drop_caches_kern.log
> > >
> >
> > The log shows that client waited 16 seconds for readdir reply. please
> > try again with debug mds/ms enabled and send both kerne log and mds
> > log to us.
> >
> > before writing data to files, enable debug_mds and debug_ms (On the
> > machine where mds.0 runs, run 'ceph daemon mds.x config set debug_mds
> > 10; ceph daemon mds.x config set debug_ms 1')
> > ...
> > echo module ceph +p > /sys/kernel/debug/dynamic_debug/control
> > time ls /mnt/cephfs_mountpoint/test1
> > echo module ceph -p > /sys/kernel/debug/dynamic_debug/control
> > disable debug_mds and debug_ms
> >
> > Yan, Zheng
>
> tarball of kernel log and mds debug log uploaded:
> https://swift.rc.nectar.org.au:8888/v1/AUTH_42/cephfs_issue/mds_debug_kern_logs_20181022_2141.tar.gz?temp_url_sig=51f74f07c77346138a164ed229dc8a92f18bed8d&temp_url_expires=1545046086
>
> Thanks,
> Dylan
>

The log shows that mds sent reply immediately after receiving readdir
request. But the reply message was delayed for 16 seconds. (mds sent 5
messages at 2018-10-22 21:39:12, the last one is readdir reply.  The
kclient received the first message at 18739.612013 and received the
readdir reply at 18755.894441). The delay pattern is that
kclient received a message, nothing happened for 4 seconds, received
another one or two messages, nothing happened for another 4 seconds
.....

Could please try using simple messenger . If it does not work, try
disabling NIC's offload functions.

Regards
Yan, Zheng
> > >
> > > > > >>
> > > > > >> Test2:
> > > > > >>
> > > > > >>
> > > > > >> 1) unmount & remount
> > > > > >>
> > > > > >>
> > > > > >> 2) Add 10 x 100GB files to a directory
> > > > > >>
> > > > > >> for i in {1..10}; do dd if=/dev/zero 
> > > > > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > > > > >>
> > > > > >>
> > > > > >> 3) Either a) unmount & remount; or b) drop caches
> > > > > >>
> > > > > >>
> > > > > >> echo 3 >>/proc/sys/vm/drop_caches
> > > > > >>
> > > > > >>
> > > > > >> 4) Delete files in directory:
> > > > > >>
> > > > > >>
> > > > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > > > > >>
> > > > > >>
> > > > > >> 5) Immediately perform ls on directory:
> > > > > >>
> > > > > >>
> > > > > >> time ls /mnt/cephfs_mountpoint/test1
> > > > > >>
> > > > > >>
> > > > > >> Result: no delay
> > > > > >>
> > > > > >> real    0m0.010s
> > > > > >>
> > > > > >> user    0m0.000s
> > > > > >>
> > > > > >> sys     0m0.001s
> > > > > >>
> > > > > >>
> > > > > >> Our understanding of ceph-fs’ file deletion mechanism, is that 
> > > > > >> there should be no blocking observed on the client. 
> > > > > >> http://docs.ceph.com/docs/mimic/dev/delayed-delete/ .
> > > > > >>
> > > > > >> It appears that if files are cached on the client, either by being 
> > > > > >> created or accessed recently  it will cause the kernel client to 
> > > > > >> block for reasons we have not identified.
> > > > > >>
> > > > > >>
> > > > > >> Is this a known issue, are there any ways to mitigate this 
> > > > > >> behaviour?
> > > > > >>
> > > > > >> Our production system relies on our client’s processes having 
> > > > > >> concurrent access to the file system, and access contention must 
> > > > > >> be avoided.
> > > > > >>
> > > > > >>
> > > > > >> An old mailing list post that discusses changes to client’s page 
> > > > > >> cache behaviour may be relevant.
> > > > > >>
> > > > > >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005692.html
> > > > > >>
> > > > > >>
> > > > > >> Client System:
> > > > > >>
> > > > > >>
> > > > > >> OS: RHEL7
> > > > > >>
> > > > > >> Kernel: 4.15.15-1
> > > > > >>
> > > > > >>
> > > > > >> Cluster: Ceph: Luminous 12.2.8
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > > > >> Thanks,
> > > > > >>
> > > > > >> Dylan
> > > > > >>
> > > > > >>
> > > > > >> _______________________________________________
> > > > > >> ceph-users mailing list
> > > > > >> ceph-users at lists.ceph.com
> > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs kernel client blocks when removing large files

Reply via email to