Hi Dan,

We also experienced very high network usage and memory pressure with our 
machine learning workload. This patch [1] (currently testing, may be merged in 
6.5) may fix it. See [2] for more about my experiment about this issue.

[1]: 
https://lkml.kernel.org/ceph-devel/20230515012044.98096-1-xiu...@redhat.com/T/#t
[2]: 
https://lore.kernel.org/ceph-devel/20230504082510.247-1-seh...@mail.scut.edu.cn

Weiwen Hu

在 2023年5月30日,02:26,Dan van der Ster <dan.vanders...@clyso.com> 写道:

Hi,

Sorry for poking this old thread, but does this issue still persist in
the 6.3 kernels?

Cheers, Dan

______________________________
Clyso GmbH | https://www.clyso.com


On Wed, Dec 7, 2022 at 3:42 AM William Edwards <wedwa...@cyberfusion.nl> wrote:


Op 7 dec. 2022 om 11:59 heeft Stefan Kooman <ste...@bit.nl> het volgende 
geschreven:

On 5/13/22 09:38, Xiubo Li wrote:
On 5/12/22 12:06 AM, Stefan Kooman wrote:
Hi List,

We have quite a few linux kernel clients for CephFS. One of our customers has 
been running mainline kernels (CentOS 7 elrepo) for the past two years. They 
started out with 3.x kernels (default CentOS 7), but upgraded to mainline when 
those kernels would frequently generate MDS warnings like "failing to respond 
to capability release". That worked fine until 5.14 kernel. 5.14 and up would 
use a lot of CPU and *way* more bandwidth on CephFS than older kernels (order 
of magnitude). After the MDS was upgraded from Nautilus to Octopus that 
behavior is gone (comparable CPU / bandwidth usage as older kernels). However, 
the newer kernels are now the ones that give "failing to respond to capability 
release", and worse, clients get evicted (unresponsive as far as the MDS is 
concerned). Even the latest 5.17 kernels have that. No difference is observed 
between using messenger v1 or v2. MDS version is 15.2.16.
Surprisingly the latest stable kernels from CentOS 7 work flawlessly now. 
Although that is good news, newer operating systems come with newer kernels.

Does anyone else observe the same behavior with newish kernel clients?
There have some known bugs, which have been fixed or under fixing recently, 
even in the mainline and, not sure whether are they related. Such as 
[1][2][3][4]. More detail please see ceph-client repo testing branch [5].

None of the issues you mentioned were related. We gained some more experience 
with newer kernel clients, specifically on Ubuntu Focal / Jammy (5.15). 
Performance issues seem to arise in certain workloads, specifically 
load-balanced Apache shared web hosting clusters with CephFS. We have tested 
linux kernel clients from 5.8 up to and including 6.0 with a production 
workload and the short summary is:

< 5.13, everything works fine
5.13 and up is giving issues

I see this issue on 6.0.0 as well.


We tested the 5.13.-rc1 as well, and already that kernel is giving issues. So 
something has changed in 5.13 that results in performance regression in certain 
workloads. And I wonder if it has something to do with the changes related to 
fscache that have, and are, happening in the kernel. These web servers might 
access the same directories / files concurrently.

Note: we have quite a few 5.15 kernel clients not doing any (load-balanced) web 
based workload (container clusters on CephFS) that don't have any performance 
issue running these kernels.

Issue: poor CephFS performance
Symptom / result: excessive CephFS network usage (order of magnitude higher 
than for older kernels not having this issue), within a minute there are a 
bunch of slow web service processes, claiming loads of virtual memory, that 
result in heavy swap usage and basically rendering the node unusable slow.

Other users that replied to this thread experienced similar symptoms. It is 
reproducible on both CentOS (EPEL mainline kernels) as well as on Ubuntu (hwe 
as well as default relase kernel).

MDS version used: 15.2.16 (with a backported patch from 15.2.17) (single active 
/ standby-replay)

Does this ring a bell?

Gr. Stefan

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to