[ceph-users] Re: Newer linux kernel cephfs clients is more trouble?

2022-05-13 Thread Xiubo Li


On 5/12/22 12:06 AM, Stefan Kooman wrote:

Hi List,

We have quite a few linux kernel clients for CephFS. One of our 
customers has been running mainline kernels (CentOS 7 elrepo) for the 
past two years. They started out with 3.x kernels (default CentOS 7), 
but upgraded to mainline when those kernels would frequently generate 
MDS warnings like "failing to respond to capability release". That 
worked fine until 5.14 kernel. 5.14 and up would use a lot of CPU and 
*way* more bandwidth on CephFS than older kernels (order of 
magnitude). After the MDS was upgraded from Nautilus to Octopus that 
behavior is gone (comparable CPU / bandwidth usage as older kernels). 
However, the newer kernels are now the ones that give "failing to 
respond to capability release", and worse, clients get evicted 
(unresponsive as far as the MDS is concerned). Even the latest 5.17 
kernels have that. No difference is observed between using messenger 
v1 or v2. MDS version is 15.2.16.
Surprisingly the latest stable kernels from CentOS 7 work flawlessly 
now. Although that is good news, newer operating systems come with 
newer kernels.


Does anyone else observe the same behavior with newish kernel clients?


There have some known bugs, which have been fixed or under fixing 
recently, even in the mainline and, not sure whether are they related.  
Such as [1][2][3][4]. More detail please see ceph-client repo testing 
branch [5].


I have never see the "failing to respond to capability release" issue 
yet, if you have the MDS logs(debug_mds = 25 and debug_ms = 1) and 
kernel debug logs will be better to help debug it further, or provide 
the steps to reproduce it.


[1] https://tracker.ceph.com/issues/55332
[2] https://tracker.ceph.com/issues/55421
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2063929
[4] https://tracker.ceph.com/issues/55377
[5] https://github.com/ceph/ceph-client/commits/testing

Thanks

-- Xiubo



Gr. Stefan

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The last 15 'degraded' items take as many hours as the first 15K?

2022-05-13 Thread Janne Johansson
Den fre 13 maj 2022 kl 08:56 skrev Stefan Kooman :
> >>
> > Thanks Janne and all for the insights!  The reason why I half-jokingly
> > suggested the cluster 'lost interest' in those last few fixes is that
> > the recovery statistics' included in ceph -s reported near to zero
> > activity for so long.  After a long while those last few 'were fixed'
> > --- but if the cluster was moving metadata around to fix the 'holdout
> > repairs' that traffic wasn't in the stats.  Those last few objects/pgs
> > to be repaired seemingly got fixed 'by magic that didn't include moving
> > data counted in the ceph -s stats'.
>
> It's probably the OMAP data (lots of key-values) that takes a lot of
> time to replicate (We have PGs with over 4 million of objects with just
> OMAP) and those can take up to 45 minutes to recover all while doing a
> little bit of network throughput (those are NVMe OSDs). You can check
> this with "watch -n 3 ceph pg ls remapped" and see how long each
> backfill takes. And also if it has a lot of OMAP_BYTES and OMAP_KEYS ...
> but no "BYTES".

Yes, RGW does (or did) place a lot of zero-sized objects in some pools
with tons of metadata attached to each 0 bytes object as a placeholder
for said data. While recovering such PGs on spin drives, the number of
metadata-per-second one can recover is probably bound by spindrive
IOPS limits at some point, and as Stefan says, the bytes-per-second
looks abysmal because it does a very simple calculation that doesn't
take these kinds of objects into account.

So for example, if a zero-sized obj would have 100 metadata "things"
attached to it, and a normal spindrive can do 100 IOPS, ceph -s would
tell me I am fixing one object per second at the incredible rate of
0b/s. That would indeed make it look like the cluster "doesn't care
anymore" even if the destination drives are flipping the write head
back and forth 100 times per second as fast as it physically can,
probably showing near 100% utilization in iostat and similar tools.
But the helicopter view summary line on recovery speed looks like the
cluster doesn't want to finish repairs...


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Need advice how to proceed with [WRN] CEPHADM_HOST_CHECK_FAILED

2022-05-13 Thread Kalin Nikolov
Hello, 
for about a year and a half I have been supporting a cluster of Ceph for my
company  (v.15.2.3 on centos 8 which is out of support already) that is used
only for S3 and until recently there were no serious problems that I could
not deal with of a different nature, 
but the last problem that appeared about 2 months ago I can not find a
solution alone. 
After adding a firewall for a short time (about 15-20 minutes), each of the
hosts was isolated from the monitoring servers, which led to the following
error message:

ceph> health detail
HEALTH_ERR 8 hosts fail cephadm check; failed to probe daemons or devices;
Module 'cephadm' has failed: cannot send (already closed?)
[WRN] CEPHADM_HOST_CHECK_FAILED: 8 hosts fail cephadm check
host mon4 failed check: cannot send (already closed?)
host mon5 failed check: cannot send (already closed?)
host rgw1 failed check: cannot send (already closed?)
host srv1 failed check: cannot send (already closed?)
host srv2 failed check: cannot send (already closed?)
host srv3 failed check: cannot send (already closed?)
host srv4 failed check: cannot send (already closed?)
host srv5 failed check: cannot send (already closed?)

[WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
host mon4 scrape failed: cannot send (already closed?)
host mon4 ceph-volume inventory failed: cannot send (already closed?)
host mon5 scrape failed: cannot send (already closed?)
host mon5 ceph-volume inventory failed: cannot send (already closed?)
host rgw1 scrape failed: cannot send (already closed?)
host rgw1 ceph-volume inventory failed: cannot send (already closed?)
host srv1 scrape failed: cannot send (already closed?)
host srv1 ceph-volume inventory failed: cannot send (already closed?)
host srv2 scrape failed: cannot send (already closed?)
host srv2 ceph-volume inventory failed: cannot send (already closed?)
host srv3 scrape failed: cannot send (already closed?)
host srv3 ceph-volume inventory failed: cannot send (already closed?)
host srv4 scrape failed: cannot send (already closed?)
host srv4 ceph-volume inventory failed: cannot send (already closed?)
host srv5 scrape failed: cannot send (already closed?)
host srv5 ceph-volume inventory failed: cannot send (already closed?)

Despite these errors, the cluster is working and the data is currently being
accessed normally. 
I have not noticed any of the services dropped. Despite the errors, it was
necessary to add a new srv6 server, 
which was normally added to the cluster and worked as expected, but
immediately after that another error occurred:

[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: cannot send (already
closed?)
Module 'cephadm' has failed: cannot send (already closed?)

Which put the cluster in ERROR state. The hosts are alive and connected.

#ceph orch host ls
HOST ADDR LABELS STATUS
adm adm mgr
mon1 mon1 mgr
mon2 mon2
mon3 mon3 mgr
mon4 mon4
mon5 mon5
rgw1 rgw1
rgw2-real rgw2-real
srv1 srv1
srv2 srv2
srv3 srv3
srv4 srv4
srv5 192.168.236.215
srv6 192.168.236.216

Any advice is welcome. I read everything that is related to the errors in
question and that I was able to find in the different groups, but none of
the proposed solutions led to a positive result.

Regards,
Kalin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Multi-datacenter filesystem

2022-05-13 Thread Daniel Persson
Hi Team

We have grown out of our current solution, and we plan to migrate to
multiple data centers.

Our setup is a mix of radosgw data and filesystem data. But we have many
legacy systems that require a filesystem at the moment, so we will probably
run it for some of our data for at least 3-5 years.

At the moment, we have about 0.5 Petabytes of data, so it is a small
cluster. Still, we want more redundancy, so we will partner with a company
with multiple data centers within the city and have redundant fiber between
the locations.

Our current center has multiple 10GB connections, so the communication
between the new locations and our existing data center will be slower.
Still, I hope the network traffic will suffice for a multi-datacenter setup.

Currently, I plan to assign OSDs to different sites and racks so we can
configure a good replication rule to keep a copy of the data in each data
center.

My question is how to handle the monitor setup for good redundancy. For
example, should I set up two new monitors in each new location and have one
in our existing data center, so I get five monitors in total, or should I
keep it as three monitors, one for each data center? Or should I go for
nine monitors 3 in each data center?

Should I use a Stretch set up to define the location of each monitor? Could
you do the same for MDS:es? Do I need to configure the mounting of the
filesystem differently to signal in which data center the client is located?

Does anyone know of a partner we could consult on these issues?

Best regards
Daniel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io