[ceph-users] Re: Newer linux kernel cephfs clients is more trouble?
On 5/12/22 12:06 AM, Stefan Kooman wrote: Hi List, We have quite a few linux kernel clients for CephFS. One of our customers has been running mainline kernels (CentOS 7 elrepo) for the past two years. They started out with 3.x kernels (default CentOS 7), but upgraded to mainline when those kernels would frequently generate MDS warnings like "failing to respond to capability release". That worked fine until 5.14 kernel. 5.14 and up would use a lot of CPU and *way* more bandwidth on CephFS than older kernels (order of magnitude). After the MDS was upgraded from Nautilus to Octopus that behavior is gone (comparable CPU / bandwidth usage as older kernels). However, the newer kernels are now the ones that give "failing to respond to capability release", and worse, clients get evicted (unresponsive as far as the MDS is concerned). Even the latest 5.17 kernels have that. No difference is observed between using messenger v1 or v2. MDS version is 15.2.16. Surprisingly the latest stable kernels from CentOS 7 work flawlessly now. Although that is good news, newer operating systems come with newer kernels. Does anyone else observe the same behavior with newish kernel clients? There have some known bugs, which have been fixed or under fixing recently, even in the mainline and, not sure whether are they related. Such as [1][2][3][4]. More detail please see ceph-client repo testing branch [5]. I have never see the "failing to respond to capability release" issue yet, if you have the MDS logs(debug_mds = 25 and debug_ms = 1) and kernel debug logs will be better to help debug it further, or provide the steps to reproduce it. [1] https://tracker.ceph.com/issues/55332 [2] https://tracker.ceph.com/issues/55421 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2063929 [4] https://tracker.ceph.com/issues/55377 [5] https://github.com/ceph/ceph-client/commits/testing Thanks -- Xiubo Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: The last 15 'degraded' items take as many hours as the first 15K?
Den fre 13 maj 2022 kl 08:56 skrev Stefan Kooman : > >> > > Thanks Janne and all for the insights! The reason why I half-jokingly > > suggested the cluster 'lost interest' in those last few fixes is that > > the recovery statistics' included in ceph -s reported near to zero > > activity for so long. After a long while those last few 'were fixed' > > --- but if the cluster was moving metadata around to fix the 'holdout > > repairs' that traffic wasn't in the stats. Those last few objects/pgs > > to be repaired seemingly got fixed 'by magic that didn't include moving > > data counted in the ceph -s stats'. > > It's probably the OMAP data (lots of key-values) that takes a lot of > time to replicate (We have PGs with over 4 million of objects with just > OMAP) and those can take up to 45 minutes to recover all while doing a > little bit of network throughput (those are NVMe OSDs). You can check > this with "watch -n 3 ceph pg ls remapped" and see how long each > backfill takes. And also if it has a lot of OMAP_BYTES and OMAP_KEYS ... > but no "BYTES". Yes, RGW does (or did) place a lot of zero-sized objects in some pools with tons of metadata attached to each 0 bytes object as a placeholder for said data. While recovering such PGs on spin drives, the number of metadata-per-second one can recover is probably bound by spindrive IOPS limits at some point, and as Stefan says, the bytes-per-second looks abysmal because it does a very simple calculation that doesn't take these kinds of objects into account. So for example, if a zero-sized obj would have 100 metadata "things" attached to it, and a normal spindrive can do 100 IOPS, ceph -s would tell me I am fixing one object per second at the incredible rate of 0b/s. That would indeed make it look like the cluster "doesn't care anymore" even if the destination drives are flipping the write head back and forth 100 times per second as fast as it physically can, probably showing near 100% utilization in iostat and similar tools. But the helicopter view summary line on recovery speed looks like the cluster doesn't want to finish repairs... -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Need advice how to proceed with [WRN] CEPHADM_HOST_CHECK_FAILED
Hello, for about a year and a half I have been supporting a cluster of Ceph for my company (v.15.2.3 on centos 8 which is out of support already) that is used only for S3 and until recently there were no serious problems that I could not deal with of a different nature, but the last problem that appeared about 2 months ago I can not find a solution alone. After adding a firewall for a short time (about 15-20 minutes), each of the hosts was isolated from the monitoring servers, which led to the following error message: ceph> health detail HEALTH_ERR 8 hosts fail cephadm check; failed to probe daemons or devices; Module 'cephadm' has failed: cannot send (already closed?) [WRN] CEPHADM_HOST_CHECK_FAILED: 8 hosts fail cephadm check host mon4 failed check: cannot send (already closed?) host mon5 failed check: cannot send (already closed?) host rgw1 failed check: cannot send (already closed?) host srv1 failed check: cannot send (already closed?) host srv2 failed check: cannot send (already closed?) host srv3 failed check: cannot send (already closed?) host srv4 failed check: cannot send (already closed?) host srv5 failed check: cannot send (already closed?) [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices host mon4 scrape failed: cannot send (already closed?) host mon4 ceph-volume inventory failed: cannot send (already closed?) host mon5 scrape failed: cannot send (already closed?) host mon5 ceph-volume inventory failed: cannot send (already closed?) host rgw1 scrape failed: cannot send (already closed?) host rgw1 ceph-volume inventory failed: cannot send (already closed?) host srv1 scrape failed: cannot send (already closed?) host srv1 ceph-volume inventory failed: cannot send (already closed?) host srv2 scrape failed: cannot send (already closed?) host srv2 ceph-volume inventory failed: cannot send (already closed?) host srv3 scrape failed: cannot send (already closed?) host srv3 ceph-volume inventory failed: cannot send (already closed?) host srv4 scrape failed: cannot send (already closed?) host srv4 ceph-volume inventory failed: cannot send (already closed?) host srv5 scrape failed: cannot send (already closed?) host srv5 ceph-volume inventory failed: cannot send (already closed?) Despite these errors, the cluster is working and the data is currently being accessed normally. I have not noticed any of the services dropped. Despite the errors, it was necessary to add a new srv6 server, which was normally added to the cluster and worked as expected, but immediately after that another error occurred: [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: cannot send (already closed?) Module 'cephadm' has failed: cannot send (already closed?) Which put the cluster in ERROR state. The hosts are alive and connected. #ceph orch host ls HOST ADDR LABELS STATUS adm adm mgr mon1 mon1 mgr mon2 mon2 mon3 mon3 mgr mon4 mon4 mon5 mon5 rgw1 rgw1 rgw2-real rgw2-real srv1 srv1 srv2 srv2 srv3 srv3 srv4 srv4 srv5 192.168.236.215 srv6 192.168.236.216 Any advice is welcome. I read everything that is related to the errors in question and that I was able to find in the different groups, but none of the proposed solutions led to a positive result. Regards, Kalin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Multi-datacenter filesystem
Hi Team We have grown out of our current solution, and we plan to migrate to multiple data centers. Our setup is a mix of radosgw data and filesystem data. But we have many legacy systems that require a filesystem at the moment, so we will probably run it for some of our data for at least 3-5 years. At the moment, we have about 0.5 Petabytes of data, so it is a small cluster. Still, we want more redundancy, so we will partner with a company with multiple data centers within the city and have redundant fiber between the locations. Our current center has multiple 10GB connections, so the communication between the new locations and our existing data center will be slower. Still, I hope the network traffic will suffice for a multi-datacenter setup. Currently, I plan to assign OSDs to different sites and racks so we can configure a good replication rule to keep a copy of the data in each data center. My question is how to handle the monitor setup for good redundancy. For example, should I set up two new monitors in each new location and have one in our existing data center, so I get five monitors in total, or should I keep it as three monitors, one for each data center? Or should I go for nine monitors 3 in each data center? Should I use a Stretch set up to define the location of each monitor? Could you do the same for MDS:es? Do I need to configure the mounting of the filesystem differently to signal in which data center the client is located? Does anyone know of a partner we could consult on these issues? Best regards Daniel ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io