[ceph-users] Re: Scrubbing?

2024-01-25 Thread Jan Marek
Hello Peter, your irony is perfect, it is worth to notice. Meaning of my previous post was, that CEPH cluster didn't fulfill my needs and, although I had set mClock profile to "high_client_ops" (because I have a plenty of time to rebalancing and scrubbing), my clients went to problems. And there

[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread Eugen Block
Hi, it's really as easy as it sounds (fresh test cluster on 18.2.1 without any pools yet): ceph:~ # ceph fs volume create cephfs (wait a minute or two) ceph:~ # ceph fs status cephfs - 0 clients == RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS 0

[ceph-users] Re: Scrubbing?

2024-01-25 Thread Sridhar Seshasayee
Hello Jan, > Meaning of my previous post was, that CEPH cluster didn't fulfill > my needs and, although I had set mClock profile to > "high_client_ops" (because I have a plenty of time to rebalancing > and scrubbing), my clients went to problems. > As far as the question around mClock is concern

[ceph-users] Re: Questions about the CRUSH details

2024-01-25 Thread Henry lol
It's reasonable enough. actually, I expected the client to have just? thousands of "PG-to-OSDs" mappings. Nevertheless, it’s so heavy that the client calculates location on demand, right? if the client with the outdated map sends a request to the wrong OSD, then does the OSD handle it somehow thro

[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread Albert Shih
Le 25/01/2024 à 08:42:19+, Eugen Block a écrit > Hi, > > it's really as easy as it sounds (fresh test cluster on 18.2.1 without any > pools yet): > > ceph:~ # ceph fs volume create cephfs Yes...I already try that with the label and works fine. But I prefer to use «my» pools. Because I have

[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread David C.
Albert, Never used EC for (root) data pool. Le jeu. 25 janv. 2024 à 12:08, Albert Shih a écrit : > Le 25/01/2024 à 08:42:19+, Eugen Block a écrit > > Hi, > > > > it's really as easy as it sounds (fresh test cluster on 18.2.1 without > any > > pools yet): > > > > ceph:~ # ceph fs volume creat

[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread Eugen Block
Did you set the ec-overwrites flag for the pool as mentioned in the docs? https://docs.ceph.com/en/latest/cephfs/createfs/#using-erasure-coded-pools-with-cephfs If you plan to use pre-created pools anyway then the slightly more manual method is the way to go. You can set the pg_num (and pgp_nu

[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread Eugen Block
I'm not sure if using EC as default data pool for cephfs is still discouraged as stated in the output when attempting to do that, the docs don't mention that (at least not in the link I sent in the last mail): ceph:~ # ceph fs new cephfs cephfs_metadata cephfs_data Error EINVAL: pool 'cephf

[ceph-users] Re: cephfs-top causes 16 mgr modules have recently crashed

2024-01-25 Thread Özkan Göksu
Hello Jos. I check the diff and notice the difference: https://github.com/ceph/ceph/pull/52127/files Thank you for the guide link and for the fix. Have a great day. Regards. 23 Oca 2024 Sal 11:07 tarihinde Jos Collin şunu yazdı: > This fix is in the mds. > I think you need to read > https:/

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
Hello Eugen. I read all of your MDS related topics and thank you so much for your effort on this. There is not much information and I couldn't find a MDS tuning guide at all. It seems that you are the correct person to discuss mds debugging and tuning. Do you have any documents or may I learn w

[ceph-users] Re: Questions about the CRUSH details

2024-01-25 Thread Janne Johansson
Den tors 25 jan. 2024 kl 11:57 skrev Henry lol : > > It's reasonable enough. > actually, I expected the client to have just? thousands of > "PG-to-OSDs" mappings. Yes, but filename to PG is done with a pseudorandom algo. > Nevertheless, it’s so heavy that the client calculates location on > deman

[ceph-users] Re: Scrubbing?

2024-01-25 Thread Jan Marek
Hello Sridhar, Dne Čt, led 25, 2024 at 09:53:26 CET napsal(a) Sridhar Seshasayee: > Hello Jan, > > Meaning of my previous post was, that CEPH cluster didn't fulfill > my needs and, although I had set mClock profile to > "high_client_ops" (because I have a plenty of time to rebalancing > and scrub

[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread David C.
In case the root is EC, it is likely that is not possible to apply the disaster recovery procedure, (no xattr layout/parent on the data pool). Cordialement, *David CASIER* Le jeu.

[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread Eugen Block
Oh right, I forgot about that, good point! But if that is (still) true then this should definitely be in the docs as a warning for EC pools in cephfs! Zitat von "David C." : In case the root is EC, it is likely that is not possible to apply the disaster recovery procedure, (no xattr layout/

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-25 Thread Adrien Georget
We are a lot impacted by this issue with MGR in Pacific. This has to be fixed. As someone suggested in the issue tracker, we limited the memory usage of the MGR in the systemd unit (MemoryLimit=16G) in order to kill the MGR before it consumes all the memory of the server and impacts other serv

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Eugen Block
There is no definitive answer wrt mds tuning. As it is everywhere mentioned, it's about finding the right setup for your specific workload. If you can synthesize your workload (maybe scale down a bit) try optimizing it in a test cluster without interrupting your developers too much. But wha

[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread David C.
It would be a pleasure to complete the documentation but we would need to test or have someone confirm what I have assumed. Concerning the warning, I think we should not talk about the discovery procedure. While the discovery procedure has already saved some entities, it has also put entities at r

[ceph-users] TLS 1.2 for dashboard

2024-01-25 Thread Sake Ceph
After upgrading to 17.2.7 our load balancers can't check the status of the manager nodes for the dashboard. After some troubleshooting I noticed only TLS 1.3 is availalbe for the dashboard. Looking at the source (quincy), TLS config got changed from 1.2 to 1.3. Searching in the tracker I found

[ceph-users] Re: TLS 1.2 for dashboard

2024-01-25 Thread Nizamudeen A
Hi, I'll re-open the PR and will merge it to Quincy. Btw i want to know if the load balancers will be supporting tls 1.3 in future. Because we were planning to completely drop the tls1.2 support from dashboard because of security reasons. (But so far we are planning to keep it as it is atleast for

[ceph-users] Re: TLS 1.2 for dashboard

2024-01-25 Thread Sake Ceph
Hi Nizamudeen, Thank you for your quick response! The load balancers support TLS 1.3, but the administrators need to reconfigure the healthchecks. The only problem, it's a global change for all load balancers... So not something they change overnight and need to plan/test for. Best regards,

[ceph-users] Re: TLS 1.2 for dashboard

2024-01-25 Thread Nizamudeen A
Ah okay, thanks for the clarification. In that case, probably we'll need to keep this 1.2 fix for squid i guess. I'll check and will update as necessary. On Thu, Jan 25, 2024, 20:12 Sake Ceph wrote: > Hi Nizamudeen, > > Thank you for your quick response! > > The load balancers support TLS 1.3,

[ceph-users] Re: TLS 1.2 for dashboard

2024-01-25 Thread Sake Ceph
I would say drop it for squid release or if you keep it in squid, but going to disable it in a minor release later, please make a note in the release notes if the option is being removed. Just my 2 cents :) Best regards, Sake ___ ceph-users mailing l

[ceph-users] Re: TLS 1.2 for dashboard

2024-01-25 Thread Nizamudeen A
Understood, thank you. On Thu, Jan 25, 2024, 20:24 Sake Ceph wrote: > I would say drop it for squid release or if you keep it in squid, but > going to disable it in a minor release later, please make a note in the > release notes if the option is being removed. > Just my 2 cents :) > > Best rega

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
I will try my best to explain my situation. I don't have a separate mds server. I have 5 identical nodes, 3 of them mons, and I use the other 2 as active and standby mds. (currently I have left overs from max_mds 4) root@ud-01:~# ceph -s cluster: id: e42fd4b0-313b-11ee-9a00-31da71873773

[ceph-users] RGW crashes when rgw_enable_ops_log is enabled

2024-01-25 Thread Marc Singer
Hi Ceph Users I am encountering a problem with the RGW Admin Ops Socket. I am setting up the socket as follows: rgw_enable_ops_log = true rgw_ops_log_socket_path = /tmp/ops/rgw-ops.socket rgw_ops_log_data_backlog = 16Mi Seems like the socket fills up over time and it doesn't seem to get flush

[ceph-users] Re: cephadm discovery service certificate absent after upgrade.

2024-01-25 Thread Nicolas FOURNIL
Gotcha ! I've got the point, after restarting the CA certificate creation with : ceph restful create-self-signed-cert I get this error : Module 'cephadm' has failed: Expected 4 octets in 'fd30:::0:1101:2:0:501' *Ouch 4 octets = IP4 address expected... some nice code in perspective.* I

[ceph-users] Re: RGW crashes when rgw_enable_ops_log is enabled

2024-01-25 Thread Matt Benjamin
Hi Marc, The ops log code is designed to discard data if the socket is flow-controlled, iirc. Maybe we just need to handle the signal. Of course, you should have something consuming data on the socket, but it's still a problem if radosgw exits unexpectedly. Matt On Thu, Jan 25, 2024 at 10:08 A

[ceph-users] Re: cephadm discovery service certificate absent after upgrade.

2024-01-25 Thread David C.
It would be cool, actually, to have the metrics working in 18.2.2, for IPv6 only Otherwise, everything works fine on my side. Cordialement, *David CASIER* Le jeu. 25 janv. 2024 à

[ceph-users] Re: RGW crashes when rgw_enable_ops_log is enabled

2024-01-25 Thread Marc Singer
Hi I am using a unix socket client to connect with it and read the data from it. Do I need to do anything like signal the socket that this data has been read? Or am I not reading fast enough and data is backing up? What I am also noticing that at some point (probably after something with the

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Eugen Block
I understand that your MDS shows a high CPU usage, but other than that what is your performance issue? Do users complain? Do some operations take longer than expected? Are OSDs saturated during those phases? Because the cache pressure messages don’t necessarily mean that users will notice.

[ceph-users] Re: Questions about the CRUSH details

2024-01-25 Thread Henry lol
Oh! That's why data imbalance occurs in Ceph. I totally misunderstood Ceph's placement algorithm until just now. Thank you a lot for your detailed explanation :) Sincerely, 2024년 1월 25일 (목) 오후 9:32, Janne Johansson 님이 작성: > > Den tors 25 jan. 2024 kl 11:57 skrev Henry lol : > > > > It's reasonab

[ceph-users] Re: Questions about the CRUSH details

2024-01-25 Thread Robert Sander
On 1/25/24 13:32, Janne Johansson wrote: It doesn't take OSD usage into consideration except at creation time or OSD in/out/reweighing (or manual displacements with upmap and so forth), so this is why "ceph df" will tell you a pool has X free space, where X is "smallest free space on the OSDs on

[ceph-users] podman / docker issues

2024-01-25 Thread Marc
More and more I am annoyed with the 'dumb' design decisions of redhat. Just now I have an issue on an 'air gapped' vm that I am unable to start a docker/podman container because it tries to contact the repository to update the image and instead of using the on disk image it just fails. (Not to m

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
Every user has a 1x subvolume and I only have 1 pool. At the beginning we were using each subvolume for ldap home directory + user data. When a user logins any docker on any host, it was using the cluster for home and the for user related data, we was have second directory in the same subvolume. Ti

[ceph-users] Re: Questions about the CRUSH details

2024-01-25 Thread Janne Johansson
Den tors 25 jan. 2024 kl 17:47 skrev Robert Sander : > > forth), so this is why "ceph df" will tell you a pool has X free > > space, where X is "smallest free space on the OSDs on which this pool > > lies, times the number of OSDs". Given the pseudorandom placement of > > objects to PGs, there is n

[ceph-users] Re: podman / docker issues

2024-01-25 Thread Kai Stian Olstad
On 25.01.2024 18:19, Marc wrote: More and more I am annoyed with the 'dumb' design decisions of redhat. Just now I have an issue on an 'air gapped' vm that I am unable to start a docker/podman container because it tries to contact the repository to update the image and instead of using the on d

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
This is client side metrics from a "failing to respond to cache pressure" warned client. root@datagen-27:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1282187# cat bdi/stats BdiWriteback:0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 0 kB Di

[ceph-users] Re: podman / docker issues

2024-01-25 Thread Daniel Brown
For the OP - IBM appears to have some relevant info in their CEPH docs: https://www.ibm.com/docs/en/storage-ceph/5?topic=cluster-performing-disconnected-installation Questions: Is it possible to reset “container_image” after the cluster has been deployed? sudo ceph config dump |grep conta

[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef

2024-01-25 Thread Eugen Block
Hi, I got those metrics back after setting: reef01:~ # ceph config set mgr mgr/prometheus/exclude_perf_counters false reef01:~ # curl http://localhost:9283/metrics | grep ceph_osd_op | head % Total% Received % Xferd Average Speed TimeTime Time Current

[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef

2024-01-25 Thread Eugen Block
Yeah, it's mentioned in the upgrade docs [2]: Monitoring & Alerting Ceph-exporter: Now the performance metrics for Ceph daemons are exported by ceph-exporter, which deploys on each daemon rather than using prometheus exporter. This will reduce performance bottlenecks. [2] https:/

[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef

2024-01-25 Thread Eugen Block
Ah, there they are (different port): reef01:~ # curl http://localhost:9926/metrics | grep ceph_osd_op | head % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 124k 100 124k0

[ceph-users] Re: Questions about the CRUSH details

2024-01-25 Thread Anthony D'Atri
> >>> forth), so this is why "ceph df" will tell you a pool has X free >>> space, where X is "smallest free space on the OSDs on which this pool >>> lies, times the number of OSDs". To be even more precise, this depends on the failure domain. With the typical "rack" failure domain, say you u

[ceph-users] 6 pgs not deep-scrubbed in time

2024-01-25 Thread Michel Niyoyita
Hello team, I have a cluster in production composed by 3 osds servers with 20 disks each deployed using ceph-ansibleand ubuntu OS , and the version is pacific . These days is in WARN state caused by pgs which are not deep-scrubbed in time . I tried to deep-scrubbed some pg manually but seems that

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-25 Thread E Taka
We had the same problem. It turned out that one disk was slowly dying. It was easy to identify by the commands (in your case): ceph pg dump | grep -F 6.78 ceph pg dump | grep -F 6.60 … This command shows the OSDs of a PG in square brackets. If is there always the same number, then you've found th

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-25 Thread Michel Niyoyita
It seems that are different OSDs as shown here . how have you managed to sort this out? ceph pg dump | grep -F 6.78 dumped all 6.78 44268 0 0 00 1786796401180 0 10099 10099 active+clean 2024-01-26T03:51:26.781438+0200 1