[ceph-users] Re: mgr module 'orchestrator' is not enabled/loaded
Hi Alex, Were you upgrading to 19.2.0? There should be a fix available in 19.2.1 for the issue. Best, Laimis J. > On 4 Mar 2025, at 12:30, Alex from North wrote: > > the answer he ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Module 'devicehealth' has failed
Hello everybody! Running 19.2.0 faced an issued still cannot struggle. And this is Module 'devicehealth' has failed: Expecting value: line 1 column 2378 (char 2377) in MGR log I see Mar 04 12:48:07 node2.ec.mts ceph-mgr[3821449]: Traceback (most recent call last): File "/usr/share/ceph/mgr/devicehealth/module.py", line 399, in serve self._do_serve() File "/usr/share/ceph/mgr/mgr_module.py", line 549, in check return func(self, *args, **kwargs) File "/usr/share/ceph/mgr/devicehealth/module.py", line 355, in _do_serve if self.db_ready() and self.enable_monitoring: File "/usr/share/ceph/mgr/mgr_module.py", line 1286, in db_ready return self.db is not None File "/usr/share/ceph/mgr/mgr_module.py", line 1298, in db self._db = self.open_db() File "/usr/share/ceph/mgr/mgr_module.py", line 1265, in open_db if not self.pool_exists(self.MGR_POOL_NAME): File "/usr/share/ceph/mgr/mgr_module.py", line 1106, in pool_exists pools = [p['pool_name'] for p in self.get('osd_map')['pools']] File "/usr/share/ceph/mgr/mgr_module.py", line 1416, in get obj = json.loads(obj) File "/lib64/python3.9/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/lib64/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/lib64/python3.9/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 2378 (char 2377) Unfortunately, I do now know how to turn devicehealth off as it is forbidden using standard commands line ceph device monitoring off. Any hint is appreciated. Thank in advance, guys! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: When 18.2.5 will be released?
Its a good question. Any news? From: wojiaowugen Sent: Friday, December 6, 2024 8:07 AM To: ceph-users@ceph.io Subject: [ceph-users] When 18.2.5 will be released? Hi, everyone, It's an honor to ask questions here. can I ask when 18.2.5 will be released? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Module 'devicehealth' has failed
Do you have a pool named ".mgr"? Zitat von Alex from North : Hello everybody! Running 19.2.0 faced an issued still cannot struggle. And this is Module 'devicehealth' has failed: Expecting value: line 1 column 2378 (char 2377) in MGR log I see Mar 04 12:48:07 node2.ec.mts ceph-mgr[3821449]: Traceback (most recent call last): File "/usr/share/ceph/mgr/devicehealth/module.py", line 399, in serve self._do_serve() File "/usr/share/ceph/mgr/mgr_module.py", line 549, in check return func(self, *args, **kwargs) File "/usr/share/ceph/mgr/devicehealth/module.py", line 355, in _do_serve if self.db_ready() and self.enable_monitoring: File "/usr/share/ceph/mgr/mgr_module.py", line 1286, in db_ready return self.db is not None File "/usr/share/ceph/mgr/mgr_module.py", line 1298, in db self._db = self.open_db() File "/usr/share/ceph/mgr/mgr_module.py", line 1265, in open_db if not self.pool_exists(self.MGR_POOL_NAME): File "/usr/share/ceph/mgr/mgr_module.py", line 1106, in pool_exists pools = [p['pool_name'] for p in self.get('osd_map')['pools']] File "/usr/share/ceph/mgr/mgr_module.py", line 1416, in get obj = json.loads(obj) File "/lib64/python3.9/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/lib64/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/lib64/python3.9/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 2378 (char 2377) Unfortunately, I do now know how to turn devicehealth off as it is forbidden using standard commands line ceph device monitoring off. Any hint is appreciated. Thank in advance, guys! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Module 'devicehealth' has failed
yes, I do .mgr 10 1 769 KiB2 2.3 MiB 04.7 PiB ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Severe Latency Issues in Ceph Cluster
> it's not Ceph but the network It's almost always the network ;-) Ramin: This reminds me of an outage we had at CERN caused by routing / ECMP / faulty line card. One of the main symptoms of that is high tcp retransmits on the Ceph nodes. Basically, OSDs keep many connections open with each other, with different src/dst port combinations. If your cluster has OSD hosts connected across routers, then you're likely using ECMP, and each connection src/dst ip/port combination takes a different path (different routers, different line cards). Then what happens is that if one line card is faulty -- which is often difficult to alert on -- some of the connections will work, but some will not. This is visible in the host retransmit counters, and it causes OSDs to flap up and down or other badness. One quick way to diagnose if this is the root cause here is to use netcat to try to connect between two ceph hosts using a range of source ports. E.g, assuming you can ssh from one OSD host to another, do this from one ceph host: echo {2..20050} | xargs -t -n1 -I{} nc -z -p {} 22 If all your network paths are okay -- you'll get something like in the PS. If some paths are broken, you'll get errors! Hope that helps. -- dan bash-5.2$ echo {2..20050} | xargs -t -n1 -I{} nc -z -p {} 192.168.1.248 22 nc -z -p 2 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20001 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20002 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20003 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20004 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20005 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! ... -- Dan van der Ster Ceph Executive Council | CTO @ CLYSO Try our Ceph Analyzer -- https://analyzer.clyso.com/ https://clyso.com | dan.vanders...@clyso.com On Tue, Mar 4, 2025 at 12:08 AM Eugen Block wrote: > > A few years ago, one of our customers complained about latency issues. > We investigated and the only real evidence we found were also high > retransmit values. So we recommended to let their network team look > into it. For months they refused to do anything, until they hired > another company to investigate the network. It was a network issue, > basically all cabling was replaced. I don't recall anymore if switches > and other components were replaced as well, but it definitely was > resolved after that. So if you ask me, I'd say it's not Ceph but the > network. ;-) > > Zitat von Ramin Najjarbashi : > > > The Ceph version is 17.2.7. > > > > > > • OSDs are a mix of SSD and HDD, with DB/WAL colocated on the same OSDs. > > > > • SSDs are used for metadata and index pools with replication 3. > > > > • HDDs store the data pool using EC 4+2. > > > > > > Interestingly, the same issue has appeared on another cluster where DB/WAL > > is placed on NVMe disks, but the pool distribution is the same: meta and > > index on SSDs, and data on HDDs. > > > > > > It seems to be network-related, as I’ve checked the interfaces, and there > > are no obvious hardware or connectivity issues. However, we’re still seeing > > a high number of retransmissions and duplicate packets on the network. > > > > > > Let me know if you have any insights or suggestions. > > > > > > On Mon, Mar 3, 2025 at 12:36 Stefan Kooman wrote: > > > >> On 01-03-2025 15:10, Ramin Najjarbashi wrote: > >> > Hi > >> > We are currently facing severe latency issues in our Ceph cluster, > >> > particularly affecting read and write operations. At times, write > >> > operations completely stall, leading to significant service degradation. > >> > Below is a detailed breakdown of the issue, our observations, and the > >> > mitigation steps we have taken so far. We would greatly appreciate any > >> > insights or suggestions. > >> > >> What ceph version? > >> > >> How are OSDs provisioned (WAL+DB, single OSD, etc.). Type of disks. > >> > >> Gr. Stefan > >> > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph cluster unable to read/write data properly and cannot recover normally.
Hi, I'm facing a critical issue with my Ceph cluster. It has become unable to read/write data properly and cannot recover normally. What steps should I take to resolve this? [root@ceph-node1 ~]# ceph -s cluster: id: 76956086-25f5-445d-a49e-b7824393c17b health: HEALTH_WARN 1 pools have many more objects per pg than average 102131233/124848552 objects misplaced (81.804%) Reduced data availability: 40 pgs inactive Degraded data redundancy: 6402821/124848552 objects degraded (5.128%), 11 pgs degraded, 31 pgs undersized services: mon: 3 daemons, quorum ceph-node1,ceph-node2,ceph-node3 mgr: ceph-node2(active), standbys: ceph-node3, ceph-node1 osd: 6 osds: 6 up, 6 in; 169 remapped pgs data: pools: 7 pools, 216 pgs objects: 41.62 M objects, 2.6 TiB usage: 18 TiB used, 12 TiB / 30 TiB avail pgs: 12.500% pgs unknown 6.019% pgs not active 6402821/124848552 objects degraded (5.128%) 102131233/124848552 objects misplaced (81.804%) 143 active+clean+remapped 27 unknown 15 active+undersized+remapped 13 active+clean 9 undersized+degraded+peered 4 undersized+peered 2 active+undersized+degraded 1 active+undersized 1 active+remapped+backfilling 1 active+clean+remapped+scrubbing+deep io: recovery: 1.3 MiB/s, 20 objects/s, [root@ceph-node1 ~]# ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZEUSE DATAOMAPMETAAVAIL %USE VAR PGS TYPE NAME -1 30.0- 30 TiB 18 TiB 17 TiB 125 GiB 302 GiB 12 TiB 58.88 1.00 - root default -3 10.0- 10 TiB 5.9 TiB 5.8 TiB 38 GiB 101 GiB 4.1 TiB 59.11 1.00 - host ceph-node1 0 hdd 5.0 0.21257 5.0 TiB 2.3 TiB 2.2 TiB 25 GiB 41 GiB 2.7 TiB 46.21 0.78 61 osd.0 3 hdd 5.0 0.13638 5.0 TiB 3.6 TiB 3.5 TiB 13 GiB 60 GiB 1.4 TiB 72.02 1.22 84 osd.3 -5 10.0- 10 TiB 6.2 TiB 6.1 TiB 42 GiB 105 GiB 3.8 TiB 62.40 1.06 - host ceph-node2 1 hdd 5.0 0.13644 5.0 TiB 3.6 TiB 3.5 TiB 11 GiB 61 GiB 1.4 TiB 71.99 1.22 95 osd.1 4 hdd 5.0 0.1 5.0 TiB 2.6 TiB 2.6 TiB 31 GiB 44 GiB 2.4 TiB 52.81 0.90 104 osd.4 -7 10.0- 10 TiB 5.5 TiB 5.4 TiB 45 GiB 96 GiB 4.5 TiB 55.11 0.94 - host ceph-node3 2 hdd 5.0 0.1 5.0 TiB 3.5 TiB 3.4 TiB 22 GiB 62 GiB 1.5 TiB 70.19 1.19 79 osd.2 5 hdd 5.0 0.21664 5.0 TiB 2.0 TiB 1.9 TiB 23 GiB 34 GiB 3.0 TiB 40.04 0.68 100 osd.5 TOTAL 30 TiB 18 TiB 17 TiB 125 GiB 302 GiB 12 TiB 58.88 MIN/MAX VAR: 0.68/1.22 STDDEV: 13.38 Thanks~ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] mgr module 'orchestrator' is not enabled/loaded
Hi all! Faced sad situation I don't know where to dig to. That is why I am here again, hoping for hints. Situation: all the "orch" commands aint available and give error in mgr log mgr.server reply reply (95) Operation not supported Module 'orchestrator' is not enabled/loaded (required by command 'orch ls'): use `ceph mgr module enable orchestrator` to enable it Mar 04 09:03:22 node2.ec.mts conmon[1238298]: 2025-03-04T09:03:22.842+ 7f0659aed640 -1 mgr.server reply reply (95) Operation not supported Module 'orchestrator' is not enabled/loaded (required by command 'orch ls'): use `ceph mgr module enable orchestrator` to enable it Of course I've tried to enable it as suggested, but it is already enable (always-on). Tried to fail/systemd restart - all for nothing. As I understood, orchestrator is responsible for listening as well on port 8765 to give Prometheus needed info. So, I also have "devicehealth" error, as nothing is on port 8765 where Prom tries to knock. Thinking that maybe redeploy would help and having no possibility to use orchestrator, I tried to deploy new manager manually - the same story, orch is not available. Maybe orchestrator module in a crash loop? Where to check logs? There is nothing in /var/log/ceph/cephadm.log and nothing in mgr log. I am really stuck. Any hint appreciated. Thanks in advance. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Severe Latency Issues in Ceph Cluster
A few years ago, one of our customers complained about latency issues. We investigated and the only real evidence we found were also high retransmit values. So we recommended to let their network team look into it. For months they refused to do anything, until they hired another company to investigate the network. It was a network issue, basically all cabling was replaced. I don't recall anymore if switches and other components were replaced as well, but it definitely was resolved after that. So if you ask me, I'd say it's not Ceph but the network. ;-) Zitat von Ramin Najjarbashi : The Ceph version is 17.2.7. • OSDs are a mix of SSD and HDD, with DB/WAL colocated on the same OSDs. • SSDs are used for metadata and index pools with replication 3. • HDDs store the data pool using EC 4+2. Interestingly, the same issue has appeared on another cluster where DB/WAL is placed on NVMe disks, but the pool distribution is the same: meta and index on SSDs, and data on HDDs. It seems to be network-related, as I’ve checked the interfaces, and there are no obvious hardware or connectivity issues. However, we’re still seeing a high number of retransmissions and duplicate packets on the network. Let me know if you have any insights or suggestions. On Mon, Mar 3, 2025 at 12:36 Stefan Kooman wrote: On 01-03-2025 15:10, Ramin Najjarbashi wrote: > Hi > We are currently facing severe latency issues in our Ceph cluster, > particularly affecting read and write operations. At times, write > operations completely stall, leading to significant service degradation. > Below is a detailed breakdown of the issue, our observations, and the > mitigation steps we have taken so far. We would greatly appreciate any > insights or suggestions. What ceph version? How are OSDs provisioned (WAL+DB, single OSD, etc.). Type of disks. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mgr module 'orchestrator' is not enabled/loaded
found the answer here https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/QJHES7GKTI6O7BT6UBGCHK6WFTJRNJHE/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Severe Latency Issues in Ceph Cluster
On Tue, Mar 04, 2025 at 06:46:20PM +, Eugen Block wrote: > > It's almost always the network ;-) > > I know, I have memorized your famous tweet about Ceph being the best network > monitor 😄 It seems to be ;-) When I spun up my small cluster, I used a noname 10G switch. Ceph complained bitterly amount massive latencies across the board, mostly around one node (hosting OSD, MDS, monitors). Running iperf3 between nodes: - A and B: 10G line speed - A and C: barely 10 MBit/s ... but ethtool says 10G ... WAT? Checked the switch management UI: turns out, one of the ports seemed to be bad, with the error counters going brrr at high speed. Replacing the switch with a Mikrotik one (keeping the same NICs and DAC cables) solved the problem. Kind regards, Alex. -- "Opportunity is missed by most people because it is dressed in overalls and looks like work." -- Thomas A. Edison ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Upgrade: 5 pgs have unknown state; cannot draw any conclusions
Hi, we have a cluster in which we have lost the osds containg the 5 pgs. How to proceed to get cluster back working ? of course we will lose data. We cannot upgrade or downgrade with unknown state. Best regards ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Severe Latency Issues in Ceph Cluster
It's almost always the network ;-) I know, I have memorized your famous tweet about Ceph being the best network monitor 😄 and there hasn’t been a single week that I haven’t thought about that. 🙂 Zitat von Dan van der Ster : it's not Ceph but the network It's almost always the network ;-) Ramin: This reminds me of an outage we had at CERN caused by routing / ECMP / faulty line card. One of the main symptoms of that is high tcp retransmits on the Ceph nodes. Basically, OSDs keep many connections open with each other, with different src/dst port combinations. If your cluster has OSD hosts connected across routers, then you're likely using ECMP, and each connection src/dst ip/port combination takes a different path (different routers, different line cards). Then what happens is that if one line card is faulty -- which is often difficult to alert on -- some of the connections will work, but some will not. This is visible in the host retransmit counters, and it causes OSDs to flap up and down or other badness. One quick way to diagnose if this is the root cause here is to use netcat to try to connect between two ceph hosts using a range of source ports. E.g, assuming you can ssh from one OSD host to another, do this from one ceph host: echo {2..20050} | xargs -t -n1 -I{} nc -z -p {} 22 If all your network paths are okay -- you'll get something like in the PS. If some paths are broken, you'll get errors! Hope that helps. -- dan bash-5.2$ echo {2..20050} | xargs -t -n1 -I{} nc -z -p {} 192.168.1.248 22 nc -z -p 2 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20001 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20002 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20003 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20004 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! nc -z -p 20005 192.168.1.248 22 Connection to 192.168.1.248 port 22 [tcp/ssh] succeeded! ... -- Dan van der Ster Ceph Executive Council | CTO @ CLYSO Try our Ceph Analyzer -- https://analyzer.clyso.com/ https://clyso.com | dan.vanders...@clyso.com On Tue, Mar 4, 2025 at 12:08 AM Eugen Block wrote: A few years ago, one of our customers complained about latency issues. We investigated and the only real evidence we found were also high retransmit values. So we recommended to let their network team look into it. For months they refused to do anything, until they hired another company to investigate the network. It was a network issue, basically all cabling was replaced. I don't recall anymore if switches and other components were replaced as well, but it definitely was resolved after that. So if you ask me, I'd say it's not Ceph but the network. ;-) Zitat von Ramin Najjarbashi : > The Ceph version is 17.2.7. > > > • OSDs are a mix of SSD and HDD, with DB/WAL colocated on the same OSDs. > > • SSDs are used for metadata and index pools with replication 3. > > • HDDs store the data pool using EC 4+2. > > > Interestingly, the same issue has appeared on another cluster where DB/WAL > is placed on NVMe disks, but the pool distribution is the same: meta and > index on SSDs, and data on HDDs. > > > It seems to be network-related, as I’ve checked the interfaces, and there > are no obvious hardware or connectivity issues. However, we’re still seeing > a high number of retransmissions and duplicate packets on the network. > > > Let me know if you have any insights or suggestions. > > > On Mon, Mar 3, 2025 at 12:36 Stefan Kooman wrote: > >> On 01-03-2025 15:10, Ramin Najjarbashi wrote: >> > Hi >> > We are currently facing severe latency issues in our Ceph cluster, >> > particularly affecting read and write operations. At times, write >> > operations completely stall, leading to significant service degradation. >> > Below is a detailed breakdown of the issue, our observations, and the >> > mitigation steps we have taken so far. We would greatly appreciate any >> > insights or suggestions. >> >> What ceph version? >> >> How are OSDs provisioned (WAL+DB, single OSD, etc.). Type of disks. >> >> Gr. Stefan >> > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io