[ceph-users] Re: Upgrade stalled after upgrading managers

2024-12-17 Thread Torkil Svensgaard
turning it off with ceph balancer off Best, Laimis J. On 17 Dec 2024, at 13:15, Torkil Svensgaard wrote: On 17/12/2024 12:05, Torkil Svensgaard wrote: Hi Running upgrade from 18.2.4 to 19.2.0 and it managed to upgrade the managers but no further progress. Now it actually seems to have upgraded

[ceph-users] Re: Upgrade stalled after upgrading managers

2024-12-17 Thread Torkil Svensgaard
On 17/12/2024 12:05, Torkil Svensgaard wrote: Hi Running upgrade from 18.2.4 to 19.2.0 and it managed to upgrade the managers but no further progress. Now it actually seems to have upgraded 1 MON now then the orchestrator crashed again: " { "mon": { "

[ceph-users] Upgrade stalled after upgrading managers

2024-12-17 Thread Torkil Svensgaard
[17/Dec/2024:10:43:11] ENGINE Bus STARTED 2024-12-17T10:43:11.964+ 7f70ebaf6640 0 log_channel(cephadm) log [INF] : [17/Dec/2024:10:43:11] ENGINE Bus STARTED ... " It will recover after some timeout, maybe 5-10 mins, and then just sit there with no upgrade progress. Nothing in mgr/ceph

[ceph-users] done, waiting for purge

2024-11-18 Thread Torkil Svensgaard
Hi 18.2.4 We had some hard drives going AWOL due to a failing SAS expander so I initiated "ceph orch host drain host". After a couple days I'm now looking at this: " OSD HOST STATEPGS REPLACE FORCE ZAPDRAIN STARTED AT 528 gimpy done, waiting for purge0

[ceph-users] Re: Error ENOENT: Module not found - ceph orch commands stoppd working

2024-11-12 Thread Torkil Svensgaard
h mgr fail And then it hopefully works again. Indeed it did, thanks! =) Mvh. Torkil Zitat von Torkil Svensgaard : On 12-11-2024 09:29, Eugen Block wrote: Hi Torkil, Hi Eugen this sounds suspiciously like https://tracker.ceph.com/issues/67329 Do you have the same (or similar) stack trace in t

[ceph-users] Re: Error ENOENT: Module not found - ceph orch commands stoppd working

2024-11-12 Thread Torkil Svensgaard
ar to me from the tracker how to recover though. The issue seems to be resolved, so should I be able to just pull new container images somehow? Mvh. Torkil Regards, Eugen Zitat von Torkil Svensgaard : Hi 18.2.4. After failing over the active manager ceph orch commands seems to have

[ceph-users] Error ENOENT: Module not found - ceph orch commands stoppd working

2024-11-12 Thread Torkil Svensgaard
Hi 18.2.4. After failing over the active manager ceph orch commands seems to have stopped working. There's this in the mgr log: " 2024-11-12T08:16:30.136+ 7f1b2d887640 0 log_channel(audit) log [DBG] : from='client.2088861125 -' entity='client.admin' cmd=[{"prefix": "orch osd rm status"

[ceph-users] Re: Snapshot getting stuck

2024-08-14 Thread Torkil Svensgaard
o firewall logs. Joachim Am Di., 13. Aug. 2024 um 14:36 Uhr schrieb Eugen Block : Hi Torkil, did anything change in the network setup? If those errors haven't popped up before, what changed? I'm not sure if I have seen this one yet... Zitat von Torkil Svensgaard : Ceph version 18.2.1.

[ceph-users] Snapshot getting stuck

2024-08-08 Thread Torkil Svensgaard
Hypervisor <-> Palo Alto firewall <-> OpenBSD firewall <-> Ceph Any ideas? I haven't found anything in the ceph logs yet. Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegå

[ceph-users] Re: Resize RBD - New size not compatible with object map

2024-08-06 Thread Torkil Svensgaard
On 06/08/2024 12:37, Ilya Dryomov wrote: On Tue, Aug 6, 2024 at 11:55 AM Torkil Svensgaard wrote: Hi [ceph: root@ceph-flash1 /]# rbd info rbd_ec/projects rbd image 'projects': size 750 TiB in 196608000 objects order 22 (4 MiB objects) snapsho

[ceph-users] Resize RBD - New size not compatible with object map

2024-08-06 Thread Torkil Svensgaard
size not compatible with object map We can do 800T though: [ceph: root@ceph-flash1 /]# rbd resize rbd_ec/projects --size 800T Resizing image: 100% complete...done. A problem with the --1024T notation? Or we hitting some sort of size limit for RBD? Mvh. Torkil -- Torkil Svensgaard S

[ceph-users] Re: OSD service specs in mixed environment

2024-06-28 Thread Torkil Svensgaard
and use models instead of sizes for everything not HDD but we have a lot of different models so as long as it's not broken this will do. Thanks for the suggestions! Mvh. Torkil Regards, Frédéric. - Le 26 Juin 24, à 8:48, Torkil Svensgaard tor...@drcmr.dk a écrit : Hi We have a

[ceph-users] Re: OSD service specs in mixed environment

2024-06-26 Thread Torkil Svensgaard
On 26/06/2024 08:48, Torkil Svensgaard wrote: Hi We have a bunch of HDD OSD hosts with DB/WAL on PCI NVMe, either 2 x 3.2TB or 1 x 6.4TB. We used to have 4 SSDs pr node for journals before bluestore and those have been repurposed for an SSD pool (wear level is fine). We've been usin

[ceph-users] OSD service specs in mixed environment

2024-06-25 Thread Torkil Svensgaard
ifier to be AND. I can do a osd.fast2 spec with size: 7000G: and change the db_devices size for osd.slow to something like 1000G:7000G but curious to see if anyone would have a different suggestion? Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Resea

[ceph-users] Re: Safe to move misplaced hosts between failure domains in the crush tree?

2024-06-13 Thread Torkil Svensgaard
ceph kept going even though I panicked and flailed with my arms a lot until I managed to revert the bad crush map changes. Good to know, thanks =) Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hosp

[ceph-users] Re: Safe to move misplaced hosts between failure domains in the crush tree?

2024-06-13 Thread Torkil Svensgaard
638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io <https://croit.io> | YouTube: https://goo.gl/PGE1Bx <https://goo.gl/PGE1Bx> > >> On 12. Jun 2024, at 09:13, Torkil Svensgaard mailto:tor...@drcmr.dk>> wrote: >>

[ceph-users] Re: Safe to move misplaced hosts between failure domains in the crush tree?

2024-06-12 Thread Torkil Svensgaard
CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx On 12. Jun 2024, at 09:33, Torkil Svensgaard wrote: On 12/06/2024 10:22, Matthias Grandl wrote: Correct, this should only result in misplaced objects. &

[ceph-users] Re: Safe to move misplaced hosts between failure domains in the crush tree?

2024-06-12 Thread Torkil Svensgaard
 https://goo.gl/PGE1Bx On 12. Jun 2024, at 09:13, Torkil Svensgaard wrote: Hi We have 3 servers for replica 3 with failure domain datacenter:  -1 4437.29248  root default -33 1467.84814  datacenter 714 -69   69.86389  host ceph-flash1 -34 1511.25378

[ceph-users] Safe to move misplaced hosts between failure domains in the crush tree?

2024-06-12 Thread Torkil Svensgaard
lashX datacenter=Y" we will just end up with a lot of misplaced data and some churn, right? Or will the affected pool go degraded/unavailable? Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital

[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-19 Thread Torkil Svensgaard
and if so how? Do I run multiple SSDs in RAID? I do realize that for some of these, there might not be the one perfect answer that fits all use cases. I am looking for best practices and in general just trying to avoid any obvious mistakes. Any advice is much appreciated. Sincerely

[ceph-users] Ceph alert module different code path?

2024-04-09 Thread Torkil Svensgaard
d the alert module is somewhat broken? Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark ___

[ceph-users] Re: NFS never recovers after slow ops

2024-04-06 Thread Torkil Svensgaard
On 06-04-2024 18:10, Torkil Svensgaard wrote: Hi Cephadm Reef 18.2.1 Started draining 5 18-20 TB HDD OSDs (DB/WAL om NVMe) on one host. Even with osd_max_backfills at 1 the OSDs get slow ops from time to time which seems odd as we recently did a huge reshuffle[1] involving the same host

[ceph-users] NFS never recovers after slow ops

2024-04-06 Thread Torkil Svensgaard
ed (3.073%) 9931 active+clean 893 active+remapped+backfill_wait 24 active+remapped+backfilling 1active+clean+inconsistent io: client: 3.5 KiB/s rd, 2.0 MiB/s wr, 5 op/s rd, 115 op/s wr " Any ideas on how to get the nfsd threa

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard
On 25-03-2024 23:07, Kai Stian Olstad wrote: On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote: On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote: My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard
On 25-03-2024 22:58, Kai Stian Olstad wrote: On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote: My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard
t the numbers I want so that will do. Thank you all for taking the time to look at this. Mvh. Torkil On 25-03-2024 20:44, Anthony D'Atri wrote: First try "ceph osd down 89" On Mar 25, 2024, at 15:37, Alexander E. Patrakov wrote: On Mon, Mar 25, 2024 at 7:37 PM Torkil Svens

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard
On 24/03/2024 01:14, Torkil Svensgaard wrote: On 24-03-2024 00:31, Alexander E. Patrakov wrote: Hi Torkil, Hi Alexander Thanks for the update. Even though the improvement is small, it is still an improvement, consistent with the osd_max_backfills value, and it proves that there are still

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-24 Thread Torkil Svensgaard
On 24-03-2024 13:41, Tyler Stachecki wrote: On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard wrote: Hi ... Using mclock with high_recovery_ops profile. What is the bottleneck here? I would have expected a huge number of simultaneous backfills. Backfill reservation logjam? mClock is very

[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-24 Thread Torkil Svensgaard
No latency spikes seen the last 24 hours after manually compacting all the OSDs so it seemed to solve it for us at least. Thanks all. Mvh. Torkil On 23-03-2024 12:32, Torkil Svensgaard wrote: Hi guys Thanks for the suggestions, we'll do the offline compaction and see how big an impa

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard
rt. On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard wrote: On 23-03-2024 21:19, Alexander E. Patrakov wrote: Hi Torkil, Hi Alexander I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage. Thank you for taki

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard
_max_pg_per_osd 250 Mvh. Torkil On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard wrote: On 2024-03-23 17:54, Kai Stian Olstad wrote: On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote: The other output is too big for pastebin and I'm not familiar with paste services,

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard
output of "ceph osd pool ls detail". On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov wrote: Hi Torkil, Unfortunately, your files contain nothing obviously bad or suspicious, except for two things: more PGs than usual and bad balance. What's your "mon max pg per osd"

[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-23 Thread Torkil Svensgaard
he whole host and your failure domain allows for that) 3. ceph config set osd osd_compact_on_start false The OSD will restart, but will not show as "up" until the compaction process completes. In your case, I would expect it to take up to 40 minutes. On Fri, Mar 22, 2024 at 3:46 PM Torkil S

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard
too big for pastebin and I'm not familiar with paste services, any suggestion for a preferred way to share such output? Mvh. Torkil On Sat, Mar 23, 2024 at 4:26 PM Torkil Svensgaard wrote: Hi We have this after adding some hosts and changing crush failure domain to datacenter:

[ceph-users] Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Torkil Svensgaard
with 6 hosts and ~400 HDD OSDs with DB/WAL on NVMe. Using mclock with high_recovery_ops profile. What is the bottleneck here? I would have expected a huge number of simultaneous backfills. Backfill reservation logjam? Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Research

[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Torkil Svensgaard
hanks. Mvh. Torkil Thanks, Igor On 3/22/2024 9:59 AM, Torkil Svensgaard wrote: Good morning, Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure domain from host to datacenter which is the reason for the large misplaced percentage. We were seeing some pretty crazy

[ceph-users] log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Torkil Svensgaard
n between with normal low latencies I think it unlikely that it is just because the cluster is busy. Also, how come there's only a small amount of PGs doing backfill when we have such a large misplaced percentage? Can this be just from backfill reservation logjam? Mvh. Torkil -- Tork

[ceph-users] Num values for 3 DC 4+2 crush rule

2024-03-15 Thread Torkil Svensgaard
ld just change 3 to 2 for the chooseleaf line for the 4+2 rule since for 4+5 each DC needs 3 shards and for 4+2 each DC needs 2 shards. Comments? Mvh. Torkil [1] https://docs.ceph.com/en/reef/rados/operations/crush-map-edits/ -- Torkil Svensgaard Systems Administrator Danish Research Centre for

[ceph-users] Re: Remove cluster_network without routing

2024-03-07 Thread Torkil Svensgaard
On 13/02/2024 13:31, Torkil Svensgaard wrote: Hi Cephadm Reef 18.2.0. We would like to remove our cluster_network without stopping the cluster and without having to route between the networks. global    advanced  cluster_network    192.168.100.0/24   * global

[ceph-users] Re: Unable to map RBDs after running pg-upmap-primary on the pool

2024-03-07 Thread Torkil Svensgaard
On 07/03/2024 08:52, Torkil Svensgaard wrote: Hi I tried to do offline read optimization[1] this morning but I am now unable to map the RBDs in the pool. I did this prior to running the pg-upmap-primary commands suggested by the optimizer, as suggested by the latest documentation[2

[ceph-users] Unable to map RBDs after running pg-upmap-primary on the pool

2024-03-06 Thread Torkil Svensgaard
Mvh. Torkil [1] https://docs.ceph.com/en/reef/rados/operations/read-balancer/ [2] https://docs.ceph.com/en/latest/rados/operations/read-balancer/ -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegård

[ceph-users] Remove cluster_network without routing

2024-02-13 Thread Torkil Svensgaard
g-ref/#id3 -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegård Allé 30 DK-2650 Hvidovre Denmark Tel: +45 386 22828 E-mail: tor...@drcmr.dk ___ ceph-users mailing li

[ceph-users] Re: Syslog server log naming

2024-02-01 Thread Torkil Svensgaard
github.com/ceph/ceph-container.git, CEPH_POINT_RELEASE=-18.2.0, org.label-schema.build-date=20231212, org.label-schema.name=CentOS Stream 8 Base Image, org.label-schema.schema-version=1.0) ... Feb 01 04:10:08 dopey practical_hypatia[766758]: 167 167 ... Feb 01 04:10:08 dopey systemd[1]: libpod-conmon-95

[ceph-users] Re: Syslog server log naming

2024-02-01 Thread Torkil Svensgaard
dopey practical_hypatia[766758]: 167 167 ... Feb 01 04:10:08 dopey systemd[1]: libpod-conmon-95967a040795bd61588dcfdc6ba5daf92553cd2cb3ecd7318cd8b16c1b15782d.scope: Deactivated successfully " Mvh. Torkil On 01/02/2024 08:24, Torkil Svensgaard wrote: We have ceph (currently 18.2.0) log to

[ceph-users] Syslog server log naming

2024-01-31 Thread Torkil Svensgaard
: {} 2024-02-01T05:42:17+01:00 dopey goofy_hypatia[845150]: 167 167 " Anyone else had this issue? Suggestions on how to get a real program name instead? Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre

[ceph-users] Re: NFS HA - "virtual_ip": null after upgrade to reef

2024-01-31 Thread Torkil Svensgaard
ig in the logs. Thanks! Mvh. Torkil Zitat von Torkil Svensgaard : On 31/01/2024 09:36, Eugen Block wrote: Hi, if I understand this correctly, with the "keepalive-only" option only one ganesha instance is supposed to be deployed: If a user additionally supplies --ingress-mode

[ceph-users] Re: NFS HA - "virtual_ip": null after upgrade to reef

2024-01-31 Thread Torkil Svensgaard
gress service is puzzling me, as it worked just fine prior to the upgrade and the upgrade shouldn't have touched the service spec in any way? Mvh. Torkil [1] https://docs.ceph.com/en/latest/cephadm/services/nfs/#nfs-with-virtual-ip-but-no-haproxy Regards, Eugen Zitat von Torkil Svensg

[ceph-users] Re: NFS HA - "virtual_ip": null after upgrade to reef

2024-01-31 Thread Torkil Svensgaard
On 31/01/2024 08:38, Torkil Svensgaard wrote: Hi Last week we created an NFS service like this: " ceph nfs cluster create jumbo "ceph-flash1,ceph-flash2,ceph-flash3" --ingress --virtual_ip 172.21.15.74/22 --ingress-mode keepalive-only " Worked like a charm. Yester

[ceph-users] NFS HA - "virtual_ip": null after upgrade to reef

2024-01-30 Thread Torkil Svensgaard
} ], "virtual_ip": null } } " Service spec: " service_type: nfs service_id: jumbo service_name: nfs.jumbo placement: count: 1 hosts: - ceph-flash1 - ceph-flash2 - ceph-flash3 spec: port: 2049 virtual_ip: 172.21.15.74 " I've tried restarting the nf

[ceph-users] Odd auto-scaler warnings about too few/many PGs

2024-01-26 Thread Torkil Svensgaard
bd_ec_data stores 683TB in 4096 pgs -> warn should be 1024 Pool rbd_internal stores 86TB in 1024 pgs-> warn should be 2048 That makes no sense to me based on the amount of data stored. Is this a bug or what am I missing? Ceph version is 17.2.7. Mvh. Torkil -- Torkil Svensgaard Systems

[ceph-users] Re: Wide EC pool causes very slow backfill?

2024-01-18 Thread Torkil Svensgaard
k-max-backfills-recovery-limits -Sridhar ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section

[ceph-users] Re: Wide EC pool causes very slow backfill?

2024-01-18 Thread Torkil Svensgaard
5139 active+remapped+backfill_wait 109 active+remapped+backfilling io: client: 28 MiB/s rd, 258 MiB/s wr, 677 op/s rd, 772 op/s wr recovery: 3.5 GiB/s, 1.00k objects/s " Thanks again. Mvh. Torkil On 18-01-2024 13:26, Torkil Svensgaard wrote: Np. Thanks, we'

[ceph-users] Re: Wide EC pool causes very slow backfill?

2024-01-18 Thread Torkil Svensgaard
ceph config set osd osd_op_queue wpq [1] https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/ [2] https://docs.clyso.com/blog/2023/03/22/ceph-how-do-disable-mclock-scheduler/ Zitat von Torkil Svensgaard : Hi Our 17.2.7 cluster: " -33  886.00842   

[ceph-users] Wide EC pool causes very slow backfill?

2024-01-18 Thread Torkil Svensgaard
m anywhere near the target capacity, and the one we just added has 22 empty OSDs, having just 22 PGs backfilling and 1 recovering seems somewhat underwhelming. Is this to be expected with such a pool? Mclock profile is high_recovery_ops. Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Fors

[ceph-users] Re: Adding OSD's results in slow ops, inactive PG's

2024-01-18 Thread Torkil Svensgaard
, 2024 9:46 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Adding OSD's results in slow ops, inactive PG's I'm glad to hear (or read) that it worked for you as well. :-) Zitat von Torkil Svensgaard : On 18/01/2024 09:30, Eugen Block wrote: Hi, [ceph: root@lazy /]# ceph-con

[ceph-users] Re: Adding OSD's results in slow ops, inactive PG's

2024-01-18 Thread Torkil Svensgaard
ame in right away, some are stuck on the aio thing. Hopefully they will recover eventually. Thanks you again for the osd_max_pg_per_osd_hard_ratio suggestion, that seems to have solved the core issue =) Mvh. Torkil Zitat von Torkil Svensgaard : On 18/01/2024 07:48, Eugen Block wrot

[ceph-users] Re: Adding OSD's results in slow ops, inactive PG's

2024-01-17 Thread Torkil Svensgaard
ev(0x56295d586400 /var/lib/ceph/osd/ceph-436/block) aio_submit retries 108 ... " Daemons are running but those last OSDs won't come online. I've tried upping bdev_aio_max_queue_depth but it didn't seem to make a difference. Mvh. Torkil Zitat von Torkil Svensgaard :

[ceph-users] Re: Adding OSD's results in slow ops, inactive PG's

2024-01-17 Thread Torkil Svensgaard
ol, placed on spinning rust, some 200-ish disks distributed across 13 nodes. I'm not sure if other pools break, but that particular 4+2 EC pool is rather important so I'm a little wary of experimenting blindly. Any thoughts on where to look next? Thanks, Ruben Vestergaard [1] https://docs.ceph.com/en/reef/rados/trouble

[ceph-users] Re: 3 DC with 4+5 EC not quite working

2024-01-12 Thread Torkil Svensgaard
r AIT Risø Campus Bygning 109, rum S14 From: Torkil Svensgaard Sent: Friday, January 12, 2024 10:17 AM To: Frédéric Nass Cc: ceph-users@ceph.io; Ruben Vestergaard Subject: [ceph-users] Re: 3 DC with 4+5 EC not quite working On 12-01-2024 09:35, Frédéric

[ceph-users] Re: 3 DC with 4+5 EC not quite working

2024-01-12 Thread Torkil Svensgaard
related replicated pools. Looking at it now I guess that was because the 5 OSDs were blocked for everything and not just the PGs for that data pool? We tried restarting the 5 blocked OSDs to no avail and eventually resorted to deleting the cephfs.hdd.data data pool to restore service. Any suggestio

[ceph-users] 3 DC with 4+5 EC not quite working

2024-01-11 Thread Torkil Svensgaard
the PGs for that data pool? We tried restarting the 5 blocked OSDs to no avail and eventually resorted to deleting the cephfs.hdd.data data pool to restore service. Any suggestions as to what we did wrong? Something to do with min_size? The crush rule? Thanks. Mvh. Torkil -- Torkil Svens

[ceph-users] Re: Upgrading From RHCS v4 to OSS Ceph

2023-11-16 Thread Torkil Svensgaard
age quay.io/ceph/ceph:v17.2.7 " Mvh. Torkil Any advice or feedback is much appreciated. Best, Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Torkil Svensgaard Systems Admini

[ceph-users] Re: Is nfs-ganesha + kerberos actually a thing?

2023-10-13 Thread Torkil Svensgaard
On 13-10-2023 16:57, John Mulligan wrote: On Friday, October 13, 2023 10:46:24 AM EDT Torkil Svensgaard wrote: On 13-10-2023 16:40, Torkil Svensgaard wrote: On 13-10-2023 14:00, John Mulligan wrote: On Friday, October 13, 2023 6:11:18 AM EDT Torkil Svensgaard wrote: Hi We have kerberos

[ceph-users] Re: Is nfs-ganesha + kerberos actually a thing?

2023-10-13 Thread Torkil Svensgaard
On 13-10-2023 16:40, Torkil Svensgaard wrote: On 13-10-2023 14:00, John Mulligan wrote: On Friday, October 13, 2023 6:11:18 AM EDT Torkil Svensgaard wrote: Hi We have kerberos working with bare metal kernel NFS exporting RBDs. I can see in the ceph documentation[1] that nfs-ganesha should

[ceph-users] Re: Is nfs-ganesha + kerberos actually a thing?

2023-10-13 Thread Torkil Svensgaard
On 13-10-2023 14:00, John Mulligan wrote: On Friday, October 13, 2023 6:11:18 AM EDT Torkil Svensgaard wrote: Hi We have kerberos working with bare metal kernel NFS exporting RBDs. I can see in the ceph documentation[1] that nfs-ganesha should work with kerberos but I'm having little

[ceph-users] Is nfs-ganesha + kerberos actually a thing?

2023-10-13 Thread Torkil Svensgaard
/#create-cephfs-export -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark ___ ceph-users mailing list -- cep

[ceph-users] Re: 2 pgs backfill_toofull but plenty of space

2023-01-16 Thread Torkil Svensgaard
y: 96 MiB/s, 49 objects/s progress: Global Recovery Event (2d) [===.] (remaining: 59m) " Mvh. Torkil Thanks, Kevin ________ From: Torkil Svensgaard Sent: Tuesday, January 10, 2023 2:36 AM To: ceph-users-a8pt6iju...@public.g

[ceph-users] 2 pgs backfill_toofull but plenty of space

2023-01-10 Thread Torkil Svensgaard
30% 1.0 261 osd.12 53.58% 1.0 51 osd.82 |32.17% 1.0 172 osd.4 53.52% 1.0 50 osd.72 |0% 0 0 osd.49 +-------- " Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Researc

[ceph-users] Re: Recovery very slow after upgrade to quincy

2022-08-15 Thread Torkil Svensgaard
have done after the upgrade? Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegård Allé 30 DK-2650 Hvidovre Denmark Tel: +45 386 22828 E-mail: tor...@drcmr.dk __

[ceph-users] Recovery very slow after upgrade to quincy

2022-08-12 Thread Torkil Svensgaard
scrubs and snaptrim, no difference. Am I missing something obvious I should have done after the upgrade? Mvh. Torkil -- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegård Allé 30 DK-2650 Hvidovre Denmark Tel

[ceph-users] Re: OSDs crashing/flapping

2022-08-04 Thread Torkil Svensgaard
On 8/4/22 09:17, Torkil Svensgaard wrote: Hi We have a lot of OSDs flapping during recovery and eventually they don't come up again until kicked with "ceph orch daemon restart osd.x". This is the end of the log for one OSD going down for good: " 2022-08-04T09:57:31.7

[ceph-users] OSDs crashing/flapping

2022-08-04 Thread Torkil Svensgaard
r::cpu_tp thread 0x7fab9ede5700' had timed out after 0.0s Aug 04 06:59:29 dcn-ceph-01 bash[5230]: debug 2022-08-04T06:59:29.808+ 7fab9cde1700 1 mon.dcn-ceph-01@4(electing) e21 collect_metadata md0: no unique device id for md0: fallback method has no model nor serial' &q

[ceph-users] 270.98 GB was requested for block_db_size, but only 270.98 GB can be fulfilled

2022-06-07 Thread Torkil Svensgaard
void the fraction: 290966526510 / 4.194.304 = 69.371,82581663132 extents pr db 290963062784 / 4.194.304 = 69.371 extents pr db 69.371 x 11 = 763.081 total extents 69.371 x 10 = 693.710 used extents " pvdisplay /dev/nvme0n1 PE Size 4.00 MiB Total PE 763089 Free PE

[ceph-users] Re: RBD mirroring bootstrap peers - direction

2021-12-15 Thread Torkil Svensgaard
On 12/15/21 14:18, Arthur Outhenin-Chalandre wrote: On 12/15/21 13:50, Torkil Svensgaard wrote: Ah, so as long as I don't run the mirror daemons on site-a there is no risk of overwriting production data there? To be perfectly clear there should be no risk whatsoever (as Ilya also sai

[ceph-users] Re: RBD mirroring bootstrap peers - direction

2021-12-15 Thread Torkil Svensgaard
On 15/12/2021 13.58, Ilya Dryomov wrote: Hi Torkil, Hi Ilya I would recommend sticking to rx-tx to make potential failback back to the primary cluster easier. There shouldn't be any issue with running rbd-mirror daemons at both sites either -- it doesn't start replicating until it is instruc

[ceph-users] Re: Snapshot mirroring problem

2021-12-15 Thread Torkil Svensgaard
On 15/12/2021 10.17, Arthur Outhenin-Chalandre wrote: Hi Torkil, Hi Arthur On 12/15/21 09:45, Torkil Svensgaard wrote: I'm having trouble getting snapshot replication to work. I have 2 clusters, 714-ceph on RHEL/16.2.0-146.el8cp and dcn-ceph on CentOS Stream 8/16.2.6. I trying to e

[ceph-users] Re: RBD mirroring bootstrap peers - direction

2021-12-15 Thread Torkil Svensgaard
On 15/12/2021 13.44, Arthur Outhenin-Chalandre wrote: Hi Torkil, Hi Arthur On 12/15/21 13:24, Torkil Svensgaard wrote: I'm confused by the direction parameter in the documentation[1]. If I have my data at site-a and want one way replication to site-b should the mirroring be configur

[ceph-users] RBD mirroring bootstrap peers - direction

2021-12-15 Thread Torkil Svensgaard
Hi I'm confused by the direction parameter in the documentation[1]. If I have my data at site-a and want one way replication to site-b should the mirroring be configured as the documentation example, directionwise? E.g. rbd --cluster site-a mirror pool peer bootstrap create --site-name site

[ceph-users] Snapshot mirroring problem

2021-12-15 Thread Torkil Svensgaard
Hi I'm having trouble getting snapshot replication to work. I have 2 clusters, 714-ceph on RHEL/16.2.0-146.el8cp and dcn-ceph on CentOS Stream 8/16.2.6. I trying to enable one-way replication from 714-ceph -> dcn-ceph. Adding peer: " # rbd mirror pool info Mode: image Site Name: dcn-ceph P

[ceph-users] Re: 2 fast allocations != 4 num_osds

2021-08-22 Thread Torkil Svensgaard
On 22/08/2021 00.42, Torkil Svensgaard wrote: Hi Any suggestions as to the cause of this error? The device list seems fine, a mix of already active OSDs and 4 empty, available drives. There were 2 orphaned LVs on the db device. After I removed those the 4 available devices came up as OSDs

[ceph-users] 2 fast allocations != 4 num_osds

2021-08-21 Thread Torkil Svensgaard
Hi Any suggestions as to the cause of this error? The device list seems fine, a mix of already active OSDs and 4 empty, available drives. RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group

[ceph-users] Re: EC and rbd-mirroring

2021-08-18 Thread Torkil Svensgaard
On 18/08/2021 21.26, Torkil Svensgaard wrote: Did I miss something obvious? Restarting the rbd-mirror daemons was the thing I missed. All good now. Thanks, Torkil Thanks, Torkil On 18/08/2021 14.30, Ilya Dryomov wrote: On Wed, Aug 18, 2021 at 12:40 PM Torkil Svensgaard wrote: Hi I

[ceph-users] Re: EC and rbd-mirroring

2021-08-18 Thread Torkil Svensgaard
6.84360 GiB Did I miss something obvious? Thanks, Torkil On 18/08/2021 14.30, Ilya Dryomov wrote: On Wed, Aug 18, 2021 at 12:40 PM Torkil Svensgaard wrote: Hi I am looking at one way mirroring from cluster A to B cluster B. As pr [1] I have configured two pools for RBD on cluster B: 1

[ceph-users] EC and rbd-mirroring

2021-08-18 Thread Torkil Svensgaard
-- Torkil Svensgaard Sysadmin MR-Forskningssektionen, afs. 714 DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegård Allé 30 DK-2650 Hvidovre Denmark Tel: +45 386 22828 E-mail: tor...@drcmr.dk ___ ceph-users mailing list -- ceph-users

[ceph-users] Re: Module 'devicehealth' has failed:

2021-06-15 Thread Torkil Svensgaard
uot; Mvh. Torkil On 15/06/2021 11.38, Sebastian Wagner wrote: Hi Torkil, you should see more information in the MGR log file. Might be an idea to restart the MGR to get some recent logs. Am 15.06.21 um 09:41 schrieb Torkil Svensgaard: Hi Looking at this error in v15.2.13: " [

[ceph-users] Module 'devicehealth' has failed:

2021-06-15 Thread Torkil Svensgaard
Hi Looking at this error in v15.2.13: " [ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed: Module 'devicehealth' has failed: " It used to work. Since the module is always on I can't seem to restart it and I've found no clue as to why it failed. I've tried rebooting all hosts to no