Hello, I have a test cluster of some mini-PCs. This one, in particular, runs Proxmox, and has two Ceph RBDs (LXCs & VMs).
The purpose of this test cluster was to test Docker Swarm. I wanted to get the feel for orchestration - our five-node production cluster is very simple and Kubernetes would be overkill. Each node boots off of NVMe, and each node has one OSD, PCIe Gen 4 M.2 NVMe. I understand this equipment is not optimal, but please keep in mind this is a test cluster. All things considered, it was running fine for two months, I even made some of our non-critical BETA programs available for internal use within our organization. Yesterday, I connected a 2.5GBe unmanaged switch, to the second 2.5GBe NIC of each node cluster, creating a Private/Cluster Network for Ceph. Since then, each node, VM, LXC, etc. are moving at a glacial pace. Just to give an example, a sudo apt update or just logging in via SSH can take sixty seconds. [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 172.16.1.0/24 fsid = 3c395d5c-7d46-4dc7-ad4b-8a6761f167b0 mon_allow_pool_delete = true mon_host = 192.168.128.156 192.168.128.150 192.168.128.158 ms_bind_ipv4 = true ms_bind_ipv6 = false osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 192.168.128.156/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [client.crash] keyring = /etc/pve/ceph/$cluster.$name.keyring [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [mon.asusNuc1] public_addr = 192.168.128.150 [mon.chyna2gb] public_addr = 192.168.128.156 [mon.chyna4tb] public_addr = 192.168.128.158 # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class nvme device 1 osd.1 class nvme device 2 osd.2 class nvme device 3 osd.3 class nvme # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root # buckets host chyna2gb { id -3 # do not change unnecessarily id -4 class nvme # do not change unnecessarily # weight 1.86299 alg straw2 hash 0 # rjenkins1 item osd.0 weight 1.86299 } host chyna4tb { id -5 # do not change unnecessarily id -6 class nvme # do not change unnecessarily # weight 3.63869 alg straw2 hash 0 # rjenkins1 item osd.1 weight 3.63869 } host nuc { id -7 # do not change unnecessarily id -8 class nvme # do not change unnecessarily # weig ht 0.90970 alg straw2 hash 0 # rjenkins1 item osd.2 weight 0.90970 } host asusNuc1 { id -9 # do not change unnecessarily id -10 class nvme # do not change unnecessarily # weight 3.63869 alg straw2 hash 0 # rjenkins1 item osd.3 weight 3.63869 } root default { id -1 # do not change unnecessarily id -2 class nvme # do not change unnecessarily # weight 10.05006 alg straw2 hash 0 # rjenkins1 item chyna2gb weight 1.86299 item chyna4tb weight 3.63869 item nuc weight 0.90970 item asusNuc1 weight 3.63869 } # rules rule replicated_rule { id 0 type replicated step take default step chooseleaf firstn 0 type host step emit } # end crush map root@asusNuc1:~# ceph osd perf osd commit_latency(ms) apply_latency(ms) 0 6 6 3 15 15 2 23 23 1 4 4 To be clear, the "dumb switch" is isolated and not connected to the rest of the network. Regards, [image] Anthony Fecarotta Founder & President [image] anth...@linehaul.ai <mailto:anth...@linehaul.ai> [image] 224-339-1182 [image] (855) 625-0300 [image] 1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181 [image] www.linehaul.ai <http://www.linehaul.ai/> [image] <http://www.linehaul.ai/> [image] <https://www.linkedin.com/in/anthony-fec/> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io