Hello,

I have a test cluster of some mini-PCs. This one, in particular, runs Proxmox, 
and has two Ceph RBDs (LXCs & VMs).

The purpose of this test cluster was to test Docker Swarm. I wanted to get the 
feel for orchestration - our five-node production cluster is very simple and 
Kubernetes would be overkill.

Each node boots off of NVMe, and each node has one OSD, PCIe Gen 4 M.2 NVMe. I 
understand this equipment is not optimal, but please keep in mind this is a 
test cluster. All things considered, it was running fine for two months, I even 
made some of our non-critical BETA programs available for internal use within 
our organization.

Yesterday, I connected a 2.5GBe unmanaged switch, to the second 2.5GBe NIC of 
each node cluster, creating a Private/Cluster Network for Ceph. Since then, 
each node, VM, LXC, etc. are moving at a glacial pace. Just to give an example, 
a sudo apt update or just logging in via SSH can take sixty seconds.
[global] auth_client_required = cephx auth_cluster_required = cephx 
auth_service_required = cephx cluster_network = 172.16.1.0/24 fsid = 
3c395d5c-7d46-4dc7-ad4b-8a6761f167b0 mon_allow_pool_delete = true mon_host = 
192.168.128.156 192.168.128.150 192.168.128.158 ms_bind_ipv4 = true 
ms_bind_ipv6 = false osd_pool_default_min_size = 2 osd_pool_default_size = 3 
public_network = 192.168.128.156/24 [client] keyring = 
/etc/pve/priv/$cluster.$name.keyring [client.crash] keyring = 
/etc/pve/ceph/$cluster.$name.keyring [mds] keyring = 
/var/lib/ceph/mds/ceph-$id/keyring [mon.asusNuc1] public_addr = 192.168.128.150 
[mon.chyna2gb] public_addr = 192.168.128.156 [mon.chyna4tb] public_addr = 
192.168.128.158
# begin crush map tunable choose_local_tries 0 tunable 
choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable 
chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 
1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 
0 osd.0 class nvme device 1 osd.1 class nvme device 2 osd.2 class nvme device 3 
osd.3 class nvme # types type 0 osd type 1 host type 2 chassis type 3 rack type 
4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 
region type 11 root # buckets host chyna2gb { id -3 # do not change 
unnecessarily id -4 class nvme # do not change unnecessarily # weight 1.86299 
alg straw2 hash 0 # rjenkins1 item osd.0 weight 1.86299 } host chyna4tb { id -5 
# do not change unnecessarily id -6 class nvme # do not change unnecessarily # 
weight 3.63869 alg straw2 hash 0 # rjenkins1 item osd.1 weight 3.63869 } host 
nuc { id -7 # do not change unnecessarily id -8 class nvme # do not change 
unnecessarily # weig
 ht 0.90970 alg straw2 hash 0 # rjenkins1 item osd.2 weight 0.90970 } host 
asusNuc1 { id -9 # do not change unnecessarily id -10 class nvme # do not 
change unnecessarily # weight 3.63869 alg straw2 hash 0 # rjenkins1 item osd.3 
weight 3.63869 } root default { id -1 # do not change unnecessarily id -2 class 
nvme # do not change unnecessarily # weight 10.05006 alg straw2 hash 0 # 
rjenkins1 item chyna2gb weight 1.86299 item chyna4tb weight 3.63869 item nuc 
weight 0.90970 item asusNuc1 weight 3.63869 } # rules rule replicated_rule { id 
0 type replicated step take default step chooseleaf firstn 0 type host step 
emit } # end crush map
root@asusNuc1:~# ceph osd perf osd commit_latency(ms) apply_latency(ms) 0 6 6 3 
15 15 2 23 23 1 4 4
To be clear, the "dumb switch" is isolated and not connected to the rest of the 
network.


Regards,
[image]
Anthony Fecarotta
Founder & President
[image] anth...@linehaul.ai <mailto:anth...@linehaul.ai>
[image] 224-339-1182 [image] (855) 625-0300
[image] 1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181

[image] www.linehaul.ai <http://www.linehaul.ai/>
[image] <http://www.linehaul.ai/>
[image] <https://www.linkedin.com/in/anthony-fec/>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to