Hey everyone, Currently running a 4 node Proxmox cluster with external Ceph cluster (Ceph using CentOS 7). 4 nodes Ceph OSD installed, each node have spesification like this: - 8 Core intel Xeon processor - 32GB RAM - 2 x 600GB HDD SAS for CentOS (RAID1 as a System) - 9 x 1200GB HDD SAS for Data (RAID0 each, bluestore), with 2 x 480GB SSD for block.db & block.wal - 3 x 960GB SSD for faster pool (RAID0 each, bluestore without separate block.db & block.wal) - 10Gb eth network
So, total we have 36 OSD hdd and 12 OSD ssd. And Here is our network topology : https://imgur.com/eAHb18I On those cluster, i make 4 pool with 3 replication: 1. rbd-data (mount at proxmox for store block data on vm. This pool I set on hdd OSD) 2. rbd-os (mount at proxmox for store block OS on vm for better performance. This pool I set on ssd OSD) 3. cephfs-data (using same device and ruleset like rbd-data, mount at proxmox as a cephfs-data) 4. cephfs-metadata Here is our crushmap config (to make sure that we already separate ssd disk and hdd disk into different pool and ruleset : # begin crush map . ... # buckets host z1 { id -3 # do not change unnecessarily id -16 class hdd # do not change unnecessarily id -22 class ssd # do not change unnecessarily # weight 10.251 alg straw2 hash 0 # rjenkins1 item osd.0 weight 1.139 item osd.1 weight 1.139 item osd.2 weight 1.139 item osd.3 weight 1.139 item osd.4 weight 1.139 item osd.5 weight 1.139 item osd.6 weight 1.139 item osd.7 weight 1.139 item osd.8 weight 1.139 } host z2 { id -5 # do not change unnecessarily id -17 class hdd # do not change unnecessarily id -23 class ssd # do not change unnecessarily # weight 10.251 alg straw2 hash 0 # rjenkins1 item osd.9 weight 1.139 item osd.10 weight 1.139 item osd.11 weight 1.139 item osd.12 weight 1.139 item osd.13 weight 1.139 item osd.14 weight 1.139 item osd.15 weight 1.139 item osd.16 weight 1.139 item osd.17 weight 1.139 } host z3 { id -7 # do not change unnecessarily id -18 class hdd # do not change unnecessarily id -24 class ssd # do not change unnecessarily # weight 10.251 alg straw2 hash 0 # rjenkins1 item osd.18 weight 1.139 item osd.19 weight 1.139 item osd.20 weight 1.139 item osd.21 weight 1.139 item osd.22 weight 1.139 item osd.23 weight 1.139 item osd.24 weight 1.139 item osd.25 weight 1.139 item osd.26 weight 1.139 } host s1 { id -9 # do not change unnecessarily id -19 class hdd # do not change unnecessarily id -25 class ssd # do not change unnecessarily # weight 10.251 alg straw2 hash 0 # rjenkins1 item osd.27 weight 1.139 item osd.28 weight 1.139 item osd.29 weight 1.139 item osd.30 weight 1.139 item osd.31 weight 1.139 item osd.32 weight 1.139 item osd.33 weight 1.139 item osd.34 weight 1.139 item osd.35 weight 1.139 } root sas { id -1 # do not change unnecessarily id -21 class hdd # do not change unnecessarily id -26 class ssd # do not change unnecessarily # weight 51.496 alg straw2 hash 0 # rjenkins1 item z1 weight 12.874 item z2 weight 12.874 item z3 weight 12.874 item s1 weight 12.874 } host z1-ssd { id -101 # do not change unnecessarily id -2 class hdd # do not change unnecessarily id -11 class ssd # do not change unnecessarily # weight 2.619 alg straw2 hash 0 # rjenkins1 item osd.36 weight 0.873 item osd.37 weight 0.873 item osd.38 weight 0.873 } host z2-ssd { id -104 # do not change unnecessarily id -4 class hdd # do not change unnecessarily id -12 class ssd # do not change unnecessarily # weight 2.619 alg straw2 hash 0 # rjenkins1 item osd.39 weight 0.873 item osd.40 weight 0.873 item osd.41 weight 0.873 } host z3-ssd { id -107 # do not change unnecessarily id -6 class hdd # do not change unnecessarily id -13 class ssd # do not change unnecessarily # weight 2.619 alg straw2 hash 0 # rjenkins1 item osd.42 weight 0.873 item osd.43 weight 0.873 item osd.44 weight 0.873 } host s1-ssd { id -110 # do not change unnecessarily id -8 class hdd # do not change unnecessarily id -14 class ssd # do not change unnecessarily # weight 2.619 alg straw2 hash 0 # rjenkins1 item osd.45 weight 0.873 item osd.46 weight 0.873 item osd.47 weight 0.873 } root ssd { id -20 # do not change unnecessarily id -10 class hdd # do not change unnecessarily id -15 class ssd # do not change unnecessarily # weight 10.476 alg straw2 hash 0 # rjenkins1 item z1-ssd weight 2.619 item z2-ssd weight 2.619 item z3-ssd weight 2.619 item s1-ssd weight 2.619 } # rules rule sas_ruleset { id 0 type replicated min_size 1 max_size 10 step take sas step chooseleaf firstn 0 type host step emit } rule ssd_ruleset { id 1 type replicated min_size 1 max_size 10 step take ssd step chooseleaf firstn 0 type host step emit } rule cephfs_ruleset { id 2 type replicated min_size 1 max_size 10 step take sas step chooseleaf firstn 0 type host step emit } # end crush map So far, we test the system functionality was good and no problem. But we need to prove that our system performance is good, especially from IOPS perspective. Our method to prove it, just like this : 1. we try to benchmark each of single disk performance, make them as a base performance. 2. we calculate theoritically maximum IOPS (read & write) in a whole array (base on number 1 test) 3. we try to benchmark on 1 VM that the OS using RBD image using pool from cluster ceph. KPI : we expect minimum 60%-70% max IOPS on VM test is near maximum IOPS in a whole array calculation. To benchmark the cluster with some test, we use FIO. The fio config we use : READ RUN ioengine=libaio sync=0 fsync=1 direct=1 runtime=180 ramp_time=30 numjobs=1 filesize=20g WRITE RUN ioengine=psync direct=0 ramp_time=30 runtime=180 numjobs=1 filesize=20g Okay, here we go. First we test it from ceph node itself directly into 1 single SSD and 1 single SAS. We want to take the result as a base performance benchmark. Here is the result : SSD Read IOPS = 50k Write IOPS = 20k HDD Read IOPS = 1k Write IOPS = 1k >From these result, we assume that when we have 36 OSD hdd and 12 OSD ssd suppose we will have in total approximately : SSD Read IOPS = 12 x 50k = 600k Write IOPS = 12 x 20k / 3 replication = 80k HDD Read IOPS = 36 x 1k = 36k Write IOPS = 36 x 1k / 3 replication = 12k So, we try to mount 1 RBD Image from pool SSD to 1 VM as a / (root) OS. And then we run similar fio with exact same config in the VM, but we only have these result : SSD Read IOPS = 46k Write IOPS = 14.4k It's just like IOPS performance with single SSD disk, not as a whole array calculation. At first we assume that, maybe when we try to run 2 simulatenously fio test on 2 VM with 2 RBD Image from same pool it will give same result. So on theory, we can run 12 VM maximum to get maximum performance (cummulative) to 12 SSD OSD. But after we run, the results is become divided by 2, far from my assumption. VM 1 SSD Read IOPS = 23k Write IOPS = 7k VM 2 SSD Read IOPS = 23k Write IOPS = 7k The results mean that the first fio test is "really" maximum performance. So my system performance is only 8-10% than it should be. Although We need to at least reach 60-70% as a KPI. Second try, i change my OS VM with RBD Image from pool HDD. When we run, weird enough that i get the exact same result like when we test with RBD image from pool SSD : HDD Read IOPS = 46k Write IOPS = 14.4k I'm a little bit confuse now. I suppose to get different results when using different pool image, but it isnt. It's like using 1 same performance. Although we're really sure that we alreay separate the SSD and HDD pool and crushmap. My question is : 1. Why i get same test results although i already test it with 2 different RBD Image from 2 different pool and ruleset (SSD and hdd)? 2. About the results, which exactly performance do i get? If it's really the performance from SSD why it is so poor? And why when we test HDD pool performance it also shows this SSD performance? 3. Vice versa, if it show performance from HDD, which is the pool HDD theoritically correct, why when i test it from pool SSD it also shows this HDD pool performance? 4. Where do i do wrong? Am i wrong from the concept after all? Or my understanding of the concept is correct, but i need to do something on my system configuration ? If you need another data of my system that help you to analyze, i will give you asap. Thank you all :)
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com