Hi , I have bundled the public NICs and added 2 more monitors ( running on 2 of the 3 OSD hosts) This seem to improve things but still I have high latency Also performance of the SSD pool is worse than HDD which is very confusing
SSDpool is using one Toshiba PX05SMB040Y per server ( for a total of 3 OSDs) while HDD pool is using 2 Seagate ST600MM0006 disks per server () for a total of 6 OSDs) Note I have also disabled C state in the BIOS and added "intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll" to GRUB Any hints/suggestions will be greatly appreciated [root@osd04 ~]# ceph status cluster: id: 37161a51-a159-4895-a7fd-3b0d857f1b66 health: HEALTH_WARN noscrub,nodeep-scrub flag(s) set application not enabled on 2 pool(s) mon osd02 is low on available space services: mon: 3 daemons, quorum osd01,osd02,mon01 mgr: mon01(active) osd: 9 osds: 9 up, 9 in flags noscrub,nodeep-scrub tcmu-runner: 6 daemons active data: pools: 2 pools, 228 pgs objects: 50384 objects, 196 GB usage: 402 GB used, 3504 GB / 3906 GB avail pgs: 228 active+clean io: client: 46061 kB/s rd, 852 B/s wr, 15 op/s rd, 0 op/s wr [root@osd04 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -9 4.50000 root ssds -10 1.50000 host osd01-ssd 6 hdd 1.50000 osd.6 up 1.00000 1.00000 -11 1.50000 host osd02-ssd 7 hdd 1.50000 osd.7 up 1.00000 1.00000 -12 1.50000 host osd04-ssd 8 hdd 1.50000 osd.8 up 1.00000 1.00000 -1 2.72574 root default -3 1.09058 host osd01 0 hdd 0.54529 osd.0 up 1.00000 1.00000 4 hdd 0.54529 osd.4 up 1.00000 1.00000 -5 1.09058 host osd02 1 hdd 0.54529 osd.1 up 1.00000 1.00000 3 hdd 0.54529 osd.3 up 1.00000 1.00000 -7 0.54459 host osd04 2 hdd 0.27229 osd.2 up 1.00000 1.00000 5 hdd 0.27229 osd.5 up 1.00000 1.00000 rados bench -p ssdpool 300 -t 32 write --no-cleanup && rados bench -p ssdpool 300 -t 32 seq Total time run: 302.058832 Total writes made: 4100 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 54.2941 Stddev Bandwidth: 70.3355 Max bandwidth (MB/sec): 252 Min bandwidth (MB/sec): 0 Average IOPS: 13 Stddev IOPS: 17 Max IOPS: 63 Min IOPS: 0 Average Latency(s): 2.35655 Stddev Latency(s): 4.4346 Max latency(s): 29.7027 Min latency(s): 0.045166 rados bench -p rbd 300 -t 32 write --no-cleanup && rados bench -p rbd 300 -t 32 seq Total time run: 301.428571 Total writes made: 8753 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 116.154 Stddev Bandwidth: 71.5763 Max bandwidth (MB/sec): 320 Min bandwidth (MB/sec): 0 Average IOPS: 29 Stddev IOPS: 17 Max IOPS: 80 Min IOPS: 0 Average Latency(s): 1.10189 Stddev Latency(s): 1.80203 Max latency(s): 15.0715 Min latency(s): 0.0210309 [root@osd04 ~]# ethtool -k gth0 Features for gth0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: on [fixed] tx-checksum-sctp: on scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off [fixed] generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: on [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] On 22 January 2018 at 12:09, Steven Vacaroaia <ste...@gmail.com> wrote: > Hi David, > > I noticed the public interface of the server I am running the test from is > heavily used so I will bond that one too > > I doubt though that this explains the poor performance > > Thanks for your advice > > Steven > > > > On 22 January 2018 at 12:02, David Turner <drakonst...@gmail.com> wrote: > >> I'm not speaking to anything other than your configuration. >> >> "I am using 2 x 10 GB bonded ( BONDING_OPTS="mode=4 miimon=100 >> xmit_hash_policy=1 lacp_rate=1") for cluster and 1 x 1GB for public" >> It might not be a bad idea for you to forgo the public network on the 1Gb >> interfaces and either put everything on one network or use VLANs on the >> 10Gb connections. I lean more towards that in particular because your >> public network doesn't have a bond on it. Just as a note, communication >> between the OSDs and the MONs are all done on the public network. If that >> interface goes down, then the OSDs are likely to be marked down/out from >> your cluster. I'm a fan of VLANs, but if you don't have the equipment or >> expertise to go that route, then just using the same subnet for public and >> private is a decent way to go. >> >> On Mon, Jan 22, 2018 at 11:37 AM Steven Vacaroaia <ste...@gmail.com> >> wrote: >> >>> I did test with rados bench ..here are the results >>> >>> rados bench -p ssdpool 300 -t 12 write --no-cleanup && rados bench -p >>> ssdpool 300 -t 12 seq >>> >>> Total time run: 300.322608 >>> Total writes made: 10632 >>> Write size: 4194304 >>> Object size: 4194304 >>> Bandwidth (MB/sec): 141.608 >>> Stddev Bandwidth: 74.1065 >>> Max bandwidth (MB/sec): 264 >>> Min bandwidth (MB/sec): 0 >>> Average IOPS: 35 >>> Stddev IOPS: 18 >>> Max IOPS: 66 >>> Min IOPS: 0 >>> Average Latency(s): 0.33887 >>> Stddev Latency(s): 0.701947 >>> Max latency(s): 9.80161 >>> Min latency(s): 0.015171 >>> >>> Total time run: 300.829945 >>> Total reads made: 10070 >>> Read size: 4194304 >>> Object size: 4194304 >>> Bandwidth (MB/sec): 133.896 >>> Average IOPS: 33 >>> Stddev IOPS: 14 >>> Max IOPS: 68 >>> Min IOPS: 3 >>> Average Latency(s): 0.35791 >>> Max latency(s): 4.68213 >>> Min latency(s): 0.0107572 >>> >>> >>> rados bench -p scbench256 300 -t 12 write --no-cleanup && rados bench -p >>> scbench256 300 -t 12 seq >>> >>> Total time run: 300.747004 >>> Total writes made: 10239 >>> Write size: 4194304 >>> Object size: 4194304 >>> Bandwidth (MB/sec): 136.181 >>> Stddev Bandwidth: 75.5 >>> Max bandwidth (MB/sec): 272 >>> Min bandwidth (MB/sec): 0 >>> Average IOPS: 34 >>> Stddev IOPS: 18 >>> Max IOPS: 68 >>> Min IOPS: 0 >>> Average Latency(s): 0.352339 >>> Stddev Latency(s): 0.72211 >>> Max latency(s): 9.62304 >>> Min latency(s): 0.00936316 >>> hints = 1 >>> >>> >>> Total time run: 300.610761 >>> Total reads made: 7628 >>> Read size: 4194304 >>> Object size: 4194304 >>> Bandwidth (MB/sec): 101.5 >>> Average IOPS: 25 >>> Stddev IOPS: 11 >>> Max IOPS: 61 >>> Min IOPS: 0 >>> Average Latency(s): 0.472321 >>> Max latency(s): 15.636 >>> Min latency(s): 0.0188098 >>> >>> >>> On 22 January 2018 at 11:34, Steven Vacaroaia <ste...@gmail.com> wrote: >>> >>>> sorry ..send the message too soon >>>> Here is more info >>>> Vendor Id : SEAGATE >>>> Product Id : ST600MM0006 >>>> State : Online >>>> Disk Type : SAS,Hard Disk Device >>>> Capacity : 558.375 GB >>>> Power State : Active >>>> >>>> ( SSD is in slot 0) >>>> >>>> megacli -LDGetProp -Cache -LALL -a0 >>>> >>>> Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, ReadAheadNone, >>>> Direct, No Write Cache if bad BBU >>>> Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, >>>> Direct, No Write Cache if bad BBU >>>> Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive, >>>> Direct, No Write Cache if bad BBU >>>> Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive, >>>> Direct, No Write Cache if bad BBU >>>> Adapter 0-VD 4(target id: 4): Cache Policy:WriteBack, ReadAdaptive, >>>> Direct, No Write Cache if bad BBU >>>> Adapter 0-VD 5(target id: 5): Cache Policy:WriteBack, ReadAdaptive, >>>> Direct, No Write Cache if bad BBU >>>> >>>> [root@osd01 ~]# megacli -LDGetProp -DskCache -LALL -a0 >>>> >>>> Adapter 0-VD 0(target id: 0): Disk Write Cache : Disabled >>>> Adapter 0-VD 1(target id: 1): Disk Write Cache : Disk's Default >>>> Adapter 0-VD 2(target id: 2): Disk Write Cache : Disk's Default >>>> Adapter 0-VD 3(target id: 3): Disk Write Cache : Disk's Default >>>> Adapter 0-VD 4(target id: 4): Disk Write Cache : Disk's Default >>>> Adapter 0-VD 5(target id: 5): Disk Write Cache : Disk's Default >>>> >>>> >>>> CPU >>>> Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz >>>> >>>> Centos 7 kernel 3.10.0-693.11.6.el7.x86_64 >>>> >>>> sysctl -p >>>> net.ipv4.tcp_sack = 0 >>>> net.core.netdev_budget = 600 >>>> net.ipv4.tcp_window_scaling = 1 >>>> net.core.rmem_max = 16777216 >>>> net.core.wmem_max = 16777216 >>>> net.core.rmem_default = 16777216 >>>> net.core.wmem_default = 16777216 >>>> net.core.optmem_max = 40960 >>>> net.ipv4.tcp_rmem = 4096 87380 16777216 >>>> net.ipv4.tcp_wmem = 4096 65536 16777216 >>>> net.ipv4.tcp_syncookies = 0 >>>> net.core.somaxconn = 1024 >>>> net.core.netdev_max_backlog = 20000 >>>> net.ipv4.tcp_max_syn_backlog = 30000 >>>> net.ipv4.tcp_max_tw_buckets = 2000000 >>>> net.ipv4.tcp_tw_reuse = 1 >>>> net.ipv4.tcp_slow_start_after_idle = 0 >>>> net.ipv4.conf.all.send_redirects = 0 >>>> net.ipv4.conf.all.accept_redirects = 0 >>>> net.ipv4.conf.all.accept_source_route = 0 >>>> vm.min_free_kbytes = 262144 >>>> vm.swappiness = 0 >>>> vm.vfs_cache_pressure = 100 >>>> fs.suid_dumpable = 0 >>>> kernel.core_uses_pid = 1 >>>> kernel.msgmax = 65536 >>>> kernel.msgmnb = 65536 >>>> kernel.randomize_va_space = 1 >>>> kernel.sysrq = 0 >>>> kernel.pid_max = 4194304 >>>> fs.file-max = 100000 >>>> >>>> >>>> ceph.conf >>>> >>>> >>>> public_network = 10.10.30.0/24 >>>> cluster_network = 192.168.0.0/24 >>>> >>>> >>>> osd_op_num_threads_per_shard = 2 >>>> osd_op_num_shards = 25 >>>> osd_pool_default_size = 2 >>>> osd_pool_default_min_size = 1 # Allow writing 1 copy in a degraded state >>>> osd_pool_default_pg_num = 256 >>>> osd_pool_default_pgp_num = 256 >>>> osd_crush_chooseleaf_type = 1 >>>> osd_scrub_load_threshold = 0.01 >>>> osd_scrub_min_interval = 137438953472 >>>> osd_scrub_max_interval = 137438953472 >>>> osd_deep_scrub_interval = 137438953472 >>>> osd_max_scrubs = 16 >>>> osd_op_threads = 8 >>>> osd_max_backfills = 1 >>>> osd_recovery_max_active = 1 >>>> osd_recovery_op_priority = 1 >>>> >>>> >>>> >>>> >>>> debug_lockdep = 0/0 >>>> debug_context = 0/0 >>>> debug_crush = 0/0 >>>> debug_buffer = 0/0 >>>> debug_timer = 0/0 >>>> debug_filer = 0/0 >>>> debug_objecter = 0/0 >>>> debug_rados = 0/0 >>>> debug_rbd = 0/0 >>>> debug_journaler = 0/0 >>>> debug_objectcatcher = 0/0 >>>> debug_client = 0/0 >>>> debug_osd = 0/0 >>>> debug_optracker = 0/0 >>>> debug_objclass = 0/0 >>>> debug_filestore = 0/0 >>>> debug_journal = 0/0 >>>> debug_ms = 0/0 >>>> debug_monc = 0/0 >>>> debug_tp = 0/0 >>>> debug_auth = 0/0 >>>> debug_finisher = 0/0 >>>> debug_heartbeatmap = 0/0 >>>> debug_perfcounter = 0/0 >>>> debug_asok = 0/0 >>>> debug_throttle = 0/0 >>>> debug_mon = 0/0 >>>> debug_paxos = 0/0 >>>> debug_rgw = 0/0 >>>> >>>> >>>> [mon] >>>> mon_allow_pool_delete = true >>>> >>>> [osd] >>>> osd_heartbeat_grace = 20 >>>> osd_heartbeat_interval = 5 >>>> bluestore_block_db_size = 16106127360 <(610)%20612-7360> >>>> bluestore_block_wal_size = 1073741824 >>>> >>>> [osd.6] >>>> host = osd01 >>>> osd_journal = /dev/disk/by-parttypeuuid/4fbd >>>> 7e29-9d25-41b8-afd0-062c0ceff05d.1d58775a-5019-42ea-8149-a126f51a2501 >>>> crush_location = root=ssds host=osd01-ssd >>>> >>>> [osd.7] >>>> host = osd02 >>>> osd_journal = /dev/disk/by-parttypeuuid/4fbd >>>> 7e29-9d25-41b8-afd0-062c0ceff05d.683dc52d-5d69-4ff0-b5d9-b17056a55681 >>>> crush_location = root=ssds host=osd02-ssd >>>> >>>> [osd.8] >>>> host = osd04 >>>> osd_journal = /dev/disk/by-parttypeuuid/4fbd >>>> 7e29-9d25-41b8-afd0-062c0ceff05d.bd7c0088-b724-441e-9b88-9457305c541d >>>> crush_location = root=ssds host=osd04-ssd >>>> >>>> >>>> On 22 January 2018 at 11:29, Steven Vacaroaia <ste...@gmail.com> wrote: >>>> >>>>> Hi David, >>>>> >>>>> Yes, I meant no separate partitions for WAL and DB >>>>> >>>>> I am using 2 x 10 GB bonded ( BONDING_OPTS="mode=4 miimon=100 >>>>> xmit_hash_policy=1 lacp_rate=1") for cluster and 1 x 1GB for public >>>>> Disks are >>>>> Vendor Id : TOSHIBA >>>>> Product Id : PX05SMB040Y >>>>> State : Online >>>>> Disk Type : SAS,Solid State Device >>>>> Capacity : 372.0 GB >>>>> >>>>> >>>>> On 22 January 2018 at 11:24, David Turner <drakonst...@gmail.com> >>>>> wrote: >>>>> >>>>>> Disk models, other hardware information including CPU, network >>>>>> config? You say you're using Luminous, but then say journal on same >>>>>> device. I'm assuming you mean that you just have the bluestore OSD >>>>>> configured without a separate WAL or DB partition? Any more specifics >>>>>> you >>>>>> can give will be helpful. >>>>>> >>>>>> On Mon, Jan 22, 2018 at 11:20 AM Steven Vacaroaia <ste...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'll appreciate if you can provide some guidance / suggestions >>>>>>> regarding perfomance issues on a test cluster ( 3 x DELL R620, 1 >>>>>>> Entreprise >>>>>>> SSD, 3 x 600 GB ,Entreprise HDD, 8 cores, 64 GB RAM) >>>>>>> >>>>>>> I created 2 pools ( replication factor 2) one with only SSD and the >>>>>>> other with only HDD >>>>>>> ( journal on same disk for both) >>>>>>> >>>>>>> The perfomance is quite similar although I was expecting to be at >>>>>>> least 5 times better >>>>>>> No issues noticed using atop >>>>>>> >>>>>>> What should I check / tune ? >>>>>>> >>>>>>> Many thanks >>>>>>> Steven >>>>>>> >>>>>>> >>>>>>> >>>>>>> HDD based pool ( journal on the same disk) >>>>>>> >>>>>>> ceph osd pool get scbench256 all >>>>>>> >>>>>>> size: 2 >>>>>>> min_size: 1 >>>>>>> crash_replay_interval: 0 >>>>>>> pg_num: 256 >>>>>>> pgp_num: 256 >>>>>>> crush_rule: replicated_rule >>>>>>> hashpspool: true >>>>>>> nodelete: false >>>>>>> nopgchange: false >>>>>>> nosizechange: false >>>>>>> write_fadvise_dontneed: false >>>>>>> noscrub: false >>>>>>> nodeep-scrub: false >>>>>>> use_gmt_hitset: 1 >>>>>>> auid: 0 >>>>>>> fast_read: 0 >>>>>>> >>>>>>> >>>>>>> rbd bench --io-type write image1 --pool=scbench256 >>>>>>> bench type write io_size 4096 io_threads 16 bytes 1073741824 >>>>>>> pattern sequential >>>>>>> SEC OPS OPS/SEC BYTES/SEC >>>>>>> 1 46816 46836.46 191842139.78 >>>>>>> 2 90658 45339.11 185709011.80 >>>>>>> 3 133671 44540.80 182439126.08 >>>>>>> 4 177341 44340.36 181618100.14 >>>>>>> 5 217300 43464.04 178028704.54 >>>>>>> 6 259595 42555.85 174308767.05 >>>>>>> elapsed: 6 ops: 262144 ops/sec: 42694.50 bytes/sec: >>>>>>> 174876688.23 >>>>>>> >>>>>>> fio /home/cephuser/write_256.fio >>>>>>> write-4M: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, >>>>>>> iodepth=32 >>>>>>> fio-2.2.8 >>>>>>> Starting 1 process >>>>>>> rbd engine: RBD version: 1.12.0 >>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [66284KB/0KB/0KB /s] [16.6K/0/0 >>>>>>> iops] [eta 00m:00s] >>>>>>> >>>>>>> >>>>>>> fio /home/cephuser/write_256.fio >>>>>>> write-4M: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, >>>>>>> iodepth=32 >>>>>>> fio-2.2.8 >>>>>>> Starting 1 process >>>>>>> rbd engine: RBD version: 1.12.0 >>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/14464KB/0KB /s] [0/3616/0 >>>>>>> iops] [eta 00m:00s] >>>>>>> >>>>>>> >>>>>>> SSD based pool >>>>>>> >>>>>>> >>>>>>> ceph osd pool get ssdpool all >>>>>>> >>>>>>> size: 2 >>>>>>> min_size: 1 >>>>>>> crash_replay_interval: 0 >>>>>>> pg_num: 128 >>>>>>> pgp_num: 128 >>>>>>> crush_rule: ssdpool >>>>>>> hashpspool: true >>>>>>> nodelete: false >>>>>>> nopgchange: false >>>>>>> nosizechange: false >>>>>>> write_fadvise_dontneed: false >>>>>>> noscrub: false >>>>>>> nodeep-scrub: false >>>>>>> use_gmt_hitset: 1 >>>>>>> auid: 0 >>>>>>> fast_read: 0 >>>>>>> >>>>>>> rbd -p ssdpool create --size 52100 image2 >>>>>>> >>>>>>> rbd bench --io-type write image2 --pool=ssdpool >>>>>>> bench type write io_size 4096 io_threads 16 bytes 1073741824 >>>>>>> pattern sequential >>>>>>> SEC OPS OPS/SEC BYTES/SEC >>>>>>> 1 42412 41867.57 171489557.93 >>>>>>> 2 78343 39180.86 160484805.88 >>>>>>> 3 118082 39076.48 160057256.16 >>>>>>> 4 155164 38683.98 158449572.38 >>>>>>> 5 192825 38307.59 156907885.84 >>>>>>> 6 230701 37716.95 154488608.16 >>>>>>> elapsed: 7 ops: 262144 ops/sec: 36862.89 bytes/sec: >>>>>>> 150990387.29 >>>>>>> >>>>>>> >>>>>>> [root@osd01 ~]# fio /home/cephuser/write_256.fio >>>>>>> write-4M: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, >>>>>>> iodepth=32 >>>>>>> fio-2.2.8 >>>>>>> Starting 1 process >>>>>>> rbd engine: RBD version: 1.12.0 >>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/20224KB/0KB /s] [0/5056/0 >>>>>>> iops] [eta 00m:00s] >>>>>>> >>>>>>> >>>>>>> fio /home/cephuser/write_256.fio >>>>>>> write-4M: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, >>>>>>> iodepth=32 >>>>>>> fio-2.2.8 >>>>>>> Starting 1 process >>>>>>> rbd engine: RBD version: 1.12.0 >>>>>>> Jobs: 1 (f=1): [r(1)] [100.0% done] [76096KB/0KB/0KB /s] [19.3K/0/0 >>>>>>> iops] [eta 00m:00s] >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> >>>>> >>>> >>> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com