Hello, On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:
> > > I hope the fastest of these MONs (CPU and storage) has the lowest IP > > number and thus is the leader. > no, the lowest IP has slowest CPU. But zabbix didn't show any load at all > mons. In your use case and configuration no surprise, but again, the lowest IP will be leader by default and thus the busiest. > > Also what Ceph, OS, kernel version? > > ubuntu 16.04 kernel 4.4.0-22 > Check the ML archives, I remember people having performance issues with the 4.4 kernels. Still don't know your Ceph version, is it the latest Jewel? > > Two GbE ports, given the "frontend" up there with the MON description I > > assume that's 1 port per client (front) and cluster (back) network? > yes, one GbE for ceph client, one GbE for back network. OK, so (from a single GbE client) 100MB/s at most. > > Is there any other client on than that Windows VM on your Ceph cluster? > Yes, another one instance but without load. OK. > > Is Ceph understanding this now? > > Other than that, the queue options aren't likely to do much good with pure > >HDD OSDs. > > I can't find those parameter in running config: > ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep > "filestore_queue" These are OSD parameters, you need to query an OSD daemon. > "filestore_queue_max_ops": "3000", > "filestore_queue_max_bytes": "1048576000", > "filestore_queue_max_delay_multiple": "0", > "filestore_queue_high_delay_multiple": "0", > "filestore_queue_low_threshhold": "0.3", > "filestore_queue_high_threshhold": "0.9", > > That should be 512, 1024 really with one RBD pool. > > Yes, I know. Today for test I added scbench pool with 128 pg > There are output status and osd tree: > ceph status > cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c > health HEALTH_OK > monmap e6: 3 mons at > {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0} > election epoch 238, quorum 0,1,2 block01,object01,object02 > osdmap e6887: 18 osds: 18 up, 18 in > pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects > 35049 GB used, 15218 GB / 50267 GB avail > 1275 active+clean > 3 active+clean+scrubbing+deep > 2 active+clean+scrubbing > Check the ML archives and restrict scrubs to off-peak hours as well as tune things to keep their impact low. Scrubbing is a major performance killer, especially on non-SSD journal OSDs and with older Ceph versions and/or non-tuned parameters: --- osd_scrub_end_hour = 6 osd_scrub_load_threshold = 2.5 osd_scrub_sleep = 0.1 --- > client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 54.00000 root default > -2 27.00000 host cn802 > 0 3.00000 osd.0 up 1.00000 1.00000 > 2 3.00000 osd.2 up 1.00000 1.00000 > 4 3.00000 osd.4 up 1.00000 1.00000 > 6 3.00000 osd.6 up 0.89995 1.00000 > 8 3.00000 osd.8 up 1.00000 1.00000 > 10 3.00000 osd.10 up 1.00000 1.00000 > 12 3.00000 osd.12 up 0.89999 1.00000 > 16 3.00000 osd.16 up 1.00000 1.00000 > 18 3.00000 osd.18 up 0.90002 1.00000 > -3 27.00000 host cn803 > 1 3.00000 osd.1 up 1.00000 1.00000 > 3 3.00000 osd.3 up 0.95316 1.00000 > 5 3.00000 osd.5 up 1.00000 1.00000 > 7 3.00000 osd.7 up 1.00000 1.00000 > 9 3.00000 osd.9 up 1.00000 1.00000 > 11 3.00000 osd.11 up 0.95001 1.00000 > 13 3.00000 osd.13 up 1.00000 1.00000 > 17 3.00000 osd.17 up 0.84999 1.00000 > 19 3.00000 osd.19 up 1.00000 1.00000 > > Wrong way to test this, test it from a monitor node, another client node > > (like your openstack nodes). > > In your 2 node cluster half of the reads or writes will be local, very > > much skewing your results. > I have been tested from copmute node also and have same result. 80-100Mb/sec > That's about as good as it gets (not 148MB/s, though!). But rados bench is not the same as real client I/O. > > Very high max latency, telling us that your cluster ran out of steam at > some point. > > I copying data from my windows instance right now. Re-do any testing when you've stopped all scrubbing. > > I'd de-frag anyway, just to rule that out. > > > >When doing your tests or normal (busy) operations from the client VM, run > > atop on your storage nodes and observe your OSD HDDs. > > Do they get busy, around 100%? > > Yes, high IO load (600-800 io). But this is very strange on SATA HDD. All > HDD have own OSD daemon and presented in OS as hardware RAID0(each block node > have hardware RAID). Example: Your RAID controller and its HW cache are likely to help with that speed, also all of these are reads, most likely the scrubs above, not a single write to be seen. > avg-cpu: %user %nice %system %iowait %steal %idle > 1.44 0.00 3.56 17.56 0.00 77.44 > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await > w_await svctm %util > sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sdc 0.00 0.00 649.00 0.00 82912.00 0.00 255.51 8.30 12.74 12.74 0.00 1.26 > 81.60 > sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sdf 0.00 0.00 761.00 0.00 94308.00 0.00 247.85 8.66 11.26 11.26 0.00 1.18 > 90.00 > sdg 0.00 0.00 761.00 0.00 97408.00 0.00 256.00 7.80 10.22 10.22 0.00 1.01 > 76.80 > sdh 0.00 0.00 801.00 0.00 102344.00 0.00 255.54 8.05 10.05 10.05 0.00 0.96 > 76.80 > sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sdj 0.00 0.00 537.00 0.00 68736.00 0.00 256.00 5.54 10.26 10.26 0.00 0.98 > 52.80 > > > > Check with iperf or NPtcp that your network to the clients from the > > storage nodes is fully functional. > The network have been tested by iperf. 950-970Mbit among all nodes in clustes > (openstack and ceph) Didn't think it was that, one thing off the list to check. Christian Понедельник, 11 июля 2016, 10:58 +05:00 от Christian Balzer <ch...@gol.com>: > > > > > >Hello, > > > >On Mon, 11 Jul 2016 07:35:02 +0300 K K wrote: > > > >> > >> Hello, guys > >> > >> I to face a task poor performance into windows 2k12r2 instance running > >> on rbd (openstack cluster). RBD disk have a size 17Tb. My ceph cluster > >> consist from: > >> - 3 monitors nodes (Celeron G530/6Gb RAM, DualCore E6500/2Gb RAM, > >> Core2Duo E7500/2Gb RAM). Each node have 1Gbit network to frontend subnet > >> od Ceph cluster > > > >I hope the fastest of these MONs (CPU and storage) has the lowest IP > >number and thus is the leader. > > > >Also what Ceph, OS, kernel version? > > > >> - 2 block nodes (Xeon E5620/32Gb RAM/2*1Gbit NIC). Each node have > >> 2*500Gb HDD for operation system and 9*3Tb SATA HDD (WD SE). Total 18 > >> OSD daemons on 2 nodes. > > > >Two GbE ports, given the "frontend" up there with the MON description I > >assume that's 1 port per client (front) and cluster (back) network? > > > >>Journals placed on same HDD as a rados data. I > >> know that better using for those purpose separate SSD disk. > >Indeed... > > > >>When I test > >> new windows instance performance was good (read/write something about > >> 100Mb/sec). But after I copied 16Tb data to windows instance read > >> performance has down to 10Mb/sec. Type of data on VM - image and video. > >> > >100MB/s would be absolute perfect with the setup you have, assuming no > >contention (other clients). > > > >Is there any other client on than that Windows VM on your Ceph cluster? > > > >> ceph.conf on client side: > >> [global] > >> auth cluster required = cephx > >> auth service required = cephx > >> auth client required = cephx > >> filestore xattr use omap = true > >> filestore max sync interval = 10 > >> filestore queue max ops = 3000 > >> filestore queue commiting max bytes = 1048576000 > >> filestore queue commiting max ops = 5000 > >> filestore queue max bytes = 1048576000 > >> filestore queue committing max ops = 4096 > >> filestore queue committing max bytes = 16 MiB > > ^^^ > >Is Ceph understanding this now? > >Other than that, the queue options aren't likely to do much good with pure > >HDD OSDs. > > > >> filestore op threads = 20 > >> filestore flusher = false > >> filestore journal parallel = false > >> filestore journal writeahead = true > >> journal dio = true > >> journal aio = true > >> journal force aio = true > >> journal block align = true > >> journal max write bytes = 1048576000 > >> journal_discard = true > >> osd pool default size = 2 # Write an object n times. > >> osd pool default min size = 1 > >> osd pool default pg num = 333 > >> osd pool default pgp num = 333 > >That should be 512, 1024 really with one RBD pool. > >http://ceph.com/pgcalc/ > > > >> osd crush chooseleaf type = 1 > >> > >> [client] > >> rbd cache = true > >> rbd cache size = 67108864 > >> rbd cache max dirty = 50331648 > >> rbd cache target dirty = 33554432 > >> rbd cache max dirty age = 2 > >> rbd cache writethrough until flush = true > >> > >> > >> rados bench show from block node show: > >Wrong way to test this, test it from a monitor node, another client node > >(like your openstack nodes). > >In your 2 node cluster half of the reads or writes will be local, very > >much skewing your results. > > > >> rados bench -p scbench 120 write --no-cleanup > > > >Default tests with 4MB "blocks", what are the writes or reads from you > >client VM like? > > > >> Total time run: 120.399337 > >> Total writes made: 3538 > >> Write size: 4194304 > >> Object size: 4194304 > >> Bandwidth (MB/sec): 117.542 > >> Stddev Bandwidth: 9.31244 > >> Max bandwidth (MB/sec): 148 > > ^^^ > >That wouldn't be possible from an external client. > > > >> Min bandwidth (MB/sec): 92 > >> Average IOPS: 29 > >> Stddev IOPS: 2 > >> Max IOPS: 37 > >> Min IOPS: 23 > >> Average Latency(s): 0.544365 > >> Stddev Latency(s): 0.35825 > >> Max latency(s): 5.42548 > >Very high max latency, telling us that your cluster ran out of steam at > >some point. > > > >> Min latency(s): 0.101533 > >> > >> rados bench -p scbench 120 seq > >> Total time run: 120.880920 > >> Total reads made: 1932 > >> Read size: 4194304 > >> Object size: 4194304 > >> Bandwidth (MB/sec): 63.9307 > >> Average IOPS 15 > >> Stddev IOPS: 3 > >> Max IOPS: 25 > >> Min IOPS: 5 > >> Average Latency(s): 0.999095 > >> Max latency(s): 8.50774 > >> Min latency(s): 0.0391591 > >> > >> rados bench -p scbench 120 rand > >> Total time run: 121.059005 > >> Total reads made: 1920 > >> Read size: 4194304 > >> Object size: 4194304 > >> Bandwidth (MB/sec): 63.4401 > >> Average IOPS: 15 > >> Stddev IOPS: 4 > >> Max IOPS: 26 > >> Min IOPS: 1 > >> Average Latency(s): 1.00785 > >> Max latency(s): 6.48138 > >> Min latency(s): 0.038925 > >> > >> On XFS partitions fragmentation no more than 1% > >I'd de-frag anyway, just to rule that out. > > > >When doing your tests or normal (busy) operations from the client VM, run > >atop on your storage nodes and observe your OSD HDDs. > >Do they get busy, around 100%? > > > >Check with iperf or NPtcp that your network to the clients from the > >storage nodes is fully functional. > > > >Christian > >-- > >Christian Balzer Network/Systems Engineer > >ch...@gol.com Global OnLine Japan/Rakuten Communications > >http://www.gol.com/ > -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com