Hello,

On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:

> 
> > I hope the fastest of these MONs (CPU and storage) has the lowest IP
> > number and thus is the leader.
> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all 
> mons.

In your use case and configuration no surprise, but again, the lowest IP
will be leader by default and thus the busiest. 

> > Also what Ceph, OS, kernel version?
> 
> ubuntu 16.04 kernel 4.4.0-22
> 
Check the ML archives, I remember people having performance issues with the
4.4 kernels.

Still don't know your Ceph version, is it the latest Jewel?

> > Two GbE ports, given the "frontend" up there with the MON description I
> > assume that's 1 port per client (front) and cluster (back) network?
> yes, one GbE for ceph client, one GbE for back network.
OK, so (from a single GbE client) 100MB/s at most.

> > Is there any other client on than that Windows VM on your Ceph cluster?
> Yes, another one instance but without load.
OK.

> > Is Ceph understanding this now?
> > Other than that, the queue options aren't likely to do much good with pure
> >HDD OSDs.
> 
> I can't find those parameter in running config:
> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep 
> "filestore_queue"

These are OSD parameters, you need to query an OSD daemon. 

> "filestore_queue_max_ops": "3000",
> "filestore_queue_max_bytes": "1048576000",
> "filestore_queue_max_delay_multiple": "0",
> "filestore_queue_high_delay_multiple": "0",
> "filestore_queue_low_threshhold": "0.3",
> "filestore_queue_high_threshhold": "0.9",
> > That should be 512, 1024 really with one RBD pool.
> 
> Yes, I know. Today for test I added scbench pool with 128 pg
> There are output status and osd tree:
> ceph status
> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
> health HEALTH_OK
> monmap e6: 3 mons at 
> {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
> election epoch 238, quorum 0,1,2 block01,object01,object02
> osdmap e6887: 18 osds: 18 up, 18 in
> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
> 35049 GB used, 15218 GB / 50267 GB avail
> 1275 active+clean
> 3 active+clean+scrubbing+deep
> 2 active+clean+scrubbing
>
Check the ML archives and restrict scrubs to off-peak hours as well as
tune things to keep their impact low.

Scrubbing is a major performance killer, especially on non-SSD journal
OSDs and with older Ceph versions and/or non-tuned parameters:
---
osd_scrub_end_hour = 6
osd_scrub_load_threshold = 2.5
osd_scrub_sleep = 0.1
---

> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr
> 
> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 54.00000 root default 
> -2 27.00000 host cn802 
> 0 3.00000 osd.0 up 1.00000 1.00000 
> 2 3.00000 osd.2 up 1.00000 1.00000 
> 4 3.00000 osd.4 up 1.00000 1.00000 
> 6 3.00000 osd.6 up 0.89995 1.00000 
> 8 3.00000 osd.8 up 1.00000 1.00000 
> 10 3.00000 osd.10 up 1.00000 1.00000 
> 12 3.00000 osd.12 up 0.89999 1.00000 
> 16 3.00000 osd.16 up 1.00000 1.00000 
> 18 3.00000 osd.18 up 0.90002 1.00000 
> -3 27.00000 host cn803 
> 1 3.00000 osd.1 up 1.00000 1.00000 
> 3 3.00000 osd.3 up 0.95316 1.00000 
> 5 3.00000 osd.5 up 1.00000 1.00000 
> 7 3.00000 osd.7 up 1.00000 1.00000 
> 9 3.00000 osd.9 up 1.00000 1.00000 
> 11 3.00000 osd.11 up 0.95001 1.00000 
> 13 3.00000 osd.13 up 1.00000 1.00000 
> 17 3.00000 osd.17 up 0.84999 1.00000 
> 19 3.00000 osd.19 up 1.00000 1.00000
> > Wrong way to test this, test it from a monitor node, another client node
> > (like your openstack nodes).
> > In your 2 node cluster half of the reads or writes will be local, very
> > much skewing your results.
> I have been tested from copmute node also and have same result. 80-100Mb/sec
> 
That's about as good as it gets (not 148MB/s, though!).
But rados bench is not the same as real client I/O.

> > Very high max latency, telling us that your cluster ran out of steam at
> some point.
> 
> I copying data from my windows instance right now.

Re-do any testing when you've stopped all scrubbing.

> > I'd de-frag anyway, just to rule that out.
> 
> 
> >When doing your tests or normal (busy) operations from the client VM, run
> > atop on your storage nodes and observe your OSD HDDs. 
> > Do they get busy, around 100%?
> 
> Yes, high IO load (600-800 io).  But this is very strange on SATA HDD. All 
> HDD have own OSD daemon and presented in OS as hardware RAID0(each block node 
> have hardware RAID). Example:

Your RAID controller and its HW cache are likely to help with that speed,
also all of these are reads, most likely the scrubs above, not a single
write to be seen.

> avg-cpu: %user %nice %system %iowait %steal %idle
> 1.44 0.00 3.56 17.56 0.00 77.44
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util
> sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 649.00 0.00 82912.00 0.00 255.51 8.30 12.74 12.74 0.00 1.26 
> 81.60
> sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdf 0.00 0.00 761.00 0.00 94308.00 0.00 247.85 8.66 11.26 11.26 0.00 1.18 
> 90.00
> sdg 0.00 0.00 761.00 0.00 97408.00 0.00 256.00 7.80 10.22 10.22 0.00 1.01 
> 76.80
> sdh 0.00 0.00 801.00 0.00 102344.00 0.00 255.54 8.05 10.05 10.05 0.00 0.96 
> 76.80
> sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdj 0.00 0.00 537.00 0.00 68736.00 0.00 256.00 5.54 10.26 10.26 0.00 0.98 
> 52.80
> 
> 
> > Check with iperf or NPtcp that your network to the clients from the
> > storage nodes is fully functional. 
> The network have been tested by iperf. 950-970Mbit among all nodes in clustes 
> (openstack and ceph) 

Didn't think it was that, one thing off the list to check.

Christian

Понедельник, 11 июля 2016, 10:58 +05:00 от Christian Balzer
<ch...@gol.com>:
> >
> >
> >Hello,
> >
> >On Mon, 11 Jul 2016 07:35:02 +0300 K K wrote:
> >
> >> 
> >> Hello, guys
> >> 
> >> I to face a task poor performance into windows 2k12r2 instance running
> >> on rbd (openstack cluster). RBD disk have a size 17Tb. My ceph cluster
> >> consist from:
> >> - 3 monitors nodes (Celeron G530/6Gb RAM, DualCore E6500/2Gb RAM,
> >> Core2Duo E7500/2Gb RAM). Each node have 1Gbit network to frontend subnet
> >> od Ceph cluster
> >
> >I hope the fastest of these MONs (CPU and storage) has the lowest IP
> >number and thus is the leader.
> >
> >Also what Ceph, OS, kernel version?
> >
> >> - 2 block nodes (Xeon E5620/32Gb RAM/2*1Gbit NIC). Each node have
> >> 2*500Gb HDD for operation system and 9*3Tb SATA HDD (WD SE). Total 18
> >> OSD daemons on 2 nodes. 
> >
> >Two GbE ports, given the "frontend" up there with the MON description I
> >assume that's 1 port per client (front) and cluster (back) network?
> >
> >>Journals placed on same HDD as a rados data. I
> >> know that better using for those purpose separate SSD disk. 
> >Indeed...
> >
> >>When I test
> >> new windows instance performance was good (read/write something about
> >> 100Mb/sec). But after I copied 16Tb data to windows instance read
> >> performance has down to 10Mb/sec. Type of data on VM - image and video.
> >> 
> >100MB/s would be absolute perfect with the setup you have, assuming no
> >contention (other clients).
> >
> >Is there any other client on than that Windows VM on your Ceph cluster?
> >
> >> ceph.conf on client side:
> >> [global]
> >> auth cluster required = cephx
> >> auth service required = cephx
> >> auth client required = cephx
> >> filestore xattr use omap = true
> >> filestore max sync interval = 10
> >> filestore queue max ops = 3000
> >> filestore queue commiting max bytes = 1048576000
> >> filestore queue commiting max ops = 5000
> >> filestore queue max bytes = 1048576000
> >> filestore queue committing max ops = 4096
> >> filestore queue committing max bytes = 16 MiB
> >                                            ^^^
> >Is Ceph understanding this now?
> >Other than that, the queue options aren't likely to do much good with pure
> >HDD OSDs.
> >
> >> filestore op threads = 20
> >> filestore flusher = false
> >> filestore journal parallel = false
> >> filestore journal writeahead = true
> >> journal dio = true
> >> journal aio = true
> >> journal force aio = true
> >> journal block align = true
> >> journal max write bytes = 1048576000
> >> journal_discard = true
> >> osd pool default size = 2 # Write an object n times.
> >> osd pool default min size = 1
> >> osd pool default pg num = 333
> >> osd pool default pgp num = 333
> >That should be 512, 1024 really with one RBD pool.
> >http://ceph.com/pgcalc/
> >
> >> osd crush chooseleaf type = 1
> >> 
> >> [client]
> >> rbd cache = true
> >> rbd cache size = 67108864
> >> rbd cache max dirty = 50331648
> >> rbd cache target dirty = 33554432
> >> rbd cache max dirty age = 2
> >> rbd cache writethrough until flush = true
> >> 
> >> 
> >> rados bench show from block node show:
> >Wrong way to test this, test it from a monitor node, another client node
> >(like your openstack nodes).
> >In your 2 node cluster half of the reads or writes will be local, very
> >much skewing your results.
> >
> >> rados bench -p scbench 120 write --no-cleanup
> >
> >Default tests with 4MB "blocks", what are the writes or reads from you
> >client VM like?
> >
> >> Total time run: 120.399337
> >> Total writes made: 3538
> >> Write size: 4194304
> >> Object size: 4194304
> >> Bandwidth (MB/sec): 117.542
> >> Stddev Bandwidth: 9.31244
> >> Max bandwidth (MB/sec): 148 
> >                          ^^^
> >That wouldn't be possible from an external client.
> >
> >> Min bandwidth (MB/sec): 92
> >> Average IOPS: 29
> >> Stddev IOPS: 2
> >> Max IOPS: 37
> >> Min IOPS: 23
> >> Average Latency(s): 0.544365
> >> Stddev Latency(s): 0.35825
> >> Max latency(s): 5.42548
> >Very high max latency, telling us that your cluster ran out of steam at
> >some point.
> >
> >> Min latency(s): 0.101533
> >> 
> >> rados bench -p scbench 120 seq
> >> Total time run: 120.880920
> >> Total reads made: 1932
> >> Read size: 4194304
> >> Object size: 4194304
> >> Bandwidth (MB/sec): 63.9307
> >> Average IOPS 15
> >> Stddev IOPS: 3
> >> Max IOPS: 25
> >> Min IOPS: 5
> >> Average Latency(s): 0.999095
> >> Max latency(s): 8.50774
> >> Min latency(s): 0.0391591
> >> 
> >> rados bench -p scbench 120 rand
> >> Total time run: 121.059005
> >> Total reads made: 1920
> >> Read size: 4194304
> >> Object size: 4194304
> >> Bandwidth (MB/sec): 63.4401
> >> Average IOPS: 15
> >> Stddev IOPS: 4
> >> Max IOPS: 26
> >> Min IOPS: 1
> >> Average Latency(s): 1.00785
> >> Max latency(s): 6.48138
> >> Min latency(s): 0.038925
> >> 
> >> On XFS partitions fragmentation no more than 1%
> >I'd de-frag anyway, just to rule that out.
> >
> >When doing your tests or normal (busy) operations from the client VM, run
> >atop on your storage nodes and observe your OSD HDDs. 
> >Do they get busy, around 100%?
> >
> >Check with iperf or NPtcp that your network to the clients from the
> >storage nodes is fully functional. 
> >
> >Christian
> >-- 
> >Christian Balzer        Network/Systems Engineer 
> >ch...@gol.com Global OnLine Japan/Rakuten Communications
> >http://www.gol.com/
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to