Re: [ceph-users] Ceph + VMware + Single Thread Performance

Nick Fisk Thu, 21 Jul 2016 07:19:27 -0700

What you are seeing is probably averaged over 1 second or something like that. 
So yes in 1 second IO would have run on all OSD’s. But for any 1 point in time 
a single thread will only run on 1 OSD (+2 replicas) assuming the IO size isn’t 
bigger than the object size.


 

For RBD, If data is striped in 4MB chunks, then you will have to read/write 
more than 4MB at a time to cross over to the next object. You get exactly the 
same problems with reading when you don’t set the readahead above 4MB.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
w...@globe.de
Sent: 21 July 2016 14:05
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal....

 

Am 21.07.16 um 15:02 schrieb Jake Young:

I think the answer is that with 1 thread you can only ever write to one journal 
at a time. Theoretically, you would need 10 threads to be able to write to 10 
nodes at the same time.  

 

Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de>  
<w...@globe.de <mailto:w...@globe.de> > wrote:

What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one thread... See 
Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the rbd client 
with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.

 

Everyone look yourself at your cluster. 

dstat -D sdb,sdc,sdd,sdX ....

You will see that Ceph stripes the data over all OSD's in the cluster if you 
test at the client side with rados bench...

rados bench -p rbd 60 write -b 4M -t 1

 

 

Am 21.07.16 um 14:38 schrieb w...@globe.de 
<javascript:_e(%7B%7D,'cvml','w...@globe.de');> :

Is there not a way to enable Linux page Cache? So do not user D_Sync... 

Then we would the dramatically performance improve. 


Am 21.07.16 um 14:33 schrieb Nick Fisk: 



-----Original Message----- 
From: w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>  
[mailto:w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');> ] 
Sent: 21 July 2016 13:23 
To: n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');> ; 'Horace 
Ng'  <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');> <hor...@hkisl.net> 
Cc: ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Okay and what is your plan now to speed up ? 

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD. 




Would it help to put in multiple P3700 per OSD Node to improve performance for 
a single Thread (example Storage VMotion) ? 

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour. 




Regards 


Am 21.07.16 um 14:17 schrieb Nick Fisk: 



-----Original Message----- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists.ceph.com');> ] On Behalf 
Of w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>  
Sent: 21 July 2016 13:04 
To: n...@fisk.me.uk <javascript:_e(%7B%7D,'cvml','n...@fisk.me.uk');> ; 'Horace 
Ng'  <javascript:_e(%7B%7D,'cvml','hor...@hkisl.net');> <hor...@hkisl.net> 
Cc: ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Hi, 

hmm i think 200 MByte/s is really bad. Is your Cluster in production right now? 

It's just been built, not running yet. 




So if you start a storage migration you get only 200 MByte/s right? 

I wish. My current cluster (not this new one) would storage migrate at 
~10-15MB/s. Serial latency is the problem, without being able to 
buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get VAAI 

working. I think esxi will try and do them in parallel, which will help as 
well. 



I think it would be awesome if you get 1000 MByte/s 

Where is the Bottleneck? 

Latency serialisation, without a buffer, you can't drive the devices 
to 100%. With buffered IO (or high queue depths) I can max out the journals. 




A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
P3700. 

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your 
-ssd-is-suitable-as-a-journal-device/ 

How could it be that the rbd client performance is 50% slower? 

Regards 


Am 21.07.16 um 12:15 schrieb Nick Fisk: 



I've had a lot of pain with this, smaller block sizes are even worse. 
You want to try and minimize latency at every point as there is no 
buffering happening in the iSCSI stack. This means:- 

1. Fast journals (NVME or NVRAM) 
2. 10GB or better networking 
3. Fast CPU's (Ghz) 
4. Fix CPU c-state's to C1 
5. Fix CPU's Freq to max 

Also I can't be sure, but I think there is a metadata update 
happening with VMFS, particularly if you are using thin VMDK's, this 
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and 

less headache. 



For the RADOS Run, here you go (400GB P3700): 

Total time run:         60.026491 
Total writes made:      3104 
Write size:             4194304 
Object size:            4194304 
Bandwidth (MB/sec):     206.842 
Stddev Bandwidth:       8.10412 
Max bandwidth (MB/sec): 224 
Min bandwidth (MB/sec): 180 
Average IOPS:           51 
Stddev IOPS:            2 
Max IOPS:               56 
Min IOPS:               45 
Average Latency(s):     0.0193366 
Stddev Latency(s):      0.00148039 
Max latency(s):         0.0377946 
Min latency(s):         0.015909 

Nick 




-----Original Message----- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists.ceph.com');> ] On 
Behalf Of Horace 
Sent: 21 July 2016 10:26 
To: w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>  
Cc: ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Hi, 

Same here, I've read some blog saying that vmware will frequently 
verify the locking on VMFS over iSCSI, hence it will have much slower 
performance than NFS (with different locking mechanism). 

Regards, 
Horace Ng 

----- Original Message ----- 
From: w...@globe.de <javascript:_e(%7B%7D,'cvml','w...@globe.de');>  
To: ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
Sent: Thursday, July 21, 2016 5:11:21 PM 
Subject: [ceph-users] Ceph + VMware + Single Thread Performance 

Hi everyone, 

we see at our cluster relatively slow Single Thread Performance on the iscsi 
Nodes. 


Our setup: 

3 Racks: 

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off). 

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x 
WD Red 1TB per Data Node as OSD. 

Replication = 3 

chooseleaf = 3 type Rack in the crush map 


We get only ca. 90 MByte/s on the iscsi Gateway Servers with: 

rados bench -p rbd 60 write -b 4M -t 1 


If we test with: 

rados bench -p rbd 60 write -b 4M -t 32 

we get ca. 600 - 700 MByte/s 


We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e 
for the Journal to get better Single Thread Performance. 

Is anyone of you out there who has an Intel P3700 for Journal an 
can give me back test results with: 


rados bench -p rbd 60 write -b 4M -t 1 


Thank you very much !! 

Kind Regards !! 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 


_______________________________________________ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
<javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

Reply via email to