Hey,

Thanks all for your replies. We finished the migration to XFS yesterday morning 
and we can see that the load average on our VMs is back to normal.

Our cluster was just a test before scaling with bigger nodes. We don't know yet 
how to use the SSDs between journals (as it was recommended) and/or for cache 
pools on the hypervisors.

Cheers,
Alexandre

-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: lundi 29 septembre 2014 17:40
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] IO wait spike in VM

On Mon, 29 Sep 2014 09:04:51 +0000 Quenten Grasso wrote:

> Hi Alexandre,
> 
> No problem, I hope this saves you some pain
> 
> It's probably worth going for a larger journal probably around 20Gig 
> if you wish to play with tuning of "filestore max sync interval" could 
> be have some interesting results. Also probably already know this 
> however most of us when starting with ceph, use xfs/file for the 
> journal instead of a using a partition using a "raw partition"  this 
> removes file system overhead on the journal.
> 
> I highly recommend looking into dedicated journals for your systems as 
> your spinning disks are going to work very hard trying to keep up with 
> all the read/write seeking on these disks particularly if you're going 
> to be using for vm's. Also you'll get about 1/3 of the write 
> performance as a "best case scenario" using journals on the same disk 
> and this comes down to the disks IOPS.
> 
> Depending on your hardware & budget you could look into using one of 
> these options for dedicated journals
> 
> Intel DC P3700 400GB PCIe these are good for about ~1000mb/s write 
> (haven't tested these myself however are looking to use these in our 
> additional nodes) Intel DC S3700 200GB  these are good for about 
> ~360mb/s write
> 
> At the time we used the Intel DC S3700 100GB  these drives don't have 
> enough throughput so I'd recommend you stay away from this particular 
> 100GB model.
> 
That's of course a very subjective statement. ^o^ I'm using 4 of those with 8 
HDDs because it was the best bang for the proverbial buck for me and they are 
plenty fast enough in that scenario.

Also in my world I run out of IOPS way, way before I run out bandwidth. 

> So if you have spare hard disk slots in your servers the 200GB DC 
> S3700 is the best bang for buck. Usually I run 6 spinning disks to  1 
> SSD in an ideal world I'd like to cut this back to 4 instead of 6 tho 
> when using the 200GB disks.
> 
Precisely. 
A 6:1 ratio might actually be still be sufficient to keep the HDDs busy, but 
unless you have a large cluster loosing 6 OSDs because one SSD misbehaved is 
painful.

> Both of these SSD options would do nicely and have on board capacitors 
> and very high write/wear rates as well.
> 
Agreed.

Christian
> Cheers,
> Quenten Grasso
> 
> -----Original Message-----
> From: Bécholey Alexandre [mailto:alexandre.becho...@nagra.com]
> Sent: Monday, 29 September 2014 4:15 PM
> To: Quenten Grasso; ceph-users@lists.ceph.com
> Cc: Aviolat Romain
> Subject: RE: [ceph-users] IO wait spike in VM
> 
> Hello Quenten,
> 
> Thanks for your reply.
> 
> We have a 5GB journal for each OSD on the same disk.
> 
> Right now, we are migrating our OSD to XFS and we'll add a 5th monitor.
> We will perform the benchmarks afterwards.
> 
> Cheers,
> Alexandre
> 
> -----Original Message-----
> From: Quenten Grasso [mailto:qgra...@onq.com.au]
> Sent: lundi 29 septembre 2014 01:56
> To: Bécholey Alexandre; ceph-users@lists.ceph.com
> Cc: Aviolat Romain
> Subject: RE: [ceph-users] IO wait spike in VM
> 
> G'day Alexandre
> 
> I'm not sure if this is causing your issues, however it could be 
> contributing to them.
> 
> I noticed you have 4 Mon's, this could contributing to your problems 
> as its recommended due to paxos algorithm which ceph uses for 
> achieving quorum of mon's, you should be running an odd number of 
> mon's 1, 3, 5, 7, etc Also worth it's mentioning running 4 mon's would 
> still only give you a possible failure of 1 mon without an outage.
> 
> Spec wise the machines look pretty good, only thing I can see is the 
> lack of journals and using btrfs at this stage.
> 
> You could try some iperf testing between the machines to make sure the 
> networking is working as expected.
> 
> If you do rados benches for extended time what kind of stats do you see?
> 
> For example,
> 
> Write)
> ceph osd pool create benchmark1 XXXX XXXX ceph osd pool set benchmark1 
> size 3 rados bench -p benchmark1 180 write --no-cleanup
> --concurrent-ios=32
> 
> * I suggest you create a 2nd benchmark pool and write for another 180 
> seconds or so to ensure nothing is cached then do a read test.
> 
> Read)
> rados bench -p benchmark1 180 seq --concurrent-ios=32
> 
> You can also try the same using 4k blocks
> 
> rados bench -p benchmark1 180 write -b 4096 --no-cleanup
> --concurrent-ios=32 rados bench -p benchmark1 180 seq -b 4096
> 
> As you may know increasing the concurrent io's will increase cpu/disk 
> load.
> 
> XXXX = Total PG = OSD * 100 / Replicas
> Ie: 50 OSD System with 3 replicas would be around 1600
> 
> Hope this helps a little,
> 
> Cheers,
> Quenten Grasso
> 
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Bécholey Alexandre Sent: Thursday, 25 September 2014 1:27 AM
> To: ceph-users@lists.ceph.com
> Cc: Aviolat Romain
> Subject: [ceph-users] IO wait spike in VM
> 
> Dear Ceph guru,
> 
> We have a Ceph cluster (version 0.80.5
> 38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 
> per
> host) used as a backend storage for libvirt.
> 
> Hosts:
> Ubuntu 14.04
> CPU: 2 Xeon X5650
> RAM: 48 GB (no swap)
> No SSD for journals
> HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one 
> partition for the journal, the rest for the OSD) FS: btrfs (I know 
> it's not recommended in the doc and I hope it's not the culprit) Network:
> dedicated 10GbE
> 
> As we added some VMs to the cluster, we saw some sporadic huge IO wait 
> on the VM. The hosts running the OSDs seem fine. I followed a similar 
> discussion here:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.h
> tml
> 
> Here is an example of a transaction that took some time:
> 
>         { "description": "osd_op(client.5275.0:262936
> rbd_data.22e42ae8944a.0000000000000807 [] 3.c9699248 ack+ondisk+write 
> e3158)", "received_at": "2014-09-23 15:23:30.820958", "age":
> "108.329989", "duration": "5.814286",
>           "type_data": [
>                 "commit sent; apply or cleanup",
>                 { "client": "client.5275",
>                   "tid": 262936},
>                 [
>                     { "time": "2014-09-23 15:23:30.821097",
>                       "event": "waiting_for_osdmap"},
>                     { "time": "2014-09-23 15:23:30.821282",
>                       "event": "reached_pg"},
>                     { "time": "2014-09-23 15:23:30.821384",
>                       "event": "started"},
>                     { "time": "2014-09-23 15:23:30.821401",
>                       "event": "started"},
>                     { "time": "2014-09-23 15:23:30.821459",
>                       "event": "waiting for subops from 14"},
>                     { "time": "2014-09-23 15:23:30.821561",
>                       "event": "commit_queued_for_journal_write"},
>                     { "time": "2014-09-23 15:23:30.821666",
>                       "event": "write_thread_in_journal_buffer"},
>                     { "time": "2014-09-23 15:23:30.822591",
>                       "event": "op_applied"},
>                     { "time": "2014-09-23 15:23:30.824707",
>                       "event": "sub_op_applied_rec"},
>                     { "time": "2014-09-23 15:23:31.225157",
>                       "event": "journaled_completion_queued"},
>                     { "time": "2014-09-23 15:23:31.225297",
>                       "event": "op_commit"},
>                     { "time": "2014-09-23 15:23:36.635085",
>                       "event": "sub_op_commit_rec"},
>                     { "time": "2014-09-23 15:23:36.635132",
>                       "event": "commit_sent"},
>                     { "time": "2014-09-23 15:23:36.635244",
>                       "event": "done"}]]}
> 
> sub_op_commit_rec took about 5 seconds to complete which make me think 
> that the replication is the bottleneck.
> 
> I didn't find any timeout in ceph's logs. The cluster is healthy. When 
> some VMs have high IO wait, the cluster op/s is between 2 to 40. I 
> would gladly submit you more information if needed.
> 
> How can I dig deeper? For example is there a way to know to which OSD 
> the replication is done for that specific transaction?
> 
> Cheers,
> Alexandre Bécholey
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to