Re: [ceph-users] IO wait spike in VM

Quenten Grasso Mon, 29 Sep 2014 02:10:43 -0700

Hi Alexandre,

No problem, I hope this saves you some pain


It's probably worth going for a larger journal probably around 20Gig if you 
wish to play with tuning of "filestore max sync interval" could be have some 
interesting results.
Also probably already know this however most of us when starting with ceph, use 
xfs/file for the journal instead of a using a partition using a "raw partition" 
 this removes file system overhead on the journal.

I highly recommend looking into dedicated journals for your systems as your 
spinning disks are going to work very hard trying to keep up with all the 
read/write seeking on these disks particularly if you're going to be using for 
vm's. 
Also you'll get about 1/3 of the write performance as a "best case scenario" 
using journals on the same disk and this comes down to the disks IOPS.

Depending on your hardware & budget you could look into using one of these 
options for dedicated journals

Intel DC P3700 400GB PCIe these are good for about ~1000mb/s write (haven't 
tested these myself however are looking to use these in our additional nodes)
Intel DC S3700 200GB  these are good for about ~360mb/s write

At the time we used the Intel DC S3700 100GB  these drives don't have enough 
throughput so I'd recommend you stay away from this particular 100GB model.

So if you have spare hard disk slots in your servers the 200GB DC S3700 is the 
best bang for buck. Usually I run 6 spinning disks to  1 SSD in an ideal world 
I'd like to cut this back to 4 instead of 6 tho when using the 200GB disks.

Both of these SSD options would do nicely and have on board capacitors and very 
high write/wear rates as well.

Cheers,
Quenten Grasso

-----Original Message-----
From: Bécholey Alexandre [mailto:alexandre.becho...@nagra.com] 
Sent: Monday, 29 September 2014 4:15 PM
To: Quenten Grasso; ceph-users@lists.ceph.com
Cc: Aviolat Romain
Subject: RE: [ceph-users] IO wait spike in VM

Hello Quenten,

Thanks for your reply.

We have a 5GB journal for each OSD on the same disk.

Right now, we are migrating our OSD to XFS and we'll add a 5th monitor. We will 
perform the benchmarks afterwards.

Cheers,
Alexandre

-----Original Message-----
From: Quenten Grasso [mailto:qgra...@onq.com.au] 
Sent: lundi 29 septembre 2014 01:56
To: Bécholey Alexandre; ceph-users@lists.ceph.com
Cc: Aviolat Romain
Subject: RE: [ceph-users] IO wait spike in VM

G'day Alexandre

I'm not sure if this is causing your issues, however it could be contributing 
to them. 

I noticed you have 4 Mon's, this could contributing to your problems as its 
recommended due to paxos algorithm which ceph uses for achieving quorum of 
mon's, you should be running an odd number of mon's 1, 3, 5, 7, etc Also worth 
it's mentioning running 4 mon's would still only give you a possible failure of 
1 mon without an outage. 

Spec wise the machines look pretty good, only thing I can see is the lack of 
journals and using btrfs at this stage. 

You could try some iperf testing between the machines to make sure the 
networking is working as expected.

If you do rados benches for extended time what kind of stats do you see?

For example,

Write)
ceph osd pool create benchmark1 XXXX XXXX ceph osd pool set benchmark1 size 3 
rados bench -p benchmark1 180 write --no-cleanup --concurrent-ios=32

* I suggest you create a 2nd benchmark pool and write for another 180 seconds 
or so to ensure nothing is cached then do a read test.

Read)
rados bench -p benchmark1 180 seq --concurrent-ios=32

You can also try the same using 4k blocks

rados bench -p benchmark1 180 write -b 4096 --no-cleanup --concurrent-ios=32 
rados bench -p benchmark1 180 seq -b 4096

As you may know increasing the concurrent io's will increase cpu/disk load.

XXXX = Total PG = OSD * 100 / Replicas
Ie: 50 OSD System with 3 replicas would be around 1600

Hope this helps a little,

Cheers,
Quenten Grasso


-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Bécholey Alexandre
Sent: Thursday, 25 September 2014 1:27 AM
To: ceph-users@lists.ceph.com
Cc: Aviolat Romain
Subject: [ceph-users] IO wait spike in VM

Dear Ceph guru,

We have a Ceph cluster (version 0.80.5 
38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) 
used as a backend storage for libvirt.

Hosts:
Ubuntu 14.04
CPU: 2 Xeon X5650
RAM: 48 GB (no swap)
No SSD for journals
HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition 
for the journal, the rest for the OSD)
FS: btrfs (I know it's not recommended in the doc and I hope it's not the 
culprit)
Network: dedicated 10GbE

As we added some VMs to the cluster, we saw some sporadic huge IO wait on the 
VM. The hosts running the OSDs seem fine.
I followed a similar discussion here: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html

Here is an example of a transaction that took some time:

        { "description": "osd_op(client.5275.0:262936 
rbd_data.22e42ae8944a.0000000000000807 [] 3.c9699248 ack+ondisk+write e3158)",
          "received_at": "2014-09-23 15:23:30.820958",
          "age": "108.329989",
          "duration": "5.814286",
          "type_data": [
                "commit sent; apply or cleanup",
                { "client": "client.5275",
                  "tid": 262936},
                [
                    { "time": "2014-09-23 15:23:30.821097",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-09-23 15:23:30.821282",
                      "event": "reached_pg"},
                    { "time": "2014-09-23 15:23:30.821384",
                      "event": "started"},
                    { "time": "2014-09-23 15:23:30.821401",
                      "event": "started"},
                    { "time": "2014-09-23 15:23:30.821459",
                      "event": "waiting for subops from 14"},
                    { "time": "2014-09-23 15:23:30.821561",
                      "event": "commit_queued_for_journal_write"},
                    { "time": "2014-09-23 15:23:30.821666",
                      "event": "write_thread_in_journal_buffer"},
                    { "time": "2014-09-23 15:23:30.822591",
                      "event": "op_applied"},
                    { "time": "2014-09-23 15:23:30.824707",
                      "event": "sub_op_applied_rec"},
                    { "time": "2014-09-23 15:23:31.225157",
                      "event": "journaled_completion_queued"},
                    { "time": "2014-09-23 15:23:31.225297",
                      "event": "op_commit"},
                    { "time": "2014-09-23 15:23:36.635085",
                      "event": "sub_op_commit_rec"},
                    { "time": "2014-09-23 15:23:36.635132",
                      "event": "commit_sent"},
                    { "time": "2014-09-23 15:23:36.635244",
                      "event": "done"}]]}

sub_op_commit_rec took about 5 seconds to complete which make me think that the 
replication is the bottleneck.

I didn't find any timeout in ceph's logs. The cluster is healthy. When some VMs 
have high IO wait, the cluster op/s is between 2 to 40.
I would gladly submit you more information if needed.

How can I dig deeper? For example is there a way to know to which OSD the 
replication is done for that specific transaction?

Cheers,
Alexandre Bécholey


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] IO wait spike in VM

Reply via email to