On 11/15/16 22:13, Thomas Danan wrote:
> Very interesting ...
>
> Any idea why optimal tunable would help here ?
I think there are some versions where it rebalances data a bunch to even
things out... I don't know why I think that...where I read it or
anything. Maybe it was only argonaut vs newer. But having to rebalance
75% of the data makes me feel more confident. (and keep in mind it
significantly changes client version compatibility requirements, esp.
kernel drivers which possibly don't even exist in any version that are
compatible)

And looking at iostat, etc., at times of blocks, it seems like 1-2 disks
are at 100% util%, and the rest are nearly idle, and the SSD journals
rarely go above 10% or so (I bought 2 expensive micron DC ones per
node). So I think balance is the most important thing I need, and just
plain efficiency is the next thing (which might come from bluestore when
it's ready, especially related to rbd snapshot CoW). Having 2 disks at
100% is like 300-560 iops, where the total server ought to do about 1700
iops (3 disks that go about 280 and 6 more that do about 150 direct sync
rand write 4k iops per node). That's about 21% utilization before it blocks.

You could try getting the data out of my ganglia here (sda,sdb are the
SSDs, and ceph2 sdg is broken and missing with bogus data on the graphs):
http://www.brockmann-consult.de/ganglia/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=ceph.*&mreg%5B%5D=sd%5Bc-z%5D_util&gtype=line&glegend=show&aggregate=1

But it's not that easy to get this info out of ganglia... highly
customized graphing isn't the best.

>  on our cluster we have 500TB of data, I am a bit concerned about
> changing it without taking lot of precautions . ...
I can't guarantee a bug free experience, but you can change it, look at
the rebalancing objects %, and if you don't like it, change it back
(maybe it will be much less going from firefly to hammer than to jewel
like me). But if you wait an hour before changing it back, you can bet
it takes an hour to settle again. (or set nobackfill maybe). I don't
like this, but I don't know what to do other than rate limit it and
accept the enormous wait.
> I am curious to know how much time it takes you to change tunable,
> size of your cluster and observed impacts on client IO ...
Well... it was at 77.65% or so (tunables made it 75% + more pgs), and
now after almost 3 hours, it's at 75.141% ... so I imagine it'll take
somewhere between 75 hours and forever minus a day or two. But with the
sleep settings, it seems not to cause any issues. So if there's any
chance of it balancing out the load on the osds, i'll try it. (and these
numbers are with me fiddling with it and watching it every now and
then... I'll set max backfills back to 1 and sleep back to about 0.6
when I go to bed... maybe then it'll be half speed)

Also FYI I only have 31% space used (most of the disks I added were to
make it not horribly slow rather than add space, since it was so slow
with only 3 disks per OSD).

The cluster is just 3 nodes, with 2x Micron S630DC-400, 3x
HUS724040ALS640, and 6 x Hitachi HUA722020ALA330 (minus one dead one)
(last one is SATA... just some old stuff I added to speed things up,
which helped even though they're slower).

> # ceph df
> GLOBAL:
>     SIZE       AVAIL      RAW USED     %RAW USED
>     65173G     45110G       20063G         30.78
And as for impact... I could tell you more tomorrow. But with the sleep
settings, the 4k randwrite iops in fio benchmarks seems maybe half or
same as before, and other behavior doesn't seem so bad...maybe even
better than before on average with a few more hicups than before, but
less blocking killing qemu VMs (which I can't explain...do tunables do
that right away? or did the snap trim sleep do something? I doubt the
recovery one did since there was no recovery until I decided to change
things. Or just luck so far, and tomorrow morning some VMs will be dead
like every morning since a week, needing SIGKILL).
>
> Yes We do have daily rbd snapshot from 16 different ceph RBD clients.
> Snapshoting the RBD image is quite immediate while we are seing the
> issue continuously during the day...
>
> Will check all of this tomorrow . ..
>
> Thanks again
>
> Thomas
>
>
>
> Sent from my Samsung device
>
>
> -------- Original message --------
> From: Peter Maloney <peter.malo...@brockmann-consult.de>
> Date: 11/15/16 21:27 (GMT+01:00)
> To: Thomas Danan <thomas.da...@mycom-osi.com>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] ceph cluster having blocke requests very
> frequently
>
> On 11/15/16 14:05, Thomas Danan wrote:
> > Hi Peter,
> >
> > Ceph cluster version is 0.94.5 and we are running with Firefly
> tunables and also we have 10KPGs instead of the 30K / 40K we should have.
> > The linux kernel version is 3.10.0-327.36.1.el7.x86_64 with RHEL 7.2
> >
> > On our side we havethe following settings:
> > mon_osd_adjust_heartbeat_grace = false
> > mon_osd_adjust_down_out_interval = false
> > mon_osd_min_down_reporters = 5
> > mon_osd_min_down_reports = 10
> >
> > explaining why the OSDs are not flapping but still they are behaving
> wrongly and generate the slow requests I am describing.
> >
> > The osd_op_complaint_time is with the default value (30 sec), not
> sure I want to change it base on your experience
> I wasn't saying you should set the complaint time to 5, just saying
> that's why I have complaints logged with such low block times.
> > Thomas
>
> And now I'm testing this:
>         osd recovery sleep = 0.5
>         osd snap trim sleep = 0.5
>
> (or fiddling with it as low as 0.1 to make it rebalance faster)
>
> While also changing tunables to optimal (which will rebalance 75% of the
> objects)
> Which has very good results so far (a few <14s blocks right at the
> start, and none since, over an hour ago).
>
> And I'm somehow hoping that will fix my rbd export-diff issue too... but
> it at least appears to fix the rebalance causing blocks.
>
> Do you use rbd snapshots? I think that may be causing my issues, based
> on things like:
>
> >             "description": "osd_op(client.692201.0:20455419 4.1b5a5bc1
> > rbd_data.94a08238e1f29.000000000000617b [] snapc 918d=[918d]
> > ack+ondisk+write+known_if_redirected e40036)",
> >             "initiated_at": "2016-11-15 20:57:48.313432",
> >             "age": 409.634862,
> >             "duration": 3.377347,
> >             ...
> >                     {
> >                         "time": "2016-11-15 20:57:48.313767",
> >                         "event": "waiting for subops from 0,1,8,22"
> >                     },
> >             ...
> >                     {
> >                         "time": "2016-11-15 20:57:51.688530",
> >                         "event": "sub_op_applied_rec from 22"
> >                     },
>
>
> Which says "snapc" in there (CoW?), and I think shows that just one osd
> is delayed a few seconds and the rest are really fast, like you said.
> (and not sure why I see 4 osds here when I have size 3... node1 osd 0
> and 1, and node3 osd 8 and 22)
>
> or some (shorter I think) have description like:
> > osd_repop(client.426591.0:203051290 4.1f9
> > 4:9fe4c001:::rbd_data.4cf92238e1f29.00000000000014ef:head v
> 40047'2531604)
>
>
>
> ------------------------------------------------------------------------
>
> This electronic message contains information from Mycom which may be
> privileged or confidential. The information is intended to be for the
> use of the individual(s) or entity named above. If you are not the
> intended recipient, be aware that any disclosure, copying,
> distribution or any other use of the contents of this information is
> prohibited. If you have received this electronic message in error,
> please notify us by post or telephone (to the numbers or
> correspondence address above) or by email (at the email address above)
> immediately.


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to