Re: [ceph-users] Split-brain in a multi-site cluster

2017-02-05 Thread Ilia Sokolinski
Thank you Joao and Gregory!

I will investigate whether we can use the mon_osd_reporter_subtree_level option.

Ilia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-05 Thread Peter Maloney
On 02/03/17 19:54, David Turner wrote:
> Our current solution in Hammer involves a daemon monitoring the
> cluster load and setting the osd_snap_trim_sleep accordingly between 0
> and 0.35 which does a good job of preventing IO blocking and help us
> to clear out the snap_trim_q each day.  These settings not being
> injectable in Jewel would negate an option of using variable settings
> throughout the day.
Are you sure they're not injectable? [Almost?] everything says
"unchangable", but it takes effect anyway. I have tested
"snap_trim_sleep", and as Nick has pointed out, it seems to cause rather
than prevent blocks (and I found the sweet spot is 0, but he seems to
think it's higher). I can reproduce that very reliably using injectargs
(+5 to sleep means +5s block length), so (unless luck can happen in such
extreme ways, or somehow only part of the effect changed) they are
changable.

I'm using Jewel. So I am now using:

osd_pg_max_concurrent_snap_trims=1
osd_snap_trim_sleep=0

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-05 Thread David Turner
Yes, we will be able to test the modifications to master. We should be able
to wait for Luminous as our next production upgrade if it means getting
this patch.

On Fri, Feb 3, 2017 at 12:09 PM Samuel Just  wrote:

> Ok, I'm still working on a branch for master that will introduce limiter
> on how many pgs can be trimming per osd at once.  It should backport
> trivially to kraken, but jewel will require more work once we've got it in
> master.  Would you be willing to test the master version to determine
> whether it's adequate?
> -Sam
>
> On Fri, Feb 3, 2017 at 10:54 AM, David Turner <
> david.tur...@storagecraft.com> wrote:
>
> We found where it is in 10.2.5.  It is implemented in the OSD.h file in
> Jewel, but it is implemented in OSD.cc in Master.  We assumed it would be
> in the same place.
>
> We delete over 100TB of snapshots spread across thousands of snapshots
> every day.  We haven't yet found any combination of settings that allow us
> to delete snapshots in Jewel without blocking requests in a test cluster
> with a fraction of that workload.  We went as far as setting
> osd_snap_trim_cost to 512MB with default osd_snap_trim_priority (before we
> noticed the priority setting) and set osd_snap_trim_cost to 4MB (the size
> of our objects) with default_osd_snap_trim_priority set to 1.  We stopped
> testing there as we thought we found that these weren't implemented in
> Jewel.  We will continue our testing and provide an update when we have it.
>
> Our current solution in Hammer involves a daemon monitoring the cluster
> load and setting the osd_snap_trim_sleep accordingly between 0 and 0.35
> which does a good job of preventing IO blocking and help us to clear out
> the snap_trim_q each day.  These settings not being injectable in Jewel
> would negate an option of using variable settings throughout the day.
>
> --
>
>  David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> --
> *From:* Samuel Just [sj...@redhat.com]
> *Sent:* Friday, February 03, 2017 11:24 AM
> *To:* David Turner
> *Cc:* Nick Fisk; ceph-users
>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
> They do seem to exist in Jewel.
> -Sam
>
> On Fri, Feb 3, 2017 at 10:12 AM, David Turner <
> david.tur...@storagecraft.com> wrote:
>
> After searching the code, osd_snap_trim_cost and osd_snap_trim_priority
> exist in Master but not in Jewel or Kraken.  If osd_snap_trim_sleep was
> made useless in Jewel by moving snap trimming to the main op thread and no
> new feature was added to Jewel to allow clusters to throttle snap
> trimming... What recourse do people that use a lot of snapshots to use
> Jewel?  Luckily this thread came around right before we were ready to push
> to production and we tested snap trimming heavily in QA and found that we
> can't even deal with half of our snap trimming on Jewel that we would need
> to.  All of these settings are also not injectable into the osd daemon so
> it would take a full restart of the all of the osds to change their
> settings...
>
> Does anyone have any success stories for snap trimming on Jewel?
>
> --
>
>  David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> --
> *From:* Samuel Just [sj...@redhat.com]
> *Sent:* Thursday, January 26, 2017 1:14 PM
> *To:* Nick Fisk
> *Cc:* David Turner; ceph-users
>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
> Just an update.  I think the real goal with the sleep configs in general
> was to reduce the number of concurrent snap trims happening.  To that end,
> I've put together a branch which adds an AsyncReserver (as with backfill)
> for snap trims to each OSD.  Before actually starting to do trim work, the
> primary will wait in line to get one of the slots and will hold that slot
> until the repops are complete.
> https://

Re: [ceph-users] Why is bandwidth not fully saturated?

2017-02-05 Thread Christian Balzer

Hello,

On Sun, 5 Feb 2017 02:03:57 +0100 Marc Roos wrote:

>  
> 
> I have a 3 node test cluster with one osd per node. And I write a file 
> to a pool with size 1. Why doesn’t ceph just use the full 110MB/s of 
> the network (as with default rados bench test)? Does ceph 'reserve' 
> bandwidth for other concurrent connections? Can this be tuned?
> 

Firstly, benchmarking from within the cluster will always give you skewed
results, typically better than what a client would see since some actions
will be local.

Secondly, you're not telling us much about the cluster, but I'll assume
these are plain drives with in-line journals. 
Same about your rados bench test run, so I'll assume defaults there.

What you're seeing here is with near certainty the difference between a
single client (your "put") and multiple ones (rados bench does 16 threads
by default).

So rados gets to distribute all writes amongst all OSDs and is able to
saturate things, while you put has to wait for a single OSD and the
latency of the network.
A single SATA HDD typically can do about 150MB/s writes, but that would be
sequential ones, which RBD isn't. The journal takes half of that, the FS
journals and overhead some more, so about 40MB/s effective performance
doesn't suprise me at all.

Run this test with atop or iostat active on all machines to confirm.


As for your LACP question, it is what it is.
With a sufficient number of real clients and OSDs things will get
distributed eventually, but a single client will never get more than one
link worth. 
Typically people find that while bandwidth certainly is desirable (up to a
point), it is the lower latency of faster links like 10/25/40Gbs Ethernet 
or Infiniband that makes their clients (VMs) happier. 

Christian

> Putting from ram drive on first node
> time rados -p data1 put test3.img test3.img
> 
> --net/eth0net/eth1- --dsk/sda-dsk/sdb-dsk/sdc--
>  recv  send: recv  send| read  writ: read  writ: read  writ
> 2440B 4381B:1973B0 |   0 0 :   0 0 :   0 0
> 1036B 3055B:1958B  124B|   0 0 :   0 0 :   0 0
> 1382B 3277B:1316B0 |   0 0 :   0 0 :   0 0
> 1227B 2850B:1243B0 |   0 0 :   0 0 :   0 0
> 1216B  120k:2300B0 |   0 0 :   0 0 :   0 0
> 1714B 8257k:  15k0 |   0  4096B:   0 0 :   0 0
> 1006B   24M:  40k0 |   014k:   0 0 :   0 0
> 1589B   36M:  58k0 |   032k:   0 0 :   0 0
>  856B   36M:  57k0 |   0 0 :   0 0 :   0 0
>  856B   40M:  64k0 |   0 0 :   0 0 :   0 0
> 2031B   36M:  58k0 |   0 0 :   0 0 :   0 0
>  865B   36M:  58k0 |   024k:   0 0 :   0 0
> 1713B   39M:  61k0 |   037k:   0 0 :   0 0
>  997B   38M:  59k0 |   0 0 :   0 0 :   0 0
>   66B   36M:  58k0 |   0 0 :   0 0 :   0 0
> 1782B   36M:  57k0 |   0 0 :   0 0 :   0 0
>  931B   36M:  58k0 |   0  8192B:   0 0 :   0 0
>  931B   36M:  57k0 |   045k:   0 0 :   0 0
>  724B   36M:  57k0 |   0 0 :   0 0 :   0 0
>  922B   28M:  47k0 |   0 0 :   0 0 :   0 0
> 2506B 4117B:2261B0 |   0 0 :   0 0 :   0 0
>  865B 7630B:2631B0 |   015k:   0 0 :   0 0
> 
> 
> Goes to 3rd node
> 
> --net/eth0net/eth1- --dsk/sda-dsk/sdb-dsk/sdc-dsk/sdd--
>  recv  send: recv  send| read  writ: read  writ: read  writ: read  writ
>   66B 1568B: 733B0 |   0 0 :   0 0 :   0 0 :   0 0
> 2723B 4979B:1469B0 |   0 0 :   0 0 :   0 0 :   0 0
>   66B 1761B:1347B0 |   0 0 :   0 0 :   0 0 :   0 0
>   66B 2480B: 119k0 |   0 0 :   0 0 :   0 0 :   0 0
>  103k   22k:  12M0 |   0  4096B:   0 0 :   012M:   0 0
>  784B   41k:  24M0 |   017k:   0 0 :   024M:   0 0
> 1266B   63k:  38M0 |   040k:   0 0 :   037M:   0 0
>   66B   60k:  39M0 |   0 0 :   0 0 :   041M:   0 0
>   66B   60k:  37M0 |   0 0 :   0 0 :   038M:   0 0
>  104k   62k:  38M0 |   0 0 :   0 0 :   038M:   0 0
>   66B   61k:  38M0 |   015k:   0 0 :   042M:   0 0
> 1209B   59k:  36M0 |   044k:   0 0 :   039M:   0 0
>   66B   60k:  38M0 |   087k:   0 0 :   039M:   0 0
>  980B   62k:  38M0 |   0 0 :   0 0 :   041M:   0 0
>  103k   52k:  32M0 |   0  4096B:   0 0 :  60k   42M:   0 0
>   66B   61k:  38M0 |   0  8192B:   0 0 :   040M:   0 0
>  476B   58k:  36M0 |   045k:   0 0 :   041M:   0 0
> 1514B   55k:  34M0 |   0 0 :   0 0 :8192B   41M:   0 0
>  856B   42k:  24M0 |   0 0 :   0 0 :   028M:   0 0
>  103k 3010B:1681B0 |   0 0 :   0 0 :   0 0 :   0 0
>  126B 3363B:4187B0 |   015k:   0 

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-05 Thread David Turner
I accidentally responded from the wrong email address. We will be able to build 
off of master and test that.

Sent from my iPhone

On Feb 5, 2017, at 3:27 PM, Peter Maloney 
mailto:peter.malo...@brockmann-consult.de>> 
wrote:

On 02/03/17 19:54, David Turner wrote:
Our current solution in Hammer involves a daemon monitoring the cluster load 
and setting the osd_snap_trim_sleep accordingly between 0 and 0.35 which does a 
good job of preventing IO blocking and help us to clear out the snap_trim_q 
each day.  These settings not being injectable in Jewel would negate an option 
of using variable settings throughout the day.
Are you sure they're not injectable? [Almost?] everything says "unchangable", 
but it takes effect anyway. I have tested "snap_trim_sleep", and as Nick has 
pointed out, it seems to cause rather than prevent blocks (and I found the 
sweet spot is 0, but he seems to think it's higher). I can reproduce that very 
reliably using injectargs (+5 to sleep means +5s block length), so (unless luck 
can happen in such extreme ways, or somehow only part of the effect changed) 
they are changable.

I'm using Jewel. So I am now using:

osd_pg_max_concurrent_snap_trims=1
osd_snap_trim_sleep=0




[cid:imagef87fae.JPG@50454f78.42884055]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com