Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

Alejandro Comisario Fri, 10 Mar 2017 04:59:27 -0800

Any thoughts ?

On Tue, Mar 7, 2017 at 3:17 PM, Alejandro Comisario <alejan...@nubeliu.com>
wrote:


> Gregory, thanks for the response, what you've said is by far, the most
> enlightneen thing i know about ceph in a long time.
>
> What brings even greater doubt, which is, this "non-functional" pool, was
> only 1.5GB large, vs 50-150GB on the other effected pools, the tiny pool
> was still being used, and just because that pool was blovking requests, the
> whole cluster was unresponsive.
>
> So , what do you mean by "non-functional" pool ? how a pool can become
> non-functional ? and what asures me that tomorrow (just becaue i deleted
> the 1.5GB pool to fix the whole problem) another pool doesnt becomes
> non-functional ?
>
> Ceph Bug ?
> Another Bug ?
> Something than can be avoided ?
>
>
> On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>
>> Some facts:
>> The OSDs use a lot of gossip protocols to distribute information.
>> The OSDs limit how many client messages they let in to the system at a
>> time.
>> The OSDs do not distinguish between client ops for different pools (the
>> blocking happens before they have any idea what the target is).
>>
>> So, yes: if you have a non-functional pool and clients keep trying to
>> access it, those requests can fill up the OSD memory queues and block
>> access to other pools as it cascades across the system.
>>
>> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario <alejan...@nubeliu.com>
>> wrote:
>>
>>> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
>>> This weekend we'be experienced a huge outage from our customers vms
>>> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
>>> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
>>> basically all PG's blocked where just one OSD in the acting set, but
>>> all customers on the other pool got their vms almost freezed.
>>>
>>> while trying to do basic troubleshooting like doing noout and then
>>> bringing down the OSD that slowed/blocked the most, inmediatelly
>>> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
>>> we rolled back that change and started to move data around with the
>>> same logic (reweighting down those OSD) with exactly the same result.
>>>
>>> So, me made a decition, we decided to delete the pool where all PGS
>>> where slowed/locked allways despite the osd.
>>>
>>> Not even 10 secconds passes after the pool deletion, where not only
>>> there were no more degraded PGs, bit also ALL slow iops dissapeared
>>> for ever, and performance from hundreds of vms came to normal
>>> immediately.
>>>
>>> I must say that i was kinda scared to see that happen, bascally
>>> because there was only ONE POOL's PGS always slowed, but performance
>>> hit the another pool, so ... did not the PGS that exists on one pool
>>> are not shared by the other ?
>>> If my assertion is true, why OSD's locking iops from one pool's pg
>>> slowed down all other pgs from other pools ?
>>>
>>> again, i just deleted a pool that has almost no traffic, because its
>>> pgs were locked and affected pgs on another pool, and as soon as that
>>> happened, the whole cluster came back to normal (and of course,
>>> HEALTH_OK and no slow transaction whatsoever)
>>>
>>> please, someone help me understand the gap where i miss something,
>>> since this , as long as my ceph knowledge is concerned, makes no
>>> sense.
>>>
>>> PS: i have found someone that , looks like went through the same here:
>>> https://forum.proxmox.com/threads/ceph-osd-failure-causing-
>>> proxmox-node-to-crash.20781/
>>> but i still dont understand what happened.
>>>
>>> hoping to get the help from the community.
>>>
>>> --
>>> Alejandrito.
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
> --
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

Reply via email to