Any thoughts ? On Tue, Mar 7, 2017 at 3:17 PM, Alejandro Comisario <alejan...@nubeliu.com> wrote:
> Gregory, thanks for the response, what you've said is by far, the most > enlightneen thing i know about ceph in a long time. > > What brings even greater doubt, which is, this "non-functional" pool, was > only 1.5GB large, vs 50-150GB on the other effected pools, the tiny pool > was still being used, and just because that pool was blovking requests, the > whole cluster was unresponsive. > > So , what do you mean by "non-functional" pool ? how a pool can become > non-functional ? and what asures me that tomorrow (just becaue i deleted > the 1.5GB pool to fix the whole problem) another pool doesnt becomes > non-functional ? > > Ceph Bug ? > Another Bug ? > Something than can be avoided ? > > > On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum <gfar...@redhat.com> wrote: > >> Some facts: >> The OSDs use a lot of gossip protocols to distribute information. >> The OSDs limit how many client messages they let in to the system at a >> time. >> The OSDs do not distinguish between client ops for different pools (the >> blocking happens before they have any idea what the target is). >> >> So, yes: if you have a non-functional pool and clients keep trying to >> access it, those requests can fill up the OSD memory queues and block >> access to other pools as it cascades across the system. >> >> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario <alejan...@nubeliu.com> >> wrote: >> >>> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact). >>> This weekend we'be experienced a huge outage from our customers vms >>> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's >>> started to slow request/block PG's on pool PRIVATE ( replica size 1 ) >>> basically all PG's blocked where just one OSD in the acting set, but >>> all customers on the other pool got their vms almost freezed. >>> >>> while trying to do basic troubleshooting like doing noout and then >>> bringing down the OSD that slowed/blocked the most, inmediatelly >>> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so >>> we rolled back that change and started to move data around with the >>> same logic (reweighting down those OSD) with exactly the same result. >>> >>> So, me made a decition, we decided to delete the pool where all PGS >>> where slowed/locked allways despite the osd. >>> >>> Not even 10 secconds passes after the pool deletion, where not only >>> there were no more degraded PGs, bit also ALL slow iops dissapeared >>> for ever, and performance from hundreds of vms came to normal >>> immediately. >>> >>> I must say that i was kinda scared to see that happen, bascally >>> because there was only ONE POOL's PGS always slowed, but performance >>> hit the another pool, so ... did not the PGS that exists on one pool >>> are not shared by the other ? >>> If my assertion is true, why OSD's locking iops from one pool's pg >>> slowed down all other pgs from other pools ? >>> >>> again, i just deleted a pool that has almost no traffic, because its >>> pgs were locked and affected pgs on another pool, and as soon as that >>> happened, the whole cluster came back to normal (and of course, >>> HEALTH_OK and no slow transaction whatsoever) >>> >>> please, someone help me understand the gap where i miss something, >>> since this , as long as my ceph knowledge is concerned, makes no >>> sense. >>> >>> PS: i have found someone that , looks like went through the same here: >>> https://forum.proxmox.com/threads/ceph-osd-failure-causing- >>> proxmox-node-to-crash.20781/ >>> but i still dont understand what happened. >>> >>> hoping to get the help from the community. >>> >>> -- >>> Alejandrito. >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > > > -- > *Alejandro Comisario* > *CTO | NUBELIU* > E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857 > _ > www.nubeliu.com > -- *Alejandro Comisario* *CTO | NUBELIU* E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857 _ www.nubeliu.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com