[ceph-users] Re: mgr hangs with upmap balancer

2019-11-22 Thread Eugen Block
Hi, we have also been facing some problems with MGR, we had to switch off balancer and pg_autoscaler because the active MGR would end up using a whole CPU, resulting in hanging dashboard and ceph commands. There are several similar threads on the ML, e.g. [1] and [2]. I'm not aware of a s

[ceph-users] Re: mgr hangs with upmap balancer

2019-11-22 Thread Bryan Stillwell
Thanks Eugen, I created this bug report to track the issue if you want to watch it: https://tracker.ceph.com/issues/42971 Bryan > On Nov 22, 2019, at 6:34 AM, Eugen Block wrote: > > Notice: This email is from an external sender. > > > > Hi, > > we have also been facing some problems with

[ceph-users] EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

2019-11-22 Thread Zoltan Arnold Nagy
Hi, We have a cluster where we mix HDDs and NVMe drives using device classes with a specific crush role for each class. One of our NVMe drives physically died which caused some of our PGs to go into this state: pg 26.ac is stuck undersized for 60830.991784, current state activating+undersi

[ceph-users] Re: EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

2019-11-22 Thread Paul Emmerich
On Fri, Nov 22, 2019 at 9:33 PM Zoltan Arnold Nagy wrote: > The 2^31-1 in there seems to indicate an overflow somewhere - the way we > were able to figure out where exactly > is to query the PG and compare the "up" and "acting" sets - only _one_ > of them had the 2^31-1 number in place > of the c

[ceph-users] Re: EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

2019-11-22 Thread Zoltan Arnold Nagy
On 2019-11-22 21:45, Paul Emmerich wrote: On Fri, Nov 22, 2019 at 9:33 PM Zoltan Arnold Nagy wrote: The 2^31-1 in there seems to indicate an overflow somewhere - the way we were able to figure out where exactly is to query the PG and compare the "up" and "acting" sets - only _one_ of them had