On Fri, Apr 29, 2016 at 9:34 AM, Mike Lovell
wrote:
> On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov <
> asheplya...@mirantis.com> wrote:
>
>> Hi,
>>
>> > i also wonder if just taking 148 out of the cluster (probably just
>> marking it out) would help
>>
>> As far as I understand this can onl
On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov wrote:
> Hi,
>
> > i also wonder if just taking 148 out of the cluster (probably just
> marking it out) would help
>
> As far as I understand this can only harm your data. The acting set of PG
> 17.73 is [41, 148],
> so after stopping/taking out
Hi,
> i also wonder if just taking 148 out of the cluster (probably just
marking it out) would help
As far as I understand this can only harm your data. The acting set of PG
17.73 is [41, 148],
so after stopping/taking out OSD 148 OSD 41 will store the only copy of
objects in PG 17.73
(so it wo
i attempted to grab some logs from the two osds in questions with debug_ms
and debug_osd at 20. i have looked through them a little bit but digging
through the logs at this verbosity is something i don't have much
experience with. hopefully someone on the list can help make sense of it.
the logs ar
I'd guess that to make any progress we'll need debug ms = 20 on both
sides of the connection when a message is lost.
-Sam
On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell wrote:
> there was a problem on one of the clusters i manage a couple weeks ago where
> pairs of OSDs would wait indefinitely on s
there was a problem on one of the clusters i manage a couple weeks ago
where pairs of OSDs would wait indefinitely on subops from the other OSD in
the pair. we used a liberal dose of "ceph osd down ##" on the osds and
eventually things just sorted them out a couple days later.
it seems to have com