Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
On Fri, Apr 29, 2016 at 9:34 AM, Mike Lovell wrote: > On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov < > asheplya...@mirantis.com> wrote: > >> Hi, >> >> > i also wonder if just taking 148 out of the cluster (probably just >> marking it out) would help >> >> As far as I understand this can onl

Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
On Fri, Apr 29, 2016 at 5:54 AM, Alexey Sheplyakov wrote: > Hi, > > > i also wonder if just taking 148 out of the cluster (probably just > marking it out) would help > > As far as I understand this can only harm your data. The acting set of PG > 17.73 is [41, 148], > so after stopping/taking out

Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Alexey Sheplyakov
Hi, > i also wonder if just taking 148 out of the cluster (probably just marking it out) would help As far as I understand this can only harm your data. The acting set of PG 17.73 is [41, 148], so after stopping/taking out OSD 148 OSD 41 will store the only copy of objects in PG 17.73 (so it wo

Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-29 Thread Mike Lovell
i attempted to grab some logs from the two osds in questions with debug_ms and debug_osd at 20. i have looked through them a little bit but digging through the logs at this verbosity is something i don't have much experience with. hopefully someone on the list can help make sense of it. the logs ar

Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-28 Thread Samuel Just
I'd guess that to make any progress we'll need debug ms = 20 on both sides of the connection when a message is lost. -Sam On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell wrote: > there was a problem on one of the clusters i manage a couple weeks ago where > pairs of OSDs would wait indefinitely on s

[ceph-users] help troubleshooting some osd communication problems

2016-04-28 Thread Mike Lovell
there was a problem on one of the clusters i manage a couple weeks ago where pairs of OSDs would wait indefinitely on subops from the other OSD in the pair. we used a liberal dose of "ceph osd down ##" on the osds and eventually things just sorted them out a couple days later. it seems to have com