We have just unset the nodwon and noout and you was right it was masking an
issue as we had OSD's down. We have unset them now and the recovery is
going much better and we are now able to do writes to the cluster again.
Things are slowly coming back online.
Thank you all for your help, it is much
You definitely have my sympathies; We encountered a similar situation a
couple of years ago and it was a very hairy ordeal indeed. We found most
of the suggestions in this mailing list post to be extremely beneficial in
coaxing our cluster back into life:
https://www.mail-archive.com/ceph-users@
A couple of things here, you have nodown and noout set which is understandable
based on what you were doing but now it’s probably time to let ceph do it’s
thing since you believe all of the osd’s are back in service and should stay up
and in. You may be masking a problem by having these set. D
I would start troubleshooting restarting those OSDs that dont respond
to ceph pg query.
Check the log files of those OSDs when you restart.
Saverio
2017-06-02 14:42 GMT+02:00 Grant Morley :
> We was just putting a security patch on that got released.
>
> After that on one of the nodes we ra
We was just putting a security patch on that got released.
After that on one of the nodes we ran a service ceph-all restart and the
cluster went into error status.
What we have been doing to query the pgs so far is:
We get a list of the PGs from the output of ceph health detail
or we have used:
With
ceph health detail
get a list of problematic pgs
with
ceph pg query
check why the pgs are stuck.
Check the log files of all OSDs on that restart where the restart
triggered the problem.
Saverio
2017-06-02 14:16 GMT+02:00 Grant Morley :
> We are using Ceph Jewel (10.2.7) running on
Additionally, it could be useful to know what did you do during the
maintenance.
Cheers,
Tyanko
On 2 June 2017 at 14:08, Saverio Proto wrote:
> To give you some help you need to tell us the ceph version you are
> using and from ceph.conf in the section [osd] what values you have for
> the follo
We are using Ceph Jewel (10.2.7) running on Ubuntu 14.04LTS
osd_recovery_max_active": "1"
osd_max_backfills": "1"
osd_recovery_op_priority": "3"
Limit Soft Limit Hard Limit
Units
Max cpu time unlimitedunlimited
seconds
Max file size
To give you some help you need to tell us the ceph version you are
using and from ceph.conf in the section [osd] what values you have for
the following ?
[osd]
osd max backfills
osd recovery max active
osd recovery op priority
these three settings can influence the recovery speed.
Also, do you h
Having 9 ceph-mon servers doesn't help...
I would look at the stuck PGs in order to find the common OSDs and focus on
them.
Their logs will probably have details on where the problem is.
> On Jun 2, 2017, at 07:51, Grant Morley wrote:
>
> Hi All,
>
> I wonder if anyone could help at all.
>
HEALTH_ERR 210 pgs are stuck inactive for more than 300 seconds; 296 pgs
backfill_wait; 3 pgs backfilling; 1 pgs degraded; 202 pgs peering; 1 pgs
recovery_wait; 1 pgs stuck degraded; 210 pgs stuck inactive; 510 pgs stuck
unclean; 3308 requests are blocked > 32 sec; 41 osds have slow requests;
recov
Usually 'ceph health detail' gives better info on what is making
everything stuck.
Saverio
2017-06-02 13:51 GMT+02:00 Grant Morley :
> Hi All,
>
> I wonder if anyone could help at all.
>
> We were doing some routine maintenance on our ceph cluster and after running
> a "service ceph-all restart"
Hi All,
I wonder if anyone could help at all.
We were doing some routine maintenance on our ceph cluster and after
running a "service ceph-all restart" on one of our nodes we noticed that
something wasn't quite right. The cluster has gone into an error mode and
we have multiple stuck PGs and the
13 matches
Mail list logo