Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Grant Morley
We have just unset the nodwon and noout and you was right it was masking an issue as we had OSD's down. We have unset them now and the recovery is going much better and we are now able to do writes to the cluster again. Things are slowly coming back online. Thank you all for your help, it is much

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Nick Jones
You definitely have my sympathies; We encountered a similar situation a couple of years ago and it was a very hairy ordeal indeed. We found most of the suggestions in this mailing list post to be extremely beneficial in coaxing our cluster back into life: https://www.mail-archive.com/ceph-users@

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Mike Lowe
A couple of things here, you have nodown and noout set which is understandable based on what you were doing but now it’s probably time to let ceph do it’s thing since you believe all of the osd’s are back in service and should stay up and in. You may be masking a problem by having these set. D

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Saverio Proto
I would start troubleshooting restarting those OSDs that dont respond to ceph pg query. Check the log files of those OSDs when you restart. Saverio 2017-06-02 14:42 GMT+02:00 Grant Morley : > We was just putting a security patch on that got released. > > After that on one of the nodes we ra

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Grant Morley
We was just putting a security patch on that got released. After that on one of the nodes we ran a service ceph-all restart and the cluster went into error status. What we have been doing to query the pgs so far is: We get a list of the PGs from the output of ceph health detail or we have used:

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Saverio Proto
With ceph health detail get a list of problematic pgs with ceph pg query check why the pgs are stuck. Check the log files of all OSDs on that restart where the restart triggered the problem. Saverio 2017-06-02 14:16 GMT+02:00 Grant Morley : > We are using Ceph Jewel (10.2.7) running on

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Tyanko Aleksiev
Additionally, it could be useful to know what did you do during the maintenance. Cheers, Tyanko On 2 June 2017 at 14:08, Saverio Proto wrote: > To give you some help you need to tell us the ceph version you are > using and from ceph.conf in the section [osd] what values you have for > the follo

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Grant Morley
We are using Ceph Jewel (10.2.7) running on Ubuntu 14.04LTS osd_recovery_max_active": "1" osd_max_backfills": "1" osd_recovery_op_priority": "3" Limit Soft Limit Hard Limit Units Max cpu time unlimitedunlimited seconds Max file size

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Saverio Proto
To give you some help you need to tell us the ceph version you are using and from ceph.conf in the section [osd] what values you have for the following ? [osd] osd max backfills osd recovery max active osd recovery op priority these three settings can influence the recovery speed. Also, do you h

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread George Mihaiescu
Having 9 ceph-mon servers doesn't help... I would look at the stuck PGs in order to find the common OSDs and focus on them. Their logs will probably have details on where the problem is. > On Jun 2, 2017, at 07:51, Grant Morley wrote: > > Hi All, > > I wonder if anyone could help at all. >

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Grant Morley
HEALTH_ERR 210 pgs are stuck inactive for more than 300 seconds; 296 pgs backfill_wait; 3 pgs backfilling; 1 pgs degraded; 202 pgs peering; 1 pgs recovery_wait; 1 pgs stuck degraded; 210 pgs stuck inactive; 510 pgs stuck unclean; 3308 requests are blocked > 32 sec; 41 osds have slow requests; recov

Re: [Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Saverio Proto
Usually 'ceph health detail' gives better info on what is making everything stuck. Saverio 2017-06-02 13:51 GMT+02:00 Grant Morley : > Hi All, > > I wonder if anyone could help at all. > > We were doing some routine maintenance on our ceph cluster and after running > a "service ceph-all restart"

[Openstack-operators] Ceph recovery going unusually slow

2017-06-02 Thread Grant Morley
Hi All, I wonder if anyone could help at all. We were doing some routine maintenance on our ceph cluster and after running a "service ceph-all restart" on one of our nodes we noticed that something wasn't quite right. The cluster has gone into an error mode and we have multiple stuck PGs and the