Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Bryan Banister Fri, 16 Feb 2018 13:15:22 -0800

Thanks David,

Taking the list of all OSDs that are stuck reports that a little over 50% of 
all OSDs are in this condition.  There isn’t any discernable pattern that I can 
find and they are spread across the three servers.  All of the OSDs are online 
as far as the service is concern.


I have also taken all PGs that were reported the health detail output and 
looked for any that report “peering_blocked_by” but none do, so I can’t tell if 
any OSD is actually blocking the peering operation.

As suggested, I got a report of all peering PGs:
[root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering | sort 
-k13
    pg 14.fe0 is stuck peering since forever, current state peering, last 
acting [104,94,108]
    pg 14.fe0 is stuck unclean since forever, current state peering, last 
acting [104,94,108]
    pg 14.fbc is stuck peering since forever, current state peering, last 
acting [110,91,0]
    pg 14.fd1 is stuck peering since forever, current state peering, last 
acting [130,62,111]
    pg 14.fd1 is stuck unclean since forever, current state peering, last 
acting [130,62,111]
    pg 14.fed is stuck peering since forever, current state peering, last 
acting [32,33,82]
    pg 14.fed is stuck unclean since forever, current state peering, last 
acting [32,33,82]
    pg 14.fee is stuck peering since forever, current state peering, last 
acting [37,96,68]
    pg 14.fee is stuck unclean since forever, current state peering, last 
acting [37,96,68]
    pg 14.fe8 is stuck peering since forever, current state peering, last 
acting [45,31,107]
    pg 14.fe8 is stuck unclean since forever, current state peering, last 
acting [45,31,107]
    pg 14.fc1 is stuck peering since forever, current state peering, last 
acting [59,124,39]
    pg 14.ff2 is stuck peering since forever, current state peering, last 
acting [62,117,7]
    pg 14.ff2 is stuck unclean since forever, current state peering, last 
acting [62,117,7]
    pg 14.fe4 is stuck peering since forever, current state peering, last 
acting [84,55,92]
    pg 14.fe4 is stuck unclean since forever, current state peering, last 
acting [84,55,92]
    pg 14.fb0 is stuck peering since forever, current state peering, last 
acting [94,30,38]
    pg 14.ffc is stuck peering since forever, current state peering, last 
acting [96,53,70]
    pg 14.ffc is stuck unclean since forever, current state peering, last 
acting [96,53,70]

Some have common OSDs but some OSDs only listed once.

Should I try just marking OSDs with stuck requests down to see if that will 
re-assert them?

Thanks!!
-Bryan

From: David Turner [mailto:[email protected]]
Sent: Friday, February 16, 2018 2:51 PM
To: Bryan Banister <[email protected]>
Cc: Bryan Stillwell <[email protected]>; Janne Johansson 
<[email protected]>; Ceph Users <[email protected]>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
The questions I definitely know the answer to first, and then we'll continue 
from there.  If an OSD is blocking peering but is online, when you mark it as 
down in the cluster it receives a message in it's log saying it was wrongly 
marked down and tells the mons it is online.  That gets it to stop what it was 
doing and start talking again.  I referred to that as re-asserting.  If the OSD 
that you marked down doesn't mark itself back up within a couple minutes, 
restarting the OSD might be a good idea.  Then again actually restarting the 
daemon could be bad because the daemon is doing something.  With as much 
potential for places to work with to get things going, actually restarting the 
daemons is probably something I would wait to do for now.

The reason the cluster doesn't know anything about the PG is because it's still 
creating and hasn't actually been created.  Starting with some of the OSDs that 
you see with blocked requests would be a good idea.  Eventually you'll down an 
OSD that when it comes back up things start looking much better as things start 
peering and getting better.  Below are the list of OSDs you had from a previous 
email that if they're still there with stuck requests then they'll be good to 
start doing this to.  On closer review, it's almost all of them... but you have 
to start somewhere.  Another possible place to start with these is to look at a 
list of all of the peering PGs and see if there are any common OSDs when you 
look at all of them at once.  Some patterns may emerge and would be good 
options to try.

    osds 7,39,60,103,133 have stuck requests > 67108.9 sec
    osds 
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have 
stuck requests > 134218 sec
    osds 
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
 have stuck requests > 268435 sec


On Fri, Feb 16, 2018 at 2:53 PM Bryan Banister 
<[email protected]<mailto:[email protected]>> wrote:
Thanks David,

I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at this 
point and the backfills have stopped.  I’ll also stop the backups from pushing 
into ceph for now.

I don’t want to make things worse, so ask for some more guidance now.


1)      In looking at a PG that is still peering or one that is “unknown”, Ceph 
complains that it doesn’t have that pgid:
    pg 14.fb0 is stuck peering since forever, current state peering, last 
acting [94,30,38]
[root@carf-ceph-osd03 ~]# ceph pg 14.fb0 query
Error ENOENT: i don't have pgid 14.fb0
[root@carf-ceph-osd03 ~]#


2)      One that is activating shows this for the recovery_state:
[root@carf-ceph-osd03 ~]# ceph pg 14.fe1 query | less
[snip]
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2018-02-13 14:33:21.406919",
            "might_have_unfound": [
                {
                    "osd": "84(0)",
                    "status": "not queried"
                }
            ],
            "recovery_progress": {
                "backfill_targets": [
                    "56(0)",
                    "87(1)",
                    "88(2)"
                ],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "0",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.seed": 0,
                "scrubber.waiting_on": 0,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2018-02-13 14:33:17.491148"
        }
    ],


Sorry for all the hand holding, but how do I determine if I need to set an OSD 
as ‘down’ to fix the issues, and how does it go about re-asserting itself?

I again tried looking at the ceph docs on troubleshooting OSDs but didn’t find 
any details.  Man page also has no details.

Thanks again,
-Bryan

From: David Turner [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, February 16, 2018 1:21 PM
To: Bryan Banister <[email protected]<mailto:[email protected]>>
Cc: Bryan Stillwell <[email protected]<mailto:[email protected]>>; 
Janne Johansson <[email protected]<mailto:[email protected]>>; Ceph Users 
<[email protected]<mailto:[email protected]>>

Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
Your problem might have been creating too many PGs at once.  I generally 
increase pg_num and pgp_num by no more than 256 at a time.  Making sure that 
all PGs are creating, peered, and healthy (other than backfilling).

To help you get back to a healthy state, let's start off by getting all of your 
PGs peered.  Go ahead and put a stop to backfilling, recovery, scrubbing, etc.  
Those are all hindering the peering effort right now.  The more clients you can 
disable is also better.

ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noscrubbing
ceph osd set nodeep-scrubbing

After that look at your peering PGs and find out what is blocking their 
peering.  This is where you might need to be using `ceph osd down 23` (assuming 
you needed to kick osd.23) to mark them down in the cluster and let them 
re-assert themselves.  Once you have all PGs done with peering, go ahead and 
unset nobackfill and norecovery and let the cluster start moving data around.  
Leaving noscrubbing and nodeep-scrubbing off is optional and up to you.  I'll 
never say it's better to leave them off, but scrubbing does use a fair bit of 
spindles while you're trying to backfill.

On Fri, Feb 16, 2018 at 2:12 PM Bryan Banister 
<[email protected]<mailto:[email protected]>> wrote:
Well I decided to try the increase in PGs to 4096 and that seems to have caused 
some issues:

2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR 
61802168/241154376 objects misplaced (25.628%); Reduced data availability: 2081 
pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376 objects 
degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck requests are 
blocked > 4096 sec

The cluster is actively backfilling misplaced objects, but not all PGs are 
active at this point and may are stuck peering, stuck unclean, or have a state 
of unknown:
PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs peering
    pg 14.fae is stuck inactive for 253360.025730, current state 
activating+remapped, last acting [85,12,41]
    pg 14.faf is stuck inactive for 253368.511573, current state unknown, last 
acting []
    pg 14.fb0 is stuck peering since forever, current state peering, last 
acting [94,30,38]
    pg 14.fb1 is stuck inactive for 253362.605886, current state 
activating+remapped, last acting [6,74,34]
[snip]

The health also shows a large number of degraded data redundancy PGs:
PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded (0.000%), 
3099 pgs unclean, 38 pgs degraded
    pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, last 
acting []
    pg 14.fc8 is stuck unclean for 531622.531271, current state 
active+remapped+backfill_wait, last acting [73,132,71]
    pg 14.fca is stuck unclean for 420540.396199, current state 
active+remapped+backfill_wait, last acting [0,80,61]
    pg 14.fcb is stuck unclean for 531622.421855, current state 
activating+remapped, last acting [70,26,75]
[snip]

We also now have a number of stuck requests:
REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
    69 ops are blocked > 268435 sec
    66 ops are blocked > 134218 sec
   28 ops are blocked > 67108.9 sec
    osds 7,39,60,103,133 have stuck requests > 67108.9 sec
    osds 
5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 have 
stuck requests > 134218 sec
    osds 
4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
 have stuck requests > 268435 sec

I tried looking through the mailing list archive on how to solve the stuck 
requests, and it seems that restarting the OSDs is the right way?

At this point we have just been watching the backfills running and see a steady 
but slow decrease of misplaced objects.  When the cluster is idle, the overall 
OSD disk utilization is not too bad at roughly 40% on the physical disks 
running these backfills.

However we still have our backups trying to push new images to the cluster.  
This worked ok for the first few days, but yesterday we were getting failure 
alerts.  I checked the status of the RGW service and noticed that 2 of the 3 
RGW civetweb servers where not responsive.  I restarted the RGWs on the ones 
that appeared hung and that got them working for a while, but then the same 
condition happened.  The RGWs seem to have recovered on their own now, but 
again the cluster is idle and only backfills are currently doing anything (that 
I can tell).  I did see these log entries:
2018-02-15 16:46:07.541542 7fffe6c56700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:12.541613 7fffe6c56700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
2018-02-15 16:46:12.541629 7fffe6c56700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
2018-02-15 16:46:17.541701 7fffe6c56700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600

At this point we do not know to proceed with recovery efforts.  I tried looking 
at the ceph docs and mail list archives but wasn’t able to determine the right 
path forward here.

Any help is appreciated,
-Bryan


From: Bryan Stillwell 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, February 13, 2018 2:27 PM

To: Bryan Banister 
<[email protected]<mailto:[email protected]>>; Janne Johansson 
<[email protected]<mailto:[email protected]>>
Cc: Ceph Users <[email protected]<mailto:[email protected]>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
________________________________
It may work fine, but I would suggest limiting the number of operations going 
on at the same time.

Bryan

From: Bryan Banister 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, February 13, 2018 at 1:16 PM
To: Bryan Stillwell <[email protected]<mailto:[email protected]>>, 
Janne Johansson <[email protected]<mailto:[email protected]>>
Cc: Ceph Users <[email protected]<mailto:[email protected]>>
Subject: RE: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Thanks for the response Bryan!

Would it be good to go ahead and do the increase up to 4096 PGs for thee pool 
given that it's only at 52% done with the rebalance backfilling operations?

Thanks in advance!!
-Bryan

-----Original Message-----
From: Bryan Stillwell [mailto:[email protected]]
Sent: Tuesday, February 13, 2018 12:43 PM
To: Bryan Banister 
<[email protected]<mailto:[email protected]>>; Janne Johansson 
<[email protected]<mailto:[email protected]>>
Cc: Ceph Users <[email protected]<mailto:[email protected]>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email
-------------------------------------------------

Bryan,

Based off the information you've provided so far, I would say that your largest 
pool still doesn't have enough PGs.

If you originally had only 512 PGs for you largest pool (I'm guessing 
.rgw.buckets has 99% of your data), then on a balanced cluster you would have 
just ~11.5 PGs per OSD (3*512/133).  That's way lower than the recommended 100 
PGs/OSD.

Based on the number of disks and assuming your .rgw.buckets pool has 99% of the 
data, you should have around 4,096 PGs for that pool.  You'll still end up with 
an uneven distribution, but the outliers shouldn't be as far out.

Sage recently wrote a new balancer plugin that makes balancing a cluster 
something that happens automatically.  He gave a great talk at LinuxConf 
Australia that you should check out, here's a link into the video where he 
talks about the balancer and the need for it:

https://youtu.be/GrStE7XSKFE?t=20m14s

Even though your objects are fairly large, they are getting broken up into 
chunks that are spread across the cluster.  You can see how large each of your 
PGs are with a command like this:

ceph pg dump | grep '[0-9]*\.[0-9a-f]*' | awk '{ print $1 "\t" $7 }' |sort -n 
-k2

You'll see that within a pool the PG sizes are fairly close to the same size, 
but in your cluster the PGs are fairly large (~200GB would be my guess).

Bryan

From: ceph-users 
<[email protected]<mailto:[email protected]>> 
on behalf of Bryan Banister 
<[email protected]<mailto:[email protected]>>
Date: Monday, February 12, 2018 at 2:19 PM
To: Janne Johansson <[email protected]<mailto:[email protected]>>
Cc: Ceph Users <[email protected]<mailto:[email protected]>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Hi Janne and others,

We used the “ceph osd reweight-by-utilization “ command to move a small amount 
of data off of the top four OSDs by utilization.  Then we updated the pg_num 
and pgp_num on the pool from 512 to 1024 which started moving roughly 50% of 
the objects around as a result.  The unfortunate issue is that the weights on 
the OSDs are still roughly equivalent and the OSDs that are nearfull were still 
getting allocated objects during the rebalance backfill operations.

At this point I have made some massive changes to the weights of the OSDs in an 
attempt to stop Ceph from allocating any more data to OSDs that are getting 
close to full.  Basically the OSD with the lowest utilization remains weighted 
at 1 and the rest of the OSDs are now reduced in weight based on the percent 
usage of the OSD + the %usage of the OSD with the amount of data (21% at the 
time).  This means the OSD that is at the most full at this time at 86% full 
now has a weight of only .33 (it was at 89% when reweight was applied).  I’m 
not sure this is a good idea, but it seemed like the only option I had.  Please 
let me know if I’m making a bad situation worse!

I still have the question on how this happened in the first place and how to 
prevent it from happening going forward without a lot of monitoring and 
reweighting on weekends/etc to keep things balanced.  It sounds like Ceph is 
really expecting that objects stored into a pool will roughly have the same 
size, is that right?

Our backups going into this pool have very large variation in size, so would it 
be better to create multiple pools based on expected size of objects and then 
put backups of similar size into each pool?

The backups also have basically the same names with the only difference being 
the date which it was taken (e.g. backup name difference in subsequent days can 
be one digit at times), so does this mean that large backups with basically the 
same name will end up being placed in the same PGs based on the CRUSH 
calculation using the object name?

Thanks,
-Bryan

From: Janne Johansson [mailto:[email protected]]
Sent: Wednesday, January 31, 2018 9:34 AM
To: Bryan Banister <[email protected]<mailto:[email protected]>>
Cc: Ceph Users <[email protected]<mailto:[email protected]>>
Subject: Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Note: External Email



2018-01-31 15:58 GMT+01:00 Bryan Banister <mailto:[email protected]>:


Given that this will move data around (I think), should we increase the pg_num 
and pgp_num first and then see how it looks?


I guess adding pgs and pgps will move stuff around too, but if the PGCALC 
formula says you should have more then that would still be a good
start. Still, a few manual reweights first to take the 85-90% ones down might 
be good, some move operations are going to refuse adding things
to too-full OSDs, so you would not want to get accidentally bumped above such a 
limit due to some temp-data being created during moves.

Also, dont bump pgs like crazy, you can never move down. Aim for getting ~100 
per OSD at most, and perhaps even then in smaller steps so
that the creation (and evening out of data to the new empty PGs) doesn't kill 
normal client I/O perf in the meantime.


--
May the most significant bit of your life be positive.



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.



________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.


________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.

________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Reply via email to