Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

Tom Christensen Tue, 01 Dec 2015 10:43:16 -0800

Another "new" thing we see with hammer is constant:

mon.0 [INF] from='client.52217412 :/0' entity='client.admin'
cmd='[{"prefix": "osd blacklist", "blacklistop": "add", "addr":
"<client_ip>:0/3562049007"}]': finished


entries in the log and while watching ceph -w

The cluster appears to generate a new osdmap after every one of these
entries.  These appear to be associated with client connect/disconnect
operations.  Unfortunately in our use case we use librbd to connect and
disconnect a lot.  What does this new message indicate?  Can it be disabled
or turned off? so that librbd sessions don't cause a new osdmap to be
generated?

In ceph -w output, whenever we see those entries, we immediately see a new
osdmap, hence my suspicion that this message is causing a new osdmap to be
generated.




On Tue, Dec 1, 2015 at 11:02 AM, Tom Christensen <pav...@gmail.com> wrote:

> Another thing that we don't quite grasp is that when we see slow requests
> now they almost always, probably 95% have the "known_if_redirected" state
> set.  What does this state mean?  Does it indicate we have OSD maps that
> are lagging and the cluster isn't really in sync?  Could this be the cause
> of our growing osdmaps?
>
> -Tom
>
>
> On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul) <
> paul.hewl...@alcatel-lucent.com> wrote:
>
>> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can
>> anybody confirm this?
>> I could not find any usage in the Ceph source code except that the value
>> is set in some of the test software…
>>
>> Paul
>>
>>
>> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Tom
>> Christensen <pav...@gmail.com>
>> Date: Monday, 30 November 2015 at 23:20
>> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs
>>
>> What counts as ancient?  Concurrent to our hammer upgrade we went from
>> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
>> we'd been running because we're also seeing an intermittent (its happened
>> twice in 2 weeks) massive load spike that completely hangs the osd node
>> (we're talking about load averages that hit 20k+ before the box becomes
>> completely unresponsive).  We saw a similar behavior on a 3.13 kernel,
>> which resolved by moving to the 3.16 kernel we had before.  I'll try to
>> catch one with debug_ms=1 and see if I can see it we're hitting a similar
>> hang.
>>
>> To your comment about omap, we do have filestore xattr use omap = true in
>> our conf... which we believe was placed there by ceph-deploy (which we used
>> to deploy this cluster).  We are on xfs, but we do take tons of RBD
>> snapshots.  If either of these use cases will cause lots of osd map size
>> then, we may just be exceeding the limits of the number of rbd snapshots
>> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)
>>
>> An interesting note, we had an OSD flap earlier this morning, and when it
>> did, immediately after it came back I checked its meta directory size with
>> du -sh, this returned immediately, and showed a size of 107GB.  The fact
>> that it returned immediately indicated to me that something had just
>> recently read through that whole directory and it was all cached in the FS
>> cache.  Normally a du -sh on the meta directory takes a good 5 minutes to
>> return.  Anyway, since it dropped this morning its meta directory size
>> continues to shrink and is down to 93GB.  So it feels like something
>> happens that makes the OSD read all its historical maps which results in
>> the OSD hanging cause there are a ton of them, and then it wakes up and
>> realizes it can delete a bunch of them...
>>
>> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvand...@gmail.com>
>> wrote:
>>
>>> The trick with debugging heartbeat problems is to grep back through the
>>> log to find the last thing the affected thread was doing, e.g. is
>>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the
>>> omap, etc..
>>>
>>> I agree this doesn't look to be network related, but if you want to rule
>>> it out you should use debug_ms=1.
>>>
>>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
>>> similarly started getting slow requests. To make a long story short, our
>>> issue turned out to be sendmsg blocking (very rarely), probably due to an
>>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature
>>> of this was 900s of slow requests, then an ms log showing "initiating
>>> reconnect". Until we got the kernel upgraded everywhere, we used a
>>> workaround of ms tcp read timeout = 60.
>>> So, check your kernels, and upgrade if they're ancient. Latest el6
>>> kernels work for us.
>>>
>>> Otherwise, those huge osd leveldb's don't look right. (Unless you're
>>> using tons and tons of omap...) And it kinda reminds me of the other
>>> problem we hit after the hammer upgrade, namely the return of the ever
>>> growing mon leveldb issue. The solution was to recreate the mons one by
>>> one. Perhaps you've hit something similar with the OSDs. debug_osd=10 might
>>> be good enough to see what the osd is doing, maybe you need
>>> debug_filestore=10 also. If that doesn't show the problem, bump those up to
>>> 20.
>>>
>>> Good luck,
>>>
>>> Dan
>>>
>>> On 30 Nov 2015 20:56, "Tom Christensen" <pav...@gmail.com> wrote:
>>> >
>>> > We recently upgraded to 0.94.3 from firefly and now for the last week
>>> have had intermittent slow requests and flapping OSDs.  We have been unable
>>> to nail down the cause, but its feeling like it may be related to our
>>> osdmaps not getting deleted properly.  Most of our osds are now storing
>>> over 100GB of data in the meta directory, almost all of that is historical
>>> osd maps going back over 7 days old.
>>> >
>>> > We did do a small cluster change (We added 35 OSDs to a 1445 OSD
>>> cluster), the rebalance took about 36 hours, and it completed 10 days ago.
>>> Since that time the cluster has been HEALTH_OK and all pgs have been
>>> active+clean except for when we have an OSD flap.
>>> >
>>> > When the OSDs flap they do not crash and restart, they just go
>>> unresponsive for 1-3 minutes, and then come back alive all on their own.
>>> They get marked down by peers, and cause some peering and then they just
>>> come back rejoin the cluster and continue on their merry way.
>>> >
>>> > We see a bunch of this in the logs while the OSD is catatonic:
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166
>>> 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread
>>> 0x7f5affe72700' had timed out after 15
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176
>>> 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping
>>> ping request
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210
>>> 7f5b04e7c700  1 heartbeat_map is_healthy 'OSD::osd_tp thread
>>> 0x7f5affe72700' had timed out after 15
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218
>>> 7f5b04e7c700 10 osd.1191 1203850 internal heartbeat not healthy, dropping
>>> ping request
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288
>>> 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread
>>> 0x7f5affe72700' had timed out after 15
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293
>>> 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping
>>> ping request
>>> >
>>> >
>>> > I have a chunk of logs at debug 20/5, not sure if I should have done
>>> just 20... It's pretty hard to catch, we have to basically see the slow
>>> requests and get debug logging set in about a 5-10 second window before the
>>> OSD stops responding to the admin socket...
>>> >
>>> > As networking is almost always the cause of flapping OSDs we have
>>> tested the network quite extensively.  It hasn't changed physically since
>>> before the hammer upgrade, and was performing well.  We have done large
>>> amounts of ping tests and have not seen a single dropped packet between osd
>>> nodes or between osd nodes and mons.
>>> >
>>> > I don't see any error packets or drops on switches either.
>>> >
>>> > Ideas?
>>> >
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

Reply via email to