Another "new" thing we see with hammer is constant: mon.0 [INF] from='client.52217412 :/0' entity='client.admin' cmd='[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "<client_ip>:0/3562049007"}]': finished
entries in the log and while watching ceph -w The cluster appears to generate a new osdmap after every one of these entries. These appear to be associated with client connect/disconnect operations. Unfortunately in our use case we use librbd to connect and disconnect a lot. What does this new message indicate? Can it be disabled or turned off? so that librbd sessions don't cause a new osdmap to be generated? In ceph -w output, whenever we see those entries, we immediately see a new osdmap, hence my suspicion that this message is causing a new osdmap to be generated. On Tue, Dec 1, 2015 at 11:02 AM, Tom Christensen <pav...@gmail.com> wrote: > Another thing that we don't quite grasp is that when we see slow requests > now they almost always, probably 95% have the "known_if_redirected" state > set. What does this state mean? Does it indicate we have OSD maps that > are lagging and the cluster isn't really in sync? Could this be the cause > of our growing osdmaps? > > -Tom > > > On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul) < > paul.hewl...@alcatel-lucent.com> wrote: > >> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can >> anybody confirm this? >> I could not find any usage in the Ceph source code except that the value >> is set in some of the test software… >> >> Paul >> >> >> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Tom >> Christensen <pav...@gmail.com> >> Date: Monday, 30 November 2015 at 23:20 >> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs >> >> What counts as ancient? Concurrent to our hammer upgrade we went from >> 3.16->3.19 on ubuntu 14.04. We are looking to revert to the 3.16 kernel >> we'd been running because we're also seeing an intermittent (its happened >> twice in 2 weeks) massive load spike that completely hangs the osd node >> (we're talking about load averages that hit 20k+ before the box becomes >> completely unresponsive). We saw a similar behavior on a 3.13 kernel, >> which resolved by moving to the 3.16 kernel we had before. I'll try to >> catch one with debug_ms=1 and see if I can see it we're hitting a similar >> hang. >> >> To your comment about omap, we do have filestore xattr use omap = true in >> our conf... which we believe was placed there by ceph-deploy (which we used >> to deploy this cluster). We are on xfs, but we do take tons of RBD >> snapshots. If either of these use cases will cause lots of osd map size >> then, we may just be exceeding the limits of the number of rbd snapshots >> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster) >> >> An interesting note, we had an OSD flap earlier this morning, and when it >> did, immediately after it came back I checked its meta directory size with >> du -sh, this returned immediately, and showed a size of 107GB. The fact >> that it returned immediately indicated to me that something had just >> recently read through that whole directory and it was all cached in the FS >> cache. Normally a du -sh on the meta directory takes a good 5 minutes to >> return. Anyway, since it dropped this morning its meta directory size >> continues to shrink and is down to 93GB. So it feels like something >> happens that makes the OSD read all its historical maps which results in >> the OSD hanging cause there are a ton of them, and then it wakes up and >> realizes it can delete a bunch of them... >> >> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvand...@gmail.com> >> wrote: >> >>> The trick with debugging heartbeat problems is to grep back through the >>> log to find the last thing the affected thread was doing, e.g. is >>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the >>> omap, etc.. >>> >>> I agree this doesn't look to be network related, but if you want to rule >>> it out you should use debug_ms=1. >>> >>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and >>> similarly started getting slow requests. To make a long story short, our >>> issue turned out to be sendmsg blocking (very rarely), probably due to an >>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature >>> of this was 900s of slow requests, then an ms log showing "initiating >>> reconnect". Until we got the kernel upgraded everywhere, we used a >>> workaround of ms tcp read timeout = 60. >>> So, check your kernels, and upgrade if they're ancient. Latest el6 >>> kernels work for us. >>> >>> Otherwise, those huge osd leveldb's don't look right. (Unless you're >>> using tons and tons of omap...) And it kinda reminds me of the other >>> problem we hit after the hammer upgrade, namely the return of the ever >>> growing mon leveldb issue. The solution was to recreate the mons one by >>> one. Perhaps you've hit something similar with the OSDs. debug_osd=10 might >>> be good enough to see what the osd is doing, maybe you need >>> debug_filestore=10 also. If that doesn't show the problem, bump those up to >>> 20. >>> >>> Good luck, >>> >>> Dan >>> >>> On 30 Nov 2015 20:56, "Tom Christensen" <pav...@gmail.com> wrote: >>> > >>> > We recently upgraded to 0.94.3 from firefly and now for the last week >>> have had intermittent slow requests and flapping OSDs. We have been unable >>> to nail down the cause, but its feeling like it may be related to our >>> osdmaps not getting deleted properly. Most of our osds are now storing >>> over 100GB of data in the meta directory, almost all of that is historical >>> osd maps going back over 7 days old. >>> > >>> > We did do a small cluster change (We added 35 OSDs to a 1445 OSD >>> cluster), the rebalance took about 36 hours, and it completed 10 days ago. >>> Since that time the cluster has been HEALTH_OK and all pgs have been >>> active+clean except for when we have an OSD flap. >>> > >>> > When the OSDs flap they do not crash and restart, they just go >>> unresponsive for 1-3 minutes, and then come back alive all on their own. >>> They get marked down by peers, and cause some peering and then they just >>> come back rejoin the cluster and continue on their merry way. >>> > >>> > We see a bunch of this in the logs while the OSD is catatonic: >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166 >>> 7f5b03679700 1 heartbeat_map is_healthy 'OSD::osd_tp thread >>> 0x7f5affe72700' had timed out after 15 >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176 >>> 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping >>> ping request >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210 >>> 7f5b04e7c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread >>> 0x7f5affe72700' had timed out after 15 >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218 >>> 7f5b04e7c700 10 osd.1191 1203850 internal heartbeat not healthy, dropping >>> ping request >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288 >>> 7f5b03679700 1 heartbeat_map is_healthy 'OSD::osd_tp thread >>> 0x7f5affe72700' had timed out after 15 >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293 >>> 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping >>> ping request >>> > >>> > >>> > I have a chunk of logs at debug 20/5, not sure if I should have done >>> just 20... It's pretty hard to catch, we have to basically see the slow >>> requests and get debug logging set in about a 5-10 second window before the >>> OSD stops responding to the admin socket... >>> > >>> > As networking is almost always the cause of flapping OSDs we have >>> tested the network quite extensively. It hasn't changed physically since >>> before the hammer upgrade, and was performing well. We have done large >>> amounts of ping tests and have not seen a single dropped packet between osd >>> nodes or between osd nodes and mons. >>> > >>> > I don't see any error packets or drops on switches either. >>> > >>> > Ideas? >>> > >>> > >>> > _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users@lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>> >>> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com