On Wed, Dec 6, 2017 at 2:35 PM David Turner <drakonst...@gmail.com> wrote:
> I have no proof or anything other than a hunch, but OSDs don't trim omaps > unless all PGs are healthy. If this PG is actually not healthy, but the > cluster doesn't realize it while these 11 involved OSDs do realize that the > PG is unhealthy... You would see this exact problem. The OSDs think a PG > is unhealthy so they aren't trimming their omaps while the cluster doesn't > seem to be aware of it and everything else is trimming their omaps properly. > I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are stored in leveldb, but they have different trimming rules. > > I don't know what to do about it, but I hope it helps get you (or someone > else on the ML) towards a resolution. > > On Wed, Dec 6, 2017 at 1:59 PM <george.vasilaka...@stfc.ac.uk> wrote: > >> Hi ceph-users, >> >> We have a Ceph cluster (running Kraken) that is exhibiting some odd >> behaviour. >> A couple weeks ago, the LevelDBs on some our OSDs started growing large >> (now at around 20G size). >> >> The one thing they have in common is the 11 disks with inflating LevelDBs >> are all in the set for one PG in one of our pools (EC 8+3). This pool >> started to see use around the time the LevelDBs started inflating. >> Compactions are running and they do go down in size a bit but the overall >> trend is one of rapid growth. The other 2000+ OSDs in the cluster have >> LevelDBs between 650M and 1.2G. >> This PG has nothing to separate it from the others in its pool, within 5% >> of average number of objects per PG, no hot-spotting in terms of load, no >> weird states reported by ceph status. >> >> The one odd thing about it is the pg query output mentions it is >> active+clean, but it has a recovery state, which it enters every morning >> between 9 and 10am, where it mentions a "might_have_unfound" situation and >> having probed all other set members. A deep scrub of the PG didn't turn up >> anything. >> > You need to be more specific here. What do you mean it "enters into" the recovery state every morning? How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools? What are you using the cluster for? >> The cluster is now starting to manifest slow requests on the OSDs with >> the large LevelDBs, although not in the particular PG. >> >> What can I do to diagnose and resolve this? >> >> Thanks, >> >> George >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com