From: Gregory Farnum <gfar...@redhat.com>
Date: Tuesday, 12 December 2017 at 19:24
To: "Vasilakakos, George (STFC,RAL,SC)" <george.vasilaka...@stfc.ac.uk>
Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Sudden omap growth on some OSDs

On Tue, Dec 12, 2017 at 3:16 AM 
<george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>> wrote:

On 11 Dec 2017, at 18:24, Gregory Farnum 
<gfar...@redhat.com<mailto:gfar...@redhat.com><mailto:gfar...@redhat.com<mailto:gfar...@redhat.com>>>
 wrote:

Hmm, this does all sound odd. Have you tried just restarting the primary OSD 
yet? That frequently resolves transient oddities like this.
If not, I'll go poke at the kraken source and one of the developers more 
familiar with the recovery processes we're seeing here.
-Greg


Hi Greg,

I’ve tried this, no effect. Also, on Friday, we tried removing an OSD (not the 
primary), the OSD that was chosen to replace it had it’s LevelDB grow to 7GiB 
by now. Yesterday it was 5.3.
We’re not seeing any errors logged by the OSDs with the default logging level 
either.

Do you have any comments on the fact that the primary sees the PG’s state as 
being different to what the peers think?

Yes. It's super weird. :p

Now, with a new primary I’m seeing the last peer in the set reporting it’s 
‘active+clean’, as is the primary, all other are saying it’s 
‘active+clean+degraded’ (according to PG query output).

Has the last OSD in the list shrunk down its LevelDB instance?

No, the last peer has the largest one currently part of the PG at 14GiB.

If so (or even if not), I'd try restarting all the OSDs in the PG and see if 
that changes things.

Will try that and report back.

If it doesn't...well, it's about to be Christmas and Luminous saw quite a bit 
of change in this space, so it's unlikely to get a lot of attention. :/

Yeah, this being Kraken I doubt it will get looked into deeply.

But the next step would be to gather high-level debug logs from the OSDs in 
question, especially as a peering action takes place.

I’ll be re-introducing the old primary this week so maybe I’ll bump the logging 
levels (to what?) on these OSDs and see what they come up with.

Oh!
I didn't notice you previously mentioned "custom gateways using the 
libradosstriper". Are those backing onto this pool? What operations are they 
doing?
Something like repeated overwrites of EC data could definitely have symptoms 
similar to this (apart from the odd peering bit.)
-Greg

Think of these as using the cluster as an object store. Most of the time we’re 
writing something in, reading it out anywhere from zero to thousands of times 
(each time running stat as well) and eventually may be deleting it. Once 
written, there’s no reason to be overwritten. They’re backing onto the EC pools 
(one per “tenant”) but the particular pool that this PG is a part of has barely 
seen any use. The most used one is storing petabytes and this one was barely 
reaching 100TiB when this came up.



This problem is quite weird I think. I copied a LevelDB and dumped a key list; 
the largest in GiB had 66% the number of keys that the average LevelDB has. The 
main difference with the ones that have been around for a while is that they 
have a lot more files that were last touched on the days when the problem 
started but most other LevelDBs have compacted those away and only have about 7 
days old files (as opposed to 3 week old ones that the big ones keep around). 
The big ones do seem to do compactions, they just don’t seem to get rid of that 
stuff.



Thanks,

George


On Fri, Dec 8, 2017 at 7:30 AM 
<george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>>>
 wrote:

________________________________
From: Gregory Farnum 
[gfar...@redhat.com<mailto:gfar...@redhat.com><mailto:gfar...@redhat.com<mailto:gfar...@redhat.com>>]
Sent: 07 December 2017 21:57
To: Vasilakakos, George (STFC,RAL,SC)
Cc: 
drakonst...@gmail.com<mailto:drakonst...@gmail.com><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>;
 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com><mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Sudden omap growth on some OSDs



On Thu, Dec 7, 2017 at 4:41 AM 
<george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>>>>
 wrote:

________________________________
From: Gregory Farnum 
[gfar...@redhat.com<mailto:gfar...@redhat.com><mailto:gfar...@redhat.com<mailto:gfar...@redhat.com>><mailto:gfar...@redhat.com<mailto:gfar...@redhat.com><mailto:gfar...@redhat.com<mailto:gfar...@redhat.com>>>]
Sent: 06 December 2017 22:50
To: David Turner
Cc: Vasilakakos, George (STFC,RAL,SC); 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com><mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>><mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com><mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>>
Subject: Re: [ceph-users] Sudden omap growth on some OSDs

On Wed, Dec 6, 2017 at 2:35 PM David Turner 
<drakonst...@gmail.com<mailto:drakonst...@gmail.com><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>>>>
 wrote:
I have no proof or anything other than a hunch, but OSDs don't trim omaps 
unless all PGs are healthy.  If this PG is actually not healthy, but the 
cluster doesn't realize it while these 11 involved OSDs do realize that the PG 
is unhealthy... You would see this exact problem.  The OSDs think a PG is 
unhealthy so they aren't trimming their omaps while the cluster doesn't seem to 
be aware of it and everything else is trimming their omaps properly.

I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are stored 
in leveldb, but they have different trimming rules.


I don't know what to do about it, but I hope it helps get you (or someone else 
on the ML) towards a resolution.

On Wed, Dec 6, 2017 at 1:59 PM 
<george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>>><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>>>>>
 wrote:
Hi ceph-users,

We have a Ceph cluster (running Kraken) that is exhibiting some odd behaviour.
A couple weeks ago, the LevelDBs on some our OSDs started growing large (now at 
around 20G size).

The one thing they have in common is the 11 disks with inflating LevelDBs are 
all in the set for one PG in one of our pools (EC 8+3). This pool started to 
see use around the time the LevelDBs started inflating. Compactions are running 
and they do go down in size a bit but the overall trend is one of rapid growth. 
The other 2000+ OSDs in the cluster have LevelDBs between 650M and 1.2G.
This PG has nothing to separate it from the others in its pool, within 5% of 
average number of objects per PG, no hot-spotting in terms of load, no weird 
states reported by ceph status.

The one odd thing about it is the pg query output mentions it is active+clean, 
but it has a recovery state, which it enters every morning between 9 and 10am, 
where it mentions a "might_have_unfound" situation and having probed all other 
set members. A deep scrub of the PG didn't turn up anything.

You need to be more specific here. What do you mean it "enters into" the 
recovery state every morning?

Here's what PG query showed me yesterday:
    "recovery_state": [
        {
            "name": "Started\/Primary\/Active",
            "enter_time": "2017-12-05 09:48:57.730385",
            "might_have_unfound": [
                {
                    "osd": "79(1)",
                    "status": "already probed"
                },
                {
                    "osd": "337(9)",
                    "status": "already probed"
                },... it goes on to list all peers of this OSD in that PG.

IIRC that's just a normal thing when there's any kind of recovery happening — 
it builds up a set during peering of OSDs that might have data, in case it 
discovers stuff missing.

OK. But this is the only PG mentioning "might_have_unfound" across the two most 
used pools in our cluster and it's the only one that has all of its omap dirs 
at sizes more than 15 times the average for the cluster.



How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools? What 
are you using the cluster for?

2048 PGs in this pool, also another 2048 PG EC pool (same profile) and two more 
1024 PG EC pools (also same profile). Then a set of RGW auxiliary pools with 
3-way replication.
I'm not 100% sure but I think all of our OSDs should have a few PGs from one of 
the EC pools. Our rules don't make a distinction so it's probabilistic. We're 
using the cluster as an object store, minor RGW use and custom gateways using 
libradosstriper.
It's also worth pointing out that an OSD in that PG was taken out of the 
cluster earlier today and pg query shows the following weirdness:
The primary thinks it's active+clean but, in the peer_info section all peers 
report it is "active+undersized+degraded+remapped+backfilling". It has shown 
this discrepancy before between the primary thinking it's a+c and the rest of 
the set seeing it a+c+degraded.

Again, exactly what output makes you say the primary thinks it's active+clean 
but the others have more complex recovery states?

I've pastebin'ed a query output: https://pastebin.com/PbGSaZyF. The PG state is 
reported "active+clean" in the top section but the peer_info sections have it 
"active+undersized+degraded". I may be reading the query output wrong but this 
seems odd.

In the query output we're showing the following for recovery state:
"recovery_state": [
        {
            "name": "Started\/Primary\/Active",
            "enter_time": "2017-12-07 08:41:57.850220",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }


The cluster is now starting to manifest slow requests on the OSDs with the 
large LevelDBs, although not in the particular PG.

Well, there have been a few causes of large LevelDBs, but given that you have 
degraded PGs and a bunch of EC pools, my guess is that the PG logs are getting 
extended thanks to the PG states. EC PG logs can be much larger than replicated 
ones, since EC pools need to be able to reverse the IO in those cases. So you 
need to get your PGs clean first and then see if the LevelDB shrinks down or 
not.

To be expected I guess. However, the average LevelDB in this cluster is around 
200MB, all OSDs in this PG are in the multi-gigabyte range with a couple 
pushing the 20GB mark. The numbers on this PG do not otherwise stand out, so 
I'm wondering what makes this PG so special. No PGs on these OSDs are reporting 
any other state than active+clean.

Thanks,

George
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to