[ceph-users] Incomplete PGs, how do I get them back without data loss?

2016-05-05 Thread george.vasilakakos
Hi folks,

I've got a serious issue with a Ceph cluster that's used for RBD.

There are 4 PGs stuck in an incomplete state and I'm trying to repair this 
problem to no avail.

Here's ceph status:
health HEALTH_WARN
4 pgs incomplete
4 pgs stuck inactive
4 pgs stuck unclean
100 requests are blocked > 32 sec
 monmap e13: 3 mons at ...
election epoch 2084, quorum 0,1,2 mon4,mon5,mon3
 osdmap e154083: 203 osds: 197 up, 197 in
  pgmap v37369382: 9856 pgs, 5 pools, 20932 GB data, 22321 kobjects
64871 GB used, 653 TB / 716 TB avail
9851 active+clean
   4 incomplete
   1 active+clean+scrubbing

The 4 PGs all have the same primary OSD, which is on a host that had its OSDs 
turned off as it was quite flaky.

1.1bdbincomplete[52,100,130]52[52,100,130]52
1.5c2incomplete[52,191,109]52[52,191,109]52
1.f98incomplete[52,92,37]52[52,92,37]52
1.11dcincomplete[52,176,12]52[52,176,12]52

One thing that strikes me as odd is that once osd.52 is taken out, these sets 
change completely.
The situation currently is that, for each of these PGs, the three OSDs have 
different amounts of data.
They all have similar but different amounts, with osd.52 having the smallest 
amount (not by too much though) in each case.

Querying those PGs doesn't return a response after a few minutes, manually 
triggering scrubs or repairs on them does nothing.
I've lowered the min_size from 2 to 1 but I'm not seeing any activity to fix 
this.

Is there something that can be done to recover without losing that data (it 
means each VM has a 75% chance of being destroyed)?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Incomplete PGs, how do I get them back without data loss?

2016-05-11 Thread george.vasilakakos
Hey Dan,

This is on Hammer 0.94.5. osd.52 was always on a problematic machine and when 
this happened had less data on its local disk than the other OSDs. I've tried 
adapting that blog post's solution to this situation to no avail.

I've tried things like looking at all probing OSDs in the query output and 
importing the data from one copy to all of them to get it to be consistent. One 
of the major red flags here was that when I looked at the original acting set's 
disks I found each OSD had a different amount of data for the same PG, there is 
at least one PG here where 52 (the primary for all four) actually had about 1GB 
(~27%) less data, everything has just been really inconsistent.

Here's hoping Cunningham will come to the rescue.

Cheers,

George 


From: Dan van der Ster [d...@vanderster.com]
Sent: 11 May 2016 17:28
To: Vasilakakos, George (STFC,RAL,SC)
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Incomplete PGs, how do I get them back without data 
loss?

Hi George,

Which version of Ceph is this?
I've never had incompete pgs stuck like this before. AFAIK it means
that osd.52 would need to be brought up before you can restore those
PGs.

Perhaps you'll need ceph-objectstore-tool to help dump osd.52 and
bring up its data elsewhere. A quick check on this list pointed to
https://ceph.com/community/incomplete-pgs-oh-my/ -- did you try that?

Or perhaps I'm spewing enough nonsense here that Cunningham's Law will
bring you the solution.

Cheers, Dan



On Thu, May 5, 2016 at 8:21 PM,   wrote:
> Hi folks,
>
> I've got a serious issue with a Ceph cluster that's used for RBD.
>
> There are 4 PGs stuck in an incomplete state and I'm trying to repair this 
> problem to no avail.
>
> Here's ceph status:
> health HEALTH_WARN
> 4 pgs incomplete
> 4 pgs stuck inactive
> 4 pgs stuck unclean
> 100 requests are blocked > 32 sec
>  monmap e13: 3 mons at ...
> election epoch 2084, quorum 0,1,2 mon4,mon5,mon3
>  osdmap e154083: 203 osds: 197 up, 197 in
>   pgmap v37369382: 9856 pgs, 5 pools, 20932 GB data, 22321 kobjects
> 64871 GB used, 653 TB / 716 TB avail
> 9851 active+clean
>4 incomplete
>1 active+clean+scrubbing
>
> The 4 PGs all have the same primary OSD, which is on a host that had its OSDs 
> turned off as it was quite flaky.
>
> 1.1bdbincomplete[52,100,130]52[52,100,130]52
> 1.5c2incomplete[52,191,109]52[52,191,109]52
> 1.f98incomplete[52,92,37]52[52,92,37]52
> 1.11dcincomplete[52,176,12]52[52,176,12]52
>
> One thing that strikes me as odd is that once osd.52 is taken out, these sets 
> change completely.
> The situation currently is that, for each of these PGs, the three OSDs have 
> different amounts of data.
> They all have similar but different amounts, with osd.52 having the smallest 
> amount (not by too much though) in each case.
>
> Querying those PGs doesn't return a response after a few minutes, manually 
> triggering scrubs or repairs on them does nothing.
> I've lowered the min_size from 2 to 1 but I'm not seeing any activity to fix 
> this.
>
> Is there something that can be done to recover without losing that data (it 
> means each VM has a 75% chance of being destroyed)?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Incomplete PGs, how do I get them back without data loss?

2016-05-12 Thread george.vasilakakos
What exactly do you mean by log? As in a journal of the actions taken or 
logging done by a daemon.
I'm making the same guess but I'm not sure what else I can try at this point. 
The PG I've been working on actively reports it needs to probe 4 OSDs (the new 
set and the old primary) which are all up and have the same amount of data, 
last changed at the same time. The PG is still incomplete. I'll be increasing 
the logging levels to max today to see what pops up.

From: Dan van der Ster [d...@vanderster.com]
Sent: 12 May 2016 09:26
To: Vasilakakos, George (STFC,RAL,SC)
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Incomplete PGs, how do I get them back without data 
loss?

On Wed, May 11, 2016 at 6:53 PM,   wrote:
> Hey Dan,
>
> This is on Hammer 0.94.5. osd.52 was always on a problematic machine and when 
> this happened had less data on its local disk than the other OSDs. I've tried 
> adapting that blog post's solution to this situation to no avail.


Do you have a log of what you did and why it didn't work? I guess the
solution to your issue lies in a version of that procedure.

-- dan




>
> I've tried things like looking at all probing OSDs in the query output and 
> importing the data from one copy to all of them to get it to be consistent. 
> One of the major red flags here was that when I looked at the original acting 
> set's disks I found each OSD had a different amount of data for the same PG, 
> there is at least one PG here where 52 (the primary for all four) actually 
> had about 1GB (~27%) less data, everything has just been really inconsistent.
>
> Here's hoping Cunningham will come to the rescue.
>
> Cheers,
>
> George
>
> 
> From: Dan van der Ster [d...@vanderster.com]
> Sent: 11 May 2016 17:28
> To: Vasilakakos, George (STFC,RAL,SC)
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Incomplete PGs, how do I get them back without data 
> loss?
>
> Hi George,
>
> Which version of Ceph is this?
> I've never had incompete pgs stuck like this before. AFAIK it means
> that osd.52 would need to be brought up before you can restore those
> PGs.
>
> Perhaps you'll need ceph-objectstore-tool to help dump osd.52 and
> bring up its data elsewhere. A quick check on this list pointed to
> https://ceph.com/community/incomplete-pgs-oh-my/ -- did you try that?
>
> Or perhaps I'm spewing enough nonsense here that Cunningham's Law will
> bring you the solution.
>
> Cheers, Dan
>
>
>
> On Thu, May 5, 2016 at 8:21 PM,   wrote:
>> Hi folks,
>>
>> I've got a serious issue with a Ceph cluster that's used for RBD.
>>
>> There are 4 PGs stuck in an incomplete state and I'm trying to repair this 
>> problem to no avail.
>>
>> Here's ceph status:
>> health HEALTH_WARN
>> 4 pgs incomplete
>> 4 pgs stuck inactive
>> 4 pgs stuck unclean
>> 100 requests are blocked > 32 sec
>>  monmap e13: 3 mons at ...
>> election epoch 2084, quorum 0,1,2 mon4,mon5,mon3
>>  osdmap e154083: 203 osds: 197 up, 197 in
>>   pgmap v37369382: 9856 pgs, 5 pools, 20932 GB data, 22321 kobjects
>> 64871 GB used, 653 TB / 716 TB avail
>> 9851 active+clean
>>4 incomplete
>>1 active+clean+scrubbing
>>
>> The 4 PGs all have the same primary OSD, which is on a host that had its 
>> OSDs turned off as it was quite flaky.
>>
>> 1.1bdbincomplete[52,100,130]52[52,100,130]52
>> 1.5c2incomplete[52,191,109]52[52,191,109]52
>> 1.f98incomplete[52,92,37]52[52,92,37]52
>> 1.11dcincomplete[52,176,12]52[52,176,12]52
>>
>> One thing that strikes me as odd is that once osd.52 is taken out, these 
>> sets change completely.
>> The situation currently is that, for each of these PGs, the three OSDs have 
>> different amounts of data.
>> They all have similar but different amounts, with osd.52 having the smallest 
>> amount (not by too much though) in each case.
>>
>> Querying those PGs doesn't return a response after a few minutes, manually 
>> triggering scrubs or repairs on them does nothing.
>> I've lowered the min_size from 2 to 1 but I'm not seeing any activity to fix 
>> this.
>>
>> Is there something that can be done to recover without losing that data (it 
>> means each VM has a 75% chance of being destroyed)?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph and IPv4 -> IPv6

2017-06-27 Thread george.vasilakakos
Hey Ceph folks,

I was wondering what the current status/roadmap/intentions etc. are on the 
possibility of providing a way of transitioning a cluster from IPv4 to IPv6 in 
the future.

My current understanding is that this not possible at the moment and that one 
should deploy initially with the version they want long term.

However, given the general lack of widespread readiness, I think lots of us 
have deployed with IPv4 and were hoping to go to IPv6 when the rest of our 
environments enabled it.

Is adding such a capability to a future version of Ceph being considered?


Best regards,

George V.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and IPv4 -> IPv6

2017-06-28 Thread george.vasilakakos
I don't think you can do that, it would require running a mixed cluster which, 
going by the docs, doesn't seem to be supported.

From: Jake Young [jak3...@gmail.com]
Sent: 27 June 2017 22:42
To: Wido den Hollander; ceph-users@lists.ceph.com; Vasilakakos, George 
(STFC,RAL,SC)
Subject: Re: [ceph-users] Ceph and IPv4 -> IPv6

On Tue, Jun 27, 2017 at 2:19 PM Wido den Hollander 
mailto:w...@42on.com>> wrote:

> Op 27 juni 2017 om 19:00 schreef 
> george.vasilaka...@stfc.ac.uk:
>
>
> Hey Ceph folks,
>
> I was wondering what the current status/roadmap/intentions etc. are on the 
> possibility of providing a way of transitioning a cluster from IPv4 to IPv6 
> in the future.
>
> My current understanding is that this not possible at the moment and that one 
> should deploy initially with the version they want long term.
>
> However, given the general lack of widespread readiness, I think lots of us 
> have deployed with IPv4 and were hoping to go to IPv6 when the rest of our 
> environments enabled it.
>
> Is adding such a capability to a future version of Ceph being considered?
>

I think you can, but not without downtime.

The main problem is the monmap which contains IPv4 addresses and you want to 
change that to IPv6.

I haven't tried this, but I think you should be able to:
- Extract MONMap
- Update the IPv4 addresses to IPv6 using monmaptool
- Set noout flag
- Stop all OSDs
- Inject new monmap
- Stop MONs
- Make sure IPv6 is fixed on MONs
- Start MONs
- Start OSDs

Again, this is from the top of my head, haven't tried it, but something like 
that should probably work.

Wido


>
> Best regards,
>
> George V.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I think you could configure all of your mons, osds and clients as dual-stack 
(both IPv4 and IPv6) in advance.

Once you have confirmed IPv6 connectivity everywhere, add a new mon using its 
IPv6 address.

You would then replace each mon one by one with IPv6 addressed mons.

You can then start to deconfigure the IPv4 interfaces.

Just a thought

Jake

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and IPv4 -> IPv6

2017-06-28 Thread george.vasilakakos
Hey Wido,

Thanks for your suggestion. It sounds like the process might be feasible but 
I'd be looking for an "official" thing to do to a production cluster. Something 
that's documented ceph.com/docs, tested and "endorsed" if you will by the Ceph 
team. 
We could try this on a pre-prod environment but it sounds rather "hacky", as 
in, there's a number of manual steps involved, including a couple where you're 
basically manipulating internal, persistent state information.

I think most people would not be too inclined to do this to their production 
set ups.

Best regards,

George

From: Wido den Hollander [w...@42on.com]
Sent: 27 June 2017 19:19
To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph and IPv4 -> IPv6

> Op 27 juni 2017 om 19:00 schreef george.vasilaka...@stfc.ac.uk:
>
>
> Hey Ceph folks,
>
> I was wondering what the current status/roadmap/intentions etc. are on the 
> possibility of providing a way of transitioning a cluster from IPv4 to IPv6 
> in the future.
>
> My current understanding is that this not possible at the moment and that one 
> should deploy initially with the version they want long term.
>
> However, given the general lack of widespread readiness, I think lots of us 
> have deployed with IPv4 and were hoping to go to IPv6 when the rest of our 
> environments enabled it.
>
> Is adding such a capability to a future version of Ceph being considered?
>

I think you can, but not without downtime.

The main problem is the monmap which contains IPv4 addresses and you want to 
change that to IPv6.

I haven't tried this, but I think you should be able to:
- Extract MONMap
- Update the IPv4 addresses to IPv6 using monmaptool
- Set noout flag
- Stop all OSDs
- Inject new monmap
- Stop MONs
- Make sure IPv6 is fixed on MONs
- Start MONs
- Start OSDs

Again, this is from the top of my head, haven't tried it, but something like 
that should probably work.

Wido


>
> Best regards,
>
> George V.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and IPv4 -> IPv6

2017-06-28 Thread george.vasilakakos
> I don't think either. I don't think there is another way then just 'hacky' 
> changing the MONMaps. There have been talks of being able to make Ceph 
> dual-stack, but I don't think there is any code in the source right now.

Yeah, that's what I'd like to know. What do the Ceph team think of providing 
ways for switching? 
We're not in any need to do so now but, it'd be good to know if the team 
dual-stack support or at least to test and document a way to do it as opposed 
to a "this should work but you're on your own" kind of deal.


> 
> From: Wido den Hollander [w...@42on.com]
> Sent: 27 June 2017 19:19
> To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph and IPv4 -> IPv6
>
> > Op 27 juni 2017 om 19:00 schreef george.vasilaka...@stfc.ac.uk:
> >
> >
> > Hey Ceph folks,
> >
> > I was wondering what the current status/roadmap/intentions etc. are on the 
> > possibility of providing a way of transitioning a cluster from IPv4 to IPv6 
> > in the future.
> >
> > My current understanding is that this not possible at the moment and that 
> > one should deploy initially with the version they want long term.
> >
> > However, given the general lack of widespread readiness, I think lots of us 
> > have deployed with IPv4 and were hoping to go to IPv6 when the rest of our 
> > environments enabled it.
> >
> > Is adding such a capability to a future version of Ceph being considered?
> >
>
> I think you can, but not without downtime.
>
> The main problem is the monmap which contains IPv4 addresses and you want to 
> change that to IPv6.
>
> I haven't tried this, but I think you should be able to:
> - Extract MONMap
> - Update the IPv4 addresses to IPv6 using monmaptool
> - Set noout flag
> - Stop all OSDs
> - Inject new monmap
> - Stop MONs
> - Make sure IPv6 is fixed on MONs
> - Start MONs
> - Start OSDs
>
> Again, this is from the top of my head, haven't tried it, but something like 
> that should probably work.
>
> Wido
>
>
> >
> > Best regards,
> >
> > George V.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and IPv4 -> IPv6

2017-07-03 Thread george.vasilakakos
Good to know. Frankly, the RGW isn’t my major concern at the moment, it seems 
to be able to handle things well enough.

It’s the RBD/CephFS side of things for one cluster where we will eventually 
need to support IPv6 clients but will not necessarily be able to switch 
everyone to IPv6 in one go.

On another cluster I’m concerned about the RADOS side of things where we have 
some custom “gateways” that use librados to talk to Ceph and expose the storage 
to their clients via other protocols.
Hopefully the dual-stack support on those is mature enough to handle talking to 
Ceph over IPv4 and clients over IPv6 so we can de-couple the transitions.



On 02/07/2017, 18:46, "Simon Leinen"  wrote:

>> I have it running the other way around. The RGW has IPv4 and IPv6, but
>> the Ceph cluster is IPv6-only.
>
>> RGW/librados talks to Ceph ovre IPv6 and handles client traffic on
>> both protocols.
>
>> No problem to run the RGW dual-stacked.
>
>Just for the record, we've been doing exactly the same for several
>years, on multiple clusters.  So you're not alone!
>-- 
>Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Speeding up backfill after increasing PGs and or adding OSDs

2017-07-06 Thread george.vasilakakos
Hey folks,

We have a cluster that's currently backfilling from increasing PG counts. We 
have tuned recovery and backfill way down as a "precaution" and would like to 
start tuning it to bring up to a good balance between that and client I/O.

At the moment we're in the process of bumping up PG numbers for pools serving 
production workloads. Said pools are EC 8+3.

It looks like we're having very low numbers of PGs backfilling as in:

2567 TB used, 5062 TB / 7630 TB avail
145588/849529410 objects degraded (0.017%)
5177689/849529410 objects misplaced (0.609%)
7309 active+clean
  23 active+clean+scrubbing
  18 active+clean+scrubbing+deep
  13 active+remapped+backfill_wait
   5 active+undersized+degraded+remapped+backfilling
   4 active+undersized+degraded+remapped+backfill_wait
   3 active+remapped+backfilling
   1 active+clean+inconsistent
recovery io 1966 MB/s, 96 objects/s
  client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr

Also, the rate of recovery in terms of data and object throughput varies a lot, 
even with the number of PGs backfilling remaining constant.

Here's the config in the OSDs:

"osd_max_backfills": "1",
"osd_min_recovery_priority": "0",
"osd_backfill_full_ratio": "0.85",
"osd_backfill_retry_interval": "10",
"osd_allow_recovery_below_min_size": "true",
"osd_recovery_threads": "1",
"osd_backfill_scan_min": "16",
"osd_backfill_scan_max": "64",
"osd_recovery_thread_timeout": "30",
"osd_recovery_thread_suicide_timeout": "300",
"osd_recovery_sleep": "0",
"osd_recovery_delay_start": "0",
"osd_recovery_max_active": "5",
"osd_recovery_max_single_start": "1",
"osd_recovery_max_chunk": "8388608",
"osd_recovery_max_omap_entries_per_chunk": "64000",
"osd_recovery_forget_lost_objects": "false",
"osd_scrub_during_recovery": "false",
"osd_kill_backfill_at": "0",
"osd_debug_skip_full_check_in_backfill_reservation": "false",
"osd_debug_reject_backfill_probability": "0",
"osd_recovery_op_priority": "5",
"osd_recovery_priority": "5",
"osd_recovery_cost": "20971520",
"osd_recovery_op_warn_multiple": "16",

What I'm looking for, first of all, is a better understanding of the mechanism 
that schedules the backfilling/recovery work; the end goal is to understand how 
to tune this safely to achieve as close to an optimal balance between rate at 
which recovery and client work is performed.

I'm thinking things like osd_max_backfills, 
osd_backfill_scan_min/osd_backfill_scan_max might be prime candidates for 
tuning.

Any thoughs/insights by the Ceph community will be greatly appreciated,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up backfill after increasing PGs and or adding OSDs

2017-07-06 Thread george.vasilakakos
Thanks for your response David.

What you've described has been what I've been thinking about too. We have 1401 
OSDs in the cluster currently and this output is from the tail end of the 
backfill for +64 PG increase on the biggest pool.

The problem is we see this cluster do at most 20 backfills at the same time and 
as the queue of PGs to backfill gets smaller there are fewer and fewer actively 
backfilling which I don't quite understand.

Out of the PGs currently backfilling, all of them have completely changed their 
sets (difference between acting and up sets is 11), which makes some sense 
since what moves around are the newly spawned PGs. That's 5 PGs currently in 
backfilling states which makes 110 OSDs blocked. What happened to the other 
1300? That's what's strange to me. There are another 7 waiting to backfill.
Out of all the OSDs in the up and acting sets of all PGs currently backfilling 
or waiting to backfill there are 13 OSDs in common so I guess that kind of 
answers it. I haven't checked to see but I suspect each backfilling PG has at 
least one OSD in one of its sets in common with either set of one of the 
waiting PGs.

So I guess we can't do much about the tail end taking so long: there's no way 
for more of the PGs to actually be backfilling at the same time.

I think we'll have to try bumping osd_max_backfills. Has anyone tried bumping 
the relative priorities of recovery vs others? What about noscrub?

Best regards,

George


From: David Turner [drakonst...@gmail.com]
Sent: 06 July 2017 16:08
To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Speeding up backfill after increasing PGs and or 
adding OSDs

Just a quick place to start is osd_max_backfills.  You have this set to 1.  
Each PG is on 11 OSDs.  When you have a PG moving, it is on the original 11 
OSDs and the new X number of OSDs that it is going to.  For each of your PGs 
that is moving, an OSD can only move 1 at a time (your osd_max_backfills), and 
each PG is on 11 + X OSDs.

So with your cluster.  I don't see how many OSDs you have, but you have 25 PGs 
moving around and 8 of them are actively backfilling.  Assuming you were only 
changing 1 OSD per backfill operation, that would mean that you had at least 96 
OSDs (11+1 * 8).  That would be a perfect distribution of OSDs for the PGs 
backfilling.  Let's say now that you're averaging closer to 3 OSDs changing per 
PG and that the remaining 17 PGs waiting to backfill are blocked by a few OSDs 
each (because those OSDs are already included in the 8 active backfilling PGs.  
That would indicate that you have closer to 200+ OSDs.

Every time I'm backfilling and want to speed things up, I watch iostat on some 
of my OSDs and increase osd_max_backfills until I'm consistently using about 
70% of the disk to allow for customer overhead.  You can always figure out 
what's best for your use case though.  Generally I've been ok running with 
osd_max_backfills=5 without much problem and bringing that up some when I know 
that client IO will be minimal, but again it depends on your use case and 
cluster.

On Thu, Jul 6, 2017 at 10:08 AM 
mailto:george.vasilaka...@stfc.ac.uk>> wrote:
Hey folks,

We have a cluster that's currently backfilling from increasing PG counts. We 
have tuned recovery and backfill way down as a "precaution" and would like to 
start tuning it to bring up to a good balance between that and client I/O.

At the moment we're in the process of bumping up PG numbers for pools serving 
production workloads. Said pools are EC 8+3.

It looks like we're having very low numbers of PGs backfilling as in:

2567 TB used, 5062 TB / 7630 TB avail
145588/849529410 objects degraded (0.017%)
5177689/849529410 objects misplaced (0.609%)
7309 active+clean
  23 active+clean+scrubbing
  18 active+clean+scrubbing+deep
  13 active+remapped+backfill_wait
   5 active+undersized+degraded+remapped+backfilling
   4 active+undersized+degraded+remapped+backfill_wait
   3 active+remapped+backfilling
   1 active+clean+inconsistent
recovery io 1966 MB/s, 96 objects/s
  client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr

Also, the rate of recovery in terms of data and object throughput varies a lot, 
even with the number of PGs backfilling remaining constant.

Here's the config in the OSDs:

"osd_max_backfills": "1",
"osd_min_recovery_priority": "0",
"osd_backfill_full_ratio": "0.85",
"osd_backfill_retry_interval": "10",
"osd_allow_recovery_below_min_size": "true",
"osd_recovery_threads": "1",
"osd_backfill_scan_min": "16",
"osd_backfill_scan_max": "64",
"osd_recovery_thread_timeout": "30",
"osd_recovery_thread_suicide_timeout": "300",
"osd_recovery_sleep": "0",
"osd_recovery_delay_start": "0",
"osd_recover

Re: [ceph-users] Speeding up backfill after increasing PGs and or adding OSDs

2017-07-07 Thread george.vasilakakos
@Christian: I think the slow tail end is just the fact that there is contention 
for the same OSDs.
@David: Yes that's what I did, used shell/awk/python to grab and compare the 
set of OSDs locked for backfilling versus the ones waiting.

From: Christian Balzer [ch...@gol.com]
Sent: 07 July 2017 01:46
To: ceph-users@lists.ceph.com
Cc: Vasilakakos, George (STFC,RAL,SC)
Subject: Re: [ceph-users] Speeding up backfill after increasing PGs and or 
adding OSDs

Hello,

On Thu, 6 Jul 2017 17:57:06 + george.vasilaka...@stfc.ac.uk wrote:

> Thanks for your response David.
>
> What you've described has been what I've been thinking about too. We have 
> 1401 OSDs in the cluster currently and this output is from the tail end of 
> the backfill for +64 PG increase on the biggest pool.
>
> The problem is we see this cluster do at most 20 backfills at the same time 
> and as the queue of PGs to backfill gets smaller there are fewer and fewer 
> actively backfilling which I don't quite understand.
>

Welcome to the club.
You're not the first one to wonder about this and while David's comment
about max_backfill is valid, it simply doesn't explain all of this.

See this and my thoughts about things, unfortunately no developer ever
followed up on it:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009704.html

Christian

> Out of the PGs currently backfilling, all of them have completely changed 
> their sets (difference between acting and up sets is 11), which makes some 
> sense since what moves around are the newly spawned PGs. That's 5 PGs 
> currently in backfilling states which makes 110 OSDs blocked. What happened 
> to the other 1300? That's what's strange to me. There are another 7 waiting 
> to backfill.
> Out of all the OSDs in the up and acting sets of all PGs currently 
> backfilling or waiting to backfill there are 13 OSDs in common so I guess 
> that kind of answers it. I haven't checked to see but I suspect each 
> backfilling PG has at least one OSD in one of its sets in common with either 
> set of one of the waiting PGs.
>
> So I guess we can't do much about the tail end taking so long: there's no way 
> for more of the PGs to actually be backfilling at the same time.
>
> I think we'll have to try bumping osd_max_backfills. Has anyone tried bumping 
> the relative priorities of recovery vs others? What about noscrub?
>
> Best regards,
>
> George
>
> 
> From: David Turner [drakonst...@gmail.com]
> Sent: 06 July 2017 16:08
> To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Speeding up backfill after increasing PGs and or 
> adding OSDs
>
> Just a quick place to start is osd_max_backfills.  You have this set to 1.  
> Each PG is on 11 OSDs.  When you have a PG moving, it is on the original 11 
> OSDs and the new X number of OSDs that it is going to.  For each of your PGs 
> that is moving, an OSD can only move 1 at a time (your osd_max_backfills), 
> and each PG is on 11 + X OSDs.
>
> So with your cluster.  I don't see how many OSDs you have, but you have 25 
> PGs moving around and 8 of them are actively backfilling.  Assuming you were 
> only changing 1 OSD per backfill operation, that would mean that you had at 
> least 96 OSDs (11+1 * 8).  That would be a perfect distribution of OSDs for 
> the PGs backfilling.  Let's say now that you're averaging closer to 3 OSDs 
> changing per PG and that the remaining 17 PGs waiting to backfill are blocked 
> by a few OSDs each (because those OSDs are already included in the 8 active 
> backfilling PGs.  That would indicate that you have closer to 200+ OSDs.
>
> Every time I'm backfilling and want to speed things up, I watch iostat on 
> some of my OSDs and increase osd_max_backfills until I'm consistently using 
> about 70% of the disk to allow for customer overhead.  You can always figure 
> out what's best for your use case though.  Generally I've been ok running 
> with osd_max_backfills=5 without much problem and bringing that up some when 
> I know that client IO will be minimal, but again it depends on your use case 
> and cluster.
>
> On Thu, Jul 6, 2017 at 10:08 AM 
> mailto:george.vasilaka...@stfc.ac.uk>> wrote:
> Hey folks,
>
> We have a cluster that's currently backfilling from increasing PG counts. We 
> have tuned recovery and backfill way down as a "precaution" and would like to 
> start tuning it to bring up to a good balance between that and client I/O.
>
> At the moment we're in the process of bumping up PG numbers for pools serving 
> production workloads. Said pools are EC 8+3.
>
> It looks like we're having very low numbers of PGs backfilling as in:
>
> 2567 TB used, 5062 TB / 7630 TB avail
> 145588/849529410 objects degraded (0.017%)
> 5177689/849529410 objects misplaced (0.609%)
> 7309 active+clean
>   23 active+clean+scrubbing
>   

[ceph-users] OSDs in EC pool flapping

2017-08-22 Thread george.vasilakakos
Hey folks,


I'm staring at a problem that I have found no solution for and which is causing 
major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only 
to crash again with the following error in their logs:

-1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 
pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 
72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 
lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010 
active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
 0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) 
**
 in thread 7f4af4057700 thread_name:tp_osd_tp

This has been going on over the weekend when we saw a different error message 
before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.

The OSDs crash with that error only to be restarted by systemd and fail again 
the exact same way. Eventually systemd gives, the mon_osd_down_out_interval 
expires and the PG just stays down+remapped while other recover and go 
active+clean.

Can anybody help with this type of problem?


Best regards,

George Vasilakakos
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs in EC pool flapping

2017-08-23 Thread george.vasilakakos
No, nothing like that.

The cluster is in the process of having more OSDs added and, while that was 
ongoing, one was removed because the underlying disk was throwing up a bunch of 
read errors.
Shortly after, the first three OSDs in this PG started crashing with error 
messages about corrupted EC shards. We seemed to be running into 
http://tracker.ceph.com/issues/18624 so we moved on to 11.2.1 which essentially 
means they now fail with a different error message. Our problem looks a bit 
like this: http://tracker.ceph.com/issues/18162

For a bit more context here's two more events going backwards in the dump:


-3> 2017-08-22 17:42:09.443216 7fa2e283d700  0 osd.1290 pg_epoch: 73324 
pg[1.138s0( v 73085'430014 (62760'421568,73
085'430014] local-les=73323 n=22919 ec=764 les/c/f 73323/72881/0 
73321/73322/73322) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 
lpr=73322 pi=72880-73321/179 rops=1 bft=1513
(7) crt=73085'430014 lcod 0'0 mlcod 0'0 
active+undersized+degraded+remapped+backfilling] failed_push 1:1c959fdd:::datad
isk%2frucio%2fmc16_13TeV%2f41%2f30%2fAOD.11927271._003020.pool.root.1.:head
 from shard 177(4), reps on 
 unfound? 0
-2> 2017-08-22 17:42:09.443299 7fa2e283d700  5 -- op tracker -- seq: 490, 
time: 2017-08-22 17:42:09.443297, event: 
done, op: MOSDECSubOpReadReply(1.138s0 73324 ECSubReadReply(tid=5, 
attrs_read=0))

No amount of taking OSDs out or restarting them fixes it. At this point we've 
had the first 3 marked out by ceph because they flapped enough that systemd 
gave up trying to restart them, they stayed down long enough and 
mon_osd_down_out_interval expired. Now the pg map looks like this:

# ceph pg map 1.138
osdmap e73599 pg 1.138 (1.138) -> up 
[111,1325,437,456,177,1094,194,1513,236,302,1326] acting 
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,1326]

Seeing the #18162, it looks a lot like what we're seeing in our production 
system (which is experiencing a service outage because of this) but the fact 
that the issue is marked as minor severity and hasn't had any updates in two 
months is disconcerting.

As for deep scrubbing it sounds like it could possibly work in a general 
corruption situation but not with a PG stuck in down+remapped and it's first 3 
OSDs crashing out after 5' of operation.


Thanks, 

George



From: Paweł Woszuk [pwos...@man.poznan.pl]

Sent: 22 August 2017 19:19

To: ceph-users@lists.ceph.com; Vasilakakos, George (STFC,RAL,SC)

Subject: Re: [ceph-users] OSDs in EC pool flapping





Have you experienced huge memory consumption by flapping OSD daemons? Restart 
could be triggered by no memory (omkiller).



If yes,this could be connected with osd device error,(bad blocks?), but we've 
experienced something similar on Jewel, not Kraken release. Solution was to 
find PG that cause error, set it to deep scrub manually and restart PG's 
primary OSD.



Hope that helps, or at least lead to some solution.



Dnia 22 sierpnia 2017 18:39:47 CEST, george.vasilaka...@stfc.ac.uk napisał(a):

Hey folks,


I'm staring at a problem that I have found no solution for and which is causing 
major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only 
to crash again with the following error in their logs:

-1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 
pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 
72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 
lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010 
active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
 0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) 
**
 in thread 7f4af4057700 thread_name:tp_osd_tp

This has been going on over the weekend when we saw a different error message 
before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.

The OSDs crash with that error only to be restarted by systemd and fail again 
the exact same way. Eventually systemd gives, the mon_osd_down_out_interval 
expires and the PG just stays down+remapped while other recover and go 
active+clean.

Can anybody help with this type of problem?


Best regards,

George Vasilakakos

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





Paweł Woszuk

PCSS, Poznańskie Centrum Superkomputerowo-Sieciowe

ul. Jana Pawła II nr 10, 61-139 Poznań

Polska


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Sudden omap growth on some OSDs

2017-12-06 Thread george.vasilakakos
Hi ceph-users,

We have a Ceph cluster (running Kraken) that is exhibiting some odd behaviour.
A couple weeks ago, the LevelDBs on some our OSDs started growing large (now at 
around 20G size).

The one thing they have in common is the 11 disks with inflating LevelDBs are 
all in the set for one PG in one of our pools (EC 8+3). This pool started to 
see use around the time the LevelDBs started inflating. Compactions are running 
and they do go down in size a bit but the overall trend is one of rapid growth. 
The other 2000+ OSDs in the cluster have LevelDBs between 650M and 1.2G.
This PG has nothing to separate it from the others in its pool, within 5% of 
average number of objects per PG, no hot-spotting in terms of load, no weird 
states reported by ceph status.

The one odd thing about it is the pg query output mentions it is active+clean, 
but it has a recovery state, which it enters every morning between 9 and 10am, 
where it mentions a "might_have_unfound" situation and having probed all other 
set members. A deep scrub of the PG didn't turn up anything.

The cluster is now starting to manifest slow requests on the OSDs with the 
large LevelDBs, although not in the particular PG.

What can I do to diagnose and resolve this?

Thanks,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden omap growth on some OSDs

2017-12-07 Thread george.vasilakakos


From: Gregory Farnum [gfar...@redhat.com]
Sent: 06 December 2017 22:50
To: David Turner
Cc: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Sudden omap growth on some OSDs

On Wed, Dec 6, 2017 at 2:35 PM David Turner 
mailto:drakonst...@gmail.com>> wrote:
I have no proof or anything other than a hunch, but OSDs don't trim omaps 
unless all PGs are healthy.  If this PG is actually not healthy, but the 
cluster doesn't realize it while these 11 involved OSDs do realize that the PG 
is unhealthy... You would see this exact problem.  The OSDs think a PG is 
unhealthy so they aren't trimming their omaps while the cluster doesn't seem to 
be aware of it and everything else is trimming their omaps properly.

I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are stored 
in leveldb, but they have different trimming rules.


I don't know what to do about it, but I hope it helps get you (or someone else 
on the ML) towards a resolution.

On Wed, Dec 6, 2017 at 1:59 PM 
mailto:george.vasilaka...@stfc.ac.uk>> wrote:
Hi ceph-users,

We have a Ceph cluster (running Kraken) that is exhibiting some odd behaviour.
A couple weeks ago, the LevelDBs on some our OSDs started growing large (now at 
around 20G size).

The one thing they have in common is the 11 disks with inflating LevelDBs are 
all in the set for one PG in one of our pools (EC 8+3). This pool started to 
see use around the time the LevelDBs started inflating. Compactions are running 
and they do go down in size a bit but the overall trend is one of rapid growth. 
The other 2000+ OSDs in the cluster have LevelDBs between 650M and 1.2G.
This PG has nothing to separate it from the others in its pool, within 5% of 
average number of objects per PG, no hot-spotting in terms of load, no weird 
states reported by ceph status.

The one odd thing about it is the pg query output mentions it is active+clean, 
but it has a recovery state, which it enters every morning between 9 and 10am, 
where it mentions a "might_have_unfound" situation and having probed all other 
set members. A deep scrub of the PG didn't turn up anything.

You need to be more specific here. What do you mean it "enters into" the 
recovery state every morning?

Here's what PG query showed me yesterday:
"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2017-12-05 09:48:57.730385",
"might_have_unfound": [
{
"osd": "79(1)",
"status": "already probed"
},
{
"osd": "337(9)",
"status": "already probed"
},... it goes on to list all peers of this OSD in that PG.


How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools? What 
are you using the cluster for?

2048 PGs in this pool, also another 2048 PG EC pool (same profile) and two more 
1024 PG EC pools (also same profile). Then a set of RGW auxiliary pools with 
3-way replication.
I'm not 100% sure but I think all of our OSDs should have a few PGs from one of 
the EC pools. Our rules don't make a distinction so it's probabilistic. We're 
using the cluster as an object store, minor RGW use and custom gateways using 
libradosstriper.
It's also worth pointing out that an OSD in that PG was taken out of the 
cluster earlier today and pg query shows the following weirdness:
The primary thinks it's active+clean but, in the peer_info section all peers 
report it is "active+undersized+degraded+remapped+backfilling". It has shown 
this discrepancy before between the primary thinking it's a+c and the rest of 
the set seeing it a+c+degraded. In the query output we're showing the following 
for recovery state:
"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2017-12-07 08:41:57.850220",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"recovery_ops": [],
"read_ops": []
}


The cluster is now starting to manifest slow requests on the OSDs with the 
large LevelDBs, although not in the particular PG.

What can I do to diagnose and resolve this?

Thanks,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
c

Re: [ceph-users] Sudden omap growth on some OSDs

2017-12-08 Thread george.vasilakakos


From: Gregory Farnum [gfar...@redhat.com]
Sent: 07 December 2017 21:57
To: Vasilakakos, George (STFC,RAL,SC)
Cc: drakonst...@gmail.com; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Sudden omap growth on some OSDs



On Thu, Dec 7, 2017 at 4:41 AM 
mailto:george.vasilaka...@stfc.ac.uk>> wrote:


From: Gregory Farnum [gfar...@redhat.com]
Sent: 06 December 2017 22:50
To: David Turner
Cc: Vasilakakos, George (STFC,RAL,SC); 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Sudden omap growth on some OSDs

On Wed, Dec 6, 2017 at 2:35 PM David Turner 
mailto:drakonst...@gmail.com>>>
 wrote:
I have no proof or anything other than a hunch, but OSDs don't trim omaps 
unless all PGs are healthy.  If this PG is actually not healthy, but the 
cluster doesn't realize it while these 11 involved OSDs do realize that the PG 
is unhealthy... You would see this exact problem.  The OSDs think a PG is 
unhealthy so they aren't trimming their omaps while the cluster doesn't seem to 
be aware of it and everything else is trimming their omaps properly.

I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are stored 
in leveldb, but they have different trimming rules.


I don't know what to do about it, but I hope it helps get you (or someone else 
on the ML) towards a resolution.

On Wed, Dec 6, 2017 at 1:59 PM 
mailto:george.vasilaka...@stfc.ac.uk>>>
 wrote:
Hi ceph-users,

We have a Ceph cluster (running Kraken) that is exhibiting some odd behaviour.
A couple weeks ago, the LevelDBs on some our OSDs started growing large (now at 
around 20G size).

The one thing they have in common is the 11 disks with inflating LevelDBs are 
all in the set for one PG in one of our pools (EC 8+3). This pool started to 
see use around the time the LevelDBs started inflating. Compactions are running 
and they do go down in size a bit but the overall trend is one of rapid growth. 
The other 2000+ OSDs in the cluster have LevelDBs between 650M and 1.2G.
This PG has nothing to separate it from the others in its pool, within 5% of 
average number of objects per PG, no hot-spotting in terms of load, no weird 
states reported by ceph status.

The one odd thing about it is the pg query output mentions it is active+clean, 
but it has a recovery state, which it enters every morning between 9 and 10am, 
where it mentions a "might_have_unfound" situation and having probed all other 
set members. A deep scrub of the PG didn't turn up anything.

You need to be more specific here. What do you mean it "enters into" the 
recovery state every morning?

Here's what PG query showed me yesterday:
"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2017-12-05 09:48:57.730385",
"might_have_unfound": [
{
"osd": "79(1)",
"status": "already probed"
},
{
"osd": "337(9)",
"status": "already probed"
},... it goes on to list all peers of this OSD in that PG.

IIRC that's just a normal thing when there's any kind of recovery happening — 
it builds up a set during peering of OSDs that might have data, in case it 
discovers stuff missing.

OK. But this is the only PG mentioning "might_have_unfound" across the two most 
used pools in our cluster and it's the only one that has all of its omap dirs 
at sizes more than 15 times the average for the cluster.



How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools? What 
are you using the cluster for?

2048 PGs in this pool, also another 2048 PG EC pool (same profile) and two more 
1024 PG EC pools (also same profile). Then a set of RGW auxiliary pools with 
3-way replication.
I'm not 100% sure but I think all of our OSDs should have a few PGs from one of 
the EC pools. Our rules don't make a distinction so it's probabilistic. We're 
using the cluster as an object store, minor RGW use and custom gateways using 
libradosstriper.
It's also worth pointing out that an OSD in that PG was taken out of the 
cluster earlier today and pg query shows the following weirdness:
The primary thinks it's active+clean but, in the peer_info section all peers 
report it is "active+undersized+degraded+remapped+backfilling". It has shown 
this discrepancy before between the primary thinking it's a+c and the rest of 
the set seeing it a+c+degraded.

Again, exactly what output makes you say the primary thinks it's active+clean 
but the others have more complex recovery states?

I've pastebin'ed a query output: https://pastebin.com/PbGSaZyF. The PG state is 
reported "active+clean" in the top section but the peer_info sections have it 
"active+un

Re: [ceph-users] Sudden omap growth on some OSDs

2017-12-12 Thread george.vasilakakos

On 11 Dec 2017, at 18:24, Gregory Farnum 
mailto:gfar...@redhat.com>> wrote:

Hmm, this does all sound odd. Have you tried just restarting the primary OSD 
yet? That frequently resolves transient oddities like this.
If not, I'll go poke at the kraken source and one of the developers more 
familiar with the recovery processes we're seeing here.
-Greg


Hi Greg,

I’ve tried this, no effect. Also, on Friday, we tried removing an OSD (not the 
primary), the OSD that was chosen to replace it had it’s LevelDB grow to 7GiB 
by now. Yesterday it was 5.3.
We’re not seeing any errors logged by the OSDs with the default logging level 
either.

Do you have any comments on the fact that the primary sees the PG’s state as 
being different to what the peers think?
Now, with a new primary I’m seeing the last peer in the set reporting it’s 
‘active+clean’, as is the primary, all other are saying it’s 
‘active+clean+degraded’ (according to PG query output).

This problem is quite weird I think. I copied a LevelDB and dumped a key list; 
the largest in GiB had 66% the number of keys that the average LevelDB has. The 
main difference with the ones that have been around for a while is that they 
have a lot more files that were last touched on the days when the problem 
started but most other LevelDBs have compacted those away and only have about 7 
days old files (as opposed to 3 week old ones that the big ones keep around). 
The big ones do seem to do compactions, they just don’t seem to get rid of that 
stuff.



Thanks,

George


On Fri, Dec 8, 2017 at 7:30 AM 
mailto:george.vasilaka...@stfc.ac.uk>> wrote:


From: Gregory Farnum [gfar...@redhat.com]
Sent: 07 December 2017 21:57
To: Vasilakakos, George (STFC,RAL,SC)
Cc: drakonst...@gmail.com; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Sudden omap growth on some OSDs



On Thu, Dec 7, 2017 at 4:41 AM 
mailto:george.vasilaka...@stfc.ac.uk>>>
 wrote:


From: Gregory Farnum 
[gfar...@redhat.com>]
Sent: 06 December 2017 22:50
To: David Turner
Cc: Vasilakakos, George (STFC,RAL,SC); 
ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Sudden omap growth on some OSDs

On Wed, Dec 6, 2017 at 2:35 PM David Turner 
mailto:drakonst...@gmail.com>>>

Re: [ceph-users] Sudden omap growth on some OSDs

2017-12-12 Thread george.vasilakakos
From: Gregory Farnum 
Date: Tuesday, 12 December 2017 at 19:24
To: "Vasilakakos, George (STFC,RAL,SC)" 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Sudden omap growth on some OSDs

On Tue, Dec 12, 2017 at 3:16 AM 
mailto:george.vasilaka...@stfc.ac.uk>> wrote:

On 11 Dec 2017, at 18:24, Gregory Farnum 
mailto:gfar...@redhat.com>>>
 wrote:

Hmm, this does all sound odd. Have you tried just restarting the primary OSD 
yet? That frequently resolves transient oddities like this.
If not, I'll go poke at the kraken source and one of the developers more 
familiar with the recovery processes we're seeing here.
-Greg


Hi Greg,

I’ve tried this, no effect. Also, on Friday, we tried removing an OSD (not the 
primary), the OSD that was chosen to replace it had it’s LevelDB grow to 7GiB 
by now. Yesterday it was 5.3.
We’re not seeing any errors logged by the OSDs with the default logging level 
either.

Do you have any comments on the fact that the primary sees the PG’s state as 
being different to what the peers think?

Yes. It's super weird. :p

Now, with a new primary I’m seeing the last peer in the set reporting it’s 
‘active+clean’, as is the primary, all other are saying it’s 
‘active+clean+degraded’ (according to PG query output).

Has the last OSD in the list shrunk down its LevelDB instance?

No, the last peer has the largest one currently part of the PG at 14GiB.

If so (or even if not), I'd try restarting all the OSDs in the PG and see if 
that changes things.

Will try that and report back.

If it doesn't...well, it's about to be Christmas and Luminous saw quite a bit 
of change in this space, so it's unlikely to get a lot of attention. :/

Yeah, this being Kraken I doubt it will get looked into deeply.

But the next step would be to gather high-level debug logs from the OSDs in 
question, especially as a peering action takes place.

I’ll be re-introducing the old primary this week so maybe I’ll bump the logging 
levels (to what?) on these OSDs and see what they come up with.

Oh!
I didn't notice you previously mentioned "custom gateways using the 
libradosstriper". Are those backing onto this pool? What operations are they 
doing?
Something like repeated overwrites of EC data could definitely have symptoms 
similar to this (apart from the odd peering bit.)
-Greg

Think of these as using the cluster as an object store. Most of the time we’re 
writing something in, reading it out anywhere from zero to thousands of times 
(each time running stat as well) and eventually may be deleting it. Once 
written, there’s no reason to be overwritten. They’re backing onto the EC pools 
(one per “tenant”) but the particular pool that this PG is a part of has barely 
seen any use. The most used one is storing petabytes and this one was barely 
reaching 100TiB when this came up.



This problem is quite weird I think. I copied a LevelDB and dumped a key list; 
the largest in GiB had 66% the number of keys that the average LevelDB has. The 
main difference with the ones that have been around for a while is that they 
have a lot more files that were last touched on the days when the problem 
started but most other LevelDBs have compacted those away and only have about 7 
days old files (as opposed to 3 week old ones that the big ones keep around). 
The big ones do seem to do compactions, they just don’t seem to get rid of that 
stuff.



Thanks,

George


On Fri, Dec 8, 2017 at 7:30 AM 
mailto:george.vasilaka...@stfc.ac.uk>>>
 wrote:


From: Gregory Farnum 
[gfar...@redhat.com>]
Sent: 07 December 2017 21:57
To: Vasilakakos, George (STFC,RAL,SC)
Cc: 
drakonst...@gmail.com>;
 
ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Sudden omap growth on some OSDs



On Thu, Dec 7, 2017 at 4:41 AM 
mailto:george.vasilaka...@stfc.ac.uk>>>>>]
Sent: 06 December 2017 22:50
To: David Turner
Cc: Vasilakakos, George (STFC,RAL,SC); 
ceph-users@lists.ceph.com>

Re: [ceph-users] Sudden omap growth on some OSDs

2017-12-13 Thread george.vasilakakos
Hi Greg,

I have re-introduced the OSD that was taken out (the one that used to be a 
primary). I have kept debug 20 logs from both the re-introduced primary and the 
outgoing primary. I have used ceph-post-file to upload these, tag: 
5b305f94-83e2-469c-a301-7299d2279d94

Hope this helps, let me know if you'd like me to do another test.

Thanks,

George

From: Gregory Farnum [gfar...@redhat.com]
Sent: 13 December 2017 00:04
To: Vasilakakos, George (STFC,RAL,SC)
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Sudden omap growth on some OSDs



On Tue, Dec 12, 2017 at 3:36 PM 
mailto:george.vasilaka...@stfc.ac.uk>> wrote:
From: Gregory Farnum mailto:gfar...@redhat.com>>
Date: Tuesday, 12 December 2017 at 19:24
To: "Vasilakakos, George (STFC,RAL,SC)" 
mailto:george.vasilaka...@stfc.ac.uk>>
Cc: "ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Sudden omap growth on some OSDs

On Tue, Dec 12, 2017 at 3:16 AM 
mailto:george.vasilaka...@stfc.ac.uk>>>
 wrote:

On 11 Dec 2017, at 18:24, Gregory Farnum 
mailto:gfar...@redhat.com>>

[ceph-users] PGs of EC pool stuck in peering state

2017-01-12 Thread george.vasilakakos
Hi Ceph folks,

I’ve just posted a bug report http://tracker.ceph.com/issues/18508 

I have a cluster (Jewel 10.2.3, SL7) that has trouble creating PGs in EC pools. 
Essentially, I’ll get a lot of CRUSH_ITEM_NONE (2147483647) in there and PGs 
will stay in peering states. This sometimes affects other pools (EC and rep.) 
where their PGs fall into peering states too.

Restarting the primary OSD for a PG will get it to peer.

Has anyone run into this issue before, if so what did you do to fix it?


Cheers,

George


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG stuck peering after host reboot

2017-02-08 Thread george.vasilakakos
Hi Ceph folks,

I have a cluster running Jewel 10.2.5 using a mix EC and replicated pools.

After rebooting a host last night, one PG refuses to complete peering

pg 1.323 is stuck inactive for 73352.498493, current state peering, last acting 
[595,1391,240,127,937,362,267,320,7,634,716]

Restarting OSDs or hosts does nothing to help, or sometimes results in things 
like this:

pg 1.323 is remapped+peering, acting 
[2147483647,1391,240,127,937,362,267,320,7,634,716]


The host that was rebooted is home to osd.7 (8). If I go onto it to look at the 
logs for osd.7 this is what I see:

$ tail -f /var/log/ceph/ceph-osd.7.log
2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- XXX.XXX.XXX.172:6905/20510 >> 
XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 sd=34 :42828 s=2 pgs=319 cs=471 
l=0 c=0x7f6070086700).fault, initiating reconnect

I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates the 
direction of communication. I've traced these to osd.7 (rank 8 in the stuck PG) 
reaching out to osd.595 (the primary in the stuck PG).

Meanwhile, looking at the logs of osd.595 I see this:

$ tail -f /var/log/ceph/ceph-osd.595.log
2017-02-08 15:41:15.760708 7f1765673700  0 -- XXX.XXX.XXX.192:6921/55371 >> 
XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 sd=101 :6921 s=0 pgs=0 cs=0 l=0 
c=0x7f17b7beaf00).accept connect_seq 478 vs existing 477 state standby
2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 != exp 
3786596716

which again shows osd.595 reaching out to osd.7 and from what I could gather 
the CRC problem is about messaging.

Google searching has yielded nothing particularly useful on how to get this 
unstuck.

ceph pg 1.323 query seems to hang forever but it completed once last night and 
I noticed this:

"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}

We have seen this before and it was cleared by setting 
osd_find_best_info_ignore_history_les to true for the first two OSDs on the 
stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and I 
suspect the option needs to be set on either a majority of OSDs or enough k 
number of OSDs to be able to use their data and ignore history.

We would really appreciate any guidance and/or help the community can offer!


George V.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-08 Thread george.vasilakakos
Hi Corentin,

I've tried that, the primary hangs when trying to injectargs so I set the 
option in the config file and restarted all OSDs in the PG, it came up with:

pg 1.323 is remapped+peering, acting 
[595,1391,2147483647,127,937,362,267,320,7,634,716]

Still can't query the PG, no error messages in the logs of osd.240.
The logs on osd.595 and osd.7 still fill up with the same messages.

Regards,

George

From: Corentin Bonneton [l...@titin.fr]
Sent: 08 February 2017 16:31
To: Vasilakakos, George (STFC,RAL,SC)
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] PG stuck peering after host reboot

Hello,

I already had the case, I applied the parameter 
(osd_find_best_info_ignore_history_les) to all the osd that have reported the 
queries blocked.

--
Cordialement,
CEO FEELB | Corentin BONNETON
cont...@feelb.io

Le 8 févr. 2017 à 17:17, 
george.vasilaka...@stfc.ac.uk a écrit :

Hi Ceph folks,

I have a cluster running Jewel 10.2.5 using a mix EC and replicated pools.

After rebooting a host last night, one PG refuses to complete peering

pg 1.323 is stuck inactive for 73352.498493, current state peering, last acting 
[595,1391,240,127,937,362,267,320,7,634,716]

Restarting OSDs or hosts does nothing to help, or sometimes results in things 
like this:

pg 1.323 is remapped+peering, acting 
[2147483647,1391,240,127,937,362,267,320,7,634,716]


The host that was rebooted is home to osd.7 (8). If I go onto it to look at the 
logs for osd.7 this is what I see:

$ tail -f /var/log/ceph/ceph-osd.7.log
2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- XXX.XXX.XXX.172:6905/20510 >> 
XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 sd=34 :42828 s=2 pgs=319 cs=471 
l=0 c=0x7f6070086700).fault, initiating reconnect

I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates the 
direction of communication. I've traced these to osd.7 (rank 8 in the stuck PG) 
reaching out to osd.595 (the primary in the stuck PG).

Meanwhile, looking at the logs of osd.595 I see this:

$ tail -f /var/log/ceph/ceph-osd.595.log
2017-02-08 15:41:15.760708 7f1765673700  0 -- XXX.XXX.XXX.192:6921/55371 >> 
XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 sd=101 :6921 s=0 pgs=0 cs=0 l=0 
c=0x7f17b7beaf00).accept connect_seq 478 vs existing 477 state standby
2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 != exp 
3786596716

which again shows osd.595 reaching out to osd.7 and from what I could gather 
the CRC problem is about messaging.

Google searching has yielded nothing particularly useful on how to get this 
unstuck.

ceph pg 1.323 query seems to hang forever but it completed once last night and 
I noticed this:

   "peering_blocked_by_detail": [
   {
   "detail": "peering_blocked_by_history_les_bound"
   }

We have seen this before and it was cleared by setting 
osd_find_best_info_ignore_history_les to true for the first two OSDs on the 
stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and I 
suspect the option needs to be set on either a majority of OSDs or enough k 
number of OSDs to be able to use their data and ignore history.

We would really appreciate any guidance and/or help the community can offer!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-08 Thread george.vasilakakos
Hi Greg,

> Yes, "bad crc" indicates that the checksums on an incoming message did
> not match what was provided — ie, the message got corrupted. You
> shouldn't try and fix that by playing around with the peering settings
> as it's not a peering bug.
> Unless there's a bug in the messaging layer causing this (very
> unlikely), you have bad hardware or a bad network configuration
> (people occasionally talk about MTU settings?). Fix that and things
> will work; don't and the only software tweaks you could apply are more
> likely to result in lost data than a happy cluster.
> -Greg


I thought of the network initially but I didn't observe packet loss between the 
two hosts and neither host is having trouble talking to the rest of its peers. 
It's these two OSDs that can't talk to each other so I figured it's not likely 
to be a network issue. Network monitoring does show virtually non-existent 
inbound traffic over those links compared to the other ports on the switch but 
no other peerings fail.

Is there something you can suggest to do to drill down deeper?
Also, am I correct in assuming that I can pull one of these OSDs from the 
cluster as a last resort to cause a remapping to a different to potentially 
give this a quick/temp fix and get the cluster serving I/O properly again?


Many thanks for your help,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-08 Thread george.vasilakakos
Hey Greg,

Thanks for your quick responses. I have to leave the office now but I'll look 
deeper into it tomorrow to try and understand what's the cause of this. I'll 
try to find other peerings between these two hosts and check those OSDs' logs 
for potential anomalies. I'll also have a look at any potential configuration 
changes that might have affected the host post-reboot.

I'll be back here with more info once I have it tomorrow.

Thanks again!

George

From: Gregory Farnum [gfar...@redhat.com]
Sent: 08 February 2017 18:29
To: Vasilakakos, George (STFC,RAL,SC)
Cc: Ceph Users
Subject: Re: [ceph-users] PG stuck peering after host reboot

On Wed, Feb 8, 2017 at 10:25 AM,   wrote:
> Hi Greg,
>
>> Yes, "bad crc" indicates that the checksums on an incoming message did
>> not match what was provided — ie, the message got corrupted. You
>> shouldn't try and fix that by playing around with the peering settings
>> as it's not a peering bug.
>> Unless there's a bug in the messaging layer causing this (very
>> unlikely), you have bad hardware or a bad network configuration
>> (people occasionally talk about MTU settings?). Fix that and things
>> will work; don't and the only software tweaks you could apply are more
>> likely to result in lost data than a happy cluster.
>> -Greg
>
>
> I thought of the network initially but I didn't observe packet loss between 
> the two hosts and neither host is having trouble talking to the rest of its 
> peers. It's these two OSDs that can't talk to each other so I figured it's 
> not likely to be a network issue. Network monitoring does show virtually 
> non-existent inbound traffic over those links compared to the other ports on 
> the switch but no other peerings fail.
>
> Is there something you can suggest to do to drill down deeper?

Sadly no. It being a single route is indeed weird and hopefully
somebody with more networking background can suggest a cause. :)

> Also, am I correct in assuming that I can pull one of these OSDs from the 
> cluster as a last resort to cause a remapping to a different to potentially 
> give this a quick/temp fix and get the cluster serving I/O properly again?

I'd expect so!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-09 Thread george.vasilakakos
OK, I've had a look.

Haven't been able to take a proper look at the network yet but here's what I've 
gathered on other fronts so far:

* Marking either osd.595 or osd.7 out results in this:

$ ceph health detail | grep -v stuck | grep 1.323
pg 1.323 is remapped+peering, acting 
[2147483647,1391,240,127,937,362,267,320,7,634,716]

The only way to fix this is to restart 595 and 1391 a couple times. Then you 
get a proper set with 595(0) and a peering state as opposed to remapped+peering.

* I have looked through the PG mappings and
** PG 1.323 is the only PG which has both 595 and 7 in its acting set.
** there are 218 PGs which have OSDs that live on both the hosts that 595 and 7 
live on

Given the above information I'm very inclined to think it's a network issue. If 
it were I'd expect at least another PG that requires the same network path to 
work to be failing. 

As it stands this persists after:

* having restarted all OSDs in the acting set with 
osd_find_best_info_ignore_history_les = true
* having restarted both hosts that the OSDs failing to talk to each other live 
on
* marking either OSD out and allowing recovery to finish

Also worth noting that after multiple restarts, osd.595 is still not responding 
to `ceph tell osd.595`  and `ceph pg 1.323 query`.


Anybody have any clue as to what this might be?


George



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
Sent: 08 February 2017 18:32
To: gfar...@redhat.com
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] PG stuck peering after host reboot

Hey Greg,

Thanks for your quick responses. I have to leave the office now but I'll look 
deeper into it tomorrow to try and understand what's the cause of this. I'll 
try to find other peerings between these two hosts and check those OSDs' logs 
for potential anomalies. I'll also have a look at any potential configuration 
changes that might have affected the host post-reboot.

I'll be back here with more info once I have it tomorrow.

Thanks again!

George

From: Gregory Farnum [gfar...@redhat.com]
Sent: 08 February 2017 18:29
To: Vasilakakos, George (STFC,RAL,SC)
Cc: Ceph Users
Subject: Re: [ceph-users] PG stuck peering after host reboot

On Wed, Feb 8, 2017 at 10:25 AM,   wrote:
> Hi Greg,
>
>> Yes, "bad crc" indicates that the checksums on an incoming message did
>> not match what was provided — ie, the message got corrupted. You
>> shouldn't try and fix that by playing around with the peering settings
>> as it's not a peering bug.
>> Unless there's a bug in the messaging layer causing this (very
>> unlikely), you have bad hardware or a bad network configuration
>> (people occasionally talk about MTU settings?). Fix that and things
>> will work; don't and the only software tweaks you could apply are more
>> likely to result in lost data than a happy cluster.
>> -Greg
>
>
> I thought of the network initially but I didn't observe packet loss between 
> the two hosts and neither host is having trouble talking to the rest of its 
> peers. It's these two OSDs that can't talk to each other so I figured it's 
> not likely to be a network issue. Network monitoring does show virtually 
> non-existent inbound traffic over those links compared to the other ports on 
> the switch but no other peerings fail.
>
> Is there something you can suggest to do to drill down deeper?

Sadly no. It being a single route is indeed weird and hopefully
somebody with more networking background can suggest a cause. :)

> Also, am I correct in assuming that I can pull one of these OSDs from the 
> cluster as a last resort to cause a remapping to a different to potentially 
> give this a quick/temp fix and get the cluster serving I/O properly again?

I'd expect so!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-14 Thread george.vasilakakos
Hi Brad,

I'll be doing so later in the day.

Thanks,

George

From: Brad Hubbard [bhubb...@redhat.com]
Sent: 13 February 2017 22:03
To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
Subject: Re: [ceph-users] PG stuck peering after host reboot

I'd suggest creating a tracker and uploading a full debug log from the
primary so we can look at this in more detail.

On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
> Hi Brad,
>
> I could not tell you that as `ceph pg 1.323 query` never completes, it just 
> hangs there.
>
> On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
>
> On Thu, Feb 9, 2017 at 3:36 AM,   wrote:
> > Hi Corentin,
> >
> > I've tried that, the primary hangs when trying to injectargs so I set 
> the option in the config file and restarted all OSDs in the PG, it came up 
> with:
> >
> > pg 1.323 is remapped+peering, acting 
> [595,1391,2147483647,127,937,362,267,320,7,634,716]
> >
> > Still can't query the PG, no error messages in the logs of osd.240.
> > The logs on osd.595 and osd.7 still fill up with the same messages.
>
> So what does "peering_blocked_by_detail" show in that case since it
> can no longer show "peering_blocked_by_history_les_bound"?
>
> >
> > Regards,
> >
> > George
> > 
> > From: Corentin Bonneton [l...@titin.fr]
> > Sent: 08 February 2017 16:31
> > To: Vasilakakos, George (STFC,RAL,SC)
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] PG stuck peering after host reboot
> >
> > Hello,
> >
> > I already had the case, I applied the parameter 
> (osd_find_best_info_ignore_history_les) to all the osd that have reported the 
> queries blocked.
> >
> > --
> > Cordialement,
> > CEO FEELB | Corentin BONNETON
> > cont...@feelb.io
> >
> > Le 8 févr. 2017 à 17:17, 
> george.vasilaka...@stfc.ac.uk a écrit :
> >
> > Hi Ceph folks,
> >
> > I have a cluster running Jewel 10.2.5 using a mix EC and replicated 
> pools.
> >
> > After rebooting a host last night, one PG refuses to complete peering
> >
> > pg 1.323 is stuck inactive for 73352.498493, current state peering, 
> last acting [595,1391,240,127,937,362,267,320,7,634,716]
> >
> > Restarting OSDs or hosts does nothing to help, or sometimes results in 
> things like this:
> >
> > pg 1.323 is remapped+peering, acting 
> [2147483647,1391,240,127,937,362,267,320,7,634,716]
> >
> >
> > The host that was rebooted is home to osd.7 (8). If I go onto it to 
> look at the logs for osd.7 this is what I see:
> >
> > $ tail -f /var/log/ceph/ceph-osd.7.log
> > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- 
> XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 
> sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating 
> reconnect
> >
> > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates 
> the direction of communication. I've traced these to osd.7 (rank 8 in the 
> stuck PG) reaching out to osd.595 (the primary in the stuck PG).
> >
> > Meanwhile, looking at the logs of osd.595 I see this:
> >
> > $ tail -f /var/log/ceph/ceph-osd.595.log
> > 2017-02-08 15:41:15.760708 7f1765673700  0 -- 
> XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 
> sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs 
> existing 477 state standby
> > 2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 
> != exp 3786596716
> >
> > which again shows osd.595 reaching out to osd.7 and from what I could 
> gather the CRC problem is about messaging.
> >
> > Google searching has yielded nothing particularly useful on how to get 
> this unstuck.
> >
> > ceph pg 1.323 query seems to hang forever but it completed once last 
> night and I noticed this:
> >
> >"peering_blocked_by_detail": [
> >{
> >"detail": "peering_blocked_by_history_les_bound"
> >}
> >
> > We have seen this before and it was cleared by setting 
> osd_find_best_info_ignore_history_les to true for the first two OSDs on the 
> stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and 
> I suspect the option needs to be set on either a majority of OSDs or enough k 
> number of OSDs to be able to use their data and ignore history.
> >
> > We would really appreciate any guidance and/or help the community can 
> offer!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
>
>



--
Cheers,
Brad

Re: [ceph-users] PG stuck peering after host reboot

2017-02-16 Thread george.vasilakakos
Hi folks,

I have just made a tracker for this issue: http://tracker.ceph.com/issues/18960
I used ceph-post-file to upload some logs from the primary OSD for the troubled 
PG.

Any help would be appreciated.

If we can't get it to peer, we'd like to at least get it unstuck, even if it 
means data loss.

What's the proper way to go about doing that?

Best regards,

George

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
Sent: 14 February 2017 10:27
To: bhubb...@redhat.com; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] PG stuck peering after host reboot

Hi Brad,

I'll be doing so later in the day.

Thanks,

George

From: Brad Hubbard [bhubb...@redhat.com]
Sent: 13 February 2017 22:03
To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
Subject: Re: [ceph-users] PG stuck peering after host reboot

I'd suggest creating a tracker and uploading a full debug log from the
primary so we can look at this in more detail.

On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
> Hi Brad,
>
> I could not tell you that as `ceph pg 1.323 query` never completes, it just 
> hangs there.
>
> On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
>
> On Thu, Feb 9, 2017 at 3:36 AM,   wrote:
> > Hi Corentin,
> >
> > I've tried that, the primary hangs when trying to injectargs so I set 
> the option in the config file and restarted all OSDs in the PG, it came up 
> with:
> >
> > pg 1.323 is remapped+peering, acting 
> [595,1391,2147483647,127,937,362,267,320,7,634,716]
> >
> > Still can't query the PG, no error messages in the logs of osd.240.
> > The logs on osd.595 and osd.7 still fill up with the same messages.
>
> So what does "peering_blocked_by_detail" show in that case since it
> can no longer show "peering_blocked_by_history_les_bound"?
>
> >
> > Regards,
> >
> > George
> > 
> > From: Corentin Bonneton [l...@titin.fr]
> > Sent: 08 February 2017 16:31
> > To: Vasilakakos, George (STFC,RAL,SC)
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] PG stuck peering after host reboot
> >
> > Hello,
> >
> > I already had the case, I applied the parameter 
> (osd_find_best_info_ignore_history_les) to all the osd that have reported the 
> queries blocked.
> >
> > --
> > Cordialement,
> > CEO FEELB | Corentin BONNETON
> > cont...@feelb.io
> >
> > Le 8 févr. 2017 à 17:17, 
> george.vasilaka...@stfc.ac.uk a écrit :
> >
> > Hi Ceph folks,
> >
> > I have a cluster running Jewel 10.2.5 using a mix EC and replicated 
> pools.
> >
> > After rebooting a host last night, one PG refuses to complete peering
> >
> > pg 1.323 is stuck inactive for 73352.498493, current state peering, 
> last acting [595,1391,240,127,937,362,267,320,7,634,716]
> >
> > Restarting OSDs or hosts does nothing to help, or sometimes results in 
> things like this:
> >
> > pg 1.323 is remapped+peering, acting 
> [2147483647,1391,240,127,937,362,267,320,7,634,716]
> >
> >
> > The host that was rebooted is home to osd.7 (8). If I go onto it to 
> look at the logs for osd.7 this is what I see:
> >
> > $ tail -f /var/log/ceph/ceph-osd.7.log
> > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- 
> XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 
> sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating 
> reconnect
> >
> > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates 
> the direction of communication. I've traced these to osd.7 (rank 8 in the 
> stuck PG) reaching out to osd.595 (the primary in the stuck PG).
> >
> > Meanwhile, looking at the logs of osd.595 I see this:
> >
> > $ tail -f /var/log/ceph/ceph-osd.595.log
> > 2017-02-08 15:41:15.760708 7f1765673700  0 -- 
> XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 
> sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs 
> existing 477 state standby
> > 2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 
> != exp 3786596716
> >
> > which again shows osd.595 reaching out to osd.7 and from what I could 
> gather the CRC problem is about messaging.
> >
> > Google searching has yielded nothing particularly useful on how to get 
> this unstuck.
> >
> > ceph pg 1.323 query seems to hang forever but it completed once last 
> night and I noticed this:
> >
> >"peering_blocked_by_detail": [
> >{
> >"detail": "peering_blocked_by_history_les_bound"
> >}
> >
> > We have seen this before and it was cleared by se

Re: [ceph-users] PG stuck peering after host reboot

2017-02-17 Thread george.vasilakakos
Hi Wido,

In an effort to get the cluster to complete peering that PG (as we need to be 
able to use our pool) we have removed osd.595 from the CRUSH map to allow a new 
mapping to occur.

When I left the office yesterday osd.307 had replaced osd.595 in the up set but 
the acting set had CRUSH_ITEM_NONE in place of the primary. The PG was in a 
remapped+peering state and recovery was taking place for the other PGs that 
lived on that OSD.
Worth noting that osd.307 in on the same host as osd.595.

We’ll have a look on osd.595 like you suggested.



On 17/02/2017, 06:48, "Wido den Hollander"  wrote:

>
>> Op 16 februari 2017 om 14:55 schreef george.vasilaka...@stfc.ac.uk:
>> 
>> 
>> Hi folks,
>> 
>> I have just made a tracker for this issue: 
>> http://tracker.ceph.com/issues/18960
>> I used ceph-post-file to upload some logs from the primary OSD for the 
>> troubled PG.
>> 
>> Any help would be appreciated.
>> 
>> If we can't get it to peer, we'd like to at least get it unstuck, even if it 
>> means data loss.
>> 
>> What's the proper way to go about doing that?
>
>Can you try this:
>
>1. Go to the host
>2. Stop OSD 595
>3. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info 
>--pgid 1.323
>
>What does osd.595 think about that PG?
>
>You could even try 'rm-past-intervals' with the object-store tool, but that 
>might be a bit dangerous. Wouldn't do that immediately.
>
>Wido
>
>> 
>> Best regards,
>> 
>> George
>> 
>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
>> george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
>> Sent: 14 February 2017 10:27
>> To: bhubb...@redhat.com; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] PG stuck peering after host reboot
>> 
>> Hi Brad,
>> 
>> I'll be doing so later in the day.
>> 
>> Thanks,
>> 
>> George
>> 
>> From: Brad Hubbard [bhubb...@redhat.com]
>> Sent: 13 February 2017 22:03
>> To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
>> Subject: Re: [ceph-users] PG stuck peering after host reboot
>> 
>> I'd suggest creating a tracker and uploading a full debug log from the
>> primary so we can look at this in more detail.
>> 
>> On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
>> > Hi Brad,
>> >
>> > I could not tell you that as `ceph pg 1.323 query` never completes, it 
>> > just hangs there.
>> >
>> > On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
>> >
>> > On Thu, Feb 9, 2017 at 3:36 AM,   wrote:
>> > > Hi Corentin,
>> > >
>> > > I've tried that, the primary hangs when trying to injectargs so I 
>> > set the option in the config file and restarted all OSDs in the PG, it 
>> > came up with:
>> > >
>> > > pg 1.323 is remapped+peering, acting 
>> > [595,1391,2147483647,127,937,362,267,320,7,634,716]
>> > >
>> > > Still can't query the PG, no error messages in the logs of osd.240.
>> > > The logs on osd.595 and osd.7 still fill up with the same messages.
>> >
>> > So what does "peering_blocked_by_detail" show in that case since it
>> > can no longer show "peering_blocked_by_history_les_bound"?
>> >
>> > >
>> > > Regards,
>> > >
>> > > George
>> > > 
>> > > From: Corentin Bonneton [l...@titin.fr]
>> > > Sent: 08 February 2017 16:31
>> > > To: Vasilakakos, George (STFC,RAL,SC)
>> > > Cc: ceph-users@lists.ceph.com
>> > > Subject: Re: [ceph-users] PG stuck peering after host reboot
>> > >
>> > > Hello,
>> > >
>> > > I already had the case, I applied the parameter 
>> > (osd_find_best_info_ignore_history_les) to all the osd that have reported 
>> > the queries blocked.
>> > >
>> > > --
>> > > Cordialement,
>> > > CEO FEELB | Corentin BONNETON
>> > > cont...@feelb.io
>> > >
>> > > Le 8 févr. 2017 à 17:17, 
>> > george.vasilaka...@stfc.ac.uk a 
>> > écrit :
>> > >
>> > > Hi Ceph folks,
>> > >
>> > > I have a cluster running Jewel 10.2.5 using a mix EC and replicated 
>> > pools.
>> > >
>> > > After rebooting a host last night, one PG refuses to complete peering
>> > >
>> > > pg 1.323 is stuck inactive for 73352.498493, current state peering, 
>> > last acting [595,1391,240,127,937,362,267,320,7,634,716]
>> > >
>> > > Restarting OSDs or hosts does nothing to help, or sometimes results 
>> > in things like this:
>> > >
>> > > pg 1.323 is remapped+peering, acting 
>> > [2147483647,1391,240,127,937,362,267,320,7,634,716]
>> > >
>> > >
>> > > The host that was rebooted is home to osd.7 (8). If I go onto it to 
>> > look at the logs for osd.7 this is what I see:
>> > >
>> > > $ tail -f /var/log/ceph/ceph-osd.7.log
>> > > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- 
>> > XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 
>> > pipe(0x7f6074a0b400 sd=34 :42828 s=

Re: [ceph-users] PG stuck peering after host reboot

2017-02-17 Thread george.vasilakakos
On 17/02/2017, 12:00, "Wido den Hollander"  wrote:



>
>> Op 17 februari 2017 om 11:09 schreef george.vasilaka...@stfc.ac.uk:
>> 
>> 
>> Hi Wido,
>> 
>> In an effort to get the cluster to complete peering that PG (as we need to 
>> be able to use our pool) we have removed osd.595 from the CRUSH map to allow 
>> a new mapping to occur.
>> 
>> When I left the office yesterday osd.307 had replaced osd.595 in the up set 
>> but the acting set had CRUSH_ITEM_NONE in place of the primary. The PG was 
>> in a remapped+peering state and recovery was taking place for the other PGs 
>> that lived on that OSD.
>> Worth noting that osd.307 in on the same host as osd.595.
>> 
>> We’ll have a look on osd.595 like you suggested.
>> 
>
>If the PG still doesn't recover do the same on osd.307 as I think that 'ceph 
>pg X query' still hangs?

Will do, what’s even more worrying is that ceph pg X query also hangs on PG 
with osd.1391 as the primary (which is rank 1 in the stuck PG). osd.1391 had 3 
threads running 100% CPU during recovery, osd.307 was idling. OSDs 595 and 1391 
were also unresponsive to ceph tell but responsive to ceph daemon. I’ve not 
tried it with 307.

>
>The info from ceph-objectstore-tool might shed some more light on this PG.
>
>Wido
>
>> 
>> 
>> On 17/02/2017, 06:48, "Wido den Hollander"  wrote:
>> 
>> >
>> >> Op 16 februari 2017 om 14:55 schreef george.vasilaka...@stfc.ac.uk:
>> >> 
>> >> 
>> >> Hi folks,
>> >> 
>> >> I have just made a tracker for this issue: 
>> >> http://tracker.ceph.com/issues/18960
>> >> I used ceph-post-file to upload some logs from the primary OSD for the 
>> >> troubled PG.
>> >> 
>> >> Any help would be appreciated.
>> >> 
>> >> If we can't get it to peer, we'd like to at least get it unstuck, even if 
>> >> it means data loss.
>> >> 
>> >> What's the proper way to go about doing that?
>> >
>> >Can you try this:
>> >
>> >1. Go to the host
>> >2. Stop OSD 595
>> >3. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info 
>> >--pgid 1.323
>> >
>> >What does osd.595 think about that PG?
>> >
>> >You could even try 'rm-past-intervals' with the object-store tool, but that 
>> >might be a bit dangerous. Wouldn't do that immediately.
>> >
>> >Wido
>> >
>> >> 
>> >> Best regards,
>> >> 
>> >> George
>> >> 
>> >> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
>> >> george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
>> >> Sent: 14 February 2017 10:27
>> >> To: bhubb...@redhat.com; ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] PG stuck peering after host reboot
>> >> 
>> >> Hi Brad,
>> >> 
>> >> I'll be doing so later in the day.
>> >> 
>> >> Thanks,
>> >> 
>> >> George
>> >> 
>> >> From: Brad Hubbard [bhubb...@redhat.com]
>> >> Sent: 13 February 2017 22:03
>> >> To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
>> >> Subject: Re: [ceph-users] PG stuck peering after host reboot
>> >> 
>> >> I'd suggest creating a tracker and uploading a full debug log from the
>> >> primary so we can look at this in more detail.
>> >> 
>> >> On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
>> >> > Hi Brad,
>> >> >
>> >> > I could not tell you that as `ceph pg 1.323 query` never completes, it 
>> >> > just hangs there.
>> >> >
>> >> > On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
>> >> >
>> >> > On Thu, Feb 9, 2017 at 3:36 AM,   
>> >> > wrote:
>> >> > > Hi Corentin,
>> >> > >
>> >> > > I've tried that, the primary hangs when trying to injectargs so I 
>> >> > set the option in the config file and restarted all OSDs in the PG, it 
>> >> > came up with:
>> >> > >
>> >> > > pg 1.323 is remapped+peering, acting 
>> >> > [595,1391,2147483647,127,937,362,267,320,7,634,716]
>> >> > >
>> >> > > Still can't query the PG, no error messages in the logs of 
>> >> > osd.240.
>> >> > > The logs on osd.595 and osd.7 still fill up with the same 
>> >> > messages.
>> >> >
>> >> > So what does "peering_blocked_by_detail" show in that case since it
>> >> > can no longer show "peering_blocked_by_history_les_bound"?
>> >> >
>> >> > >
>> >> > > Regards,
>> >> > >
>> >> > > George
>> >> > > 
>> >> > > From: Corentin Bonneton [l...@titin.fr]
>> >> > > Sent: 08 February 2017 16:31
>> >> > > To: Vasilakakos, George (STFC,RAL,SC)
>> >> > > Cc: ceph-users@lists.ceph.com
>> >> > > Subject: Re: [ceph-users] PG stuck peering after host reboot
>> >> > >
>> >> > > Hello,
>> >> > >
>> >> > > I already had the case, I applied the parameter 
>> >> > (osd_find_best_info_ignore_history_les) to all the osd that have 
>> >> > reported the queries blocked.
>> >> > >
>> >> > > --
>> >> > > Cordialement,
>> >> > > CEO FEELB | Corentin BONNETON
>> >> > > cont...@feelb.io
>> >> > >
>> >> > > Le 8 févr. 2017 à 17:17, 
>> >> > george.vasi

Re: [ceph-users] PG stuck peering after host reboot

2017-02-20 Thread george.vasilakakos
Hi Wido,

Just to make sure I have everything straight,

> If the PG still doesn't recover do the same on osd.307 as I think that 'ceph 
> pg X query' still hangs?

> The info from ceph-objectstore-tool might shed some more light on this PG.

You mean run the objectstore command on 307, not remove that from the CRUSH map 
too. Am I correct?

Assuming I am, I tried this command on all OSDs in that PG, including 307 and 
they all say "PG '1.323' not found", which is weird and worrying.

Best regards,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-21 Thread george.vasilakakos
> Can you for the sake of redundancy post your sequence of commands you 
> executed and their output?

[root@ceph-sn852 ~]# systemctl stop ceph-osd@307
[root@ceph-sn852 ~]# ceph-objectstore-tool --data-path 
/var/lib/ceph/osd/ceph-307 --op info --pgid 1.323
PG '1.323' not found
[root@ceph-sn852 ~]# systemctl start ceph-osd@307

I did the same thing for 307 (new up but not acting primary) and all the OSDs 
in the original set (including 595). The output was the exact same. I don't 
have the whole session log handy from all those sessions but here's a sample 
from one that's easy to pick out:

[root@ceph-sn832 ~]# systemctl stop ceph-osd@7
[root@ceph-sn832 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 
--op info --pgid 1.323
PG '1.323' not found
[root@ceph-sn832 ~]# systemctl start ceph-osd@7
[root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/
0.18_head/  11.1c8s5_TEMP/  13.3b_head/ 1.74s1_TEMP/2.256s6_head/   
2.c3s10_TEMP/   3.b9s4_head/
0.18_TEMP/  1.16s1_head/13.3b_TEMP/ 1.8bs9_head/2.256s6_TEMP/   
2.c4s3_head/3.b9s4_TEMP/
1.106s10_head/  1.16s1_TEMP/1.3a6s0_head/   1.8bs9_TEMP/2.2d5s2_head/   
2.c4s3_TEMP/4.34s10_head/
1.106s10_TEMP/  1.274s5_head/   1.3a6s0_TEMP/   2.174s10_head/  2.2d5s2_TEMP/   
2.dbs7_head/4.34s10_TEMP/
11.12as10_head/ 1.274s5_TEMP/   1.3e4s9_head/   2.174s10_TEMP/  2.340s8_head/   
2.dbs7_TEMP/commit_op_seq
11.12as10_TEMP/ 1.2ds8_head/1.3e4s9_TEMP/   2.1c1s10_head/  2.340s8_TEMP/   
3.159s3_head/   meta/
11.148s2_head/  1.2ds8_TEMP/14.1a_head/ 2.1c1s10_TEMP/  2.36es10_head/  
3.159s3_TEMP/   nosnap
11.148s2_TEMP/  1.323s8_head/   14.1a_TEMP/ 2.1d0s6_head/   2.36es10_TEMP/  
3.170s1_head/   omap/
11.165s6_head/  1.323s8_TEMP/   1.6fs9_head/2.1d0s6_TEMP/   2.3d3s10_head/  
3.170s1_TEMP/   
11.165s6_TEMP/  13.32_head/ 1.6fs9_TEMP/2.1efs2_head/   2.3d3s10_TEMP/  
3.1aas5_head/   
11.1c8s5_head/  13.32_TEMP/ 1.74s1_head/2.1efs2_TEMP/   2.c3s10_head/   
3.1aas5_TEMP/   
[root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_
1.323s8_head/ 1.323s8_TEMP/ 
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_
DIR_3/ DIR_7/ DIR_B/ DIR_F/ 
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_
DIR_0/ DIR_1/ DIR_2/ DIR_3/ DIR_4/ DIR_5/ DIR_6/ DIR_7/ DIR_8/ DIR_9/ DIR_A/ 
DIR_B/ DIR_C/ DIR_D/ DIR_E/ DIR_F/
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_1/
total 271276
-rw-r--r--. 1 ceph ceph 8388608 Feb  3 22:07 
datadisk\srucio\sdata16\u13TeV\s11\sad\sDAOD\uTOPQ4.09383728.\u000436.pool.root.1.0001__head_2BA91323__1__8

> If you run a find in the data directory of the OSD, does that PG show up?

OSDs 595 (used to be 0), 1391(1), 240(2), 7(7, the one that started this) have 
a 1.323_headsX directory. OSD 307 does not.
I have not checked the other OSDs in the PG yet.

Wido

>
> Best regards,
>
> George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-21 Thread george.vasilakakos
I have noticed something odd with the ceph-objectstore-tool command:

It always reports PG X not found even on healthly OSDs/PGs. The 'list' op works 
on both and unhealthy PGs.


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
Sent: 21 February 2017 10:17
To: w...@42on.com; ceph-users@lists.ceph.com; bhubb...@redhat.com
Subject: Re: [ceph-users] PG stuck peering after host reboot

> Can you for the sake of redundancy post your sequence of commands you 
> executed and their output?

[root@ceph-sn852 ~]# systemctl stop ceph-osd@307
[root@ceph-sn852 ~]# ceph-objectstore-tool --data-path 
/var/lib/ceph/osd/ceph-307 --op info --pgid 1.323
PG '1.323' not found
[root@ceph-sn852 ~]# systemctl start ceph-osd@307

I did the same thing for 307 (new up but not acting primary) and all the OSDs 
in the original set (including 595). The output was the exact same. I don't 
have the whole session log handy from all those sessions but here's a sample 
from one that's easy to pick out:

[root@ceph-sn832 ~]# systemctl stop ceph-osd@7
[root@ceph-sn832 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 
--op info --pgid 1.323
PG '1.323' not found
[root@ceph-sn832 ~]# systemctl start ceph-osd@7
[root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/
0.18_head/  11.1c8s5_TEMP/  13.3b_head/ 1.74s1_TEMP/2.256s6_head/   
2.c3s10_TEMP/   3.b9s4_head/
0.18_TEMP/  1.16s1_head/13.3b_TEMP/ 1.8bs9_head/2.256s6_TEMP/   
2.c4s3_head/3.b9s4_TEMP/
1.106s10_head/  1.16s1_TEMP/1.3a6s0_head/   1.8bs9_TEMP/2.2d5s2_head/   
2.c4s3_TEMP/4.34s10_head/
1.106s10_TEMP/  1.274s5_head/   1.3a6s0_TEMP/   2.174s10_head/  2.2d5s2_TEMP/   
2.dbs7_head/4.34s10_TEMP/
11.12as10_head/ 1.274s5_TEMP/   1.3e4s9_head/   2.174s10_TEMP/  2.340s8_head/   
2.dbs7_TEMP/commit_op_seq
11.12as10_TEMP/ 1.2ds8_head/1.3e4s9_TEMP/   2.1c1s10_head/  2.340s8_TEMP/   
3.159s3_head/   meta/
11.148s2_head/  1.2ds8_TEMP/14.1a_head/ 2.1c1s10_TEMP/  2.36es10_head/  
3.159s3_TEMP/   nosnap
11.148s2_TEMP/  1.323s8_head/   14.1a_TEMP/ 2.1d0s6_head/   2.36es10_TEMP/  
3.170s1_head/   omap/
11.165s6_head/  1.323s8_TEMP/   1.6fs9_head/2.1d0s6_TEMP/   2.3d3s10_head/  
3.170s1_TEMP/
11.165s6_TEMP/  13.32_head/ 1.6fs9_TEMP/2.1efs2_head/   2.3d3s10_TEMP/  
3.1aas5_head/
11.1c8s5_head/  13.32_TEMP/ 1.74s1_head/2.1efs2_TEMP/   2.c3s10_head/   
3.1aas5_TEMP/
[root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_
1.323s8_head/ 1.323s8_TEMP/
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_
DIR_3/ DIR_7/ DIR_B/ DIR_F/
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_
DIR_0/ DIR_1/ DIR_2/ DIR_3/ DIR_4/ DIR_5/ DIR_6/ DIR_7/ DIR_8/ DIR_9/ DIR_A/ 
DIR_B/ DIR_C/ DIR_D/ DIR_E/ DIR_F/
[root@ceph-sn832 ~]# ll 
/var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_1/
total 271276
-rw-r--r--. 1 ceph ceph 8388608 Feb  3 22:07 
datadisk\srucio\sdata16\u13TeV\s11\sad\sDAOD\uTOPQ4.09383728.\u000436.pool.root.1.0001__head_2BA91323__1__8

> If you run a find in the data directory of the OSD, does that PG show up?

OSDs 595 (used to be 0), 1391(1), 240(2), 7(7, the one that started this) have 
a 1.323_headsX directory. OSD 307 does not.
I have not checked the other OSDs in the PG yet.

Wido

>
> Best regards,
>
> George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-22 Thread george.vasilakakos
Brad Hubbard pointed out on the bug tracker 
(http://tracker.ceph.com/issues/18960) that, for EC, we need to add the shard 
suffix to the PGID parameter in the command, e.g. --pgid 1.323s0
The command now works and produces the same output as PG query.

To avoid spamming the list I've put the outputs of this command of 307, 595 and 
1391 in a Gist 
(https://gist.github.com/gvasilak/3bf155a89a4b2703e639c4326df01460)


From: Wido den Hollander [w...@42on.com]
Sent: 22 February 2017 12:18
To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
Subject: RE: [ceph-users] PG stuck peering after host reboot

> Op 21 februari 2017 om 15:35 schreef george.vasilaka...@stfc.ac.uk:
>
>
> I have noticed something odd with the ceph-objectstore-tool command:
>
> It always reports PG X not found even on healthly OSDs/PGs. The 'list' op 
> works on both and unhealthy PGs.
>

Are you sure you are supplying the correct PG ID?

I just tested with (Jewel 10.2.5):

$ ceph pg ls-by-osd 5
$ systemctl stop ceph-osd@5
$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-5 --op info --pgid 
10.d0
$ systemctl start ceph-osd@5

Can you double-check that?

It's weird that the PG can't be found on those OSDs by the tool.

Wido


> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
> george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
> Sent: 21 February 2017 10:17
> To: w...@42on.com; ceph-users@lists.ceph.com; bhubb...@redhat.com
> Subject: Re: [ceph-users] PG stuck peering after host reboot
>
> > Can you for the sake of redundancy post your sequence of commands you 
> > executed and their output?
>
> [root@ceph-sn852 ~]# systemctl stop ceph-osd@307
> [root@ceph-sn852 ~]# ceph-objectstore-tool --data-path 
> /var/lib/ceph/osd/ceph-307 --op info --pgid 1.323
> PG '1.323' not found
> [root@ceph-sn852 ~]# systemctl start ceph-osd@307
>
> I did the same thing for 307 (new up but not acting primary) and all the OSDs 
> in the original set (including 595). The output was the exact same. I don't 
> have the whole session log handy from all those sessions but here's a sample 
> from one that's easy to pick out:
>
> [root@ceph-sn832 ~]# systemctl stop ceph-osd@7
> [root@ceph-sn832 ~]# ceph-objectstore-tool --data-path 
> /var/lib/ceph/osd/ceph-7 --op info --pgid 1.323
> PG '1.323' not found
> [root@ceph-sn832 ~]# systemctl start ceph-osd@7
> [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/
> 0.18_head/  11.1c8s5_TEMP/  13.3b_head/ 1.74s1_TEMP/2.256s6_head/ 
>   2.c3s10_TEMP/   3.b9s4_head/
> 0.18_TEMP/  1.16s1_head/13.3b_TEMP/ 1.8bs9_head/2.256s6_TEMP/ 
>   2.c4s3_head/3.b9s4_TEMP/
> 1.106s10_head/  1.16s1_TEMP/1.3a6s0_head/   1.8bs9_TEMP/2.2d5s2_head/ 
>   2.c4s3_TEMP/4.34s10_head/
> 1.106s10_TEMP/  1.274s5_head/   1.3a6s0_TEMP/   2.174s10_head/  2.2d5s2_TEMP/ 
>   2.dbs7_head/4.34s10_TEMP/
> 11.12as10_head/ 1.274s5_TEMP/   1.3e4s9_head/   2.174s10_TEMP/  2.340s8_head/ 
>   2.dbs7_TEMP/commit_op_seq
> 11.12as10_TEMP/ 1.2ds8_head/1.3e4s9_TEMP/   2.1c1s10_head/  2.340s8_TEMP/ 
>   3.159s3_head/   meta/
> 11.148s2_head/  1.2ds8_TEMP/14.1a_head/ 2.1c1s10_TEMP/  
> 2.36es10_head/  3.159s3_TEMP/   nosnap
> 11.148s2_TEMP/  1.323s8_head/   14.1a_TEMP/ 2.1d0s6_head/   
> 2.36es10_TEMP/  3.170s1_head/   omap/
> 11.165s6_head/  1.323s8_TEMP/   1.6fs9_head/2.1d0s6_TEMP/   
> 2.3d3s10_head/  3.170s1_TEMP/
> 11.165s6_TEMP/  13.32_head/ 1.6fs9_TEMP/2.1efs2_head/   
> 2.3d3s10_TEMP/  3.1aas5_head/
> 11.1c8s5_head/  13.32_TEMP/ 1.74s1_head/2.1efs2_TEMP/   2.c3s10_head/ 
>   3.1aas5_TEMP/
> [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_
> 1.323s8_head/ 1.323s8_TEMP/
> [root@ceph-sn832 ~]# ll 
> /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_
> DIR_3/ DIR_7/ DIR_B/ DIR_F/
> [root@ceph-sn832 ~]# ll 
> /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_
> DIR_0/ DIR_1/ DIR_2/ DIR_3/ DIR_4/ DIR_5/ DIR_6/ DIR_7/ DIR_8/ DIR_9/ DIR_A/ 
> DIR_B/ DIR_C/ DIR_D/ DIR_E/ DIR_F/
> [root@ceph-sn832 ~]# ll 
> /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_1/
> total 271276
> -rw-r--r--. 1 ceph ceph 8388608 Feb  3 22:07 
> datadisk\srucio\sdata16\u13TeV\s11\sad\sDAOD\uTOPQ4.09383728.\u000436.pool.root.1.0001__head_2BA91323__1__8
>
> > If you run a find in the data directory of the OSD, does that PG show up?
>
> OSDs 595 (used to be 0), 1391(1), 240(2), 7(7, the one that started this) 
> have a 1.323_headsX directory. OSD 307 does not.
> I have not checked the other OSDs in the PG yet.
>
> Wido
>
> >
> > Best regards,
> >
> > George
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ce

Re: [ceph-users] PG stuck peering after host reboot

2017-02-22 Thread george.vasilakakos
So what I see there is this for osd.307:

"empty": 1,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 0,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}

last_epoch_started is 0 and empty is 1. The other OSDs are reporting 
last_epoch_started 16806 and empty 0.

I noticed that too and was wondering why it never completed recovery and joined

> If you stop osd.307 and maybe mark it as out, does that help?

No, I see the same thing I saw when I took 595 out: 

[root@ceph-mon1 ~]# ceph pg map 1.323
osdmap e22392 pg 1.323 (1.323) -> up 
[985,1391,240,127,937,362,267,320,7,634,716] acting 
[2147483647,1391,240,127,937,362,267,320,7,634,716]

Another OSD get chosen as the primary but never becomes acting on its own. 

Another 11 PGs are reporting being undersized and having ITEM_NONE in their 
acting sets as well.

> 
> From: Wido den Hollander [w...@42on.com]
> Sent: 22 February 2017 12:18
> To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] PG stuck peering after host reboot
>
> > Op 21 februari 2017 om 15:35 schreef george.vasilaka...@stfc.ac.uk:
> >
> >
> > I have noticed something odd with the ceph-objectstore-tool command:
> >
> > It always reports PG X not found even on healthly OSDs/PGs. The 'list' op 
> > works on both and unhealthy PGs.
> >
>
> Are you sure you are supplying the correct PG ID?
>
> I just tested with (Jewel 10.2.5):
>
> $ ceph pg ls-by-osd 5
> $ systemctl stop ceph-osd@5
> $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-5 --op info --pgid 
> 10.d0
> $ systemctl start ceph-osd@5
>
> Can you double-check that?
>
> It's weird that the PG can't be found on those OSDs by the tool.
>
> Wido
>
>
> > 
> > From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
> > george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
> > Sent: 21 February 2017 10:17
> > To: w...@42on.com; ceph-users@lists.ceph.com; bhubb...@redhat.com
> > Subject: Re: [ceph-users] PG stuck peering after host reboot
> >
> > > Can you for the sake of redundancy post your sequence of commands you 
> > > executed and their output?
> >
> > [root@ceph-sn852 ~]# systemctl stop ceph-osd@307
> > [root@ceph-sn852 ~]# ceph-objectstore-tool --data-path 
> > /var/lib/ceph/osd/ceph-307 --op info --pgid 1.323
> > PG '1.323' not found
> > [root@ceph-sn852 ~]# systemctl start ceph-osd@307
> >
> > I did the same thing for 307 (new up but not acting primary) and all the 
> > OSDs in the original set (including 595). The output was the exact same. I 
> > don't have the whole session log handy from all those sessions but here's a 
> > sample from one that's easy to pick out:
> >
> > [root@ceph-sn832 ~]# systemctl stop ceph-osd@7
> > [root@ceph-sn832 ~]# ceph-objectstore-tool --data-path 
> > /var/lib/ceph/osd/ceph-7 --op info --pgid 1.323
> > PG '1.323' not found
> > [root@ceph-sn832 ~]# systemctl start ceph-osd@7
> > [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/
> > 0.18_head/  11.1c8s5_TEMP/  13.3b_head/ 1.74s1_TEMP/
> > 2.256s6_head/   2.c3s10_TEMP/   3.b9s4_head/
> > 0.18_TEMP/  1.16s1_head/13.3b_TEMP/ 1.8bs9_head/
> > 2.256s6_TEMP/   2.c4s3_head/3.b9s4_TEMP/
> > 1.106s10_head/  1.16s1_TEMP/1.3a6s0_head/   1.8bs9_TEMP/
> > 2.2d5s2_head/   2.c4s3_TEMP/4.34s10_head/
> > 1.106s10_TEMP/  1.274s5_head/   1.3a6s0_TEMP/   2.174s10_head/  
> > 2.2d5s2_TEMP/   2.dbs7_head/4.34s10_TEMP/
> > 11.12as10_head/ 1.274s5_TEMP/   1.3e4s9_head/   2.174s10_TEMP/  
> > 2.340s8_head/   2.dbs7_TEMP/commit_op_seq
> > 11.12as10_TEMP/ 1.2ds8_head/1.3e4s9_TEMP/   2.1c1s10_head/  
> > 2.340s8_TEMP/   3.159s3_head/   meta/
> > 11.148s2_head/  1.2ds8_TEMP/14.1a_head/ 2.1c1s10_TEMP/  
> > 2.36es10_head/  3.159s3_TEMP/   nosnap
> > 11.148s2_TEMP/  1.323s8_head/   14.1a_TEMP/ 2.1d0s6_head/   
> > 2.36es10_TEMP/  3.170s1_head/   omap/
> > 11.165s6_head/  1.323s8_TEMP/   1.6fs9_head/2.1d0s6_TEMP/   
> > 2.3d3s10_head/  3.170s1_TEMP/
> > 11.165s6_TEMP/  13.32_head/ 1.6fs9_TEMP/2.1efs2_head/   
> > 2.3d3s10_TEMP/  3.1aas5_head/
> > 11.1c8s5_head/  13.32_TEMP/ 1.74s1_head/2.1efs2_TEMP/   
> > 2.c3s10_head/   3.1aas5_TEMP/
> > [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_
> > 1.323s8_head/ 1.323s8_TEMP/
> > [root@ceph-sn832 ~]# ll 
> > /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_
> > DIR_3/ DIR_7/ DIR_B/ DIR_F/
> > [root@ceph-sn832 ~]# ll 
> > /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_
> > DIR_0/ DIR_1/ DIR_2/ DIR_3/ DIR_4/ DIR_5/ DIR_6/ DIR_7/ DIR_8/ DIR_9/ 
> > DIR_A/ DIR_B/ DIR_C/ DIR_D/ DIR_E/ DIR_F/
> > [root@ceph-sn832 ~]# ll 
> > /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_1/
> > total 271276
> > -rw-r--r--. 1 ceph ceph 8388608 Feb  3 22:07 
> > datadisk\srucio\sdata16\u13TeV\s11\s

Re: [ceph-users] PG stuck peering after host reboot

2017-02-23 Thread george.vasilakakos
Since we need this pool to work again, we decided to take the data loss and try 
to move on.

So far, no luck. We tried a force create but, as expected, with a PG that is 
not peering this did absolutely nothing.
We also tried rm-past-intervals and remove from ceph-objectstore-tool and 
manually deleting the data directories in the disks. The PG remains 
down+remapped with two OSDs failing to join the acting set. These have been 
restarted multiple times to no avail.

# ceph pg map 1.323
osdmap e23122 pg 1.323 (1.323) -> up 
[595,1391,240,127,937,362,267,320,986,634,716] acting 
[595,1391,240,127,937,362,267,320,986,2147483647,2147483647]

We have also seen some very odd behaviour. 
# ceph pg map 1.323
osdmap e22909 pg 1.323 (1.323) -> up 
[595,1391,240,127,937,362,267,320,986,634,716] acting 
[595,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]

Straight after a restart of all OSDs in the PG and after everything else has 
settled down. From that state restarting 595 results in:

# ceph pg map 1.323
osdmap e22921 pg 1.323 (1.323) -> up 
[595,1391,240,127,937,362,267,320,986,634,716] acting 
[2147483647,1391,240,127,937,362,267,320,986,634,716]

Restarting 595 doesn't change this. Another restart of all OSDs in the PG 
results in the state seen above with the last two replaced by ITEM_NONE.

Another strange thing is that on osd.7 (the one originally at rank 8 that was 
restarted and caused this problem) the objectstore tool fails to remove the PG 
and crashes out:

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --op remove --pgid 
1.323s8
 marking collection for removal
setting '_remove' omap key
finish_remove_pgs 1.323s8_head removing 1.323s8
 *** Caught signal (Aborted) **
 in thread 7fa713782700 thread_name:tp_fstore_op
 ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
 1: (()+0x97463a) [0x7fa71c47563a]
 2: (()+0xf370) [0x7fa71935a370]
 3: (snappy::RawUncompress(snappy::Source*, char*)+0x374) [0x7fa71abd0cd4]
 4: (snappy::RawUncompress(char const*, unsigned long, char*)+0x3d) 
[0x7fa71abd0e2d]
 5: (leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions 
const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x35e) 
[0x7fa71b08007e]
 6: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, 
leveldb::Slice const&)+0x276) [0x7fa71b081196]
 7: (()+0x3c820) [0x7fa71b083820]
 8: (()+0x3c9cd) [0x7fa71b0839cd]
 9: (()+0x3ca3e) [0x7fa71b083a3e]
 10: (()+0x39c75) [0x7fa71b080c75]
 11: (()+0x21e20) [0x7fa71b068e20]
 12: (()+0x223c5) [0x7fa71b0693c5]
 13: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::seek_to_first(std::string 
const&)+0x3d) [0x7fa71c3ecb1d]
 14: (LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(std::string 
const&)+0x138) [0x7fa71c3ec028]
 15: (DBObjectMap::clear_header(std::shared_ptr, 
std::shared_ptr)+0x1d0) [0x7fa71c400a40]
 16: (DBObjectMap::_clear(std::shared_ptr, 
std::shared_ptr)+0xa1) [0x7fa71c401171]
 17: (DBObjectMap::clear(ghobject_t const&, SequencerPosition const*)+0x1ff) 
[0x7fa71c4075bf]
 18: (FileStore::lfn_unlink(coll_t const&, ghobject_t const&, SequencerPosition 
const&, bool)+0x241) [0x7fa71c2c0d41]
 19: (FileStore::_remove(coll_t const&, ghobject_t const&, SequencerPosition 
const&)+0x8e) [0x7fa71c2c171e]
 20: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, 
ThreadPool::TPHandle*)+0x433e) [0x7fa71c2d8c6e]
 21: (FileStore::_do_transactions(std::vector >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x7fa71c2db75b]
 22: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2cd) 
[0x7fa71c2dba5d]
 23: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0x7fa71c63e189]
 24: (ThreadPool::WorkThread::entry()+0x10) [0x7fa71c63f160]
 25: (()+0x7dc5) [0x7fa719352dc5]
 26: (clone()+0x6d) [0x7fa71843e73d]
Aborted

At this point all we want to achieve is for the PG to peer again (and soon) 
without us having to delete the pool.

Any help would be appreciated...

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
Sent: 22 February 2017 14:35
To: w...@42on.com; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] PG stuck peering after host reboot

So what I see there is this for osd.307:

"empty": 1,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 0,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}

last_epoch_started is 0 and empty is 1. The other OSDs are reporting 
last_epoch_started 16806 and empty 0.

I noticed that too and was wondering why it never completed recovery and joined

> If you stop osd.307 and maybe mark it as out, does that help?

No, I see the same thing I saw when I took 595 out:

[root@ceph-mon1 ~]# ceph pg map 1.323
osdmap e22392 pg 1.323 (1.323) -> up 
[985,1391,240,127,937,362,267,320,7,634,716] acting 
[2147483647,1391,2

[ceph-users] Large OSD omap directories (LevelDBs)

2017-05-19 Thread george.vasilakakos
Hello Ceph folk,


We have a Ceph cluster (info at bottom) with some odd omap directory sizes in 
our OSDs.
We're looking at 1439 OSDs were the most common omap sizes are 15-40MB.
However a quick sampling reveals some outliers, looking at around the 100 
largest omaps one can see sizes go to a few hundred MB and then up to single 
digit GB sized ones jumping for the last 10 or so:

14G/var/lib/ceph/osd/ceph-769/current/omap
35G/var/lib/ceph/osd/ceph-1278/current/omap
48G/var/lib/ceph/osd/ceph-899/current/omap
49G/var/lib/ceph/osd/ceph-27/current/omap
57G/var/lib/ceph/osd/ceph-230/current/omap
58G/var/lib/ceph/osd/ceph-343/current/omap
58G/var/lib/ceph/osd/ceph-948/current/omap
60G/var/lib/ceph/osd/ceph-470/current/omap
66G/var/lib/ceph/osd/ceph-348/current/omap
67G/var/lib/ceph/osd/ceph-980/current/omap


Any omap that's 500MB when most are 25 is worrying but 67GB is extremely 
worrying, something doesn't seem right. The 67GB omap has 37k .sst files and 
the oldest file in there is from Feb 21st.

Anyone seen this before who can point me in the right direction to start 
digging?

Cluster info:

ceph version 11.2.0

 monmap e8: 5 mons at {...}
election epoch 1850, quorum 0,1,2,3,4 
ceph-mon1,ceph-mon2,ceph-mon3,ceph-mon4,ceph-mon5
mgr active: ceph-mon4 standbys: ceph-mon3, ceph-mon2, ceph-mon1, 
ceph-mon5
 osdmap e27138: 1439 osds: 1439 up, 1439 in
flags sortbitwise,require_jewel_osds,require_kraken_osds
  pgmap v10911626: 5120 pgs, 21 pools, 1834 TB data, 61535 kobjects
2525 TB used, 5312 TB / 7837 TB avail
5087 active+clean
  17 active+clean+scrubbing
  16 active+clean+scrubbing+deep

Most pools are 64 PGs for RGW metadata. There are 3 pools with 1024 and another 
2 with 512 PGs that hold our data. These are all using EC 8+3, the auxiliary 
ones are replicated.

Our data is put into pools via the libradosstriper interface which adds some 
xattrs to be able to read the data back (stripe count, stripe size, stripe unit 
size, original size (pre-striping)) and the client also puts in a couple of 
checksumming related attributes.



Thanks,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OSD omap directories (LevelDBs)

2017-05-23 Thread george.vasilakakos
> Your RGW buckets, how many objects in them, and do they have the index
> sharded?

> I know we have some very large & old buckets (10M+ RGW objects in a
> single bucket), with correspondingly large OMAPs wherever that bucket
> index is living (sufficently large that trying to list the entire thing
> online is fruitless). ceph's pgmap status says we have 2G RADOS objects
> however, and you're only at 61M RADOS objects.


According to radosgw-admin bucket stats the most populous bucket contains 
568101 objects. There is no index sharding. The default.rgw.buckets.data pool 
contains 4162566 objects, I think striping is done by default for 4MB sizes 
stripes.

Bear in mind RGW is a small use case for us currently.
Most of the data lives in a pool that is accessed by specialized servers that 
have plugins based on libradosstriper. That pool stores around 1.8 PB in 
32920055 objects.

One thing of note is that we have this:
filestore_xattr_use_omap=1
in our ceph.conf and libradosstriper makes use of xattrs for striping metadata 
and locking mechanisms.

This seems to have been removed some time ago but the question is could have 
any effect? This cluster was built in January and ran Jewel initially.

I do see the xattrs in XFS but a sampling of an omap dir from an OSD showed 
like there might be some xattrs in there too.

I'm going to try restarting an OSD with a big omap and also extracting a copy 
of one for further inspection.
It seems to me like they might not be cleaning up old data. I'm fairly certain 
an active cluster would've compacted enough for 3 month old SSTs to go away.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OSD omap directories (LevelDBs)

2017-05-23 Thread george.vasilakakos
Hi Wido,

I see your point. I would expect OMAPs to grow with the number of objects but 
multiple OSDs getting to multiple tens of GBs for their omaps seems excessive. 
I find it difficult to believe that not sharding the index for a bucket of 500k 
objects in RGW causes the 10 largest OSD omaps to grow to a total 512GB which 
is about 2000 greater that than size of 10 average omaps. Given the relative 
usage of our pools and the much greater prominence of our non-RGW pools on the 
OSDs with huge omaps I'm not inclined to think this is caused by some RGW 
configuration (or lack thereof).

It's also worth pointing out that we've seen problems with files being slow to 
retrieve (I'm talking about rados get doing 120MB/sec on one file and 2MB/sec 
on another) and subsequently the omap of the OSD hosting the first stripe of 
those growing from 30MB to 5GB in the span of an hour during which the logs are 
flooded with LevelDB compaction activity.

Best regards,

George

From: Wido den Hollander [w...@42on.com]
Sent: 23 May 2017 14:00
To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Large OSD omap directories (LevelDBs)

> Op 23 mei 2017 om 13:01 schreef george.vasilaka...@stfc.ac.uk:
>
>
> > Your RGW buckets, how many objects in them, and do they have the index
> > sharded?
>
> > I know we have some very large & old buckets (10M+ RGW objects in a
> > single bucket), with correspondingly large OMAPs wherever that bucket
> > index is living (sufficently large that trying to list the entire thing
> > online is fruitless). ceph's pgmap status says we have 2G RADOS objects
> > however, and you're only at 61M RADOS objects.
>
>
> According to radosgw-admin bucket stats the most populous bucket contains 
> 568101 objects. There is no index sharding. The default.rgw.buckets.data pool 
> contains 4162566 objects, I think striping is done by default for 4MB sizes 
> stripes.
>

Without index sharding 500k objects in a bucket can already cause larger OMAP 
directories. I'd recommend that you at least start to shard them.

Wido

> Bear in mind RGW is a small use case for us currently.
> Most of the data lives in a pool that is accessed by specialized servers that 
> have plugins based on libradosstriper. That pool stores around 1.8 PB in 
> 32920055 objects.
>
> One thing of note is that we have this:
> filestore_xattr_use_omap=1
> in our ceph.conf and libradosstriper makes use of xattrs for striping 
> metadata and locking mechanisms.
>
> This seems to have been removed some time ago but the question is could have 
> any effect? This cluster was built in January and ran Jewel initially.
>
> I do see the xattrs in XFS but a sampling of an omap dir from an OSD showed 
> like there might be some xattrs in there too.
>
> I'm going to try restarting an OSD with a big omap and also extracting a copy 
> of one for further inspection.
> It seems to me like they might not be cleaning up old data. I'm fairly 
> certain an active cluster would've compacted enough for 3 month old SSTs to 
> go away.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OSD omap directories (LevelDBs)

2017-05-24 Thread george.vasilakakos
Hi Greg,

> This does sound weird, but I also notice that in your earlier email you
> seemed to have only ~5k PGs across  ~1400 OSDs, which is a pretty
> low number. You may just have a truly horrible PG balance; can you share
> more details (eg ceph osd df)?


Our distribution is pretty bad, we're getting close to the point where the most 
filled disk is getting close to the nearfull ratio and already has more than 
twice as much data as the cluster fill ratio. My view is that we need to at 
least double the PG count across the cluster. Here's some data: 
https://pastebin.com/qX0LXxid

However, I think this particular issue is down to compaction problems. The 
oldest SST files in the largest LevelDBs date back to Feb 21 (the oldest files 
in normal-sized LevelDBs are no more than a week old):

# du -sh /var/lib/ceph/osd/ceph-348/current/omap/
66G /var/lib/ceph/osd/ceph-348/current/omap/
# ll -t /var/lib/ceph/osd/ceph-348/current/omap/ | tail
-rw-r--r--. 1 ceph ceph  2109703 Feb 21 01:07 013472.sst
-rw-r--r--. 1 ceph ceph  2104172 Feb 21 01:07 013470.sst
-rw-r--r--. 1 ceph ceph  2102942 Feb 21 01:07 013468.sst
-rw-r--r--. 1 ceph ceph  2102906 Feb 21 01:04 013446.sst
-rw-r--r--. 1 ceph ceph  2102977 Feb 21 01:04 013444.sst
-rw-r--r--. 1 ceph ceph  2102667 Feb 21 01:04 013442.sst
-rw-r--r--. 1 ceph ceph  2102903 Feb 21 01:04 013440.sst
-rw-r--r--. 1 ceph ceph  172 Jan  6 15:45 LOG
-rw-r--r--. 1 ceph ceph   57 Jan  6 15:45 LOG.old
-rw-r--r--. 1 ceph ceph0 Jan  6 15:45 LOCK

The corresponding daemon has been running for a while:

# systemctl status ceph-osd@348
● ceph-osd@348.service - Ceph object storage daemon osd.348
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor 
preset: disabled)
   Active: active (running) since Mon 2017-03-13 14:23:27 GMT; 2 months 11 days 
ago

This is confirmed as being the case for the top 3 largest LevelDBs.

Given the inflation that was observed while the OSD was reporting compacting 
operations I thought this might be a compaction issue.

I have performed the following test:

I chose osd.101 which had an average sized LevelDB and proceeded to extract 
that LevelDB and poke around a bit.

- the size of osd.101's omap directory was 25M
- it contained 99627 keys
- when compacted it went down to 15M

By comparison, osd.980's omap directory:

- was 67G in size
- it contained 101773 keys
- when compacted it went down to 44M

Both omaps had similar key and value sizes.

We do not have any options regarding OSD LevelDB compaction set in ceph.conf so 
OSDs seem to be compacting when they see fit. This seems to be mostly working. 
What's troubling is the fact that many of these LevelDBs go into a compacting 
frenzy, where the OSD spends upwards of an hour compacting a LevelDB, during 
which time the LevelDB is actually exploding in size and then remains at that 
size for at least a couple of weeks.

This seems a bit similar to http://tracker.ceph.com/issues/13990 which Dan van 
der Ster pointed out, although not quite the same behaviour. Is there a way we 
can try to trigger the OSD to do compaction and/or manually do it and see what 
happens? How risky is this (this is our production service after all)?


Thanks,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EXT :Re: ceph auth list - access denied

2016-04-05 Thread george.vasilakakos
For future reference,

You can reset your keyring's permissions using a keyring located on the
monitors at /var/lib/ceph/mon/your-mon/keyring. Specify the -k option
for the ceph command and the full path to the keyring and you can
correct this without having to restart the cluster a couple of times.

On Mon, 2016-04-04 at 18:25 +, Plewes, Dave (IS) wrote:
> Oliver,
> 
> Following your recommendation to stop the cluster, restart cluster 
> authentication disabled allowed me to fix the incorrect capability settings 
> on the client.admin user.  Then, I re-enabled authentication and restarted 
> the cluster.  Everything is back to normal.
> 
> Thank you for your help,
> 
> Dave
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Plewes, Dave (IS)
> Sent: Monday, April 04, 2016 10:34 AM
> To: Oliver Dzombic; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] EXT :Re: ceph auth list - access denied
> 
> Oliver,
> 
> Thank you for the quick response.  
> 
> I suspected that I made a mistake with the update rather than a write. 
> 
> My cluster still shows a "HEALTH_OK" with all 3 osds and all 3 mons in quorum 
> but I suspect that totally killed auth.
> 
> I will look at re-establishing authentication according to your 
> recommendation.
> 
> Thanks,
> 
> Dave
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Oliver Dzombic
> Sent: Monday, April 04, 2016 10:14 AM
> To: ceph-users@lists.ceph.com
> Subject: EXT :Re: [ceph-users] ceph auth list - access denied
> 
> Hi David,
> 
> you killed your auth.
> 
> When updating auth, you always have to write all auth's not only the change.
> 
> Means, if you update, the auth is completely reset newly according to your 
> change.
> 
> So its not adding, its replacing.
> 
> ---
> 
> You will have to start your cluster now without auth, giving your key again 
> ALL rights on everything.
> 
> Then restart the cluster again with authentication enabled.
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 04.04.2016 um 16:07 schrieb Plewes, Dave (IS):
> > All,
> > 
> >  
> > 
> > I am fairly new to using Ceph.  I have successfully established a Ceph 
> > Cluster with 3 OSDs of 8TB each for a total cluster of 24TBs.
> > Recently, I was attempting to use Libvirt with Ceph RBD as documented here:
> > http://docs.ceph.com/docs/hammer/rbd/libvirt/
> > 
> >  
> > 
> > I was able to create (and list) a pool and the image using the 
> > following
> > commands:
> > 
> >  
> > 
> > 1)  ceph osd pool create libvirt-pool 128 128
> > 
> > 2)  ceph osd lspools
> > 
> > 3)  rbd create libvirt-image -size 1024 -pool libvirt-pool
> > 
> > 4)  rbd ls libvirt-pool
> > 
> > 5)  rbd -image libvirt-image -p libvirt-pool info
> > 
> > Then, I wanted to modify the "client.admin" user to allow access to 
> > the pool using the following command:
> > 
> >  
> > 
> > ceph auth caps client.admin mon 'allow r' osd 'allow rwx pool=libvirt-pool'
> > 
> > which returned response of:
> > 
> >  
> > 
> >updated caps for client.admin
> > 
> >  
> > 
> > However, I can no longer execute a "ceph auth list" command.  I 
> > receive the following access denied:
> > 
> >  
> > 
> > Error EACCES: access denied
> > 
> >  
> > 
> >  
> > 
> > How can I recover access to "ceph auth"?
> > 
> >  
> > 
> > Prior to the "ceph auths caps" command I could execute the "ceph auth"
> > command with no problem and it
> > 
> > Returned the auth entries for osd.0, osd.1, osd.2, client.admin, 
> > client.bootstrap-mds, client.bootstrap-osd, and client.bootstrap.rgw
> > 
> >  
> > 
> >  
> > 
> > Any assistance will be helpful and appreciated.
> > 
> >  
> > 
> > Thanks,
> > 
> >  
> > 
> > Dave P.
> > 
> >  
> > 
> >  
> > 
> >  
> > 
> >  
> > 
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com