[ceph-users] Re: Ceph Health error right after starting balancer

2019-11-01 Thread Paul Emmerich
Looks like you didn't tell the whole story, please post the *full*
output of ceph -s and ceph osd df tree.

Wild guess: you need to increase "mon max pg per osd"

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Oct 31, 2019 at 8:17 PM Thomas <74cmo...@gmail.com> wrote:
>
> This is the output of OSD.270 that remains with slow requests blocked
> even after restarting.
> What's the interpretation of it?
>
> root@ld5507:~# ceph daemon osd.270 dump_blocked_ops [330/1857]
> {
>  "ops": [
>  {
>  "description": "osd_pg_create(e293649 59.b:267033
> 59.2c:267033)",
>  "initiated_at": "2019-10-31 19:22:13.563017",
>  "age": 2785.269856041,
>  "duration": 2785.269905628,
>  "type_data": {
>  "flag_point": "started",
>  "events": [
>  {
>  "time": "2019-10-31 19:22:13.563017",
>  "event": "initiated"
>  },
>  {
>  "time": "2019-10-31 19:22:13.563017",
>  "event": "header_read"
>  },
>  {
>  "time": "2019-10-31 19:22:13.563011",
>  "event": "throttled"
>  },
>  {
>  "time": "2019-10-31 19:22:13.563024",
>  "event": "all_read"
>  },
>  {
>  "time": "2019-10-31 20:07:43.881441",
>  "event": "dispatched"
> }, [300/1857]
>  {
>  "time": "2019-10-31 20:07:43.881472",
>  "event": "wait for new map"
>  },
>  {
>  "time": "2019-10-31 20:07:44.665714",
>  "event": "started"
>  }
>  ]
>  }
>  },
>  {
>  "description": "osd_pg_create(e293650 59.b:267033
> 59.2c:267033)",
>  "initiated_at": "2019-10-31 19:23:16.150040",
>  "age": 2722.682833165,
>  "duration": 2722.683007228,
>  "type_data": {
>  "flag_point": "delayed",
>  "events": [
>  {
>  "time": "2019-10-31 19:23:16.150040",
>  "event": "initiated"
>  },
>  {
>  "time": "2019-10-31 19:23:16.150040",
>  "event": "header_read"
>  },
>  {
>  "time": "2019-10-31 19:23:16.150035",
>  "event": "throttled"
> }, [269/1857]
>  {
>  "time": "2019-10-31 19:23:16.150055",
>  "event": "all_read"
>  },
>  {
>  "time": "2019-10-31 20:07:43.882197",
>  "event": "dispatched"
>  },
>  {
>  "time": "2019-10-31 20:07:43.882198",
>  "event": "wait for new map"
>  }
>  ]
>  }
>  },
>  {
>  "description": "osd_pg_create(e293651 59.b:267033
> 59.2c:267033)",
>  "initiated_at": "2019-10-31 19:23:17.779034",
>  "age": 2721.0538393319998,
>  "duration": 2721.0541152350002,
>  "type_data": {
>  "flag_point": "delayed",
>  "events": [
>  {
>  "time": "2019-10-31 19:23:17.779034",
>  "event": "initiated"
>  },
>  {
>  "time": "2019-10-31 19:23:17.779034",
>  "event": "header_read"
> }, [238/1857]
>  {
>  "time": "2019-10-31 19:23:17.779027",
>  "event": "throttled"
>  },
>  {
>  "time": "2019-10-31 19:23:17.779044",
>  "event": "all_read"
>  },
>  {
>  "time": "2019-10-31 20:07:43.882326",
>  "event": "dispatched"
>  },
>  {
>  "time": "2019-10-31 20:07:43.882328",
>  "event": "wait for new map"
>  }
>  ]
>  }
>  },
>  {
>  "description": "osd_pg_create

[ceph-users] Re: V/v Multiple pool for data in Ceph object

2019-11-01 Thread tuan dung
Ok, thanks.



Br,
--
Dương Tuấn Dũng
Email: dungdt.aicgr...@gmail.com
ĐT: 0986153686


On Wed, Oct 30, 2019 at 1:51 PM Konstantin Shalygin  wrote:

> On 10/29/19 3:45 PM, tuan dung wrote:
>
> i have a cluster run ceph object using version 14.2.1. I want to creat 2
> pool for bucket data for  purposes for security:
> + one bucket-data pool for public client access from internet (name
> *zone1.rgw.buckets.data-pub) *
> + one bucket-data pool for private client access from local network (name
> *zone1.rgw.buckets.data-pub)*
> each pool bucket-data has one individual access key: access key public
> (access pool public) and  access key private (access pool private).
> Can you give me a recomment for this or bestpractice that you've done?
> what needs to be done?
> Or give me your best solution for securiy a cluster ceph object with
> public client access and  private client access?
>
> You need add extra placement. This setup is pretty useless IMHO because
> you still will be going from one rgw zone.
>
>
>
> k
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-11-01 Thread Lars Täuber
Is there anybody who can explain the overcommitment calcuation?

Thanks


Mon, 28 Oct 2019 11:24:54 +0100
Lars Täuber  ==> ceph-users  :
> Is there a way to get rid of this warnings with activated autoscaler besides 
> adding new osds?
> 
> Yet I couldn't get a satisfactory answer to the question why this all happens.
> 
> ceph osd pool autoscale-status :
>  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET 
> RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
>  cephfs_data  122.2T1.5165.4T  1.1085
> 0.8500   1.01024  on 
> 
> versus
> 
>  ceph df  :
> RAW STORAGE:
> CLASS SIZEAVAIL   USEDRAW USED %RAW USED 
> hdd   165 TiB  41 TiB 124 TiB  124 TiB 74.95 
>  
> POOLS:
> POOLID STORED OBJECTS USED%USED 
> MAX AVAIL 
> cephfs_data  1 75 TiB  49.31M 122 TiB 87.16   
>  12 TiB 
> 
> 
> It seems that the overcommitment is wrongly calculated. Isn't the RATE 
> already used to calculate the SIZE?
> 
> It seems USED(df) = SIZE(autoscale-status)
> Isn't the RATE already taken into account here?
> 
> Could someone please explain the numbers to me?
> 
> 
> Thanks!
> Lars
> 
> Fri, 25 Oct 2019 07:42:58 +0200
> Lars Täuber  ==> Nathan Fish  :
> > Hi Nathan,
> > 
> > Thu, 24 Oct 2019 10:59:55 -0400
> > Nathan Fish  ==> Lars Täuber  :  
> > > Ah, I see! The BIAS reflects the number of placement groups it should
> > > create. Since cephfs metadata pools are usually very small, but have
> > > many objects and high IO, the autoscaler gives them 4x the number of
> > > placement groups that it would normally give for that amount of data.
> > > 
> > ah ok, I understand.
> >   
> > > So, your cephfs_data is set to a ratio of 0.9, and cephfs_metadata to
> > > 0.3? Are the two pools using entirely different device classes, so
> > > they are not sharing space?
> > 
> > Yes, the metadata is on SSDs and the data on HDDs.
> >   
> > > Anyway, I see that your overcommit is only "1.031x". So if you set
> > > cephfs_data to 0.85, it should go away.
> > 
> > This is not the case. I set the target_ratio to 0.7 and get this:
> > 
> >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET 
> > RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
> >  cephfs_metadata  15736M3.0 2454G  0.0188
> > 0.3000   4.0 256  on
> >  cephfs_data  122.2T1.5165.4T  1.1085
> > 0.7000   1.01024  on
> > 
> > The ratio seems to have nothing to do with the target_ratio but the SIZE 
> > and the RAW_CAPACITY.
> > Because the pool is still getting more data the SIZE increases and 
> > therefore the RATIO increases.
> > The RATIO seems to be calculated by this formula
> > RATIO = SIZE * RATE / RAW_CAPACITY.
> > 
> > This is what I don't understand. The data in the cephfs_data pool seems to 
> > need more space than the raw capacity of the cluster provides. Hence the 
> > situation is called "overcommitment".
> > 
> > But why is this only the case when the autoscaler is active?
> > 
> > Thanks
> > Lars
> >   
> > > 
> > > On Thu, Oct 24, 2019 at 10:09 AM Lars Täuber  wrote:
> > > >
> > > > Thanks Nathan for your answer,
> > > >
> > > > but I set the the Target Ratio to 0.9. It is the cephfs_data pool that 
> > > > makes the troubles.
> > > >
> > > > The 4.0 is the BIAS from the cephfs_metadata pool. This "BIAS" is not 
> > > > explained on the page linked below. So I don't know its meaning.
> > > >
> > > > How can be a pool overcommited when it is the only pool on a set of 
> > > > OSDs?
> > > >
> > > > Best regards,
> > > > Lars
> > > >
> > > > Thu, 24 Oct 2019 09:39:51 -0400
> > > > Nathan Fish  ==> Lars Täuber  :   
> > > >
> > > > > The formatting is mangled on my phone, but if I am reading it 
> > > > > correctly,
> > > > > you have set Target Ratio to 4.0. This means you have told the 
> > > > > balancer
> > > > > that this pool will occupy 4x the space of your whole cluster, and to
> > > > > optimize accordingly. This is naturally a problem. Setting it to 0 
> > > > > will
> > > > > clear the setting and allow the autobalancer to work.
> > > > >
> > > > > On Thu., Oct. 24, 2019, 5:18 a.m. Lars Täuber,  
> > > > > wrote:
> > > > >  
> > > > > > This question is answered here:
> > > > > > https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotuning/
> > > > > >
> > > > > > But it tells me that there is more data stored in the pool than the 
> > > > > > raw
> > > > > > capacity provides (taking the replication factor RATE into account) 
> > > > > > hence
> > > > > > the RATIO being above 1.0 .
> > > > > >
> > > > > > How comes this is the case? - Data is stored outside of the pool?
> > > > > > How comes this is only the case when the autoscaler is active?
> > > > > >
> > > > > > Thanks
> > > > > > Lars
> > > >

[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-11-01 Thread Sage Weil
This was fixed a few weeks back.  It should be resolved in 14.2.5.

https://tracker.ceph.com/issues/41567
https://github.com/ceph/ceph/pull/31100

sage


On Fri, 1 Nov 2019, Lars Täuber wrote:

> Is there anybody who can explain the overcommitment calcuation?
> 
> Thanks
> 
> 
> Mon, 28 Oct 2019 11:24:54 +0100
> Lars Täuber  ==> ceph-users  :
> > Is there a way to get rid of this warnings with activated autoscaler 
> > besides adding new osds?
> > 
> > Yet I couldn't get a satisfactory answer to the question why this all 
> > happens.
> > 
> > ceph osd pool autoscale-status :
> >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET 
> > RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
> >  cephfs_data  122.2T1.5165.4T  1.1085
> > 0.8500   1.01024  on 
> > 
> > versus
> > 
> >  ceph df  :
> > RAW STORAGE:
> > CLASS SIZEAVAIL   USEDRAW USED %RAW USED 
> > hdd   165 TiB  41 TiB 124 TiB  124 TiB 74.95 
> >  
> > POOLS:
> > POOLID STORED OBJECTS USED%USED 
> > MAX AVAIL 
> > cephfs_data  1 75 TiB  49.31M 122 TiB 87.16 
> >12 TiB 
> > 
> > 
> > It seems that the overcommitment is wrongly calculated. Isn't the RATE 
> > already used to calculate the SIZE?
> > 
> > It seems USED(df) = SIZE(autoscale-status)
> > Isn't the RATE already taken into account here?
> > 
> > Could someone please explain the numbers to me?
> > 
> > 
> > Thanks!
> > Lars
> > 
> > Fri, 25 Oct 2019 07:42:58 +0200
> > Lars Täuber  ==> Nathan Fish  :
> > > Hi Nathan,
> > > 
> > > Thu, 24 Oct 2019 10:59:55 -0400
> > > Nathan Fish  ==> Lars Täuber  :  
> > > > Ah, I see! The BIAS reflects the number of placement groups it should
> > > > create. Since cephfs metadata pools are usually very small, but have
> > > > many objects and high IO, the autoscaler gives them 4x the number of
> > > > placement groups that it would normally give for that amount of data.
> > > > 
> > > ah ok, I understand.
> > >   
> > > > So, your cephfs_data is set to a ratio of 0.9, and cephfs_metadata to
> > > > 0.3? Are the two pools using entirely different device classes, so
> > > > they are not sharing space?
> > > 
> > > Yes, the metadata is on SSDs and the data on HDDs.
> > >   
> > > > Anyway, I see that your overcommit is only "1.031x". So if you set
> > > > cephfs_data to 0.85, it should go away.
> > > 
> > > This is not the case. I set the target_ratio to 0.7 and get this:
> > > 
> > >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET 
> > > RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
> > >  cephfs_metadata  15736M3.0 2454G  0.0188
> > > 0.3000   4.0 256  on
> > >  cephfs_data  122.2T1.5165.4T  1.1085
> > > 0.7000   1.01024  on
> > > 
> > > The ratio seems to have nothing to do with the target_ratio but the SIZE 
> > > and the RAW_CAPACITY.
> > > Because the pool is still getting more data the SIZE increases and 
> > > therefore the RATIO increases.
> > > The RATIO seems to be calculated by this formula
> > > RATIO = SIZE * RATE / RAW_CAPACITY.
> > > 
> > > This is what I don't understand. The data in the cephfs_data pool seems 
> > > to need more space than the raw capacity of the cluster provides. Hence 
> > > the situation is called "overcommitment".
> > > 
> > > But why is this only the case when the autoscaler is active?
> > > 
> > > Thanks
> > > Lars
> > >   
> > > > 
> > > > On Thu, Oct 24, 2019 at 10:09 AM Lars Täuber  wrote:   
> > > >  
> > > > >
> > > > > Thanks Nathan for your answer,
> > > > >
> > > > > but I set the the Target Ratio to 0.9. It is the cephfs_data pool 
> > > > > that makes the troubles.
> > > > >
> > > > > The 4.0 is the BIAS from the cephfs_metadata pool. This "BIAS" is not 
> > > > > explained on the page linked below. So I don't know its meaning.
> > > > >
> > > > > How can be a pool overcommited when it is the only pool on a set of 
> > > > > OSDs?
> > > > >
> > > > > Best regards,
> > > > > Lars
> > > > >
> > > > > Thu, 24 Oct 2019 09:39:51 -0400
> > > > > Nathan Fish  ==> Lars Täuber  : 
> > > > >  
> > > > > > The formatting is mangled on my phone, but if I am reading it 
> > > > > > correctly,
> > > > > > you have set Target Ratio to 4.0. This means you have told the 
> > > > > > balancer
> > > > > > that this pool will occupy 4x the space of your whole cluster, and 
> > > > > > to
> > > > > > optimize accordingly. This is naturally a problem. Setting it to 0 
> > > > > > will
> > > > > > clear the setting and allow the autobalancer to work.
> > > > > >
> > > > > > On Thu., Oct. 24, 2019, 5:18 a.m. Lars Täuber,  
> > > > > > wrote:
> > > > > >  
> > > > > > > This question is answered here:
> > > > > > > https://ceph.io/rados/new-in-nautilus-pg-merging-and-autotunin

[ceph-users] Re: subtrees have overcommitted (target_size_bytes / target_size_ratio)

2019-11-01 Thread Lars Täuber
Thanks a lot!

Lars

Fri, 1 Nov 2019 13:03:25 + (UTC)
Sage Weil  ==> Lars Täuber  :
> This was fixed a few weeks back.  It should be resolved in 14.2.5.
> 
> https://tracker.ceph.com/issues/41567
> https://github.com/ceph/ceph/pull/31100
> 
> sage
> 
> 
> On Fri, 1 Nov 2019, Lars Täuber wrote:
> 
> > Is there anybody who can explain the overcommitment calcuation?
> > 
> > Thanks
> > 
> > 
> > Mon, 28 Oct 2019 11:24:54 +0100
> > Lars Täuber  ==> ceph-users  :  
> > > Is there a way to get rid of this warnings with activated autoscaler 
> > > besides adding new osds?
> > > 
> > > Yet I couldn't get a satisfactory answer to the question why this all 
> > > happens.
> > > 
> > > ceph osd pool autoscale-status :
> > >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET 
> > > RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
> > >  cephfs_data  122.2T1.5165.4T  1.1085
> > > 0.8500   1.01024  on 
> > > 
> > > versus
> > > 
> > >  ceph df  :
> > > RAW STORAGE:
> > > CLASS SIZEAVAIL   USEDRAW USED %RAW USED 
> > > hdd   165 TiB  41 TiB 124 TiB  124 TiB 74.95 
> > >  
> > > POOLS:
> > > POOLID STORED OBJECTS USED%USED   
> > >   MAX AVAIL 
> > > cephfs_data  1 75 TiB  49.31M 122 TiB 87.16   
> > >  12 TiB 
> > > 
> > > 
> > > It seems that the overcommitment is wrongly calculated. Isn't the RATE 
> > > already used to calculate the SIZE?
> > > 
> > > It seems USED(df) = SIZE(autoscale-status)
> > > Isn't the RATE already taken into account here?
> > > 
> > > Could someone please explain the numbers to me?
> > > 
> > > 
> > > Thanks!
> > > Lars
> > > 
> > > Fri, 25 Oct 2019 07:42:58 +0200
> > > Lars Täuber  ==> Nathan Fish  :  
> > > > Hi Nathan,
> > > > 
> > > > Thu, 24 Oct 2019 10:59:55 -0400
> > > > Nathan Fish  ==> Lars Täuber  :   
> > > >  
> > > > > Ah, I see! The BIAS reflects the number of placement groups it should
> > > > > create. Since cephfs metadata pools are usually very small, but have
> > > > > many objects and high IO, the autoscaler gives them 4x the number of
> > > > > placement groups that it would normally give for that amount of data.
> > > > >   
> > > > ah ok, I understand.
> > > > 
> > > > > So, your cephfs_data is set to a ratio of 0.9, and cephfs_metadata to
> > > > > 0.3? Are the two pools using entirely different device classes, so
> > > > > they are not sharing space?  
> > > > 
> > > > Yes, the metadata is on SSDs and the data on HDDs.
> > > > 
> > > > > Anyway, I see that your overcommit is only "1.031x". So if you set
> > > > > cephfs_data to 0.85, it should go away.  
> > > > 
> > > > This is not the case. I set the target_ratio to 0.7 and get this:
> > > > 
> > > >  POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
> > > > TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
> > > >  cephfs_metadata  15736M3.0 2454G  0.0188   
> > > >  0.3000   4.0 256  on
> > > >  cephfs_data  122.2T1.5165.4T  1.1085   
> > > >  0.7000   1.01024  on
> > > > 
> > > > The ratio seems to have nothing to do with the target_ratio but the 
> > > > SIZE and the RAW_CAPACITY.
> > > > Because the pool is still getting more data the SIZE increases and 
> > > > therefore the RATIO increases.
> > > > The RATIO seems to be calculated by this formula
> > > > RATIO = SIZE * RATE / RAW_CAPACITY.
> > > > 
> > > > This is what I don't understand. The data in the cephfs_data pool seems 
> > > > to need more space than the raw capacity of the cluster provides. Hence 
> > > > the situation is called "overcommitment".
> > > > 
> > > > But why is this only the case when the autoscaler is active?
> > > > 
> > > > Thanks
> > > > Lars
> > > > 
> > > > > 
> > > > > On Thu, Oct 24, 2019 at 10:09 AM Lars Täuber  wrote: 
> > > > >  
> > > > > >
> > > > > > Thanks Nathan for your answer,
> > > > > >
> > > > > > but I set the the Target Ratio to 0.9. It is the cephfs_data pool 
> > > > > > that makes the troubles.
> > > > > >
> > > > > > The 4.0 is the BIAS from the cephfs_metadata pool. This "BIAS" is 
> > > > > > not explained on the page linked below. So I don't know its meaning.
> > > > > >
> > > > > > How can be a pool overcommited when it is the only pool on a set of 
> > > > > > OSDs?
> > > > > >
> > > > > > Best regards,
> > > > > > Lars
> > > > > >
> > > > > > Thu, 24 Oct 2019 09:39:51 -0400
> > > > > > Nathan Fish  ==> Lars Täuber  
> > > > > > :
> > > > > > > The formatting is mangled on my phone, but if I am reading it 
> > > > > > > correctly,
> > > > > > > you have set Target Ratio to 4.0. This means you have told the 
> > > > > > > balancer
> > > > > > > that this pool will occupy 4x the space of your whole cluster, 
> > > > > > > and to
> > > > > > > opti

[ceph-users] Re: Ceph Health error right after starting balancer

2019-11-01 Thread Thomas

Hi Paul,

the situation has changed in the meantime.

However, I can reproduce a similar behaviour.
This means,
- I disable balancer (ceph balancer off)
- and then start reweighting of a specific OSD (ceph osd reweight 134 1.0)

The cluster immediatelly reports slow requests.

root@ld3955:~# ceph health detail
HEALTH_WARN 434 slow requests are blocked > 32 sec; mon ld5505 is low on 
available space

REQUEST_SLOW 434 slow requests are blocked > 32 sec
    131 ops are blocked > 131.072 sec
    270 ops are blocked > 65.536 sec
    33 ops are blocked > 32.768 sec
    osd.66 has blocked requests > 32.768 sec
    osds 19,65,67,426 have blocked requests > 65.536 sec
    osds 
0,2,3,5,6,7,8,16,24,28,29,30,32,34,35,36,37,38,39,40,41,59,60,61,62,63,64,68,69,70,71,72,73,74,75,173,174,178,180,181,184,185
,186,187,188,268,269,270,271,368,369,370,420,421,423,424,429,431,432,433,434,435,436 
have blocked requests > 131.072 sec

MON_DISK_LOW mon ld5505 is low on available space
    mon.ld5505 has 24% avail

root@ld3955:~# ceph -s
  cluster:
    id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
    health: HEALTH_WARN
    453 slow requests are blocked > 32 sec
    mon ld5505 is low on available space

  services:
    mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 23h)
    mgr: ld5505(active, since 22h), standbys: ld5506, ld5507, ld5508
    mds: cephfs:1 {0=ld4465=up:active} 1 up:standby
    osd: 442 osds: 442 up, 442 in; 5 remapped pgs

  data:
    pools:   6 pools, 8312 pgs
    objects: 63.92M objects, 244 TiB
    usage:   731 TiB used, 800 TiB / 1.5 PiB avail
    pgs: 37702/191053577 objects misplaced (0.020%)
 8249 active+clean
 29   active+clean+scrubbing+deep
 29   active+clean+scrubbing
 3    active+remapped+backfill_wait
 2    active+remapped+backfilling

  io:
    client:   1.4 KiB/s rd, 87 MiB/s wr, 1 op/s rd, 22 op/s wr
    recovery: 34 MiB/s, 8 objects/s

In this example I have reweight an HDD, but many OSDs that have blocked 
requests (0,2,3,5,6,7,8) are SSDs.


root@ld3955:~# ceph osd df tree
ID  CLASS WEIGHT REWEIGHT SIZE    RAW USE  DATA OMAP META 
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
-17   1363.19983    - 1.3 PiB  725 TiB  724 TiB  31 MiB  1.4 TiB 
637 TiB 53.25 1.12   -    root hdd_strgbox
-43    349.43994    - 349 TiB  171 TiB  171 TiB 6.3 MiB  300 GiB 
178 TiB 48.91 1.02   -    host ld4257-hdd_strgbox
371   hdd    7.28000  1.0 7.3 TiB  3.5 TiB  3.5 TiB 180 KiB  5.9 GiB 
3.8 TiB 47.84 1.00 118 up osd.371
372   hdd    7.28000  1.0 7.3 TiB  3.5 TiB  3.5 TiB 128 KiB  6.2 GiB 
3.7 TiB 48.65 1.02 120 up osd.372
373   hdd    7.28000  1.0 7.3 TiB  3.5 TiB  3.5 TiB   8 KiB  6.1 GiB 
3.7 TiB 48.71 1.02 120 up osd.373
374   hdd    7.28000  1.0 7.3 TiB  3.4 TiB  3.4 TiB  68 KiB  6.0 GiB 
3.8 TiB 47.34 0.99 117 up osd.374
375   hdd    7.28000  1.0 7.3 TiB  3.9 TiB  3.9 TiB 208 KiB  6.7 GiB 
3.4 TiB 53.49 1.12 132 up osd.375
376   hdd    7.28000  1.0 7.3 TiB  3.1 TiB  3.1 TiB  72 KiB  5.7 GiB 
4.2 TiB 42.11 0.88 104 up osd.376
377   hdd    7.28000  1.0 7.3 TiB  3.1 TiB  3.1 TiB 120 KiB  5.7 GiB 
4.2 TiB 42.13 0.88 104 up osd.377
378   hdd    7.28000  1.0 7.3 TiB  3.4 TiB  3.4 TiB 176 KiB  6.0 GiB 
3.8 TiB 47.37 0.99 117 up osd.378
379   hdd    7.28000  1.0 7.3 TiB  3.6 TiB  3.6 TiB  32 KiB  6.2 GiB 
3.7 TiB 49.47 1.04 122 up osd.379
380   hdd    7.28000  1.0 7.3 TiB  3.3 TiB  3.3 TiB 168 KiB  5.8 GiB 
4.0 TiB 45.33 0.95 112 up osd.380
381   hdd    7.28000  1.0 7.3 TiB  3.7 TiB  3.7 TiB 284 KiB  6.4 GiB 
3.6 TiB 50.30 1.05 124 up osd.381
382   hdd    7.28000  1.0 7.3 TiB  3.4 TiB  3.4 TiB  12 KiB  5.8 GiB 
3.9 TiB 46.92 0.98 116 up osd.382
383   hdd    7.28000  1.0 7.3 TiB  3.6 TiB  3.6 TiB 172 KiB  6.2 GiB 
3.7 TiB 49.75 1.04 123 up osd.383
384   hdd    7.28000  1.0 7.3 TiB  3.8 TiB  3.8 TiB  60 KiB  7.3 GiB 
3.5 TiB 51.88 1.09 128 up osd.384
385   hdd    7.28000  1.0 7.3 TiB  3.4 TiB  3.4 TiB  76 KiB  6.5 GiB 
3.8 TiB 47.10 0.99 116 up osd.385
386   hdd    7.28000  1.0 7.3 TiB  3.9 TiB  3.9 TiB  84 KiB  7.2 GiB 
3.4 TiB 53.83 1.13 133 up osd.386
387   hdd    7.28000  1.0 7.3 TiB  3.6 TiB  3.6 TiB 200 KiB  6.3 GiB 
3.7 TiB 49.36 1.03 122 up osd.387
388   hdd    7.28000  1.0 7.3 TiB  3.4 TiB  3.4 TiB  72 KiB  5.8 GiB 
3.9 TiB 46.62 0.98 115 up osd.388
389   hdd    7.28000  1.0 7.3 TiB  3.8 TiB  3.8 TiB 276 KiB  6.6 GiB 
3.5 TiB 52.24 1.09 128 up osd.389
390   hdd    7.28000  1.0 7.3 TiB  3.1 TiB  3.1 TiB  72 KiB  5.3 GiB 
4.2 TiB 42.24 0.88 104 up osd.390
391   hdd    7.28000  1.0 7.3 TiB  3.4 TiB  3.4 TiB 148 KiB  5.8 GiB 
3.9 TiB 46.57 0.98 115 up osd.391
392   hdd    

[ceph-users] mgr daemons becoming unresponsive

2019-11-01 Thread Oliver Freyermuth
Dear Cephers,

this is a 14.2.4 cluster with device health metrics enabled - since about a 
day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" shows:

  cluster:
id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
health: HEALTH_WARN
no active mgr
1/3 mons down, quorum mon001,mon002
 
  services:
mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: mon003
mgr:no daemons active (since 56m)
...
(the third mon has a planned outage and will come back in a few days)

Checking the logs of the mgr daemons, I find some "reset" messages at the time 
when it goes "silent", first for the first mgr:

2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail
2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
v2:10.160.16.1:6800/401248
2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 
TiB avail

and a bit later, on the standby mgr:

2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
v2:10.160.16.2:6800/352196
2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap 
v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 
active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail

Interestingly, the dashboard still works, but presents outdated information, 
and for example zero I/O going on. 
I believe this started to happen mainly after the third mon went into the known 
downtime, but I am not fully sure if this was the trigger, since the cluster is 
still growing. 
It may also have been the addition of 24 more OSDs. 


I also find other messages in the mgr logs which seem problematic, but I am not 
sure they are related:
--
2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: 
[errno 22] Failed to operate read op for oid 
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
put_device_metrics
ioctx.operate_read_op(op, devid)
  File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
  File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
InvalidArgumentError: [errno 22] Failed to operate read op for oid 
--
or:
--
2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.51 ()
2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.52 ()
2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
result from daemon osd.53 ()
--

The reason why I am cautious about the health metrics is that I observed a 
crash when trying to query them:
--
2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] : 
from='client.174136 -' entity='client.admin' cmd=[{"prefix": "device 
get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch
2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation fault) 
**
 in thread 7fa46394b700 thread_name:mgr-fin

 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
(stable)
 1: (()+0xf5f0) [0x7fa488cee5f0]
 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
 7: (()+0x709c8) [0x7fa48ae479c8]
 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
 9: (()+0x5aaa5) [0x7fa48ae31aa5]
 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
 11: (()+0x4bb95) [0x7fa48ae22b95]
 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
 13: (ActivePyModule::handle_command(std::map >, std::vector >, 
std::vector > >, std::less, 
std::allocator >, 
std::vector >, std::vector > > > > > const&, ceph::buffer::v14_2_0::list const&, 
std::basic_stringstream, std::allocator >*, 
std::basic_stringstream, std::allocator 
>*)+0x20e) [0x55c3

[ceph-users] Weird blocked OP issue.

2019-11-01 Thread Robert LeBlanc
We had an OSD host with 13 OSDs fail today and we have a weird blocked
OP message that I can't understand. There are no OSDs with blocked
ops, just `mon` (multiple times), and some of the rgw instances.

  cluster:
   id: 570bcdbb-9fdf-406f-9079-b0181025f8d0
   health: HEALTH_WARN
   1 large omap objects
   Degraded data redundancy: 2083023/195702437 objects
degraded (1.064%), 880 pgs degraded, 880 pgs undersized
   1609 pgs not deep-scrubbed in time
   4 slow ops, oldest one blocked for 506699 sec, daemons
[mon,sun-gcs02-rgw01,mon,sun-gcs02-rgw02,mon,sun-gcs02-rgw03] have
slow ops.

 services:
   mon: 3 daemons, quorum
sun-gcs02-rgw01,sun-gcs02-rgw02,sun-gcs02-rgw03 (age 6m)
   mgr: sun-gcs02-rgw02(active, since 5d), standbys: sun-gcs02-rgw03,
sun-gcs02-rgw04
   osd: 767 osds: 754 up (since 10m), 754 in (since 104m); 880 remapped pgs
   rgw: 16 daemons active (sun-gcs02-rgw01.rgw0, sun-gcs02-rgw01.rgw1,
sun-gcs02-rgw01.rgw2, sun-gcs02-rgw01.rgw3, sun-gcs02-rgw02.rgw0,
sun-gcs02-rgw02.rgw1, sun-gcs02-rgw02.rgw2, sun-gcs02-rgw02.rgw3,
sun-gcs02-rgw03.rgw0, sun-gcs02-rgw03.rgw1, sun-gcs02-rgw03.rgw2, s
un-gcs02-rgw03.rgw3, sun-gcs02-rgw04.rgw0, sun-gcs02-rgw04.rgw1,
sun-gcs02-rgw04.rgw2, sun-gcs02-rgw04.rgw3)

 data:
   pools:   7 pools, 8240 pgs
   objects: 19.57M objects, 52 TiB
   usage:   88 TiB used, 6.1 PiB / 6.2 PiB avail
   pgs: 2083023/195702437 objects degraded (1.064%)
43492/195702437 objects misplaced (0.022%)
7360 active+clean
868  active+undersized+degraded+remapped+backfill_wait
12   active+undersized+degraded+remapped+backfilling

 io:
   client:   150 MiB/s rd, 642 op/s rd, 0 op/s wr
   recovery: 626 MiB/s, 223 objects/s

$ ceph versions
{
   "mon": {
   "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
   },
   "mgr": {
   "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
   },
   "osd": {
   "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754
   },
   "mds": {},
   "rgw": {
   "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 16
   },
   "overall": {
   "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754,
   "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 22
   }
}

I restarted one of the monitors and it dropped out of the list only
showing 2 blocked ops, but then showed up again a little while later.

Any ideas on where to look?

Thanks,
Robert LeBlanc

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-01 Thread Oliver Freyermuth
Dear Cephers,

interestingly, after:
 ceph device monitoring off
the mgrs seem to be stable now - the active one still went silent a few minutes 
later,
but the standby took over and was stable, and restarting the broken one, it's 
now stable since an hour, too,
so probably, a restart of the mgr is needed after disabling device monitoring 
to get things stable again. 

So it seems to be caused by a problem with the device health metrics. In case 
this is a red herring and mgrs become instable again in the next days,
I'll let you know. 

Cheers,
Oliver

Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
> Dear Cephers,
> 
> this is a 14.2.4 cluster with device health metrics enabled - since about a 
> day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" 
> shows:
> 
>   cluster:
> id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
> health: HEALTH_WARN
> no active mgr
> 1/3 mons down, quorum mon001,mon002
>  
>   services:
> mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
> mon003
> mgr:no daemons active (since 56m)
> ...
> (the third mon has a planned outage and will come back in a few days)
> 
> Checking the logs of the mgr daemons, I find some "reset" messages at the 
> time when it goes "silent", first for the first mgr:
> 
> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
> pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB 
> / 138 TiB avail
> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
> v2:10.160.16.1:6800/401248
> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
> pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB 
> / 138 TiB avail
> 
> and a bit later, on the standby mgr:
> 
> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
> pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
> TiB / 138 TiB avail
> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
> v2:10.160.16.2:6800/352196
> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
> pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
> TiB / 138 TiB avail
> 
> Interestingly, the dashboard still works, but presents outdated information, 
> and for example zero I/O going on. 
> I believe this started to happen mainly after the third mon went into the 
> known downtime, but I am not fully sure if this was the trigger, since the 
> cluster is still growing. 
> It may also have been the addition of 24 more OSDs. 
> 
> 
> I also find other messages in the mgr logs which seem problematic, but I am 
> not sure they are related:
> --
> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: 
> [errno 22] Failed to operate read op for oid 
> Traceback (most recent call last):
>   File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
> put_device_metrics
> ioctx.operate_read_op(op, devid)
>   File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>   File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
> InvalidArgumentError: [errno 22] Failed to operate read op for oid 
> --
> or:
> --
> 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
> result from daemon osd.51 ()
> 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
> result from daemon osd.52 ()
> 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON 
> result from daemon osd.53 ()
> --
> 
> The reason why I am cautious about the health metrics is that I observed a 
> crash when trying to query them:
> --
> 2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] : 
> from='client.174136 -' entity='client.admin' cmd=[{"prefix": "device 
> get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch
> 2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation 
> fault) **
>  in thread 7fa46394b700 thread_name:mgr-fin
> 
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
>  1: (()+0xf5f0) [0x7fa48

[ceph-users] Re: mgr daemons becoming unresponsive

2019-11-01 Thread Sage Weil
On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
> Dear Cephers,
> 
> interestingly, after:
>  ceph device monitoring off
> the mgrs seem to be stable now - the active one still went silent a few 
> minutes later,
> but the standby took over and was stable, and restarting the broken one, it's 
> now stable since an hour, too,
> so probably, a restart of the mgr is needed after disabling device monitoring 
> to get things stable again. 
> 
> So it seems to be caused by a problem with the device health metrics. In case 
> this is a red herring and mgrs become instable again in the next days,
> I'll let you know. 

If this seems to stabilize things, and you can tolerate inducing the 
failure again, reproducing the problem with mgr logs cranked up (debug_mgr 
= 20, debug_ms = 1) would probably give us a good idea of why the mgr is 
hanging.  Let us know!

Thanks,
sage

 > 
> Cheers,
>   Oliver
> 
> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
> > Dear Cephers,
> > 
> > this is a 14.2.4 cluster with device health metrics enabled - since about a 
> > day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" 
> > shows:
> > 
> >   cluster:
> > id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
> > health: HEALTH_WARN
> > no active mgr
> > 1/3 mons down, quorum mon001,mon002
> >  
> >   services:
> > mon:3 daemons, quorum mon001,mon002 (age 57m), out of quorum: 
> > mon003
> > mgr:no daemons active (since 56m)
> > ...
> > (the third mon has a planned outage and will come back in a few days)
> > 
> > Checking the logs of the mgr daemons, I find some "reset" messages at the 
> > time when it goes "silent", first for the first mgr:
> > 
> > 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
> > pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
> > TiB / 138 TiB avail
> > 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on 
> > v2:10.160.16.1:6800/401248
> > 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : 
> > pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 
> > TiB / 138 TiB avail
> > 
> > and a bit later, on the standby mgr:
> > 
> > 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
> > pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 
> > active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
> > 136 TiB / 138 TiB avail
> > 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on 
> > v2:10.160.16.2:6800/352196
> > 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : 
> > pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 
> > active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 
> > 136 TiB / 138 TiB avail
> > 
> > Interestingly, the dashboard still works, but presents outdated 
> > information, and for example zero I/O going on. 
> > I believe this started to happen mainly after the third mon went into the 
> > known downtime, but I am not fully sure if this was the trigger, since the 
> > cluster is still growing. 
> > It may also have been the addition of 24 more OSDs. 
> > 
> > 
> > I also find other messages in the mgr logs which seem problematic, but I am 
> > not sure they are related:
> > --
> > 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading 
> > OMAP: [errno 22] Failed to operate read op for oid 
> > Traceback (most recent call last):
> >   File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in 
> > put_device_metrics
> > ioctx.operate_read_op(op, devid)
> >   File "rados.pyx", line 516, in rados.requires.wrapper.validate_func 
> > (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
> > D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
> >   File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op 
> > (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
> > InvalidArgumentError: [errno 22] Failed to operate read op for oid 
> > --
> > or:
> > --
> > 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse 
> > JSON result from daemon osd.51 ()
> > 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse 
> > JSON result from daemon osd.52 ()
> > 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse 
> > JSON result from daemon osd.53 ()
> > --
> > 
> > The reason why I am cautious about the health metrics is that I observed a 
> > crash when trying to query them:
> > --
> > 2019-11-01 20:21:23.661 7fa46314a700  0

[ceph-users] Re: Weird blocked OP issue.

2019-11-01 Thread Robert LeBlanc
On Fri, Nov 1, 2019 at 6:10 PM Robert LeBlanc  wrote:
>
> We had an OSD host with 13 OSDs fail today and we have a weird blocked
> OP message that I can't understand. There are no OSDs with blocked
> ops, just `mon` (multiple times), and some of the rgw instances.
>
>   cluster:
>id: 570bcdbb-9fdf-406f-9079-b0181025f8d0
>health: HEALTH_WARN
>1 large omap objects
>Degraded data redundancy: 2083023/195702437 objects
> degraded (1.064%), 880 pgs degraded, 880 pgs undersized
>1609 pgs not deep-scrubbed in time
>4 slow ops, oldest one blocked for 506699 sec, daemons
> [mon,sun-gcs02-rgw01,mon,sun-gcs02-rgw02,mon,sun-gcs02-rgw03] have
> slow ops.
>
>  services:
>mon: 3 daemons, quorum
> sun-gcs02-rgw01,sun-gcs02-rgw02,sun-gcs02-rgw03 (age 6m)
>mgr: sun-gcs02-rgw02(active, since 5d), standbys: sun-gcs02-rgw03,
> sun-gcs02-rgw04
>osd: 767 osds: 754 up (since 10m), 754 in (since 104m); 880 remapped pgs
>rgw: 16 daemons active (sun-gcs02-rgw01.rgw0, sun-gcs02-rgw01.rgw1,
> sun-gcs02-rgw01.rgw2, sun-gcs02-rgw01.rgw3, sun-gcs02-rgw02.rgw0,
> sun-gcs02-rgw02.rgw1, sun-gcs02-rgw02.rgw2, sun-gcs02-rgw02.rgw3,
> sun-gcs02-rgw03.rgw0, sun-gcs02-rgw03.rgw1, sun-gcs02-rgw03.rgw2, s
> un-gcs02-rgw03.rgw3, sun-gcs02-rgw04.rgw0, sun-gcs02-rgw04.rgw1,
> sun-gcs02-rgw04.rgw2, sun-gcs02-rgw04.rgw3)
>
>  data:
>pools:   7 pools, 8240 pgs
>objects: 19.57M objects, 52 TiB
>usage:   88 TiB used, 6.1 PiB / 6.2 PiB avail
>pgs: 2083023/195702437 objects degraded (1.064%)
> 43492/195702437 objects misplaced (0.022%)
> 7360 active+clean
> 868  active+undersized+degraded+remapped+backfill_wait
> 12   active+undersized+degraded+remapped+backfilling
>
>  io:
>client:   150 MiB/s rd, 642 op/s rd, 0 op/s wr
>recovery: 626 MiB/s, 223 objects/s
>
> $ ceph versions
> {
>"mon": {
>"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
> nautilus (stable)": 3
>},
>"mgr": {
>"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
> nautilus (stable)": 3
>},
>"osd": {
>"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
> nautilus (stable)": 754
>},
>"mds": {},
>"rgw": {
>"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
> nautilus (stable)": 16
>},
>"overall": {
>"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
> nautilus (stable)": 754,
>"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
> nautilus (stable)": 22
>}
> }
>
> I restarted one of the monitors and it dropped out of the list only
> showing 2 blocked ops, but then showed up again a little while later.
>
> Any ideas on where to look?

For posterity's sake, it looks like I got things happy again.

The rgw data pool is 8+2 EC, but was set for min_size=10. I thought I
had configured that min_size=9, but it was recovering PGs, so I didn't
think about it at the time. Then one OSD started crashing with
something about strays and would be restarted and crash again. Then
incomplete PGs showed up. I dropped the min_size to 8 to get things
recovered and marked osd.119 out to empty it off. Once the cluster
recovered and all PGs were healthy, I set min_size=9. I then noticed
that what I thought were rgw instances being blocked where actually
the names of the monitors (the hosts are named after the rgws, but
mon, mgr and rgw are all containers on the boxes). I thought, well let
me try to roll the first monitor again and see if that unblocks the
op, sure enough it looks like it unblocked this time and has not
showed up again in 10 minutes. After letting osd.119 sit empty for
about 10 minutes, I set it back in and it doesn't seem to be crashing
anymore, so I wonder if it had some bad db entry. It's almost halfway
back in and so far so good.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io