Re: [ceph-users] Slow requests from bluestore osds

2019-05-14 Thread Stefan Kooman
Quoting Marc Schöchlin (m...@256bit.org):

> Out new setup is now:
> (12.2.10 on Ubuntu 16.04)
> 
> [osd]
> osd deep scrub interval = 2592000
> osd scrub begin hour = 19
> osd scrub end hour = 6
> osd scrub load threshold = 6
> osd scrub sleep = 0.3
> osd snap trim sleep = 0.4
> pg max concurrent snap trims = 1
> 
> [osd.51]
> osd memory target = 8589934592

I would upgrade to 12.2.12 and set the following:

[osd]
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Just to make sure you're not hit by the "stupid allocator" behaviour,
which (also) might result in slow ops after $period of OSD uptime.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MGR CRASH : balancer module

2019-05-14 Thread xie.xingguo
Should be fixed by https://github.com/ceph/ceph/pull/27225 


You can simply upgrade to v14.2.1 to get rid of it,


or you can do 'ceph balancer off' to temporarily disable automatic balancing...



















原始邮件



发件人:TarekZegar 
收件人:ceph-users@lists.ceph.com ;
日 期 :2019年05月14日 01:53
主 题 :[ceph-users] Ceph MGR CRASH : balancer module


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hello,

My manager keeps dying, the last meta log is below. What is causing this? I do 
have two roots in the osd tree with shared hosts(see below), I can't imagine 
that is causing balancer to fail?


meta log:
{
"crash_id": 
"2019-05-11_19:09:17.999875Z_aa7afa7c-bc7e-43ec-b32a-821bd47bd68b",
"timestamp": "2019-05-11 19:09:17.999875Z",
"process_name": "ceph-mgr",
"entity_name": "mgr.pok1-qz1-sr1-rk023-s08",
"ceph_version": "14.2.0",
"utsname_hostname": "pok1-qz1-sr1-rk023-s08",
"utsname_sysname": "Linux",
"utsname_release": "4.15.0-1014-ibm-gt",
"utsname_version": "#16-Ubuntu SMP Tue Dec 11 11:19:10 UTC 2018",
"utsname_machine": "x86_64",
"os_name": "Ubuntu",
"os_id": "ubuntu",
"os_version_id": "18.04",
"os_version": "18.04.1 LTS (Bionic Beaver)",
"assert_condition": "osd_weight.count(i.first)",
"assert_func": "int OSDMap::calc_pg_upmaps(CephContext*, float, int, const 
std::set&, OSDMap::Incremental*)",
"assert_file": "/build/ceph-14.2.0/src/osd/OSDMap.cc",
"assert_line": 4743,
"assert_thread_name": "balancer",
"assert_msg": "/build/ceph-14.2.0/src/osd/OSDMap.cc: In function 'int 
OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set&, 
OSDMap::Incremental*)' thread 7fffd6572700 time 2019-05-11 
19:09:17.998114\n/build/ceph-14.2.0/src/osd/OSDMap.cc: 4743: FAILED 
ceph_assert(osd_weight.count(i.first))\n",
"backtrace": [
"(()+0x12890) [0x7fffee586890]",
"(gsignal()+0xc7) [0x7fffed67ee97]",
"(abort()+0x141) [0x7fffed680801]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a3) [0x7fffef1eb7d3]",
"(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
char const*, )+0) [0x7fffef1eb95d]",
"(OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&, OSDMap::Incremental*)+0x274b) 
[0x7fffef61bb3b]",
"(()+0x1d52b6) [0x557292b6]",
"(PyEval_EvalFrameEx()+0x8010) [0x7fffeeab21d0]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fffeebe2278]",
"(PyEval_EvalFrameEx()+0x5bf6) [0x7fffeeaafdb6]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fffeeab2d1b]",
"(PyEval_EvalFrameEx()+0x8b5b) [0x7fffeeab2d1b]",
"(PyEval_EvalCodeEx()+0x7d8) [0x7fffeebe2278]",
"(()+0x1645f9) [0x7fffeeb675f9]",
"(PyObject_Call()+0x43) [0x7fffeea57333]",
"(()+0x1abd1c) [0x7fffeebaed1c]",
"(PyObject_Call()+0x43) [0x7fffeea57333]",
"(PyObject_CallMethod()+0xc8) [0x7fffeeb7bc78]",
"(PyModuleRunner::serve()+0x62) [0x55725f32]",
"(PyModuleRunner::PyModuleRunnerThread::entry()+0x1cf) 
[0x557265df]",
"(()+0x76db) [0x7fffee57b6db]",
"(clone()+0x3f) [0x7fffed76188f]"
]
}

OSD TREE:
ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
-2954.58200 root tzrootthreenodes
-2518.19400 host pok1-qz1-sr1-rk001-s20
  0   ssd   1.81898 osd.0   up  1.0 1.0
122   ssd   1.81898 osd.122 up  1.0 1.0
135   ssd   1.81898 osd.135 up  1.0 1.0
149   ssd   1.81898 osd.149 up  1.0 1.0
162   ssd   1.81898 osd.162 up  1.0 1.0
175   ssd   1.81898 osd.175 up  1.0 1.0
188   ssd   1.81898 osd.188 up  1.0 1.0
200   ssd   1.81898 osd.200 up  1.0 1.0
213   ssd   1.81898 osd.213 up  1.0 1.0
225   ssd   1.81898 osd.225 up  1.0 1.0
 -518.19400 host pok1-qz1-sr1-rk002-s05
112   ssd   1.81898 osd.112 up  1.0 1.0
120   ssd   1.81898 osd.120 up  1.0 1.0
132   ssd   1.81898 osd.132 up  1.0 1.0
144   ssd   1.81898 osd.144 up  1.0 1.0
156   ssd   1.81898 osd.156 up  1.0 1.0
168   ssd   1.81898 osd.168 up  1.0 1.0
180   ssd   1.81898 osd.180 up  1.0 1.0
192   ssd   1.81898 osd.192 up  1.0 1.0
204   ssd   1.81898 osd.204 up  1.0 1.0
216   ssd   1.81898 

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-14 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk):

If at all possible I would:

Upgrade to 13.2.5 (there have been quite a few MDS fixes since 13.2.2).
Use more recent kernels on the clients.

Below settings for [mds] might help with trimming (you might already
have changed mds_log_max_segments to 128 according to logs):

[mds]
mds_log_max_expiring = 80  # default 20
# trim max $value segments in parallel
# Defaults are too conservative.
mds_log_max_segments = 120  # default 30


> 1) Is there a bug with having MDS daemons acting as standby-replay?
I can't tell what bug you are referring to based on info below. It does
seem to work as designed.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and 
therefore we decided to mark the osd as lost and set it up from 
scratch. Ceph started recovering and then we lost another osd with 
the same behavior. We did the same as for the first osd.


With 3+1 you only allow a single OSD failure per pg at a given time. 
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 
separate servers (assuming standard crush rules) is a death sentence 
for the data on some pgs using both of those OSD (the ones not fully 
recovered before the second failure).


OK, so the 2 OSDs (4,23) failed shortly one after the other but we think 
that the recovery of the first was finished before the second failed. 
Nonetheless, both problematic pgs have been on both OSDs. We think, that 
we still have enough shards left. For one of the pgs, the recovery state 
looks like this:


    "recovery_state": [
    {
    "name": "Started/Primary/Peering/Incomplete",
    "enter_time": "2019-05-09 16:11:48.625966",
    "comment": "not enough complete instances of this PG"
    },
    {
    "name": "Started/Primary/Peering",
    "enter_time": "2019-05-09 16:11:48.611171",
    "past_intervals": [
    {
    "first": "49767",
    "last": "59313",
    "all_participants": [
    {
    "osd": 2,
    "shard": 0
    },
    {
    "osd": 4,
    "shard": 1
    },
    {
    "osd": 23,
    "shard": 2
    },
    {
    "osd": 24,
    "shard": 0
    },
    {
    "osd": 72,
    "shard": 1
    },
    {
    "osd": 79,
    "shard": 3
    }
    ],
    "intervals": [
    {
    "first": "58860",
    "last": "58861",
    "acting": "4(1),24(0),79(3)"
    },
    {
    "first": "58875",
    "last": "58877",
    "acting": "4(1),23(2),24(0)"
    },
    {
    "first": "59002",
    "last": "59009",
    "acting": "4(1),23(2),79(3)"
    },
    {
    "first": "59010",
    "last": "59012",
    "acting": "2(0),4(1),23(2),79(3)"
    },
    {
    "first": "59197",
    "last": "59233",
    "acting": "23(2),24(0),79(3)"
    },
    {
    "first": "59234",
    "last": "59313",
    "acting": "23(2),24(0),72(1),79(3)"
    }
    ]
    }
    ],
    "probing_osds": [
    "2(0)",
    "4(1)",
    "23(2)",
    "24(0)",
    "72(1)",
    "79(3)"
    ],
    "down_osds_we_would_probe": [],
    "peering_blocked_by": [],
    "peering_blocked_by_detail": [
    {
    "detail": "peering_blocked_by_history_les_bound"
    }
    ]
    },
    {
    "name": "Started",
    "enter_time": "2019-05-09 16:11:48.611121"
    }
    ],
Is there a chance to recover this pg from the shards on OSDs 2, 72, 79? 
ceph pg repair/deep-scrub/scrub did not work.


We are also worried about the behind on trimming of the mds or is this 
not too problematic?



MDS_TRIM 1 MDSs behind on trimming
    mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46178/128) 
max_segments: 128, num_segments: 46178



Depending on the data stored (CephFS ?) you probably can recover most 
of it but some of it is irremediably lost.


If you can recover the data from the failed OSD at the time they 
failed you might be able to recover some of your lost data (with the 
help of Ceph devs), if not there's nothing to do.


In the later case I'd add a new server to use at least 3+2 for a fresh 
p

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh


On 13.05.19 11:21 nachm., Dan van der Ster wrote:

Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs?
It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, at 
least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If not 
we would just set up a new ceph directly without fixing the old one and 
copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, let me
first show you the current ceph status:

cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs incomplete
  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. Implicated osds
24,32,91

services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up  {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7active+clean+inconsistent
   2incomplete
   1active+recovery_wait+degraded

io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec
MDS_TRIM 1 MDSs behind on trimming
  mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128)
max_segments: 128, num_segments: 46034
OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
  pg 1.24c has 1 unfound objects
OSD_SCRUB_ERRORS 19 scrub errors
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
  pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31
min_size from 3 may help; search ceph.com/docs for 'incomplete')
  pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31
min_size from 3 may help; search ceph.com/docs for 'incomplete')
PG_DAMAGED Possible data damage: 7 pgs inconsistent
  pg 1.17f is active+clean+inconsistent, acting [65,49,25,4]
  pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81]
  pg 1.203 is active+clean+inconsistent, acting [43,49,4,72]
  pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4]
  pg 1.779 is active+clean+inconsistent, acting [50,4,77,62]
  pg 1.77c is active+clean+inconsistent, acting [21,49,40,4]
  pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4]
PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded
(0.000%), 1 pg degraded
  pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1
unfound
REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Implicated osds
24,32,91
  118 ops are blocked > 536871 sec
  osd

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster
On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:
>
> On 13.05.19 10:51 nachm., Lionel Bouton wrote:
> > Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
> >> Dear ceph experts,
> >>
> >> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> >> Here is what happened: One osd daemon could not be started and
> >> therefore we decided to mark the osd as lost and set it up from
> >> scratch. Ceph started recovering and then we lost another osd with
> >> the same behavior. We did the same as for the first osd.
> >
> > With 3+1 you only allow a single OSD failure per pg at a given time.
> > You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
> > separate servers (assuming standard crush rules) is a death sentence
> > for the data on some pgs using both of those OSD (the ones not fully
> > recovered before the second failure).
>
> OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
> that the recovery of the first was finished before the second failed.
> Nonetheless, both problematic pgs have been on both OSDs. We think, that
> we still have enough shards left. For one of the pgs, the recovery state
> looks like this:
>
>  "recovery_state": [
>  {
>  "name": "Started/Primary/Peering/Incomplete",
>  "enter_time": "2019-05-09 16:11:48.625966",
>  "comment": "not enough complete instances of this PG"
>  },
>  {
>  "name": "Started/Primary/Peering",
>  "enter_time": "2019-05-09 16:11:48.611171",
>  "past_intervals": [
>  {
>  "first": "49767",
>  "last": "59313",
>  "all_participants": [
>  {
>  "osd": 2,
>  "shard": 0
>  },
>  {
>  "osd": 4,
>  "shard": 1
>  },
>  {
>  "osd": 23,
>  "shard": 2
>  },
>  {
>  "osd": 24,
>  "shard": 0
>  },
>  {
>  "osd": 72,
>  "shard": 1
>  },
>  {
>  "osd": 79,
>  "shard": 3
>  }
>  ],
>  "intervals": [
>  {
>  "first": "58860",
>  "last": "58861",
>  "acting": "4(1),24(0),79(3)"
>  },
>  {
>  "first": "58875",
>  "last": "58877",
>  "acting": "4(1),23(2),24(0)"
>  },
>  {
>  "first": "59002",
>  "last": "59009",
>  "acting": "4(1),23(2),79(3)"
>  },
>  {
>  "first": "59010",
>  "last": "59012",
>  "acting": "2(0),4(1),23(2),79(3)"
>  },
>  {
>  "first": "59197",
>  "last": "59233",
>  "acting": "23(2),24(0),79(3)"
>  },
>  {
>  "first": "59234",
>  "last": "59313",
>  "acting": "23(2),24(0),72(1),79(3)"
>  }
>  ]
>  }
>  ],
>  "probing_osds": [
>  "2(0)",
>  "4(1)",
>  "23(2)",
>  "24(0)",
>  "72(1)",
>  "79(3)"
>  ],
>  "down_osds_we_would_probe": [],
>  "peering_blocked_by": [],
>  "peering_blocked_by_detail": [
>  {
>  "detail": "peering_blocked_by_history_les_bound"
>  }
>  ]
>  },
>  {
>  "name": "Started",
>  "enter_time": "2019-05-09 16:11:48.611121"
>  }
>  ],
> Is there a chance to recover this pg from the shards on OSDs 2, 72, 79?
> ceph pg repair/deep-scrub/scrub did not work.

repair/scrub are not related to this problem so they won't help.

How exactly did you use the osd_find_best_info_ignore_history_les option?

One correct procedure would be to set it to true in ceph.conf, then
restart each of 

Re: [ceph-users] Ceph MGR CRASH : balancer module

2019-05-14 Thread EDH - Manuel Rios Fernandez
We can confirm that Balancer module works smooth in 14.2.1.

 

We’re balancing with bytes and pg. Now all osd are 100% balanced.

 

 

 

De: ceph-users  En nombre de 
xie.xing...@zte.com.cn
Enviado el: martes, 14 de mayo de 2019 9:53
Para: tze...@us.ibm.com
CC: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] Ceph MGR CRASH : balancer module

 

Should be fixed by https://github.com/ceph/ceph/pull/27225 

You can simply upgrade to v14.2.1 to get rid of it,

or you can do 'ceph balancer off' to temporarily disable automatic balancing...

 

 

 

 

原始邮件

发件人:TarekZegar mailto:tze...@us.ibm.com> >

收件人:ceph-users@lists.ceph.com   
mailto:ceph-users@lists.ceph.com> >;

日 期 :2019年05月14日 01:53

主 题 :[ceph-users] Ceph MGR CRASH : balancer module

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hello,

My manager keeps dying, the last meta log is below. What is causing this? I do 
have two roots in the osd tree with shared hosts(see below), I can't imagine 
that is causing balancer to fail?


meta log:
{
   "crash_id": 
"2019-05-11_19:09:17.999875Z_aa7afa7c-bc7e-43ec-b32a-821bd47bd68b",
   "timestamp": "2019-05-11 19:09:17.999875Z",
   "process_name": "ceph-mgr",
   "entity_name": "mgr.pok1-qz1-sr1-rk023-s08",
   "ceph_version": "14.2.0",
   "utsname_hostname": "pok1-qz1-sr1-rk023-s08",
   "utsname_sysname": "Linux",
   "utsname_release": "4.15.0-1014-ibm-gt",
   "utsname_version": "#16-Ubuntu SMP Tue Dec 11 11:19:10 UTC 2018",
   "utsname_machine": "x86_64",
   "os_name": "Ubuntu",
   "os_id": "ubuntu",
   "os_version_id": "18.04",
   "os_version": "18.04.1 LTS (Bionic Beaver)",
   "assert_condition": "osd_weight.count(i.first)",
   "assert_func": "int OSDMap::calc_pg_upmaps(CephContext*, float, int, const 
std::set&, OSDMap::Incremental*)",
   "assert_file": "/build/ceph-14.2.0/src/osd/OSDMap.cc",
   "assert_line": 4743,
   "assert_thread_name": "balancer",
   "assert_msg": "/build/ceph-14.2.0/src/osd/OSDMap.cc: In function 'int 
OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set&, 
OSDMap::Incremental*)' thread 7fffd6572700 time 2019-05-11 
19:09:17.998114\n/build/ceph-14.2.0/src/osd/OSDMap.cc: 4743: FAILED 
ceph_assert(osd_weight.count(i.first))\n",
   "backtrace": [
   "(()+0x12890) [0x7fffee586890]",
   "(gsignal()+0xc7) [0x7fffed67ee97]",
   "(abort()+0x141) [0x7fffed680801]",
   "(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a3) [0x7fffef1eb7d3]",
   "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
char const*, )+0) [0x7fffef1eb95d]",
   "(OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&, OSDMap::Incremental*)+0x274b) 
[0x7fffef61bb3b]",
   "(()+0x1d52b6) [0x557292b6]",
   "(PyEval_EvalFrameEx()+0x8010) [0x7fffeeab21d0]",
   "(PyEval_EvalCodeEx()+0x7d8) [0x7fffeebe2278]",
   "(PyEval_EvalFrameEx()+0x5bf6) [0x7fffeeaafdb6]",
   "(PyEval_EvalFrameEx()+0x8b5b) [0x7fffeeab2d1b]",
   "(PyEval_EvalFrameEx()+0x8b5b) [0x7fffeeab2d1b]",
   "(PyEval_EvalCodeEx()+0x7d8) [0x7fffeebe2278]",
   "(()+0x1645f9) [0x7fffeeb675f9]",
   "(PyObject_Call()+0x43) [0x7fffeea57333]",
   "(()+0x1abd1c) [0x7fffeebaed1c]",
   "(PyObject_Call()+0x43) [0x7fffeea57333]",
   "(PyObject_CallMethod()+0xc8) [0x7fffeeb7bc78]",
   "(PyModuleRunner::serve()+0x62) [0x55725f32]",
   "(PyModuleRunner::PyModuleRunnerThread::entry()+0x1cf) [0x557265df]",
   "(()+0x76db) [0x7fffee57b6db]",
   "(clone()+0x3f) [0x7fffed76188f]"
   ]
}

OSD TREE:
ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
-2954.58200 root tzrootthreenodes
-2518.19400 host pok1-qz1-sr1-rk001-s20
 0   ssd   1.81898 osd.0   up  1.0 1.0
122   ssd   1.81898 osd.122 up  1.0 1.0
135   ssd   1.81898 osd.135 up  1.0 1.0
149   ssd   1.81898 osd.149 up  1.0 1.0
162   ssd   1.81898 osd.162 up  1.0 1.0
175   ssd   1.81898 osd.175 up  1.0 1.0
188   ssd   1.81898 osd.188 up  1.0 1.0
200   ssd   1.81898 osd.200 up  1.0 1.0
213   ssd   1.81898 osd.213 up  1.0 1.0
225   ssd   1.81898 osd.225 up  1.0 1.0
-518.19400 host pok1-qz1-sr1-rk002-s05
112   ssd   1.81898 osd.112 up  1.0 1.0
120   ssd   1.81898 osd.120 up  1.0 1.0
132   ssd   1.81898 osd.132 up  1.0 1.0
144   ssd   1.81898 osd.144

Re: [ceph-users] Rolling upgrade fails with flag norebalance with background IO [EXT]

2019-05-14 Thread Matthew Vernon
On 14/05/2019 00:36, Tarek Zegar wrote:
> It's not just mimic to nautilus
> I confirmed with luminous to mimic
>  
> They are checking for clean pgs with flags set, they should unset flags,
> then check. Set flags again, move on to next osd

I think I'm inclined to agree that "norebalance" is likely to get in the
way when upgrading a cluster - our rolling upgrade playbook omits it.

OTOH, you might want to raise this on the ceph-ansible list (
ceph-ansi...@lists.ceph.com ) and/or as a github issue - I don't think
the ceph-ansible maintainers routinely watch this list.

HTH,

Matthew


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh


On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
that the recovery of the first was finished before the second failed.
Nonetheless, both problematic pgs have been on both OSDs. We think, that
we still have enough shards left. For one of the pgs, the recovery state
looks like this:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-09 16:11:48.625966",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-09 16:11:48.611171",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59313",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  },
  {
  "osd": 24,
  "shard": 0
  },
  {
  "osd": 72,
  "shard": 1
  },
  {
  "osd": 79,
  "shard": 3
  }
  ],
  "intervals": [
  {
  "first": "58860",
  "last": "58861",
  "acting": "4(1),24(0),79(3)"
  },
  {
  "first": "58875",
  "last": "58877",
  "acting": "4(1),23(2),24(0)"
  },
  {
  "first": "59002",
  "last": "59009",
  "acting": "4(1),23(2),79(3)"
  },
  {
  "first": "59010",
  "last": "59012",
  "acting": "2(0),4(1),23(2),79(3)"
  },
  {
  "first": "59197",
  "last": "59233",
  "acting": "23(2),24(0),79(3)"
  },
  {
  "first": "59234",
  "last": "59313",
  "acting": "23(2),24(0),72(1),79(3)"
  }
  ]
  }
  ],
  "probing_osds": [
  "2(0)",
  "4(1)",
  "23(2)",
  "24(0)",
  "72(1)",
  "79(3)"
  ],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": [],
  "peering_blocked_by_detail": [
  {
  "detail": "peering_blocked_by_history_les_bound"
  }
  ]
  },
  {
  "name": "Started",
  "enter_time": "2019-05-09 16:11:48.611121"
  }
  ],
Is there a chance to recover this pg from the shards on OSDs 2, 72, 79?
ceph pg repair/deep-scrub/scrub did not work.

repair/scrub are not related to this problem so they won't help.

How exactly did you use the osd_find_best_info_ignore_history_les option?

One correct procedure would be to set it to true in ceph.conf, then
restart each of the probing_osd's above.
(Once the PG has peered, you need to unset the option and restart
those osds again).


We execu

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster
On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:
>
>
> On 14.05.19 10:08 vorm., Dan van der Ster wrote:
>
> On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:
>
> On 13.05.19 10:51 nachm., Lionel Bouton wrote:
>
> Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
>
> Dear ceph experts,
>
> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> Here is what happened: One osd daemon could not be started and
> therefore we decided to mark the osd as lost and set it up from
> scratch. Ceph started recovering and then we lost another osd with
> the same behavior. We did the same as for the first osd.
>
> With 3+1 you only allow a single OSD failure per pg at a given time.
> You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
> separate servers (assuming standard crush rules) is a death sentence
> for the data on some pgs using both of those OSD (the ones not fully
> recovered before the second failure).
>
> OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
> that the recovery of the first was finished before the second failed.
> Nonetheless, both problematic pgs have been on both OSDs. We think, that
> we still have enough shards left. For one of the pgs, the recovery state
> looks like this:
>
>  "recovery_state": [
>  {
>  "name": "Started/Primary/Peering/Incomplete",
>  "enter_time": "2019-05-09 16:11:48.625966",
>  "comment": "not enough complete instances of this PG"
>  },
>  {
>  "name": "Started/Primary/Peering",
>  "enter_time": "2019-05-09 16:11:48.611171",
>  "past_intervals": [
>  {
>  "first": "49767",
>  "last": "59313",
>  "all_participants": [
>  {
>  "osd": 2,
>  "shard": 0
>  },
>  {
>  "osd": 4,
>  "shard": 1
>  },
>  {
>  "osd": 23,
>  "shard": 2
>  },
>  {
>  "osd": 24,
>  "shard": 0
>  },
>  {
>  "osd": 72,
>  "shard": 1
>  },
>  {
>  "osd": 79,
>  "shard": 3
>  }
>  ],
>  "intervals": [
>  {
>  "first": "58860",
>  "last": "58861",
>  "acting": "4(1),24(0),79(3)"
>  },
>  {
>  "first": "58875",
>  "last": "58877",
>  "acting": "4(1),23(2),24(0)"
>  },
>  {
>  "first": "59002",
>  "last": "59009",
>  "acting": "4(1),23(2),79(3)"
>  },
>  {
>  "first": "59010",
>  "last": "59012",
>  "acting": "2(0),4(1),23(2),79(3)"
>  },
>  {
>  "first": "59197",
>  "last": "59233",
>  "acting": "23(2),24(0),79(3)"
>  },
>  {
>  "first": "59234",
>  "last": "59313",
>  "acting": "23(2),24(0),72(1),79(3)"
>  }
>  ]
>  }
>  ],
>  "probing_osds": [
>  "2(0)",
>  "4(1)",
>  "23(2)",
>  "24(0)",
>  "72(1)",
>  "79(3)"
>  ],
>  "down_osds_we_would_probe": [],
>  "peering_blocked_by": [],
>  "peering_blocked_by_detail": [
>  {
>  "detail": "peering_blocked_by_history_les_bound"
>  }
>  ]
>  },
>  {
>  "name": "Started",
>  "enter_time": "2019-05-09 16:11:48.611121"
>  }
>  ],
> Is there a chance to recover this pg from the shards on OSDs 2, 72, 79?
> ceph pg repair/deep-scrub/scrub did not work.
>
> repair/scrub are not related to this problem so they won't help.
>
> How exactly did you use the osd_find_best_info_ignore_history_les option?

Re: [ceph-users] ceph mimic and samba vfs_ceph

2019-05-14 Thread Ansgar Jazdzewski
hi,

i was able to compile samba 4.10.2 using the mimic-headerfiles and it
works fine so far.
now we are loking forward to do some real load tests.

Have a nice one,
Ansgar

Am Fr., 10. Mai 2019 um 13:33 Uhr schrieb Ansgar Jazdzewski
:
>
> thanks,
>
> i will try to "backport" this to ubuntu 16.04
>
> Ansgar
>
> Am Do., 9. Mai 2019 um 12:33 Uhr schrieb Paul Emmerich 
> :
> >
> > We maintain vfs_ceph for samba at mirror.croit.io for Debian Stretch and 
> > Buster.
> >
> > We apply a9c5be394da4f20bcfea7f6d4f5919d5c0f90219 on Samba 4.9 for
> > Buster to fix this.
> >
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > On Thu, May 9, 2019 at 9:25 AM Robert Sander
> >  wrote:
> > >
> > > On 08.05.19 23:23, Gregory Farnum wrote:
> > >
> > > > Fixing the wiring wouldn't be that complicated if you can hack on the
> > > > code at all, but there are some other issues with the Samba VFS
> > > > implementation that have prevented anyone from prioritizing it so far.
> > > > (Namely, smb forks for every incoming client connection, which means
> > > > every smb client gets a completely independent cephfs client, which is
> > > > very inefficient.)
> > >
> > > Inefficient because of multiplying the local cache efforts or because
> > > too much clients stress the MDS?
> > >
> > > I thought it would be more efficient to run multiple clients (in
> > > userspace) that interact in parallel with the Ceph cluster.
> > > Instead of having only one mounted filesystem (kernel or FUSE) where all
> > > the data passes through.
> > >
> > > Regards
> > > --
> > > Robert Sander
> > > Heinlein Support GmbH
> > > Schwedter Str. 8/9b, 10119 Berlin
> > >
> > > https://www.heinlein-support.de
> > >
> > > Tel: 030 / 405051-43
> > > Fax: 030 / 405051-19
> > >
> > > Amtsgericht Berlin-Charlottenburg - HRB 93818 B
> > > Geschäftsführer: Peer Heinlein - Sitz: Berlin
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph nautilus deep-scrub health error

2019-05-14 Thread nokia ceph
Hi Team,

After upgrading from Luminous to Nautilus , we see 654 pgs not
deep-scrubbed in time error in ceph status . How can we disable this flag?
. In our setup we disable deep-scrubbing for performance issues.

Thanks,
Muthu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph nautilus deep-scrub health error

2019-05-14 Thread EDH - Manuel Rios Fernandez
Hi Muthu

 

We found the same issue near 2000 pgs not deep-scrubbed in time.

 

We’re manually force scrubbing with :

 

ceph health detail | grep -i not | awk '{print $2}' | while read i; do ceph pg 
deep-scrub ${i}; done

 

It launch near 20-30 pgs to be deep-scrubbed. I think you can improve  with a 
sleep of 120 secs between scrub to prevent overload your osd.

 

For disable deep-scrub you can use “ceph osd set nodeep-scrub” , Also you can 
setup deep-scrub with threshold .

#Start Scrub 22:00

osd scrub begin hour = 22

#Stop Scrub 8

osd scrub end hour = 8

#Scrub Load 0.5

osd scrub load threshold = 0.5

 

Regards,

 

Manuel

 

 

 

 

De: ceph-users  En nombre de nokia ceph
Enviado el: martes, 14 de mayo de 2019 11:44
Para: Ceph Users 
Asunto: [ceph-users] ceph nautilus deep-scrub health error

 

Hi Team,

 

After upgrading from Luminous to Nautilus , we see 654 pgs not deep-scrubbed in 
time error in ceph status . How can we disable this flag? . In our setup we 
disable deep-scrubbing for performance issues.

 

Thanks,

Muthu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lost OSD from PCIe error, recovered, to restore OSD process

2019-05-14 Thread Tarek Zegar

Someone nuked and OSD that had 1 replica PGs. They accidentally did echo 1
> /sys/block/nvme0n1/device/device/remove
We got it back doing a echo 1 > /sys/bus/pci/rescan
However, it reenumerated as a different drive number (guess we didn't have
udev rules)
They restored the LVM volume (vgcfgrestore
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)

lsblk
nvme0n2
259:90  1.8T  0 diskc
ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
  253:10  1.8T  0 lvm

We are stuck here. How do we attach an OSD daemon to the drive? It was
OSD.122 previously

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh

ok, so now we see at least a diffrence in the recovery state:

    "recovery_state": [
    {
    "name": "Started/Primary/Peering/Incomplete",
    "enter_time": "2019-05-14 14:15:15.650517",
    "comment": "not enough complete instances of this PG"
    },
    {
    "name": "Started/Primary/Peering",
    "enter_time": "2019-05-14 14:15:15.243756",
    "past_intervals": [
    {
    "first": "49767",
    "last": "59580",
    "all_participants": [
    {
    "osd": 2,
    "shard": 0
    },
    {
    "osd": 4,
    "shard": 1
    },
    {
    "osd": 23,
    "shard": 2
    },
    {
    "osd": 24,
    "shard": 0
    },
    {
    "osd": 72,
    "shard": 1
    },
    {
    "osd": 79,
    "shard": 3
    }
    ],
    "intervals": [
    {
    "first": "59562",
    "last": "59563",
    "acting": "4(1),24(0),79(3)"
    },
    {
    "first": "59564",
    "last": "59567",
    "acting": "23(2),24(0),79(3)"
    },
    {
    "first": "59570",
    "last": "59574",
    "acting": "4(1),23(2),79(3)"
    },
    {
    "first": "59577",
    "last": "59580",
    "acting": "4(1),23(2),24(0)"
    }
    ]
    }
    ],
    "probing_osds": [
    "2(0)",
    "4(1)",
    "23(2)",
    "24(0)",
    "72(1)",
    "79(3)"
    ],
    "down_osds_we_would_probe": [],
    "peering_blocked_by": []
    },
    {
    "name": "Started",
    "enter_time": "2019-05-14 14:15:15.243663"
    }
    ],

the peering does not seem to be blocked anymore. But still there is no 
recovery going on. Is there anything else we can try?



On 14.05.19 11:02 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:


On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
that the recovery of the first was finished before the second failed.
Nonetheless, both problematic pgs have been on both OSDs. We think, that
we still have enough shards left. For one of the pgs, the recovery state
looks like this:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-09 16:11:48.625966",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-09 16:11:48.611171",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59313",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster
On Tue, May 14, 2019 at 5:13 PM Kevin Flöh  wrote:
>
> ok, so now we see at least a diffrence in the recovery state:
>
>  "recovery_state": [
>  {
>  "name": "Started/Primary/Peering/Incomplete",
>  "enter_time": "2019-05-14 14:15:15.650517",
>  "comment": "not enough complete instances of this PG"
>  },
>  {
>  "name": "Started/Primary/Peering",
>  "enter_time": "2019-05-14 14:15:15.243756",
>  "past_intervals": [
>  {
>  "first": "49767",
>  "last": "59580",
>  "all_participants": [
>  {
>  "osd": 2,
>  "shard": 0
>  },
>  {
>  "osd": 4,
>  "shard": 1
>  },
>  {
>  "osd": 23,
>  "shard": 2
>  },
>  {
>  "osd": 24,
>  "shard": 0
>  },
>  {
>  "osd": 72,
>  "shard": 1
>  },
>  {
>  "osd": 79,
>  "shard": 3
>  }
>  ],
>  "intervals": [
>  {
>  "first": "59562",
>  "last": "59563",
>  "acting": "4(1),24(0),79(3)"
>  },
>  {
>  "first": "59564",
>  "last": "59567",
>  "acting": "23(2),24(0),79(3)"
>  },
>  {
>  "first": "59570",
>  "last": "59574",
>  "acting": "4(1),23(2),79(3)"
>  },
>  {
>  "first": "59577",
>  "last": "59580",
>  "acting": "4(1),23(2),24(0)"
>  }
>  ]
>  }
>  ],
>  "probing_osds": [
>  "2(0)",
>  "4(1)",
>  "23(2)",
>  "24(0)",
>  "72(1)",
>  "79(3)"
>  ],
>  "down_osds_we_would_probe": [],
>  "peering_blocked_by": []
>  },
>  {
>  "name": "Started",
>  "enter_time": "2019-05-14 14:15:15.243663"
>  }
>  ],
>
> the peering does not seem to be blocked anymore. But still there is no
> recovery going on. Is there anything else we can try?

What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG shards
and import to another operable OSD.

-- dan



>
>
> On 14.05.19 11:02 vorm., Dan van der Ster wrote:
> > On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:
> >>
> >> On 14.05.19 10:08 vorm., Dan van der Ster wrote:
> >>
> >> On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:
> >>
> >> On 13.05.19 10:51 nachm., Lionel Bouton wrote:
> >>
> >> Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
> >>
> >> Dear ceph experts,
> >>
> >> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> >> Here is what happened: One osd daemon could not be started and
> >> therefore we decided to mark the osd as lost and set it up from
> >> scratch. Ceph started recovering and then we lost another osd with
> >> the same behavior. We did the same as for the first osd.
> >>
> >> With 3+1 you only allow a single OSD failure per pg at a given time.
> >> You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
> >> separate servers (assuming standard crush rules) is a death sentence
> >> for the data on some pgs using both of those OSD (the ones not fully
> >> recovered before the second failure).
> >>
> >> OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
> >> that the recovery of the first was finished before the second failed.
> >> Nonetheless, both problematic pgs have been on both OSDs. We think, that
> >> we still have enough shards left. For one of the pgs, the recovery state
> >> looks like this:
> >>
> >>   "recovery_state": [
> >>   {
> >>   "name": "Started/Primary/Peering/Incomplete",
> >>   "enter_time": "2019-05-09 16:11:48.625966",
> >>   "comment": "not enough complete instances of this PG"
> >>   },
> >>   {
> >>   "name": "Started/Pri

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Konstantin Shalygin

  peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?



Try to reduce min_size for problem pool as 'health detail' suggested: 
`ceph osd pool set ec31 min_size 2`.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph nautilus deep-scrub health error

2019-05-14 Thread Brett Chancellor
You can increase your scrub intervals.
osd deep scrub interval
osd scrub max interval

On Tue, May 14, 2019 at 7:00 AM EDH - Manuel Rios Fernandez <
mrios...@easydatahost.com> wrote:

> Hi Muthu
>
>
>
> We found the same issue near 2000 pgs not deep-scrubbed in time.
>
>
>
> We’re manually force scrubbing with :
>
>
>
> ceph health detail | grep -i not | awk '{print $2}' | while read i; do
> ceph pg deep-scrub ${i}; done
>
>
>
> It launch near 20-30 pgs to be deep-scrubbed. I think you can improve
>  with a sleep of 120 secs between scrub to prevent overload your osd.
>
>
>
> For disable deep-scrub you can use “ceph osd set nodeep-scrub” , Also you
> can setup deep-scrub with threshold .
>
> #Start Scrub 22:00
>
> osd scrub begin hour = 22
>
> #Stop Scrub 8
>
> osd scrub end hour = 8
>
> #Scrub Load 0.5
>
> osd scrub load threshold = 0.5
>
>
>
> Regards,
>
>
>
> Manuel
>
>
>
>
>
>
>
>
>
> *De:* ceph-users  *En nombre de *nokia
> ceph
> *Enviado el:* martes, 14 de mayo de 2019 11:44
> *Para:* Ceph Users 
> *Asunto:* [ceph-users] ceph nautilus deep-scrub health error
>
>
>
> Hi Team,
>
>
>
> After upgrading from Luminous to Nautilus , we see 654 pgs not
> deep-scrubbed in time error in ceph status . How can we disable this flag?
> . In our setup we disable deep-scrubbing for performance issues.
>
>
>
> Thanks,
>
> Muthu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Health Cron Script

2019-05-14 Thread Georgios Dimitrakakis

Hello,

I am wondering if there are people out there that still use "old 
fashion" CRON scripts to check Ceph's health, monitor and receive email 
alerts.


If there are do you mind sharing your implementation?

Probably something similar to this: 
https://github.com/cernceph/ceph-scripts/blob/master/ceph-health-cron/ceph-health-cron



Best regards,

G.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph -s finds 4 pools but ceph osd lspools says no pool which is the expected answer

2019-05-14 Thread Rainer Krienke
Hello,

for a fresh setup ceph cluster I see a strange difference in the number
of existing pools in the output of ceph -s and what I know that should
be there: no pools at all.

I set up a fresh Nautilus cluster with 144 OSDs on 9 hosts. Just to play
around I created a pool named rbd with

$ ceph osd pool create rbd 512 512 replicated

In ceph -s I saw the pool but also saw a warning:

 cluster:
id: a-b-c-d-e
health: HEALTH_WARN
too few PGs per OSD (21 < min 30)

So I experimented around, removed the pool (ceph osd pool remove rbd)
and it was gone in ceph osd lspools, and created a new one with some
more PGs and repeated this a few times with larger PG nums. In the end
in the output of ceph -s I see that 4 pools do exist:

  cluster:
id: a-b-c-d-e
health: HEALTH_OK

  services:
mon: 3 daemons, quorum c2,c5,c8 (age 8h)
mgr: c2(active, since 8h)
osd: 144 osds: 144 up (since 8h), 144 in (since 8h)

  data:
pools:   4 pools, 0 pgs
objects: 0 objects, 0 B
usage:   155 GiB used, 524 TiB / 524 TiB avail
pgs:

but:

$ ceph osd lspools


Since I deleted each pool I created, 0 pools is the correct answer.
I could add another "ghost" pool by creating another pool named rbd with
only 512 PGs and then delete it again right away. ceph -s would then
show me 5 pools. This is the way I came from 3 to 4 "ghost pools".

This does not seem to happen if I use 2048 PGs for the new pool which I
do delete right afterwards. In this case the pool is created and ceph -s
shows one pool more (5) and if delete this pool again the counter in
ceph -s goes back to 4 again.

How can I fix the system so that ceph -s also understands that are
actually no pools? There must be some inconsistency. Any ideas?

Thanks
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
1001312
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rolling upgrade fails with flag norebalance with background IO [EXT]

2019-05-14 Thread Tarek Zegar

https://github.com/ceph/ceph-ansible/issues/3961   <--- created ticket

Thanks
Tarek



From:   Matthew Vernon 
To: Tarek Zegar , solarflo...@gmail.com
Cc: ceph-users@lists.ceph.com
Date:   05/14/2019 04:41 AM
Subject:[EXTERNAL] Re: [ceph-users] Rolling upgrade fails with flag
norebalance with background IO [EXT]



On 14/05/2019 00:36, Tarek Zegar wrote:
> It's not just mimic to nautilus
> I confirmed with luminous to mimic
>
> They are checking for clean pgs with flags set, they should unset flags,
> then check. Set flags again, move on to next osd

I think I'm inclined to agree that "norebalance" is likely to get in the
way when upgrading a cluster - our rolling upgrade playbook omits it.

OTOH, you might want to raise this on the ceph-ansible list (
ceph-ansi...@lists.ceph.com ) and/or as a github issue - I don't think
the ceph-ansible maintainers routinely watch this list.

HTH,

Matthew


--
 The Wellcome Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD from PCIe error, recovered, to restore OSD process

2019-05-14 Thread Bob R
Does 'ceph-volume lvm list' show it? If so you can try to activate it with
'ceph-volume lvm activate 122 74b01ec2--124d--427d--9812--e437f90261d4'

Bob

On Tue, May 14, 2019 at 7:35 AM Tarek Zegar  wrote:

> Someone nuked and OSD that had 1 replica PGs. They accidentally did echo 1
> > /sys/block/nvme0n1/device/device/remove
> We got it back doing a echo 1 > /sys/bus/pci/rescan
> However, it reenumerated as a different drive number (guess we didn't have
> udev rules)
> They restored the LVM volume (vgcfgrestore
> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>
> lsblk
> nvme0n2 259:9 0 1.8T 0 diskc
> ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
> 253:1 0 1.8T 0 lvm
>
> We are stuck here. How do we attach an OSD daemon to the drive? It was
> OSD.122 previously
>
> Thanks
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Using centraliced management configuration drops some unrecognized config option

2019-05-14 Thread EDH - Manuel Rios Fernandez
Hi

 

We're moving our config to centralized management configuration with "ceph
config set" and with the minimal ceph.conf in all nodes.

 

Several options from ceph are not allowed. Why? 

ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
(stable)

 

ceph config set osd osd_mkfs_type xfs

Error EINVAL: unrecognized config option 'osd_mkfs_type'

ceph config set osd osd_op_threads 12

Error EINVAL: unrecognized config option 'osd_op_threads'

ceph config set osd osd_disk_threads 2

Error EINVAL: unrecognized config option 'osd_disk_threads'

ceph config set osd osd_recovery_threads 4

Error EINVAL: unrecognized config option 'osd_recovery_threads'

ceph config set osd osd_recovery_thread 4

Error EINVAL: unrecognized config option 'osd_recovery_thread'

 

Bug? Failed in the cli setup?

 

Regards

 

Manuel

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh

Hi,

since we have 3+1 ec I didn't try before. But when I run the command you 
suggested I get the following error:


ceph osd pool set ec31 min_size 2
Error EINVAL: pool min_size must be between 3 and 4

On 14.05.19 6:18 nachm., Konstantin Shalygin wrote:



  peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?



Try to reduce min_size for problem pool as 'health detail' suggested: 
`ceph osd pool set ec31 min_size 2`.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph -s finds 4 pools but ceph osd lspools says no pool which is the expected answer

2019-05-14 Thread Rainer Krienke
Hello,

since I had no ideas by what the wrong pool number in the ceph -s output
could be caused I simply rebooted all machines of this cluster (it does
not yet contain any real data) which solved the problem.

So it seems that some caching problem might have caused this issue.

Thanks
Rainer

Am 14.05.19 um 20:03 schrieb Rainer Krienke:
> Hello,
> 
> for a fresh setup ceph cluster I see a strange difference in the number
> of existing pools in the output of ceph -s and what I know that should
> be there: no pools at all.
> 
> I set up a fresh Nautilus cluster with 144 OSDs on 9 hosts. Just to play
> around I created a pool named rbd with
> 



-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh
The hdds of OSDs 4 and 23 are completely lost, we cannot access them in 
any way. Is it possible to use the shards which are maybe stored on 
working OSDs as shown in the all_participants list?


On 14.05.19 5:24 nachm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 5:13 PM Kevin Flöh  wrote:

ok, so now we see at least a diffrence in the recovery state:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-14 14:15:15.650517",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-14 14:15:15.243756",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59580",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  },
  {
  "osd": 24,
  "shard": 0
  },
  {
  "osd": 72,
  "shard": 1
  },
  {
  "osd": 79,
  "shard": 3
  }
  ],
  "intervals": [
  {
  "first": "59562",
  "last": "59563",
  "acting": "4(1),24(0),79(3)"
  },
  {
  "first": "59564",
  "last": "59567",
  "acting": "23(2),24(0),79(3)"
  },
  {
  "first": "59570",
  "last": "59574",
  "acting": "4(1),23(2),79(3)"
  },
  {
  "first": "59577",
  "last": "59580",
  "acting": "4(1),23(2),24(0)"
  }
  ]
  }
  ],
  "probing_osds": [
  "2(0)",
  "4(1)",
  "23(2)",
  "24(0)",
  "72(1)",
  "79(3)"
  ],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": []
  },
  {
  "name": "Started",
  "enter_time": "2019-05-14 14:15:15.243663"
  }
  ],

the peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?

What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG shards
and import to another operable OSD.

-- dan





On 14.05.19 11:02 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:

On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
that the recovery of the first was finished before the second failed.
Nonetheless, both problematic pgs have been on both OSDs. We think, that
we still have enough shards left. For one of the pgs, the recovery state
looks like this:

   "recovery_state": [
   {
   "name": "Started/Primary/Peering/Incomplete",
   "enter_time": "2019-05-09 16:11:48.625966",
   "comment": "not enough complete instances of this PG"
   },
   {
   "name": "Started/Primary/Peering",