Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Nick Fisk
I take my hat off to you, well done for solving that!!! > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Zdenek Janda > Sent: 11 January 2018 13:01 > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Cluster cra

Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Zdenek Janda
Hi, we have restored damaged ODS not starting after bug caused by this issue, detailed steps are for reference at http://tracker.ceph.com/issues/21142#note-9 , should anybody hit into this this should fix it for you. Thanks Zdenek Janda On 11.1.2018 11:40, Zdenek Janda wrote: > Hi, > I have suc

Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Zdenek Janda
Hi, I have succeeded in identifying faulty PG: -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d needs 13939-15333 -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00 1 osd.15 15340 build_past_intervals_parallel over 13939-15333 -3448> 2018-01-11 11:32:20.019436 7f066e2a3e00 10

Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Zdenek Janda
Hi, updated the issue at http://tracker.ceph.com/issues/21142#note-5 with last 1 lines of strace before ABRT. Crash ends with: 0.002429 pread64(22, "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\354:\0\0"..., 12288, 908492996608) = 12288 0.007869 pread64(22,

Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Zdenek Janda
Hi, does anyone suggest what to do with this ? I have identified the underlying crashing code src/osd/osd_types.cc [assert(interval.last > last);] commited by Sage Weil, however didnt figured out exact mechanism of function and why it crashes. Also unclear is mechanism, how this bug spreaded and cr

Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Josef Zelenka
I have posted logs/strace from our osds with details to a ticket in the ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You can see where exactly the OSDs crash etc, this can be of help if someone decides to debug it. JZ On 10/01/18 22:05, Josef Zelenka wrote: Hi, today w

[ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-10 Thread Josef Zelenka
Hi, today we had a disasterous crash - we are running a 3 node, 24 osd in total cluster (8 each) with SSDs for blockdb, HDD for bluestore data. This cluster is used as a radosgw backend, for storing a big number of thumbnails for a file hosting site - around 110m files in total. We were adding