Hi , I got another failure and this time was able to investigate a bit.

1. If i delete the OSD and recreate it with the exact same setup, the OSD
boot up successfully
2., however, diffing the log between the failed run and the successful one
I noticed something odd: https://www.diffchecker.com/sSHrxwC9

We have in every successful OSD startup the following lines executed

Running command: ln -snf
/dev/inaugurator/ed5a15e8-20b9-4312-991c-1a4d91b284bd-wal
/var/lib/ceph/osd/ceph-5/block.wal
Running command: chown -h ceph:ceph
/dev/inaugurator/ed5a15e8-20b9-4312-991c-1a4d91b284bd-wal


However, in every failed run this two line are missing . Any idea why this
would occur?


Last but not least: I have setup the log level to 20, however, it seems
that the bluestore crash before even getting to the point where things are
logged.

Regards
Benoit



On Mon, 6 Aug 2018 at 13:07, Benoit Hudzia <ben...@stratoscale.com> wrote:

> Thanks, I ll try to check if i can reproduce it. It's really sporadic and
> occurs every 20-30 runs , I might check if it always occurs on the same
> server , maybe an HW issue.
>
> On Mon, 6 Aug 2018 at 06:12, Gregory Farnum <gfar...@redhat.com> wrote:
>
>> This isn't very complete as it just indicates that something went wrong
>> with a read. Since I presume it happens on every startup, it may help if
>> you set "debug bluestore = 20" in the OSD's config and provide that log
>> (perhaps with ceph-post-file if it's large).
>> I also went through my email and see
>> https://tracker.ceph.com/issues/24639, if you have anything in common
>> with that deployment. (But you probably don't; an error on read generally
>> is about bad state on disk that was created somewhere else.)
>> -Greg
>>
>> On Sun, Aug 5, 2018 at 3:19 PM Benoit Hudzia <ben...@stratoscale.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We start to see core dump occurring with luminous 12.2.7. Any idea where
>>> this is coming from ?? We started having issues with bluestore core dumping
>>> when we moved to 12.2.6 and hoped that 12.2.7 would have fixed it. We might
>>> need to revert back to 12.2.5 as it seems a lot more stable.
>>>
>>> Pastebin link for full log: https://pastebin.com/na4E3m3N
>>>
>>>
>>> Core dump :
>>>
>>> starting osd.7 at - osd_data /var/lib/ceph/osd/ceph-7 
>>> /var/lib/ceph/osd/ceph-7/journal
>>> *** Caught signal (Segmentation fault) **
>>>  in thread 7fa8830cfd80 thread_name:ceph-osd
>>>  ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous 
>>> (stable)
>>>  1: (()+0xa48ec1) [0x55e010afcec1]
>>>  2: (()+0xf6d0) [0x7fa8807966d0]
>>>  3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned 
>>> long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x55e010ab1e72]
>>>  4: (BlueFS::_replay(bool)+0x2ef) [0x55e010ac526f]
>>>  5: (BlueFS::mount()+0x1d4) [0x55e010ac8fd4]
>>>  6: (BlueStore::_open_db(bool)+0x1847) [0x55e0109e2da7]
>>>  7: (BlueStore::_mount(bool)+0x40e) [0x55e010a1406e]
>>>  8: (OSD::init()+0x3bd) [0x55e0105c934d]
>>>  9: (main()+0x2d07) [0x55e0104ce947]
>>>  10: (__libc_start_main()+0xf5) [0x7fa87f7a3445]
>>>  11: (()+0x4b9003) [0x55e01056d003]
>>> 2018-08-03 21:58:12.248736 7fa8830cfd80 -1 *** Caught signal (Segmentation 
>>> fault) **
>>>  in thread 7fa8830cfd80 thread_name:ceph-osd
>>>
>>>  ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous 
>>> (stable)
>>>  1: (()+0xa48ec1) [0x55e010afcec1]
>>>  2: (()+0xf6d0) [0x7fa8807966d0]
>>>  3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned 
>>> long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x55e010ab1e72]
>>>  4: (BlueFS::_replay(bool)+0x2ef) [0x55e010ac526f]
>>>  5: (BlueFS::mount()+0x1d4) [0x55e010ac8fd4]
>>>  6: (BlueStore::_open_db(bool)+0x1847) [0x55e0109e2da7]
>>>  7: (BlueStore::_mount(bool)+0x40e) [0x55e010a1406e]
>>>  8: (OSD::init()+0x3bd) [0x55e0105c934d]
>>>  9: (main()+0x2d07) [0x55e0104ce947]
>>>  10: (__libc_start_main()+0xf5) [0x7fa87f7a3445]
>>>  11: (()+0x4b9003) [0x55e01056d003]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed 
>>> to interpret this.
>>>
>>>      0> 2018-08-03 21:58:12.248736 7fa8830cfd80 -1 *** Caught signal 
>>> (Segmentation fault) **
>>>  in thread 7fa8830cfd80 thread_name:ceph-osd
>>>
>>>  ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous 
>>> (stable)
>>>  1: (()+0xa48ec1) [0x55e010afcec1]
>>>  2: (()+0xf6d0) [0x7fa8807966d0]
>>>  3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned 
>>> long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x55e010ab1e72]
>>>  4: (BlueFS::_replay(bool)+0x2ef) [0x55e010ac526f]
>>>  5: (BlueFS::mount()+0x1d4) [0x55e010ac8fd4]
>>>  6: (BlueStore::_open_db(bool)+0x1847) [0x55e0109e2da7]
>>>  7: (BlueStore::_mount(bool)+0x40e) [0x55e010a1406e]
>>>  8: (OSD::init()+0x3bd) [0x55e0105c934d]
>>>  9: (main()+0x2d07) [0x55e0104ce947]
>>>  10: (__libc_start_main()+0xf5) [0x7fa87f7a3445]
>>>  11: (()+0x4b9003) [0x55e01056d003]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed 
>>> to interpret this.
>>>
>>> /osd_entrypoint: line 98: 119388 Segmentation fault      (core dumped) 
>>> /usr/bin/ceph-osd -f --cluster "${CEPH_CLUSTERNAME}" --id "${OSD_ID}" 
>>> --setuser root --setgroup root
>>>
>>>
>>>
>>>
>>> --
>>> Dr. Benoit Hudzia
>>>
>>> Mobile (UK): +44 (0) 75 346 78673
>>> Mobile (IE):  +353 (0) 89 219 3675 <+353%2089%20219%203675>
>>> Email: ben...@stratoscale.com
>>>
>>>
>>>
>>> Web <http://www.stratoscale.com/> | Blog
>>> <http://www.stratoscale.com/blog/> | Twitter
>>> <https://twitter.com/Stratoscale> | Google+
>>> <https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
>>>  | Linkedin <https://www.linkedin.com/company/stratoscale>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
> --
> Dr. Benoit Hudzia
>
> Mobile (UK): +44 (0) 75 346 78673
> Mobile (IE):  +353 (0) 89 219 3675
> Email: ben...@stratoscale.com
>
>
>
> Web <http://www.stratoscale.com/> | Blog
> <http://www.stratoscale.com/blog/> | Twitter
> <https://twitter.com/Stratoscale> | Google+
> <https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
>  | Linkedin <https://www.linkedin.com/company/stratoscale>
>
>

-- 
Dr. Benoit Hudzia

Mobile (UK): +44 (0) 75 346 78673
Mobile (IE):  +353 (0) 89 219 3675
Email: ben...@stratoscale.com



Web <http://www.stratoscale.com/> | Blog <http://www.stratoscale.com/blog/>
 | Twitter <https://twitter.com/Stratoscale> | Google+
<https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
 | Linkedin <https://www.linkedin.com/company/stratoscale>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to