Hello,

we are running ceph 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) on a
4.5.2 kernel. Our cluster currently consists of 5 nodes, with 6 OSD's each.

An issue has also been filed here (also containg logs, etc.):
http://tracker.ceph.com/issues/15813

Last night we have observed a single OSD (osd.11) die with an assertion:

2016-05-10 03:16:30.718936 7fa5166dc700 -1 common/Mutex.cc: In function
'void Mutex::Lock(bool)' thread 7fa5166dc700 time 2016-05-10 03:16:30.688044

common/Mutex.cc: 100: FAILED assert(r == 0)


 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0xb34520]

 2: (Mutex::Lock(bool)+0x105) [0xae2395]

 3: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]

 4: (Pipe::fault(bool)+0x426) [0xc16256]

 5: (Pipe::reader()+0x3f2) [0xc1d752]

 6: (Pipe::Reader::entry()+0xd) [0xc2880d]

 7: (()+0x7474) [0x7fa54c70b474]

 8: (clone()+0x6d) [0x7fa54ac01acd]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.


--- begin dump of recent events ---

   -74> 2016-05-09 11:56:56.015711 7fa54d1fc7c0  5 asok(0x5644000)
register_command perfcounters_dump hook 0x5624030

   -73> 2016-05-09 11:56:56.015739 7fa54d1fc7c0  5 asok(0x5644000)
register_command 1 hook 0x5624030

   -72> 2016-05-09 11:56:56.015745 7fa54d1fc7c0  5 asok(0x5644000)
register_command perf dump hook 0x5624030

   -71> 2016-05-09 11:56:56.015751 7fa54d1fc7c0  5 asok(0x5644000)
register_command perfcounters_schema hook 0x5624030

   -70> 2016-05-09 11:56:56.015756 7fa54d1fc7c0  5 asok(0x5644000)
register_command 2 hook 0x5624030

   -69> 2016-05-09 11:56:56.015758 7fa54d1fc7c0  5 asok(0x5644000)
register_command perf schema hook 0x5624030

   -68> 2016-05-09 11:56:56.015763 7fa54d1fc7c0  5 asok(0x5644000)
register_command perf reset hook 0x5624030

   -67> 2016-05-09 11:56:56.015766 7fa54d1fc7c0  5 asok(0x5644000)
register_command config show hook 0x5624030

   -66> 2016-05-09 11:56:56.015770 7fa54d1fc7c0  5 asok(0x5644000)
register_command config set hook 0x5624030

   -65> 2016-05-09 11:56:56.015773 7fa54d1fc7c0  5 asok(0x5644000)
register_command config get hook 0x5624030

   -64> 2016-05-09 11:56:56.015776 7fa54d1fc7c0  5 asok(0x5644000)
register_command config diff hook 0x5624030

   -63> 2016-05-09 11:56:56.015779 7fa54d1fc7c0  5 asok(0x5644000)
register_command log flush hook 0x5624030

   -62> 2016-05-09 11:56:56.015783 7fa54d1fc7c0  5 asok(0x5644000)
register_command log dump hook 0x5624030

   -61> 2016-05-09 11:56:56.015786 7fa54d1fc7c0  5 asok(0x5644000)
register_command log reopen hook 0x5624030

   -60> 2016-05-09 11:56:56.017553 7fa54d1fc7c0  0 ceph version 0.94.5
(9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 76017

   -59> 2016-05-09 11:56:56.027154 7fa54d1fc7c0  0
filestore(/var/lib/ceph/osd/ceph-11) backend xfs (magic 0x58465342)

   -58> 2016-05-09 11:56:56.028635 7fa54d1fc7c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP
ioctl is supported and appears to work

   -57> 2016-05-09 11:56:56.028644 7fa54d1fc7c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option

   -56> 2016-05-09 11:56:56.042822 7fa54d1fc7c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)

   -55> 2016-05-09 11:56:56.043047 7fa54d1fc7c0  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature: extsize is
supported and kernel 4.5.2-1-ARCH >= 3.5

   -54> 2016-05-09 11:56:56.109483 7fa54d1fc7c0  0
filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled

   -53> 2016-05-09 11:56:56.110110 7fa54d1fc7c0 -1 journal
FileJournal::_open: disabling aio for non-block journal.  Use
journal_force_aio to force use of aio anyway

   -52> 2016-05-09 11:56:56.110825 7fa54d1fc7c0  0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello

   -51> 2016-05-09 11:56:56.114886 7fa54d1fc7c0  0 osd.11 9819 crush map
has features 283675107524608, adjusting msgr requires for clients

   -50> 2016-05-09 11:56:56.114895 7fa54d1fc7c0  0 osd.11 9819 crush map
has features 283675107524608 was 8705, adjusting msgr requires for mons

   -49> 2016-05-09 11:56:56.114899 7fa54d1fc7c0  0 osd.11 9819 crush map
has features 283675107524608, adjusting msgr requires for osds

   -48> 2016-05-09 11:56:56.114919 7fa54d1fc7c0  0 osd.11 9819 load_pgs

   -47> 2016-05-09 11:56:57.000584 7fa54d1fc7c0  0 osd.11 9819 load_pgs
opened 55 pgs

   -46> 2016-05-09 11:56:57.000991 7fa54d1fc7c0 -1 osd.11 9819
log_to_monitors {default=true}

   -45> 2016-05-09 11:56:57.052319 7fa54d1fc7c0  0 osd.11 9819 done with
init, starting boot process

   -44> 2016-05-09 11:57:01.103141 7fa50d84e700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::3]:6804/75598
pipe(0x11152000 sd=80 :6806 s=0 pgs=0 cs=0 l=0 c=0x10ec6580).accept
connect_seq 0 vs existing 0 state connecting

   -43> 2016-05-09 12:11:57.005225 7fa51a614700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6800/20004064
pipe(0x104da000 sd=33 :6098 s=2 pgs=934 cs=1 l=0 c=0x97de700).fault with
nothing to send, going to standby

   -42> 2016-05-09 12:11:57.115818 7fa5188f7700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6807/20003350
pipe(0x10a13000 sd=68 :20154 s=2 pgs=1642 cs=1 l=0 c=0x1054e100).fault with
nothing to send, going to standby

   -41> 2016-05-09 12:11:58.959851 7fa519806700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6817/21003580
pipe(0x10845000 sd=47 :52276 s=2 pgs=1636 cs=1 l=0 c=0x1054d200).fault with
nothing to send, going to standby

   -40> 2016-05-09 12:12:01.173115 7fa50d84e700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::3]:6804/75598
pipe(0x11152000 sd=80 :6806 s=2 pgs=25 cs=1 l=0 c=0x10ec5c80).fault with
nothing to send, going to standby

   -39> 2016-05-09 12:19:10.564593 7fa539988700  0 --
[fd00:2380:0:20::3]:6806/76017 >> [fd00:2380:0:20::4]:0/1580628965
pipe(0x1285d000 sd=193 :6806 s=0 pgs=0 cs=0 l=0 c=0x10ec4d80).accept peer
addr is really [fd00:2380:0:20::4]:0/1580628965 (socket is
[fd00:2380:0:20::4]:29739/0)

   -38> 2016-05-09 12:59:17.227432 7fa509ea6700  0 --
[fd00:2380:0:20::3]:6806/76017 >> [fd00:2380:0:20::1]:0/2953656339
pipe(0x12122000 sd=271 :6806 s=0 pgs=0 cs=0 l=0 c=0x12549080).accept peer
addr is really [fd00:2380:0:20::1]:0/2953656339 (socket is
[fd00:2380:0:20::1]:41764/0)

   -37> 2016-05-10 02:35:46.567783 7fa52052d700  0 log_channel(cluster) log
[INF] : 3.1db scrub starts

   -36> 2016-05-10 02:35:51.683231 7fa520d2e700  0 log_channel(cluster) log
[INF] : 3.1db scrub ok

   -35> 2016-05-10 02:36:35.583938 7fa51fd2c700  0 log_channel(cluster) log
[INF] : 3.2c scrub starts

   -34> 2016-05-10 02:36:40.647568 7fa520d2e700  0 log_channel(cluster) log
[INF] : 3.2c scrub ok


At this point, the primary monitor marked the OSD as "down".


   -33> 2016-05-10 03:01:54.554677 7fa519b09700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6800/4229
pipe(0x10828000 sd=52 :6696 s=2 pgs=75 cs=1 l=0 c=0x1054ca80).fault with
nothing to send, going to standby

   -32> 2016-05-10 03:01:54.554757 7fa51a715700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6813/22004022
pipe(0x104df000 sd=34 :4162 s=2 pgs=1199 cs=1 l=0 c=0x97deb80).fault with
nothing to send, going to standby

   -31> 2016-05-10 03:01:54.554933 7fa5189f8700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6800/20003787
pipe(0x10a25000 sd=67 :42436 s=2 pgs=1845 cs=1 l=0 c=0x1054ee80).fault with
nothing to send, going to standby

   -30> 2016-05-10 03:01:54.555193 7fa51a412700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6810/71171
pipe(0x1052c000 sd=40 :9588 s=2 pgs=59 cs=1 l=0 c=0x97dea00).fault with
nothing to send, going to standby

   -29> 2016-05-10 03:01:54.556082 7fa5185f4700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6804/3787
pipe(0x10a82000 sd=77 :16254 s=2 pgs=1053 cs=1 l=0 c=0x1054fd80).fault with
nothing to send, going to standby

   -28> 2016-05-10 03:01:54.556134 7fa51a816700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6800/3315
pipe(0x104b9000 sd=32 :9106 s=2 pgs=627 cs=1 l=0 c=0x97de280).fault with
nothing to send, going to standby

   -27> 2016-05-10 03:01:54.556252 7fa51a917700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6806/4027
pipe(0x5bde000 sd=28 :12780 s=2 pgs=1145 cs=1 l=0 c=0x97dd800).fault with
nothing to send, going to standby

   -26> 2016-05-10 03:01:54.557199 7fa5186f5700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6803/126743
pipe(0x10a6c000 sd=72 :43264 s=2 pgs=62 cs=1 l=0 c=0x1054f900).fault with
nothing to send, going to standby

   -25> 2016-05-10 03:01:54.557588 7fa519a08700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6808/5200
pipe(0x10840000 sd=54 :36606 s=2 pgs=51 cs=1 l=0 c=0x1054cf00).fault with
nothing to send, going to standby

   -24> 2016-05-10 03:01:54.557850 7fa518cfb700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6806/4952
pipe(0x10531000 sd=65 :42550 s=2 pgs=61 cs=1 l=0 c=0x97dfd80).fault with
nothing to send, going to standby

   -23> 2016-05-10 03:01:54.558839 7fa519604700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6801/21003546
pipe(0x10978000 sd=27 :52456 s=2 pgs=937 cs=1 l=0 c=0x1054d680).fault with
nothing to send, going to standby

   -22> 2016-05-10 03:01:54.558903 7fa519402700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6811/102397
pipe(0x1097d000 sd=38 :36150 s=2 pgs=62 cs=1 l=0 c=0x1054db00).fault with
nothing to send, going to standby

   -21> 2016-05-10 03:01:54.559397 7fa518bfa700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6802/93235
pipe(0x10a0e000 sd=48 :6828 s=2 pgs=54 cs=1 l=0 c=0x1054de00).fault with
nothing to send, going to standby

   -20> 2016-05-10 03:01:54.559578 7fa51a311700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6812/22043768
pipe(0x1050e000 sd=29 :52358 s=2 pgs=830 cs=1 l=0 c=0x97de580).fault with
nothing to send, going to standby

   -19> 2016-05-10 03:01:54.560105 7fa51a513700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6813/22079394
pipe(0x104fc000 sd=35 :3652 s=2 pgs=855 cs=1 l=0 c=0x97ddf80).fault with
nothing to send, going to standby

   -18> 2016-05-10 03:01:54.560870 7fa519c0a700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6809/22003317
pipe(0x106ff000 sd=50 :32878 s=2 pgs=1407 cs=1 l=0 c=0x1054c600).fault with
nothing to send, going to standby

   -17> 2016-05-10 03:01:54.561127 7fa51aa18700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6802/4449
pipe(0x104b4000 sd=31 :30636 s=2 pgs=111 cs=1 l=0 c=0x97ddc80).fault with
nothing to send, going to standby

   -16> 2016-05-10 03:01:54.562713 7fa5187f6700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6808/63020
pipe(0x10a20000 sd=64 :7038 s=2 pgs=608 cs=1 l=0 c=0x1054e400).fault with
nothing to send, going to standby

   -15> 2016-05-10 03:01:54.564040 7fa51a210700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6802/3556
pipe(0x10501000 sd=30 :41532 s=2 pgs=1367 cs=1 l=0 c=0x97df180).fault with
nothing to send, going to standby

   -14> 2016-05-10 03:01:54.564845 7fa51a00e700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6809/21003823
pipe(0x106fa000 sd=41 :37786 s=2 pgs=1743 cs=1 l=0 c=0x1054c180).fault with
nothing to send, going to standby

   -13> 2016-05-10 03:02:19.610294 7fa519e0c700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6800/4229
pipe(0x10828000 sd=19 :7102 s=1 pgs=75 cs=1 l=0 c=0x1054ca80).connect got
RESETSESSION

   -12> 2016-05-10 03:02:19.610500 7fa518dfc700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6803/126743
pipe(0x10a6c000 sd=17 :43648 s=1 pgs=62 cs=1 l=0 c=0x1054f900).connect got
RESETSESSION

   -11> 2016-05-10 03:02:20.509905 7fa519907700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6817/21003580
pipe(0x10845000 sd=22 :52680 s=1 pgs=1636 cs=1 l=0 c=0x1054d200).connect
got RESETSESSION

   -10> 2016-05-10 03:02:20.510001 7fa518ffe700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6800/20003787
pipe(0x10a25000 sd=23 :42830 s=1 pgs=1845 cs=1 l=0 c=0x1054ee80).connect
got RESETSESSION

    -9> 2016-05-10 03:02:21.372603 7fa51a10f700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6809/21003823
pipe(0x106fa000 sd=25 :38202 s=1 pgs=1743 cs=1 l=0 c=0x1054c180).connect
got RESETSESSION

    -8> 2016-05-10 03:02:21.372712 7fa519705700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6801/21003546
pipe(0x10978000 sd=27 :52864 s=1 pgs=937 cs=1 l=0 c=0x1054d680).connect got
RESETSESSION

    -7> 2016-05-10 03:02:21.392847 7fa51a00e700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6801/21003546
pipe(0x10978000 sd=27 :52864 s=2 pgs=941 cs=1 l=0 c=0x1054d680).fault,
initiating reconnect

    -6> 2016-05-10 03:02:21.393394 7fa519705700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6801/21003546
pipe(0x10978000 sd=27 :52868 s=1 pgs=941 cs=2 l=0 c=0x1054d680).connect got
RESETSESSION

    -5> 2016-05-10 03:02:21.393697 7fa51b11f700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6800/3315
pipe(0x104b9000 sd=32 :9540 s=1 pgs=627 cs=1 l=0 c=0x97de280).connect got
RESETSESSION

    -4> 2016-05-10 03:02:24.368774 7fa519806700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6800/20003787
pipe(0x10a25000 sd=23 :42830 s=2 pgs=1848 cs=1 l=0 c=0x1054ee80).fault,
initiating reconnect

    -3> 2016-05-10 03:02:24.368786 7fa5189f8700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6817/21003580
pipe(0x10845000 sd=22 :52680 s=2 pgs=1641 cs=1 l=0 c=0x1054d200).fault,
initiating reconnect

    -2> 2016-05-10 03:02:24.369396 7fa518ffe700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6800/20003787
pipe(0x10a25000 sd=23 :42842 s=1 pgs=1848 cs=2 l=0 c=0x1054ee80).connect
got RESETSESSION

    -1> 2016-05-10 03:02:24.369480 7fa519907700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6817/21003580
pipe(0x10845000 sd=22 :52692 s=1 pgs=1641 cs=2 l=0 c=0x1054d200).connect
got RESETSESSION


After several minutes, osd.11 was marked OUT by mon.0, resulting in some
data movement.


2016-05-10 03:07:01.826746 mon.0 [fd00:2380:0:20::c9]:6789/0 11317108 :
cluster [INF] osd.11 out (down for 307.274008)


The next thing we see is osd.11 terminating because of an assertion.


     0> 2016-05-10 03:16:30.718936 7fa5166dc700 -1 common/Mutex.cc: In
function 'void Mutex::Lock(bool)' thread 7fa5166dc700 time 2016-05-10
03:16:30.688044

common/Mutex.cc: 100: FAILED assert(r == 0)


 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0xb34520]

 2: (Mutex::Lock(bool)+0x105) [0xae2395]

 3: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]

 4: (Pipe::fault(bool)+0x426) [0xc16256]

 5: (Pipe::reader()+0x3f2) [0xc1d752]

 6: (Pipe::Reader::entry()+0xd) [0xc2880d]

 7: (()+0x7474) [0x7fa54c70b474]

 8: (clone()+0x6d) [0x7fa54ac01acd]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.


--- logging levels ---

   0/ 5 none

   0/ 0 lockdep

   0/ 0 context

   0/ 0 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 0 buffer

   0/ 0 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_replay

   0/ 0 journaler

   0/ 5 objectcacher

   0/ 5 client

   0/ 0 osd

   0/ 0 optracker

   0/ 0 objclass

   0/ 0 filestore

   1/ 3 keyvaluestore

   0/ 0 journal

   0/ 0 ms

   1/ 5 mon

   0/ 0 monc

   1/ 5 paxos

   0/ 0 tp

   0/ 0 auth

   1/ 5 crypto

   0/ 0 finisher

   0/ 0 heartbeatmap

   0/ 0 perfcounter

   1/ 5 rgw

   1/10 civetweb

   1/ 5 javaclient

   0/ 0 asok

   0/ 0 throttle

   0/ 0 refs

   1/ 5 xio

  -2/-2 (syslog threshold)

  -1/-1 (stderr threshold)

  max_recent     10000

  max_new         1000

  log_file /var/log/ceph/ceph-osd.11.log

--- end dump of recent events ---

2016-05-10 03:16:30.721888 7fa5166dc700 -1 *** Caught signal (Aborted) **

 in thread 7fa5166dc700


 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

 1: /usr/bin/ceph-osd() [0xa36cfd]

 2: (()+0x10ef0) [0x7fa54c714ef0]

 3: (gsignal()+0x35) [0x7fa54ab4d295]

 4: (abort()+0x16a) [0x7fa54ab4e6da]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fa54b4663dd]

 6: (()+0x8f196) [0x7fa54b464196]

 7: (()+0x8f1e1) [0x7fa54b4641e1]

 8: (()+0x8f3f8) [0x7fa54b4643f8]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27e) [0xb3471e]

 10: (Mutex::Lock(bool)+0x105) [0xae2395]

 11: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]

 12: (Pipe::fault(bool)+0x426) [0xc16256]

 13: (Pipe::reader()+0x3f2) [0xc1d752]

 14: (Pipe::Reader::entry()+0xd) [0xc2880d]

 15: (()+0x7474) [0x7fa54c70b474]

 16: (clone()+0x6d) [0x7fa54ac01acd]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.


--- begin dump of recent events ---

     0> 2016-05-10 03:16:30.721888 7fa5166dc700 -1 *** Caught signal
(Aborted) **

 in thread 7fa5166dc700


 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

 1: /usr/bin/ceph-osd() [0xa36cfd]

 2: (()+0x10ef0) [0x7fa54c714ef0]

 3: (gsignal()+0x35) [0x7fa54ab4d295]

 4: (abort()+0x16a) [0x7fa54ab4e6da]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fa54b4663dd]

 6: (()+0x8f196) [0x7fa54b464196]

 7: (()+0x8f1e1) [0x7fa54b4641e1]

 8: (()+0x8f3f8) [0x7fa54b4643f8]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27e) [0xb3471e]

 10: (Mutex::Lock(bool)+0x105) [0xae2395]

 11: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]

 12: (Pipe::fault(bool)+0x426) [0xc16256]

 13: (Pipe::reader()+0x3f2) [0xc1d752]

 14: (Pipe::Reader::entry()+0xd) [0xc2880d]

 15: (()+0x7474) [0x7fa54c70b474]

 16: (clone()+0x6d) [0x7fa54ac01acd]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.


--- logging levels ---

   0/ 5 none

   0/ 0 lockdep

   0/ 0 context

   0/ 0 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 0 buffer

   0/ 0 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_replay

   0/ 0 journaler

   0/ 5 objectcacher

   0/ 5 client

   0/ 0 osd

   0/ 0 optracker

   0/ 0 objclass

   0/ 0 filestore

   1/ 3 keyvaluestore

   0/ 0 journal

   0/ 0 ms

   1/ 5 mon

   0/ 0 monc

   1/ 5 paxos

   0/ 0 tp

   0/ 0 auth

   1/ 5 crypto

   0/ 0 finisher

   0/ 0 heartbeatmap

   0/ 0 perfcounter

   1/ 5 rgw

   1/10 civetweb

   1/ 5 javaclient

   0/ 0 asok

   0/ 0 throttle

   0/ 0 refs

   1/ 5 xio

  -2/-2 (syslog threshold)

  -1/-1 (stderr threshold)

  max_recent     10000

  max_new         1000

  log_file /var/log/ceph/ceph-osd.11.log

--- end dump of recent events ---

Since we have been having issues with the FD usage of some OSD's (although
they never terminated), we've already set the following within the
systemd-Service file for ceph-osd's:

LimitNOFILE=1048576

LimitNPROC=1048576

TasksMax=infinity

Only one OSD was affected which is why i immediately checked these limits,
both have been well below the set max. There is also no issue with the
actual SSD its running on. Smart reports that everything is good, counters
are fine etc. Also, while this all happened, all the other OSDs on the same
node, have been running flawlessly and without issues which makes me rule
out a network problem.

The osd has been restarted and is working flawlessly since then.

I would appreciate a hint as to what might be causing this. I am not too
deep into debugging these kind of assertions myself but i will provide
anything i can.

Thanks,
Torsten
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to