Hi Marcel,
FileStore doesn't subscribe for any such event from the device. Presently, it 
is relying on filesystem (for the FileStore assert) to return back error during 
IO and based on the error it is giving an assert.
FileJournal assert you are getting in the aio path is relying on linux aio 
system to report an error.
It should get these asserts pretty quickly not couple of minutes if IO is on.
Are you saying this crash timestamp is couple of minutes after ?
BTW, if you are on Ubuntu , upstart will restart the OSDs after crash and based 
on some logic (more frequent crash)  it will eventually decide not to. So, in 
the log try to get the very first crash trace and see when it occurred.
BTW, hope you are aware that recovery will not be kicking off unless there is 
some grace period (configurable) is over.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Marcel 
Lauhoff
Sent: Tuesday, May 17, 2016 5:59 AM
To: ceph-users
Subject: [ceph-users] OSD process doesn't die immediately after device 
disappears


Hi,

we recently played the good ol' pull a harddrive game and wondered, why the OSD 
process took a couple of minutes to recognize their misfortune.

In our configuration two OSDs share an HDD:
  OSD n as its journal device,
  OSD n+1 as its filesystem.

We expected that OSDs detect this kind of failure and immediately shut down, so 
that transactions aren't blocked and recovery can start as soon as possible.

What do you think?


I read through the FileStore code about a year ago and can't remember any code 
that somehow subscribes to events of the underlying devices.

Does anyone use external watchdog tools for this type of failure?



~irq0


The last messages of the two OSD daemons:

2016-04-27 14:57:25.613408 7f1b9ed10700 -1 journal aio to 0~4096 wrote 
18446744073709551611
2016-04-27 14:57:25.642669 7f1b9ed10700 -1 os/FileJournal.cc: In function 'void 
FileJournal::write_finish_thread_entry()' thread 7f1b9ed10700 time 2016-04-27 
14:57:25.613475
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

2016-04-27 14:57:22.534578 7f0e0c6a5700 -1 os/FileStore.cc: In function 
'unsigned int FileStore::_do_transaction(ObjectStore::Trans
action&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0e0c6a5700 time 
2016-04-27 14:57:22.489978
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

--
Marcel Lauhoff
Mail: lauh...@uni-mainz.de
XMPP: mlauh...@jabber.uni-mainz.de
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to