[ceph-users] OSD process doesn't die immediately after device disappears

Marcel Lauhoff Tue, 17 May 2016 05:59:53 -0700

Hi,

we recently played the good ol' pull a harddrive game and wondered, why
the OSD process took a couple of minutes to recognize their misfortune.


In our configuration two OSDs share an HDD:
  OSD n as its journal device,
  OSD n+1 as its filesystem.

We expected that OSDs detect this kind of failure and immediately
shut down, so that transactions aren't blocked and recovery can start as
soon as possible.

What do you think?


I read through the FileStore code about a year ago and can't remember
any code that somehow subscribes to events of the underlying devices.

Does anyone use external watchdog tools for this type of failure?



~irq0


The last messages of the two OSD daemons:

2016-04-27 14:57:25.613408 7f1b9ed10700 -1 journal aio to 0~4096 wrote 
18446744073709551611
2016-04-27 14:57:25.642669 7f1b9ed10700 -1 os/FileJournal.cc: In function 'void 
FileJournal::write_finish_thread_entry()' thread 7f1b9ed10700 time 2016-04-27 
14:57:25.613475
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

2016-04-27 14:57:22.534578 7f0e0c6a5700 -1 os/FileStore.cc: In function 
'unsigned int FileStore::_do_transaction(ObjectStore::Trans
action&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0e0c6a5700 time 
2016-04-27 14:57:22.489978
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

--
Marcel Lauhoff
Mail: lauh...@uni-mainz.de
XMPP: mlauh...@jabber.uni-mainz.de
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD process doesn't die immediately after device disappears

Reply via email to