Hi Sage/Sam,

I faced a crash in OSD with latest Ceph master. Here is the log trace for the 
same.

ceph version 0.85-677-gd5777c4 (d5777c421548e7f039bb2c77cb0df2e9c7404723)
1: ceph-osd() [0x990def]
2: (()+0xfbb0) [0x7f72ae6e6bb0]
3: (gsignal()+0x37) [0x7f72acc08f77]
4: (abort()+0x148) [0x7f72acc0c5e8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f72ad5146e5]
6: (()+0x5e856) [0x7f72ad512856]
7: (()+0x5e883) [0x7f72ad512883]
8: (()+0x5eaae) [0x7f72ad512aae]
9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, unsigned int, 
unsigned int)+0x277) [0xa88747]
10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81) [0xa89541]
11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16]
12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32]
13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223) [0x6b6483]
14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) [0x6b7057]
15: (OSD::asok_command(std::string, std::map<std::string, 
boost::variant<std::string, bool, long, double, std::vector<std::string, 
std::allocator<std::string> >, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_>, 
std::less<std::string>, std::allocator<std::pair<std::string const, 
boost::variant<std::string, bool, long, double, std::vector<std::string, 
std::allocator<std::string> >, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_> > > >&, 
std::string, std::ostream&)+0x1d7) [0x612cb7]
16: (OSDSocketHook::call(std::string, std::map<std::string, 
boost::variant<std::string, bool, long, double, std::vector<std::string, 
std::allocator<std::string> >, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_>, 
std::less<std::string>, std::allocator<std::pair<std::string const, 
boost::variant<std::string, bool, long, double, std::vector<std::string, 
std::allocator<std::string> >, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_, 
boost::detail::variant::void_, boost::detail::variant::void_> > > >&, 
std::string, ceph::buffer::list&)+0x67) [0x67c8b7]
17: (AdminSocket::do_accept()+0x1007) [0xa79817]
18: (AdminSocket::entry()+0x258) [0xa7b448]
19: (()+0x7f6e) [0x7f72ae6def6e]
20: (clone()+0x6d) [0x7f72acccc9cd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

Steps to reproduce:
-----------------------


1.       Run ios

2.       While ios running , run the following command continuously.



"ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight"


3.       At some point the osd will be crashed.

I think I have root caused it..


1.       OpTracker::RemoveOnDelete::operator() is calling op->_unregistered() 
which clears out message->data() and payload

2.       After that, if optracking is enabled we are calling 
unregister_inflight_op() which removed the op from the xlist.

3.       Now, while dumping ops, we are calling _dump_op_descriptor_unlocked() 
from TrackedOP::dump, which tries to print the message.

4.       So, there is a race condition when it tries to print the message whoes 
ops (data) field is already cleared.

Fix could be, call this op->_unregistered (in case optracking is enabled) after 
it is removed from xlist.

With this fix, I am not getting the crash anymore.

If my observation is correct, please let me know. I will raise a bug and will 
fix that as part of the overall optracker performance improvement (I will 
submit that pull request soon).

Thanks & Regards
Somnath

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to