Hello,

We recently had a outage on our Ceph storage cluster caused by what I believe 
to be a bug in Ceph. At the time of the incident all MONs, OSDs and clients 
(except for one) were running Ceph Hammer 0.94.6.

To start describing the incident I will portray a hierarchy of rbd 
volumes/snaptshos/clones:
root: volumes/volume-35429025-844f-42bb-8fb1-7071cabf0e9a
child:  
volumes/volume-35429025-844f-42bb-8fb1-7071cabf0e9a@snapshot-37cfc5c6-312c-45f1-a586-0398b04a2e35
child:    volumes/volume-02557bbd-d513-4c7c-9035-73dac6704a93

This is a long-lived cluster so I am not sure under what version the above 
construction of volumes was created. I guess we started out with Firefly at 
some point back in time.

At approximately 09:22 a request was issued to delete 
snapshot-37cfc5c6-312c-45f1-a586-0398b04a2e35.

Some time after this there was also issued a «rbd -p volumes disk-usage» 
command from a 10.2.0 client. I don’t know if it is relevant but thought it was 
worth noting.

A short while after these messages occured in the logs:
2016-05-16 09:27:47.858897 osd.38 [xxx]:6805/2576 995 : cluster [ERR] osd.38: 
inconsistent clone_overlap found for oid 
46b786d5/rbd_data.26b3cb5fbccda4.0000000000000185/head//29 clone 0
2016-05-16 09:27:46.011693 osd.48 [xxx]:6801/2362 697 : cluster [ERR] osd.48: 
inconsistent clone_overlap found for oid 
9e88fdc5/rbd_data.26b3cb5fbccda4.0000000000000009/head//29 clone 0
2016-05-16 09:27:48.280009 osd.48 [xxx]:6801/2362 698 : cluster [ERR] osd.48: 
inconsistent clone_overlap found for oid 
81532b89/rbd_data.26b3cb5fbccda4.00000000000001ce/head//29 clone 0
2016-05-16 09:27:49.706609 osd.44 [xxx]:6801/2434 613 : cluster [ERR] osd.44: 
inconsistent clone_overlap found for oid 
bddaa5a1/rbd_data.26b3cb5fbccda4.0000000000000304/head//29 clone 0
2016-05-16 09:27:48.999400 osd.37 [xxx]:6803/2538 902 : cluster [ERR] osd.37: 
inconsistent clone_overlap found for oid 
ca18f5ca/rbd_data.26b3cb5fbccda4.000000000000026b/head//29 clone 0
2016-05-16 09:27:49.014679 osd.36 [xxx]:6807/2598 1015 : cluster [ERR] osd.36: 
inconsistent clone_overlap found for oid 
eca5f5dc/rbd_data.26b3cb5fbccda4.000000000000026c/head//29 clone 0
2016-05-16 09:27:49.235251 osd.36 [xxx]:6807/2598 1016 : cluster [ERR] osd.36: 
inconsistent clone_overlap found for oid 
9fb5b7dc/rbd_data.26b3cb5fbccda4.00000000000002a1/head//29 clone 0
2016-05-16 09:27:49.526915 osd.50 [xxx]:6805/2410 693 : cluster [ERR] osd.50: 
inconsistent clone_overlap found for oid 
ab5eff09/rbd_data.26b3cb5fbccda4.00000000000002c6/head//29 clone 0
2016-05-16 09:27:50.336825 osd.36 [xxx]:6807/2598 1017 : cluster [ERR] osd.36: 
inconsistent clone_overlap found for oid 
d7922b2b/rbd_data.26b3cb5fbccda4.0000000000000392/head//29 clone 0
2016-05-16 09:27:50.037706 osd.38 [xxx]:6805/2576 996 : cluster [ERR] osd.38: 
inconsistent clone_overlap found for oid 
d70156d5/rbd_data.26b3cb5fbccda4.000000000000034f/head//29 clone 0
2016-05-16 09:27:51.875372 osd.44 [xxx]:6801/2434 614 : cluster [ERR] osd.44: 
inconsistent clone_overlap found for oid 
e839050d/rbd_data.26b3cb5fbccda4.00000000000004f0/head//29 clone 0

At around 10:00 OSDs start to crash. After a long time of debugging we figured 
out that the OSD crashes was tied to write operations related to the volume at 
the bottom of the ancestry outlined above.

The tricky thing about this is that during the time we spent figuring this out 
the OSDs crashing caused a ripple effect trhoughout the cluster. When the first 
OSD eventually was marked down+out the next one to handle the requests would 
start crashing and so on and so forth. This had a devestating effect on our 
cluster, and we were effectively more ore less down for 12 - 17 hours while 
trying to figure out the root cause of the problem.

During the search for the root cause we first upgraded the cluster to Hammer 
0.94.7 and then to Jewel 10.2.1. So all these versions are still affected by 
the issue.

There are seperate problems here:
1) Why was the attempt to delete the snapshot successful(?) / why did the 
attempt at deleting the snapshot cause the error messages.
2) Would it be usefull to attempt to detect requests that crash the OSDs and at 
some point deny them?
3) We still have the «affected» volume in our cluster and cannot do anything 
with it. Attempts to delete or otherwise modify it still causes OSDs to crash, 
and we need some way of removing it.

Crashdump caught with gdb from one of the OSDs:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fc2e4f9f700 (LWP 10954)]
0x00007fc30d45aaaa in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0  0x00007fc30d45aaaa in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x0000000000889e99 in operator-- (this=<synthetic pointer>) at 
/usr/include/c++/4.8/bits/stl_tree.h:204
#2  operator* (this=<optimized out>) at 
/usr/include/c++/4.8/bits/stl_iterator.h:163
#3  operator-> (this=<optimized out>) at 
/usr/include/c++/4.8/bits/stl_iterator.h:173
#4  ReplicatedPG::make_writeable (this=this@entry=0x1926a400, 
ctx=ctx@entry=0x40c7800) at osd/ReplicatedPG.cc:5508
#5  0x000000000088b973 in ReplicatedPG::prepare_transaction 
(this=this@entry=0x1926a400, ctx=ctx@entry=0x40c7800) at 
osd/ReplicatedPG.cc:5703
#6  0x000000000088bfba in ReplicatedPG::execute_ctx 
(this=this@entry=0x1926a400, ctx=ctx@entry=0x40c7800) at 
osd/ReplicatedPG.cc:2269
#7  0x0000000000891b99 in ReplicatedPG::do_op (this=0x1926a400, 
op=std::tr1::shared_ptr (count 4, weak 0) 0x2762af00) at 
osd/ReplicatedPG.cc:1752
#8  0x000000000082b30a in ReplicatedPG::do_request (this=0x1926a400, 
op=std::tr1::shared_ptr (count 4, weak 0) 0x2762af00, handle=...)
    at osd/ReplicatedPG.cc:1296
#9  0x0000000000697e35 in OSD::dequeue_op (this=0x41ee000, pg=..., 
op=std::tr1::shared_ptr (count 4, weak 0) 0x2762af00, handle=...) at 
osd/OSD.cc:8503
#10 0x0000000000698388 in OSD::ShardedOpWQ::_process (this=0x41ef028, 
thread_index=<optimized out>, hb=<optimized out>) at osd/OSD.cc:8397
#11 0x0000000000ba1975 in ShardedThreadPool::shardedthreadpool_worker 
(this=0x41ee640, thread_index=<optimized out>) at common/WorkQueue.cc:340
#12 0x0000000000ba3a90 in ShardedThreadPool::WorkThreadSharded::entry 
(this=<optimized out>) at common/WorkQueue.h:627
#13 0x00007fc30e694182 in start_thread (arg=0x7fc2e4f9f700) at 
pthread_create.c:312
#14 0x00007fc30cbff47d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 8
#8  0x000000000082b30a in ReplicatedPG::do_request (this=0x1926a400, 
op=std::tr1::shared_ptr (count 4, weak 0) 0x2762af00, handle=...)
    at osd/ReplicatedPG.cc:1296
1296    osd/ReplicatedPG.cc: No such file or directory.
(gdb) print *this

You can find the output of print *this here: 
https://gist.github.com/fnordahl/e29321a1b027a89b56080540348532d7


PS:
I attempted to create a tracker issue for this, but my logon currently does not 
work and the password recovery procedure also seems to be broken.

--
Frode Nordahl
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to