Hi, Ilya, In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd. Here is a more than 10000 lines log contains more info, http://jmp.sh/NcokrfT <http://jmp.sh/NcokrfT> Thanks for willing to help.
van chaofa...@owtware.com > On Jul 28, 2015, at 7:11 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > > On Tue, Jul 28, 2015 at 11:19 AM, van <chaofa...@owtware.com > <mailto:chaofa...@owtware.com>> wrote: >> Hi, Ilya, >> >> Thanks for your quick reply. >> >> Here is the link http://ceph.com/docs/cuttlefish/faq/ >> <http://ceph.com/docs/cuttlefish/faq/> , under the "HOW >> CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff. >> >> By the way, what’s the main reason of using kernel 4.1, is there a lot of >> critical bugs fixed in that version despite perf improvements? >> I am worrying kernel 4.1 is too new that may introduce other problems. > > Well, I'm not sure what exactly is in 3.10.0.229, so I can't tell you > off hand. I can think of one important memory pressure related fix > that's probably not in there. > > I'm suggesting the latest stable version of 4.1 (currently 4.1.3), > because if you hit a deadlock (remember, this is a configuration that > is neither recommended nor guaranteed to work), it'll be easier to > debug and fix if the fix turns out to be worth it. > > If 4.1 is not acceptable for you, try the latest stable version of 3.18 > (that is 3.18.19). It's an LTS kernel, so that should mitigate some of > your concerns. > >> And if I’m using the librdb API, is the kernel version matters? > > No, not so much. > >> >> In my tests, I built a 2-nodes cluster, each with only one OSD with os >> centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2. >> I created several rbds and mkfs.xfs on those rbds to create filesystems. >> (kernel client were running on the ceph cluster) >> I performed heavy IO tests on those filesystems and found some fio got >> hung and turned into D state forever (uninterruptible sleep). >> I suspect it’s the deadlock that make the fio process hung. >> However the ceph-osd are stil responsive, and I can operate rbd via librbd >> API. >> Does this mean it’s not the loopback mount deadlock that cause the fio >> process hung? >> Or it is also a deadlock phnonmenon, only one thread is blocked in memory >> allocation and other threads are still possible to receive API requests, so >> the ceph-osd are still responsive? >> >> What worth mentioning is that after I restart the ceph-osd daemon, all >> processes in D state come back into normal state. >> >> Below is related log in kernel: >> >> Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more >> than 120 seconds. >> Jul 7 02:25:39 node0 kernel: "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Jul 7 02:25:39 node0 kernel: xfsaild/rbd1 D ffff880c2fc13680 0 24795 >> 2 0x00000080 >> Jul 7 02:25:39 node0 kernel: ffff8801d6343d40 0000000000000046 >> ffff8801d6343fd8 0000000000013680 >> Jul 7 02:25:39 node0 kernel: ffff8801d6343fd8 0000000000013680 >> ffff880c0c0b0000 ffff880c0c0b0000 >> Jul 7 02:25:39 node0 kernel: ffff880c2fc14340 0000000000000001 >> 0000000000000000 ffff8805bace2528 >> Jul 7 02:25:39 node0 kernel: Call Trace: >> Jul 7 02:25:39 node0 kernel: [<ffffffff81609e39>] schedule+0x29/0x70 >> Jul 7 02:25:39 node0 kernel: [<ffffffffa03a1890>] >> _xfs_log_force+0x230/0x290 [xfs] >> Jul 7 02:25:39 node0 kernel: [<ffffffff810a9620>] ? wake_up_state+0x20/0x20 >> Jul 7 02:25:39 node0 kernel: [<ffffffffa03a1916>] xfs_log_force+0x26/0x80 >> [xfs] >> Jul 7 02:25:39 node0 kernel: [<ffffffffa03a6390>] ? >> xfs_trans_ail_cursor_first+0x90/0x90 [xfs] >> Jul 7 02:25:39 node0 kernel: [<ffffffffa03a64e1>] xfsaild+0x151/0x5e0 [xfs] >> Jul 7 02:25:39 node0 kernel: [<ffffffffa03a6390>] ? >> xfs_trans_ail_cursor_first+0x90/0x90 [xfs] >> Jul 7 02:25:39 node0 kernel: [<ffffffff8109739f>] kthread+0xcf/0xe0 >> Jul 7 02:25:39 node0 kernel: [<ffffffff810972d0>] ? >> kthread_create_on_node+0x140/0x140 >> Jul 7 02:25:39 node0 kernel: [<ffffffff8161497c>] ret_from_fork+0x7c/0xb0 >> Jul 7 02:25:39 node0 kernel: [<ffffffff810972d0>] ? >> kthread_create_on_node+0x140/0x140 >> Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more >> than 120 seconds. > > Is that all there is in dmesg? Can you paste the entire dmesg? > > Thanks, > > Ilya
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com