Hi, Ilya,
  
  In the dmesg, there is also a lot of libceph socket error, which I think may 
be caused by my stopping ceph service without unmap rbd.
  
  Here is a more than 10000 lines log contains more info, http://jmp.sh/NcokrfT 
<http://jmp.sh/NcokrfT> 
  
  Thanks for willing to help.

van
chaofa...@owtware.com



> On Jul 28, 2015, at 7:11 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
> 
> On Tue, Jul 28, 2015 at 11:19 AM, van <chaofa...@owtware.com 
> <mailto:chaofa...@owtware.com>> wrote:
>> Hi, Ilya,
>> 
>>  Thanks for your quick reply.
>> 
>>  Here is the link http://ceph.com/docs/cuttlefish/faq/ 
>> <http://ceph.com/docs/cuttlefish/faq/>  , under the "HOW
>> CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff.
>> 
>>  By the way, what’s the main reason of using kernel 4.1, is there a lot of
>> critical bugs fixed in that version despite perf improvements?
>>  I am worrying kernel 4.1 is too new that may introduce other problems.
> 
> Well, I'm not sure what exactly is in 3.10.0.229, so I can't tell you
> off hand.  I can think of one important memory pressure related fix
> that's probably not in there.
> 
> I'm suggesting the latest stable version of 4.1 (currently 4.1.3),
> because if you hit a deadlock (remember, this is a configuration that
> is neither recommended nor guaranteed to work), it'll be easier to
> debug and fix if the fix turns out to be worth it.
> 
> If 4.1 is not acceptable for you, try the latest stable version of 3.18
> (that is 3.18.19).  It's an LTS kernel, so that should mitigate some of
> your concerns.
> 
>>  And if I’m using the librdb API, is the kernel version matters?
> 
> No, not so much.
> 
>> 
>>  In my tests, I built a 2-nodes cluster, each with only one OSD with os
>> centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2.
>>  I created several rbds and mkfs.xfs on those rbds to create filesystems.
>> (kernel client were running on the ceph cluster)
>>  I performed heavy IO tests on those filesystems and found some fio got
>> hung and turned into D state forever (uninterruptible sleep).
>>  I suspect it’s the deadlock that make the fio process hung.
>>  However the ceph-osd are stil responsive, and I can operate rbd via librbd
>> API.
>>  Does this mean it’s not the loopback mount deadlock that cause the fio
>> process hung?
>>  Or it is also a deadlock phnonmenon, only one thread is blocked in memory
>> allocation and other threads are still possible to receive API requests, so
>> the ceph-osd are still responsive?
>> 
>>  What worth mentioning is that after I restart the ceph-osd daemon, all
>> processes in D state come back into normal state.
>> 
>>  Below is related log in kernel:
>> 
>> Jul  7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more
>> than 120 seconds.
>> Jul  7 02:25:39 node0 kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Jul  7 02:25:39 node0 kernel: xfsaild/rbd1    D ffff880c2fc13680     0 24795
>> 2 0x00000080
>> Jul  7 02:25:39 node0 kernel: ffff8801d6343d40 0000000000000046
>> ffff8801d6343fd8 0000000000013680
>> Jul  7 02:25:39 node0 kernel: ffff8801d6343fd8 0000000000013680
>> ffff880c0c0b0000 ffff880c0c0b0000
>> Jul  7 02:25:39 node0 kernel: ffff880c2fc14340 0000000000000001
>> 0000000000000000 ffff8805bace2528
>> Jul  7 02:25:39 node0 kernel: Call Trace:
>> Jul  7 02:25:39 node0 kernel: [<ffffffff81609e39>] schedule+0x29/0x70
>> Jul  7 02:25:39 node0 kernel: [<ffffffffa03a1890>]
>> _xfs_log_force+0x230/0x290 [xfs]
>> Jul  7 02:25:39 node0 kernel: [<ffffffff810a9620>] ? wake_up_state+0x20/0x20
>> Jul  7 02:25:39 node0 kernel: [<ffffffffa03a1916>] xfs_log_force+0x26/0x80
>> [xfs]
>> Jul  7 02:25:39 node0 kernel: [<ffffffffa03a6390>] ?
>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>> Jul  7 02:25:39 node0 kernel: [<ffffffffa03a64e1>] xfsaild+0x151/0x5e0 [xfs]
>> Jul  7 02:25:39 node0 kernel: [<ffffffffa03a6390>] ?
>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>> Jul  7 02:25:39 node0 kernel: [<ffffffff8109739f>] kthread+0xcf/0xe0
>> Jul  7 02:25:39 node0 kernel: [<ffffffff810972d0>] ?
>> kthread_create_on_node+0x140/0x140
>> Jul  7 02:25:39 node0 kernel: [<ffffffff8161497c>] ret_from_fork+0x7c/0xb0
>> Jul  7 02:25:39 node0 kernel: [<ffffffff810972d0>] ?
>> kthread_create_on_node+0x140/0x140
>> Jul  7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more
>> than 120 seconds.
> 
> Is that all there is in dmesg?  Can you paste the entire dmesg?
> 
> Thanks,
> 
>                Ilya

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to