> On Jul 29, 2015, at 12:40 AM, Ilya Dryomov <idryo...@gmail.com> wrote:
> 
> On Tue, Jul 28, 2015 at 7:20 PM, van <chaofa...@owtware.com 
> <mailto:chaofa...@owtware.com>> wrote:
>> 
>>> On Jul 28, 2015, at 7:57 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
>>> 
>>> On Tue, Jul 28, 2015 at 2:46 PM, van <chaofa...@owtware.com> wrote:
>>>> Hi, Ilya,
>>>> 
>>>> In the dmesg, there is also a lot of libceph socket error, which I think
>>>> may be caused by my stopping ceph service without unmap rbd.
>>> 
>>> Well, sure enough, if you kill all OSDs, the filesystem mounted on top
>>> of rbd device will get stuck.
>> 
>> Sure it will get stuck if osds are stopped. And since rados requests have 
>> retry policy, the stucked requests will recover after I start the daemon 
>> again.
>> 
>> But in my case, the osds are running in normal state and librbd API can 
>> read/write normally.
>> Meanwhile, heavy fio test for the filesystem mounted on top of rbd device 
>> will get stuck.
>> 
>> I wonder if this phenomenon is triggered by running rbd kernel client on 
>> machines have ceph daemons, i.e. the annoying loopback mount deadlock issue.
>> 
>> In my opinion, if it’s due to the loopback mount deadlock, the OSDs will 
>> become unresponsive.
>> No matter the requests are from user space requests (like API) or from 
>> kernel client.
>> Am I right?
> 
> Not necessarily.
> 
>> 
>> If so, my case seems to be triggered by another bug.
>> 
>> Anyway, it seems that I should separate client and daemons at least.
> 
> Try 3.18.19 if you can.  I'd be interested in your results.

It’s strange, after I drop the page cache and restart my OSDs, same heavy IO 
tests on rbd folder now works fine.
The deadlock seems not that easy to trigger. Maybe I need longer tests.

I’ll try 3.18.19 LTS, thanks.

> 
> Thanks,
> 
>                Ilya

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to