> On Jul 29, 2015, at 12:40 AM, Ilya Dryomov <idryo...@gmail.com> wrote: > > On Tue, Jul 28, 2015 at 7:20 PM, van <chaofa...@owtware.com > <mailto:chaofa...@owtware.com>> wrote: >> >>> On Jul 28, 2015, at 7:57 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >>> >>> On Tue, Jul 28, 2015 at 2:46 PM, van <chaofa...@owtware.com> wrote: >>>> Hi, Ilya, >>>> >>>> In the dmesg, there is also a lot of libceph socket error, which I think >>>> may be caused by my stopping ceph service without unmap rbd. >>> >>> Well, sure enough, if you kill all OSDs, the filesystem mounted on top >>> of rbd device will get stuck. >> >> Sure it will get stuck if osds are stopped. And since rados requests have >> retry policy, the stucked requests will recover after I start the daemon >> again. >> >> But in my case, the osds are running in normal state and librbd API can >> read/write normally. >> Meanwhile, heavy fio test for the filesystem mounted on top of rbd device >> will get stuck. >> >> I wonder if this phenomenon is triggered by running rbd kernel client on >> machines have ceph daemons, i.e. the annoying loopback mount deadlock issue. >> >> In my opinion, if it’s due to the loopback mount deadlock, the OSDs will >> become unresponsive. >> No matter the requests are from user space requests (like API) or from >> kernel client. >> Am I right? > > Not necessarily. > >> >> If so, my case seems to be triggered by another bug. >> >> Anyway, it seems that I should separate client and daemons at least. > > Try 3.18.19 if you can. I'd be interested in your results.
It’s strange, after I drop the page cache and restart my OSDs, same heavy IO tests on rbd folder now works fine. The deadlock seems not that easy to trigger. Maybe I need longer tests. I’ll try 3.18.19 LTS, thanks. > > Thanks, > > Ilya
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com