On Tue, Nov 10, 2015 at 12:06 AM, Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:
> Hi,
>
> On 11/09/2015 04:03 PM, Gregory Farnum wrote:
>>
>> On Mon, Nov 9, 2015 at 6:57 AM, Burkhard Linke
>> <burkhard.li...@computational.bio.uni-giessen.de> wrote:
>>>
>>> Hi,
>>>
>>> On 11/09/2015 02:07 PM, Burkhard Linke wrote:
>>>>
>>>> Hi,
>>>
>>> *snipsnap*
>>>
>>>>
>>>> Cluster is running Hammer 0.94.5 on top of Ubuntu 14.04. Clients use
>>>> ceph-fuse with patches for improved page cache handling, but the
problem
>>>> also occur with the official hammer packages from download.ceph.com
>>>
>>> I've tested the same setup with clients running kernel 4.2.5 and using
>>> the
>>> kernel cephfs client. I was not able to reproduce the problem in that
>>> setup.
>>
>> What's the workload you're running, precisely? I would not generally
>> expect multiple accesses to a sqlite database to work *well*, but
>> offhand I'm not entirely certain why it would work differently between
>> the kernel and userspace clients. (Probably something to do with the
>> timing of the shared requests and any writes happening.)
>
> Using SQLite on network filesystems is somewhat challenging, especially if
> multiple instances write to the database. The reproducible test case does
> not write to the database at all; it simply extracts the table structure
> from the default database. The applications itself only read from the
> database and do not modify anything. The underlying SQLite library may
> attempt to use locking to protect certain operations. According to dmesg
the
> processes are blocked within fuse calls:
>
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.543966] INFO: task
> ceph-fuse:6298 blocked for more than 120 seconds.
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544014]       Not tainted
> 4.2.5-040205-generic #201510270124
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544054] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544119] ceph-fuse       D
> ffff881fbf8d64c0     0  6298   3262 0x00000100
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544125] ffff881f9768f838
> 0000000000000086 ffff883fb2d83700 ffff881f97b38dc0
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544130] 0000000000001000
> ffff881f97690000 ffff881fbf8d64c0 7fffffffffffffff
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544134] 0000000000000002
> ffffffff817dc300 ffff881f9768f858 ffffffff817dbb07
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544138] Call Trace:
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544147]
[<ffffffff817dc300>]
> ? bit_wait+0x50/0x50
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544156]
[<ffffffff817deba9>]
> schedule_timeout+0x189/0x250
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544166]
[<ffffffff817dc300>]
> ? bit_wait+0x50/0x50
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544176]
[<ffffffff810bcb64>]
> ? prepare_to_wait_exclusive+0x54/0x80
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544185]
[<ffffffff817dc0bb>]
> __wait_on_bit_lock+0x4b/0xa0
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544195]
[<ffffffff810bd0e0>]
> ? autoremove_wake_function+0x40/0x40
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544205]
[<ffffffff8106d962>]
> ? get_user_pages_fast+0x112/0x190
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544213]
[<ffffffff812173df>]
> ? ilookup5_nowait+0x6f/0x90
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544222]
[<ffffffff812f922d>]
> fuse_notify+0x14d/0x830
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544230]
[<ffffffff812f85d4>]
> ? fuse_copy_do+0x84/0xf0
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544239]
[<ffffffff810a4f7d>]
> ? ttwu_do_activate.constprop.89+0x5d/0x70
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544248]
[<ffffffff811fc0dc>]
> do_iter_readv_writev+0x6c/0xa0
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544257]
[<ffffffff811bc9d8>]
> ? mprotect_fixup+0x148/0x230
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544264]
[<ffffffff811fdae9>]
> SyS_writev+0x59/0xf0
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672548]       Not tainted
> 4.2.5-040205-generic #201510270124
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672654] ceph-fuse       D
> ffff881fbf8d64c0     0  6298   3262 0x00000100
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672665] 0000000000001000
> ffff881f97690000 ffff881fbf8d64c0 7fffffffffffffff
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672673] Call Trace:
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672687]
[<ffffffff817dbb07>]
> schedule+0x37/0x80
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672698]
[<ffffffff8101dcd9>]
> ? read_tsc+0x9/0x10
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672707]
[<ffffffff817db114>]
> io_schedule_timeout+0xa4/0x110
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672717]
[<ffffffff817dc335>]
> bit_wait_io+0x35/0x50
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672726]
[<ffffffff8118186b>]
> __lock_page+0xbb/0xe0
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672736]
[<ffffffff811934cc>]
> invalidate_inode_pages2_range+0x22c/0x460
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672745]
[<ffffffff81304a80>]
> ? fuse_init_file_inode+0x30/0x30
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672753]
[<ffffffff813068a6>]
> fuse_reverse_inval_inode+0x66/0x90
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672761]
[<ffffffff813c8e12>]
> ? iov_iter_get_pages+0xa2/0x220
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672770]
[<ffffffff812f9f0d>]
> fuse_dev_do_write+0x22d/0x380
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672779]
[<ffffffff812fa41b>]
> fuse_dev_write+0x5b/0x80
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672786]
[<ffffffff811fcc66>]
> do_readv_writev+0x196/0x250
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672796]
[<ffffffff811fcda9>]
> vfs_writev+0x39/0x50
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672803]
[<ffffffff817dfb72>]
> entry_SYSCALL_64_fastpath+0x16/0x75
>
>
> The fact that the kernel client is working so far may be timing related.
> I've also done test runs on the cluster with 20 instance of the
application
> and a small dataset running in parallel without any problem so far.
>

it seems the hang is related to async invalidate.  please try the following
patch
---
diff --git a/src/client/Client.cc b/src/client/Client.cc
index 0d85db2..afbb896 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -3151,8 +3151,6 @@ void Client::_async_invalidate(Inode *in, int64_t
off, int64_t len, bool keep_ca
   ino_invalidate_cb(callback_handle, in->vino(), off, len);

   client_lock.Lock();
-  if (!keep_caps)
-    check_caps(in, false);
   put_inode(in);
   client_lock.Unlock();
   ldout(cct, 10) << "_async_invalidate " << off << "~" << len <<
(keep_caps ? " keep_caps" : "") << " done" << dendl;
@@ -3163,7 +3161,7 @@ void Client::_schedule_invalidate_callback(Inode *in,
int64_t off, int64_t len,
   if (ino_invalidate_cb)
     // we queue the invalidate, which calls the callback and decrements
the ref
     async_ino_invalidator.queue(new C_Client_CacheInvalidate(this, in,
off, len, keep_caps));
-  else if (!keep_caps)
+  if (!keep_caps)
     check_caps(in, false);
 }




> Best regards,
> Burkhard
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to