Re: kernel panic in skb_copy_bits

2013-07-04 Thread David Miller
From: Eric Dumazet Date: Thu, 04 Jul 2013 03:12:10 -0700 > It looks like a typical COW issue to me. Generically speaking, if we have to mess with page protections this eliminates the performance gain from bypass/zerocopy/whatever that these virtualization layers are doing. But there may be othe

Re: kernel panic in skb_copy_bits

2013-07-04 Thread Alex Bligh
--On 4 July 2013 03:12:10 -0700 Eric Dumazet wrote: It looks like a typical COW issue to me. If the page content is written while there is still a reference on this page, we should allocate a new page and copy the previous content. And this has little to do with networking. I suspect this

Re: kernel panic in skb_copy_bits

2013-07-04 Thread Eric Dumazet
On Thu, 2013-07-04 at 10:52 +0100, Ian Campbell wrote: > Might just be that no one has observed it with vmsplice()+splice()? Most > of the time this happens silently and you'll probably never notice, it's > just the behaviour of Xen which escalates the issue into one you can > see. The point I wa

Re: kernel panic in skb_copy_bits

2013-07-04 Thread Ian Campbell
On Thu, 2013-07-04 at 02:34 -0700, Eric Dumazet wrote: > On Thu, 2013-07-04 at 09:59 +0100, Ian Campbell wrote: > > On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote: > > > > > > Another way is add new page flag like PG_send, when sendpage() be called, > > > set the bit, when page be put, clear the

Re: kernel panic in skb_copy_bits

2013-07-04 Thread Eric Dumazet
On Thu, 2013-07-04 at 09:59 +0100, Ian Campbell wrote: > On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote: > > > > Another way is add new page flag like PG_send, when sendpage() be called, > > set the bit, when page be put, clear the bit. Then xen-blkback can wait > > on the pagequeue. > > These

Re: kernel panic in skb_copy_bits

2013-07-04 Thread Ian Campbell
On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote: > On 07/01/13 16:11, Ian Campbell wrote: > > On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote: > >>> A workaround is to turn off O_DIRECT use by Xen as that ensures > >>> the pages are copied. Xen 4.3 does this by default. > >>> > >>> I believe fixe

Re: kernel panic in skb_copy_bits

2013-07-04 Thread Joe Jin
On 07/01/13 16:11, Ian Campbell wrote: > On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote: >>> A workaround is to turn off O_DIRECT use by Xen as that ensures >>> the pages are copied. Xen 4.3 does this by default. >>> >>> I believe fixes for this are in 4.3 and 4.2.2 if using the >>> qemu upstream

Re: kernel panic in skb_copy_bits

2013-07-01 Thread David Miller
From: Eric Dumazet Date: Fri, 28 Jun 2013 02:37:42 -0700 > [PATCH] neighbour: fix a race in neigh_destroy() > > There is a race in neighbour code, because neigh_destroy() uses > skb_queue_purge(&neigh->arp_queue) without holding neighbour lock, > while other parts of the code assume neighbour rw

Re: kernel panic in skb_copy_bits

2013-07-01 Thread Joe Jin
On 07/01/13 16:11, Ian Campbell wrote: > On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote: >>> A workaround is to turn off O_DIRECT use by Xen as that ensures >>> the pages are copied. Xen 4.3 does this by default. >>> >>> I believe fixes for this are in 4.3 and 4.2.2 if using the >>> qemu upstream

Re: kernel panic in skb_copy_bits

2013-07-01 Thread Alex Bligh
Joe, Do you know if have a fix for above? so far we also suspected the grant page be unmapped earlier, we using 4.1 stable during our test. A true fix? No, but I posted a patch set (see later email message for a link) that you could forward port. The workaround is: A workaround is to turn of

Re: kernel panic in skb_copy_bits

2013-07-01 Thread Ian Campbell
On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote: > > A workaround is to turn off O_DIRECT use by Xen as that ensures > > the pages are copied. Xen 4.3 does this by default. > > > > I believe fixes for this are in 4.3 and 4.2.2 if using the > > qemu upstream DM. Note these aren't real fixes, just

Re: kernel panic in skb_copy_bits

2013-06-30 Thread Joe Jin
On 06/30/13 17:13, Alex Bligh wrote: > > > --On 28 June 2013 12:17:43 +0800 Joe Jin wrote: > >> Find a similar issue >> http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen >> developer as well. > > I thought this sounded familiar. I haven't got the start of this > thread, b

Re: kernel panic in skb_copy_bits

2013-06-30 Thread Alex Bligh
--On 30 June 2013 10:13:35 +0100 Alex Bligh wrote: The nature of the bug is extensively discussed in that thread - you'll also find a reference to a thread on linux-nfs which concludes it isn't an nfs problem, and even some patches to fix it in the kernel adding reference counting. Some mor

Re: kernel panic in skb_copy_bits

2013-06-30 Thread Alex Bligh
--On 28 June 2013 12:17:43 +0800 Joe Jin wrote: Find a similar issue http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen developer as well. I thought this sounded familiar. I haven't got the start of this thread, but what version of Xen are you running and what device mo

Re: kernel panic in skb_copy_bits

2013-06-30 Thread Eric Dumazet
On Sun, 2013-06-30 at 08:26 +0800, Joe Jin wrote: > So far we suspected it caused by iscsi called sendpage(), and later page > be unmapped but still trying copy skb. We'll try to disable sg to see if > help or no. sendpage() should increment page refcounts for every page frag of an skb, therefore

Re: kernel panic in skb_copy_bits

2013-06-29 Thread Joe Jin
On 06/29/13 15:20, Eric Dumazet wrote: > On Sat, 2013-06-29 at 07:36 +0800, Joe Jin wrote: >> Hi Eric, >> >> The patch not fix the issue and panic as same as early I posted: >>> BUG: unable to handle kernel paging request at 88006d9e8d48 >>> IP: [] memcpy+0xb/0x120 >>> PGD 1798067 PUD 1fd2067 P

Re: kernel panic in skb_copy_bits

2013-06-29 Thread Ben Greear
On 06/29/2013 09:26 AM, Eric Dumazet wrote: On Sat, 2013-06-29 at 09:11 -0700, Ben Greear wrote: Do you know if your patch should go in 3.9? Yes it should. Ok, I'll add that to my tree. Your test case sounds a bit like what gives us the rare crash in tcp_collapse (we have lots of bouncin

Re: kernel panic in skb_copy_bits

2013-06-29 Thread Eric Dumazet
On Sat, 2013-06-29 at 09:11 -0700, Ben Greear wrote: > Do you know if your patch should go in 3.9? > Yes it should. > Your test case sounds a bit like what gives us the rare crash in tcp_collapse > (we have lots of bouncing wifi interfaces running slow-speed TCP trafic). > But, > it takes day

Re: kernel panic in skb_copy_bits

2013-06-29 Thread Ben Greear
On 06/29/2013 12:20 AM, Eric Dumazet wrote: On Sat, 2013-06-29 at 07:36 +0800, Joe Jin wrote: Hi Eric, The patch not fix the issue and panic as same as early I posted: BUG: unable to handle kernel paging request at 88006d9e8d48 IP: [] memcpy+0xb/0x120 PGD 1798067 PUD 1fd2067 PMD 213f067 PT

Re: kernel panic in skb_copy_bits

2013-06-29 Thread Eric Dumazet
On Sat, 2013-06-29 at 07:36 +0800, Joe Jin wrote: > Hi Eric, > > The patch not fix the issue and panic as same as early I posted: > > BUG: unable to handle kernel paging request at 88006d9e8d48 > > IP: [] memcpy+0xb/0x120 > > PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0 > > Oops: [#1] SMP >

Re: kernel panic in skb_copy_bits

2013-06-29 Thread Eric Dumazet
On Sat, 2013-06-29 at 07:36 +0800, Joe Jin wrote: > Hi Eric, > > The patch not fix the issue and panic as same as early I posted: At least it fixes my own panics ;) My test bed was : Launch 24 concurrent "netperf -t UDP_STREAM -H destination -- -m 128" Then on "destination" disconnect the eth

Re: kernel panic in skb_copy_bits

2013-06-28 Thread Joe Jin
Hi Eric, The patch not fix the issue and panic as same as early I posted: > BUG: unable to handle kernel paging request at 88006d9e8d48 > IP: [] memcpy+0xb/0x120 > PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0 > Oops: [#1] SMP > CPU 7 > Modules linked in: dm_nfs tun nfs fscache auth_rpcgss

Re: kernel panic in skb_copy_bits

2013-06-28 Thread Joe Jin
Hi Eric, Thanks for your patch, I'll test it then get back to you. Regards, Joe On 06/28/13 17:37, Eric Dumazet wrote: > OK please try the following patch > > > [PATCH] neighbour: fix a race in neigh_destroy() > > There is a race in neighbour code, because neigh_destroy() uses > skb_queue_purg

Re: kernel panic in skb_copy_bits

2013-06-28 Thread Eric Dumazet
OK please try the following patch [PATCH] neighbour: fix a race in neigh_destroy() There is a race in neighbour code, because neigh_destroy() uses skb_queue_purge(&neigh->arp_queue) without holding neighbour lock, while other parts of the code assume neighbour rwlock is what protects arp_queue

Re: kernel panic in skb_copy_bits

2013-06-27 Thread Eric Dumazet
On Fri, 2013-06-28 at 12:17 +0800, Joe Jin wrote: > Find a similar issue http://www.gossamer-threads.com/lists/xen/devel/265611 > So copied to Xen developer as well. > > On 06/27/13 13:31, Eric Dumazet wrote: > > On Thu, 2013-06-27 at 10:58 +0800, Joe Jin wrote: > >> Hi, > >> > >> When we do fail

Re: kernel panic in skb_copy_bits

2013-06-27 Thread Joe Jin
Find a similar issue http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen developer as well. On 06/27/13 13:31, Eric Dumazet wrote: > On Thu, 2013-06-27 at 10:58 +0800, Joe Jin wrote: >> Hi, >> >> When we do fail over test with iscsi + multipath by reset the switches >> on OVM(

Re: kernel panic in skb_copy_bits

2013-06-27 Thread Joe Jin
Hi Eric, Thanks for you response, will test it and get back to you. Regards, Joe On 06/27/13 13:31, Eric Dumazet wrote: > On Thu, 2013-06-27 at 10:58 +0800, Joe Jin wrote: >> Hi, >> >> When we do fail over test with iscsi + multipath by reset the switches >> on OVM(2.6.39) we hit the panic: >> >>

Re: kernel panic in skb_copy_bits

2013-06-26 Thread Eric Dumazet
On Thu, 2013-06-27 at 10:58 +0800, Joe Jin wrote: > Hi, > > When we do fail over test with iscsi + multipath by reset the switches > on OVM(2.6.39) we hit the panic: > > BUG: unable to handle kernel paging request at 88006d9e8d48 > IP: [] memcpy+0xb/0x120 > PGD 1798067 PUD 1fd2067 PMD 213f067

kernel panic in skb_copy_bits

2013-06-26 Thread Joe Jin
Hi, When we do fail over test with iscsi + multipath by reset the switches on OVM(2.6.39) we hit the panic: BUG: unable to handle kernel paging request at 88006d9e8d48 IP: [] memcpy+0xb/0x120 PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0 Oops: [#1] SMP CPU 7 Modules linked in: dm_nfs tun n