Re: Dcache oops

2016-06-04 Thread Oleg Drokin
On Jun 3, 2016, at 8:56 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > >>> EOPENSTALE, that is... Oleg, could you check if the following works? >> >> Yes, this one lasted for an hour with no crashing, so it must be good. >> Thanks. >> (note, I am not equipp

Re: Dcache oops

2016-06-04 Thread Jeff Layton
On Sat, 2016-06-04 at 01:56 +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > > > > > > > > EOPENSTALE, that is...  Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > > Thanks. > > (not

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > EOPENSTALE, that is... Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > Thanks. > (note, I am not equipped to verify correctness of NFS operations, though).

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW,

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW,

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 03:36:22PM -0700, Linus Torvalds wrote: > Happy to hear that you seem to have figured it out. > > But why did it apparently only start happening now? Oleg has started to use Lustre torture tests on NFS, that's all. Note, BTW, that first they'd triggered an oopsable bug (

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:36 PM, Linus Torvalds wrote: > On Fri, Jun 3, 2016 at 3:23 PM, Al Viro wrote: >> On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> Normally it's done at terminate_walk() time. But I note that in >>> walk_component(), we do put_link(nd) which does a do

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. EOPENSTALE,

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 3:23 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> >> Normally it's done at terminate_walk() time. But I note that in >> walk_component(), we do put_link(nd) which does a do_delayed_call(), >> but does *not* do a clear_delayed_call(

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. FWIW, I'd st

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: > Is perhaps the "delayed_call" logic broken, and the symlink is free'd too > early? > > That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe > something broke that logic, and we've executed the delayed freeing > bef

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 10:46:31PM +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > > 4e8440b? > > > > Well, that was faster than I expected. 4e8440b triggers right away, so

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 2:26 PM, Al Viro wrote: >> >> in the __d_lookup() disassembly. And %rdi contains 2, so there were >> supposed to be two more characters at 'ct' (which is %rdx). > > ... and since r8 and rsi are 0, we couldn't have consumed anything. Right you are. So it really started out p

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > 4e8440b? > > Well, that was faster than I expected. 4e8440b triggers right away, so I guess > there's no point in trying the later ones? > BTW, just to c

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 02:18:15PM -0700, Linus Torvalds wrote: > So something must have corrupted the qstr. > > The remaining length *should* in %edi, judging by the > >0x81243b82 <+306>: cmp$0x7,%edi > > in the __d_lookup() disassembly. And %rdi contains 2, so there were > sup

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 1:07 PM, Al Viro wrote: > > Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting > a genuinely unmapped address. That sends it into fixup, where it tries to > load an aligned word containing the address in question, in hope that > fault was on attempt to

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 4:07 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 >>> >>> *ow* >>> Could you dump your vmlinux (and

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > >> [ 2642.364383] BUG: unable to handle kernel paging request at > >> 880113f82000 > >> [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 > > > > *ow* > > Could you dump your vmlinux (and System.map) somewhere on anonftp? > > This 'bad_g

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 2:22 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: >> I am dropping NFS people since it seems to be converting into a generic >> VFS/dcache bug even though you need NFS or the like to trigger it - the >> lookup_open path. > > NFS bug is re

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: > I am dropping NFS people since it seems to be converting into a generic > VFS/dcache bug even though you need NFS or the like to trigger it - the > lookup_open path. NFS bug is real; there might very well be something else, but that