Re: Does vm_operations_struct require a .owner field?
On Tue, Jan 05, 2016 at 11:27:45AM -0500, Alan Stern wrote: > Hello: > > Question: The vm_operations_struct structure contains lots of callback > pointers. Is there any mechanism to prevent the callback routines and > the structure itself being unloaded from memory (if they are built into > modules) while the relevant VMAs are still in use? > > Consider a simple example: A user program calls mmap(2) on a device > file. Later on, the file is closed and the device driver's module is > unloaded. But until munmap(2) is called or the user program exits, the > memory mapping and the corresponding VMA will remain in existence. > (The man page for munmap specifically says "closing the file descriptor > does not unmap the region".) Thus when the user program does do an > munmap(), the callback to vma->vm_ops->close will reference nonexistent > memory and cause an oops. > > Normally this sort of thing is prevented by try_module_get(...->owner). > But vm_operations_struct doesn't include a .owner field. > > Am I missing something here? mmap(2) takes reference of the file, therefore the file is not closed from kernel POV until vma is gone and you cannot unload relevant module. See get_file() in mmap_region(). -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 03/78] xarray: Add the xa_lock to the radix_tree_root
On Fri, Dec 15, 2017 at 02:03:35PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox > > This results in no change in structure size on 64-bit x86 as it fits in > the padding between the gfp_t and the void *. The patch does more than described in the subject and commit message. At first I was confused why do you need to touch idr here. It took few minutes to figure it out. Could you please add more into commit message about lockname and xa_ locking interface since you introduce it here? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 04/78] page cache: Use xa_lock
On Fri, Dec 15, 2017 at 02:03:36PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox > > Remove the address_space ->tree_lock and use the xa_lock newly added to > the radix_tree_root. Rename the address_space ->page_tree to ->pages, > since we don't really care that it's a tree. Take the opportunity to > rearrange the elements of address_space to pack them better on 64-bit, > and make the comments more useful. The description sounds a lot like three commits ;) -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 05/78] xarray: Replace exceptional entries
On Fri, Dec 15, 2017 at 02:03:37PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox > > Introduce xarray value entries to replace the radix tree exceptional > entry code. This is a slight change in encoding to allow the use of an > extra bit (we can now store BITS_PER_LONG - 1 bits in a value entry). > It is also a change in emphasis; exceptional entries are intimidating > and different. As the comment explains, you can choose to store values > or pointers in the xarray and they are both first-class citizens. > > Signed-off-by: Matthew Wilcox > --- > arch/powerpc/include/asm/book3s/64/pgtable.h| 4 +- > arch/powerpc/include/asm/nohash/64/pgtable.h| 4 +- > drivers/gpu/drm/i915/i915_gem.c | 17 ++-- > drivers/staging/lustre/lustre/mdc/mdc_request.c | 2 +- > fs/btrfs/compression.c | 2 +- > fs/btrfs/inode.c| 4 +- > fs/dax.c| 115 > > fs/proc/task_mmu.c | 2 +- > include/linux/fs.h | 48 ++ > include/linux/radix-tree.h | 36 ++-- > include/linux/swapops.h | 19 ++-- > include/linux/xarray.h | 40 + > lib/idr.c | 63 ++--- > lib/radix-tree.c| 21 ++--- > mm/filemap.c| 10 +-- > mm/khugepaged.c | 2 +- > mm/madvise.c| 2 +- > mm/memcontrol.c | 2 +- > mm/mincore.c| 2 +- > mm/readahead.c | 2 +- > mm/shmem.c | 10 +-- > mm/swap.c | 2 +- > mm/truncate.c | 12 +-- > mm/workingset.c | 12 ++- > tools/testing/radix-tree/idr-test.c | 6 +- > tools/testing/radix-tree/linux/radix-tree.h | 1 + > tools/testing/radix-tree/multiorder.c | 47 +- > tools/testing/radix-tree/test.c | 2 +- > 28 files changed, 249 insertions(+), 240 deletions(-) Everything looks fine to me after quick scan, but hat's a lot of changes for one patch... > @@ -565,7 +565,7 @@ unsigned long invalidate_mapping_pages(struct > address_space *mapping, > if (index > end) > break; > > - if (radix_tree_exceptional_entry(page)) { > + if (xa_is_value(page)) { > invalidate_exceptional_entry(mapping, index, >page); > continue; > @@ -696,7 +696,7 @@ int invalidate_inode_pages2_range(struct address_space > *mapping, > if (index > end) > break; > > - if (radix_tree_exceptional_entry(page)) { > + if (xa_is_value(page)) { > if (!invalidate_exceptional_entry2(mapping, > index, page)) > ret = -EBUSY; invalidate_exceptional_entry? Are we going to leave the terminology here as is? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 06/78] xarray: Change definition of sibling entries
On Fri, Dec 15, 2017 at 02:03:38PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox > > Instead of storing a pointer to the slot containing the canonical entry, > store the offset of the slot. Produces slightly more efficient code > (~300 bytes) and simplifies the implementation. > > Signed-off-by: Matthew Wilcox > --- > include/linux/xarray.h | 82 > ++ > lib/radix-tree.c | 65 +++ > 2 files changed, 100 insertions(+), 47 deletions(-) > > diff --git a/include/linux/xarray.h b/include/linux/xarray.h > index 49fffc354431..f175350560fd 100644 > --- a/include/linux/xarray.h > +++ b/include/linux/xarray.h > @@ -49,6 +49,17 @@ static inline bool xa_is_value(const void *entry) > return (unsigned long)entry & 1; > } > > +/** > + * xa_is_internal() - Is the entry an internal entry? > + * @entry: Entry retrieved from the XArray > + * > + * Return: %true if the entry is an internal entry. > + */ What does it mean "internal entry"? Is it just a term for non-value and non-data pointer entry? Do we allow anybody besides xarray implementation to use internal entires? Do we have it documented? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 03/78] xarray: Add the xa_lock to the radix_tree_root
On Tue, Dec 26, 2017 at 07:43:40PM -0800, Matthew Wilcox wrote: > On Tue, Dec 26, 2017 at 07:54:40PM +0300, Kirill A. Shutemov wrote: > > On Fri, Dec 15, 2017 at 02:03:35PM -0800, Matthew Wilcox wrote: > > > From: Matthew Wilcox > > > > > > This results in no change in structure size on 64-bit x86 as it fits in > > > the padding between the gfp_t and the void *. > > > > The patch does more than described in the subject and commit message. At > > first > > I was confused why do you need to touch idr here. It took few minutes to > > figure > > it out. > > > > Could you please add more into commit message about lockname and xa_ locking > > interface since you introduce it here? > > Sure! How's this? > > xarray: Add the xa_lock to the radix_tree_root > > This results in no change in structure size on 64-bit x86 as it fits in > the padding between the gfp_t and the void *. > > Initialising the spinlock requires a name for the benefit of lockdep, > so RADIX_TREE_INIT() now needs to know the name of the radix tree it's > initialising, and so do IDR_INIT() and IDA_INIT(). > > Also add the xa_lock() and xa_unlock() family of wrappers to make it > easier to use the lock. If we could rely on -fplan9-extensions in > the compiler, we could avoid all of this syntactic sugar, but that > wasn't added until gcc 4.6. > Looks great, thanks. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 03/78] xarray: Add the xa_lock to the radix_tree_root
On Tue, Dec 26, 2017 at 07:58:15PM -0800, Matthew Wilcox wrote: > On Tue, Dec 26, 2017 at 07:43:40PM -0800, Matthew Wilcox wrote: > > Also add the xa_lock() and xa_unlock() family of wrappers to make it > > easier to use the lock. If we could rely on -fplan9-extensions in > > the compiler, we could avoid all of this syntactic sugar, but that > > wasn't added until gcc 4.6. > > Oh, in case anyone's wondering, here's how I'd do it with plan9 extensions: > > struct xarray { > spinlock_t; > int xa_flags; > void *xa_head; > }; > > ... > spin_lock_irqsave(&mapping->pages, flags); > __delete_from_page_cache(page, NULL); > spin_unlock_irqrestore(&mapping->pages, flags); > ... > > The plan9 extensions permit passing a pointer to a struct which has an > unnamed element to a function which is expecting a pointer to the type > of that element. The compiler does any necessary arithmetic to produce > a pointer. It's exactly as if I had written: > > spin_lock_irqsave(&mapping->pages.xa_lock, flags); > __delete_from_page_cache(page, NULL); > spin_unlock_irqrestore(&mapping->pages.xa_lock, flags); > > More details here: https://9p.io/sys/doc/compiler.html Yeah, that's neat. Dealing with old compilers is frustrating... -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 05/78] xarray: Replace exceptional entries
On Tue, Dec 26, 2017 at 07:05:34PM -0800, Matthew Wilcox wrote: > On Tue, Dec 26, 2017 at 08:15:42PM +0300, Kirill A. Shutemov wrote: > > > 28 files changed, 249 insertions(+), 240 deletions(-) > > > > Everything looks fine to me after quick scan, but hat's a lot of changes for > > one patch... > > Yeah. It's pretty mechanical though. > > > > - if (radix_tree_exceptional_entry(page)) { > > > + if (xa_is_value(page)) { > > > if (!invalidate_exceptional_entry2(mapping, > > > index, page)) > > > ret = -EBUSY; > > > > invalidate_exceptional_entry? Are we going to leave the terminology here as > > is? > > That is a great question. If the page cache wants to call its value > entries exceptional entries, it can continue to do that. I think there's > a better name for them, but I'm not sure what it is. Right now, the > page cache uses value entries to store: > > 1. Shadow entries (for workingset) > 2. Swap entries (for shmem) > 3. DAX entries > > I can't come up with a good name for these three things. 'nonpage' is > the only thing which hasn't immediately fallen off my ideas list. Yeah, naming problem... > But I think renaming exceptional entries in the page cache is a great idea, > and I don't want to do it as part of this patch set ;-) Fair enough. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 06/78] xarray: Change definition of sibling entries
On Tue, Dec 26, 2017 at 07:13:26PM -0800, Matthew Wilcox wrote: > On Tue, Dec 26, 2017 at 08:21:53PM +0300, Kirill A. Shutemov wrote: > > > +/** > > > + * xa_is_internal() - Is the entry an internal entry? > > > + * @entry: Entry retrieved from the XArray > > > + * > > > + * Return: %true if the entry is an internal entry. > > > + */ > > > > What does it mean "internal entry"? Is it just a term for non-value and > > non-data pointer entry? Do we allow anybody besides xarray implementation to > > use internal entires? > > > > Do we have it documented? > > We do! include/linux/radix-tree.h has it documented right now: Looks good. Thanks. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: kernel BUG at ./include/linux/mm.h:LINE! (3)
On Thu, Dec 28, 2017 at 04:03:46PM -0800, Andrew Morton wrote: > > Is anyone able to reproduce this? The VM_BUG_ON_PAGE() in get_page() > went bang. Yep, it triggers for me. Looks like MON_IOCT_RING_SIZE reallocates ring buffer without any serialization wrt mon_bin_vma_fault(). By the time of get_page() the page may be freed. The patch below seems help the crash to go away, but I *think* more work is required. For instance, after ring buffer reallocation the old pages will stay mapped. Nothing pulls them. 8<-- diff --git a/drivers/usb/mon/mon_bin.c b/drivers/usb/mon/mon_bin.c index f6ae753ab99b..ac168fecf04f 100644 --- a/drivers/usb/mon/mon_bin.c +++ b/drivers/usb/mon/mon_bin.c @@ -1228,15 +1228,24 @@ static void mon_bin_vma_close(struct vm_area_struct *vma) static int mon_bin_vma_fault(struct vm_fault *vmf) { struct mon_reader_bin *rp = vmf->vma->vm_private_data; - unsigned long offset, chunk_idx; + unsigned long offset, chunk_idx, flags; struct page *pageptr; + mutex_lock(&rp->fetch_lock); + spin_lock_irqsave(&rp->b_lock, flags); offset = vmf->pgoff << PAGE_SHIFT; - if (offset >= rp->b_size) + if (offset >= rp->b_size) { + spin_unlock_irqrestore(&rp->b_lock, flags); + mutex_unlock(&rp->fetch_lock); return VM_FAULT_SIGBUS; + } chunk_idx = offset / CHUNK_SIZE; + pageptr = rp->b_vec[chunk_idx].pg; get_page(pageptr); + spin_unlock_irqrestore(&rp->b_lock, flags); + mutex_unlock(&rp->fetch_lock); + vmf->page = pageptr; return 0; } 8<-- > > > Begin forwarded message: > > Date: Wed, 27 Dec 2017 14:22:01 -0800 > From: syzbot > To: a...@linux-foundation.org, alexander.deuc...@amd.com, > ch...@chris-wilson.co.uk, dave.ji...@intel.com, da...@davemloft.net, > gre...@linuxfoundation.org, linux-ker...@vger.kernel.org, > linux-usb@vger.kernel.org, mche...@kernel.org, mi...@kernel.org, > syzkaller-b...@googlegroups.com > Subject: kernel BUG at ./include/linux/mm.h:LINE! (3) > > > Hello, > > syzkaller hit the following crash on > 37759fa6d0fa9e4d6036d19ac12f555bfc0aeafd > git://git.cmpxchg.org/linux-mmots.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > C reproducer is attached > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > for information about syzkaller reproducers > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > [ cut here ] > kernel BUG at ./include/linux/mm.h:844! > invalid opcode: [#1] SMP KASAN > Dumping ftrace buffer: > (ftrace buffer empty) > Modules linked in: > CPU: 1 PID: 3152 Comm: syzkaller757651 Not tainted 4.15.0-rc4-mm1+ #49 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:get_page include/linux/mm.h:844 [inline] > RIP: 0010:mon_bin_vma_fault+0x2f4/0x400 drivers/usb/mon/mon_bin.c:1239 > RSP: 0018:8801cb0af510 EFLAGS: 00010203 > RAX: RBX: 110039615ea4 RCX: > RDX: RSI: 110039615dee RDI: ed0039615e90 > RBP: 8801cb0af5e8 R08: 110039615db0 R09: > R10: dc00 R11: R12: 110039615ea8 > R13: dc00 R14: 8801cb0af840 R15: ea000723fc00 > FS: 7f49b1386700() GS:8801db30() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 55cfb39ce0b8 CR3: 0001ccb19003 CR4: 001606e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > __do_fault+0xeb/0x30f mm/memory.c:3206 > do_read_fault mm/memory.c:3616 [inline] > do_fault mm/memory.c:3716 [inline] > handle_pte_fault mm/memory.c:3947 [inline] > __handle_mm_fault+0x1d8f/0x3ce0 mm/memory.c:4071 > handle_mm_fault+0x38f/0x930 mm/memory.c:4108 > faultin_page mm/gup.c:502 [inline] > __get_user_pages+0x50c/0x15f0 mm/gup.c:699 > populate_vma_page_range+0x20e/0x2f0 mm/gup.c:1200 > __mm_populate+0x23a/0x450 mm/gup.c:1250 > mm_populate include/linux/mm.h:2233 [inline] > vm_mmap_pgoff+0x241/0x280 mm/util.c:338 > SYSC_mmap_pgoff mm/mmap.c:1544 [inline] > SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1502 > SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline] > SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91 > entry_SYSCALL_64_fastpath+0x1f/0x96 > RIP: 0033:0x449249 > RSP: 002b:7f49b1385d98 EFLAGS: 0212 ORIG_RAX: 0009 > RAX: ffda RBX: 006dac24 RCX: 00449249 > RD
Re: kernel BUG at ./include/linux/mm.h:LINE! (3)
On Wed, Jan 03, 2018 at 01:02:38AM -0600, Pete Zaitcev wrote: > On Fri, 29 Dec 2017 16:24:20 +0300 > "Kirill A. Shutemov" wrote: > > > Looks like MON_IOCT_RING_SIZE reallocates ring buffer without any > > serialization wrt mon_bin_vma_fault(). By the time of get_page() the page > > may be freed. > > Okay. Who knew that you could fork while holding an open descriptor. :-) It's threads. But yeah. > > The patch below seems help the crash to go away, but I *think* more work > > is required. For instance, after ring buffer reallocation the old pages > > will stay mapped. Nothing pulls them. > > You know, this bothered me all these years too, but I was assured > back in the day (as much as I can remember), that doing get_page() > in the .fault() is just the right thing. In my defense, you can > see other drivers doing it, such as: > > ./drivers/char/agp/alpha-agp.c > ./drivers/hsi/clients/cmt_speech.c > > I'd appreciate insight from someone who knows how VM subsystem works. get_page() is not a problem. It's right thing to do in ->fault. After more thought, I think it's not a problem at all. As long as userspace is aware that old mapping is no good after changing size of the buffer everything would work fine. Even if userspace would use old mapping nothing bad would happen from kernel POV. Just userspace may see old/inconsistent data. But there's no crashes or such. > Now, about the code: > > > diff --git a/drivers/usb/mon/mon_bin.c b/drivers/usb/mon/mon_bin.c > > index f6ae753ab99b..ac168fecf04f 100644 > > --- a/drivers/usb/mon/mon_bin.c > > +++ b/drivers/usb/mon/mon_bin.c > > @@ -1228,15 +1228,24 @@ static void mon_bin_vma_close(struct vm_area_struct > > *vma) > > static int mon_bin_vma_fault(struct vm_fault *vmf) > > { > > struct mon_reader_bin *rp = vmf->vma->vm_private_data; > > - unsigned long offset, chunk_idx; > > + unsigned long offset, chunk_idx, flags; > > struct page *pageptr; > > > > + mutex_lock(&rp->fetch_lock); > > + spin_lock_irqsave(&rp->b_lock, flags); > > offset = vmf->pgoff << PAGE_SHIFT; > > - if (offset >= rp->b_size) > > + if (offset >= rp->b_size) { > > + spin_unlock_irqrestore(&rp->b_lock, flags); > > + mutex_unlock(&rp->fetch_lock); > > return VM_FAULT_SIGBUS; > > + } > > chunk_idx = offset / CHUNK_SIZE; > > + > > pageptr = rp->b_vec[chunk_idx].pg; > > get_page(pageptr); > > + spin_unlock_irqrestore(&rp->b_lock, flags); > > + mutex_unlock(&rp->fetch_lock); > > + > > vmf->page = pageptr; > > return 0; > > } > > I think that grabbing the spinlock is not really necessary in > this case. The ->b_lock is designed for things that are accessed > from interrupts that Host Controller Driver serves -- mostly > various pointers. By defintion it's not covering things that > are related to re-allocation. Now, the re-allocation itself > grabs it, because it resets indexes into the new buffer, but > does not appear to apply here, does it now? Please, double check everything. I remember that the mutex wasn't enough to stop bug from triggering. But I didn't spend much time understanding the code. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html