Re: [patch] mm: fix PageUptodate data race
Sorry, way behind on email here. I'll get through it slowly... On Sat, Jan 26, 2008 at 10:03:56PM -0800, Andrew Morton wrote: > > On Tue, 22 Jan 2008 05:01:14 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > After running SetPageUptodate, preceeding stores to the page contents to > > actually bring it uptodate may not be ordered with the store to set the page > > uptodate. > > > > Therefore, another CPU which checks PageUptodate is true, then reads the > > page contents can get stale data. > > > > Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after > > PageUptodate. > > > > Many places that test PageUptodate, do so with the page locked, and this > > would be enough to ensure memory ordering in those places if SetPageUptodate > > were only called while the page is locked. Unfortunately that is not always > > the case for some filesystems, but it could be an idea for the future. > > > > Also bring the handling of anonymous page uptodateness in line with that of > > file backed page management, by marking anon pages as uptodate when they > > _are_ > > uptodate, rather than when our implementation requires that they be marked > > as > > such. Doing allows us to get rid of the smp_wmb's in the page copying > > functions, which were especially added for anonymous pages for an analogous > > memory ordering problem. Both file and anonymous pages are handled with the > > same barriers. > > > > So... it's two patches in one. I guess so. Hmm, at least I appreciate it (them) getting testing in -mm for now. I guess I should break it in two, do you agree Hugh? Do you like/dislike the anonymous page change? > What kernel is this against? Looks like mainline. Is it complete and > correct when applied against the large number of pending MM changes? Uh, I forget. But luckily this one should be quite correct reglardless of pending mm changes... unless something there has fundamentally changed the semantics or locking of PG_uptodate... which wouldn't be too surprising actually ;) No, it should be OK. I'll double check when I look at resubmitting it as 2 patches. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 arch updates also broke s390
On Thu, Jan 31, 2008 at 11:24:54AM +0100, Ingo Molnar wrote: > > * Martin Schwidefsky <[EMAIL PROTECTED]> wrote: > > > On Thu, 2008-01-31 at 02:33 +0200, Adrian Bunk wrote: > > > <-- snip --> > > > > > > ... > > > CC arch/s390/kernel/asm-offsets.s > > > In file included from > > > /home/bunk/linux/kernel-2.6/git/linux-2.6/arch/s390/kernel/asm-offsets.c:7: > > > /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/sched.h: In > > > function 'spin_needbreak': > > > /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/sched.h:1931: > > > error: implicit declaration of function '__raw_spin_is_contended' > > > make[2]: *** [arch/s390/kernel/asm-offsets.s] Error 1 > > > > > > <-- snip --> > > > > Defining GENERIC_LOCKBREAK in arch/s390/Kconfig takes care of it. I'll > > cook up a patch and queue it in git390. > > thanks! Yeah thanks, don't know what happened with this, sorry. I thought I had defined it for all SMP capable ones, so maybe it was a quilt error or something on my part. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
On Friday 01 February 2008 09:45, Frederik Himpe wrote: > On ma, 2008-01-28 at 12:46 +1100, Nick Piggin wrote: > > On Sunday 27 January 2008 00:29, Frederik Himpe wrote: > > > On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote: > > > > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote: > > > > > > > With Linux 2.6.24-rc8 I often have the problem that the pan > > > > > > > usenet reader starts using 100% of CPU time after some time. > > > > > > > When this happens, kill -9 does not work, and strace just hangs > > > > > > > when trying to attach to the process. The same with gdb. ps > > > > > > > shows the process as being in the R state. > > > > Well after trying a lot of writev combinations, I've reproduced a hang > > *hangs head*. > > > > Does this help? > > Just to confirm: in four days of testing, I haven't seen the problem > anymore, so it looks like this was indeed the right fix. Thanks very much for reporting and testing. This patch needs to go into 2.6.24.stable and upstream. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [bug] as_merged_requests(): possible recursive locking detected
On Friday 01 February 2008 21:31, Jens Axboe wrote: > On Fri, Feb 01 2008, Jens Axboe wrote: > > I think the right solution is to remove swap_io_context() and fix the io > > context referencing in as-iosched.c instead. > > IOW, the below. I don't know why Nick originally wanted to swap io > contexts for a rq <-> rq merge, there seems little (if any) benefit to > doing so. Yeah, I guess this patch is fine. Simpler is better. > > diff --git a/block/as-iosched.c b/block/as-iosched.c > index 9603684..852803e 100644 > --- a/block/as-iosched.c > +++ b/block/as-iosched.c > @@ -1266,22 +1266,8 @@ static void as_merged_requests(struct request_queue > *q, struct request *req, */ > if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) { > if (time_before(rq_fifo_time(next), rq_fifo_time(req))) { > - struct io_context *rioc = RQ_IOC(req); > - struct io_context *nioc = RQ_IOC(next); > - > list_move(&req->queuelist, &next->queuelist); > rq_set_fifo_time(req, rq_fifo_time(next)); > - /* > - * Don't copy here but swap, because when anext is > - * removed below, it must contain the unused context > - */ > - if (rioc != nioc) { > - double_spin_lock(&rioc->lock, &nioc->lock, > - rioc < nioc); > - swap_io_context(&rioc, &nioc); > - double_spin_unlock(&rioc->lock, &nioc->lock, > - rioc < nioc); > - } > } > } > > diff --git a/block/blk-ioc.c b/block/blk-ioc.c > index 6d16755..80245dc 100644 > --- a/block/blk-ioc.c > +++ b/block/blk-ioc.c > @@ -176,15 +176,6 @@ void copy_io_context(struct io_context **pdst, struct > io_context **psrc) } > EXPORT_SYMBOL(copy_io_context); > > -void swap_io_context(struct io_context **ioc1, struct io_context **ioc2) > -{ > - struct io_context *temp; > - temp = *ioc1; > - *ioc1 = *ioc2; > - *ioc2 = temp; > -} > -EXPORT_SYMBOL(swap_io_context); > - > int __init blk_ioc_init(void) > { > iocontext_cachep = kmem_cache_create("blkdev_ioc", > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index baba233..bbe3cf4 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -39,7 +39,6 @@ void exit_io_context(void); > struct io_context *get_io_context(gfp_t gfp_flags, int node); > struct io_context *alloc_io_context(gfp_t gfp_flags, int node); > void copy_io_context(struct io_context **pdst, struct io_context **psrc); > -void swap_io_context(struct io_context **ioc1, struct io_context **ioc2); > > struct request; > typedef void (rq_end_io_fn)(struct request *, int); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Feature Removals for 2.6.25
On Thu, Jan 31, 2008 at 05:38:42PM -0800, Harvey Harrison wrote: > --- > Ping? > What: vm_ops.nopage > When: Soon, provided in-kernel callers have been converted > Why: This interface is replaced by vm_ops.fault, but it has been around > forever, is used by a lot of drivers, and doesn't cost much to > maintain. > Who: Nick Piggin <[EMAIL PROTECTED]> Well the in-kernel callers have not all been converted yet. I have actually done the work, but it needs testing and merging by maintainers. Getting it done during this merge window would be nice, I'm going to try to make that happen after I get back from LCA. Otherwise probably 2.6.26. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] uio: vm_operations_struct ->nopage to ->fault method conversion
On Saturday 02 February 2008 20:51, Denis Cheng wrote: > Signed-off-by: Denis Cheng <[EMAIL PROTECTED]> Thanks, but already patched in -mm. > --- > drivers/uio/uio.c | 19 --- > 1 files changed, 8 insertions(+), 11 deletions(-) > > diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c > index cc246fa..47e0c32 100644 > --- a/drivers/uio/uio.c > +++ b/drivers/uio/uio.c > @@ -417,30 +417,27 @@ static void uio_vma_close(struct vm_area_struct *vma) > idev->vma_count--; > } > > -static struct page *uio_vma_nopage(struct vm_area_struct *vma, > -unsigned long address, int *type) > +static int uio_vma_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { > struct uio_device *idev = vma->vm_private_data; > - struct page* page = NOPAGE_SIGBUS; > > int mi = uio_find_mem_index(vma); > if (mi < 0) > - return page; > + return VM_FAULT_SIGBUS; > > if (idev->info->mem[mi].memtype == UIO_MEM_LOGICAL) > - page = virt_to_page(idev->info->mem[mi].addr); > + vmf->page = virt_to_page(idev->info->mem[mi].addr); > else > - page = vmalloc_to_page((void*)idev->info->mem[mi].addr); > - get_page(page); > - if (type) > - *type = VM_FAULT_MINOR; > - return page; > + vmf->page = vmalloc_to_page((void *)idev->info->mem[mi].addr); > + get_page(vmf->page); > + > + return 0; > } > > static struct vm_operations_struct uio_vm_ops = { > .open = uio_vma_open, > .close = uio_vma_close, > - .nopage = uio_vma_nopage, > + .fault = uio_vma_fault, > }; > > static int uio_mmap_physical(struct vm_area_struct *vma) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote: > > Second experiment which we did was migrating the IO submission to the > IO completion cpu. Instead of submitting the IO on the same cpu where the > request arrived, in this experiment the IO submission gets migrated to the > cpu that is processing IO completions(interrupt). This will minimize the > access to remote cachelines (that happens in timers, slab, scsi layers). The > IO submission request is forwarded to the kblockd thread on the cpu receiving > the interrupts. As part of this, we also made kblockd thread on each cpu as > the > highest priority thread, so that IO gets submitted as soon as possible on the > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this > resulted in 2% performance improvement and 3.3% improvement on two node ia64 > platform. > > Quick and dirty prototype patch(not meant for inclusion) for this io migration > experiment is appended to this e-mail. > > Observation #1 mentioned above is also applicable to this experiment. CPU's > processing interrupts will now have to cater IO submission/processing > load aswell. > > Observation #2: This introduces some migration overhead during IO submission. > With the current prototype, every incoming IO request results in an IPI and > context switch(to kblockd thread) on the interrupt processing cpu. > This issue needs to be addressed and main challenge to address is > the efficient mechanism of doing this IO migration(how much batching to do and > when to send the migrate request?), so that we don't delay the IO much and at > the same point, don't cause much overhead during migration. Hi guys, Just had another way we might do this. Migrate the completions out to the submitting CPUs rather than migrate submission into the completing CPU. I've got a basic patch that passes some stress testing. It seems fairly simple to do at the block layer, and the bulk of the patch involves introducing a scalable smp_call_function for it. Now it could be optimised more by looking at batching up IPIs or optimising the call function path or even mirating the completion event at a different level... However, this is a first cut. It actually seems like it might be taking slightly more CPU to process block IO (~0.2%)... however, this is on my dual core system that shares an llc, which means that there are very few cache benefits to the migration, but non-zero overhead. So on multisocket systems hopefully it might get to positive territory. --- Index: linux-2.6/arch/x86/kernel/smp_64.c === --- linux-2.6.orig/arch/x86/kernel/smp_64.c +++ linux-2.6/arch/x86/kernel/smp_64.c @@ -321,6 +321,99 @@ void unlock_ipi_call_lock(void) spin_unlock_irq(&call_lock); } +struct call_single_data { + struct list_head list; + void (*func) (void *info); + void *info; + int wait; +}; + +struct call_single_queue { + spinlock_t lock; + struct list_head list; +}; +static DEFINE_PER_CPU(struct call_single_queue, call_single_queue); + +int __cpuinit init_smp_call(void) +{ + int i; + + for_each_cpu_mask(i, cpu_possible_map) { + spin_lock_init(&per_cpu(call_single_queue, i).lock); + INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list); + } + return 0; +} +core_initcall(init_smp_call); + +/* + * this function sends a 'generic call function' IPI to all other CPU + * of the system defined in the mask. + */ +int smp_call_function_fast(int cpu, void (*func)(void *), void *info, + int wait) +{ + struct call_single_data *data; + struct call_single_queue *dst = &per_cpu(call_single_queue, cpu); + cpumask_t mask = cpumask_of_cpu(cpu); + int ipi; + + data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC); + data->func = func; + data->info = info; + data->wait = wait; + + spin_lock_irq(&dst->lock); + ipi = list_empty(&dst->list); + list_add_tail(&data->list, &dst->list); + spin_unlock_irq(&dst->lock); + + if (ipi) + send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR); + + if (wait) { + /* Wait for response */ + while (data->wait) + cpu_relax(); + kfree(data); + } + + return 0; +} + +asmlinkage void smp_call_function_fast_interrupt(void) +{ + struct call_single_queue *q; + unsigned long flags; + LIST_HEAD(list); + + ack_APIC_irq(); + + q = &__get_cpu_var(call_single_queue); + spin_lock_irqsave(&q->lock, flags); + list_replace_init(&q->list, &list); + spin_unlock_irqrestore(&q->lock, flags); + + exit_idle(); + irq_enter(); + while (!list_empty(&list)) { + struct call_single_data *data; + + data = list_entry(list.next, struct call_single_data, list);
Re: [rfc] direct IO submission and completion scalability issues
On Sun, Feb 03, 2008 at 12:53:02PM +0200, Pekka Enberg wrote: > Hi Nick, > > On Feb 3, 2008 11:52 AM, Nick Piggin <[EMAIL PROTECTED]> wrote: > > +asmlinkage void smp_call_function_fast_interrupt(void) > > +{ > > [snip] > > > + while (!list_empty(&list)) { > > + struct call_single_data *data; > > + > > + data = list_entry(list.next, struct call_single_data, list); > > + list_del(&data->list); > > + > > + data->func(data->info); > > + if (data->wait) { > > + smp_mb(); > > + data->wait = 0; > > Why do we need smp_mb() here (maybe add a comment to keep > Andrew/checkpatch happy)? Yeah, definitely... it's just a really basic RFC, but I should get into the habit of just doing it anyway. Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Mon, Feb 04, 2008 at 03:40:20PM +1100, David Chinner wrote: > On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote: > > David Chinner wrote: > > >Hi Nick, > > > > > >When Matthew was describing this work at an LCA presentation (not > > >sure whether you were at that presentation or not), Zach came up > > >with the idea that allowing the submitting application control the > > >CPU that the io completion processing was occurring would be a good > > >approach to try. That is, we submit a "completion cookie" with the > > >bio that indicates where we want completion to run, rather than > > >dictating that completion runs on the submission CPU. > > > > > >The reasoning is that only the higher level context really knows > > >what is optimal, and that changes from application to application. > > > > well.. kinda. One of the really hard parts of the submit/completion stuff > > is that > > the slab/slob/slub/slib allocator ends up basically "cycling" memory > > through the system; > > there's a sink of free memory on all the submission cpus and a source of > > free memory > > on the completion cpu. I don't think applications are capable of working > > out what is > > best in this scenario.. > > Applications as in "anything that calls submit_bio()". i.e, direct I/O, > filesystems, etc. i.e. not userspace but in-kernel applications. > > In XFS, simultaneous io completion on multiple CPUs can contribute greatly to > contention of global structures in XFS. By controlling where completions are > delivered, we can greatly reduce this contention, especially on large, > mulitpathed devices that deliver interrupts to multiple CPUs that may be far > distant from each other. We have all the state and intelligence necessary > to control this sort policy decision effectively. Hi Dave, Thanks for taking a look at the patch... yes it would be easy to turn this bit of state into a more flexible cookie (eg. complete on submitter; complete on interrupt; complete on CPUx/nodex etc.). Maybe we'll need something that complex... I'm not sure, it would probably need more fine tuning. That said, I just wanted to get this approach out there early for rfc. I guess both you and Arjan have points. For a _lot_ of things, completing on the same CPU as submitter (whether that is migrating submission as in the original patch in the thread, or migrating completion like I do). You get better behaviour in the slab and page allocators and locality and cache hotness of memory. For example, I guess in a filesystem / pagecache heavy workload, you have to touch each struct page, buffer head, fs private state, and also often have to wake the thread for completion. Much of this data has just been touched at submit time, so doin this on the same CPU is nice... I'm surprised that the xfs global state bouncing would outweigh the bouncing of all the per-page/block/bio/request/etc data that gets touched during completion. We'll see. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote: > On Sun, Feb 03 2008, Nick Piggin wrote: > > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote: > > > > Hi guys, > > > > Just had another way we might do this. Migrate the completions out to > > the submitting CPUs rather than migrate submission into the completing > > CPU. > > > > I've got a basic patch that passes some stress testing. It seems fairly > > simple to do at the block layer, and the bulk of the patch involves > > introducing a scalable smp_call_function for it. > > > > Now it could be optimised more by looking at batching up IPIs or > > optimising the call function path or even mirating the completion event > > at a different level... > > > > However, this is a first cut. It actually seems like it might be taking > > slightly more CPU to process block IO (~0.2%)... however, this is on my > > dual core system that shares an llc, which means that there are very few > > cache benefits to the migration, but non-zero overhead. So on multisocket > > systems hopefully it might get to positive territory. > > That's pretty funny, I did pretty much the exact same thing last week! Oh nice ;) > The primary difference between yours and mine is that I used a more > private interface to signal a softirq raise on another CPU, instead of > allocating call data and exposing a generic interface. That put the > locking in blk-core instead, turning blk_cpu_done into a structure with > a lock and list_head instead of just being a list head, and intercepted > at blk_complete_request() time instead of waiting for an already raised > softirq on that CPU. Yeah I was looking at that... didn't really want to add the spinlock overhead to the non-migration case. Anyway, I guess that sort of fine implementation details is going to have to be sorted out with results. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Monthly md check == hung machine; how do I debug?
On Monday 04 February 2008 08:21, Robin Lee Powell wrote: > I've got a machine with a 4 disk SATA raid10 configuration using md. > The entire disk is loop-AES encrypted, but that shouldn't matter > here. > > Once a month, Debian runs: > > /usr/share/mdadm/checkarray --cron --all --quiet > > and the machine hangs within 30 minutes of that starting. > > It seems that I can avoid the hang by not having "mdadm --monitor" > running, but I'm not certain if that's the case or if I've just been > lucky this go-round. > > I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD > Athlon(tm) 64 Processor 3700+. > > I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I > can't find anything that looks relevant. > > So, how can I (help you all) debug this? Do you have a serial console? Does it respond to pings? Can you try to get sysrq+T traces, and sysrq+P traces, and post them? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
On Tuesday 05 February 2008 01:49, Mike Galbraith wrote: > On Tue, 2008-01-22 at 06:47 +0100, Mike Galbraith wrote: > > On Tue, 2008-01-22 at 16:25 +1100, Nick Piggin wrote: > > > On Tuesday 22 January 2008 16:03, Mike Galbraith wrote: > > > > I've hit same twice recently (not pan, and not repeatable). > > > > > > Nasty. The attached patch is something really simple that can sometimes > > > help. sysrq+p is also an option, if you're on a UP system. > > > > SMP (P4/HT imitating real cores) > > > > > Any luck getting traces? > > > > We'll see. Armed. > > Hm. ld just went loopy (but killable) in v2.6.24-6928-g9135f19. During > kbuild, modpost segfaulted, restart build, ld goes gaga. Third attempt, > build finished. Not what I hit before, but mentionable. > > > [ 674.589134] modpost[18588]: segfault at 3e8dc42c ip 0804a96d sp af982920 > error 5 in modpost[8048000+9000] [ 674.589211] mm/memory.c:115: bad pgd > 3e081163. > [ 674.589214] mm/memory.c:115: bad pgd 3e0d2163. > [ 674.589217] mm/memory.c:115: bad pgd 3eb01163. Hmm, this _could_ be bad memory. Or if it is very easy to reproduce with a particular kernel version, then it is probably a memory scribble from another part of the kernel :( First thing I guess would be easy and helpful to run memtest86 for a while if you have time. If that's clean, then I don't have another good option except to bisect the problem. Turning on DEBUG_VM, DEBUG_SLAB, DEBUG_LIST, DEBUG_PAGEALLOC, DEBUG_STACKOVERFLOW, DEBUG_RODATA might help catch it sooner... SLAB and PAGEALLOC could slow you down quite a bit though. And if the problem is quite reproduceable, then obviously don't touch your config ;) Thanks, Nick > > [ 1407.322144] === > [ 1407.322144] ldR running 0 21963 21962 > [ 1407.322144]db9d7f1c 00200086 c75f9020 b1814300 b0428300 b0428300 > b0428300 c75f9280 [ 1407.322144]b1814300 0001 db9d7000 > d08c2f90 dba4f300 0002 [ 1407.322144]b1810120 dba4f334 > 00200046 db9d7000 c75f9020 db9d7f30 b02f333f [ 1407.322144] Call > Trace: > [ 1407.322144] [] preempt_schedule_irq+0x45/0x5b > [ 1407.322144] [] ? do_page_fault+0x0/0x470 > [ 1407.322144] [] need_resched+0x1f/0x21 > [ 1407.322144] [] ? do_page_fault+0x0/0x470 > [ 1407.322144] [] ? do_page_fault+0x4c/0x470 > [ 1407.322144] [] ? do_page_fault+0x0/0x470 > [ 1407.322144] [] ? error_code+0x72/0x78 > [ 1407.322144] [] ? init_transmeta+0xcf/0x22f <== zzt P4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] SLUB updates for 2.6.25
On Tuesday 05 February 2008 10:47, Christoph Lameter wrote: > On Tue, 5 Feb 2008, Nick Piggin wrote: > > > erk, sorry, I misremembered. I was about to merge all the patches we > > > weren't going to merge. oops. > > > > While you're there, can you drop the patch(es?) I commented on > > and didn't get an answer to. Like the ones that open code their > > own locking primitives and do risky looking things with barriers > > to boot... > > That patch will be moved to a special archive for > microbenchmarks. It shows the same issues like the __unlock patch. Ok. But the approach is just not so good. If you _really_ need something like that and it is a win over the regular non-atomic unlock, then you just have to implement it as a generic locking / atomic operation and allow all architectures to implement the optimal (and correct) memory barriers. Anyway > > Also, WRT this one: > > slub-use-non-atomic-bit-unlock.patch > > > > This is strange that it is unwanted. Avoiding atomic operations > > is a pretty good idea. The fact that it appears to be slower on > > some microbenchmark on some architecture IMO either means that > > their __clear_bit_unlock or the CPU isn't implemented so well... > > Its slower on x86_64 and that is a pretty important arch. So > I am to defer this until we have analyzed the situation some more. Could > there be some effect of atomic ops on the speed with which a cacheline is > released? I'm sure it could have an effect. But why is the common case in SLUB for the cacheline to be bouncing? What's the benchmark? What does SLAB do in that benchmark, is it faster than SLUB there? What does the non-atomic bit unlock do to Willy's database workload? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] SLUB updates for 2.6.25
On Tuesday 05 February 2008 09:30, Andrew Morton wrote: > On Mon, 4 Feb 2008 14:28:45 -0800 > > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > root (1): > > > SLUB: Do not upset lockdep > > > > err, what? I though I was going to merge these: > > > > slub-move-count_partial.patch > > slub-rename-numa-defrag_ratio-to-remote_node_defrag_ratio.patch > > slub-consolidate-add_partial-and-add_partial_tail-to-one-function.patch > > slub-use-non-atomic-bit-unlock.patch > > slub-fix-coding-style-violations.patch > > slub-noinline-some-functions-to-avoid-them-being-folded-into-alloc-free.p > >atch > > slub-move-kmem_cache_node-determination-into-add_full-and-add_partial.pat > >ch > > slub-avoid-checking-for-a-valid-object-before-zeroing-on-the-fast-path.pa > >tch slub-__slab_alloc-exit-path-consolidation.patch > > slub-provide-unique-end-marker-for-each-slab.patch > > slub-avoid-referencing-kmem_cache-structure-in-__slab_alloc.patch > > slub-optional-fast-path-using-cmpxchg_local.patch > > slub-do-our-own-locking-via-slab_lock-and-slab_unlock.patch > > slub-restructure-slab-alloc.patch > > slub-comment-kmem_cache_cpu-structure.patch > > slub-fix-sysfs-refcounting.patch > > > > before you went and changed things under my feet. > > erk, sorry, I misremembered. I was about to merge all the patches we > weren't going to merge. oops. While you're there, can you drop the patch(es?) I commented on and didn't get an answer to. Like the ones that open code their own locking primitives and do risky looking things with barriers to boot... Also, WRT this one: slub-use-non-atomic-bit-unlock.patch This is strange that it is unwanted. Avoiding atomic operations is a pretty good idea. The fact that it appears to be slower on some microbenchmark on some architecture IMO either means that their __clear_bit_unlock or the CPU isn't implemented so well... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] SLUB updates for 2.6.25
On Tuesday 05 February 2008 11:32, Christoph Lameter wrote: > On Tue, 5 Feb 2008, Nick Piggin wrote: > > Ok. But the approach is just not so good. If you _really_ need something > > like that and it is a win over the regular non-atomic unlock, then you > > just have to implement it as a generic locking / atomic operation and > > allow all architectures to implement the optimal (and correct) memory > > barriers. > > Assuming this really gives a benefit on several benchmarks then we need > to think about how to do this some more. Its a rather strange form of > locking. > > Basically you lock the page with a single atomic operation that sets > PageLocked and retrieves the page flags. This operation is not totally unusual. I could use it for my optimised page lock patches for example (although I need an operation that clears a flag and has release semantics, but similar class of "thing"). > Then we shovel the page state > around a couple of functions in a register and finally store the page > state back which at the same time unlocks the page. And this is a store-for-unlock (eg. with release semantics). Nothing too special about that either I guess. (it is almost the word equivalent of clear_bit_unlock). > So two memory > references with one of them being atomic with none in between. We have > nothing that can do something like that right now. The load you are trying to avoid in the lock really isn't that expensive. The cacheline is in L1. Even after a store, many CPUs have store forwarding so it is probably not going to matter at all on those. Anyway, not saying the operations are useless, but they should be made available to core kernel and implemented per-arch. (if they are found to be useful) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
On Tuesday 12 February 2008 15:10, Max Krasnyansky wrote: > Rusty - Stop machine. >After doing a bunch of testing last three days I actually downgraded > stop machine changes from [highly experimental] to simply [experimental]. > Pleas see this thread for more info: > http://marc.info/?l=linux-kernel&m=120243837206248&w=2 Short story is that > I ran several insmod/rmmod workloads on live multi-core boxes with stop > machine _completely_ disabled and did no see any issues. Rusty did not get > a chance to reply yet, I hopping that we'll be able to make "stop machine" > completely optional for some configurations. stop machine is used for more than just module loading and unloading. I don't think you can just disable it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoid buffer overflows in get_user_pages()
On Tuesday 12 February 2008 14:16, Robert Hancock wrote: > Nick Piggin wrote: > > On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote: > >> Avoid buffer overflows in get_user_pages() > >> > >> So I spent a while pounding my head against my monitor trying to figure > >> out the vmsplice() vulnerability - how could a failure to check for > >> *read* access turn into a root exploit? It turns out that it's a buffer > >> overflow problem which is made easy by the way get_user_pages() is > >> coded. > >> > >> In particular, "len" is a signed int, and it is only checked at the > >> *end* of a do {} while() loop. So, if it is passed in as zero, the loop > >> will execute once and decrement len to -1. At that point, the loop will > >> proceed until the next invalid address is found; in the process, it will > >> likely overflow the pages array passed in to get_user_pages(). > >> > >> I think that, if get_user_pages() has been asked to grab zero pages, > >> that's what it should do. Thus this patch; it is, among other things, > >> enough to block the (already fixed) root exploit and any others which > >> might be lurking in similar code. I also think that the number of pages > >> should be unsigned, but changing the prototype of this function probably > >> requires some more careful review. > >> > >> Signed-off-by: Jonathan Corbet <[EMAIL PROTECTED]> > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index e5628a5..7f50fd8 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct > >> mm_struct *mm, int i; > >>unsigned int vm_flags; > >> > >> + if (len <= 0) > >> + return 0; > > > > BUG_ON()? > > Well, not if the code involved in the exploit can pass a zero value, Which is a bug, and you want to catch it. > otherwise it's just turning it into a DoS.. If it is due to a security bug, then the fix is to fix the point where the kernel starts trusting an untrusted value. Not to hide the bug like this. Arguably, a BUG_ON is better in the case of a security hole because you want to halt the process as soon as you detect a problem. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoid buffer overflows in get_user_pages()
On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote: > Avoid buffer overflows in get_user_pages() > > So I spent a while pounding my head against my monitor trying to figure > out the vmsplice() vulnerability - how could a failure to check for > *read* access turn into a root exploit? It turns out that it's a buffer > overflow problem which is made easy by the way get_user_pages() is > coded. > > In particular, "len" is a signed int, and it is only checked at the > *end* of a do {} while() loop. So, if it is passed in as zero, the loop > will execute once and decrement len to -1. At that point, the loop will > proceed until the next invalid address is found; in the process, it will > likely overflow the pages array passed in to get_user_pages(). > > I think that, if get_user_pages() has been asked to grab zero pages, > that's what it should do. Thus this patch; it is, among other things, > enough to block the (already fixed) root exploit and any others which > might be lurking in similar code. I also think that the number of pages > should be unsigned, but changing the prototype of this function probably > requires some more careful review. > > Signed-off-by: Jonathan Corbet <[EMAIL PROTECTED]> > > diff --git a/mm/memory.c b/mm/memory.c > index e5628a5..7f50fd8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct > mm_struct *mm, int i; > unsigned int vm_flags; > > + if (len <= 0) > + return 0; BUG_ON()? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70
On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote: > On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote: > > On Sun, 10 Feb 2008 17:00:31 +0300 > > > > Alexey Dobriyan <[EMAIL PROTECTED]> wrote: > > > This happened during LTP. FWIW, modprobe/rmmod trivial empty module > > > together with cat /proc/*/wchan and cat /proc/modules were also > > > running. > > > > > > Box is E6400, much debugging is on, config below. > > > > > > > > > [ 4057.31] BUG: unable to handle kernel paging request at > > > 810101dbc008 [ 4057.31] IP: [] > > > iov_iter_advance+0x38/0x70 [ 4057.31] PGD 8063 PUD c063 PMD > > > 153baa163 PTE 800101dbc160 [ 4057.31] Oops: [1] SMP > > > DEBUG_PAGEALLOC > > > [ 4057.31] CPU 0 > > > [ 4057.31] Modules linked in: [last unloaded: foo] > > > > what is this foo.ko of which you speak, and did it wreck your kernel? > > It's a trivial dumb module which does nothing but loads and unloads. > I redid ftest03 later without any suspicious activity and it oopsed the > same way. Ah crap. Hmm, maybe I didn't consider all cases with my last patch to that code... is there an easy way to get the ftest03 source and run it? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel BUG at fs/mpage.c:489
On Wednesday 13 February 2008 08:50, Alan Cox wrote: > > Feb 12 19:55:08 butterfly kernel: hde: dma timeout error: status=0xd0 { > > Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode was: unknown > > Your drive stopped responding. > > > Feb 12 19:55:08 butterfly kernel: hde: DMA disabled > > Feb 12 19:55:08 butterfly kernel: PDC202XX: Primary channel reset. > > Feb 12 19:55:08 butterfly kernel: PDC202XX: Secondary channel reset. > > We gave it a good kicking and it stayed offline > > > Feb 12 19:55:08 butterfly kernel: hde: set_drive_speed_status: > > status=0xd0 { Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode > > was: unknown Feb 12 19:55:47 butterfly kernel: ide2: reset timed-out, > > status=0xd0 Feb 12 19:55:47 butterfly kernel: hde: status timeout: > > status=0xd0 { Busy } > > And we gave up. > > Almost certainly a hardware fail of some sort. Right, but the kernel shouldn't go bug... I don't have a copy of your exact source code... which condition in __mpage_writepage went BUG? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70
On Wednesday 13 February 2008 11:17, Nick Piggin wrote: > On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote: > > It's a trivial dumb module which does nothing but loads and unloads. > > I redid ftest03 later without any suspicious activity and it oopsed the > > same way. > > Ah crap. Hmm, maybe I didn't consider all cases with my last patch to > that code... is there an easy way to get the ftest03 source and run > it? OK I didn't realise it is a test from ltp. But I can't reproduce it for the life of me with the latest git kernel and latest ltp tarball. Is it easy to reproduce? Are you reproducing it simply by running the ftest03 binary directly from the shell? How many times between oopses? It is multi-process but no threads, so races should be minimal down this path -- can you get an strace of the failing process? Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote: > David Miller wrote: > > From: Nick Piggin <[EMAIL PROTECTED]> > > Date: Tue, 12 Feb 2008 17:41:21 +1100 > > > >> stop machine is used for more than just module loading and unloading. > >> I don't think you can just disable it. > > > > Right, in particular it is used for CPU hotplug. > > Ooops. Totally missed that. And a bunch of other places. > > [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run > Documentation/cpu-hotplug.txt > arch/s390/kernel/kprobes.c > drivers/char/hw_random/intel-rng.c > include/linux/stop_machine.h > kernel/cpu.c > kernel/module.c > kernel/stop_machine.c > mm/page_alloc.c > > I wonder why I did not see any issues when I disabled stop machine > completely. I mentioned in the other thread that I commented out the part > that actually halts the machine and ran it for several hours on my dual > core laptop and on the quad core server. Tried all kinds of workloads, > which include constant module removal and insertion, and cpu hotplug as > well. It cannot be just luck :). It really is. With subtle races, it can take a lot more than a few hours. Consider that we have subtle races still in the kernel now, which are almost never or rarely hit in maybe 10,000 hours * every single person who has been using the current kernel for the past year. For a less theoretical example -- when I was writing the RCU radix tree code, I tried to run directed stress tests on a 64 CPU Altix machine (which found no bugs). Then I ran it on a dedicated test harness that could actually do a lot more than the existing kernel users are able to, and promptly found a couple more bugs (on a 2 CPU system). But your primary defence against concurrency bugs _has_ to be knowing the code and all its interactions. > Clearly though, you guys are right. It cannot be simply disabled. Based on > the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390 > and intel rng driver. Hopefully we can avoid it at least in module > insertion/removal. Yes, reducing the number of users by going through their code and showing that it is safe, is the right way to do this. Also, you could avoid module insertion/removal? FWIW, I think the idea of trying to turn Linux into giving hard realtime guarantees is just insane. If that is what you want, you would IMO be much better off to spend effort with something like improving adeos and communicatoin/administration between Linux and the hard-rt kernel. But don't let me dissuade you from making these good improvements to Linux as well :) Just that it isn't really going to be hard-rt in general. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ALPHA] ES40 fails to boot with >=kernel 2.6.23
On Tuesday 12 February 2008 04:27, Raúl Porcel wrote: > Hi, > > We have a Compaq AlphaServer ES40 and since 2.6.23 it won't boot. I'm > attaching the console log and the kernel config. > > Need to say that with a DEC Xp1000 it works fine, although they're > different machines, of course. > With .22 it boots fine, and by booting fine i mean after we reverted to > 2.6.22 it booted again and everything worked as expected. > Still hangs with latest kernel. > > I'm attaching the verlinux output as well, hope it helps. If i'm missing > something, please don't hesitate to ask. > > Thanks Hi, Thanks for reporting. I'm not an alpha person, but I have cc'ed them in case they missed this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2 resend] mm: various cleanups in get_user_pages()
On Wednesday 13 February 2008 00:10, Eugene Teo wrote: > Sorry for the repeated emails. Kindly ignore the previous resend. Please > review this instead. Thanks. I have tested this. If it is causing this much problems, can you split the cleanups into their own patches. > [PATCH 2/2] mm: various cleanups in get_user_pages() > > This patch contains various cleanups, including making sure vma is valid, > and the return value of follow_hugetlb_page() is validated. > > Signed-off-by: Eugene Teo <[EMAIL PROTECTED]> > --- > mm/memory.c | 24 > 1 file changed, 16 insertions(+), 8 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 54f951b..c7e0610 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1003,7 +1003,9 @@ int get_user_pages(struct task_struct *tsk, struct > mm_struct *mm, unsigned int foll_flags; > > vma = find_extend_vma(mm, start); > - if (!vma && in_gate_area(tsk, start)) { > + if (!vma) > + goto finish_or_fault; > + if (in_gate_area(tsk, start)) { > unsigned long pg = start & PAGE_MASK; > struct vm_area_struct *gate_vma = get_gate_vma(tsk); > pgd_t *pgd; Doesn't this break the logic? If you don't have a vma, but you are in the gate area, then you should use the gate vma. With your patch, gate area will fault. > @@ -1011,7 +1013,7 @@ int get_user_pages(struct task_struct *tsk, struct > mm_struct *mm, pmd_t *pmd; > pte_t *pte; > if (write) /* user gate pages are read-only */ > - return i ? : -EFAULT; > + goto finish_or_fault; I don't know if this is exactly a cleanup or not... I guess gcc probably isn't smart enough to fold them all together, so it should use a little less code in the unlikely branches. Does it? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull for -mm] CPU isolation extensions (updated2)
On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote: > Nick Piggin wrote: > > But don't let me dissuade you from making these good improvements > > to Linux as well :) Just that it isn't really going to be hard-rt > > in general. > > Actually that's the cool thing about CPU isolation. Get rid of all latency > sources from the CPU(s) and you get youself as hard-RT as it gets. Hmm, maybe. Removing all sources of latency from the CPU kind of implies that you have to audit the whole kernel for source of latency. > I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec > worst case and ~200nsec average latency. I do not even need Adeos/Xenomai > or Preemp-RT just a few very small patches. And it can be used for non RT > stuff too. OK, but you then are very restricted in what you can do, and easily can break it especially if you run any userspace on that CPU. If you just run a kernel module that, after setup, doesn't use any other kernel resources except interrupt handling, then you might be OK (depending on whether even interrupt handling can run into contended locks)... If you started doing very much more, then you can easily run into trouble. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel BUG at fs/mpage.c:489
On Wednesday 13 February 2008 20:01, Andrew Morton wrote: > On Wed, 13 Feb 2008 08:26:27 +0100 Bart Dopheide <[EMAIL PROTECTED]> wrote: > > On Wed, Feb 13, 2008 at 12:05:45PM +1100, Nick Piggin wrote: > > :)On Wednesday 13 February 2008 08:50, Alan Cox wrote: > > :)> Almost certainly a hardware fail of some sort. > > :) > > :)Right, but the kernel shouldn't go bug... > > > > Indeed, that's why I'm reporting. > > > > :)I don't have a copy of your exact source code... which condition in > > :)__mpage_writepage went BUG? > > > > BUG_ON(buffer_locked(bh)); > > > > In a bit of context: > > 482:if (page_has_buffers(page)) { > > 483:struct buffer_head *head = page_buffers(page); > > 484:struct buffer_head *bh = head; > > 485: > > 486:/* If they're all mapped and dirty, do it */ > > 487:page_block = 0; > > 488:do { > > 489:BUG_ON(buffer_locked(bh)); > > 490:if (!buffer_mapped(bh)) { > > 491:/* > > 492: * unmapped dirty buffers are created by > > 493: * __set_page_dirty_buffers -> mmapped > > data 494: */ > > 495:if (buffer_dirty(bh)) > > 496:goto confused; > > 497:if (first_unmapped == blocks_per_page) > > 498:first_unmapped = page_block; > > 499:continue; > > 500:} > > Probably means that either fat, IDE, block or fs/buffer.c failed to unlock > a buffer_head when the IO error happened. It's unlikely to be fat. Yes that looks like it would be the problem. I can't really see anything in buffer.c that would do it... BTW is it really true that the buffer can never be locked by anything else at this point? What about fsync_buffers_list? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel BUG at fs/mpage.c:489
On Wednesday 13 February 2008 20:32, Andrew Morton wrote: > On Wed, 13 Feb 2008 20:24:03 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > BTW is it really true that the buffer can never be locked by > > anything else at this point? > > It has been for the past five or six years. With the page locked, nobody > else can get at that page. Hmm OK. > > What about fsync_buffers_list? > > They're metadata buffers, not regular file data. Things might get ugly if > IO to /dev/sda went via that path, but it doesn't. Yeah right... so the BUG_ON is basically because you want to avoid the overhead of locking the buffer (which would presumably allow it to work in situations where someone else might lock the buffer without locking the page?). OK, makes sense. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
On Saturday 16 February 2008 14:37, Andrew Morton wrote: > On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter <[EMAIL PROTECTED]> wrote: > > Two callbacks to remove individual pages as done in rmap code > > > > invalidate_page() > > > > Called from the inner loop of rmap walks to invalidate pages. > > > > age_page() > > > > Called for the determination of the page referenced status. > > > > If we do not care about page referenced status then an age_page callback > > may be be omitted. PageLock and pte lock are held when either of the > > functions is called. > > The age_page mystery shallows. BTW. can this callback be called mmu_notifier_clear_flush_young? To match the core VM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LatencyTOP: sync_page description
On Saturday 16 February 2008 08:56, Török Edwin wrote: > Hi Arjan, > > LatencyTOP says that sync_page is 'Writing a page to disk', however > I see that even when no writes are involved, such as during a > readdir, lseek, etc. > Naming it a write is misleading, as no program is running that is > doing constant writes to the disk. The only program is writing to a > temp dir in /dev/shm. > > What would be a better description for sync_page? Waiting on a page state change (usually: waiting for IO, but can be also waiting for the page lock which is taken by some other part of the kernel eg in page reclaim, truncate, buffered writes, page faults). > Here are some /proc/latency_stats containing sync_page: > > 125 6937678 210821 sync_page sync_page_killable sync_page_killable > __lock_page_killable wake_bit_function generic_file_aio_read > get_unused_fd_flags path_walk do_sync_read autoremove_wake_function > security_file_permission rw_verify_area > 306 5677749 215746 sync_page sync_page_killable sync_page_killable > __lock_page_killable wake_bit_function generic_file_aio_read > do_sync_read autoremove_wake_function security_file_permission > rw_verify_area vfs_read vfs_llseek > 21 435657 59966 sync_page sync_page __lock_page wake_bit_function > read_cache_page_async ntfs_readpage read_cache_page map_mft_record > ntfs_read_locked_inode ntfs_alloc_big_inode iget5_locked > ntfs_test_inode > 195 2716409 133660 blk_unplug sync_page sync_page __lock_page > wake_bit_function read_cache_page_async ntfs_readpage > read_cache_page map_mft_record ntfs_read_locked_inode > ntfs_alloc_big_inode iget5_locked > 28 1881278 181986 add_to_page_cache_lru sync_page sync_page_killable > sync_page_killable __lock_page_killable wake_bit_function > generic_file_aio_read get_unused_fd_flags path_walk do_sync_read > autoremove_wake_function security_file_permission > 2 17132 9746 add_to_page_cache_lru sync_page sync_page_killable > sync_page_killable __lock_page_killable wake_bit_function > generic_file_aio_read do_sync_read autoremove_wake_function > security_file_permission rw_verify_area vfs_read > 1 70 70 irq_exit sync_page sync_page_killable sync_page_killable > __lock_page_killable wake_bit_function generic_file_aio_read > do_sync_read autoremove_wake_function security_file_permission > rw_verify_area vfs_read > 23 306682 114514 blk_unplug sync_page sync_page_killable > sync_page_killable __lock_page_killable wake_bit_function > generic_file_aio_read do_sync_read autoremove_wake_function > security_file_permission rw_verify_area vfs_read > 1 153 153 hrtimer_interrupt smp_apic_timer_interrupt sync_page > sync_page_killable sync_page_killable __lock_page_killable > wake_bit_function generic_file_aio_read do_sync_read > autoremove_wake_function cfq_idle_slice_timer security_file_permission -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queueing and complete affinity w/ threads: Some results
On Mon, Feb 18, 2008 at 02:33:17PM +0100, Andi Kleen wrote: > Jens Axboe <[EMAIL PROTECTED]> writes: > > > and that scrapping the remote > > softirq trigger stuff is sanest. > > I actually liked Nick's queued smp_function_call_single() patch. So even > if it was not used for block I would still like to see it being merged > in some form to speed up all the other IPI users. Yeah, that hasn't been forgotten (nor have your comments about folding my special function into smp_call_function_single). The call function path is terribly unscalable at the moment on a lot of architectures, and also it isn't allowed to be used with interrupts off due to deadlock (which the queued version can allow, provided that wait=0). I will get around to sending that upstream soon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Fix Unlikely(x) == y
On Tuesday 19 February 2008 01:39, Andi Kleen wrote: > Arjan van de Ven <[EMAIL PROTECTED]> writes: > > you have more faith in the authors knowledge of how his code actually > > behaves than I think is warranted :) > > iirc there was a mm patch some time ago to keep track of the actual > unlikely values at runtime and it showed indeed some wrong ones. But the > far majority of them are probably correct. > > > Or faith in that he knows what "unlikely" means. > > I should write docs about this; but unlikely() means: > > 1) It happens less than 0.01% of the cases. > > 2) The compiler couldn't have figured this out by itself > >(NULL pointer checks are compiler done already, same for some other > > conditions) 3) It's a hot codepath where shaving 0.5 cycles (less even on > > x86) matters (and the author is ok with taking a 500 cycles hit if he's > > wrong) > > One more thing unlikely() does is to move the unlikely code out of line. > So it should conserve some icache in critical functions, which might > well be worth some more cycles (don't have numbers though). I actually once measured context switching performance in the scheduler, and removing the unlikely hint for testing RT tasks IIRC gave about 5% performance drop. This was on a P4 which is very different from more modern CPUs both in terms of branch performance characteristics, and icache characteristics. However, the P4's branch predictor is pretty good, and it should easily be able to correctly predict the rt_task check if it has enough entries. So I think much of the savings came from code transformation and movement. Anyway, it is definitely worthwhile if used correctly. Actually one thing I don't like about gcc is that I think it still emits cmovs for likely/unlikely branches, which is silly (the gcc developers seem to be in love with that instruction). If that goes away, then branch hints may be even better. > > But overall I agree with you that unlikely is in most cases a bad > idea (and I submitted the original patch introducing it originally @). That > is because it is often used in situations where gcc's default branch > prediction heuristics do would make exactly the same decision > >if (unlikely(x == NULL)) > > is simply totally useless because gcc already assumes all x == NULL > tests are unlikely. I appended some of the builtin heuristics from > a recent gcc source so people can see them. > > Note in particular the last predictors; assuming branch ending > with goto, including call, causing early function return or > returning negative constant are not taken. Just these alone > are likely 95+% of the unlikelies in the kernel. Yes, gcc should be able to do pretty good heuristics, considering the quite good numbers that cold CPU predictors can attain. However for really performance critical code (or really "never" executed code), then I think it is OK to have the hints and not have to rely on gcc heuristics. > > -Andi [snip] Interesting, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70
On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote: > On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote: > > > [ 4057.31] Pid: 7035, comm: ftest03 Not tainted > > > 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #2 [ 4057.31] RIP: > > > 0010:[] [] > > > iov_iter_advance+0x38/0x70 [ 4057.31] RSP: 0018:810110329b20 > > > EFLAGS: 00010246 > > > [ 4057.31] RAX: RBX: 0800 RCX: > > > [ 4057.31] RDX: RSI: > > > 0800 RDI: 810110329ba8 [ 4057.31] RBP: > > > 0800 R08: R09: 810101dbc000 [ > > > 4057.31] R10: 0004 R11: R12: > > > 00026000 [ 4057.31] R13: 81010d765c98 R14: > > > 1000 R15: [ 4057.31] FS: > > > 7fee589146d0() GS:80501000() knlGS: > > > [ 4057.31] CS: 0010 DS: ES: CR0: 8005003b [ > > > 4057.31] CR2: 810101dbc008 CR3: 0001103da000 CR4: > > > 06e0 [ 4057.31] DR0: DR1: > > > DR2: [ 4057.31] DR3: > > > DR6: 0ff0 DR7: 0400 [ > > > 4057.31] Process ftest03 (pid: 7035, threadinfo 810110328000, > > > task 810160b0) [ 4057.31] Stack: 8025b413 > > > 81010d765ab0 804e6fd8 001201d2 [ 4057.31] > > > 810110329db8 00026000 810110329d38 81017b9fb500 [ > > > 4057.31] 81010d765c98 804175e0 81010d765ab0 > > > [ 4057.31] Call Trace: > > > [ 4057.31] [] ? > > > generic_file_buffered_write+0x1e3/0x6f0 [ 4057.31] > > > [] ? current_fs_time+0x1e/0x30 [ 4057.31] > > > [] ? __generic_file_aio_write_nolock+0x28f/0x440 [ > > > 4057.31] [] ? generic_file_aio_write+0x63/0xd0 [ > > > 4057.31] [] ? ext3_file_write+0x23/0xc0 [ > > > 4057.31] [] ? ext3_file_write+0x0/0xc0 [ > > > 4057.31] [] ? do_sync_readv_writev+0xcb/0x110 [ > > > 4057.31] [] ? autoremove_wake_function+0x0/0x30 > > > [ 4057.31] [] ? > > > debug_check_no_locks_freed+0x7d/0x130 [ 4057.31] > > > [] ? trace_hardirqs_on+0xcf/0x150 [ 4057.31] > > > [] ? __kmalloc+0x15/0xc0 > > > [ 4057.31] [] ? rw_copy_check_uvector+0x9d/0x130 > > > [ 4057.31] [] ? do_readv_writev+0xe0/0x170 > > > [ 4057.31] [] ? mutex_lock_nested+0x1a7/0x280 > > > [ 4057.31] [] ? trace_hardirqs_on+0xcf/0x150 > > > [ 4057.31] [] ? > > > __mutex_unlock_slowpath+0xc9/0x170 [ 4057.31] [] > > > ? trace_hardirqs_on+0xcf/0x150 [ 4057.31] [] ? > > > trace_hardirqs_on_thunk+0x35/0x3a [ 4057.31] [] > > > ? sys_writev+0x53/0x90 > > > [ 4057.31] [] ? > > > system_call_after_swapgs+0x7b/0x80 [ 4057.31] > > > [ 4057.31] > > > [ 4057.31] Code: 48 01 77 10 48 29 77 18 c3 0f 0b eb fe 66 66 90 66 > > > 66 90 4c 8b 0f 48 8b 4f 10 49 89 f0 eb 07 66 66 66 90 49 29 c0 4d 85 c0 > > > 75 07 <49> 83 79 08 00 75 23 49 8b 51 08 48 89 d0 48 29 c8 49 39 c0 49 > > > [ 4057.31] RIP [] iov_iter_advance+0x38/0x70 [ > > > 4057.31] RSP > > > [ 4057.31] CR2: 810101dbc008 > > > [ 4057.31] Kernel panic - not syncing: Fatal exception Can you try this patch please? Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1753,9 +1753,10 @@ static void __iov_iter_advance_iov(struc /* * The !iov->iov_len check ensures we skip over unlikely - * zero-length segments. + * zero-length segments. But we mustn't try to "skip" if + * we have come to the end (i->count == bytes). */ - while (bytes || !iov->iov_len) { + while (bytes || (unlikely(!iov->iov_len) && i->count > bytes)) { int copy = min(bytes, iov->iov_len - base); bytes -= copy;
Re: [PATCH 1/3] Fix Unlikely(x) == y
On Tuesday 19 February 2008 13:40, Arjan van de Ven wrote: > On Tue, 19 Feb 2008 13:33:53 +1100 > > Nick Piggin <[EMAIL PROTECTED]> wrote: > > Actually one thing I don't like about gcc is that I think it still > > emits cmovs for likely/unlikely branches, which is silly (the gcc > > developers seem to be in love with that instruction). If that goes > > away, then branch hints may be even better. > > only for -Os and only if the result is smaller afaik. What is your evidence for saying this? Because here, with the latest kernel and recent gcc-4.3 snapshot, it spits out cmov like crazy even when compiled with -O2. [EMAIL PROTECTED]:~/usr/src/linux-2.6$ grep cmov kernel/sched.s | wc -l 45 And yes it even does for hinted branches and even at -O2/3 [EMAIL PROTECTED]:~/tests$ cat cmov.c int test(int a, int b) { if (__builtin_expect(a < b, 0)) return a; else return b; } [EMAIL PROTECTED]:~/tests$ gcc-4.3 -S -O2 cmov.c [EMAIL PROTECTED]:~/tests$ head -13 cmov.s .file "cmov.c" .text .p2align 4,,15 ..globl test .type test, @function test: ..LFB2: cmpl%edi, %esi cmovle %esi, %edi movl%edi, %eax ret ..LFE2: .size test, .-test This definitely should be a branch, IMO. > (cmov tends to be a performance loss most of the time so for -O2 and such > it isn't used as far as I know.. it does make for nice small code however > ;-) It shouldn't be hard to work out the cutover point based on how expensive cmov is, how expensive branch and branch mispredicts are, and how often the branch is likely to be mispredicted. For an unpredictable branch, cmov is normally quite a good win even on modern CPUs. But gcc overuses it I think. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, RFC] kthread: (possibly) a missing memory barrier in kthread_stop()
On Tuesday 19 February 2008 10:03, Dmitry Adamushko wrote: > Hi, > > > [ description ] > > Subject: kthread: add a memory barrier to kthread_stop() > > 'kthread' threads do a check in the following order: > - set_current_state(TASK_INTERRUPTIBLE); > - kthread_should_stop(); > > and set_current_state() implies an smp_mb(). > > on another side (kthread_stop), wake_up_process() does not seem to > guarantee a full mb. > > And 'kthread_stop_info.k' must be visible before wake_up_process() > checks for/modifies a state of the 'kthread' task. > > (the patch is at the end of the message) > > > [ more detailed description ] > > the current code might well be safe in case a to-be-stopped 'kthread' > task is _not_ running on another CPU at the moment when kthread_stop() > is called (in this case, 'rq->lock' will act as a kind of synch. > point/barrier). > > Another case is as follows: > > CPU#0: > > ... > while (kthread_should_stop()) { > >if (condition) > schedule(); > >/* ... do something useful ... */ <--- EIP > >set_current_state(TASK_INTERRUPTIBLE); > } > > so a 'kthread' task is about to call > set_current_state(TASK_INTERRUPTIBLE) ... > > > (in the mean time) > > CPU#1: > > kthread_stop() > > -> kthread_stop_info.k = k (*) > -> wake_up_process() > > wake_up_process() looks like: > > (try_to_wake_up) > > IRQ_OFF > LOCK > > old_state = p->state; > if (!(old_state & state)) (**) > goto out; > > ... > > UNLOCK > IRQ_ON > > > let's suppose (*) and (**) are reordered > (according to Documentation/memory-barriers.txt, neither IRQ_OFF nor > LOCK may prevent it from happening). > > - the state is TASK_RUNNING, so we are about to return. > > - CPU#1 is about to execute (*) (it's guaranteed to be done before > spin_unlock(&rq->lock) at the end of try_to_wake_up()) > > > (in the mean time) > > CPU#0: > > - set_current_state(TASK_INTERRUPTIBLE); > - kthread_should_stop(); > > here, kthread_stop_info.k is not yet visible > > - schedule() > > ... > > we missed a 'kthread_stop' event. > > hum? Looks like you are correct to me. > TIA, > > --- > > From: Dmitry Adamushko <[EMAIL PROTECTED]> > Subject: kthread: add a memory barrier to kthread_stop() > > 'kthread' threads do a check in the following order: > - set_current_state(TASK_INTERRUPTIBLE); > - kthread_should_stop(); > > and set_current_state() implies an smp_mb(). > > on another side (kthread_stop), wake_up_process() is not guaranteed to > act as a full mb. > > 'kthread_stop_info.k' must be visible before wake_up_process() checks > for/modifies a state of the 'kthread' task. > > > Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]> > > > diff --git a/kernel/kthread.c b/kernel/kthread.c > index 0ac8878..5167110 100644 > --- a/kernel/kthread.c > +++ b/kernel/kthread.c > @@ -211,6 +211,10 @@ int kthread_stop(struct task_struct *k) > > /* Now set kthread_should_stop() to true, and wake it up. */ > kthread_stop_info.k = k; > + > + /* The previous store operation must not get ahead of the wakeup. */ > + smp_mb(); > + > wake_up_process(k); > put_task_struct(k); > > > > -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Fix Unlikely(x) == y
On Tuesday 19 February 2008 16:58, Willy Tarreau wrote: > On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote: > > > Note in particular the last predictors; assuming branch ending > > > with goto, including call, causing early function return or > > > returning negative constant are not taken. Just these alone > > > are likely 95+% of the unlikelies in the kernel. > > > > Yes, gcc should be able to do pretty good heuristics, considering > > the quite good numbers that cold CPU predictors can attain. However > > for really performance critical code (or really "never" executed > > code), then I think it is OK to have the hints and not have to rely > > on gcc heuristics. > > in my experience, the real problem is that gcc does what *it* wants and not > what *you* want. I've been annoyed a lot by the way it coded some loops > that could really be blazingly fast, but which resulted in a ton of > branches due to its predictors. And using unlikely() there was a real mess, > because instead of just hinting the compiler with probabilities to write > some linear code for the *most* common case, it ended up with awful > branches everywhere with code sent far away and even duplicated for some > branches. > > Sometimes, for performance critical paths, I would like gcc to be dumb and > follow *my* code and not its hard-coded probabilities. For instance, in a > tree traversal, you really know how you want to build your loop. And these > days, it seems like the single method of getting it your way is doing asm, > which obviously is not portable :-( Probably all true. > Maybe one thing we would need would be the ability to assign probabilities > to each branch based on what we expect, so that gcc could build a better > tree keeping most frequently used code tight. I don't know if that would *directly* lead to gcc being smarter. I think perhaps they probably don't benchmark on code bases that have much explicit annotation (I'm sure they wouldn't seriously benchmark any parts of Linux as part of daily development). I think the key is to continue to use annotations _properly_, and eventually gcc should go in the right direction if enough code uses it. And if you have really good examples like it sounds like above, then I guess that should be reported to gcc? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote: > background > > current VM implementation doesn't has limit of # of parallel reclaim. > when heavy workload, it bring to 2 bad things > - heavy lock contention > - unnecessary swap out > > abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page > reclaim throttle and explain it improve reclaim time. > http://marc.info/?l=linux-mm&m=119667465917215&w=2 > > but unfortunately it works only memcgroup reclaim. > Today, I implement it again for support global reclaim and mesure it. > > > test machine, method and result > == > > CPU: IA64 x8 > MEM: 8GB > SWAP: 2GB > > > got hackbench from > http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c > > $ /usr/bin/time hackbench 120 process 1000 > > this parameter mean consume all physical memory and > 1GB swap space on my test environment. > > > > before: > hackbench result: 282.30 > /usr/bin/time result > user: 14.16 > sys:1248.47 > elapse: 432.93 > major fault:29026 > max parallel reclaim tasks: 1298 > max consumption time of >try_to_free_pages(): 70394 > > after: > hackbench result: 30.36 > /usr/bin/time result > user: 14.26 > sys:294.44 > elapse: 118.01 > major fault:3064 > max parallel reclaim tasks: 4 > max consumption time of >try_to_free_pages(): 12234 > > > conclusion > = > this patch improve 3 things. > 1. reduce unnecessary swap >(see above major fault. about 90% reduced) > 2. improve throughput performance >(see above hackbench result. about 90% reduced) > 3. improve interactive performance. >(see above max consumption of try_to_free_pages. > about 80% reduced) > 4. reduce lock contention. >(see above sys time. about 80% reduced) > > > Now, we got about 1000% performance improvement of hackbench :) > > > > foture works > == > - more discussion with memory controller guys. Hi, Yeah this is definitely needed and a nice result. I'm worried about a) placing a global limit on parallelism, and b) placing a limit on parallelism at all. I think it should maybe be a per-zone thing... What happens if you make it a per-zone mutex, and allow just a single process to reclaim pages from a given zone at a time? I guess that is going to slow down throughput a little bit in some cases though... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] my mmu notifier sample driver
Index: linux-2.6/drivers/char/mmu_notifier_skel.c === --- /dev/null +++ linux-2.6/drivers/char/mmu_notifier_skel.c @@ -0,0 +1,255 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static DEFINE_SPINLOCK(mmn_lock); +static RADIX_TREE(rmap_tree, GFP_ATOMIC); +static seqcount_t rmap_seq = SEQCNT_ZERO; + +static int __rmap_add(unsigned long mem, unsigned long vaddr) +{ + int err; + + err = radix_tree_insert(&rmap_tree, mem >> PAGE_SHIFT, (void *)vaddr); + + return err; +} + +static void __rmap_del(unsigned long mem) +{ + void *ret; + + ret = radix_tree_delete(&rmap_tree, mem >> PAGE_SHIFT); + BUG_ON(!ret); +} + +static unsigned long rmap_find(unsigned long mem) +{ + unsigned long vaddr; + + rcu_read_lock(); + vaddr = (unsigned long)radix_tree_lookup(&rmap_tree, mem >> PAGE_SHIFT); + rcu_read_unlock(); + + return vaddr; +} + +static struct page *follow_page_atomic(struct mm_struct *mm, unsigned long address, int write) +{ + struct vm_area_struct *vma; + + vma = find_vma(mm, address); +if (!vma || (vma->vm_start > address)) +return NULL; + + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) + return NULL; + + return follow_page(vma, address, FOLL_GET|(write ? FOLL_WRITE : 0)); +} + +static int mmn_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct mm_struct *mm = vma->vm_mm; + unsigned long source_vaddr = (unsigned long)vmf->pgoff << PAGE_SHIFT; + unsigned long dest_vaddr = (unsigned long)vmf->virtual_address; + unsigned long pfn; + struct page *page; + pgprot_t prot; + int write = vmf->flags & FAULT_FLAG_WRITE; + int ret; + + printk("mmn_vm_fault [EMAIL PROTECTED] sourcing from %lx\n", write ? "write" : "read", dest_vaddr, source_vaddr); + + BUG_ON(mm != current->mm); /* disallow get_user_pages */ + +again: + spin_lock(&mmn_lock); + write_seqcount_begin(&rmap_seq); + page = follow_page_atomic(mm, source_vaddr, write); + if (unlikely(!page)) { + write_seqcount_end(&rmap_seq); + spin_unlock(&mmn_lock); + ret = get_user_pages(current, mm, source_vaddr, + 1, write, 0, &page, NULL); + if (ret != 1) + goto out_err; + put_page(page); + goto again; + } + + ret = __rmap_add(source_vaddr, dest_vaddr); + if (ret) + goto out_lock; + + pfn = page_to_pfn(page); + prot = vma->vm_page_prot; + if (!write) + vma->vm_page_prot = vm_get_page_prot(vma->vm_flags & ~(VM_WRITE|VM_MAYWRITE)); + ret = vm_insert_pfn(vma, dest_vaddr, pfn); + vma->vm_page_prot = prot; + if (ret) { + if (ret == -EBUSY) + WARN_ON(1); + goto out_rmap; + } + write_seqcount_end(&rmap_seq); + spin_unlock(&mmn_lock); + put_page(page); + +return VM_FAULT_NOPAGE; + +out_rmap: + __rmap_del(source_vaddr); +out_lock: + write_seqcount_end(&rmap_seq); + spin_unlock(&mmn_lock); + put_page(page); +out_err: + switch (ret) { + case -EFAULT: + case -EEXIST: + case -EBUSY: + return VM_FAULT_SIGBUS; + case -ENOMEM: + return VM_FAULT_OOM; + default: + BUG(); + } +} + +struct vm_operations_struct mmn_vm_ops = { +.fault = mmn_vm_fault, +}; + +static int mmu_notifier_busy; +static struct mmu_notifier mmu_notifier; + +static int mmn_clear_young(struct mmu_notifier *mn, unsigned long address) +{ + unsigned long vaddr; + unsigned seq; + struct mm_struct *mm = mn->mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep, pte; + + do { + seq = read_seqcount_begin(&rmap_seq); + vaddr = rmap_find(address); + } while (read_seqcount_retry(&rmap_seq, seq)); + + if (vaddr == 0) + return 0; + + printk("[EMAIL PROTECTED] sourced from %lx\n", vaddr, address); + + spin_lock(&mmn_lock); +pgd = pgd_offset(mm, vaddr); +pud = pud_offset(pgd, vaddr); + if (pud) { + pmd = pmd_offset(pud, vaddr); + if (pmd) { + ptep = pte_offset_map(pmd, vaddr); + if (ptep) { + pte = *ptep; + if (!pte_present(pte)) { + /* x86 specific, don't have a vma */ + ptep_get_and_clear(mm, vaddr, ptep); + __flush_tlb_one(vaddr); +
[patch] my mmu notifiers
Well I started reviewing the mmu notifier code, but it is kind of hard to know what you're talking about just by reading through code and not trying your suggestions for yourself... So I implemented mmu notifiers slightly differently. Andrea's mmu notifiers are rather similar. However I have tried to make a point of minimising the impact the the core mm/. I don't see why we need to invalidate or flush anything when changing the pte to be _more_ permissive, and I don't understand the need for invalidate_begin/invalidate_end pairs at all. What I have done is basically create it so that the notifiers get called basically in the same place as the normal TLB flushing is done, and nowhere else. I also wanted to avoid calling notifier code from inside eg. hardware TLB or pte manipulation primitives. These things are already pretty well spaghetti, so I'd like to just place them right where needed first... I think eventually it will need a bit of a rethink to make it more consistent and more general. But I prefer to do put them in the caller for the moment. I have also attempted to write a skeleton driver. Not like Christoph's drivers, but one that actually does something. This one can mmap a window into its own virtual address space. It's not perfect yet (I need to replace page_mkwrite with ->fault in the core mm before I can get enough information to do protection properly I think). However I think it may be race-free in the fault vs unmap paths. It's pretty complex, I must say. --- Index: linux-2.6/include/linux/mm_types.h === --- linux-2.6.orig/include/linux/mm_types.h +++ linux-2.6/include/linux/mm_types.h @@ -228,6 +228,9 @@ struct mm_struct { #ifdef CONFIG_CGROUP_MEM_CONT struct mem_cgroup *mem_cgroup; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct hlist_head mmu_notifier_list; +#endif }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/mmu_notifier.h === --- /dev/null +++ linux-2.6/include/linux/mmu_notifier.h @@ -0,0 +1,69 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include +#include + +struct mmu_notifier; +struct mmu_notifier_operations; + +#ifdef CONFIG_MMU_NOTIFIER + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_operations *ops; + struct mm_struct *mm; +}; + +struct mmu_notifier_operations { + void (*release)(struct mmu_notifier *mn); + int (*clear_young)(struct mmu_notifier *mn, unsigned long address); + void (*unmap)(struct mmu_notifier *mn, unsigned long address); + void (*invalidate_range)(struct mmu_notifier *mn, unsigned long start, unsigned long end); +}; + +static inline void mmu_notifier_init_mm(struct mm_struct *mm) +{ + INIT_HLIST_HEAD(&mm->mmu_notifier_list); +} + +static inline void mmu_notifier_init(struct mmu_notifier *mn, const struct mmu_notifier_operations *ops, struct mm_struct *mm) +{ + INIT_HLIST_NODE(&mn->hlist); + mn->ops = ops; + mn->mm = mm; +} + +extern void mmu_notifier_register(struct mmu_notifier *mn); +extern void mmu_notifier_unregister(struct mmu_notifier *mn); + +extern void mmu_notifier_exit_mm(struct mm_struct *mm); +extern int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long address); +extern void mmu_notifier_unmap(struct mm_struct *mm, unsigned long address); +extern void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); + +#else /* CONFIG_MMU_NOTIFIER */ + +static inline void mmu_notifier_init_mm(struct mm_struct *mm) +{ +} + +static inline void mmu_notifier_exit_mm(struct mm_struct *mm) +{ +} + +static inline int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_unmap(struct mm_struct *mm, unsigned long address) +{ +} + +static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) +{ +} +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif Index: linux-2.6/kernel/fork.c === --- linux-2.6.orig/kernel/fork.c +++ linux-2.6/kernel/fork.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include #include @@ -358,6 +359,7 @@ static struct mm_struct * mm_init(struct mm->ioctx_list = NULL; mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; + mmu_notifier_init_mm(mm); mm_init_cgroup(mm, p); if (likely(!mm_alloc_pgd(mm))) { Index: linux-2.6/mm/filemap_xip.c === --- linux-2.6.orig/mm/filemap_xip.c +++ linux-2.6/mm/filemap_xip.c @@ -195,6 +195,7 @@ __xip_unmap (struct address_space * mapp /* Nuke the page table entry. */ flush_cache_page(vma, addre
Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
On Sunday 17 February 2008 06:22, Christoph Lameter wrote: > On Fri, 15 Feb 2008, Andrew Morton wrote: > > > flush_cache_page(vma, address, pte_pfn(*pte)); > > > entry = ptep_clear_flush(vma, address, pte); > > > + mmu_notifier(invalidate_page, mm, address); > > > > I just don't see how ths can be done if the callee has another thread in > > the middle of establishing IO against this region of memory. > > ->invalidate_page() _has_ to be able to block. Confused. > > The page lock is held and that holds off I/O? I think the actual answer is that "it doesn't matter". ptes are not exactly the entity via which IO gets established, so all we really care about here is that after the callback finishes, we will not get any more reads or writes to the page via the external mapping. As far as holding off local IO goes, that is the job of the core VM. (And no, page lock does not necessarily hold it off FYI -- it can be writeback IO or even IO directly via buffers). Holding off IO via the external references I guess is a job for the notifier driver. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Fix Unlikely(x) == y
On Tuesday 19 February 2008 20:25, Andi Kleen wrote: > On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote: > > I actually once measured context switching performance in the scheduler, > > and removing the unlikely hint for testing RT tasks IIRC gave about 5% > > performance drop. > > OT: what benchmarks did you use for that? I had a change some time > ago to the CFS scheduler to avoid unpredicted indirect calls for > the common case, but I wasn't able to benchmark a difference with the usual > suspect benchmark (lmbench). Since it increased code size by > a few bytes it was rejected then. I think it was just a simple context switch benchmark, but not lmbench (which I found to be a bit too variable). But it was a long time ago... > > This was on a P4 which is very different from more modern CPUs both in > > terms of branch performance characteristics, > > > > and icache characteristics. > > Hmm, the P4 the trace cache actually should not care about inline > code that is not executed. Yeah, which is why it is a bit different than other CPUs. Although the L2 cache I guess is still going to suffer from sparse code, but I guess that is a bit less important. > > However, the P4's branch predictor is pretty good, and it should easily > > I think it depends on the generation. Prescott class branch > prediction should be much better than the earlier ones. I was using a Nocona Xeon, which I think is a Prescott class? And don't they have much higher mispredict penalty (than older P4s)? > > Actually one thing I don't like about gcc is that I think it still emits > > cmovs for likely/unlikely branches, > > That's -Os. And -O2 and -O3, on the gccs that I'm using, AFAIKS. > > which is silly (the gcc developers > > It depends on the CPU. e.g. on K8 and P6 using CMOV if possible > makes sense. P4 doesn't like it though. If the branch is completely predictable (eg. annotated), then I think branches should be used anyway. Even on well predicted branches, cmov is similar speed on microbenchmarks, but it will increase data hazards I think, so it will probably be worse for some real world situations. > > the quite good numbers that cold CPU predictors can attain. However > > for really performance critical code (or really "never" executed > > code), then I think it is OK to have the hints and not have to rely > > on gcc heuristics. > > But only when the explicit hints are different from what the implicit > branch predictors would predict anyways. And if you look at the > heuristics that is not often the case... But a likely branch will be _strongly_ predicted to be taken, wheras a lot of the gcc heuristics simply have slightly more or slightly less probability. So it's not just a question of which way is more likely, but also _how_ likely it is to go that way. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
On Friday 15 February 2008 17:49, Christoph Lameter wrote: > The invalidation of address ranges in a mm_struct needs to be > performed when pages are removed or permissions etc change. > > If invalidate_range_begin() is called with locks held then we > pass a flag into invalidate_range() to indicate that no sleeping is > possible. Locks are only held for truncate and huge pages. > > In two cases we use invalidate_range_begin/end to invalidate > single pages because the pair allows holding off new references > (idea by Robin Holt). > > do_wp_page(): We hold off new references while we update the pte. > > xip_unmap: We are not taking the PageLock so we cannot > use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end > stands in. This whole thing would be much better if you didn't rely on the page lock at all, but either a) used the same locking as Linux does for its ptes/tlbs, or b) have some locking that is private to the mmu notifier code. Then there is not all this new stuff that has to be understood in the core VM. Also, why do you have to "invalidate" ranges when switching to a _more_ permissive state? This stuff should basically be the same as (a subset of) the TLB flushing API AFAIKS. Anything more is a pretty big burden to put in the core VM. See my alternative patch I posted -- I can't see why it won't work just like a TLB. As far as sleeping inside callbacks goes... I think there are big problems with the patch (the sleeping patch and the external rmap patch). I don't think it is workable in its current state. Either we have to make some big changes to the core VM, or we have to turn some locks into sleeping locks to do it properly AFAIKS. Neither one is good. But anyway, I don't really think the two approaches (Andrea's notifiers vs sleeping/xrmap) should be tangled up too much. I think Andrea's can possibly be quite unintrusive and useful very soon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Fix Unlikely(x) == y
On Tuesday 19 February 2008 20:57, Andi Kleen wrote: > On Tue, Feb 19, 2008 at 08:46:46PM +1100, Nick Piggin wrote: > > I think it was just a simple context switch benchmark, but not lmbench > > (which I found to be a bit too variable). But it was a long time ago... > > Do you still have it? > > I thought about writing my own but ended up being too lazy for that @) Had a quick look but couldn't find it. It was just two threads running and switching to each other with a couple of mutexes or yield. If I find it, then I'll send it over. > > > > Actually one thing I don't like about gcc is that I think it still > > > > emits cmovs for likely/unlikely branches, > > > > > > That's -Os. > > > > And -O2 and -O3, on the gccs that I'm using, AFAIKS. > > Well if it still happens on gcc 4.2 with P4 tuning you should > perhaps open a gcc PR. They tend to ignore these bugs mostly in > my experience, but sometimes they act on them. I'm not sure about P4 tuning... But even IMO it should not on predictable branches too much for any (especially OOOE) CPU. > > > > which is silly (the gcc developers > > > > > > It depends on the CPU. e.g. on K8 and P6 using CMOV if possible > > > makes sense. P4 doesn't like it though. > > > > If the branch is completely predictable (eg. annotated), then I > > think branches should be used anyway. Even on well predicted > > branches, cmov is similar speed on microbenchmarks, but it will > > increase data hazards I think, so it will probably be worse for > > some real world situations. > > At least the respective optimization manuals say they should be used. > I presume they only made this recommendation after some extensive > benchmarking. What I have seen is that they tell you definitely not to use it for predictable branches. Eg. the Intel optimization manual says Use the setcc and cmov instructions to eliminate unpredictable conditional branches where possible. Do not do this for predictable branches. Do not use these instructions to eliminate all unpredictable conditional branches, because using these instructions will incur execution overhead due to executing both paths of a conditional branch. In addition, converting conditional branches to cmovs or setcc trades control-flow dependence for data dependence and restricts the capability of the out-of-order engine. > > But a likely branch will be _strongly_ predicted to be taken, > > wheras a lot of the gcc heuristics simply have slightly more or > > slightly less probability. So it's not just a question of which > > way is more likely, but also _how_ likely it is to go that way. > > Yes, but a lot of the heuristics are pretty strong (>80%) and gcc will > act on them unless it has a very strong contra cue. And that should > normally not be the case. True, but if you know a branch is 99%+, then use of likely/unlikely can still be a good idea. 80% may not be enough to choose a branch over a cmov for example. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > > are rather similar. However I have tried to make a point of minimising the > > impact the the core mm/. I don't see why we need to invalidate or flush > > I also tried hard to minimise the impact of the core mm/, I also > argued with Christoph that cluttering mm/ wasn't a good idea for > things like age_page that could be a 1 liner change instead of a > multiple-liner change, without any loss of flexibility or readability. > > > anything when changing the pte to be _more_ permissive, and I don't > > Note that in my patch the invalidate_pages in mprotect can be > trivially switched to a mprotect_pages with proper params. This will > prevent page faults completely in the secondary MMU (there will only > be tlb misses after the tlb flush just like for the core linux pte), > and it'll allow all the secondary MMU pte blocks (512/1024 at time > with my PT lock design) to be updated to have proper permissions > matching the core linux pte. > > > understand the need for invalidate_begin/invalidate_end pairs at all. > > The need of the pairs is crystal clear to me: range_begin is needed > for GRU _but_only_if_ range_end is called after releasing the > reference that the VM holds on the page. _begin will flush the GRU tlb > and at the same time it will take a mutex that will block further GRU > tlb-miss-interrupts (no idea how they manange those nightmare locking, > I didn't even try to add more locking to KVM and I get away with the > fact KVM takes the pin on the page itself). > > My patch calls invalidate_page/pages before the reference is released > on the page, so GRU will work fine despite lack of > range_begin. Furthermore with my patch GRU will be auto-serialized by > the PT lock w/o the need of any additional locking. That's why I don't understand the need for the pairs: it should be done like this. > > What I have done is basically create it so that the notifiers get called > > basically in the same place as the normal TLB flushing is done, and nowhere > > else. > > That was one of my objectives too. > > > I also wanted to avoid calling notifier code from inside eg. hardware TLB > > or pte manipulation primitives. These things are already pretty well > > spaghetti, so I'd like to just place them right where needed first... I > > think eventually it will need a bit of a rethink to make it more consistent > > and more general. But I prefer to do put them in the caller for the moment. > > Your patch should also work for KVM but it's suboptimal, my patch can > be orders of magnitude more efficient for GRU thanks to the > invalidate_pages optimization. Christoph complained about having to > call one method per pte. OK, I didn't see the invalidate_pages call... > And adding invalidate_range is useless unless you fully support > xpmem. You're calling invalidate_range in places that can't sleep... I thought that could be used by a non-sleeping user (not intending to try supporting sleeping users). If it is useless then it should go away (BTW. I didn't see your recent patch, some of my confusion I think stems from Christoph's novel way of merging and splitting patches). > No idea why xpmem needs range_begin, I perfectly understand why GRU > needs _begin with Chrisotph's patch (gru lacks the page pin) but I > dunno why xpmem needs range_begin (xpmem has the page pin so I also > think it could avoid using range_begin). Still to support GRU you need > both to call invalidate_range in places that can sleep and you need > the external rmap notifier. The moment you add xpmem into the equation > your and my clean patches become Christoph's one... Sorry, I kind of didn't have time to follow the conversation so well before; are there patches posted for gru and/or xpmem? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote: > > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > > > understand the need for invalidate_begin/invalidate_end pairs at all. > > > > The need of the pairs is crystal clear to me: range_begin is needed > > for GRU _but_only_if_ range_end is called after releasing the > > reference that the VM holds on the page. _begin will flush the GRU tlb > > and at the same time it will take a mutex that will block further GRU > > tlb-miss-interrupts (no idea how they manange those nightmare locking, > > I didn't even try to add more locking to KVM and I get away with the > > fact KVM takes the pin on the page itself). > > As it turns out, no actual mutex is required. _begin_ simply increments a > count of active range invalidates, _end_ decrements the count. New TLB > dropins are deferred while range callouts are active. > > This would appear to be racy but the GRU has special hardware that > simplifies locking. When the GRU sees a TLB invalidate, all outstanding > misses & potentially inflight TLB dropins are marked by the GRU with a > "kill" bit. When the dropin finally occurs, the dropin is ignored & the > instruction is simply restarted. The instruction will fault again & the TLB > dropin will be repeated. This is optimized for the case where invalidates > are rare - true for users of the GRU. OK (thanks to Robin as well). Now I understand why you are using it, but I don't understand why you don't defer new TLBs after the point where the linux pte changes. If you can do that, then you look and act much more like a TLB from the point of view of the Linux vm. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
On Friday 15 February 2008 17:49, Christoph Lameter wrote: > The invalidation of address ranges in a mm_struct needs to be > performed when pages are removed or permissions etc change. > > If invalidate_range_begin() is called with locks held then we > pass a flag into invalidate_range() to indicate that no sleeping is > possible. Locks are only held for truncate and huge pages. You can't sleep inside rcu_read_lock()! I must say that for a patch that is up to v8 or whatever and is posted twice a week to such a big cc list, it is kind of slack to not even test it and expect other people to review it. Also, what we are going to need here are not skeleton drivers that just do all the *easy* bits (of registering their callbacks), but actual fully working examples that do everything that any real driver will need to do. If not for the sanity of the driver writer, then for the sanity of the VM developers (I don't want to have to understand xpmem or infiniband in order to understand how the VM works). > In two cases we use invalidate_range_begin/end to invalidate > single pages because the pair allows holding off new references > (idea by Robin Holt). > > do_wp_page(): We hold off new references while we update the pte. > > xip_unmap: We are not taking the PageLock so we cannot > use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end > stands in. > > Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> > Signed-off-by: Robin Holt <[EMAIL PROTECTED]> > Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> > > --- > mm/filemap_xip.c |5 + > mm/fremap.c |3 +++ > mm/hugetlb.c |3 +++ > mm/memory.c | 35 +-- > mm/mmap.c|2 ++ > mm/mprotect.c|3 +++ > mm/mremap.c |7 ++- > 7 files changed, 51 insertions(+), 7 deletions(-) > > Index: linux-2.6/mm/fremap.c > === > --- linux-2.6.orig/mm/fremap.c2008-02-14 18:43:31.0 -0800 > +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.0 -0800 > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns > spin_unlock(&mapping->i_mmap_lock); > } > > + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0); > err = populate_range(mm, vma, start, size, pgoff); > + mmu_notifier(invalidate_range_end, mm, start, start + size, 0); > if (!err && !(flags & MAP_NONBLOCK)) { > if (unlikely(has_write_lock)) { > downgrade_write(&mm->mmap_sem); > Index: linux-2.6/mm/memory.c > === > --- linux-2.6.orig/mm/memory.c2008-02-14 18:43:31.0 -0800 > +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.0 -0800 > @@ -51,6 +51,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds > if (is_vm_hugetlb_page(vma)) > return copy_hugetlb_page_range(dst_mm, src_mm, vma); > > + if (is_cow_mapping(vma->vm_flags)) > + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0); > + > dst_pgd = pgd_offset(dst_mm, addr); > src_pgd = pgd_offset(src_mm, addr); > do { > @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds > vma, addr, next)) > return -ENOMEM; > } while (dst_pgd++, src_pgd++, addr = next, addr != end); > + > + if (is_cow_mapping(vma->vm_flags)) > + mmu_notifier(invalidate_range_end, src_mm, > + vma->vm_start, end, 0); > + > return 0; > } > > @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a > struct mmu_gather *tlb; > unsigned long end = address + size; > unsigned long nr_accounted = 0; > + int atomic = details ? (details->i_mmap_lock != 0) : 0; > > lru_add_drain(); > tlb = tlb_gather_mmu(mm, 0); > update_hiwater_rss(mm); > + mmu_notifier(invalidate_range_begin, mm, address, end, atomic); > end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); > if (tlb) > tlb_finish_mmu(tlb, address, end); > + mmu_notifier(invalidate_range_end, mm, address, end, atomic); > return end; > } > Where do you invalidate for munmap()? Also, how to you resolve the case where you are not allowed to sleep? I would have thought either you have to handle it, in which case nobody needs to sleep; or you can't handle it, in which case the code is broken. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > > anything when changing the pte to be _more_ permissive, and I don't > > Note that in my patch the invalidate_pages in mprotect can be > trivially switched to a mprotect_pages with proper params. This will > prevent page faults completely in the secondary MMU (there will only > be tlb misses after the tlb flush just like for the core linux pte), > and it'll allow all the secondary MMU pte blocks (512/1024 at time > with my PT lock design) to be updated to have proper permissions > matching the core linux pte. Sorry, I realise I still didn't get this through my head yet (and also have not seen your patch recently). So I don't know exactly what you are doing... But why does _anybody_ (why does Christoph's patches) need to invalidate when they are going to be more permissive? This should be done lazily by the driver, I would have thought. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
On Friday 15 February 2008 17:49, Christoph Lameter wrote: > These special additional callbacks are required because XPmem (and likely > other mechanisms) do use their own rmap (multiple processes on a series > of remote Linux instances may be accessing the memory of a process). > F.e. XPmem may have to send out notifications to remote Linux instances > and receive confirmation before a page can be freed. > > So we handle this like an additional Linux reverse map that is walked after > the existing rmaps have been walked. We leave the walking to the driver > that is then able to use something else than a spinlock to walk its reverse > maps. So we can actually call the driver without holding spinlocks while we > hold the Pagelock. I don't know how this is supposed to solve anything. The sleeping problem happens I guess mostly in truncate. And all you are doing is putting these rmap callbacks in page_mkclean and try_to_unmap. > However, we cannot determine the mm_struct that a page belongs to at > that point. The mm_struct can only be determined from the rmaps by the > device driver. > > We add another pageflag (PageExternalRmap) that is set if a page has > been remotely mapped (f.e. by a process from another Linux instance). > We can then only perform the callbacks for pages that are actually in > remote use. > > Rmap notifiers need an extra page bit and are only available > on 64 bit platforms. This functionality is not available on 32 bit! > > A notifier that uses the reverse maps callbacks does not need to provide > the invalidate_page() method that is called when locks are held. That doesn't seem right. To start with, the new callbacks aren't even called in the places where invalidate_page isn't allowed to sleep. The problem is unmap_mapping_range, right? And unmap_mapping_range must walk the rmaps with the mmap lock held, which is why it can't sleep. And it can't hold any mmap_sem so it cannot prevent address space modifications of the processes in question between the time you unmap them from the linux ptes with unmap_mapping_range, and the time that you unmap them from your driver. So in the meantime, you could have eg. a fault come in and set up a new page for one of the processes, and that page might even get exported via the same external driver. And now you have a totally inconsistent view. Preventing new mappings from being set up until the old mapping is completely flushed is basically what we need to ensure for any sane TLB as far as I can tell. To do that, you'll need to make the mmap lock sleep, and either take mmap_sem inside it (which is a deadlock condition at the moment), or make ptl sleep as well. These are simply the locks we use to prevent that from happening, so I can't see how you can possibly hope to have a coherent TLB without invalidating inside those locks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
On Wednesday 20 February 2008 14:00, Robin Holt wrote: > On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote: > > On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote: > > > Also, how to you resolve the case where you are not allowed to sleep? > > > I would have thought either you have to handle it, in which case nobody > > > needs to sleep; or you can't handle it, in which case the code is > > > broken. > > > > I also asked exactly this, glad you reasked this too. > > Currently, we BUG_ON having a PFN in our tables and not being able > to sleep. These are mappings which MPT has never supported in the past > and XPMEM was already not allowing page faults for VMAs which are not > anonymous so it should never happen. If the file-backed operations can > ever get changed to allow for sleeping and a customer has a need for it, > we would need to change XPMEM to allow those types of faults to succeed. Do you really want to be able to swap, or are you just interested in keeping track of unmaps / prot changes? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
On Wednesday 20 February 2008 14:12, Robin Holt wrote: > For XPMEM, we do not currently allow file backed > mapping pages from being exported so we should never reach this condition. > It has been an issue since day 1. We have operated with that assumption > for 6 years and have not had issues with that assumption. The user of > xpmem is MPT and it controls the communication buffers so it is reasonable > to expect this type of behavior. OK, that makes things simpler. So why can't you export a device from your xpmem driver, which can be mmap()ed to give out "anonymous" memory pages to be used for these communication buffers? I guess you may also want an "munmap/mprotect" callback, which we don't have in the kernel right now... but at least you could prototype it easily by having an ioctl to be called before munmapping or mprotecting (eg. the ioctl could prevent new TLB setup for the region, and shoot down existing ones). This is actually going to be much faster for you if you use any threaded applications, because you will be able to do all the shootdown round trips outside mmap_sem, and so you will be able to have other threads faulting and even mmap()ing / munmaping at the same time as the shootdown is happening. I guess there is some catch... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
On Wednesday 20 February 2008 20:00, Robin Holt wrote: > On Wed, Feb 20, 2008 at 02:51:45PM +1100, Nick Piggin wrote: > > On Wednesday 20 February 2008 14:12, Robin Holt wrote: > > > For XPMEM, we do not currently allow file backed > > > mapping pages from being exported so we should never reach this > > > condition. It has been an issue since day 1. We have operated with > > > that assumption for 6 years and have not had issues with that > > > assumption. The user of xpmem is MPT and it controls the communication > > > buffers so it is reasonable to expect this type of behavior. > > > > OK, that makes things simpler. > > > > So why can't you export a device from your xpmem driver, which > > can be mmap()ed to give out "anonymous" memory pages to be used > > for these communication buffers? > > Because we need to have heap and stack available as well. MPT does > not control all the communication buffer areas. I haven't checked, but > this is the same problem that IB will have. I believe they are actually > allowing any memory region be accessible, but I am not sure of that. Then you should create a driver that the user program can register and unregister regions of their memory with. The driver can do a get_user_pages to get the pages, and then you'd just need to set up some kind of mapping so that userspace can unmap pages / won't leak memory (and an exit_mm notifier I guess). Because you don't need to swap, you don't need coherency, and you are in control of the areas, then this seems like the best choice. It would allow you to use heap, stack, file-backed, anything. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Tue, Feb 19, 2008 at 05:40:50PM -0600, Jack Steiner wrote: > On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: > > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote: > > > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote: > > > > anything when changing the pte to be _more_ permissive, and I don't > > > > > > Note that in my patch the invalidate_pages in mprotect can be > > > trivially switched to a mprotect_pages with proper params. This will > > > prevent page faults completely in the secondary MMU (there will only > > > be tlb misses after the tlb flush just like for the core linux pte), > > > and it'll allow all the secondary MMU pte blocks (512/1024 at time > > > with my PT lock design) to be updated to have proper permissions > > > matching the core linux pte. > > > > Sorry, I realise I still didn't get this through my head yet (and also > > have not seen your patch recently). So I don't know exactly what you > > are doing... > > > > But why does _anybody_ (why does Christoph's patches) need to invalidate > > when they are going to be more permissive? This should be done lazily by > > the driver, I would have thought. > > > Agree. Although for most real applications, the performance difference > is probably negligible. But importantly, doing it that way means you share test coverage with the CPU TLB flushing code, and you don't introduce a new concept to the VM. So, it _has_ to be lazy flushing, IMO (as there doesn't seem to be a good reason otherwise). mprotect shouldn't really be a special case, because it still has to flush the CPU tlbs as well when restricting access. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] my mmu notifiers
On Wed, Feb 20, 2008 at 02:09:41AM +0100, Andrea Arcangeli wrote: > On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote: > > Sorry, I realise I still didn't get this through my head yet (and also > > have not seen your patch recently). So I don't know exactly what you > > are doing... > > The last version was posted here: > > http://marc.info/?l=kvm-devel&m=120321732521533&w=2 > > > But why does _anybody_ (why does Christoph's patches) need to invalidate > > when they are going to be more permissive? This should be done lazily by > > the driver, I would have thought. > > This can be done lazily by the driver yes. The place where I've an > invalidate_pages in mprotect however can also become less permissive. That's OK, because we have to flush tlbs there too. > It's simpler to invalidate always and it's not guaranteed the > secondary mmu page fault is capable of refreshing the spte across a > writeprotect fault. I think we just have to make sure that it _can_ do writeprotect faults. AFAIKS, that will be possible if the driver registers a .page_mkwrite handler (actually not quite -- page_mkwrite is fairly crap, so I have a patch to merge it together with .fault so we get address information as well). Anyway, I really think we should do it that way. > In the future this can be changed to > mprotect_pages though, so no page fault will happen in the secondary > mmu. Possibly, but hopefully not needed for performance. Let's wait and see. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v6
On Wed, Feb 20, 2008 at 11:39:42AM +0100, Andrea Arcangeli wrote: > Given Nick's comments I ported my version of the mmu notifiers to > latest mainline. There are no known bugs AFIK and it's obviously safe > (nothing is allowed to schedule inside rcu_read_lock taken by > mmu_notifier() with my patch). Thanks! Yes the seqlock you are using now ends up looking similar to what I did and I couldn't find a hole in that either. So I think this is going to work. I do prefer some parts of my patch, however for everyone's sanity, I think you should be the maintainer of the mmu notifiers, and I will send you incremental changes that can be discussed more easily that way (nothing major, mainly style and minor things). > XPMEM simply can't use RCU for the registration locking if it wants to > schedule inside the mmu notifier calls. So I guess it's better to add > the XPMEM invalidate_range_end/begin/external-rmap as a whole > different subsystem that will have to use a mutex (not RCU) to > serialize, and at the same time that CONFIG_XPMEM will also have to > switch the i_mmap_lock to a mutex. I doubt xpmem fits inside a > CONFIG_MMU_NOTIFIER anymore, or we'll all run a bit slower because of > it. It's really a call of how much we want to optimize the MMU > notifier, by keeping things like RCU for the registration. I agree: your coherent, non-sleeping mmu notifiers are pretty simple and unintrusive. The sleeping version is fundamentally going to either need to change VM locks, or be non-coherent, so I don't think there is a question of making one solution fit everybody. So the sleeping / xrmap patch should be kept either completely independent, or as an add-on to this one. I will post some suggestions to you when I get a chance. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v6
On Wed, Feb 20, 2008 at 01:03:24PM +0100, Andrea Arcangeli wrote: > If there's agreement that the VM should alter its locking from > spinlock to mutex for its own good, then Christoph's > one-config-option-fits-all becomes a lot more appealing (replacing RCU > with a mutex in the mmu notifier list registration locking isn't my > main worry and the non-sleeping-users may be ok to live with it). Just from a high level view, in some cases we can just say that no we aren't going to support this. And this may well be one of those cases. The more constraints placed on the VM, the harder it becomes to improve and adapt in future. And this seems like a pretty big restriction. (especially if we can eg. work around it completely by having a special purpose driver to get_user_pages on comm buffers as I suggested in the other mail). At any rate, I believe Andrea's patch really places minimal or no further constraints than a regular CPU TLB (or the hash tables that some archs implement). So we're kind of in 2 different leagues here. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Document huge memory/cache overhead of memory controller in Kconfig
On Wednesday 20 February 2008 23:52, Balbir Singh wrote: > Andi Kleen wrote: > > Document huge memory/cache overhead of memory controller in Kconfig > > > > I was a little surprised that 2.6.25-rc* increased struct page for the > > memory controller. At least on many x86-64 machines it will not fit into > > a single cache line now anymore and also costs considerable amounts of > > RAM. > > The size of struct page earlier was 56 bytes on x86_64 and with 64 bytes it > won't fit into the cacheline anymore? Please also look at > http://lwn.net/Articles/234974/ BTW. We'll probably want to increase the width of some counters in struct page at some point for 64-bit, so then it really will go over with the memory controller! Actually, an external data structure is a pretty good idea. We could probably do it easily with a radix tree (pfn->memory controller). And that might be a better option for distros. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] more SLUB updates for 2.6.25
On Friday 08 February 2008 13:13, Christoph Lameter wrote: > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git slub-linus > > (includes the cmpxchg_local fastpath since the cmpxchg_local work > by Matheiu is in now, and the non atomic unlock by Nick. Verified that > this is not doing any harm after some other patches had been removed. Ah, good. I think it is always a good thing to be able to remove atomics. They place quite a bit of burden on the CPU, especially x86 where it also has implicit memory ordering semantics (although x86 can speculatively get around much of the problem, it's obviously worse than no restriction) Even if perhaps some cache coherency or timing quirk makes the non-atomic version slower (all else being equal), then I'd still say that the non atomic version should be preferred. Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)
On Thu, Feb 07, 2008 at 07:25:45PM +0100, Jens Axboe wrote: > Hi, > > Here's a variant using kernel threads only, the nasty arch bits are then > not needed. Works for me, no performance testing (that's a hint for Alan > to try and queue up some testing for this variant as well :-) Well this stuff looks pretty nice (although I'm not sure whether the softirq->thread changes are a good idea for performance, I guess we'll see). You still don't have the option that the Intel patch gave, that is, to submit on the completer. I guess that you could do it somewhat generically by having a cpuid in the request queue, and update that with the completing cpu. At least they reported it to be the most efficient scheme in their testing, and Dave thought that migrating completions out to submitters might be a bottleneck in some cases. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Tue, Feb 05, 2008 at 11:14:19AM +1100, David Chinner wrote: > On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote: > > You get better behaviour in the slab and page allocators and locality > > and cache hotness of memory. For example, I guess in a filesystem / > > pagecache heavy workload, you have to touch each struct page, buffer head, > > fs private state, and also often have to wake the thread for completion. > > Much of this data has just been touched at submit time, so doin this on > > the same CPU is nice... > > [] > > > I'm surprised that the xfs global state bouncing would outweigh the > > bouncing of all the per-page/block/bio/request/etc data that gets touched > > during completion. We'll see. > > per-page/block.bio/request/etc is local to a single I/O. the only > penalty is a cacheline bounce for each of the structures from one > CPU to another. That is, there is no global state modified by these > completions. Yeah, but it is going from _all_ submitting CPUs to the one completing CPU. So you could bottleneck the interconnect at the completing CPU just as much as if you had cachelines being pulled the other way (ie. many CPUs trying to pull in a global cacheline). > The real issue is metadata. The transaction log I/O completion > funnels through a state machine protected by a single lock, which > means completions on different CPUs pulls that lock to all > completion CPUs. Given that the same lock is used during transaction > completion for other state transitions (in task context, not intr), > the more cpus active at once touches, the worse the problem gets. OK, once you add locking (and not simply cacheline contention), then the problem gets harder I agree. But I think that if the submitting side takes the same locks as log completion (eg. maybe for starting a new transaction), then it is not going to be a clear win either way, and you'd have to measure it in the end. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)
On Fri, Feb 08, 2008 at 08:47:47AM +0100, Jens Axboe wrote: > On Fri, Feb 08 2008, Nick Piggin wrote: > > On Thu, Feb 07, 2008 at 07:25:45PM +0100, Jens Axboe wrote: > > > Hi, > > > > > > Here's a variant using kernel threads only, the nasty arch bits are then > > > not needed. Works for me, no performance testing (that's a hint for Alan > > > to try and queue up some testing for this variant as well :-) > > > > Well this stuff looks pretty nice (although I'm not sure whether the > > softirq->thread changes are a good idea for performance, I guess we'll > > see). > > Yeah, that is indeed an open question and why I have two seperate > patches for now (io-cpu-affinity branch and io-cpu-affinity-kthread > branch). As Ingo mentioned, this is how softirqs are handled in the -rt > branch already. True, although there are some IO workloads where -rt falls behind mainline. May not be purely due to irq threads though, of course. > > You still don't have the option that the Intel patch gave, that is, > > to submit on the completer. I guess that you could do it somewhat > > generically by having a cpuid in the request queue, and update that > > with the completing cpu. > > Not sure what you mean, if setting queue_affinity doesn't accomplish it. > If you know the completing CPU to begin with, surely you can just set > the queuing affinity appropriately? And if you don't? > > At least they reported it to be the most efficient scheme in their > > testing, and Dave thought that migrating completions out to submitters > > might be a bottleneck in some cases. > > More so than migrating submitters to completers? The advantage of only > movign submitters is that you get rid of the completion locking. Apart > from that, the cost should be the same, especially for the thread based > solution. Not specifically for the block layer, but higher layers like xfs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [git pull] more SLUB updates for 2.6.25
On Friday 08 February 2008 18:29, Eric Dumazet wrote: > Nick Piggin a écrit : > > On Friday 08 February 2008 13:13, Christoph Lameter wrote: > >> are available in the git repository at: > >> > >> git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git > >> slub-linus > >> > >> (includes the cmpxchg_local fastpath since the cmpxchg_local work > >> by Matheiu is in now, and the non atomic unlock by Nick. Verified that > >> this is not doing any harm after some other patches had been removed. > > > > Ah, good. I think it is always a good thing to be able to remove atomics. > > They place quite a bit of burden on the CPU, especially x86 where it also > > has implicit memory ordering semantics (although x86 can speculatively > > get around much of the problem, it's obviously worse than no restriction) > > > > Even if perhaps some cache coherency or timing quirk makes the non-atomic > > version slower (all else being equal), then I'd still say that the non > > atomic version should be preferred. > > What about IRQ masking then ? I really did mean all else being equal. eg. "clear_bit" vs "__clear_bit". > Many CPU pay high cost for cli/sti pair... True, and many UP architectures have to implement atomic operations with cli/sti pairs... so those are more reasons to use non-atomics. > And SLAB/SLUB allocators, even if only used from process context, want to > disable/re-enable interrupts... > > I understand kmalloc() want generic pools, but dedicated pools could avoid > this cli/sti Sure, I guess that would be possible. I've kind of toyed with doing some cli/sti mitigation in the page allocator, but in that case I found that it wasn't a win outside microbenchmarks: the cache characteristics of the returned pages are just as important if not more so than cli/sti costs (although that balance would change depending on the CPU and workload I guess). For slub yes you could do it with fewer downsides with process context pools. Is it possible instead for architectures where cli/sti is so expensive to change their lowest level of irq handling to do this by setting and clearing a soft flag somewhere? That's what I'd rather see, if possible. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)
On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote: > On Fri, Feb 08 2008, Nick Piggin wrote: > > And if you don't? > > Well if you don't ask for anything, you wont get anything :-) > As I mentioned, the patch is a playing ground for trying various setups. > Everything defaults to 'do as usual', set options to setup certain test > scenarios. I mean if you don't know the completing CPU. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)
On Fri, Feb 08, 2008 at 09:24:22AM +0100, Jens Axboe wrote: > On Fri, Feb 08 2008, Nick Piggin wrote: > > On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote: > > > On Fri, Feb 08 2008, Nick Piggin wrote: > > > > And if you don't? > > > > > > Well if you don't ask for anything, you wont get anything :-) > > > As I mentioned, the patch is a playing ground for trying various setups. > > > Everything defaults to 'do as usual', set options to setup certain test > > > scenarios. > > > > I mean if you don't know the completing CPU. > > I still don't know quite what part of that patch you are referring to > here. If you don't have queue_affinity set, queueing a new request with > the hardware is generally done on the same CPU that just completed a > request. That is true even without any patches. Generally, but I guess not always. The database workloads in question (which you might know very well about ;)) apparently has a lot of queue empty and unplug conditions. Which I guess is the reason for Intel's initial patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] block layer: kmemcheck fixes
On Fri, Feb 08, 2008 at 07:09:07AM -0800, Arjan van de Ven wrote: > David Miller wrote: > >From: Linus Torvalds <[EMAIL PROTECTED]> > >Date: Thu, 7 Feb 2008 09:42:56 -0800 (PST) > > > >>Can we please just stop doing these one-by-one assignments, and just do > >>something like > >> > >>memset(rq, 0, sizeof(*rq)); > >>rq->q = q; > >>rq->ref_count = 1; > >>INIT_HLIST_NODE(&rq->hash); > >>RB_CLEAR_NODE(&rq->rb_node); > >> > >>instead? > >> > >>The memset() is likely faster and smaller than one-by-one assignments > >>anyway, even if the one-by-ones can avoid initializing some field or > >>there ends up being a double initialization.. > > > >The problem is store buffer compression. At least a few years > >ago this made a huge difference in sk_buff initialization in the > >networking. > > > >Maybe cpus these days have so much store bandwith that doing > >things like the above is OK, but I doubt it :-) > > on modern x86 cpus the memset may even be faster if the memory isn't in > cache; > the "explicit" method ends up doing Write Allocate on the cache lines > (so read them from memory) even though they then end up being written > entirely. > With memset the CPU is told that the entire range is set to a new value, and > the WA can be avoided for the whole-cachelines in the range. Don't you have write combining store buffers? Or is it still speculatively issuing the reads even before the whole cacheline is combined? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] block layer: kmemcheck fixes
On Fri, Feb 08, 2008 at 02:56:09PM -0800, Arjan van de Ven wrote: > Nick Piggin wrote: > >>>Maybe cpus these days have so much store bandwith that doing > >>>things like the above is OK, but I doubt it :-) > >>on modern x86 cpus the memset may even be faster if the memory isn't in > >>cache; > >>the "explicit" method ends up doing Write Allocate on the cache lines > >>(so read them from memory) even though they then end up being written > >>entirely. > >>With memset the CPU is told that the entire range is set to a new value, > >>and > >>the WA can be avoided for the whole-cachelines in the range. > > > >Don't you have write combining store buffers? Or is it still speculatively > >issuing the reads even before the whole cacheline is combined? > > x86 memory order model doesn't allow that quite; and you need a "series" of > at least 64 bytes > without any other memory accesses in between even if it would > not happening in practice. OK, fair enough... then it will be a very nice test to see if it helps. I'm sure you could have an arch specific initialisation function if it makes a significant difference. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops report for the week upto Feb 10th 2008
On Monday 11 February 2008 11:35, Arjan van de Ven wrote: > The http://www.kerneloops.org website collects kernel oops and > warning reports from various mailing lists and bugzillas as well as > with a client users can install to auto-submit oopses. > Below is a top 10 list of the oopses/backtraces collected in the last 7 > days. (Reports prior to 2.6.23 have been omitted in collecting the top 10) > > This week, a total of 323 oopses and warnings have been reported, > compared to 110 reports in the previous week. > > (This sharp increase is due to Fedora 9 alpha shipping the oops data > collection client in the default install, giving us much wider coverage > in the issues that actual users hit; many thanks to the Fedora project > for this) > > With the 2.6.25-rc1 release out, this will be the last report that includes > 2.6.23; future reports will only include issues from 2.6.24 and later. > > > Rank 1: set_dentry_child_flags > WARN_ON at fs/inotify.c:172 set_dentry_child_flags > Reported 93 times (116 total reports) > This is a user triggered WARN_ON in inotify. Sadly inotify seems to be > unmaintained. More info: > http://www.kerneloops.org/search.php?search=set_dentry_child_flags I was never able to trigger this or get anyone to reliably trigger it with a debug patch in. Which is why it has taken so long to fix. It looks like kde4 is triggering this big rash of new reports. Anyway, I have fixed a race or two and removed that warning code (which was also a little racy). So I think that should be OK. > Rank 9: mark_buffer_dirty > WARN_ON at fs/buffer.c:1169 > This indicates that a non-uptodate buffer is marked dirty. > This can lead to data corruption! > Reported 5 times (12 total reports) - Only seen since 2.6.24-rc6 > Usually happens during umount() > More info: http://www.kerneloops.org/search.php?search=mark_buffer_dirty That's interesting. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] alloc_percpu() fails to allocate percpu data
On Friday 22 February 2008 09:26, Peter Zijlstra wrote: > On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote: > > Some oprofile results obtained while using tbench on a 2x2 cpu machine > > were very surprising. > > > > For example, loopback_xmit() function was using high number of cpu > > cycles to perform the statistic updates, supposed to be real cheap > > since they use percpu data > > > > pcpu_lstats = netdev_priv(dev); > > lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id()); > > lb_stats->packets++; /* HERE : serious contention */ > > lb_stats->bytes += skb->len; > > > > > > struct pcpu_lstats is a small structure containing two longs. It > > appears that on my 32bits platform, alloc_percpu(8) allocates a single > > cache line, instead of giving to each cpu a separate cache line. > > > > Using the following patch gave me impressive boost in various > > benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too) > > > > Long term fix (ie >= 2.6.26) would be to let each CPU allocate their > > own block of memory, so that we dont need to roudup sizes to > > L1_CACHE_BYTES, or merging the SGI stuff of course... > > > > Note : SLUB vs SLAB is important here to *show* the improvement, since > > they dont have the same minimum allocation sizes (8 bytes vs 32 > > bytes). This could very well explain regressions some guys reported > > when they switched to SLUB. > > I've complained about this false sharing as well, so until we get the > new and improved percpu allocators, What I don't understand is why the slab allocators have something like this in it: if ((flags & SLAB_HWCACHE_ALIGN) && size > cache_line_size() / 2) return max_t(unsigned long, align, cache_line_size()); If you ask for HWCACHE_ALIGN, then you should get it. I don't understand, why do they think they knows better than the caller? Things like this are just going to lead to very difficult to track performance problems. Possibly correctness problems in rare cases. There could be another flag for "maybe align". -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70
On Wednesday 20 February 2008 09:01, Alexey Dobriyan wrote: > On Tue, Feb 19, 2008 at 11:47:11PM +0300, wrote: > > > Are you reproducing it simply by running the > > > ftest03 binary directly from the shell? How many times between oopses? > > > It is multi-process but no threads, so races should be minimal down > > > this path -- can you get an strace of the failing process? > > Speaking of multi-proceseness, changing MAXCHILD to 1, nchild to 1, > AFAICS, generates one child which oopses the very same way (in parallel > with generic LTP) But, lowering MAXIOVCNT to 8 generates no oops. Thanks, I was able to reproduce quite easily with these settings. I think I have the correct patch now (at least it isn't triggerable any more here). Thanks, Nick diff --git a/mm/filemap.c b/mm/filemap.c index 5c74b68..2650073 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1750,14 +1750,18 @@ static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes) } else { const struct iovec *iov = i->iov; size_t base = i->iov_offset; + size_t copied = 0; /* * The !iov->iov_len check ensures we skip over unlikely - * zero-length segments. + * zero-length segments (without overruning the iovec). */ - while (bytes || !iov->iov_len) { - int copy = min(bytes, iov->iov_len - base); + while (copied < bytes || +unlikely(!iov->iov_len && copied < i->count)) { + int copy; + copy = min(bytes, iov->iov_len - base); + copied += copy; bytes -= copy; base += copy; if (iov->iov_len == base) {
Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
On Thursday 21 February 2008 21:58, Robin Holt wrote: > On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote: > > > > So why can't you export a device from your xpmem driver, which > > > > can be mmap()ed to give out "anonymous" memory pages to be used > > > > for these communication buffers? > > > > > > Because we need to have heap and stack available as well. MPT does > > > not control all the communication buffer areas. I haven't checked, but > > > this is the same problem that IB will have. I believe they are > > > actually allowing any memory region be accessible, but I am not sure of > > > that. > > > > Then you should create a driver that the user program can register > > and unregister regions of their memory with. The driver can do a > > get_user_pages to get the pages, and then you'd just need to set up > > some kind of mapping so that userspace can unmap pages / won't leak > > memory (and an exit_mm notifier I guess). > > OK. You need to explain this better to me. How would this driver > supposedly work? What we have is an MPI library. It gets invoked at > process load time to establish its rank-to-rank communication regions. > It then turns control over to the processes main(). That is allowed to > run until it hits the > MPI_Init(&argc, &argv); > > The process is then totally under the users control until: > MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, > MPI_COMM_WORLD); > MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD, > &status); > > That is it. That is all our allowed interaction with the users process. OK, when you said something along the lines of "the MPT library has control of the comm buffer", then I assumed it was an area of virtual memory which is set up as part of initialization, rather than during runtime. I guess I jumped to conclusions. > That doesn't seem too unreasonable, except when you compare it to how the > driver currently works. Remember, this is done from a library which has > no insight into what the user has done to its own virtual address space. > As a result, each MPI_Send() would result in a system call (or we would > need to have a set of callouts for changes to a processes VMAs) which > would be a significant increase in communication overhead. > > Maybe I am missing what you intend to do, but what we need is a means of > tracking one processes virtual address space changes so other processes > can do direct memory accesses without the need for a system call on each > communication event. Yeah it's tricky. BTW. what is the performance difference between having a system call or no? > > Because you don't need to swap, you don't need coherency, and you > > are in control of the areas, then this seems like the best choice. > > It would allow you to use heap, stack, file-backed, anything. > > You are missing one point here. The MPI specifications that have > been out there for decades do not require the process use a library > for allocating the buffer. I realize that is a horrible shortcoming, > but that is the world we live in. Even if we could change that spec, Can you change the spec? Are you working on it? > we would still need to support the existing specs. As a result, the > user can change their virtual address space as they need and still expect > communications be cheap. That's true. How has it been supported up to now? Are you using these kind of notifiers in patched kernels? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
On Tuesday 26 February 2008 18:59, Jamie Lokier wrote: > Andrew Morton wrote: > > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > > > (It would be nicer if sync_file_range() > > > took a vector of ranges for better elevator scheduling, but let's > > > ignore that :-) > > > > Two passes: > > > > Pass 1: shove each of the segments into the queue with > > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE > > > > Pass 2: wait for them all to complete and return accumulated result > > with SYNC_FILE_RANGE_WAIT_AFTER > > Thanks. > > Seems ok, though being able to cork the I/O until the last one would > be a bonus (like TCP_MORE... SYNC_FILE_RANGE_MORE?) > > I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE. Is there a > reason why you have it there? The man page isn't very enlightening. Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, "writeout", to userspace (does "writeout" include "in drive cache"? the kernel doesn't think so, but the only way to make sync_file_range "safe" is if you do consider it writeout). If it makes it any easier to understand, we can add in SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with safe/unsafe and sync/async semantics that is part of the normal POSIX api. Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
On Tuesday 26 February 2008 18:21, Gleb Natapov wrote: > On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote: > > > You are missing one point here. The MPI specifications that have > > > been out there for decades do not require the process use a library > > > for allocating the buffer. I realize that is a horrible shortcoming, > > > but that is the world we live in. Even if we could change that spec, > > > > Can you change the spec? > > Not really. It will break all existing codes. I meant as in eg. submit changes to MPI-3 > MPI-2 provides a call for > memory allocation (and it's beneficial to use this call for some > interconnects), but many (most?) applications are still written for MPI-1 > and those that are written for MPI-2 mostly uses the old habit of > allocating memory by malloc(), or even use stack or BSS memory for > communication buffer purposes. OK, so MPI-2 already has some way to do that... I'm not saying that we can now completely dismiss the idea of using notifiers for this, but it is just a good data point to know. Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: oops when using git gc --auto
On Wednesday 27 February 2008 00:22, Otavio Salvador wrote: > Hello, > > Today I got this oops, someone has an idea of what's going wrong? > > Unable to handle kernel paging request at 0200 RIP: > [] find_get_pages+0x3c/0x69 At this point, the most likely candidate is a memory corruption error, probably hardware. Can you run memtest86 for a few hours to get a bit more confidence in the hw (preferably overnight)? I did recently see another quite similar corruption in the pagecache radix-tree, though. Coincidence maybe? > PGD 0 > Oops: [1] SMP > CPU 3 > Modules linked in: sha256_generic aes_generic aes_x86_64 cbc blkcipher > nvidia(P) rfcomm l2cap bluetooth ac battery ipv6 nfs lockd nfs_acl sunrpc > bridge ext2 mbcache dm_crypt tun kvm_intel kvm loop snd_usb_audio > snd_usb_lib snd_rawmidi snd_hda_intel e1000e i2c_i801 serio_raw > snd_seq_device snd_pcm intel_agp button snd_timer pcspkr psmouse snd_hwdep > snd snd_page_alloc soundcore evdev i2c_core xfs dm_mirror dm_snapshot > dm_mod raid0 md_mod sg sr_mod cdrom sd_mod usbhid hid usb_storage > pata_marvell floppy ahci ata_generic libata scsi_mod ehci_hcd uhci_hcd > thermal processor fan Pid: 15684, comm: git Tainted: P > 2.6.24-1-amd64 #1 > RIP: 0010:[] [] > find_get_pages+0x3c/0x69 RSP: 0018:8100394dfd98 EFLAGS: 00010097 > RAX: 0009 RBX: 000e RCX: 0009 > RDX: 0200 RSI: 000a RDI: 0040 > RBP: 810042964350 R08: 0040 R09: 000a > R10: 8100425a06c8 R11: 000a R12: 000e > R13: 8100394dfdf8 R14: 810042964350 R15: > FS: 2ae326df2190() GS:81007d7aeb40() > knlGS: CS: 0010 DS: ES: CR0: 8005003b > CR2: 0200 CR3: 358f9000 CR4: 26e0 > DR0: DR1: DR2: > DR3: DR6: 0ff0 DR7: 0400 > Process git (pid: 15684, threadinfo 8100394de000, task > 8100359cd800) Stack: 000d 8100394dfde8 > 000d 000e 000e 802794d6 > 8100014a7768 80279b04 > Call Trace: > [] pagevec_lookup+0x17/0x1e > [] truncate_inode_pages_range+0x108/0x2bd > [] generic_delete_inode+0xbf/0x127 > [] do_unlinkat+0xd5/0x144 > [] sys_write+0x45/0x6e > [] system_call+0x7e/0x83 > > > Code: 48 8b 02 25 00 40 02 00 48 3d 00 40 02 00 75 04 48 8b 52 10 > RIP [] find_get_pages+0x3c/0x69 > RSP > CR2: 0200 > ---[ end trace cb43a9f4488b815a ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: unfair stress on non memory allocating apps while swapout (in 2.4)
> why are programs which do not allocate memory be delayed while one > program is eating up all memory. This clearly means they are not delayed in > the malloc call but simply the kernel will not schedule them while he is bussy > to page out processes. Bernd, The reason why programs not allocating memory start waiting when the system starts swapping is because they get some of their working set paged out, which must be loaded into physical memory when they next use it. By working set I mean the code and data they will use in their next timeslice. Nick PS this is my first post to lkml so please keep that in mind... PPS ... so, was I right? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Topic for discussion: OS Design
> So what we really need to do is get some custom "RAM blitter" into our > hardware to do the memory copies needed for fast context switching and > message passing. don't you think you should quit while you're behind? > Too bad nobody on this list works at an electronics design company... ;-P you would probably be surprised. Nick. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
> I'm trying to write a server that handles 1 clients. On 2.4.x, > the RT signal queue stuff looks like the way to achieve that. I would suggest you try multiple polling threads. Not only will you get better SMP scalability, if you have say 16 threads, each one only has to handle ~ 600 fds. Nick. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.4.0-test9 Oopses
Did the following with 2.4.0-test9 + reiserfs 3.6.18 (all ext2 filesystem, however) and all ide block devices. scsi0 : SCSI host adapter emulation for IDE ATAPI devices Vendor: RICOH Model: CD-R/RW MP7060A Rev: 1.50 Type: CD-ROM ANSI SCSI revision: 02 Vendor: ATAPI Model: CD-ROM DRIVE-24X Rev: U40M Type: CD-ROM ANSI SCSI revision: 02 Detected scsi CD-ROM sr0 at scsi0, channel 0, id 0, lun 0 Detected scsi CD-ROM sr1 at scsi0, channel 0, id 1, lun 0 sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray sr1: scsi3-mmc drive: 20x/20x xa/form2 cdda tray scsi : 0 hosts left. (loaded ide-scsi modules as you can see) After trying to access the /proc/scsi directory I got this oops: Unable to handle kernel paging request at virtual address c4858010 printing eip: c01461ac Oops: 0002 CPU:0 EIP:0010:[proc_get_inode+156/288] EFLAGS: 00010286 eax: c4858000 ebx: c39390c0 ecx: c21835c8 edx: 0023 esi: c2183540 edi: c3939114 ebp: c21830c0 esp: c3ef3eec ds: 0018 es: 0018 ss: 0018 Process bash (pid: 407, stackpage=c3ef3000) Stack: c0f04460 c0f044c8 c0147d09 c111c800 115d c39390c0 fff4 c0f04460 c21830c0 c0f042e0 ffea c01378df c21830c0 c0f04460 c3ef3f68 c0f042e0 c3ef3fa4 c0138009 c0f042e0 c3ef3f68 c1fb1000 Call Trace: [proc_lookup+121/160] [real_lookup+79/192] [path_walk+1369/1952] [][__user_walk+60/96] [sys_newstat+22/112] [sys_close+72/96] [system_call+51/56] Code: ff 40 10 8b 43 24 80 48 14 18 0f b7 43 08 25 00 f0 ff ff 66 After rebooting, loading the same modules, then doing cdrecord -scanbus I got these two oopses Unable to handle kernel NULL pointer dereference at virtual address printing eip: c013a551 Oops: CPU:0 EIP:0010:[vfs_follow_link+33/368] EFLAGS: 00010217 eax: ebx: c3c5bf90 ecx: 0341 edx: c02b6040 esi: edi: ebp: esp: c3c5befc ds: 0018 es: 0018 ss: 0018 Process devfsd (pid: 12, stackpage=c3c5b000) Stack: c3c5bf90 c0a01e20 c3c5bf90 c01547cf c3c5bf90 c3c5a000 c0138143 c0a01e20 c3c5bf90 c09c0b40 c3d69000 c3c5bf90 bfffecdc 0001 bfffecdc c3c5bf94 0009 c0a01e20 c3d69005 0003 Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952] [__user_walk+60/96] [sys_chown+22/80] [sys_chown16+48/64] [system_call+51/56] Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff Unable to handle kernel NULL pointer dereference at virtual address printing eip: c013a551 Oops: CPU:0 EIP:0010:[vfs_follow_link+33/368] EFLAGS: 00010217 eax: ebx: c081df80 ecx: 0341 edx: c02b6040 esi: c0a01b20 edi: ebp: esp: c081ded0 ds: 0018 es: 0018 ss: 0018 Process cdrecord (pid: 758, stackpage=c081d000) Stack: c081df80 c0a01b20 c0a01b20 c081df80 c01547cf c081df80 c081c000 c0138143 c0a01b20 c081df80 c09c0b40 c117d000 0002 0003 0001 08074094 c081df84 0001 c0a01b20 c117d005 0003 Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952] [open_namei+128/1504] [filp_open+59/96] [sys_open+67/208] [system_call+51/56] Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] BSD process accounting: new locking
I have attached a very small patch (test9) to remove the kernel lock from kernel/acct.c. If I am missing something major (a brain?), I apologise in advance. I have tested this on my UP x86 with spinlock debugging. I would appreciate comments or an explanation of why this can't be done if you have the time. Thanks. Nick bsdacct.patch
Re: [patch] BSD process accounting: new locking
> I have attached a very small patch (test9) to remove the kernel lock from > kernel/acct.c. If I am missing something major (a brain?), I apologise in > advance. I have tested this on my UP x86 with spinlock debugging. I would > appreciate comments or an explanation of why this can't be done if you have > the time. Thanks. > > Nick > Maybe there was a possibility of a race when acct_auto_close calls sys_acct in the last patch. If so, this should fix it. Nick. bsdacct2.patch
test9 oops (in block_read_full_page)
I apologise if this oops has already been fixed: it has happened twice but I can't find the exact way to trigger it, I just want to make sure it is reported ;) Nick oops
Re: 2.4.0-test9 Oopses
Just a note that this oops still occurs in test10. The problem occurs because get_devfs_entry_from_vfs_inode in devfs_follow_link (and/or devfs_read_link), seems to return invalid or incorrect devfs entries whose .u.symlink.linkname is null which causes the line: if (*link == '/') { in fs/namei.c: __vfs_follow_link to oops. The oops is due to trying to follow an sg? link in /dev. Nick. - Original Message ----- From: "Nick Piggin" <[EMAIL PROTECTED]> To: "Linux-Kernel" <[EMAIL PROTECTED]> Sent: Wednesday, October 25, 2000 9:16 PM Subject: 2.4.0-test9 Oopses > Did the following with 2.4.0-test9 + reiserfs 3.6.18 (all ext2 filesystem, > however) and all ide block devices. > > scsi0 : SCSI host adapter emulation for IDE ATAPI devices > Vendor: RICOH Model: CD-R/RW MP7060A Rev: 1.50 > Type: CD-ROM ANSI SCSI revision: 02 > Vendor: ATAPI Model: CD-ROM DRIVE-24X Rev: U40M > Type: CD-ROM ANSI SCSI revision: 02 > Detected scsi CD-ROM sr0 at scsi0, channel 0, id 0, lun 0 > Detected scsi CD-ROM sr1 at scsi0, channel 0, id 1, lun 0 > sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray > sr1: scsi3-mmc drive: 20x/20x xa/form2 cdda tray > scsi : 0 hosts left. > > (loaded ide-scsi modules as you can see) [snip] > Doing cdrecord -scanbus I > got these two oopses > > Unable to handle kernel NULL pointer dereference at virtual address > printing eip: > c013a551 > Oops: > CPU:0 > EIP:0010:[vfs_follow_link+33/368] > EFLAGS: 00010217 > eax: ebx: c3c5bf90 ecx: 0341 edx: c02b6040 > esi: edi: ebp: esp: c3c5befc > ds: 0018 es: 0018 ss: 0018 > Process devfsd (pid: 12, stackpage=c3c5b000) > Stack: c3c5bf90 c0a01e20 c3c5bf90 c01547cf c3c5bf90 > >c3c5a000 c0138143 c0a01e20 c3c5bf90 c09c0b40 c3d69000 > c3c5bf90 >bfffecdc 0001 bfffecdc c3c5bf94 0009 c0a01e20 c3d69005 > 0003 > Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952] > [__user_walk+60/96] [sys_chown+22/80] [sys_chown16+48/64] > [system_call+51/56] > Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff > > Unable to handle kernel NULL pointer dereference at virtual address > printing eip: > c013a551 > Oops: > CPU:0 > EIP:0010:[vfs_follow_link+33/368] > EFLAGS: 00010217 > eax: ebx: c081df80 ecx: 0341 edx: c02b6040 > esi: c0a01b20 edi: ebp: esp: c081ded0 > ds: 0018 es: 0018 ss: 0018 > Process cdrecord (pid: 758, stackpage=c081d000) > Stack: c081df80 c0a01b20 c0a01b20 c081df80 c01547cf c081df80 > >c081c000 c0138143 c0a01b20 c081df80 c09c0b40 c117d000 > 0002 >0003 0001 08074094 c081df84 0001 c0a01b20 c117d005 > 0003 > Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952] > [open_namei+128/1504] [filp_open+59/96] [sys_open+67/208] > [system_call+51/56] > Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.4.0-test10 oopses (bug in devfs)
Maybe this got ignored because the subject was test9 oops when test 10 had been released, or people tend to ignore .edu addresses... > Just a note that this oops still occurs in test10. The problem occurs > because get_devfs_entry_from_vfs_inode in devfs_follow_link (and/or > devfs_read_link), seems to return invalid or incorrect devfs entries whose > .u.symlink.linkname is null which causes the line: > if (*link == '/') { > in fs/namei.c: __vfs_follow_link to oops. > > The oops is due to trying to follow an sg? link in /dev. > > Nick. > > ----- Original Message - > From: "Nick Piggin" <[EMAIL PROTECTED]> > To: "Linux-Kernel" <[EMAIL PROTECTED]> > Sent: Wednesday, October 25, 2000 9:16 PM > Subject: 2.4.0-test9 Oopses > > > > Did the following with 2.4.0-test9 + reiserfs 3.6.18 (all ext2 filesystem, > > however) and all ide block devices. > > > > scsi0 : SCSI host adapter emulation for IDE ATAPI devices > > Vendor: RICOH Model: CD-R/RW MP7060A Rev: 1.50 > > Type: CD-ROM ANSI SCSI revision: 02 > > Vendor: ATAPI Model: CD-ROM DRIVE-24X Rev: U40M > > Type: CD-ROM ANSI SCSI revision: 02 > > Detected scsi CD-ROM sr0 at scsi0, channel 0, id 0, lun 0 > > Detected scsi CD-ROM sr1 at scsi0, channel 0, id 1, lun 0 > > sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray > > sr1: scsi3-mmc drive: 20x/20x xa/form2 cdda tray > > scsi : 0 hosts left. > > > > (loaded ide-scsi modules as you can see) > [snip] > > Doing cdrecord -scanbus I > > got these two oopses > > > > Unable to handle kernel NULL pointer dereference at virtual address > > > printing eip: > > c013a551 > > Oops: > > CPU:0 > > EIP:0010:[vfs_follow_link+33/368] > > EFLAGS: 00010217 > > eax: ebx: c3c5bf90 ecx: 0341 edx: c02b6040 > > esi: edi: ebp: esp: c3c5befc > > ds: 0018 es: 0018 ss: 0018 > > Process devfsd (pid: 12, stackpage=c3c5b000) > > Stack: c3c5bf90 c0a01e20 c3c5bf90 c01547cf c3c5bf90 > > > >c3c5a000 c0138143 c0a01e20 c3c5bf90 c09c0b40 c3d69000 > > c3c5bf90 > >bfffecdc 0001 bfffecdc c3c5bf94 0009 c0a01e20 c3d69005 > > 0003 > > Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952] > > [__user_walk+60/96] [sys_chown+22/80] [sys_chown16+48/64] > > [system_call+51/56] > > Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff > > > > Unable to handle kernel NULL pointer dereference at virtual address > > > printing eip: > > c013a551 > > Oops: > > CPU:0 > > EIP:0010:[vfs_follow_link+33/368] > > EFLAGS: 00010217 > > eax: ebx: c081df80 ecx: 0341 edx: c02b6040 > > esi: c0a01b20 edi: ebp: esp: c081ded0 > > ds: 0018 es: 0018 ss: 0018 > > Process cdrecord (pid: 758, stackpage=c081d000) > > Stack: c081df80 c0a01b20 c0a01b20 c081df80 c01547cf c081df80 > > > >c081c000 c0138143 c0a01b20 c081df80 c09c0b40 c117d000 > > 0002 > >0003 0001 08074094 c081df84 0001 c0a01b20 c117d005 > > 0003 > > Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952] > > [open_namei+128/1504] [filp_open+59/96] [sys_open+67/208] > > [system_call+51/56] > > Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
bkl usage
Hi. In my efforts to understand the linux kernel v2.4 I found the bkl being used in kernel/acct.c to lock seemingly local data. Would someone please explain what races this prevents vs. say: --- linux/kernel/acct.c Mon Oct 30 01:02:56 2000 +++ linux-2.4.0-test10/kernel/acct.c Mon Oct 30 01:10:20 2000 @@ -41,6 +41,10 @@ * Oh, fsck... Oopsable SMP race in do_process_acct() - we must hold * ->mmap_sem to walk the vma list of current->mm. Nasty, since it leaks * a struct file opened for write. Fixed. 2/6/2000, AV. + * + * 2000-10-27 Modified by Nick <[EMAIL PROTECTED]> to remove usage + * of the big kernel lock in favour of a local spinlock + * */ #include @@ -77,6 +81,7 @@ static struct file *acct_file; static struct timer_list acct_timer; static void do_acct_process(long, struct file *); +static spinlock_t acct_lock = SPIN_LOCK_UNLOCKED; /* * Called whenever the timer says to check the free space. @@ -95,11 +100,11 @@ int res; int act; - lock_kernel(); +spin_lock(&acct_lock); res = acct_active; if (!file || !acct_needcheck) - goto out; - unlock_kernel(); + goto out_unlock; +spin_unlock(&acct_lock); /* May block */ if (vfs_statfs(file->f_dentry->d_inode->i_sb, &sbuf)) @@ -113,14 +118,14 @@ act = 0; /* - * If some joker switched acct_file under us we'ld better be + * If some joker switched acct_file under us we'd better be * silent and _not_ touch anything. */ - lock_kernel(); +spin_lock(&acct_lock); if (file != acct_file) { if (act) res = act>0; - goto out; + goto out_unlock; } if (acct_active) { @@ -140,8 +145,8 @@ acct_timer.expires = jiffies + ACCT_TIMEOUT*HZ; add_timer(&acct_timer); res = acct_active; -out: - unlock_kernel(); +out_unlock: +spin_unlock(&acct_lock); return res; } @@ -182,7 +187,7 @@ } error = 0; - lock_kernel(); +spin_lock(&acct_lock); if (acct_file) { old_acct = acct_file; del_timer(&acct_timer); @@ -200,7 +205,7 @@ acct_timer.expires = jiffies + ACCT_TIMEOUT*HZ; add_timer(&acct_timer); } - unlock_kernel(); +spin_unlock(&acct_lock); if (old_acct) { do_acct_process(0,old_acct); filp_close(old_acct, NULL); @@ -214,10 +219,24 @@ void acct_auto_close(kdev_t dev) { - lock_kernel(); - if (acct_file && acct_file->f_dentry->d_inode->i_dev == dev) - sys_acct(NULL); - unlock_kernel(); +struct file *old_acct; + +spin_lock(&acct_lock); + if (acct_file && acct_file->f_dentry->d_inode->i_dev == dev) { + +/* Run the same code as sys_acct(NULL) here. This simplifies locking */ +old_acct = acct_file; +del_timer(&acct_timer); +acct_active = 0; +acct_needcheck = 0; +acct_file = NULL; + +spin_unlock(&acct_lock); + +do_acct_process(0, old_acct); +filp_close(old_acct, NULL); +} else +spin_unlock(&acct_lock); } /* @@ -348,15 +367,15 @@ int acct_process(long exitcode) { struct file *file = NULL; - lock_kernel(); +spin_lock(&acct_lock); if (acct_file) { file = acct_file; get_file(file); - unlock_kernel(); +spin_unlock(&acct_lock); do_acct_process(exitcode, acct_file); fput(file); } else - unlock_kernel(); +spin_unlock(&acct_lock); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.12-rc2-mm1
Andrew Morton wrote: +sched-remove-unnecessary-sched-domains.patch +sched-improve-pinned-task-handling-again.patch [snip] CPU scheduler updates It is no problem that you picked these up for testing. But don't merge them yet, please. Suresh's underlying problem with the unnecessary sched domains is a failing of sched-balance-exec and sched-balance-fork, which I am working on now. Removing unnecessary domains is a nice optimisation, but just needs to account for a few more flags before declaring that a domain is unnecessary (not to mention this probably breaks if isolcpus= is used). I have made some modifications to the patch to fix these problems. Lastly, I'd like to be a bit less intrusive with pinned task handling improvements. I think we can do this while still being effective in preventing livelocks. I will keep you posted with regards to the various scheduler patches. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.12-rc2-mm1
Siddha, Suresh B wrote: On Tue, Apr 05, 2005 at 05:33:49PM +1000, Nick Piggin wrote: Suresh's underlying problem with the unnecessary sched domains is a failing of sched-balance-exec and sched-balance-fork, which That wasn't the only motivation. For example, on non-HT cpu's we shouldn't be setting up SMT sched-domain, same with NUMA domains on non-NUMA systems. Yep, sure. It is a good, if slight, optimisation. And I've also just slightly extended your patch, so we don't have any domains if booting with maxcpus=1 I am working on now. Removing unnecessary domains is a nice optimisation, but just needs to account for a few more flags before declaring that a Can you elaborate when we require a domain with special flags but has no or only one group in it. The SD_WAKE_* flags do not use groups, so it would be legitimate to have a domain that has one of these set, with no groups. domain is unnecessary (not to mention this probably breaks if isolcpus= is used). I have made some modifications to the patch I have tested my patch with "ioslcpus=" and it works just fine. OK, my apologies ;) to fix these problems. Lastly, I'd like to be a bit less intrusive with pinned task handling improvements. I think we can do this while still being effective in preventing livelocks. We want to see this fixed. Please post your patch and I can let you know the test results. I will try to get it working and tested tonight for you. I will keep you posted with regards to the various scheduler patches. Nick, Can you post the patches you sent me earlier to this list? Yep, I'll post them. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/5] sched: remove degenerate domains
This is Suresh's patch with some modifications. -- SUSE Labs, Novell Inc. Remove degenerate scheduler domains during the sched-domain init. For example on x86_64, we always have NUMA configured in. On Intel EM64T systems, top most sched domain will be of NUMA and with only one sched_group in it. With fork/exec balances(recent Nick's fixes in -mm tree), we always endup taking wrong decisions because of this topmost domain (as it contains only one group and find_idlest_group always returns NULL). We will endup loading HT package completely first, letting active load balance kickin and correct it. In general, this patch also makes sense with out recent Nick's fixes in -mm. Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> Modified to account for more than just sched_groups when scanning for degenerate domains by Nick Piggin. Allow a runqueue's sd to go NULL, which required small changes to the smtnice code. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2005-04-05 16:38:21.0 +1000 +++ linux-2.6/kernel/sched.c2005-04-05 18:39:09.0 +1000 @@ -2583,11 +2583,15 @@ out: #ifdef CONFIG_SCHED_SMT static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) { - struct sched_domain *sd = this_rq->sd; + struct sched_domain *tmp, *sd = NULL; cpumask_t sibling_map; int i; + + for_each_domain(this_cpu, tmp) + if (tmp->flags & SD_SHARE_CPUPOWER) + sd = tmp; - if (!(sd->flags & SD_SHARE_CPUPOWER)) + if (!sd) return; /* @@ -2628,13 +2632,17 @@ static inline void wake_sleeping_depende static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq) { - struct sched_domain *sd = this_rq->sd; + struct sched_domain *tmp, *sd = NULL; cpumask_t sibling_map; prio_array_t *array; int ret = 0, i; task_t *p; - if (!(sd->flags & SD_SHARE_CPUPOWER)) + for_each_domain(this_cpu, tmp) + if (tmp->flags & SD_SHARE_CPUPOWER) + sd = tmp; + + if (!sd) return 0; /* @@ -4604,6 +4612,11 @@ static void sched_domain_debug(struct sc { int level = 0; + if (!sd) { + printk(KERN_DEBUG "CPU%d attaching NULL sched-domain.\n", cpu); + return; + } + printk(KERN_DEBUG "CPU%d attaching sched-domain:\n", cpu); do { @@ -4809,6 +4822,50 @@ static void init_sched_domain_sysctl(voi } #endif +static int __devinit sd_degenerate(struct sched_domain *sd) +{ + if (cpus_weight(sd->span) == 1) + return 1; + + /* Following flags need at least 2 groups */ + if (sd->flags & (SD_LOAD_BALANCE | +SD_BALANCE_NEWIDLE | +SD_BALANCE_FORK | +SD_BALANCE_EXEC)) { + if (sd->groups != sd->groups->next) + return 0; + } + + /* Following flags don't use groups */ + if (sd->flags & (SD_WAKE_IDLE | +SD_WAKE_AFFINE | +SD_WAKE_BALANCE)) + return 0; + + return 1; +} + +static int __devinit sd_parent_degenerate(struct sched_domain *sd, + struct sched_domain *parent) +{ + unsigned long cflags = sd->flags, pflags = parent->flags; + + if (sd_degenerate(parent)) + return 1; + + if (!cpus_equal(sd->span, parent->span)) + return 0; + + /* Does parent contain flags not in child? */ + /* WAKE_BALANCE is a subset of WAKE_AFFINE */ + if (cflags & SD_WAKE_AFFINE) + pflags &= ~SD_WAKE_BALANCE; + if ((~sd->flags) & parent->flags) + return 0; + + return 1; +} + /* * Attach the domain 'sd' to 'cpu' as its base domain. Callers must * hold the hotplug lock. @@ -4819,6 +4876,19 @@ void __devinit cpu_attach_domain(struct unsigned long flags; runqueue_t *rq = cpu_rq(cpu); int local = 1; + struct sched_domain *tmp; + + /* Remove the sched domains which do not contribute to scheduling. */ + for (tmp = sd; tmp; tmp = tmp->parent) { + struct sched_domain *parent = tmp->parent; + if (!parent) + break; + if (sd_parent_degenerate(tmp, parent)) + tmp->parent = parent->parent; + } + + if (sd_degenerate(sd)) + sd = sd->parent; sched_domain_debug(sd, cpu);
[patch 2/5] sched: NULL domains
2/5 The previous patch fixed the last 2 places that directly access a runqueue's sched-domain and assume it cannot be NULL. We can now use a NULL domain instead of a dummy domain to signify no balancing is to happen. No functional changes. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2005-04-05 16:38:40.0 +1000 +++ linux-2.6/kernel/sched.c2005-04-05 18:39:08.0 +1000 @@ -4887,7 +4887,7 @@ void __devinit cpu_attach_domain(struct tmp->parent = parent->parent; } - if (sd_degenerate(sd)) + if (sd && sd_degenerate(sd)) sd = sd->parent; sched_domain_debug(sd, cpu); @@ -5054,7 +5054,7 @@ static void __devinit arch_init_sched_do cpus_and(cpu_default_map, cpu_default_map, cpu_online_map); /* -* Set up domains. Isolated domains just stay on the dummy domain. +* Set up domains. Isolated domains just stay on the NULL domain. */ for_each_cpu_mask(i, cpu_default_map) { int group; @@ -5167,18 +5167,11 @@ static void __devinit arch_destroy_sched #endif /* ARCH_HAS_SCHED_DOMAIN */ -/* - * Initial dummy domain for early boot and for hotplug cpu. Being static, - * it is initialized to zero, so all balancing flags are cleared which is - * what we want. - */ -static struct sched_domain sched_domain_dummy; - #ifdef CONFIG_HOTPLUG_CPU /* * Force a reinitialization of the sched domains hierarchy. The domains * and groups cannot be updated in place without racing with the balancing - * code, so we temporarily attach all running cpus to a "dummy" domain + * code, so we temporarily attach all running cpus to the NULL domain * which will prevent rebalancing while the sched domains are recalculated. */ static int update_sched_domains(struct notifier_block *nfb, @@ -5190,7 +5183,7 @@ static int update_sched_domains(struct n case CPU_UP_PREPARE: case CPU_DOWN_PREPARE: for_each_online_cpu(i) - cpu_attach_domain(&sched_domain_dummy, i); + cpu_attach_domain(NULL, i); arch_destroy_sched_domains(); return NOTIFY_OK; @@ -5253,7 +5246,7 @@ void __init sched_init(void) rq->best_expired_prio = MAX_PRIO; #ifdef CONFIG_SMP - rq->sd = &sched_domain_dummy; + rq->sd = NULL; for (j = 1; j < 3; j++) rq->cpu_load[j] = 0; rq->active_balance = 0;
[patch 3/5] sched: multilevel sbe and sbf
3/5 The fundamental problem that Suresh has with balance on exec and fork is that it only tries to balance the top level domain with the flag set. This was worked around by removing degenerate domains, but is still a problem if people want to start using more complex sched-domains, especially multilevel NUMA that ia64 is already using. This patch makes balance on fork and exec try balancing over not just the top most domain with the flag set, but all the way down the domain tree. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2005-04-05 16:38:53.0 +1000 +++ linux-2.6/kernel/sched.c2005-04-05 18:39:07.0 +1000 @@ -1320,21 +1320,24 @@ void fastcall wake_up_new_task(task_t * sd = tmp; if (sd) { + cpumask_t span; int new_cpu; struct sched_group *group; +again: schedstat_inc(sd, sbf_cnt); + span = sd->span; cpu = task_cpu(p); group = find_idlest_group(sd, p, cpu); if (!group) { schedstat_inc(sd, sbf_balanced); - goto no_forkbalance; + goto nextlevel; } new_cpu = find_idlest_cpu(group, cpu); if (new_cpu == -1 || new_cpu == cpu) { schedstat_inc(sd, sbf_balanced); - goto no_forkbalance; + goto nextlevel; } if (cpu_isset(new_cpu, p->cpus_allowed)) { @@ -1344,9 +1347,21 @@ void fastcall wake_up_new_task(task_t * rq = task_rq_lock(p, &flags); cpu = task_cpu(p); } + + /* Now try balancing at a lower domain level */ +nextlevel: + sd = NULL; + for_each_domain(cpu, tmp) { + if (cpus_subset(span, tmp->span)) + break; + if (tmp->flags & SD_BALANCE_FORK) + sd = tmp; + } + + if (sd) + goto again; } -no_forkbalance: #endif /* * We decrease the sleep average of forking parents @@ -1712,25 +1727,41 @@ void sched_exec(void) sd = tmp; if (sd) { + cpumask_t span; struct sched_group *group; +again: schedstat_inc(sd, sbe_cnt); + span = sd->span; group = find_idlest_group(sd, current, this_cpu); if (!group) { schedstat_inc(sd, sbe_balanced); - goto out; + goto nextlevel; } new_cpu = find_idlest_cpu(group, this_cpu); if (new_cpu == -1 || new_cpu == this_cpu) { schedstat_inc(sd, sbe_balanced); - goto out; + goto nextlevel; } schedstat_inc(sd, sbe_pushed); put_cpu(); sched_migrate_task(current, new_cpu); - return; + + /* Now try balancing at a lower domain level */ + this_cpu = get_cpu(); +nextlevel: + sd = NULL; + for_each_domain(this_cpu, tmp) { + if (cpus_subset(span, tmp->span)) + break; + if (tmp->flags & SD_BALANCE_EXEC) + sd = tmp; + } + + if (sd) + goto again; } -out: + put_cpu(); }
[patch 4/5] sched: RCU sched domains
4/5 One of the problems with the multilevel balance-on-fork/exec is that it needs to jump through hoops to satisfy sched-domain's locking semantics (that is, you may traverse your own domain when not preemptable, and you may traverse others' domains when holding their runqueue lock). balance-on-exec had to potentially migrate between more than one CPU before finding a final CPU to migrate to, and balance-on-fork needed to potentially take multiple runqueue locks. So bite the bullet and make sched-domains go completely RCU. This actually simplifies the code quite a bit. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2005-04-05 16:39:14.0 +1000 +++ linux-2.6/kernel/sched.c2005-04-05 18:39:05.0 +1000 @@ -825,22 +825,12 @@ inline int task_curr(const task_t *p) } #ifdef CONFIG_SMP -enum request_type { - REQ_MOVE_TASK, - REQ_SET_DOMAIN, -}; - typedef struct { struct list_head list; - enum request_type type; - /* For REQ_MOVE_TASK */ task_t *task; int dest_cpu; - /* For REQ_SET_DOMAIN */ - struct sched_domain *sd; - struct completion done; } migration_req_t; @@ -862,7 +852,6 @@ static int migrate_task(task_t *p, int d } init_completion(&req->done); - req->type = REQ_MOVE_TASK; req->task = p; req->dest_cpu = dest_cpu; list_add(&req->list, &rq->migration_queue); @@ -4365,17 +4354,9 @@ static int migration_thread(void * data) req = list_entry(head->next, migration_req_t, list); list_del_init(head->next); - if (req->type == REQ_MOVE_TASK) { - spin_unlock(&rq->lock); - __migrate_task(req->task, cpu, req->dest_cpu); - local_irq_enable(); - } else if (req->type == REQ_SET_DOMAIN) { - rq->sd = req->sd; - spin_unlock_irq(&rq->lock); - } else { - spin_unlock_irq(&rq->lock); - WARN_ON(1); - } + spin_unlock(&rq->lock); + __migrate_task(req->task, cpu, req->dest_cpu); + local_irq_enable(); complete(&req->done); } @@ -4606,7 +4587,6 @@ static int migration_call(struct notifie migration_req_t *req; req = list_entry(rq->migration_queue.next, migration_req_t, list); - BUG_ON(req->type != REQ_MOVE_TASK); list_del_init(&req->list); complete(&req->done); } @@ -4903,10 +4883,7 @@ static int __devinit sd_parent_degenerat */ void __devinit cpu_attach_domain(struct sched_domain *sd, int cpu) { - migration_req_t req; - unsigned long flags; runqueue_t *rq = cpu_rq(cpu); - int local = 1; struct sched_domain *tmp; /* Remove the sched domains which do not contribute to scheduling. */ @@ -4923,24 +4900,7 @@ void __devinit cpu_attach_domain(struct sched_domain_debug(sd, cpu); - spin_lock_irqsave(&rq->lock, flags); - - if (cpu == smp_processor_id() || !cpu_online(cpu)) { - rq->sd = sd; - } else { - init_completion(&req.done); - req.type = REQ_SET_DOMAIN; - req.sd = sd; - list_add(&req.list, &rq->migration_queue); - local = 0; - } - - spin_unlock_irqrestore(&rq->lock, flags); - - if (!local) { - wake_up_process(rq->migration_thread); - wait_for_completion(&req.done); - } + rq->sd = sd; } /* cpus with isolated domains */ @@ -5215,6 +5175,7 @@ static int update_sched_domains(struct n case CPU_DOWN_PREPARE: for_each_online_cpu(i) cpu_attach_domain(NULL, i); + synchronize_kernel(); arch_destroy_sched_domains(); return NOTIFY_OK;
[patch 5/5] sched: consolidate sbe sbf
5/5 Any ideas about what to do with schedstats? Do we really need balance on exec and fork as seperate statistics? Consolidate balance-on-exec with balance-on-fork. This is made easy by the sched-domains RCU patches. As well as the general goodness of code reduction, this allows the runqueues to be unlocked during balance-on-fork. schedstats is a problem. Maybe just have balance-on-event instead of distinguishing fork and exec? Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2005-04-05 18:39:14.0 +1000 +++ linux-2.6/kernel/sched.c2005-04-05 18:40:18.0 +1000 @@ -1013,8 +1013,57 @@ static int find_idlest_cpu(struct sched_ return idlest; } +/* + * sched_balance_self: balance the current task (running on cpu) in domains + * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and + * SD_BALANCE_EXEC. + * + * Balance, ie. select the least loaded group. + * + * Returns the target CPU number, or the same CPU if no balancing is needed. + * + * preempt must be disabled. + */ +static int sched_balance_self(int cpu, int flag) +{ + struct task_struct *t = current; + struct sched_domain *tmp, *sd = NULL; -#endif + for_each_domain(cpu, tmp) + if (tmp->flags & flag) + sd = tmp; + + while (sd) { + cpumask_t span; + struct sched_group *group; + int new_cpu; + + span = sd->span; + group = find_idlest_group(sd, t, cpu); + if (!group) + goto nextlevel; + + new_cpu = find_idlest_cpu(group, cpu); + if (new_cpu == -1 || new_cpu == cpu) + goto nextlevel; + + /* Now try balancing at a lower domain level */ + cpu = new_cpu; +nextlevel: + sd = NULL; + for_each_domain(cpu, tmp) { + if (cpus_subset(span, tmp->span)) + break; + if (tmp->flags & flag) + sd = tmp; + } + /* while loop will break here if sd == NULL */ + } + + return cpu; +} + +#endif /* CONFIG_SMP */ /* * wake_idle() will wake a task on an idle cpu if task->cpu is @@ -1295,63 +1344,22 @@ void fastcall wake_up_new_task(task_t * int this_cpu, cpu; runqueue_t *rq, *this_rq; #ifdef CONFIG_SMP - struct sched_domain *tmp, *sd = NULL; -#endif + int new_cpu; + cpu = task_cpu(p); + preempt_disable(); + new_cpu = sched_balance_self(cpu, SD_BALANCE_FORK); + preempt_enable(); + if (new_cpu != cpu) + set_task_cpu(p, new_cpu); +#endif + + cpu = task_cpu(p); rq = task_rq_lock(p, &flags); - BUG_ON(p->state != TASK_RUNNING); this_cpu = smp_processor_id(); - cpu = task_cpu(p); - -#ifdef CONFIG_SMP - for_each_domain(cpu, tmp) - if (tmp->flags & SD_BALANCE_FORK) - sd = tmp; - - if (sd) { - cpumask_t span; - int new_cpu; - struct sched_group *group; - -again: - schedstat_inc(sd, sbf_cnt); - span = sd->span; - cpu = task_cpu(p); - group = find_idlest_group(sd, p, cpu); - if (!group) { - schedstat_inc(sd, sbf_balanced); - goto nextlevel; - } - new_cpu = find_idlest_cpu(group, cpu); - if (new_cpu == -1 || new_cpu == cpu) { - schedstat_inc(sd, sbf_balanced); - goto nextlevel; - } - - if (cpu_isset(new_cpu, p->cpus_allowed)) { - schedstat_inc(sd, sbf_pushed); - set_task_cpu(p, new_cpu); - task_rq_unlock(rq, &flags); - rq = task_rq_lock(p, &flags); - cpu = task_cpu(p); - } - - /* Now try balancing at a lower domain level */ -nextlevel: - sd = NULL; - for_each_domain(cpu, tmp) { - if (cpus_subset(span, tmp->span)) - break; - if (tmp->flags & SD_BALANCE_FORK) - sd = tmp; - } - - if (sd) - goto again; - } + BUG_ON(p->state != TASK_RUNNING); -#endif /* * We decrease the sleep average of forking parents * and children as well, to keep max-interactive tasks @@ -1699,59 +1707,17 @@ out: task_rq_unlock(rq, &flags); } -/* - * sche
Re: [patch 1/5] sched: remove degenerate domains
Ingo Molnar wrote: * Nick Piggin <[EMAIL PROTECTED]> wrote: This is Suresh's patch with some modifications. Remove degenerate scheduler domains during the sched-domain init. actually, i'd suggest to not do this patch. The point of booting with a CONFIG_NUMA kernel on a non-NUMA box is mostly for testing, and the 'degenerate' toplevel domain exposed conceptual bugs in the sched-domains code. In that sense removing such 'unnecessary' domains inhibits debuggability to a certain degree. If we had this patch earlier we'd not have experienced the wrong decisions taken by the scheduler, only on the much rarer 'really NUMA' boxes. True. Although I'd imagine it may be something distros may want. For example, a generic x86-64 kernel for both AMD and Intel systems could easily have SMT and NUMA turned on. I agree with the downside of exercising less code paths though. What about putting as a (default to off for 2.6) config option in the config embedded menu? is there any case where we'd want to simplify the domain tree? One more domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the domain tree got optimized. Hm? I guess there is the SMT issue too, and even booting an SMP kernel on a UP system. Also small ia64 NUMA systems will probably have one redundant NUMA level. If/when topologies get more complex (for example, the recent Altix discussions we had with Paul), it will be generally easier to set up all levels in a generic way, then weed them out using something like this, rather than put the logic in the domain setup code. Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] sched: NULL domains
Ingo Molnar wrote: * Ingo Molnar <[EMAIL PROTECTED]> wrote: * Nick Piggin <[EMAIL PROTECTED]> wrote: 2/5 The previous patch fixed the last 2 places that directly access a runqueue's sched-domain and assume it cannot be NULL. We can now use a NULL domain instead of a dummy domain to signify no balancing is to happen. No functional changes. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> ^^^ Thanks. if the previous 'remove degenerate domains' patch would go away then this patch needs to be merged/modified. (and most of the others as well) I probably should respin this so it goes in *first* anyway. Rather than doing half in the remove degenerate domains and half here. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/5] sched: multilevel sbe and sbf
Ingo Molnar wrote: Acked-by: Ingo Molnar <[EMAIL PROTECTED]> note that no matter how much scheduler logic, in the end cross-scheduling of tasks between nodes on NUMA will always have a permanent penalty (i.e. the 'migration cost' is 'infinity' in the long run), so the primary focus _hast to be_ on 'get it right initially' When tasks must spill over to other nodes will always remain a special case. So balance-on-fork/exec/[clone] definitely needs to be aware of the full domain tree picture. Yes, well put. I imagine this will only become more important as there becomes more push towards multiprocessing machines, and the need for higher memory bandwidth and lower latency to CPUs. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 4/5] sched: RCU sched domains
Ingo Molnar wrote: * Nick Piggin <[EMAIL PROTECTED]> wrote: 4/5 One of the problems with the multilevel balance-on-fork/exec is that it needs to jump through hoops to satisfy sched-domain's locking semantics (that is, you may traverse your own domain when not preemptable, and you may traverse others' domains when holding their runqueue lock). balance-on-exec had to potentially migrate between more than one CPU before finding a final CPU to migrate to, and balance-on-fork needed to potentially take multiple runqueue locks. So bite the bullet and make sched-domains go completely RCU. This actually simplifies the code quite a bit. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> i like it conceptually, so: Acked-by: Ingo Molnar <[EMAIL PROTECTED]> Oh good, thanks. from now on, all domain-tree readonly uses have to be rcu_read_lock()-ed (or otherwise have to be in a non-preemptible section). But there's a bug in show_shedstats() which does a for_each_domain() from within a preemptible section. (It was a bug with the current hotplug logic too i think.) Ah, thanks. That looks like a bug in the code with the locking we have now too... At a minimum i think we need the fix+comment below. Well if we say "this is actually RCU", then yes. And we should probably change the preempt_{dis|en}ables in other places to rcu_read_lock. OTOH, if we say we just want all running threads to process through a preemption stage, then this would just be a preempt_disable/enable pair. In practice that makes no difference yet, but it looks like you and Paul are working to distinguish these two cases in the RCU code, to accomodate your low latency RCU stuff? I'd prefer the latter (ie. just disable preempt, and use synchronize_sched), but I'm not too sure of what is going on with your the low latency RCU work...? Ingo Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> Thanks for catching that. I may just push it through first as a fix to the current 2.6 schedstats code (using preempt_disable), and afterwards we can change it to rcu_read_lock if that is required. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/5] sched: consolidate sbe sbf
Ingo Molnar wrote: * Nick Piggin <[EMAIL PROTECTED]> wrote: 5/5 Any ideas about what to do with schedstats? Do we really need balance on exec and fork as seperate statistics? Consolidate balance-on-exec with balance-on-fork. This is made easy by the sched-domains RCU patches. As well as the general goodness of code reduction, this allows the runqueues to be unlocked during balance-on-fork. schedstats is a problem. Maybe just have balance-on-event instead of distinguishing fork and exec? Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> looks good. One problem I just noticed, sorry. This is doing set_cpus_allowed without holding the runqueue lock and without checking the hard affinity mask either. We could just do a set_cpus_allowed, or take the lock, set_cpus_allowed, and take the new lock, but that's probably a bit heavy if we can avoid it. In the interests of speed in this fast path, do you think we can do this in sched_fork, before the task has even been put on the tasklist? That would avoid all locking problems. Passing clone_flags into sched_fork would not be a problem if we want to distinguish fork() and clone(CLONE_VM). Yes? I'll cut a new patch to do just that. Acked-by: Ingo Molnar <[EMAIL PROTECTED]> while the code is now consolidated, i think we still need the separate fork/exec stats for schedstat. This makes it a bit harder then, to get good stats in the sched-domain (which is really what we want). It would basically mean doing if (balance fork) schedstat_inc(sbf_cnt); else if (balance exec) schedstat_inc(sbe_cnt); etc. That should all get optimised out by the compiler, but still a bit ugly. Any ideas? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/5] sched: remove degenerate domains
Ingo Molnar wrote: * Siddha, Suresh B <[EMAIL PROTECTED]> wrote: Similarly I am working on adding a new core domain for dual-core systems! All these domains are unnecessary and cause performance isssues on non Multi-threading/Multi-core capable cpus! Agreed that performance impact will be minor but still... ok, lets keep it then. It may in fact simplify the domain setup code: we could generate the 'most generic' layout for a given arch all the time, and then optimize it automatically. I.e. in theory we could have just a single domain-setup routine, which would e.g. generate the NUMA domains on SMP too, which would then be optimized away. Yep, exactly. Even so, Andrew: please ignore this patch series and I'll redo it for you when we all agree on everything. Thanks. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/5] sched: consolidate sbe sbf
Nick Piggin wrote: One problem I just noticed, sorry. This is doing set_cpus_allowed without holding the runqueue lock and without checking the hard affinity mask either. Err, that is to say set_task_cpu, not set_cpus_allowed. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Coding style: mixed-case
Kenneth Aafløy wrote: On Wednesday 06 April 2005 04:09, Matt Mackall wrote: While there may be reasons why mixed case is suboptimal, the real reason is that it's hard to keep track of which style is used where. It's annoying and error-prone to have to remember the naming format for everything in addition to its name. As most things are in a standard style, things are made easier by having every piece of new code follow that style and let us slowly approach uniformity. My primary concern was that of; why does the kernels own coding style deviate from that advise given in it's documentation. Other than that Probably it's been like that for a long time, and nobody has really bothered to change it. If you posted a patch for pf_locked() and friends (and note that it's lowercase to match function-like usage), you'd probably find some enthusiasts and some naysayers. Most of the naysayers would object on the grounds of "it ain't broke", but if someone were to do it as part of a series of more substantial clean-ups, it'd likely be accepted. Certainly I would like to have a go at a patch, but I must say that I do not feel particularly familiar with the code in question to make such a change. I would have risen to the challenge had this been a driver level change, but the mmu is something that I will not touch untill I feel comfortable. Well the only patch that could possibly be considered would be a straight search and replace, and absolutely no functional changes; I think you would be up to it ;) A few suggestions: Don't use PF_*. That namespace is already being used by at least process flags and protocol flags. Maybe page_locked, page_dirty, etc. might be better There could be a quite a bit of external code using these interfaces. Typically we wouldn't just rename public interfaces in a stable series "just because", but the rules are a bit different for 2.6. Your best bet would be to firstly do a patch to create the new interface names but keep the old ones in place for backwards compatibility (just #defined to the new name), then a second patch to convert over all the in-kernel users. The compatibility stuff can be removed in N years. Lastly, it is quite likely that many people will consider this to be more trouble than it's worth. So keep in mind it is not guaranteed to get included. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: return value of ptep_get_and_clear
Kumar Gala wrote: ptep_get_and_clear has a signature that looks something like: static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) It appears that its suppose to return the pte_t pointed to by ptep before its modified. Why do we bother doing this? The caller seems perfectly able to dereference ptep and hold on to it. Am I missing something here? You need to be able to *atomically* clear the pte and retrieve the old value. Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.12-rc2-mm1
Siddha, Suresh B wrote: On Tue, Apr 05, 2005 at 05:33:49PM +1000, Nick Piggin wrote: Lastly, I'd like to be a bit less intrusive with pinned task handling improvements. I think we can do this while still being effective in preventing livelocks. We want to see this fixed. Please post your patch and I can let you know the test results. Using the attached patch, a puny dual PIII-650 with ~400MB RAM swapped itself to death after 2 infinite loop tasks had been pinned to one of the CPUs. See how you go. -- SUSE Labs, Novell Inc. Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2005-04-07 02:39:22.0 +1000 +++ linux-2.6/kernel/sched.c2005-04-07 02:45:26.0 +1000 @@ -2041,6 +2041,12 @@ static runqueue_t *find_busiest_queue(st } /* + * Max backoff if we encounter pinned tasks. Pretty arbitrary value, but + * so long as it is large enough. + */ +#define MAX_PINNED_INTERVAL1024 + +/* * Check this_cpu to ensure it is balanced within domain. Attempt to move * tasks if there is an imbalance. * @@ -2052,7 +2058,7 @@ static int load_balance(int this_cpu, ru struct sched_group *group; runqueue_t *busiest; unsigned long imbalance; - int nr_moved, all_pinned; + int nr_moved, all_pinned = 0; int active_balance = 0; spin_lock(&this_rq->lock); @@ -2143,7 +2149,8 @@ out_balanced: sd->nr_balance_failed = 0; /* tune up the balancing interval */ - if (sd->balance_interval < sd->max_interval) + if ((all_pinned && sd->balance_interval < MAX_PINNED_INTERVAL) || + (sd->balance_interval < sd->max_interval)) sd->balance_interval *= 2; return 0;
[patch] sched: consolidate sbe sbf
Hi Ingo, What do you think of the following patch? I won't send the whole series again, I'll queue them up with Andrew if you think this one looks OK (which is the only major change). Thanks, Nick -- SUSE Labs, Novell Inc. Consolidate balance-on-exec with balance-on-fork. This is made easy by the sched-domains RCU patches. As well as the general goodness of code reduction, this allows the runqueues to be unlocked during balance-on-fork. schedstats is a problem. Maybe just have balance-on-event instead of distinguishing fork and exec? Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2005-04-07 02:39:21.0 +1000 +++ linux-2.6/kernel/sched.c2005-04-07 12:34:06.0 +1000 @@ -1022,8 +1022,57 @@ static int find_idlest_cpu(struct sched_ return idlest; } +/* + * sched_balance_self: balance the current task (running on cpu) in domains + * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and + * SD_BALANCE_EXEC. + * + * Balance, ie. select the least loaded group. + * + * Returns the target CPU number, or the same CPU if no balancing is needed. + * + * preempt must be disabled. + */ +static int sched_balance_self(int cpu, int flag) +{ + struct task_struct *t = current; + struct sched_domain *tmp, *sd = NULL; -#endif + for_each_domain(cpu, tmp) + if (tmp->flags & flag) + sd = tmp; + + while (sd) { + cpumask_t span; + struct sched_group *group; + int new_cpu; + + span = sd->span; + group = find_idlest_group(sd, t, cpu); + if (!group) + goto nextlevel; + + new_cpu = find_idlest_cpu(group, cpu); + if (new_cpu == -1 || new_cpu == cpu) + goto nextlevel; + + /* Now try balancing at a lower domain level */ + cpu = new_cpu; +nextlevel: + sd = NULL; + for_each_domain(cpu, tmp) { + if (cpus_subset(span, tmp->span)) + break; + if (tmp->flags & flag) + sd = tmp; + } + /* while loop will break here if sd == NULL */ + } + + return cpu; +} + +#endif /* CONFIG_SMP */ /* * wake_idle() will wake a task on an idle cpu if task->cpu is @@ -1241,8 +1290,17 @@ int fastcall wake_up_state(task_t *p, un * Perform scheduler related setup for a newly forked process p. * p is forked by current. */ -void fastcall sched_fork(task_t *p) +void fastcall sched_fork(task_t *p, int clone_flags) { + int cpu = smp_processor_id(); + +#ifdef CONFIG_SMP + preempt_disable(); + cpu = sched_balance_self(cpu, SD_BALANCE_FORK); + preempt_enable(); +#endif + set_task_cpu(p, cpu); + /* * We mark the process as running here, but have not actually * inserted it onto the runqueue yet. This guarantees that @@ -1303,64 +1361,12 @@ void fastcall wake_up_new_task(task_t * unsigned long flags; int this_cpu, cpu; runqueue_t *rq, *this_rq; -#ifdef CONFIG_SMP - struct sched_domain *tmp, *sd = NULL; -#endif rq = task_rq_lock(p, &flags); BUG_ON(p->state != TASK_RUNNING); this_cpu = smp_processor_id(); cpu = task_cpu(p); -#ifdef CONFIG_SMP - for_each_domain(cpu, tmp) - if (tmp->flags & SD_BALANCE_FORK) - sd = tmp; - - if (sd) { - cpumask_t span; - int new_cpu; - struct sched_group *group; - -again: - schedstat_inc(sd, sbf_cnt); - span = sd->span; - cpu = task_cpu(p); - group = find_idlest_group(sd, p, cpu); - if (!group) { - schedstat_inc(sd, sbf_balanced); - goto nextlevel; - } - - new_cpu = find_idlest_cpu(group, cpu); - if (new_cpu == -1 || new_cpu == cpu) { - schedstat_inc(sd, sbf_balanced); - goto nextlevel; - } - - if (cpu_isset(new_cpu, p->cpus_allowed)) { - schedstat_inc(sd, sbf_pushed); - set_task_cpu(p, new_cpu); - task_rq_unlock(rq, &flags); - rq = task_rq_lock(p, &flags); - cpu = task_cpu(p); - } - - /* Now try balancing at a lower domain level */ -nextlevel: - sd = NULL; - for_each_domain(cpu, tmp) { - if (cpus_subset(span, tmp->span)) -
Re: [patch 4/5] sched: RCU sched domains
Ingo Molnar wrote: * Nick Piggin <[EMAIL PROTECTED]> wrote: At a minimum i think we need the fix+comment below. Well if we say "this is actually RCU", then yes. And we should probably change the preempt_{dis|en}ables in other places to rcu_read_lock. OTOH, if we say we just want all running threads to process through a preemption stage, then this would just be a preempt_disable/enable pair. In practice that makes no difference yet, but it looks like you and Paul are working to distinguish these two cases in the RCU code, to accomodate your low latency RCU stuff? it doesnt impact PREEMPT_RCU/PREEMPT_RT directly, because the scheduler itself always needs to be non-preemptible. those few places where we currently do preempt_disable(), which should thus be rcu_read_lock(), are never in codepaths that can take alot of time. but yes, in principle you are right, but in this particular (and special) case it's not a big issue. We should document the RCU read-lock dependencies cleanly and make all rcu-read-lock cases truly rcu_read_lock(), but it's not a pressing issue even considering possible future features like PREEMPT_RT. the only danger in this area is to PREEMPT_RT: it is a bug on PREEMPT_RT if kernel code has an implicit 'spinlock means preempt-off and thus RCU-read-lock' assumption. Most of the time these get discovered via PREEMPT_DEBUG. (preempt_disable() disables preemption on PREEMPT_RT too, so that is not a problem either.) OK thanks for the good explanation. So I'll keep it as is for now, and whatever needs cleaning up later can be worked out as it comes up. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/