Re: [patch] mm: fix PageUptodate data race

2008-01-31 Thread Nick Piggin
Sorry, way behind on email here. I'll get through it slowly...

On Sat, Jan 26, 2008 at 10:03:56PM -0800, Andrew Morton wrote:
> > On Tue, 22 Jan 2008 05:01:14 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > After running SetPageUptodate, preceeding stores to the page contents to
> > actually bring it uptodate may not be ordered with the store to set the page
> > uptodate.
> > 
> > Therefore, another CPU which checks PageUptodate is true, then reads the
> > page contents can get stale data.
> > 
> > Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
> > PageUptodate.
> > 
> > Many places that test PageUptodate, do so with the page locked, and this
> > would be enough to ensure memory ordering in those places if SetPageUptodate
> > were only called while the page is locked. Unfortunately that is not always
> > the case for some filesystems, but it could be an idea for the future.
> > 
> > Also bring the handling of anonymous page uptodateness in line with that of
> > file backed page management, by marking anon pages as uptodate when they 
> > _are_
> > uptodate, rather than when our implementation requires that they be marked 
> > as
> > such. Doing allows us to get rid of the smp_wmb's in the page copying
> > functions, which were especially added for anonymous pages for an analogous
> > memory ordering problem. Both file and anonymous pages are handled with the
> > same barriers.
> > 
> 
> So...  it's two patches in one.

I guess so. Hmm, at least I appreciate it (them) getting testing in -mm
for now. I guess I should break it in two, do you agree Hugh? Do you
like/dislike the anonymous page change?


> What kernel is this against?  Looks like mainline.  Is it complete and
> correct when applied against the large number of pending MM changes?

Uh, I forget. But luckily this one should be quite correct reglardless
of pending mm changes... unless something there has fundamentally changed
the semantics or locking of PG_uptodate... which wouldn't be too surprising
actually ;)

No, it should be OK. I'll double check when I look at resubmitting it as
2 patches.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86 arch updates also broke s390

2008-01-31 Thread Nick Piggin
On Thu, Jan 31, 2008 at 11:24:54AM +0100, Ingo Molnar wrote:
> 
> * Martin Schwidefsky <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, 2008-01-31 at 02:33 +0200, Adrian Bunk wrote:
> > > <--  snip  -->
> > > 
> > > ...
> > >   CC  arch/s390/kernel/asm-offsets.s
> > > In file included from 
> > > /home/bunk/linux/kernel-2.6/git/linux-2.6/arch/s390/kernel/asm-offsets.c:7:
> > > /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/sched.h: In
> > > function 'spin_needbreak':
> > > /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/sched.h:1931:
> > > error: implicit declaration of function '__raw_spin_is_contended'
> > > make[2]: *** [arch/s390/kernel/asm-offsets.s] Error 1
> > > 
> > > <--  snip  -->
> > 
> > Defining GENERIC_LOCKBREAK in arch/s390/Kconfig takes care of it. I'll 
> > cook up a patch and queue it in git390.
> 
> thanks!

Yeah thanks, don't know what happened with this, sorry. I thought I
had defined it for all SMP capable ones, so maybe it was a quilt
error or something on my part.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-02-01 Thread Nick Piggin
On Friday 01 February 2008 09:45, Frederik Himpe wrote:
> On ma, 2008-01-28 at 12:46 +1100, Nick Piggin wrote:
> > On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> > > On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > > > With Linux 2.6.24-rc8 I often have the problem that the pan
> > > > > > > usenet reader starts using 100% of CPU time after some time.
> > > > > > > When this happens, kill -9 does not work, and strace just hangs
> > > > > > > when trying to attach to the process. The same with gdb. ps
> > > > > > > shows the process as being in the R state.
> >
> > Well after trying a lot of writev combinations, I've reproduced a hang
> > *hangs head*.
> >
> > Does this help?
>
> Just to confirm: in four days of testing, I haven't seen the problem
> anymore, so it looks like this was indeed the right fix.

Thanks very much for reporting and testing. This patch needs to go
into 2.6.24.stable and upstream.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [bug] as_merged_requests(): possible recursive locking detected

2008-02-01 Thread Nick Piggin
On Friday 01 February 2008 21:31, Jens Axboe wrote:
> On Fri, Feb 01 2008, Jens Axboe wrote:

> > I think the right solution is to remove swap_io_context() and fix the io
> > context referencing in as-iosched.c instead.
>
> IOW, the below. I don't know why Nick originally wanted to swap io
> contexts for a rq <-> rq merge, there seems little (if any) benefit to
> doing so.

Yeah, I guess this patch is fine. Simpler is better.

>
> diff --git a/block/as-iosched.c b/block/as-iosched.c
> index 9603684..852803e 100644
> --- a/block/as-iosched.c
> +++ b/block/as-iosched.c
> @@ -1266,22 +1266,8 @@ static void as_merged_requests(struct request_queue
> *q, struct request *req, */
>   if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
>   if (time_before(rq_fifo_time(next), rq_fifo_time(req))) {
> - struct io_context *rioc = RQ_IOC(req);
> - struct io_context *nioc = RQ_IOC(next);
> -
>   list_move(&req->queuelist, &next->queuelist);
>   rq_set_fifo_time(req, rq_fifo_time(next));
> - /*
> -  * Don't copy here but swap, because when anext is
> -  * removed below, it must contain the unused context
> -  */
> - if (rioc != nioc) {
> - double_spin_lock(&rioc->lock, &nioc->lock,
> - rioc < nioc);
> - swap_io_context(&rioc, &nioc);
> - double_spin_unlock(&rioc->lock, &nioc->lock,
> - rioc < nioc);
> - }
>   }
>   }
>
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 6d16755..80245dc 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -176,15 +176,6 @@ void copy_io_context(struct io_context **pdst, struct
> io_context **psrc) }
>  EXPORT_SYMBOL(copy_io_context);
>
> -void swap_io_context(struct io_context **ioc1, struct io_context **ioc2)
> -{
> - struct io_context *temp;
> - temp = *ioc1;
> - *ioc1 = *ioc2;
> - *ioc2 = temp;
> -}
> -EXPORT_SYMBOL(swap_io_context);
> -
>  int __init blk_ioc_init(void)
>  {
>   iocontext_cachep = kmem_cache_create("blkdev_ioc",
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index baba233..bbe3cf4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -39,7 +39,6 @@ void exit_io_context(void);
>  struct io_context *get_io_context(gfp_t gfp_flags, int node);
>  struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
>  void copy_io_context(struct io_context **pdst, struct io_context **psrc);
> -void swap_io_context(struct io_context **ioc1, struct io_context **ioc2);
>
>  struct request;
>  typedef void (rq_end_io_fn)(struct request *, int);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Feature Removals for 2.6.25

2008-02-01 Thread Nick Piggin
On Thu, Jan 31, 2008 at 05:38:42PM -0800, Harvey Harrison wrote:
> ---
> Ping?
> What: vm_ops.nopage
> When: Soon, provided in-kernel callers have been converted
> Why:  This interface is replaced by vm_ops.fault, but it has been around
>   forever, is used by a lot of drivers, and doesn't cost much to
>       maintain.
> Who:  Nick Piggin <[EMAIL PROTECTED]>

Well the in-kernel callers have not all been converted yet. I have
actually done the work, but it needs testing and merging by maintainers.
Getting it done during this merge window would be nice, I'm going to
try to make that happen after I get back from LCA. Otherwise probably
2.6.26.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] uio: vm_operations_struct ->nopage to ->fault method conversion

2008-02-02 Thread Nick Piggin
On Saturday 02 February 2008 20:51, Denis Cheng wrote:
> Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>

Thanks, but already patched in -mm.

> ---
>  drivers/uio/uio.c |   19 ---
>  1 files changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
> index cc246fa..47e0c32 100644
> --- a/drivers/uio/uio.c
> +++ b/drivers/uio/uio.c
> @@ -417,30 +417,27 @@ static void uio_vma_close(struct vm_area_struct *vma)
>   idev->vma_count--;
>  }
>
> -static struct page *uio_vma_nopage(struct vm_area_struct *vma,
> -unsigned long address, int *type)
> +static int uio_vma_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>   struct uio_device *idev = vma->vm_private_data;
> - struct page* page = NOPAGE_SIGBUS;
>
>   int mi = uio_find_mem_index(vma);
>   if (mi < 0)
> - return page;
> + return VM_FAULT_SIGBUS;
>
>   if (idev->info->mem[mi].memtype == UIO_MEM_LOGICAL)
> - page = virt_to_page(idev->info->mem[mi].addr);
> + vmf->page = virt_to_page(idev->info->mem[mi].addr);
>   else
> - page = vmalloc_to_page((void*)idev->info->mem[mi].addr);
> - get_page(page);
> - if (type)
> - *type = VM_FAULT_MINOR;
> - return page;
> + vmf->page = vmalloc_to_page((void *)idev->info->mem[mi].addr);
> + get_page(vmf->page);
> +
> + return 0;
>  }
>
>  static struct vm_operations_struct uio_vm_ops = {
>   .open = uio_vma_open,
>   .close = uio_vma_close,
> - .nopage = uio_vma_nopage,
> + .fault = uio_vma_fault,
>  };
>
>  static int uio_mmap_physical(struct vm_area_struct *vma)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc] direct IO submission and completion scalability issues

2008-02-03 Thread Nick Piggin
On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> 
> Second experiment which we did was migrating the IO submission to the
> IO completion cpu. Instead of submitting the IO on the same cpu where the
> request arrived, in this experiment  the IO submission gets migrated to the
> cpu that is processing IO completions(interrupt). This will minimize the
> access to remote cachelines (that happens in timers, slab, scsi layers). The
> IO submission request is forwarded to the kblockd thread on the cpu receiving
> the interrupts. As part of this, we also made kblockd thread on each cpu as 
> the
> highest priority thread, so that IO gets submitted as soon as possible on the
> interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> resulted in 2% performance improvement and 3.3% improvement on two node ia64
> platform.
> 
> Quick and dirty prototype patch(not meant for inclusion) for this io migration
> experiment is appended to this e-mail.
> 
> Observation #1 mentioned above is also applicable to this experiment. CPU's
> processing interrupts will now have to cater IO submission/processing
> load aswell.
> 
> Observation #2: This introduces some migration overhead during IO submission.
> With the current prototype, every incoming IO request results in an IPI and
> context switch(to kblockd thread) on the interrupt processing cpu.
> This issue needs to be addressed and main challenge to address is
> the efficient mechanism of doing this IO migration(how much batching to do and
> when to send the migrate request?), so that we don't delay the IO much and at
> the same point, don't cause much overhead during migration.

Hi guys,

Just had another way we might do this. Migrate the completions out to
the submitting CPUs rather than migrate submission into the completing
CPU.

I've got a basic patch that passes some stress testing. It seems fairly
simple to do at the block layer, and the bulk of the patch involves
introducing a scalable smp_call_function for it.

Now it could be optimised more by looking at batching up IPIs or
optimising the call function path or even mirating the completion event
at a different level...

However, this is a first cut. It actually seems like it might be taking
slightly more CPU to process block IO (~0.2%)... however, this is on my
dual core system that shares an llc, which means that there are very few
cache benefits to the migration, but non-zero overhead. So on multisocket
systems hopefully it might get to positive territory.

---

Index: linux-2.6/arch/x86/kernel/smp_64.c
===
--- linux-2.6.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6/arch/x86/kernel/smp_64.c
@@ -321,6 +321,99 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
 }
 
+struct call_single_data {
+   struct list_head list;
+   void (*func) (void *info);
+   void *info;
+   int wait;
+};
+
+struct call_single_queue {
+   spinlock_t lock;
+   struct list_head list;
+};
+static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
+
+int __cpuinit init_smp_call(void)
+{
+   int i;
+
+   for_each_cpu_mask(i, cpu_possible_map) {
+   spin_lock_init(&per_cpu(call_single_queue, i).lock);
+   INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
+   }
+   return 0;
+}
+core_initcall(init_smp_call);
+
+/*
+ * this function sends a 'generic call function' IPI to all other CPU
+ * of the system defined in the mask.
+ */
+int smp_call_function_fast(int cpu, void (*func)(void *), void *info,
+   int wait)
+{
+   struct call_single_data *data;
+   struct call_single_queue *dst = &per_cpu(call_single_queue, cpu);
+   cpumask_t mask = cpumask_of_cpu(cpu);
+   int ipi;
+
+   data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
+   data->func = func;
+   data->info = info;
+   data->wait = wait;
+
+   spin_lock_irq(&dst->lock);
+   ipi = list_empty(&dst->list);
+   list_add_tail(&data->list, &dst->list);
+   spin_unlock_irq(&dst->lock);
+
+   if (ipi)
+   send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
+
+   if (wait) {
+   /* Wait for response */
+   while (data->wait)
+   cpu_relax();
+   kfree(data);
+   }
+
+   return 0;
+}
+
+asmlinkage void smp_call_function_fast_interrupt(void)
+{
+   struct call_single_queue *q;
+   unsigned long flags;
+   LIST_HEAD(list);
+
+   ack_APIC_irq();
+
+   q = &__get_cpu_var(call_single_queue);
+   spin_lock_irqsave(&q->lock, flags);
+   list_replace_init(&q->list, &list);
+   spin_unlock_irqrestore(&q->lock, flags);
+
+   exit_idle();
+   irq_enter();
+   while (!list_empty(&list)) {
+   struct call_single_data *data;
+
+   data = list_entry(list.next, struct call_single_data, list);

Re: [rfc] direct IO submission and completion scalability issues

2008-02-03 Thread Nick Piggin
On Sun, Feb 03, 2008 at 12:53:02PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Feb 3, 2008 11:52 AM, Nick Piggin <[EMAIL PROTECTED]> wrote:
> > +asmlinkage void smp_call_function_fast_interrupt(void)
> > +{
> 
> [snip]
> 
> > +   while (!list_empty(&list)) {
> > +   struct call_single_data *data;
> > +
> > +   data = list_entry(list.next, struct call_single_data, list);
> > +   list_del(&data->list);
> > +
> > +   data->func(data->info);
> > +   if (data->wait) {
> > +   smp_mb();
> > +   data->wait = 0;
> 
> Why do we need smp_mb() here (maybe add a comment to keep
> Andrew/checkpatch happy)?

Yeah, definitely... it's just a really basic RFC, but I should get
into the habit of just doing it anyway.

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc] direct IO submission and completion scalability issues

2008-02-04 Thread Nick Piggin
On Mon, Feb 04, 2008 at 03:40:20PM +1100, David Chinner wrote:
> On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
> > David Chinner wrote:
> > >Hi Nick,
> > >
> > >When Matthew was describing this work at an LCA presentation (not
> > >sure whether you were at that presentation or not), Zach came up
> > >with the idea that allowing the submitting application control the
> > >CPU that the io completion processing was occurring would be a good
> > >approach to try.  That is, we submit a "completion cookie" with the
> > >bio that indicates where we want completion to run, rather than
> > >dictating that completion runs on the submission CPU.
> > >
> > >The reasoning is that only the higher level context really knows
> > >what is optimal, and that changes from application to application.
> > 
> > well.. kinda. One of the really hard parts of the submit/completion stuff 
> > is that
> > the slab/slob/slub/slib allocator ends up basically "cycling" memory 
> > through the system;
> > there's a sink of free memory on all the submission cpus and a source of 
> > free memory
> > on the completion cpu. I don't think applications are capable of working 
> > out what is
> > best in this scenario..
> 
> Applications as in "anything that calls submit_bio()". i.e, direct I/O,
> filesystems, etc. i.e. not userspace but in-kernel applications.
> 
> In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
> contention of global structures in XFS. By controlling where completions are
> delivered, we can greatly reduce this contention, especially on large,
> mulitpathed devices that deliver interrupts to multiple CPUs that may be far
> distant from each other.  We have all the state and intelligence necessary
> to control this sort policy decision effectively.

Hi Dave,

Thanks for taking a look at the patch... yes it would be easy to turn
this bit of state into a more flexible cookie (eg. complete on submitter;
complete on interrupt; complete on CPUx/nodex etc.). Maybe we'll need
something that complex... I'm not sure, it would probably need more
fine tuning. That said, I just wanted to get this approach out there
early for rfc.

I guess both you and Arjan have points. For a _lot_ of things, completing
on the same CPU as submitter (whether that is migrating submission as in
the original patch in the thread, or migrating completion like I do).

You get better behaviour in the slab and page allocators and locality
and cache hotness of memory. For example, I guess in a filesystem /
pagecache heavy workload, you have to touch each struct page, buffer head,
fs private state, and also often have to wake the thread for completion.
Much of this data has just been touched at submit time, so doin this on
the same CPU is nice...

I'm surprised that the xfs global state bouncing would outweigh the
bouncing of all the per-page/block/bio/request/etc data that gets touched
during completion. We'll see.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc] direct IO submission and completion scalability issues

2008-02-04 Thread Nick Piggin
On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote:
> On Sun, Feb 03 2008, Nick Piggin wrote:
> > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> > 
> > Hi guys,
> > 
> > Just had another way we might do this. Migrate the completions out to
> > the submitting CPUs rather than migrate submission into the completing
> > CPU.
> > 
> > I've got a basic patch that passes some stress testing. It seems fairly
> > simple to do at the block layer, and the bulk of the patch involves
> > introducing a scalable smp_call_function for it.
> > 
> > Now it could be optimised more by looking at batching up IPIs or
> > optimising the call function path or even mirating the completion event
> > at a different level...
> > 
> > However, this is a first cut. It actually seems like it might be taking
> > slightly more CPU to process block IO (~0.2%)... however, this is on my
> > dual core system that shares an llc, which means that there are very few
> > cache benefits to the migration, but non-zero overhead. So on multisocket
> > systems hopefully it might get to positive territory.
> 
> That's pretty funny, I did pretty much the exact same thing last week!

Oh nice ;)


> The primary difference between yours and mine is that I used a more
> private interface to signal a softirq raise on another CPU, instead of
> allocating call data and exposing a generic interface. That put the
> locking in blk-core instead, turning blk_cpu_done into a structure with
> a lock and list_head instead of just being a list head, and intercepted
> at blk_complete_request() time instead of waiting for an already raised
> softirq on that CPU.

Yeah I was looking at that... didn't really want to add the spinlock
overhead to the non-migration case. Anyway, I guess that sort of
fine implementation details is going to have to be sorted out with
results.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Monthly md check == hung machine; how do I debug?

2008-02-04 Thread Nick Piggin
On Monday 04 February 2008 08:21, Robin Lee Powell wrote:
> I've got a machine with a 4 disk SATA raid10 configuration using md.
> The entire disk is loop-AES encrypted, but that shouldn't matter
> here.
>
> Once a month, Debian runs:
>
> /usr/share/mdadm/checkarray --cron --all --quiet
>
> and the machine hangs within 30 minutes of that starting.
>
> It seems that I can avoid the hang by not having "mdadm --monitor"
> running, but I'm not certain if that's the case or if I've just been
> lucky this go-round.
>
> I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
> Athlon(tm) 64 Processor 3700+.
>
> I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
> can't find anything that looks relevant.
>
> So, how can I (help you all) debug this?

Do you have a serial console? Does it respond to pings?

Can you try to get sysrq+T traces, and sysrq+P traces, and post
them?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-02-04 Thread Nick Piggin
On Tuesday 05 February 2008 01:49, Mike Galbraith wrote:
> On Tue, 2008-01-22 at 06:47 +0100, Mike Galbraith wrote:
> > On Tue, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > On Tuesday 22 January 2008 16:03, Mike Galbraith wrote:
> > > > I've hit same twice recently (not pan, and not repeatable).
> > >
> > > Nasty. The attached patch is something really simple that can sometimes
> > > help. sysrq+p is also an option, if you're on a UP system.
> >
> > SMP (P4/HT imitating real cores)
> >
> > > Any luck getting traces?
> >
> > We'll see.  Armed.
>
> Hm.  ld just went loopy (but killable) in v2.6.24-6928-g9135f19.  During
> kbuild, modpost segfaulted, restart build, ld goes gaga.  Third attempt,
> build finished.  Not what I hit before, but mentionable.
>
>
> [  674.589134] modpost[18588]: segfault at 3e8dc42c ip 0804a96d sp af982920
> error 5 in modpost[8048000+9000] [  674.589211] mm/memory.c:115: bad pgd
> 3e081163.
> [  674.589214] mm/memory.c:115: bad pgd 3e0d2163.
> [  674.589217] mm/memory.c:115: bad pgd 3eb01163.

Hmm, this _could_ be bad memory. Or if it is very easy to reproduce with
a particular kernel version, then it is probably a memory scribble from
another part of the kernel :(

First thing I guess would be easy and helpful to run memtest86 for a
while if you have time.

If that's clean, then I don't have another good option except to bisect
the problem. Turning on DEBUG_VM, DEBUG_SLAB, DEBUG_LIST, DEBUG_PAGEALLOC,
DEBUG_STACKOVERFLOW, DEBUG_RODATA might help catch it sooner... SLAB and
PAGEALLOC could slow you down quite a bit though. And if the problem is
quite reproduceable, then obviously don't touch your config ;)

Thanks,
Nick


>
> [ 1407.322144]  ===
> [ 1407.322144] ldR running  0 21963  21962
> [ 1407.322144]db9d7f1c 00200086 c75f9020 b1814300 b0428300 b0428300
> b0428300 c75f9280 [ 1407.322144]b1814300 0001 db9d7000 
> d08c2f90 dba4f300 0002  [ 1407.322144]b1810120 dba4f334
> 00200046  db9d7000 c75f9020 db9d7f30 b02f333f [ 1407.322144] Call
> Trace:
> [ 1407.322144]  [] preempt_schedule_irq+0x45/0x5b
> [ 1407.322144]  [] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [] need_resched+0x1f/0x21
> [ 1407.322144]  [] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [] ? do_page_fault+0x4c/0x470
> [ 1407.322144]  [] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [] ? error_code+0x72/0x78
> [ 1407.322144]  [] ? init_transmeta+0xcf/0x22f <== zzt P4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Nick Piggin
On Tuesday 05 February 2008 10:47, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Nick Piggin wrote:
> > > erk, sorry, I misremembered.   I was about to merge all the patches we
> > > weren't going to merge.  oops.
> >
> > While you're there, can you drop the patch(es?) I commented on
> > and didn't get an answer to. Like the ones that open code their
> > own locking primitives and do risky looking things with barriers
> > to boot...
>
> That patch will be moved to a special archive for
> microbenchmarks. It shows the same issues like the __unlock patch.

Ok. But the approach is just not so good. If you _really_ need something
like that and it is a win over the regular non-atomic unlock, then you
just have to implement it as a generic locking / atomic operation and
allow all architectures to implement the optimal (and correct) memory
barriers.

Anyway


> > Also, WRT this one:
> > slub-use-non-atomic-bit-unlock.patch
> >
> > This is strange that it is unwanted. Avoiding atomic operations
> > is a pretty good idea. The fact that it appears to be slower on
> > some microbenchmark on some architecture IMO either means that
> > their __clear_bit_unlock or the CPU isn't implemented so well...
>
> Its slower on x86_64 and that is a pretty important arch. So
> I am to defer this until we have analyzed the situation some more. Could
> there be some effect of atomic ops on the speed with which a cacheline is
> released?

I'm sure it could have an effect. But why is the common case in SLUB
for the cacheline to be bouncing? What's the benchmark? What does SLAB
do in that benchmark, is it faster than SLUB there? What does the
non-atomic bit unlock do to Willy's database workload?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Nick Piggin
On Tuesday 05 February 2008 09:30, Andrew Morton wrote:
> On Mon, 4 Feb 2008 14:28:45 -0800
>
> Andrew Morton <[EMAIL PROTECTED]> wrote:
> > > root (1):
> > >   SLUB: Do not upset lockdep
> >
> > err, what?  I though I was going to merge these:
> >
> > slub-move-count_partial.patch
> > slub-rename-numa-defrag_ratio-to-remote_node_defrag_ratio.patch
> > slub-consolidate-add_partial-and-add_partial_tail-to-one-function.patch
> > slub-use-non-atomic-bit-unlock.patch
> > slub-fix-coding-style-violations.patch
> > slub-noinline-some-functions-to-avoid-them-being-folded-into-alloc-free.p
> >atch
> > slub-move-kmem_cache_node-determination-into-add_full-and-add_partial.pat
> >ch
> > slub-avoid-checking-for-a-valid-object-before-zeroing-on-the-fast-path.pa
> >tch slub-__slab_alloc-exit-path-consolidation.patch
> > slub-provide-unique-end-marker-for-each-slab.patch
> > slub-avoid-referencing-kmem_cache-structure-in-__slab_alloc.patch
> > slub-optional-fast-path-using-cmpxchg_local.patch
> > slub-do-our-own-locking-via-slab_lock-and-slab_unlock.patch
> > slub-restructure-slab-alloc.patch
> > slub-comment-kmem_cache_cpu-structure.patch
> > slub-fix-sysfs-refcounting.patch
> >
> > before you went and changed things under my feet.
>
> erk, sorry, I misremembered.   I was about to merge all the patches we
> weren't going to merge.  oops.

While you're there, can you drop the patch(es?) I commented on
and didn't get an answer to. Like the ones that open code their
own locking primitives and do risky looking things with barriers
to boot...

Also, WRT this one:
slub-use-non-atomic-bit-unlock.patch

This is strange that it is unwanted. Avoiding atomic operations
is a pretty good idea. The fact that it appears to be slower on
some microbenchmark on some architecture IMO either means that
their __clear_bit_unlock or the CPU isn't implemented so well...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Nick Piggin
On Tuesday 05 February 2008 11:32, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Nick Piggin wrote:
> > Ok. But the approach is just not so good. If you _really_ need something
> > like that and it is a win over the regular non-atomic unlock, then you
> > just have to implement it as a generic locking / atomic operation and
> > allow all architectures to implement the optimal (and correct) memory
> > barriers.
>
> Assuming this really gives a benefit on several benchmarks then we need
> to think about how to do this some more. Its a rather strange form of
> locking.
>
> Basically you lock the page with a single atomic operation that sets
> PageLocked and retrieves the page flags.

This operation is not totally unusual. I could use it for my optimised
page lock patches for example (although I need an operation that clears
a flag and has release semantics, but similar class of "thing").


> Then we shovel the page state 
> around a couple of functions in a register and finally store the page
> state back which at the same time unlocks the page.

And this is a store-for-unlock (eg. with release semantics).
Nothing too special about that either I guess. (it is almost the word
equivalent of clear_bit_unlock).


> So two memory 
> references with one of them being atomic with none in between. We have
> nothing that can do something like that right now.

The load you are trying to avoid in the lock really isn't that
expensive. The cacheline is in L1. Even after a store, many CPUs
have store forwarding so it is probably not going to matter at all
on those.

Anyway, not saying the operations are useless, but they should be
made available to core kernel and implemented per-arch. (if they are
found to be useful)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-11 Thread Nick Piggin
On Tuesday 12 February 2008 15:10, Max Krasnyansky wrote:

> Rusty - Stop machine.
>After doing a bunch of testing last three days I actually downgraded
> stop machine changes from [highly experimental] to simply [experimental].
> Pleas see this thread for more info:
> http://marc.info/?l=linux-kernel&m=120243837206248&w=2 Short story is that
> I ran several insmod/rmmod workloads on live multi-core boxes with stop
> machine _completely_ disabled and did no see any issues. Rusty did not get
> a chance to reply yet, I hopping that we'll be able to make "stop machine"
> completely optional for some configurations.

stop machine is used for more than just module loading and unloading.
I don't think you can just disable it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid buffer overflows in get_user_pages()

2008-02-11 Thread Nick Piggin
On Tuesday 12 February 2008 14:16, Robert Hancock wrote:
> Nick Piggin wrote:
> > On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote:
> >> Avoid buffer overflows in get_user_pages()
> >>
> >> So I spent a while pounding my head against my monitor trying to figure
> >> out the vmsplice() vulnerability - how could a failure to check for
> >> *read* access turn into a root exploit?  It turns out that it's a buffer
> >> overflow problem which is made easy by the way get_user_pages() is
> >> coded.
> >>
> >> In particular, "len" is a signed int, and it is only checked at the
> >> *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
> >> will execute once and decrement len to -1.  At that point, the loop will
> >> proceed until the next invalid address is found; in the process, it will
> >> likely overflow the pages array passed in to get_user_pages().
> >>
> >> I think that, if get_user_pages() has been asked to grab zero pages,
> >> that's what it should do.  Thus this patch; it is, among other things,
> >> enough to block the (already fixed) root exploit and any others which
> >> might be lurking in similar code.  I also think that the number of pages
> >> should be unsigned, but changing the prototype of this function probably
> >> requires some more careful review.
> >>
> >> Signed-off-by: Jonathan Corbet <[EMAIL PROTECTED]>
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index e5628a5..7f50fd8 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct
> >> mm_struct *mm, int i;
> >>unsigned int vm_flags;
> >>
> >> +  if (len <= 0)
> >> +  return 0;
> >
> > BUG_ON()?
>
> Well, not if the code involved in the exploit can pass a zero value,

Which is a bug, and you want to catch it.


> otherwise it's just turning it into a DoS..

If it is due to a security bug, then the fix is to fix the point
where the kernel starts trusting an untrusted value. Not to hide
the bug like this. Arguably, a BUG_ON is better in the case of a
security hole because you want to halt the process as soon as you
detect a problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid buffer overflows in get_user_pages()

2008-02-11 Thread Nick Piggin
On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote:
> Avoid buffer overflows in get_user_pages()
>
> So I spent a while pounding my head against my monitor trying to figure
> out the vmsplice() vulnerability - how could a failure to check for
> *read* access turn into a root exploit?  It turns out that it's a buffer
> overflow problem which is made easy by the way get_user_pages() is
> coded.
>
> In particular, "len" is a signed int, and it is only checked at the
> *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
> will execute once and decrement len to -1.  At that point, the loop will
> proceed until the next invalid address is found; in the process, it will
> likely overflow the pages array passed in to get_user_pages().
>
> I think that, if get_user_pages() has been asked to grab zero pages,
> that's what it should do.  Thus this patch; it is, among other things,
> enough to block the (already fixed) root exploit and any others which
> might be lurking in similar code.  I also think that the number of pages
> should be unsigned, but changing the prototype of this function probably
> requires some more careful review.
>
> Signed-off-by: Jonathan Corbet <[EMAIL PROTECTED]>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e5628a5..7f50fd8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, int i;
>   unsigned int vm_flags;
>
> + if (len <= 0)
> + return 0;

BUG_ON()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-12 Thread Nick Piggin
On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:
> On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote:
> > On Sun, 10 Feb 2008 17:00:31 +0300
> >
> > Alexey Dobriyan <[EMAIL PROTECTED]> wrote:
> > > This happened during LTP. FWIW, modprobe/rmmod trivial empty module
> > > together with cat /proc/*/wchan and cat /proc/modules were also
> > > running.
> > >
> > > Box is E6400, much debugging is on, config below.
> > >
> > >
> > > [ 4057.31] BUG: unable to handle kernel paging request at
> > > 810101dbc008 [ 4057.31] IP: []
> > > iov_iter_advance+0x38/0x70 [ 4057.31] PGD 8063 PUD c063 PMD
> > > 153baa163 PTE 800101dbc160 [ 4057.31] Oops:  [1] SMP
> > > DEBUG_PAGEALLOC
> > > [ 4057.31] CPU 0
> > > [ 4057.31] Modules linked in: [last unloaded: foo]
> >
> > what is this foo.ko of which you speak, and did it wreck your kernel?
>
> It's a trivial dumb module which does nothing but loads and unloads.
> I redid ftest03 later without any suspicious activity and it oopsed the
> same way.

Ah crap. Hmm, maybe I didn't consider all cases with my last patch to
that code... is there an easy way to get the ftest03 source and run
it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel BUG at fs/mpage.c:489

2008-02-12 Thread Nick Piggin
On Wednesday 13 February 2008 08:50, Alan Cox wrote:
> > Feb 12 19:55:08 butterfly kernel: hde: dma timeout error: status=0xd0 {
> > Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode was: unknown
>
> Your drive stopped responding.
>
> > Feb 12 19:55:08 butterfly kernel: hde: DMA disabled
> > Feb 12 19:55:08 butterfly kernel: PDC202XX: Primary channel reset.
> > Feb 12 19:55:08 butterfly kernel: PDC202XX: Secondary channel reset.
>
> We gave it a good kicking and it stayed offline
>
> > Feb 12 19:55:08 butterfly kernel: hde: set_drive_speed_status:
> > status=0xd0 { Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode
> > was: unknown Feb 12 19:55:47 butterfly kernel: ide2: reset timed-out,
> > status=0xd0 Feb 12 19:55:47 butterfly kernel: hde: status timeout:
> > status=0xd0 { Busy }
>
> And we gave up.
>
> Almost certainly a hardware fail of some sort.

Right, but the kernel shouldn't go bug...

I don't have a copy of your exact source code... which condition in
__mpage_writepage went BUG?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-12 Thread Nick Piggin
On Wednesday 13 February 2008 11:17, Nick Piggin wrote:
> On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:

> > It's a trivial dumb module which does nothing but loads and unloads.
> > I redid ftest03 later without any suspicious activity and it oopsed the
> > same way.
>
> Ah crap. Hmm, maybe I didn't consider all cases with my last patch to
> that code... is there an easy way to get the ftest03 source and run
> it?

OK I didn't realise it is a test from ltp.

But I can't reproduce it for the life of me with the latest git kernel
and latest ltp tarball.

Is it easy to reproduce? Are you reproducing it simply by running the
ftest03 binary directly from the shell? How many times between oopses?
It is multi-process but no threads, so races should be minimal down
this path -- can you get an strace of the failing process?

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Nick Piggin
On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote:
> David Miller wrote:
> > From: Nick Piggin <[EMAIL PROTECTED]>
> > Date: Tue, 12 Feb 2008 17:41:21 +1100
> >
> >> stop machine is used for more than just module loading and unloading.
> >> I don't think you can just disable it.
> >
> > Right, in particular it is used for CPU hotplug.
>
> Ooops. Totally missed that. And a bunch of other places.
>
> [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run
> Documentation/cpu-hotplug.txt
> arch/s390/kernel/kprobes.c
> drivers/char/hw_random/intel-rng.c
> include/linux/stop_machine.h
> kernel/cpu.c
> kernel/module.c
> kernel/stop_machine.c
> mm/page_alloc.c
>
> I wonder why I did not see any issues when I disabled stop machine
> completely. I mentioned in the other thread that I commented out the part
> that actually halts the machine and ran it for several hours on my dual
> core laptop and on the quad core server. Tried all kinds of workloads,
> which include constant module removal and insertion, and cpu hotplug as
> well. It cannot be just luck :).

It really is. With subtle races, it can take a lot more than a few
hours. Consider that we have subtle races still in the kernel now,
which are almost never or rarely hit in maybe 10,000 hours * every
single person who has been using the current kernel for the past
year.

For a less theoretical example -- when I was writing the RCU radix
tree code, I tried to run directed stress tests on a 64 CPU Altix
machine (which found no bugs). Then I ran it on a dedicated test
harness that could actually do a lot more than the existing kernel
users are able to, and promptly found a couple more bugs (on a 2
CPU system).

But your primary defence against concurrency bugs _has_ to be
knowing the code and all its interactions.


> Clearly though, you guys are right. It cannot be simply disabled. Based on
> the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390
> and intel rng driver. Hopefully we can avoid it at least in module
> insertion/removal.

Yes, reducing the number of users by going through their code and
showing that it is safe, is the right way to do this. Also, you
could avoid module insertion/removal?

FWIW, I think the idea of trying to turn Linux into giving hard
realtime guarantees is just insane. If that is what you want, you
would IMO be much better off to spend effort with something like
improving adeos and communicatoin/administration between Linux and
the hard-rt kernel.

But don't let me dissuade you from making these good improvements
to Linux as well :) Just that it isn't really going to be hard-rt
in general.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ALPHA] ES40 fails to boot with >=kernel 2.6.23

2008-02-12 Thread Nick Piggin
On Tuesday 12 February 2008 04:27, Raúl Porcel wrote:
> Hi,
>
> We have a Compaq AlphaServer ES40 and since 2.6.23 it won't boot. I'm
> attaching the console log and the kernel config.
>
> Need to say that with a DEC Xp1000 it works fine, although they're
> different machines, of course.
> With .22 it boots fine, and by booting fine i mean after we reverted to
> 2.6.22 it booted again and everything worked as expected.
> Still hangs with latest kernel.
>
> I'm attaching the verlinux output as well, hope it helps. If i'm missing
> something, please don't hesitate to ask.
>
> Thanks

Hi,

Thanks for reporting. I'm not an alpha person, but I have
cc'ed them in case they missed this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 resend] mm: various cleanups in get_user_pages()

2008-02-12 Thread Nick Piggin
On Wednesday 13 February 2008 00:10, Eugene Teo wrote:
> Sorry for the repeated emails. Kindly ignore the previous resend. Please
> review this instead. Thanks. I have tested this.

If it is causing this much problems, can you split the cleanups into
their own patches.


> [PATCH 2/2] mm: various cleanups in get_user_pages()
>
> This patch contains various cleanups, including making sure vma is valid,
> and the return value of follow_hugetlb_page() is validated.
>
> Signed-off-by: Eugene Teo <[EMAIL PROTECTED]>
> ---
>  mm/memory.c |   24 
>  1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 54f951b..c7e0610 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1003,7 +1003,9 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, unsigned int foll_flags;
>
>   vma = find_extend_vma(mm, start);
> - if (!vma && in_gate_area(tsk, start)) {
> + if (!vma)
> + goto finish_or_fault;
> + if (in_gate_area(tsk, start)) {
>   unsigned long pg = start & PAGE_MASK;
>   struct vm_area_struct *gate_vma = get_gate_vma(tsk);
>   pgd_t *pgd;

Doesn't this break the logic?

If you don't have a vma, but you are in the gate area, then you
should use the gate vma. With your patch, gate area will fault.

> @@ -1011,7 +1013,7 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, pmd_t *pmd;
>   pte_t *pte;
>   if (write) /* user gate pages are read-only */
> - return i ? : -EFAULT;
> + goto finish_or_fault;

I don't know if this is exactly a cleanup or not... I guess gcc
probably isn't smart enough to fold them all together, so it should
use a little less code in the unlikely branches. Does it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Nick Piggin
On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote:
> Nick Piggin wrote:

> > But don't let me dissuade you from making these good improvements
> > to Linux as well :) Just that it isn't really going to be hard-rt
> > in general.
>
> Actually that's the cool thing about CPU isolation. Get rid of all latency
> sources from the CPU(s) and you get youself as hard-RT as it gets.

Hmm, maybe. Removing all sources of latency from the CPU kind of
implies that you have to audit the whole kernel for source of
latency.

> I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec
> worst case and ~200nsec average latency. I do not even need Adeos/Xenomai
> or Preemp-RT just a few very small patches. And it can be used for non RT
> stuff too.

OK, but you then are very restricted in what you can do, and easily
can break it especially if you run any userspace on that CPU. If
you just run a kernel module that, after setup, doesn't use any
other kernel resources except interrupt handling, then you might be
OK (depending on whether even interrupt handling can run into
contended locks)...

If you started doing very much more, then you can easily run into
trouble.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel BUG at fs/mpage.c:489

2008-02-13 Thread Nick Piggin
On Wednesday 13 February 2008 20:01, Andrew Morton wrote:
> On Wed, 13 Feb 2008 08:26:27 +0100 Bart Dopheide <[EMAIL PROTECTED]> wrote:
> > On Wed, Feb 13, 2008 at 12:05:45PM +1100, Nick Piggin wrote:
> > :)On Wednesday 13 February 2008 08:50, Alan Cox wrote:
> > :)> Almost certainly a hardware fail of some sort.
> > :)
> > :)Right, but the kernel shouldn't go bug...
> >
> > Indeed, that's why I'm reporting.
> >
> > :)I don't have a copy of your exact source code... which condition in
> > :)__mpage_writepage went BUG?
> >
> > BUG_ON(buffer_locked(bh));
> >
> > In a bit of context:
> > 482:if (page_has_buffers(page)) {
> > 483:struct buffer_head *head = page_buffers(page);
> > 484:struct buffer_head *bh = head;
> > 485:
> > 486:/* If they're all mapped and dirty, do it */
> > 487:page_block = 0;
> > 488:do {
> > 489:BUG_ON(buffer_locked(bh));
> > 490:if (!buffer_mapped(bh)) {
> > 491:/*
> > 492: * unmapped dirty buffers are created by
> > 493: * __set_page_dirty_buffers -> mmapped
> > data 494: */
> > 495:if (buffer_dirty(bh))
> > 496:goto confused;
> > 497:if (first_unmapped == blocks_per_page)
> > 498:first_unmapped = page_block;
> > 499:continue;
> > 500:}
>
> Probably means that either fat, IDE, block or fs/buffer.c failed to unlock
> a buffer_head when the IO error happened.  It's unlikely to be fat.

Yes that looks like it would be the problem. I can't really
see anything in buffer.c that would do it... 

BTW is it really true that the buffer can never be locked by
anything else at this point? What about fsync_buffers_list?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel BUG at fs/mpage.c:489

2008-02-13 Thread Nick Piggin
On Wednesday 13 February 2008 20:32, Andrew Morton wrote:
> On Wed, 13 Feb 2008 20:24:03 +1100 Nick Piggin <[EMAIL PROTECTED]> 
wrote:
> > BTW is it really true that the buffer can never be locked by
> > anything else at this point?
>
> It has been for the past five or six years.  With the page locked, nobody
> else can get at that page.

Hmm OK.


> > What about fsync_buffers_list?
>
> They're metadata buffers, not regular file data.  Things might get ugly if
> IO to /dev/sda went via that path, but it doesn't.

Yeah right... so the BUG_ON is basically because you want to avoid
the overhead of locking the buffer (which would presumably allow it
to work in situations where someone else might lock the buffer without
locking the page?). OK, makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-17 Thread Nick Piggin
On Saturday 16 February 2008 14:37, Andrew Morton wrote:
> On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter <[EMAIL PROTECTED]> 
wrote:
> > Two callbacks to remove individual pages as done in rmap code
> >
> > invalidate_page()
> >
> > Called from the inner loop of rmap walks to invalidate pages.
> >
> > age_page()
> >
> > Called for the determination of the page referenced status.
> >
> > If we do not care about page referenced status then an age_page callback
> > may be be omitted. PageLock and pte lock are held when either of the
> > functions is called.
>
> The age_page mystery shallows.

BTW. can this callback be called mmu_notifier_clear_flush_young? To
match the core VM.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LatencyTOP: sync_page description

2008-02-17 Thread Nick Piggin
On Saturday 16 February 2008 08:56, Török Edwin wrote:
> Hi Arjan,
>
> LatencyTOP says that sync_page is 'Writing a page to disk', however
> I see that even when no writes are involved, such as during a
> readdir, lseek, etc.
> Naming it a write is misleading, as no program is running that is
> doing constant writes to the disk. The only program is writing to a
> temp dir in /dev/shm.
>
> What would be a better description for sync_page?

Waiting on a page state change (usually: waiting for IO, but can be
also waiting for the page lock which is taken by some other part of
the kernel eg in page reclaim, truncate, buffered writes, page
faults).

> Here are some /proc/latency_stats containing sync_page:
>
> 125 6937678 210821 sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> get_unused_fd_flags path_walk do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area
> 306 5677749 215746 sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> do_sync_read autoremove_wake_function security_file_permission
> rw_verify_area vfs_read vfs_llseek
> 21 435657 59966 sync_page sync_page __lock_page wake_bit_function
> read_cache_page_async ntfs_readpage read_cache_page map_mft_record
> ntfs_read_locked_inode ntfs_alloc_big_inode iget5_locked
> ntfs_test_inode
> 195 2716409 133660 blk_unplug sync_page sync_page __lock_page
> wake_bit_function read_cache_page_async ntfs_readpage
> read_cache_page map_mft_record ntfs_read_locked_inode
> ntfs_alloc_big_inode iget5_locked
> 28 1881278 181986 add_to_page_cache_lru sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read get_unused_fd_flags path_walk do_sync_read
> autoremove_wake_function security_file_permission
> 2 17132 9746 add_to_page_cache_lru sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area vfs_read
> 1 70 70 irq_exit sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> do_sync_read autoremove_wake_function security_file_permission
> rw_verify_area vfs_read
> 23 306682 114514 blk_unplug sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area vfs_read
> 1 153 153 hrtimer_interrupt smp_apic_timer_interrupt sync_page
> sync_page_killable sync_page_killable __lock_page_killable
> wake_bit_function generic_file_aio_read do_sync_read
> autoremove_wake_function cfq_idle_slice_timer security_file_permission

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queueing and complete affinity w/ threads: Some results

2008-02-18 Thread Nick Piggin
On Mon, Feb 18, 2008 at 02:33:17PM +0100, Andi Kleen wrote:
> Jens Axboe <[EMAIL PROTECTED]> writes:
> 
> > and that scrapping the remote
> > softirq trigger stuff is sanest.
> 
> I actually liked Nick's queued smp_function_call_single() patch. So even
> if it was not used for block I would still like to see it being merged 
> in some form to speed up all the other IPI users.

Yeah, that hasn't been forgotten (nor have your comments about folding
my special function into smp_call_function_single).

The call function path is terribly unscalable at the moment on a lot
of architectures, and also it isn't allowed to be used with interrupts
off due to deadlock (which the queued version can allow, provided
that wait=0).

I will get around to sending that upstream soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin
On Tuesday 19 February 2008 01:39, Andi Kleen wrote:
> Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > you have more faith in the authors knowledge of how his code actually
> > behaves than I think is warranted  :)
>
> iirc there was a mm patch some time ago to keep track of the actual
> unlikely values at runtime and it showed indeed some wrong ones. But the
> far majority of them are probably correct.
>
> > Or faith in that he knows what "unlikely" means.
> > I should write docs about this; but unlikely() means:
> > 1) It happens less than 0.01% of the cases.
> > 2) The compiler couldn't have figured this out by itself
> >(NULL pointer checks are compiler done already, same for some other
> > conditions) 3) It's a hot codepath where shaving 0.5 cycles (less even on
> > x86) matters (and the author is ok with taking a 500 cycles hit if he's
> > wrong)
>
> One more thing unlikely() does is to move the unlikely code out of line.
> So it should conserve some icache in critical functions, which might
> well be worth some more cycles (don't have numbers though).

I actually once measured context switching performance in the scheduler,
and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
performance drop.

This was on a P4 which is very different from more modern CPUs both in
terms of branch performance characteristics, and icache characteristics.
However, the P4's branch predictor is pretty good, and it should easily
be able to correctly predict the rt_task check if it has enough entries.
So I think much of the savings came from code transformation and movement.
Anyway, it is definitely worthwhile if used correctly.

Actually one thing I don't like about gcc is that I think it still emits
cmovs for likely/unlikely branches, which is silly (the gcc developers
seem to be in love with that instruction). If that goes away, then
branch hints may be even better.

>
> But overall I agree with you that unlikely is in most cases a bad
> idea (and I submitted the original patch introducing it originally @). That
> is because it is often used in situations where gcc's default branch
> prediction heuristics do would make exactly the same decision
>
>if (unlikely(x == NULL))
>
> is simply totally useless because gcc already assumes all x == NULL
> tests are unlikely. I appended some of the builtin heuristics from
> a recent gcc source so people can see them.
>
> Note in particular the last predictors; assuming branch ending
> with goto, including call, causing early function return or
> returning negative constant are not taken. Just these alone
> are likely 95+% of the unlikelies in the kernel.

Yes, gcc should be able to do pretty good heuristics, considering
the quite good numbers that cold CPU predictors can attain. However
for really performance critical code (or really "never" executed
code), then I think it is OK to have the hints and not have to rely
on gcc heuristics.

>
> -Andi

[snip]

Interesting, thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-18 Thread Nick Piggin
On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:
> On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote:

> > > [ 4057.31] Pid: 7035, comm: ftest03 Not tainted
> > > 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #2 [ 4057.31] RIP:
> > > 0010:[]  []
> > > iov_iter_advance+0x38/0x70 [ 4057.31] RSP: 0018:810110329b20 
> > > EFLAGS: 00010246
> > > [ 4057.31] RAX:  RBX: 0800 RCX:
> > >  [ 4057.31] RDX:  RSI:
> > > 0800 RDI: 810110329ba8 [ 4057.31] RBP:
> > > 0800 R08:  R09: 810101dbc000 [
> > > 4057.31] R10: 0004 R11:  R12:
> > > 00026000 [ 4057.31] R13: 81010d765c98 R14:
> > > 1000 R15:  [ 4057.31] FS: 
> > > 7fee589146d0() GS:80501000() knlGS:
> > > [ 4057.31] CS:  0010 DS:  ES:  CR0: 8005003b [
> > > 4057.31] CR2: 810101dbc008 CR3: 0001103da000 CR4:
> > > 06e0 [ 4057.31] DR0:  DR1:
> > >  DR2:  [ 4057.31] DR3:
> > >  DR6: 0ff0 DR7: 0400 [
> > > 4057.31] Process ftest03 (pid: 7035, threadinfo 810110328000,
> > > task 810160b0) [ 4057.31] Stack:  8025b413
> > > 81010d765ab0 804e6fd8 001201d2 [ 4057.31] 
> > > 810110329db8 00026000 810110329d38 81017b9fb500 [
> > > 4057.31]  81010d765c98 804175e0 81010d765ab0
> > >  [ 4057.31] Call Trace:
> > > [ 4057.31]  [] ?
> > > generic_file_buffered_write+0x1e3/0x6f0 [ 4057.31] 
> > > [] ? current_fs_time+0x1e/0x30 [ 4057.31] 
> > > [] ? __generic_file_aio_write_nolock+0x28f/0x440 [
> > > 4057.31]  [] ? generic_file_aio_write+0x63/0xd0 [
> > > 4057.31]  [] ? ext3_file_write+0x23/0xc0 [
> > > 4057.31]  [] ? ext3_file_write+0x0/0xc0 [
> > > 4057.31]  [] ? do_sync_readv_writev+0xcb/0x110 [
> > > 4057.31]  [] ? autoremove_wake_function+0x0/0x30
> > > [ 4057.31]  [] ?
> > > debug_check_no_locks_freed+0x7d/0x130 [ 4057.31] 
> > > [] ? trace_hardirqs_on+0xcf/0x150 [ 4057.31] 
> > > [] ? __kmalloc+0x15/0xc0
> > > [ 4057.31]  [] ? rw_copy_check_uvector+0x9d/0x130
> > > [ 4057.31]  [] ? do_readv_writev+0xe0/0x170
> > > [ 4057.31]  [] ? mutex_lock_nested+0x1a7/0x280
> > > [ 4057.31]  [] ? trace_hardirqs_on+0xcf/0x150
> > > [ 4057.31]  [] ?
> > > __mutex_unlock_slowpath+0xc9/0x170 [ 4057.31]  []
> > > ? trace_hardirqs_on+0xcf/0x150 [ 4057.31]  [] ?
> > > trace_hardirqs_on_thunk+0x35/0x3a [ 4057.31]  []
> > > ? sys_writev+0x53/0x90
> > > [ 4057.31]  [] ?
> > > system_call_after_swapgs+0x7b/0x80 [ 4057.31]
> > > [ 4057.31]
> > > [ 4057.31] Code: 48 01 77 10 48 29 77 18 c3 0f 0b eb fe 66 66 90 66
> > > 66 90 4c 8b 0f 48 8b 4f 10 49 89 f0 eb 07 66 66 66 90 49 29 c0 4d 85 c0
> > > 75 07 <49> 83 79 08 00 75 23 49 8b 51 08 48 89 d0 48 29 c8 49 39 c0 49
> > > [ 4057.31] RIP  [] iov_iter_advance+0x38/0x70 [
> > > 4057.31]  RSP 
> > > [ 4057.31] CR2: 810101dbc008
> > > [ 4057.31] Kernel panic - not syncing: Fatal exception

Can you try this patch please?
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1753,9 +1753,10 @@ static void __iov_iter_advance_iov(struc
 
 		/*
 		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments. But we mustn't try to "skip" if
+		 * we have come to the end (i->count == bytes).
 		 */
-		while (bytes || !iov->iov_len) {
+		while (bytes || (unlikely(!iov->iov_len) && i->count > bytes)) {
 			int copy = min(bytes, iov->iov_len - base);
 
 			bytes -= copy;


Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin
On Tuesday 19 February 2008 13:40, Arjan van de Ven wrote:
> On Tue, 19 Feb 2008 13:33:53 +1100
>
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> > Actually one thing I don't like about gcc is that I think it still
> > emits cmovs for likely/unlikely branches, which is silly (the gcc
> > developers seem to be in love with that instruction). If that goes
> > away, then branch hints may be even better.
>
> only for -Os and only if the result is smaller afaik.

What is your evidence for saying this? Because here, with the latest
kernel and recent gcc-4.3 snapshot, it spits out cmov like crazy even
when compiled with -O2.

[EMAIL PROTECTED]:~/usr/src/linux-2.6$ grep cmov kernel/sched.s | wc -l
45

And yes it even does for hinted branches and even at -O2/3

[EMAIL PROTECTED]:~/tests$ cat cmov.c
int test(int a, int b)
{
if (__builtin_expect(a < b, 0))
return a;
else
return b;
}
[EMAIL PROTECTED]:~/tests$ gcc-4.3 -S -O2 cmov.c
[EMAIL PROTECTED]:~/tests$ head -13 cmov.s
.file   "cmov.c"
.text
.p2align 4,,15
..globl test
.type   test, @function
test:
..LFB2:
cmpl%edi, %esi
cmovle  %esi, %edi
movl%edi, %eax
ret
..LFE2:
.size   test, .-test

This definitely should be a branch, IMO.

> (cmov tends to be a performance loss most of the time so for -O2 and such
> it isn't used as far as I know.. it does make for nice small code however
> ;-)

It shouldn't be hard to work out the cutover point based on how
expensive cmov is, how expensive branch and branch mispredicts are,
and how often the branch is likely to be mispredicted. For an
unpredictable branch, cmov is normally quite a good win even on
modern CPUs. But gcc overuses it I think.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, RFC] kthread: (possibly) a missing memory barrier in kthread_stop()

2008-02-18 Thread Nick Piggin
On Tuesday 19 February 2008 10:03, Dmitry Adamushko wrote:
> Hi,
>
>
> [ description ]
>
> Subject: kthread: add a memory barrier to kthread_stop()
>
> 'kthread' threads do a check in the following order:
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> and set_current_state() implies an smp_mb().
>
> on another side (kthread_stop), wake_up_process() does not seem to
> guarantee a full mb.
>
> And 'kthread_stop_info.k' must be visible before wake_up_process()
> checks for/modifies a state of the 'kthread' task.
>
> (the patch is at the end of the message)
>
>
> [ more detailed description ]
>
> the current code might well be safe in case a to-be-stopped 'kthread'
> task is _not_ running on another CPU at the moment when kthread_stop()
> is called (in this case, 'rq->lock' will act as a kind of synch.
> point/barrier).
>
> Another case is as follows:
>
> CPU#0:
>
> ...
> while (kthread_should_stop()) {
>
>if (condition)
>  schedule();
>
>/* ... do something useful ... */   <--- EIP
>
>set_current_state(TASK_INTERRUPTIBLE);
> }
>
> so a 'kthread' task is about to call
> set_current_state(TASK_INTERRUPTIBLE) ...
>
>
> (in the mean time)
>
> CPU#1:
>
> kthread_stop()
>
> -> kthread_stop_info.k = k (*)
> -> wake_up_process()
>
> wake_up_process() looks like:
>
> (try_to_wake_up)
>
> IRQ_OFF
> LOCK
>
> old_state = p->state;
> if (!(old_state & state))  (**)
>  goto out;
>
> ...
>
> UNLOCK
> IRQ_ON
>
>
> let's suppose (*) and (**) are reordered
> (according to Documentation/memory-barriers.txt, neither IRQ_OFF nor
> LOCK may prevent it from happening).
>
> - the state is TASK_RUNNING, so we are about to return.
>
> - CPU#1 is about to execute (*) (it's guaranteed to be done before
> spin_unlock(&rq->lock) at the end of try_to_wake_up())
>
>
> (in the mean time)
>
> CPU#0:
>
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> here, kthread_stop_info.k is not yet visible
>
> - schedule()
>
> ...
>
> we missed a 'kthread_stop' event.
>
> hum?

Looks like you are correct to me.


> TIA,
>
> ---
>
> From: Dmitry Adamushko <[EMAIL PROTECTED]>
> Subject: kthread: add a memory barrier to kthread_stop()
>
> 'kthread' threads do a check in the following order:
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> and set_current_state() implies an smp_mb().
>
> on another side (kthread_stop), wake_up_process() is not guaranteed to
> act as a full mb.
>
> 'kthread_stop_info.k' must be visible before wake_up_process() checks
> for/modifies a state of the 'kthread' task.
>
>
> Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>
>
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 0ac8878..5167110 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -211,6 +211,10 @@ int kthread_stop(struct task_struct *k)
>
>   /* Now set kthread_should_stop() to true, and wake it up. */
>   kthread_stop_info.k = k;
> +
> + /* The previous store operation must not get ahead of the wakeup. */
> + smp_mb();
> +
>   wake_up_process(k);
>   put_task_struct(k);
>
>
>
> --

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin
On Tuesday 19 February 2008 16:58, Willy Tarreau wrote:
> On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:
> > > Note in particular the last predictors; assuming branch ending
> > > with goto, including call, causing early function return or
> > > returning negative constant are not taken. Just these alone
> > > are likely 95+% of the unlikelies in the kernel.
> >
> > Yes, gcc should be able to do pretty good heuristics, considering
> > the quite good numbers that cold CPU predictors can attain. However
> > for really performance critical code (or really "never" executed
> > code), then I think it is OK to have the hints and not have to rely
> > on gcc heuristics.
>
> in my experience, the real problem is that gcc does what *it* wants and not
> what *you* want. I've been annoyed a lot by the way it coded some loops
> that could really be blazingly fast, but which resulted in a ton of
> branches due to its predictors. And using unlikely() there was a real mess,
> because instead of just hinting the compiler with probabilities to write
> some linear code for the *most* common case, it ended up with awful
> branches everywhere with code sent far away and even duplicated for some
> branches.
>
> Sometimes, for performance critical paths, I would like gcc to be dumb and
> follow *my* code and not its hard-coded probabilities. For instance, in a
> tree traversal, you really know how you want to build your loop. And these
> days, it seems like the single method of getting it your way is doing asm,
> which obviously is not portable :-(

Probably all true.


> Maybe one thing we would need would be the ability to assign probabilities
> to each branch based on what we expect, so that gcc could build a better
> tree keeping most frequently used code tight.

I don't know if that would *directly* lead to gcc being smarter. I
think perhaps they probably don't benchmark on code bases that have
much explicit annotation (I'm sure they wouldn't seriously benchmark
any parts of Linux as part of daily development). I think the key is
to continue to use annotations _properly_, and eventually gcc should
go in the right direction if enough code uses it.

And if you have really good examples like it sounds like above, then
I guess that should be reported to gcc?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-19 Thread Nick Piggin
On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote:
> background
> 
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
>   - heavy lock contention
>   - unnecessary swap out
>
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
> reclaim throttle and explain it improve reclaim time.
>   http://marc.info/?l=linux-mm&m=119667465917215&w=2
>
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
>
>
> test machine, method and result
> ==
> 
>   CPU:  IA64 x8
>   MEM:  8GB
>   SWAP: 2GB
>
> 
>   got hackbench from
>   http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
>
>   $ /usr/bin/time hackbench 120 process 1000
>
>   this parameter mean consume all physical memory and
>   1GB swap space on my test environment.
>
> 
>
> before:
>   hackbench result:   282.30
>   /usr/bin/time result
>   user:   14.16
>   sys:1248.47
>   elapse: 432.93
>   major fault:29026
>   max parallel reclaim tasks: 1298
>   max consumption time of
>try_to_free_pages():   70394
>
> after:
>   hackbench result:   30.36
>   /usr/bin/time result
>   user:   14.26
>   sys:294.44
>   elapse: 118.01
>   major fault:3064
>   max parallel reclaim tasks: 4
>   max consumption time of
>try_to_free_pages():   12234
>
>
> conclusion
> =
> this patch improve 3 things.
> 1. reduce unnecessary swap
>(see above major fault. about 90% reduced)
> 2. improve throughput performance
>(see above hackbench result. about 90% reduced)
> 3. improve interactive performance.
>(see above max consumption of try_to_free_pages.
> about 80% reduced)
> 4. reduce lock contention.
>(see above sys time. about 80% reduced)
>
>
> Now, we got about 1000% performance improvement of hackbench :)
>
>
>
> foture works
> ==
>  - more discussion with memory controller guys.

Hi,

Yeah this is definitely needed and a nice result.

I'm worried about a) placing a global limit on parallelism, and b)
placing a limit on parallelism at all.

I think it should maybe be a per-zone thing...

What happens if you make it a per-zone mutex, and allow just a single
process to reclaim pages from a given zone at a time? I guess that is
going to slow down throughput a little bit in some cases though...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] my mmu notifier sample driver

2008-02-19 Thread Nick Piggin

Index: linux-2.6/drivers/char/mmu_notifier_skel.c
===
--- /dev/null
+++ linux-2.6/drivers/char/mmu_notifier_skel.c
@@ -0,0 +1,255 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static DEFINE_SPINLOCK(mmn_lock);
+static RADIX_TREE(rmap_tree, GFP_ATOMIC);
+static seqcount_t rmap_seq = SEQCNT_ZERO;
+
+static int __rmap_add(unsigned long mem, unsigned long vaddr)
+{
+   int err;
+
+   err = radix_tree_insert(&rmap_tree, mem >> PAGE_SHIFT, (void *)vaddr);
+
+   return err;
+}
+
+static void __rmap_del(unsigned long mem)
+{
+   void *ret;
+
+   ret = radix_tree_delete(&rmap_tree, mem >> PAGE_SHIFT);
+   BUG_ON(!ret);
+}
+
+static unsigned long rmap_find(unsigned long mem)
+{
+   unsigned long vaddr;
+
+   rcu_read_lock();
+   vaddr = (unsigned long)radix_tree_lookup(&rmap_tree, mem >> PAGE_SHIFT);
+   rcu_read_unlock();
+
+   return vaddr;
+}
+
+static struct page *follow_page_atomic(struct mm_struct *mm, unsigned long 
address, int write)
+{
+   struct vm_area_struct *vma;
+
+   vma = find_vma(mm, address);
+if (!vma || (vma->vm_start > address))
+return NULL;
+
+   if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+   return NULL;
+
+   return follow_page(vma, address, FOLL_GET|(write ? FOLL_WRITE : 0));
+}
+
+static int mmn_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long source_vaddr = (unsigned long)vmf->pgoff << PAGE_SHIFT;
+   unsigned long dest_vaddr = (unsigned long)vmf->virtual_address;
+   unsigned long pfn;
+   struct page *page;
+   pgprot_t prot;
+   int write = vmf->flags & FAULT_FLAG_WRITE;
+   int ret;
+
+   printk("mmn_vm_fault [EMAIL PROTECTED] sourcing from %lx\n", write ? 
"write" : "read", dest_vaddr, source_vaddr);
+
+   BUG_ON(mm != current->mm); /* disallow get_user_pages */
+
+again:
+   spin_lock(&mmn_lock);
+   write_seqcount_begin(&rmap_seq);
+   page = follow_page_atomic(mm, source_vaddr, write);
+   if (unlikely(!page)) {
+   write_seqcount_end(&rmap_seq);
+   spin_unlock(&mmn_lock);
+   ret = get_user_pages(current, mm, source_vaddr,
+   1, write, 0, &page, NULL);
+   if (ret != 1)
+   goto out_err;
+   put_page(page);
+   goto again;
+   }
+
+   ret = __rmap_add(source_vaddr, dest_vaddr);
+   if (ret)
+   goto out_lock;
+
+   pfn = page_to_pfn(page);
+   prot = vma->vm_page_prot;
+   if (!write)
+   vma->vm_page_prot = vm_get_page_prot(vma->vm_flags & 
~(VM_WRITE|VM_MAYWRITE));
+   ret = vm_insert_pfn(vma, dest_vaddr, pfn);
+   vma->vm_page_prot = prot;
+   if (ret) {
+   if (ret == -EBUSY)
+   WARN_ON(1);
+   goto out_rmap;
+   }
+   write_seqcount_end(&rmap_seq);
+   spin_unlock(&mmn_lock);
+   put_page(page);
+
+return VM_FAULT_NOPAGE;
+
+out_rmap:
+   __rmap_del(source_vaddr);
+out_lock:
+   write_seqcount_end(&rmap_seq);
+   spin_unlock(&mmn_lock);
+   put_page(page);
+out_err:
+   switch (ret) {
+   case -EFAULT:
+   case -EEXIST:
+   case -EBUSY:
+   return VM_FAULT_SIGBUS;
+   case -ENOMEM:
+   return VM_FAULT_OOM;
+   default:
+   BUG();
+   }
+}
+
+struct vm_operations_struct mmn_vm_ops = {
+.fault = mmn_vm_fault,
+};
+
+static int mmu_notifier_busy;
+static struct mmu_notifier mmu_notifier;
+
+static int mmn_clear_young(struct mmu_notifier *mn, unsigned long address)
+{
+   unsigned long vaddr;
+   unsigned seq;
+   struct mm_struct *mm = mn->mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep, pte;
+
+   do {
+   seq = read_seqcount_begin(&rmap_seq);
+   vaddr = rmap_find(address);
+   } while (read_seqcount_retry(&rmap_seq, seq));
+
+   if (vaddr == 0)
+   return 0;
+
+   printk("[EMAIL PROTECTED] sourced from %lx\n", vaddr, address);
+
+   spin_lock(&mmn_lock);
+pgd = pgd_offset(mm, vaddr);
+pud = pud_offset(pgd, vaddr);
+   if (pud) {
+   pmd = pmd_offset(pud, vaddr);
+   if (pmd) {
+   ptep = pte_offset_map(pmd, vaddr);
+   if (ptep) {
+   pte = *ptep;
+   if (!pte_present(pte)) {
+   /* x86 specific, don't have a vma */
+   ptep_get_and_clear(mm, vaddr, ptep);
+   __flush_tlb_one(vaddr);
+   

[patch] my mmu notifiers

2008-02-19 Thread Nick Piggin
Well I started reviewing the mmu notifier code, but it is kind of hard to
know what you're talking about just by reading through code and not trying
your suggestions for yourself...

So I implemented mmu notifiers slightly differently. Andrea's mmu notifiers
are rather similar. However I have tried to make a point of minimising the
impact the the core mm/. I don't see why we need to invalidate or flush
anything when changing the pte to be _more_ permissive, and I don't
understand the need for invalidate_begin/invalidate_end pairs at all.
What I have done is basically create it so that the notifiers get called
basically in the same place as the normal TLB flushing is done, and nowhere
else.

I also wanted to avoid calling notifier code from inside eg. hardware TLB
or pte manipulation primitives. These things are already pretty well
spaghetti, so I'd like to just place them right where needed first... I
think eventually it will need a bit of a rethink to make it more consistent
and more general. But I prefer to do put them in the caller for the moment.

I have also attempted to write a skeleton driver. Not like Christoph's
drivers, but one that actually does something. This one can mmap a
window into its own virtual address space. It's not perfect yet (I need
to replace page_mkwrite with ->fault in the core mm before I can get
enough information to do protection properly I think). However I think it
may be race-free in the fault vs unmap paths. It's pretty complex, I must
say.

---

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -228,6 +228,9 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -0,0 +1,69 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include 
+#include 
+
+struct mmu_notifier;
+struct mmu_notifier_operations;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier {
+   struct hlist_node hlist;
+   const struct mmu_notifier_operations *ops;
+   struct mm_struct *mm;
+};
+
+struct mmu_notifier_operations {
+   void (*release)(struct mmu_notifier *mn);
+   int (*clear_young)(struct mmu_notifier *mn, unsigned long address);
+   void (*unmap)(struct mmu_notifier *mn, unsigned long address);
+   void (*invalidate_range)(struct mmu_notifier *mn, unsigned long start, 
unsigned long end);
+};
+
+static inline void mmu_notifier_init_mm(struct mm_struct *mm)
+{
+   INIT_HLIST_HEAD(&mm->mmu_notifier_list);
+}
+
+static inline void mmu_notifier_init(struct mmu_notifier *mn, const struct 
mmu_notifier_operations *ops, struct mm_struct *mm)
+{
+   INIT_HLIST_NODE(&mn->hlist);
+   mn->ops = ops;
+   mn->mm = mm;
+}
+
+extern void mmu_notifier_register(struct mmu_notifier *mn);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn);
+
+extern void mmu_notifier_exit_mm(struct mm_struct *mm);
+extern int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long 
address);
+extern void mmu_notifier_unmap(struct mm_struct *mm, unsigned long address);
+extern void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long 
start, unsigned long end);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_exit_mm(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long 
address)
+{
+   return 0;
+}
+
+static inline void mmu_notifier_unmap(struct mm_struct *mm, unsigned long 
address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, 
unsigned long start, unsigned long end)
+{
+}
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif
Index: linux-2.6/kernel/fork.c
===
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -358,6 +359,7 @@ static struct mm_struct * mm_init(struct
mm->ioctx_list = NULL;
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
+   mmu_notifier_init_mm(mm);
mm_init_cgroup(mm, p);
 
if (likely(!mm_alloc_pgd(mm))) {
Index: linux-2.6/mm/filemap_xip.c
===
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -195,6 +195,7 @@ __xip_unmap (struct address_space * mapp
/* Nuke the page table entry. */
flush_cache_page(vma, addre

Re: [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-19 Thread Nick Piggin
On Sunday 17 February 2008 06:22, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:

> > >   flush_cache_page(vma, address, pte_pfn(*pte));
> > >   entry = ptep_clear_flush(vma, address, pte);
> > > + mmu_notifier(invalidate_page, mm, address);
> >
> > I just don't see how ths can be done if the callee has another thread in
> > the middle of establishing IO against this region of memory.
> > ->invalidate_page() _has_ to be able to block.  Confused.
>
> The page lock is held and that holds off I/O?

I think the actual answer is that "it doesn't matter".

ptes are not exactly the entity via which IO gets established, so
all we really care about here is that after the callback finishes,
we will not get any more reads or writes to the page via the
external mapping.

As far as holding off local IO goes, that is the job of the core
VM. (And no, page lock does not necessarily hold it off FYI -- it
can be writeback IO or even IO directly via buffers).

Holding off IO via the external references I guess is a job for
the notifier driver.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Nick Piggin
On Tuesday 19 February 2008 20:25, Andi Kleen wrote:
> On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:

> > I actually once measured context switching performance in the scheduler,
> > and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
> > performance drop.
>
> OT: what benchmarks did you use for that? I had a change some time
> ago to the CFS scheduler to avoid unpredicted indirect calls for
> the common case, but I wasn't able to benchmark a difference with the usual
> suspect benchmark (lmbench). Since it increased code size by
> a few bytes it was rejected then.

I think it was just a simple context switch benchmark, but not lmbench
(which I found to be a bit too variable). But it was a long time ago...


> > This was on a P4 which is very different from more modern CPUs both in
> > terms of branch performance characteristics,
> >
> > and icache characteristics.
>
> Hmm, the P4 the trace cache actually should not care about inline
> code that is not executed.

Yeah, which is why it is a bit different than other CPUs. Although
the L2 cache I guess is still going to suffer from sparse code, but
I guess that is a bit less important.


> > However, the P4's branch predictor is pretty good, and it should easily
>
> I think it depends on the generation. Prescott class branch
> prediction should be much better than the earlier ones.

I was using a Nocona Xeon, which I think is a Prescott class? And
don't they have much higher mispredict penalty (than older P4s)?


> > Actually one thing I don't like about gcc is that I think it still emits
> > cmovs for likely/unlikely branches,
>
> That's -Os.

And -O2 and -O3, on the gccs that I'm using, AFAIKS.


> > which is silly (the gcc developers
>
> It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
> makes sense. P4 doesn't like it though.

If the branch is completely predictable (eg. annotated), then I
think branches should be used anyway. Even on well predicted
branches, cmov is similar speed on microbenchmarks, but it will
increase data hazards I think, so it will probably be worse for
some real world situations.


> > the quite good numbers that cold CPU predictors can attain. However
> > for really performance critical code (or really "never" executed
> > code), then I think it is OK to have the hints and not have to rely
> > on gcc heuristics.
>
> But only when the explicit hints are different from what the implicit
> branch predictors would predict anyways. And if you look at the
> heuristics that is not often the case...

But a likely branch will be _strongly_ predicted to be taken,
wheras a lot of the gcc heuristics simply have slightly more or
slightly less probability. So it's not just a question of which
way is more likely, but also _how_ likely it is to go that way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin
On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
>
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.
>
> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
>
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.

This whole thing would be much better if you didn't rely on the page
lock at all, but either a) used the same locking as Linux does for its
ptes/tlbs, or b) have some locking that is private to the mmu notifier
code. Then there is not all this new stuff that has to be understood in
the core VM.

Also, why do you have to "invalidate" ranges when switching to a
_more_ permissive state? This stuff should basically be the same as
(a subset of) the TLB flushing API AFAIKS. Anything more is a pretty
big burden to put in the core VM.

See my alternative patch I posted -- I can't see why it won't work
just like a TLB.

As far as sleeping inside callbacks goes... I think there are big
problems with the patch (the sleeping patch and the external rmap
patch). I don't think it is workable in its current state. Either
we have to make some big changes to the core VM, or we have to turn
some locks into sleeping locks to do it properly AFAIKS. Neither
one is good.

But anyway, I don't really think the two approaches (Andrea's
notifiers vs sleeping/xrmap) should be tangled up too much. I
think Andrea's can possibly be quite unintrusive and useful very
soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Nick Piggin
On Tuesday 19 February 2008 20:57, Andi Kleen wrote:
> On Tue, Feb 19, 2008 at 08:46:46PM +1100, Nick Piggin wrote:

> > I think it was just a simple context switch benchmark, but not lmbench
> > (which I found to be a bit too variable). But it was a long time ago...
>
> Do you still have it?
>
> I thought about writing my own but ended up being too lazy for that @)

Had a quick look but couldn't find it. It was just two threads running
and switching to each other with a couple of mutexes or yield. If I
find it, then I'll send it over.


> > > > Actually one thing I don't like about gcc is that I think it still
> > > > emits cmovs for likely/unlikely branches,
> > >
> > > That's -Os.
> >
> > And -O2 and -O3, on the gccs that I'm using, AFAIKS.
>
> Well if it still happens on gcc 4.2 with P4 tuning you should
> perhaps open a gcc PR. They tend to ignore these bugs mostly in
> my experience, but sometimes they act on them.

I'm not sure about P4 tuning... But even IMO it should not on
predictable branches too much for any (especially OOOE) CPU.


> > > > which is silly (the gcc developers
> > >
> > > It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
> > > makes sense. P4 doesn't like it though.
> >
> > If the branch is completely predictable (eg. annotated), then I
> > think branches should be used anyway. Even on well predicted
> > branches, cmov is similar speed on microbenchmarks, but it will
> > increase data hazards I think, so it will probably be worse for
> > some real world situations.
>
> At least the respective optimization manuals say they should be used.
> I presume they only made this recommendation after some extensive
> benchmarking.

What I have seen is that they tell you definitely not to use it for
predictable branches. Eg. the Intel optimization manual says

 Use the setcc and cmov instructions to eliminate unpredictable
 conditional branches where possible. Do not do this for predictable
 branches. Do not use these instructions to eliminate all
 unpredictable conditional branches, because using these instructions
 will incur execution overhead due to executing both paths of a
 conditional branch. In addition, converting conditional branches to
 cmovs or setcc trades control-flow dependence for data dependence
 and restricts the capability of the out-of-order engine.


> > But a likely branch will be _strongly_ predicted to be taken,
> > wheras a lot of the gcc heuristics simply have slightly more or
> > slightly less probability. So it's not just a question of which
> > way is more likely, but also _how_ likely it is to go that way.
>
> Yes, but a lot of the heuristics are pretty strong (>80%) and gcc will
> act on them unless it has a very strong contra cue. And that should
> normally not be the case.

True, but if you know a branch is 99%+, then use of likely/unlikely
can still be a good idea. 80% may not be enough to choose a branch
over a cmov for example.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > are rather similar. However I have tried to make a point of minimising the
> > impact the the core mm/. I don't see why we need to invalidate or flush
> 
> I also tried hard to minimise the impact of the core mm/, I also
> argued with Christoph that cluttering mm/ wasn't a good idea for
> things like age_page that could be a 1 liner change instead of a
> multiple-liner change, without any loss of flexibility or readability.
> 
> > anything when changing the pte to be _more_ permissive, and I don't
> 
> Note that in my patch the invalidate_pages in mprotect can be
> trivially switched to a mprotect_pages with proper params. This will
> prevent page faults completely in the secondary MMU (there will only
> be tlb misses after the tlb flush just like for the core linux pte),
> and it'll allow all the secondary MMU pte blocks (512/1024 at time
> with my PT lock design) to be updated to have proper permissions
> matching the core linux pte.
> 
> > understand the need for invalidate_begin/invalidate_end pairs at all.
> 
> The need of the pairs is crystal clear to me: range_begin is needed
> for GRU _but_only_if_ range_end is called after releasing the
> reference that the VM holds on the page. _begin will flush the GRU tlb
> and at the same time it will take a mutex that will block further GRU
> tlb-miss-interrupts (no idea how they manange those nightmare locking,
> I didn't even try to add more locking to KVM and I get away with the
> fact KVM takes the pin on the page itself).
> 
> My patch calls invalidate_page/pages before the reference is released
> on the page, so GRU will work fine despite lack of
> range_begin. Furthermore with my patch GRU will be auto-serialized by
> the PT lock w/o the need of any additional locking.

That's why I don't understand the need for the pairs: it should be
done like this.


> > What I have done is basically create it so that the notifiers get called
> > basically in the same place as the normal TLB flushing is done, and nowhere
> > else.
> 
> That was one of my objectives too.
> 
> > I also wanted to avoid calling notifier code from inside eg. hardware TLB
> > or pte manipulation primitives. These things are already pretty well
> > spaghetti, so I'd like to just place them right where needed first... I
> > think eventually it will need a bit of a rethink to make it more consistent
> > and more general. But I prefer to do put them in the caller for the moment.
> 
> Your patch should also work for KVM but it's suboptimal, my patch can
> be orders of magnitude more efficient for GRU thanks to the
> invalidate_pages optimization. Christoph complained about having to
> call one method per pte.

OK, I didn't see the invalidate_pages call...

 
> And adding invalidate_range is useless unless you fully support
> xpmem. You're calling invalidate_range in places that can't sleep...

I thought that could be used by a non-sleeping user (not intending
to try supporting sleeping users). If it is useless then it should
go away (BTW. I didn't see your recent patch, some of my confusion
I think stems from Christoph's novel way of merging and splitting
patches).


> No idea why xpmem needs range_begin, I perfectly understand why GRU
> needs _begin with Chrisotph's patch (gru lacks the page pin) but I
> dunno why xpmem needs range_begin (xpmem has the page pin so I also
> think it could avoid using range_begin). Still to support GRU you need
> both to call invalidate_range in places that can sleep and you need
> the external rmap notifier. The moment you add xpmem into the equation
> your and my clean patches become Christoph's one...

Sorry, I kind of didn't have time to follow the conversation so well
before; are there patches posted for gru and/or xpmem?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin
On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote:
> > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > > understand the need for invalidate_begin/invalidate_end pairs at all.
> > 
> > The need of the pairs is crystal clear to me: range_begin is needed
> > for GRU _but_only_if_ range_end is called after releasing the
> > reference that the VM holds on the page. _begin will flush the GRU tlb
> > and at the same time it will take a mutex that will block further GRU
> > tlb-miss-interrupts (no idea how they manange those nightmare locking,
> > I didn't even try to add more locking to KVM and I get away with the
> > fact KVM takes the pin on the page itself).
> 
> As it turns out, no actual mutex is required. _begin_ simply increments a
> count of active range invalidates, _end_ decrements the count. New TLB
> dropins are deferred while range callouts are active.
> 
> This would appear to be racy but the GRU has special hardware that
> simplifies locking. When the GRU sees a TLB invalidate, all outstanding
> misses & potentially inflight TLB dropins are marked by the GRU with a
> "kill" bit. When the dropin finally occurs, the dropin is ignored & the
> instruction is simply restarted. The instruction will fault again & the TLB
> dropin will be repeated.  This is optimized for the case where invalidates
> are rare - true for users of the GRU.

OK (thanks to Robin as well). Now I understand why you are using it,
but I don't understand why you don't defer new TLBs after the point
where the linux pte changes. If you can do that, then you look and
act much more like a TLB from the point of view of the Linux vm.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin
On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
>
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.

You can't sleep inside rcu_read_lock()!

I must say that for a patch that is up to v8 or whatever and is
posted twice a week to such a big cc list, it is kind of slack to
not even test it and expect other people to review it.

Also, what we are going to need here are not skeleton drivers
that just do all the *easy* bits (of registering their callbacks),
but actual fully working examples that do everything that any
real driver will need to do. If not for the sanity of the driver
writer, then for the sanity of the VM developers (I don't want
to have to understand xpmem or infiniband in order to understand
how the VM works).



> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
>
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.
>
> Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
> Signed-off-by: Robin Holt <[EMAIL PROTECTED]>
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
>
> ---
>  mm/filemap_xip.c |5 +
>  mm/fremap.c  |3 +++
>  mm/hugetlb.c |3 +++
>  mm/memory.c  |   35 +--
>  mm/mmap.c|2 ++
>  mm/mprotect.c|3 +++
>  mm/mremap.c  |7 ++-
>  7 files changed, 51 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/mm/fremap.c
> ===
> --- linux-2.6.orig/mm/fremap.c2008-02-14 18:43:31.0 -0800
> +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.0 -0800
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
>   spin_unlock(&mapping->i_mmap_lock);
>   }
>
> + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
>   err = populate_range(mm, vma, start, size, pgoff);
> + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
>   if (!err && !(flags & MAP_NONBLOCK)) {
>   if (unlikely(has_write_lock)) {
>   downgrade_write(&mm->mmap_sem);
> Index: linux-2.6/mm/memory.c
> ===
> --- linux-2.6.orig/mm/memory.c2008-02-14 18:43:31.0 -0800
> +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.0 -0800
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
>   if (is_vm_hugetlb_page(vma))
>   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
>
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
> +
>   dst_pgd = pgd_offset(dst_mm, addr);
>   src_pgd = pgd_offset(src_mm, addr);
>   do {
> @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
>   vma, addr, next))
>   return -ENOMEM;
>   } while (dst_pgd++, src_pgd++, addr = next, addr != end);
> +
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier(invalidate_range_end, src_mm,
> + vma->vm_start, end, 0);
> +
>   return 0;
>  }
>
> @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
>   struct mmu_gather *tlb;
>   unsigned long end = address + size;
>   unsigned long nr_accounted = 0;
> + int atomic = details ? (details->i_mmap_lock != 0) : 0;
>
>   lru_add_drain();
>   tlb = tlb_gather_mmu(mm, 0);
>   update_hiwater_rss(mm);
> + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
>   end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
>   if (tlb)
>   tlb_finish_mmu(tlb, address, end);
> + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
>   return end;
>  }
>

Where do you invalidate for munmap()?

Also, how to you resolve the case where you are not allowed to sleep?
I would have thought either you have to handle it, in which case nobody
needs to sleep; or you can't handle it, in which case the code is
broken.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin
On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > anything when changing the pte to be _more_ permissive, and I don't
> 
> Note that in my patch the invalidate_pages in mprotect can be
> trivially switched to a mprotect_pages with proper params. This will
> prevent page faults completely in the secondary MMU (there will only
> be tlb misses after the tlb flush just like for the core linux pte),
> and it'll allow all the secondary MMU pte blocks (512/1024 at time
> with my PT lock design) to be updated to have proper permissions
> matching the core linux pte.

Sorry, I realise I still didn't get this through my head yet (and also
have not seen your patch recently). So I don't know exactly what you
are doing...

But why does _anybody_ (why does Christoph's patches) need to invalidate
when they are going to be more permissive? This should be done lazily by
the driver, I would have thought.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-19 Thread Nick Piggin
On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> These special additional callbacks are required because XPmem (and likely
> other mechanisms) do use their own rmap (multiple processes on a series
> of remote Linux instances may be accessing the memory of a process).
> F.e. XPmem may have to send out notifications to remote Linux instances
> and receive confirmation before a page can be freed.
>
> So we handle this like an additional Linux reverse map that is walked after
> the existing rmaps have been walked. We leave the walking to the driver
> that is then able to use something else than a spinlock to walk its reverse
> maps. So we can actually call the driver without holding spinlocks while we
> hold the Pagelock.

I don't know how this is supposed to solve anything. The sleeping
problem happens I guess mostly in truncate. And all you are doing
is putting these rmap callbacks in page_mkclean and try_to_unmap.


> However, we cannot determine the mm_struct that a page belongs to at
> that point. The mm_struct can only be determined from the rmaps by the
> device driver.
>
> We add another pageflag (PageExternalRmap) that is set if a page has
> been remotely mapped (f.e. by a process from another Linux instance).
> We can then only perform the callbacks for pages that are actually in
> remote use.
>
> Rmap notifiers need an extra page bit and are only available
> on 64 bit platforms. This functionality is not available on 32 bit!
>
> A notifier that uses the reverse maps callbacks does not need to provide
> the invalidate_page() method that is called when locks are held.

That doesn't seem right. To start with, the new callbacks aren't
even called in the places where invalidate_page isn't allowed to
sleep.

The problem is unmap_mapping_range, right? And unmap_mapping_range
must walk the rmaps with the mmap lock held, which is why it can't
sleep. And it can't hold any mmap_sem so it cannot prevent address
space modifications of the processes in question between the time
you unmap them from the linux ptes with unmap_mapping_range, and the
time that you unmap them from your driver.

So in the meantime, you could have eg. a fault come in and set up a
new page for one of the processes, and that page might even get
exported via the same external driver. And now you have a totally
inconsistent view.

Preventing new mappings from being set up until the old mapping is
completely flushed is basically what we need to ensure for any sane
TLB as far as I can tell. To do that, you'll need to make the mmap
lock sleep, and either take mmap_sem inside it (which is a
deadlock condition at the moment), or make ptl sleep as well. These
are simply the locks we use to prevent that from happening, so I
can't see how you can possibly hope to have a coherent TLB without
invalidating inside those locks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin
On Wednesday 20 February 2008 14:00, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:

> > > Also, how to you resolve the case where you are not allowed to sleep?
> > > I would have thought either you have to handle it, in which case nobody
> > > needs to sleep; or you can't handle it, in which case the code is
> > > broken.
> >
> > I also asked exactly this, glad you reasked this too.
>
> Currently, we BUG_ON having a PFN in our tables and not being able
> to sleep.  These are mappings which MPT has never supported in the past
> and XPMEM was already not allowing page faults for VMAs which are not
> anonymous so it should never happen.  If the file-backed operations can
> ever get changed to allow for sleeping and a customer has a need for it,
> we would need to change XPMEM to allow those types of faults to succeed.

Do you really want to be able to swap, or are you just interested
in keeping track of unmaps / prot changes?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-19 Thread Nick Piggin
On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> For XPMEM, we do not currently allow file backed
> mapping pages from being exported so we should never reach this condition.
> It has been an issue since day 1.  We have operated with that assumption
> for 6 years and have not had issues with that assumption.  The user of
> xpmem is MPT and it controls the communication buffers so it is reasonable
> to expect this type of behavior.

OK, that makes things simpler.

So why can't you export a device from your xpmem driver, which
can be mmap()ed to give out "anonymous" memory pages to be used
for these communication buffers?

I guess you may also want an "munmap/mprotect" callback, which
we don't have in the kernel right now... but at least you could
prototype it easily by having an ioctl to be called before
munmapping or mprotecting (eg. the ioctl could prevent new TLB
setup for the region, and shoot down existing ones).

This is actually going to be much faster for you if you use any
threaded applications, because you will be able to do all the
shootdown round trips outside mmap_sem, and so you will be able
to have other threads faulting and even mmap()ing / munmaping
at the same time as the shootdown is happening.

I guess there is some catch...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-20 Thread Nick Piggin
On Wednesday 20 February 2008 20:00, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:51:45PM +1100, Nick Piggin wrote:
> > On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> > > For XPMEM, we do not currently allow file backed
> > > mapping pages from being exported so we should never reach this
> > > condition. It has been an issue since day 1.  We have operated with
> > > that assumption for 6 years and have not had issues with that
> > > assumption.  The user of xpmem is MPT and it controls the communication
> > > buffers so it is reasonable to expect this type of behavior.
> >
> > OK, that makes things simpler.
> >
> > So why can't you export a device from your xpmem driver, which
> > can be mmap()ed to give out "anonymous" memory pages to be used
> > for these communication buffers?
>
> Because we need to have heap and stack available as well.  MPT does
> not control all the communication buffer areas.  I haven't checked, but
> this is the same problem that IB will have.  I believe they are actually
> allowing any memory region be accessible, but I am not sure of that.

Then you should create a driver that the user program can register
and unregister regions of their memory with. The driver can do a
get_user_pages to get the pages, and then you'd just need to set up
some kind of mapping so that userspace can unmap pages / won't leak
memory (and an exit_mm notifier I guess).

Because you don't need to swap, you don't need coherency, and you
are in control of the areas, then this seems like the best choice.
It would allow you to use heap, stack, file-backed, anything.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] my mmu notifiers

2008-02-20 Thread Nick Piggin
On Tue, Feb 19, 2008 at 05:40:50PM -0600, Jack Steiner wrote:
> On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
> > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > > > anything when changing the pte to be _more_ permissive, and I don't
> > > 
> > > Note that in my patch the invalidate_pages in mprotect can be
> > > trivially switched to a mprotect_pages with proper params. This will
> > > prevent page faults completely in the secondary MMU (there will only
> > > be tlb misses after the tlb flush just like for the core linux pte),
> > > and it'll allow all the secondary MMU pte blocks (512/1024 at time
> > > with my PT lock design) to be updated to have proper permissions
> > > matching the core linux pte.
> > 
> > Sorry, I realise I still didn't get this through my head yet (and also
> > have not seen your patch recently). So I don't know exactly what you
> > are doing...
> > 
> > But why does _anybody_ (why does Christoph's patches) need to invalidate
> > when they are going to be more permissive? This should be done lazily by
> > the driver, I would have thought.
> 
> 
> Agree. Although for most real applications, the performance difference
> is probably negligible.

But importantly, doing it that way means you share test coverage with
the CPU TLB flushing code, and you don't introduce a new concept to the
VM.

So, it _has_ to be lazy flushing, IMO (as there doesn't seem to be a
good reason otherwise). mprotect shouldn't really be a special case,
because it still has to flush the CPU tlbs as well when restricting
access.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] my mmu notifiers

2008-02-20 Thread Nick Piggin
On Wed, Feb 20, 2008 at 02:09:41AM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
> > Sorry, I realise I still didn't get this through my head yet (and also
> > have not seen your patch recently). So I don't know exactly what you
> > are doing...
> 
> The last version was posted here:
> 
> http://marc.info/?l=kvm-devel&m=120321732521533&w=2
> 
> > But why does _anybody_ (why does Christoph's patches) need to invalidate
> > when they are going to be more permissive? This should be done lazily by
> > the driver, I would have thought.
> 
> This can be done lazily by the driver yes. The place where I've an
> invalidate_pages in mprotect however can also become less permissive.

That's OK, because we have to flush tlbs there too.


> It's simpler to invalidate always and it's not guaranteed the
> secondary mmu page fault is capable of refreshing the spte across a
> writeprotect fault.

I think we just have to make sure that it _can_ do writeprotect
faults. AFAIKS, that will be possible if the driver registers a
.page_mkwrite handler (actually not quite -- page_mkwrite is fairly
crap, so I have a patch to merge it together with .fault so we get
address information as well). Anyway, I really think we should do
it that way.

> In the future this can be changed to
> mprotect_pages though, so no page fault will happen in the secondary
> mmu.

Possibly, but hopefully not needed for performance. Let's wait and
see.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v6

2008-02-20 Thread Nick Piggin
On Wed, Feb 20, 2008 at 11:39:42AM +0100, Andrea Arcangeli wrote:
> Given Nick's comments I ported my version of the mmu notifiers to
> latest mainline. There are no known bugs AFIK and it's obviously safe
> (nothing is allowed to schedule inside rcu_read_lock taken by
> mmu_notifier() with my patch).

Thanks! Yes the seqlock you are using now ends up looking similar
to what I did and I couldn't find a hole in that either. So I
think this is going to work.

I do prefer some parts of my patch, however for everyone's sanity,
I think you should be the maintainer of the mmu notifiers, and I
will send you incremental changes that can be discussed more easily
that way (nothing major, mainly style and minor things).


> XPMEM simply can't use RCU for the registration locking if it wants to
> schedule inside the mmu notifier calls. So I guess it's better to add
> the XPMEM invalidate_range_end/begin/external-rmap as a whole
> different subsystem that will have to use a mutex (not RCU) to
> serialize, and at the same time that CONFIG_XPMEM will also have to
> switch the i_mmap_lock to a mutex. I doubt xpmem fits inside a
> CONFIG_MMU_NOTIFIER anymore, or we'll all run a bit slower because of
> it. It's really a call of how much we want to optimize the MMU
> notifier, by keeping things like RCU for the registration.

I agree: your coherent, non-sleeping mmu notifiers are pretty simple
and unintrusive. The sleeping version is fundamentally going to either
need to change VM locks, or be non-coherent, so I don't think there is
a question of making one solution fit everybody. So the sleeping /
xrmap patch should be kept either completely independent, or as an
add-on to this one.

I will post some suggestions to you when I get a chance.

 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v6

2008-02-20 Thread Nick Piggin
On Wed, Feb 20, 2008 at 01:03:24PM +0100, Andrea Arcangeli wrote:
> If there's agreement that the VM should alter its locking from
> spinlock to mutex for its own good, then Christoph's
> one-config-option-fits-all becomes a lot more appealing (replacing RCU
> with a mutex in the mmu notifier list registration locking isn't my
> main worry and the non-sleeping-users may be ok to live with it).

Just from a high level view, in some cases we can just say that no we
aren't going to support this. And this may well be one of those cases.

The more constraints placed on the VM, the harder it becomes to
improve and adapt in future. And this seems like a pretty big restriction.
(especially if we can eg. work around it completely by having a special
purpose driver to get_user_pages on comm buffers as I suggested in the
other mail).

At any rate, I believe Andrea's patch really places minimal or no further
constraints than a regular CPU TLB (or the hash tables that some archs
implement). So we're kind of in 2 different leagues here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document huge memory/cache overhead of memory controller in Kconfig

2008-02-21 Thread Nick Piggin
On Wednesday 20 February 2008 23:52, Balbir Singh wrote:
> Andi Kleen wrote:
> > Document huge memory/cache overhead of memory controller in Kconfig
> >
> > I was a little surprised that 2.6.25-rc* increased struct page for the
> > memory controller.  At least on many x86-64 machines it will not fit into
> > a single cache line now anymore and also costs considerable amounts of
> > RAM.
>
> The size of struct page earlier was 56 bytes on x86_64 and with 64 bytes it
> won't fit into the cacheline anymore? Please also look at
> http://lwn.net/Articles/234974/

BTW. We'll probably want to increase the width of some counters
in struct page at some point for 64-bit, so then it really will
go over with the memory controller!

Actually, an external data structure is a pretty good idea. We
could probably do it easily with a radix tree (pfn->memory
controller). And that might be a better option for distros.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] more SLUB updates for 2.6.25

2008-02-07 Thread Nick Piggin
On Friday 08 February 2008 13:13, Christoph Lameter wrote:
> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git slub-linus
>
> (includes the cmpxchg_local fastpath since the cmpxchg_local work
> by Matheiu is in now, and the non atomic unlock by Nick. Verified that
> this is not doing any harm after some other patches had been removed.

Ah, good. I think it is always a good thing to be able to remove atomics.
They place quite a bit of burden on the CPU, especially x86 where it also
has implicit memory ordering semantics (although x86 can speculatively
get around much of the problem, it's obviously worse than no restriction)

Even if perhaps some cache coherency or timing quirk makes the non-atomic
version slower (all else being equal), then I'd still say that the non
atomic version should be preferred.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-07 Thread Nick Piggin
On Thu, Feb 07, 2008 at 07:25:45PM +0100, Jens Axboe wrote:
> Hi,
> 
> Here's a variant using kernel threads only, the nasty arch bits are then
> not needed. Works for me, no performance testing (that's a hint for Alan
> to try and queue up some testing for this variant as well :-)

Well this stuff looks pretty nice (although I'm not sure whether the
softirq->thread changes are a good idea for performance, I guess we'll
see).

You still don't have the option that the Intel patch gave, that is,
to submit on the completer. I guess that you could do it somewhat
generically by having a cpuid in the request queue, and update that
with the completing cpu.

At least they reported it to be the most efficient scheme in their
testing, and Dave thought that migrating completions out to submitters
might be a bottleneck in some cases.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc] direct IO submission and completion scalability issues

2008-02-07 Thread Nick Piggin
On Tue, Feb 05, 2008 at 11:14:19AM +1100, David Chinner wrote:
> On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
> > You get better behaviour in the slab and page allocators and locality
> > and cache hotness of memory. For example, I guess in a filesystem /
> > pagecache heavy workload, you have to touch each struct page, buffer head,
> > fs private state, and also often have to wake the thread for completion.
> > Much of this data has just been touched at submit time, so doin this on
> > the same CPU is nice...
> 
> []
> 
> > I'm surprised that the xfs global state bouncing would outweigh the
> > bouncing of all the per-page/block/bio/request/etc data that gets touched
> > during completion. We'll see.
> 
> per-page/block.bio/request/etc is local to a single I/O. the only
> penalty is a cacheline bounce for each of the structures from one
> CPU to another.  That is, there is no global state modified by these
> completions.

Yeah, but it is going from _all_ submitting CPUs to the one completing
CPU. So you could bottleneck the interconnect at the completing CPU
just as much as if you had cachelines being pulled the other way (ie.
many CPUs trying to pull in a global cacheline).

 
> The real issue is metadata. The transaction log I/O completion
> funnels through a state machine protected by a single lock, which
> means completions on different CPUs pulls that lock to all
> completion CPUs. Given that the same lock is used during transaction
> completion for other state transitions (in task context, not intr),
> the more cpus active at once touches, the worse the problem gets.

OK, once you add locking (and not simply cacheline contention), then
the problem gets harder I agree. But I think that if the submitting
side takes the same locks as log completion (eg. maybe for starting a
new transaction), then it is not going to be a clear win either way,
and you'd have to measure it in the end.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-07 Thread Nick Piggin
On Fri, Feb 08, 2008 at 08:47:47AM +0100, Jens Axboe wrote:
> On Fri, Feb 08 2008, Nick Piggin wrote:
> > On Thu, Feb 07, 2008 at 07:25:45PM +0100, Jens Axboe wrote:
> > > Hi,
> > > 
> > > Here's a variant using kernel threads only, the nasty arch bits are then
> > > not needed. Works for me, no performance testing (that's a hint for Alan
> > > to try and queue up some testing for this variant as well :-)
> > 
> > Well this stuff looks pretty nice (although I'm not sure whether the
> > softirq->thread changes are a good idea for performance, I guess we'll
> > see).
> 
> Yeah, that is indeed an open question and why I have two seperate
> patches for now (io-cpu-affinity branch and io-cpu-affinity-kthread
> branch). As Ingo mentioned, this is how softirqs are handled in the -rt
> branch already.
 
True, although there are some IO workloads where -rt falls behind
mainline. May not be purely due to irq threads though, of course.


> > You still don't have the option that the Intel patch gave, that is,
> > to submit on the completer. I guess that you could do it somewhat
> > generically by having a cpuid in the request queue, and update that
> > with the completing cpu.
> 
> Not sure what you mean, if setting queue_affinity doesn't accomplish it.
> If you know the completing CPU to begin with, surely you can just set
> the queuing affinity appropriately?

And if you don't?


> > At least they reported it to be the most efficient scheme in their
> > testing, and Dave thought that migrating completions out to submitters
> > might be a bottleneck in some cases.
> 
> More so than migrating submitters to completers? The advantage of only
> movign submitters is that you get rid of the completion locking. Apart
> from that, the cost should be the same, especially for the thread based
> solution.

Not specifically for the block layer, but higher layers like xfs.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] more SLUB updates for 2.6.25

2008-02-08 Thread Nick Piggin
On Friday 08 February 2008 18:29, Eric Dumazet wrote:
> Nick Piggin a écrit :
> > On Friday 08 February 2008 13:13, Christoph Lameter wrote:
> >> are available in the git repository at:
> >>
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git
> >> slub-linus
> >>
> >> (includes the cmpxchg_local fastpath since the cmpxchg_local work
> >> by Matheiu is in now, and the non atomic unlock by Nick. Verified that
> >> this is not doing any harm after some other patches had been removed.
> >
> > Ah, good. I think it is always a good thing to be able to remove atomics.
> > They place quite a bit of burden on the CPU, especially x86 where it also
> > has implicit memory ordering semantics (although x86 can speculatively
> > get around much of the problem, it's obviously worse than no restriction)
> >
> > Even if perhaps some cache coherency or timing quirk makes the non-atomic
> > version slower (all else being equal), then I'd still say that the non
> > atomic version should be preferred.
>
> What about IRQ masking then ?

I really did mean all else being equal. eg. "clear_bit" vs "__clear_bit".


> Many CPU pay high cost for cli/sti pair...

True, and many UP architectures have to implement atomic operations
with cli/sti pairs... so those are more reasons to use non-atomics.


> And SLAB/SLUB allocators, even if only used from process context, want to
> disable/re-enable interrupts...
>
> I understand kmalloc() want generic pools, but dedicated pools could avoid
> this cli/sti

Sure, I guess that would be possible. I've kind of toyed with doing
some cli/sti mitigation in the page allocator, but in that case I
found that it wasn't a win outside microbenchmarks: the cache
characteristics of the returned pages are just as important if not
more so than cli/sti costs (although that balance would change
depending on the CPU and workload I guess).

For slub yes you could do it with fewer downsides with process context
pools.

Is it possible instead for architectures where cli/sti is so expensive
to change their lowest level of irq handling to do this by setting and
clearing a soft flag somewhere? That's what I'd rather see, if possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-08 Thread Nick Piggin
On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
> On Fri, Feb 08 2008, Nick Piggin wrote:
> > And if you don't?
> 
> Well if you don't ask for anything, you wont get anything :-)
> As I mentioned, the patch is a playing ground for trying various setups.
> Everything defaults to 'do as usual', set options to setup certain test
> scenarios.

I mean if you don't know the completing CPU.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-08 Thread Nick Piggin
On Fri, Feb 08, 2008 at 09:24:22AM +0100, Jens Axboe wrote:
> On Fri, Feb 08 2008, Nick Piggin wrote:
> > On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
> > > On Fri, Feb 08 2008, Nick Piggin wrote:
> > > > And if you don't?
> > > 
> > > Well if you don't ask for anything, you wont get anything :-)
> > > As I mentioned, the patch is a playing ground for trying various setups.
> > > Everything defaults to 'do as usual', set options to setup certain test
> > > scenarios.
> > 
> > I mean if you don't know the completing CPU.
> 
> I still don't know quite what part of that patch you are referring to
> here. If you don't have queue_affinity set, queueing a new request with
> the hardware is generally done on the same CPU that just completed a
> request. That is true even without any patches.

Generally, but I guess not always. The database workloads in question
(which you might know very well about ;)) apparently has a lot of
queue empty and unplug conditions. Which I guess is the reason for
Intel's initial patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] block layer: kmemcheck fixes

2008-02-08 Thread Nick Piggin
On Fri, Feb 08, 2008 at 07:09:07AM -0800, Arjan van de Ven wrote:
> David Miller wrote:
> >From: Linus Torvalds <[EMAIL PROTECTED]>
> >Date: Thu, 7 Feb 2008 09:42:56 -0800 (PST)
> >
> >>Can we please just stop doing these one-by-one assignments, and just do 
> >>something like
> >>
> >>memset(rq, 0, sizeof(*rq));
> >>rq->q = q;
> >>rq->ref_count = 1;
> >>INIT_HLIST_NODE(&rq->hash);
> >>RB_CLEAR_NODE(&rq->rb_node);
> >>
> >>instead?
> >>
> >>The memset() is likely faster and smaller than one-by-one assignments 
> >>anyway, even if the one-by-ones can avoid initializing some field or 
> >>there ends up being a double initialization..
> >
> >The problem is store buffer compression.  At least a few years
> >ago this made a huge difference in sk_buff initialization in the
> >networking.
> >
> >Maybe cpus these days have so much store bandwith that doing
> >things like the above is OK, but I doubt it :-)
> 
> on modern x86 cpus the memset may even be faster if the memory isn't in 
> cache;
> the "explicit" method ends up doing Write Allocate on the cache lines
> (so read them from memory) even though they then end up being written 
> entirely.
> With memset the CPU is told that the entire range is set to a new value, and
> the WA can be avoided for the whole-cachelines in the range.

Don't you have write combining store buffers? Or is it still speculatively
issuing the reads even before the whole cacheline is combined?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] block layer: kmemcheck fixes

2008-02-08 Thread Nick Piggin
On Fri, Feb 08, 2008 at 02:56:09PM -0800, Arjan van de Ven wrote:
> Nick Piggin wrote:
> >>>Maybe cpus these days have so much store bandwith that doing
> >>>things like the above is OK, but I doubt it :-)
> >>on modern x86 cpus the memset may even be faster if the memory isn't in 
> >>cache;
> >>the "explicit" method ends up doing Write Allocate on the cache lines
> >>(so read them from memory) even though they then end up being written 
> >>entirely.
> >>With memset the CPU is told that the entire range is set to a new value, 
> >>and
> >>the WA can be avoided for the whole-cachelines in the range.
> >
> >Don't you have write combining store buffers? Or is it still speculatively
> >issuing the reads even before the whole cacheline is combined?
> 
> x86 memory order model doesn't allow that quite; and you need a "series" of 
> at least 64 bytes
> without any other memory accesses in between even if it would
> not happening in practice.

OK, fair enough... then it will be a very nice test to see if it
helps. I'm sure you could have an arch specific initialisation
function if it makes a significant difference.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oops report for the week upto Feb 10th 2008

2008-02-10 Thread Nick Piggin
On Monday 11 February 2008 11:35, Arjan van de Ven wrote:
> The http://www.kerneloops.org website collects kernel oops and
> warning reports from various mailing lists and bugzillas as well as
> with a client users can install to auto-submit oopses.
> Below is a top 10 list of the oopses/backtraces collected in the last 7
> days. (Reports prior to 2.6.23 have been omitted in collecting the top 10)
>
> This week, a total of 323 oopses and warnings have been reported,
> compared to 110 reports in the previous week.
>
> (This sharp increase is due to Fedora 9 alpha shipping the oops data
> collection client in the default install, giving us much wider coverage
> in the issues that actual users hit; many thanks to the Fedora project
> for this)
>
> With the 2.6.25-rc1 release out, this will be the last report that includes
> 2.6.23; future reports will only include issues from 2.6.24 and later.
>
>
> Rank 1: set_dentry_child_flags
>   WARN_ON at fs/inotify.c:172 set_dentry_child_flags
>   Reported 93 times (116 total reports)
>   This is a user triggered WARN_ON in inotify. Sadly inotify seems to be
> unmaintained. More info:
> http://www.kerneloops.org/search.php?search=set_dentry_child_flags

I was never able to trigger this or get anyone to reliably trigger it with
a debug patch in. Which is why it has taken so long to fix. It looks like
kde4 is triggering this big rash of new reports.

Anyway, I have fixed a race or two and removed that warning code (which was
also a little racy). So I think that should be OK.


> Rank 9: mark_buffer_dirty
>   WARN_ON at fs/buffer.c:1169
>   This indicates that a non-uptodate buffer is marked dirty.
>   This can lead to data corruption!
>   Reported 5 times (12 total reports) - Only seen since 2.6.24-rc6
>   Usually happens during umount()
>   More info: http://www.kerneloops.org/search.php?search=mark_buffer_dirty

That's interesting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] alloc_percpu() fails to allocate percpu data

2008-02-23 Thread Nick Piggin
On Friday 22 February 2008 09:26, Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
> > Some oprofile results obtained while using tbench on a 2x2 cpu machine
> > were very surprising.
> >
> > For example, loopback_xmit() function was using high number of cpu
> > cycles to perform the statistic updates, supposed to be real cheap
> > since they use percpu data
> >
> > pcpu_lstats = netdev_priv(dev);
> > lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> > lb_stats->packets++;  /* HERE : serious contention */
> > lb_stats->bytes += skb->len;
> >
> >
> > struct pcpu_lstats is a small structure containing two longs. It
> > appears that on my 32bits platform, alloc_percpu(8) allocates a single
> > cache line,  instead of giving to each cpu a separate cache line.
> >
> > Using the following patch gave me impressive boost in various
> > benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
> >
> > Long term fix (ie >= 2.6.26) would be to let each CPU allocate their
> > own block of memory, so that we dont need to roudup sizes to
> > L1_CACHE_BYTES, or merging the SGI stuff of course...
> >
> > Note : SLUB vs SLAB is important here to *show* the improvement, since
> > they dont have the same minimum allocation sizes (8 bytes vs 32
> > bytes). This could very well explain regressions some guys reported
> > when they switched to SLUB.
>
> I've complained about this false sharing as well, so until we get the
> new and improved percpu allocators,

What I don't understand is why the slab allocators have something like
this in it:

if ((flags & SLAB_HWCACHE_ALIGN) &&
size > cache_line_size() / 2)
return max_t(unsigned long, align, cache_line_size());

If you ask for HWCACHE_ALIGN, then you should get it. I don't
understand, why do they think they knows better than the caller?
Things like this are just going to lead to very difficult to track
performance problems. Possibly correctness problems in rare cases.

There could be another flag for "maybe align".

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-25 Thread Nick Piggin
On Wednesday 20 February 2008 09:01, Alexey Dobriyan wrote:
> On Tue, Feb 19, 2008 at 11:47:11PM +0300,  wrote:

> > > Are you reproducing it simply by running the
> > > ftest03 binary directly from the shell? How many times between oopses?
> > > It is multi-process but no threads, so races should be minimal down
> > > this path -- can you get an strace of the failing process?
>
> Speaking of multi-proceseness, changing MAXCHILD to 1, nchild to 1,
> AFAICS, generates one child which oopses the very same way (in parallel
> with generic LTP) But, lowering MAXIOVCNT to 8 generates no oops.

Thanks, I was able to reproduce quite easily with these settings.
I think I have the correct patch now (at least it isn't triggerable
any more here).

Thanks,
Nick
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c74b68..2650073 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1750,14 +1750,18 @@ static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
 	} else {
 		const struct iovec *iov = i->iov;
 		size_t base = i->iov_offset;
+		size_t copied = 0;
 
 		/*
 		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments (without overruning the iovec).
 		 */
-		while (bytes || !iov->iov_len) {
-			int copy = min(bytes, iov->iov_len - base);
+		while (copied < bytes ||
+unlikely(!iov->iov_len && copied < i->count)) {
+			int copy;
 
+			copy = min(bytes, iov->iov_len - base);
+			copied += copy;
 			bytes -= copy;
 			base += copy;
 			if (iov->iov_len == base) {


Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-25 Thread Nick Piggin
On Thursday 21 February 2008 21:58, Robin Holt wrote:
> On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote:
> > > > So why can't you export a device from your xpmem driver, which
> > > > can be mmap()ed to give out "anonymous" memory pages to be used
> > > > for these communication buffers?
> > >
> > > Because we need to have heap and stack available as well.  MPT does
> > > not control all the communication buffer areas.  I haven't checked, but
> > > this is the same problem that IB will have.  I believe they are
> > > actually allowing any memory region be accessible, but I am not sure of
> > > that.
> >
> > Then you should create a driver that the user program can register
> > and unregister regions of their memory with. The driver can do a
> > get_user_pages to get the pages, and then you'd just need to set up
> > some kind of mapping so that userspace can unmap pages / won't leak
> > memory (and an exit_mm notifier I guess).
>
> OK.  You need to explain this better to me.  How would this driver
> supposedly work?  What we have is an MPI library.  It gets invoked at
> process load time to establish its rank-to-rank communication regions.
> It then turns control over to the processes main().  That is allowed to
> run until it hits the
>   MPI_Init(&argc, &argv);
>
> The process is then totally under the users control until:
>   MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, 
> MPI_COMM_WORLD);
>   MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD,
> &status);
>
> That is it.  That is all our allowed interaction with the users process.

OK, when you said something along the lines of "the MPT library has
control of the comm buffer", then I assumed it was an area of virtual
memory which is set up as part of initialization, rather than during
runtime. I guess I jumped to conclusions.


> That doesn't seem too unreasonable, except when you compare it to how the
> driver currently works.  Remember, this is done from a library which has
> no insight into what the user has done to its own virtual address space.
> As a result, each MPI_Send() would result in a system call (or we would
> need to have a set of callouts for changes to a processes VMAs) which
> would be a significant increase in communication overhead.
>
> Maybe I am missing what you intend to do, but what we need is a means of
> tracking one processes virtual address space changes so other processes
> can do direct memory accesses without the need for a system call on each
> communication event.

Yeah it's tricky. BTW. what is the performance difference between
having a system call or no?


> > Because you don't need to swap, you don't need coherency, and you
> > are in control of the areas, then this seems like the best choice.
> > It would allow you to use heap, stack, file-backed, anything.
>
> You are missing one point here.  The MPI specifications that have
> been out there for decades do not require the process use a library
> for allocating the buffer.  I realize that is a horrible shortcoming,
> but that is the world we live in.  Even if we could change that spec,

Can you change the spec? Are you working on it?


> we would still need to support the existing specs.  As a result, the
> user can change their virtual address space as they need and still expect
> communications be cheap.

That's true. How has it been supported up to now? Are you using
these kind of notifiers in patched kernels?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin
On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
> Andrew Morton wrote:
> > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> 
wrote:
> > > (It would be nicer if sync_file_range()
> > > took a vector of ranges for better elevator scheduling, but let's
> > > ignore that :-)
> >
> > Two passes:
> >
> > Pass 1: shove each of the segments into the queue with
> > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> >
> > Pass 2: wait for them all to complete and return accumulated result
> > with SYNC_FILE_RANGE_WAIT_AFTER
>
> Thanks.
>
> Seems ok, though being able to cork the I/O until the last one would
> be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)
>
> I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
> reason why you have it there?  The man page isn't very enlightening.


Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, "writeout", to userspace (does "writeout" include
"in drive cache"? the kernel doesn't think so, but the only way to
make sync_file_range "safe" is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-26 Thread Nick Piggin
On Tuesday 26 February 2008 18:21, Gleb Natapov wrote:
> On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote:
> > > You are missing one point here.  The MPI specifications that have
> > > been out there for decades do not require the process use a library
> > > for allocating the buffer.  I realize that is a horrible shortcoming,
> > > but that is the world we live in.  Even if we could change that spec,
> >
> > Can you change the spec?
>
> Not really. It will break all existing codes.

I meant as in eg. submit changes to MPI-3


> MPI-2 provides a call for 
> memory allocation (and it's beneficial to use this call for some
> interconnects), but many (most?) applications are still written for MPI-1
> and those that are written for MPI-2 mostly uses the old habit of
> allocating memory by malloc(), or even use stack or BSS memory for
> communication buffer purposes.

OK, so MPI-2 already has some way to do that... I'm not saying that we
can now completely dismiss the idea of using notifiers for this, but it
is just a good data point to know.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: oops when using git gc --auto

2008-02-26 Thread Nick Piggin
On Wednesday 27 February 2008 00:22, Otavio Salvador wrote:
> Hello,
>
> Today I got this oops, someone has an idea of what's going wrong?
>
> Unable to handle kernel paging request at 0200 RIP:
>  [] find_get_pages+0x3c/0x69

At this point, the most likely candidate is a memory corruption
error, probably hardware. Can you run memtest86 for a few hours
to get a bit more confidence in the hw (preferably overnight)?

I did recently see another quite similar corruption in the
pagecache radix-tree, though. Coincidence maybe?

> PGD 0
> Oops:  [1] SMP
> CPU 3
> Modules linked in: sha256_generic aes_generic aes_x86_64 cbc blkcipher
> nvidia(P) rfcomm l2cap bluetooth ac battery ipv6 nfs lockd nfs_acl sunrpc
> bridge ext2 mbcache dm_crypt tun kvm_intel kvm loop snd_usb_audio
> snd_usb_lib snd_rawmidi snd_hda_intel e1000e i2c_i801 serio_raw
> snd_seq_device snd_pcm intel_agp button snd_timer pcspkr psmouse snd_hwdep
> snd snd_page_alloc soundcore evdev i2c_core xfs dm_mirror dm_snapshot
> dm_mod raid0 md_mod sg sr_mod cdrom sd_mod usbhid hid usb_storage
> pata_marvell floppy ahci ata_generic libata scsi_mod ehci_hcd uhci_hcd
> thermal processor fan Pid: 15684, comm: git Tainted: P   
> 2.6.24-1-amd64 #1
> RIP: 0010:[]  []
> find_get_pages+0x3c/0x69 RSP: 0018:8100394dfd98  EFLAGS: 00010097
> RAX: 0009 RBX: 000e RCX: 0009
> RDX: 0200 RSI: 000a RDI: 0040
> RBP: 810042964350 R08: 0040 R09: 000a
> R10: 8100425a06c8 R11: 000a R12: 000e
> R13: 8100394dfdf8 R14: 810042964350 R15: 
> FS:  2ae326df2190() GS:81007d7aeb40()
> knlGS: CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 0200 CR3: 358f9000 CR4: 26e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process git (pid: 15684, threadinfo 8100394de000, task
> 8100359cd800) Stack:  000d 8100394dfde8
> 000d 000e 000e 802794d6
> 8100014a7768 80279b04  
>   Call Trace:
>  [] pagevec_lookup+0x17/0x1e
>  [] truncate_inode_pages_range+0x108/0x2bd
>  [] generic_delete_inode+0xbf/0x127
>  [] do_unlinkat+0xd5/0x144
>  [] sys_write+0x45/0x6e
>  [] system_call+0x7e/0x83
>
>
> Code: 48 8b 02 25 00 40 02 00 48 3d 00 40 02 00 75 04 48 8b 52 10
> RIP  [] find_get_pages+0x3c/0x69
>  RSP 
> CR2: 0200
> ---[ end trace cb43a9f4488b815a ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unfair stress on non memory allocating apps while swapout (in 2.4)

2000-10-21 Thread Nick Piggin

> why are programs which do not allocate memory be delayed while one
> program is eating up all memory. This clearly means they are not delayed
in
> the malloc call but simply the kernel will not schedule them while he is
bussy
> to page out processes.

Bernd,
The reason why programs not allocating memory start waiting when the system
starts swapping is because they get some of their working set paged out,
which must be loaded into physical memory when they next use it. By working
set I mean the code and data they will use in their next timeslice.

Nick

PS this is my first post to lkml so please keep that in mind...
PPS ... so, was I right?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Topic for discussion: OS Design

2000-10-22 Thread Nick Piggin

> So what we really need to do is get some custom "RAM blitter" into our
> hardware to do the memory copies needed for fast context switching and 
> message passing.

don't you think you should quit while you're behind?

> Too bad nobody on this list works at an electronics design company... ;-P

you would probably be surprised.

Nick.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Nick Piggin

> I'm trying to write a server that handles 1 clients.  On 2.4.x,
> the RT signal queue stuff looks like the way to achieve that.

I would suggest you try multiple polling threads. Not only will you get
better SMP scalability, if you have say 16 threads, each one only has to
handle ~ 600 fds.

Nick.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



2.4.0-test9 Oopses

2000-10-25 Thread Nick Piggin

Did the following with 2.4.0-test9 + reiserfs 3.6.18 (all ext2 filesystem,
however) and all ide block devices.

scsi0 : SCSI host adapter emulation for IDE ATAPI devices
Vendor: RICOH Model: CD-R/RW MP7060A   Rev: 1.50
Type:   CD-ROM ANSI SCSI revision: 02
Vendor: ATAPI Model: CD-ROM DRIVE-24X  Rev: U40M
Type:   CD-ROM ANSI SCSI revision: 02
Detected scsi CD-ROM sr0 at scsi0, channel 0, id 0, lun 0
Detected scsi CD-ROM sr1 at scsi0, channel 0, id 1, lun 0
sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray
sr1: scsi3-mmc drive: 20x/20x xa/form2 cdda tray
scsi : 0 hosts left.

(loaded ide-scsi modules as you can see)
After trying to access the /proc/scsi directory I got this oops:

Unable to handle kernel paging request at virtual address c4858010
 printing eip:
c01461ac
Oops: 0002
CPU:0
EIP:0010:[proc_get_inode+156/288]
EFLAGS: 00010286
eax: c4858000   ebx: c39390c0   ecx: c21835c8   edx: 0023
esi: c2183540   edi: c3939114   ebp: c21830c0   esp: c3ef3eec
ds: 0018   es: 0018   ss: 0018
Process bash (pid: 407, stackpage=c3ef3000)
Stack: c0f04460 c0f044c8 c0147d09 c111c800 115d c39390c0 fff4
c0f04460
   c21830c0 c0f042e0 ffea c01378df c21830c0 c0f04460 c3ef3f68

   c0f042e0 c3ef3fa4 c0138009 c0f042e0 c3ef3f68  c1fb1000

Call Trace: [proc_lookup+121/160] [real_lookup+79/192] [path_walk+1369/1952]
[][__user_walk+60/96] [sys_newstat+22/112] [sys_close+72/96]
[system_call+51/56]
Code: ff 40 10 8b 43 24 80 48 14 18 0f b7 43 08 25 00 f0 ff ff 66

After rebooting, loading the same modules, then doing cdrecord -scanbus I
got these two oopses

Unable to handle kernel NULL pointer dereference at virtual address 
 printing eip:
c013a551
Oops: 
CPU:0
EIP:0010:[vfs_follow_link+33/368]
EFLAGS: 00010217
eax:    ebx: c3c5bf90   ecx: 0341   edx: c02b6040
esi:    edi:    ebp:    esp: c3c5befc
ds: 0018   es: 0018   ss: 0018
Process devfsd (pid: 12, stackpage=c3c5b000)
Stack: c3c5bf90  c0a01e20 c3c5bf90  c01547cf c3c5bf90

   c3c5a000 c0138143 c0a01e20 c3c5bf90 c09c0b40 c3d69000 
c3c5bf90
   bfffecdc 0001 bfffecdc c3c5bf94 0009 c0a01e20 c3d69005
0003
Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952]
[__user_walk+60/96] [sys_chown+22/80] [sys_chown16+48/64]
[system_call+51/56]
Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff

Unable to handle kernel NULL pointer dereference at virtual address 
 printing eip:
c013a551
Oops: 
CPU:0
EIP:0010:[vfs_follow_link+33/368]
EFLAGS: 00010217
eax:    ebx: c081df80   ecx: 0341   edx: c02b6040
esi: c0a01b20   edi:    ebp:    esp: c081ded0
ds: 0018   es: 0018   ss: 0018
Process cdrecord (pid: 758, stackpage=c081d000)
Stack: c081df80 c0a01b20 c0a01b20 c081df80  c01547cf c081df80

   c081c000 c0138143 c0a01b20 c081df80 c09c0b40 c117d000 
0002
   0003 0001 08074094 c081df84 0001 c0a01b20 c117d005
0003
Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952]
[open_namei+128/1504] [filp_open+59/96] [sys_open+67/208]
[system_call+51/56]
Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] BSD process accounting: new locking

2000-10-29 Thread Nick Piggin

I have attached a very small patch (test9) to remove the kernel lock from
kernel/acct.c. If I am missing something major (a brain?), I apologise in
advance. I have tested this on my UP x86 with spinlock debugging. I would
appreciate comments or an explanation of why this can't be done if you have
the time. Thanks.

Nick

 bsdacct.patch


Re: [patch] BSD process accounting: new locking

2000-10-29 Thread Nick Piggin

> I have attached a very small patch (test9) to remove the kernel lock from
> kernel/acct.c. If I am missing something major (a brain?), I apologise in
> advance. I have tested this on my UP x86 with spinlock debugging. I would
> appreciate comments or an explanation of why this can't be done if you
have
> the time. Thanks.
>
> Nick
>

Maybe there was a possibility of a race when acct_auto_close calls sys_acct
in the last patch. If so, this should fix it.

Nick.

 bsdacct2.patch


test9 oops (in block_read_full_page)

2000-10-29 Thread Nick Piggin

I apologise if this oops has already been fixed: it has happened twice but I
can't find the exact way to trigger it, I just want to make sure it is
reported ;)

Nick

 oops


Re: 2.4.0-test9 Oopses

2000-11-02 Thread Nick Piggin

Just  a note that this oops still occurs in test10. The problem occurs
because get_devfs_entry_from_vfs_inode in devfs_follow_link (and/or
devfs_read_link), seems to return invalid or incorrect devfs entries whose
.u.symlink.linkname is null which causes the line:
if (*link == '/') {
in fs/namei.c: __vfs_follow_link to oops.

The oops is due to trying to follow an sg? link in /dev.

Nick.

- Original Message -----
From: "Nick Piggin" <[EMAIL PROTECTED]>
To: "Linux-Kernel" <[EMAIL PROTECTED]>
Sent: Wednesday, October 25, 2000 9:16 PM
Subject: 2.4.0-test9 Oopses


> Did the following with 2.4.0-test9 + reiserfs 3.6.18 (all ext2 filesystem,
> however) and all ide block devices.
>
> scsi0 : SCSI host adapter emulation for IDE ATAPI devices
> Vendor: RICOH Model: CD-R/RW MP7060A   Rev: 1.50
> Type:   CD-ROM ANSI SCSI revision: 02
> Vendor: ATAPI Model: CD-ROM DRIVE-24X  Rev: U40M
> Type:   CD-ROM ANSI SCSI revision: 02
> Detected scsi CD-ROM sr0 at scsi0, channel 0, id 0, lun 0
> Detected scsi CD-ROM sr1 at scsi0, channel 0, id 1, lun 0
> sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray
> sr1: scsi3-mmc drive: 20x/20x xa/form2 cdda tray
> scsi : 0 hosts left.
>
> (loaded ide-scsi modules as you can see)
[snip]
> Doing cdrecord -scanbus I
> got these two oopses
>
> Unable to handle kernel NULL pointer dereference at virtual address

>  printing eip:
> c013a551
> Oops: 
> CPU:0
> EIP:0010:[vfs_follow_link+33/368]
> EFLAGS: 00010217
> eax:    ebx: c3c5bf90   ecx: 0341   edx: c02b6040
> esi:    edi:    ebp:    esp: c3c5befc
> ds: 0018   es: 0018   ss: 0018
> Process devfsd (pid: 12, stackpage=c3c5b000)
> Stack: c3c5bf90  c0a01e20 c3c5bf90  c01547cf c3c5bf90
> 
>c3c5a000 c0138143 c0a01e20 c3c5bf90 c09c0b40 c3d69000 
> c3c5bf90
>bfffecdc 0001 bfffecdc c3c5bf94 0009 c0a01e20 c3d69005
> 0003
> Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952]
> [__user_walk+60/96] [sys_chown+22/80] [sys_chown16+48/64]
> [system_call+51/56]
> Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff
>
> Unable to handle kernel NULL pointer dereference at virtual address

>  printing eip:
> c013a551
> Oops: 
> CPU:0
> EIP:0010:[vfs_follow_link+33/368]
> EFLAGS: 00010217
> eax:    ebx: c081df80   ecx: 0341   edx: c02b6040
> esi: c0a01b20   edi:    ebp:    esp: c081ded0
> ds: 0018   es: 0018   ss: 0018
> Process cdrecord (pid: 758, stackpage=c081d000)
> Stack: c081df80 c0a01b20 c0a01b20 c081df80  c01547cf c081df80
> 
>c081c000 c0138143 c0a01b20 c081df80 c09c0b40 c117d000 
> 0002
>0003 0001 08074094 c081df84 0001 c0a01b20 c117d005
> 0003
> Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952]
> [open_namei+128/1504] [filp_open+59/96] [sys_open+67/208]
> [system_call+51/56]
> Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff
>
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



2.4.0-test10 oopses (bug in devfs)

2000-11-04 Thread Nick Piggin

Maybe this got ignored because the subject was test9 oops when test 10 had
been released, or people tend to ignore .edu addresses...

> Just  a note that this oops still occurs in test10. The problem occurs
> because get_devfs_entry_from_vfs_inode in devfs_follow_link (and/or
> devfs_read_link), seems to return invalid or incorrect devfs entries whose
> .u.symlink.linkname is null which causes the line:
> if (*link == '/') {
> in fs/namei.c: __vfs_follow_link to oops.
>
> The oops is due to trying to follow an sg? link in /dev.
>
> Nick.
>
> ----- Original Message -
> From: "Nick Piggin" <[EMAIL PROTECTED]>
> To: "Linux-Kernel" <[EMAIL PROTECTED]>
> Sent: Wednesday, October 25, 2000 9:16 PM
> Subject: 2.4.0-test9 Oopses
>
>
> > Did the following with 2.4.0-test9 + reiserfs 3.6.18 (all ext2
filesystem,
> > however) and all ide block devices.
> >
> > scsi0 : SCSI host adapter emulation for IDE ATAPI devices
> > Vendor: RICOH Model: CD-R/RW MP7060A   Rev: 1.50
> > Type:   CD-ROM ANSI SCSI revision: 02
> > Vendor: ATAPI Model: CD-ROM DRIVE-24X  Rev: U40M
> > Type:   CD-ROM ANSI SCSI revision: 02
> > Detected scsi CD-ROM sr0 at scsi0, channel 0, id 0, lun 0
> > Detected scsi CD-ROM sr1 at scsi0, channel 0, id 1, lun 0
> > sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray
> > sr1: scsi3-mmc drive: 20x/20x xa/form2 cdda tray
> > scsi : 0 hosts left.
> >
> > (loaded ide-scsi modules as you can see)
> [snip]
> > Doing cdrecord -scanbus I
> > got these two oopses
> >
> > Unable to handle kernel NULL pointer dereference at virtual address
> 
> >  printing eip:
> > c013a551
> > Oops: 
> > CPU:0
> > EIP:0010:[vfs_follow_link+33/368]
> > EFLAGS: 00010217
> > eax:    ebx: c3c5bf90   ecx: 0341   edx: c02b6040
> > esi:    edi:    ebp:    esp: c3c5befc
> > ds: 0018   es: 0018   ss: 0018
> > Process devfsd (pid: 12, stackpage=c3c5b000)
> > Stack: c3c5bf90  c0a01e20 c3c5bf90  c01547cf c3c5bf90
> > 
> >c3c5a000 c0138143 c0a01e20 c3c5bf90 c09c0b40 c3d69000 
> > c3c5bf90
> >bfffecdc 0001 bfffecdc c3c5bf94 0009 c0a01e20 c3d69005
> > 0003
> > Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952]
> > [__user_walk+60/96] [sys_chown+22/80] [sys_chown16+48/64]
> > [system_call+51/56]
> > Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff
> >
> > Unable to handle kernel NULL pointer dereference at virtual address
> 
> >  printing eip:
> > c013a551
> > Oops: 
> > CPU:0
> > EIP:0010:[vfs_follow_link+33/368]
> > EFLAGS: 00010217
> > eax:    ebx: c081df80   ecx: 0341   edx: c02b6040
> > esi: c0a01b20   edi:    ebp:    esp: c081ded0
> > ds: 0018   es: 0018   ss: 0018
> > Process cdrecord (pid: 758, stackpage=c081d000)
> > Stack: c081df80 c0a01b20 c0a01b20 c081df80  c01547cf c081df80
> > 
> >c081c000 c0138143 c0a01b20 c081df80 c09c0b40 c117d000 
> > 0002
> >0003 0001 08074094 c081df84 0001 c0a01b20 c117d005
> > 0003
> > Call Trace: [devfs_follow_link+31/48] [path_walk+1683/1952]
> > [open_namei+128/1504] [filp_open+59/96] [sys_open+67/208]
> > [system_call+51/56]
> > Code: 80 3f 2f 0f 85 c6 00 00 00 53 e8 90 d2 ff ff ba 00 e0 ff ff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



bkl usage

2000-11-12 Thread Nick Piggin

Hi. In my efforts to understand the linux kernel v2.4 I found the bkl being
used in kernel/acct.c to lock seemingly local data. Would someone please
explain what races this prevents vs. say:


--- linux/kernel/acct.c Mon Oct 30 01:02:56 2000
+++ linux-2.4.0-test10/kernel/acct.c Mon Oct 30 01:10:20 2000
@@ -41,6 +41,10 @@
  * Oh, fsck... Oopsable SMP race in do_process_acct() - we must hold
  * ->mmap_sem to walk the vma list of current->mm. Nasty, since it leaks
  * a struct file opened for write. Fixed. 2/6/2000, AV.
+ *
+ *  2000-10-27 Modified by Nick <[EMAIL PROTECTED]> to remove usage
+ *  of the big kernel lock in favour of a local spinlock
+ *
  */

 #include 
@@ -77,6 +81,7 @@
 static struct file *acct_file;
 static struct timer_list acct_timer;
 static void do_acct_process(long, struct file *);
+static spinlock_t acct_lock = SPIN_LOCK_UNLOCKED;

 /*
  * Called whenever the timer says to check the free space.
@@ -95,11 +100,11 @@
  int res;
  int act;

- lock_kernel();
+spin_lock(&acct_lock);
  res = acct_active;
  if (!file || !acct_needcheck)
-  goto out;
- unlock_kernel();
+  goto out_unlock;
+spin_unlock(&acct_lock);

  /* May block */
  if (vfs_statfs(file->f_dentry->d_inode->i_sb, &sbuf))
@@ -113,14 +118,14 @@
   act = 0;

  /*
-  * If some joker switched acct_file under us we'ld better be
+  * If some joker switched acct_file under us we'd better be
   * silent and _not_ touch anything.
   */
- lock_kernel();
+spin_lock(&acct_lock);
  if (file != acct_file) {
   if (act)
res = act>0;
-  goto out;
+  goto out_unlock;
  }

  if (acct_active) {
@@ -140,8 +145,8 @@
  acct_timer.expires = jiffies + ACCT_TIMEOUT*HZ;
  add_timer(&acct_timer);
  res = acct_active;
-out:
- unlock_kernel();
+out_unlock:
+spin_unlock(&acct_lock);
  return res;
 }

@@ -182,7 +187,7 @@
  }

  error = 0;
- lock_kernel();
+spin_lock(&acct_lock);
  if (acct_file) {
   old_acct = acct_file;
   del_timer(&acct_timer);
@@ -200,7 +205,7 @@
   acct_timer.expires = jiffies + ACCT_TIMEOUT*HZ;
   add_timer(&acct_timer);
  }
- unlock_kernel();
+spin_unlock(&acct_lock);
  if (old_acct) {
   do_acct_process(0,old_acct);
   filp_close(old_acct, NULL);
@@ -214,10 +219,24 @@

 void acct_auto_close(kdev_t dev)
 {
- lock_kernel();
- if (acct_file && acct_file->f_dentry->d_inode->i_dev == dev)
-  sys_acct(NULL);
- unlock_kernel();
+struct file *old_acct;
+
+spin_lock(&acct_lock);
+ if (acct_file && acct_file->f_dentry->d_inode->i_dev == dev) {
+
+/* Run the same code as sys_acct(NULL) here. This
simplifies locking */
+old_acct = acct_file;
+del_timer(&acct_timer);
+acct_active = 0;
+acct_needcheck = 0;
+acct_file = NULL;
+
+spin_unlock(&acct_lock);
+
+do_acct_process(0, old_acct);
+filp_close(old_acct, NULL);
+} else
+spin_unlock(&acct_lock);
 }

 /*
@@ -348,15 +367,15 @@
 int acct_process(long exitcode)
 {
  struct file *file = NULL;
- lock_kernel();
+spin_lock(&acct_lock);
  if (acct_file) {
   file = acct_file;
   get_file(file);
-  unlock_kernel();
+spin_unlock(&acct_lock);
   do_acct_process(exitcode, acct_file);
   fput(file);
  } else
-  unlock_kernel();
+spin_unlock(&acct_lock);
  return 0;
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.6.12-rc2-mm1

2005-04-05 Thread Nick Piggin
Andrew Morton wrote:
+sched-remove-unnecessary-sched-domains.patch
+sched-improve-pinned-task-handling-again.patch
[snip]
 CPU scheduler updates
It is no problem that you picked these up for testing. But
don't merge them yet, please.
Suresh's underlying problem with the unnecessary sched domains
is a failing of sched-balance-exec and sched-balance-fork, which
I am working on now.
Removing unnecessary domains is a nice optimisation, but just
needs to account for a few more flags before declaring that a
domain is unnecessary (not to mention this probably breaks if
isolcpus= is used). I have made some modifications to the patch
to fix these problems.
Lastly, I'd like to be a bit less intrusive with pinned task
handling improvements. I think we can do this while still being
effective in preventing livelocks.
I will keep you posted with regards to the various scheduler
patches.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.12-rc2-mm1

2005-04-05 Thread Nick Piggin
Siddha, Suresh B wrote:
On Tue, Apr 05, 2005 at 05:33:49PM +1000, Nick Piggin wrote:

Suresh's underlying problem with the unnecessary sched domains
is a failing of sched-balance-exec and sched-balance-fork, which

That wasn't the only motivation. For example, on non-HT cpu's we shouldn't
be setting up SMT sched-domain, same with NUMA domains on non-NUMA systems.
Yep, sure. It is a good, if slight, optimisation. And I've also just
slightly extended your patch, so we don't have any domains if booting
with maxcpus=1

I am working on now.
Removing unnecessary domains is a nice optimisation, but just
needs to account for a few more flags before declaring that a

Can you elaborate when we require a domain with special flags but has
no or only one group in it.
The SD_WAKE_* flags do not use groups, so it would be legitimate to
have a domain that has one of these set, with no groups.

domain is unnecessary (not to mention this probably breaks if
isolcpus= is used). I have made some modifications to the patch

I have tested my patch with "ioslcpus=" and it works just fine.
OK, my apologies ;)

to fix these problems.
Lastly, I'd like to be a bit less intrusive with pinned task
handling improvements. I think we can do this while still being
effective in preventing livelocks.

We want to see this fixed. Please post your patch and I can let you know
the test results.
I will try to get it working and tested tonight for you.

I will keep you posted with regards to the various scheduler
patches.

Nick, Can you post the patches you sent me earlier to this list?
Yep, I'll post them.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/5] sched: remove degenerate domains

2005-04-05 Thread Nick Piggin
This is Suresh's patch with some modifications.
--
SUSE Labs, Novell Inc.
Remove degenerate scheduler domains during the sched-domain init.

For example on x86_64, we always have NUMA configured in. On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group in
it. 

With fork/exec balances(recent Nick's fixes in -mm tree), we always endup 
taking wrong decisions because of this topmost domain (as it contains only 
one group and find_idlest_group always returns NULL). We will endup loading 
HT package completely first, letting active load balance kickin and correct it.

In general, this patch also makes sense with out recent Nick's fixes
in -mm.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>

Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin. Allow a runqueue's sd to go NULL, which
required small changes to the smtnice code.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>


Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 16:38:21.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:39:09.0 +1000
@@ -2583,11 +2583,15 @@ out:
 #ifdef CONFIG_SCHED_SMT
 static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
 {
-   struct sched_domain *sd = this_rq->sd;
+   struct sched_domain *tmp, *sd = NULL;
cpumask_t sibling_map;
int i;
+   
+   for_each_domain(this_cpu, tmp)
+   if (tmp->flags & SD_SHARE_CPUPOWER)
+   sd = tmp;
 
-   if (!(sd->flags & SD_SHARE_CPUPOWER))
+   if (!sd)
return;
 
/*
@@ -2628,13 +2632,17 @@ static inline void wake_sleeping_depende
 
 static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
 {
-   struct sched_domain *sd = this_rq->sd;
+   struct sched_domain *tmp, *sd = NULL;
cpumask_t sibling_map;
prio_array_t *array;
int ret = 0, i;
task_t *p;
 
-   if (!(sd->flags & SD_SHARE_CPUPOWER))
+   for_each_domain(this_cpu, tmp)
+   if (tmp->flags & SD_SHARE_CPUPOWER)
+   sd = tmp;
+
+   if (!sd)
return 0;
 
/*
@@ -4604,6 +4612,11 @@ static void sched_domain_debug(struct sc
 {
int level = 0;
 
+   if (!sd) {
+   printk(KERN_DEBUG "CPU%d attaching NULL sched-domain.\n", cpu);
+   return;
+   }
+   
printk(KERN_DEBUG "CPU%d attaching sched-domain:\n", cpu);
 
do {
@@ -4809,6 +4822,50 @@ static void init_sched_domain_sysctl(voi
 }
 #endif
 
+static int __devinit sd_degenerate(struct sched_domain *sd)
+{
+   if (cpus_weight(sd->span) == 1)
+   return 1;
+
+   /* Following flags need at least 2 groups */
+   if (sd->flags & (SD_LOAD_BALANCE |
+SD_BALANCE_NEWIDLE |
+SD_BALANCE_FORK |
+SD_BALANCE_EXEC)) {
+   if (sd->groups != sd->groups->next)
+   return 0;
+   }
+
+   /* Following flags don't use groups */
+   if (sd->flags & (SD_WAKE_IDLE |
+SD_WAKE_AFFINE |
+SD_WAKE_BALANCE))
+   return 0;
+
+   return 1;
+}
+
+static int __devinit sd_parent_degenerate(struct sched_domain *sd,
+   struct sched_domain *parent)
+{
+   unsigned long cflags = sd->flags, pflags = parent->flags;
+
+   if (sd_degenerate(parent))
+   return 1;
+
+   if (!cpus_equal(sd->span, parent->span))
+   return 0;
+
+   /* Does parent contain flags not in child? */
+   /* WAKE_BALANCE is a subset of WAKE_AFFINE */
+   if (cflags & SD_WAKE_AFFINE)
+   pflags &= ~SD_WAKE_BALANCE;
+   if ((~sd->flags) & parent->flags)
+   return 0;
+
+   return 1;
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain.  Callers must
  * hold the hotplug lock.
@@ -4819,6 +4876,19 @@ void __devinit cpu_attach_domain(struct 
unsigned long flags;
runqueue_t *rq = cpu_rq(cpu);
int local = 1;
+   struct sched_domain *tmp;
+
+   /* Remove the sched domains which do not contribute to scheduling. */
+   for (tmp = sd; tmp; tmp = tmp->parent) {
+   struct sched_domain *parent = tmp->parent;
+   if (!parent)
+   break;
+   if (sd_parent_degenerate(tmp, parent))
+   tmp->parent = parent->parent;
+   }
+
+   if (sd_degenerate(sd))
+   sd = sd->parent;
 
sched_domain_debug(sd, cpu);
 


[patch 2/5] sched: NULL domains

2005-04-05 Thread Nick Piggin
2/5
The previous patch fixed the last 2 places that directly access a
runqueue's sched-domain and assume it cannot be NULL.

We can now use a NULL domain instead of a dummy domain to signify
no balancing is to happen. No functional changes.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 16:38:40.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:39:08.0 +1000
@@ -4887,7 +4887,7 @@ void __devinit cpu_attach_domain(struct 
tmp->parent = parent->parent;
}
 
-   if (sd_degenerate(sd))
+   if (sd && sd_degenerate(sd))
sd = sd->parent;
 
sched_domain_debug(sd, cpu);
@@ -5054,7 +5054,7 @@ static void __devinit arch_init_sched_do
cpus_and(cpu_default_map, cpu_default_map, cpu_online_map);
 
/*
-* Set up domains. Isolated domains just stay on the dummy domain.
+* Set up domains. Isolated domains just stay on the NULL domain.
 */
for_each_cpu_mask(i, cpu_default_map) {
int group;
@@ -5167,18 +5167,11 @@ static void __devinit arch_destroy_sched
 
 #endif /* ARCH_HAS_SCHED_DOMAIN */
 
-/*
- * Initial dummy domain for early boot and for hotplug cpu. Being static,
- * it is initialized to zero, so all balancing flags are cleared which is
- * what we want.
- */
-static struct sched_domain sched_domain_dummy;
-
 #ifdef CONFIG_HOTPLUG_CPU
 /*
  * Force a reinitialization of the sched domains hierarchy.  The domains
  * and groups cannot be updated in place without racing with the balancing
- * code, so we temporarily attach all running cpus to a "dummy" domain
+ * code, so we temporarily attach all running cpus to the NULL domain
  * which will prevent rebalancing while the sched domains are recalculated.
  */
 static int update_sched_domains(struct notifier_block *nfb,
@@ -5190,7 +5183,7 @@ static int update_sched_domains(struct n
case CPU_UP_PREPARE:
case CPU_DOWN_PREPARE:
for_each_online_cpu(i)
-   cpu_attach_domain(&sched_domain_dummy, i);
+   cpu_attach_domain(NULL, i);
arch_destroy_sched_domains();
return NOTIFY_OK;
 
@@ -5253,7 +5246,7 @@ void __init sched_init(void)
rq->best_expired_prio = MAX_PRIO;
 
 #ifdef CONFIG_SMP
-   rq->sd = &sched_domain_dummy;
+   rq->sd = NULL;
for (j = 1; j < 3; j++)
rq->cpu_load[j] = 0;
rq->active_balance = 0;


[patch 3/5] sched: multilevel sbe and sbf

2005-04-05 Thread Nick Piggin
3/5
The fundamental problem that Suresh has with balance on exec and fork
is that it only tries to balance the top level domain with the flag
set.

This was worked around by removing degenerate domains, but is still a
problem if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.

This patch makes balance on fork and exec try balancing over not just the
top most domain with the flag set, but all the way down the domain tree.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 16:38:53.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:39:07.0 +1000
@@ -1320,21 +1320,24 @@ void fastcall wake_up_new_task(task_t * 
sd = tmp;
 
if (sd) {
+   cpumask_t span;
int new_cpu;
struct sched_group *group;
 
+again:
schedstat_inc(sd, sbf_cnt);
+   span = sd->span;
cpu = task_cpu(p);
group = find_idlest_group(sd, p, cpu);
if (!group) {
schedstat_inc(sd, sbf_balanced);
-   goto no_forkbalance;
+   goto nextlevel;
}
 
new_cpu = find_idlest_cpu(group, cpu);
if (new_cpu == -1 || new_cpu == cpu) {
schedstat_inc(sd, sbf_balanced);
-   goto no_forkbalance;
+   goto nextlevel;
}
 
if (cpu_isset(new_cpu, p->cpus_allowed)) {
@@ -1344,9 +1347,21 @@ void fastcall wake_up_new_task(task_t * 
rq = task_rq_lock(p, &flags);
cpu = task_cpu(p);
}
+
+   /* Now try balancing at a lower domain level */
+nextlevel:
+   sd = NULL;
+   for_each_domain(cpu, tmp) {
+   if (cpus_subset(span, tmp->span))
+   break;
+   if (tmp->flags & SD_BALANCE_FORK)
+   sd = tmp;
+   }
+
+   if (sd)
+   goto again;
}
 
-no_forkbalance:
 #endif
/*
 * We decrease the sleep average of forking parents
@@ -1712,25 +1727,41 @@ void sched_exec(void)
sd = tmp;
 
if (sd) {
+   cpumask_t span;
struct sched_group *group;
+again:
schedstat_inc(sd, sbe_cnt);
+   span = sd->span;
group = find_idlest_group(sd, current, this_cpu);
if (!group) {
schedstat_inc(sd, sbe_balanced);
-   goto out;
+   goto nextlevel;
}
new_cpu = find_idlest_cpu(group, this_cpu);
if (new_cpu == -1 || new_cpu == this_cpu) {
schedstat_inc(sd, sbe_balanced);
-   goto out;
+   goto nextlevel;
}
 
schedstat_inc(sd, sbe_pushed);
put_cpu();
sched_migrate_task(current, new_cpu);
-   return;
+   
+   /* Now try balancing at a lower domain level */
+   this_cpu = get_cpu();
+nextlevel:
+   sd = NULL;
+   for_each_domain(this_cpu, tmp) {
+   if (cpus_subset(span, tmp->span))
+   break;
+   if (tmp->flags & SD_BALANCE_EXEC)
+   sd = tmp;
+   }
+
+   if (sd)
+   goto again;
}
-out:
+
put_cpu();
 }
 


[patch 4/5] sched: RCU sched domains

2005-04-05 Thread Nick Piggin
4/5
One of the problems with the multilevel balance-on-fork/exec is that it
needs to jump through hoops to satisfy sched-domain's locking semantics
(that is, you may traverse your own domain when not preemptable, and
you may traverse others' domains when holding their runqueue lock).

balance-on-exec had to potentially migrate between more than one CPU before
finding a final CPU to migrate to, and balance-on-fork needed to potentially
take multiple runqueue locks.

So bite the bullet and make sched-domains go completely RCU. This actually
simplifies the code quite a bit.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 16:39:14.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:39:05.0 +1000
@@ -825,22 +825,12 @@ inline int task_curr(const task_t *p)
 }
 
 #ifdef CONFIG_SMP
-enum request_type {
-   REQ_MOVE_TASK,
-   REQ_SET_DOMAIN,
-};
-
 typedef struct {
struct list_head list;
-   enum request_type type;
 
-   /* For REQ_MOVE_TASK */
task_t *task;
int dest_cpu;
 
-   /* For REQ_SET_DOMAIN */
-   struct sched_domain *sd;
-
struct completion done;
 } migration_req_t;
 
@@ -862,7 +852,6 @@ static int migrate_task(task_t *p, int d
}
 
init_completion(&req->done);
-   req->type = REQ_MOVE_TASK;
req->task = p;
req->dest_cpu = dest_cpu;
list_add(&req->list, &rq->migration_queue);
@@ -4365,17 +4354,9 @@ static int migration_thread(void * data)
req = list_entry(head->next, migration_req_t, list);
list_del_init(head->next);
 
-   if (req->type == REQ_MOVE_TASK) {
-   spin_unlock(&rq->lock);
-   __migrate_task(req->task, cpu, req->dest_cpu);
-   local_irq_enable();
-   } else if (req->type == REQ_SET_DOMAIN) {
-   rq->sd = req->sd;
-   spin_unlock_irq(&rq->lock);
-   } else {
-   spin_unlock_irq(&rq->lock);
-   WARN_ON(1);
-   }
+   spin_unlock(&rq->lock);
+   __migrate_task(req->task, cpu, req->dest_cpu);
+   local_irq_enable();
 
complete(&req->done);
}
@@ -4606,7 +4587,6 @@ static int migration_call(struct notifie
migration_req_t *req;
req = list_entry(rq->migration_queue.next,
 migration_req_t, list);
-   BUG_ON(req->type != REQ_MOVE_TASK);
list_del_init(&req->list);
complete(&req->done);
}
@@ -4903,10 +4883,7 @@ static int __devinit sd_parent_degenerat
  */
 void __devinit cpu_attach_domain(struct sched_domain *sd, int cpu)
 {
-   migration_req_t req;
-   unsigned long flags;
runqueue_t *rq = cpu_rq(cpu);
-   int local = 1;
struct sched_domain *tmp;
 
/* Remove the sched domains which do not contribute to scheduling. */
@@ -4923,24 +4900,7 @@ void __devinit cpu_attach_domain(struct 
 
sched_domain_debug(sd, cpu);
 
-   spin_lock_irqsave(&rq->lock, flags);
-
-   if (cpu == smp_processor_id() || !cpu_online(cpu)) {
-   rq->sd = sd;
-   } else {
-   init_completion(&req.done);
-   req.type = REQ_SET_DOMAIN;
-   req.sd = sd;
-   list_add(&req.list, &rq->migration_queue);
-   local = 0;
-   }
-
-   spin_unlock_irqrestore(&rq->lock, flags);
-
-   if (!local) {
-   wake_up_process(rq->migration_thread);
-   wait_for_completion(&req.done);
-   }
+   rq->sd = sd;
 }
 
 /* cpus with isolated domains */
@@ -5215,6 +5175,7 @@ static int update_sched_domains(struct n
case CPU_DOWN_PREPARE:
for_each_online_cpu(i)
cpu_attach_domain(NULL, i);
+   synchronize_kernel();
arch_destroy_sched_domains();
return NOTIFY_OK;
 


[patch 5/5] sched: consolidate sbe sbf

2005-04-05 Thread Nick Piggin
5/5
Any ideas about what to do with schedstats?
Do we really need balance on exec and fork as seperate
statistics?
Consolidate balance-on-exec with balance-on-fork. This is made easy
by the sched-domains RCU patches.

As well as the general goodness of code reduction, this allows
the runqueues to be unlocked during balance-on-fork.

schedstats is a problem. Maybe just have balance-on-event instead
of distinguishing fork and exec?

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 18:39:14.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:40:18.0 +1000
@@ -1013,8 +1013,57 @@ static int find_idlest_cpu(struct sched_
return idlest;
 }
 
+/*
+ * sched_balance_self: balance the current task (running on cpu) in domains
+ * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
+ * SD_BALANCE_EXEC.
+ *
+ * Balance, ie. select the least loaded group.
+ *
+ * Returns the target CPU number, or the same CPU if no balancing is needed.
+ *
+ * preempt must be disabled.
+ */
+static int sched_balance_self(int cpu, int flag)
+{
+   struct task_struct *t = current;
+   struct sched_domain *tmp, *sd = NULL;
 
-#endif
+   for_each_domain(cpu, tmp)
+   if (tmp->flags & flag)
+   sd = tmp;
+
+   while (sd) {
+   cpumask_t span;
+   struct sched_group *group;
+   int new_cpu;
+
+   span = sd->span;
+   group = find_idlest_group(sd, t, cpu);
+   if (!group)
+   goto nextlevel;
+
+   new_cpu = find_idlest_cpu(group, cpu);
+   if (new_cpu == -1 || new_cpu == cpu)
+   goto nextlevel;
+
+   /* Now try balancing at a lower domain level */
+   cpu = new_cpu;
+nextlevel:
+   sd = NULL;
+   for_each_domain(cpu, tmp) {
+   if (cpus_subset(span, tmp->span))
+   break;
+   if (tmp->flags & flag)
+   sd = tmp;
+   }
+   /* while loop will break here if sd == NULL */
+   }
+
+   return cpu;
+}
+
+#endif /* CONFIG_SMP */
 
 /*
  * wake_idle() will wake a task on an idle cpu if task->cpu is
@@ -1295,63 +1344,22 @@ void fastcall wake_up_new_task(task_t * 
int this_cpu, cpu;
runqueue_t *rq, *this_rq;
 #ifdef CONFIG_SMP
-   struct sched_domain *tmp, *sd = NULL;
-#endif
+   int new_cpu;
 
+   cpu = task_cpu(p);
+   preempt_disable();
+   new_cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
+   preempt_enable();
+   if (new_cpu != cpu)
+   set_task_cpu(p, new_cpu);
+#endif
+   
+   cpu = task_cpu(p);
rq = task_rq_lock(p, &flags);
-   BUG_ON(p->state != TASK_RUNNING);
this_cpu = smp_processor_id();
-   cpu = task_cpu(p);
-
-#ifdef CONFIG_SMP
-   for_each_domain(cpu, tmp)
-   if (tmp->flags & SD_BALANCE_FORK)
-   sd = tmp;
-
-   if (sd) {
-   cpumask_t span;
-   int new_cpu;
-   struct sched_group *group;
-
-again:
-   schedstat_inc(sd, sbf_cnt);
-   span = sd->span;
-   cpu = task_cpu(p);
-   group = find_idlest_group(sd, p, cpu);
-   if (!group) {
-   schedstat_inc(sd, sbf_balanced);
-   goto nextlevel;
-   }
 
-   new_cpu = find_idlest_cpu(group, cpu);
-   if (new_cpu == -1 || new_cpu == cpu) {
-   schedstat_inc(sd, sbf_balanced);
-   goto nextlevel;
-   }
-
-   if (cpu_isset(new_cpu, p->cpus_allowed)) {
-   schedstat_inc(sd, sbf_pushed);
-   set_task_cpu(p, new_cpu);
-   task_rq_unlock(rq, &flags);
-   rq = task_rq_lock(p, &flags);
-   cpu = task_cpu(p);
-   }
-
-   /* Now try balancing at a lower domain level */
-nextlevel:
-   sd = NULL;
-   for_each_domain(cpu, tmp) {
-   if (cpus_subset(span, tmp->span))
-   break;
-   if (tmp->flags & SD_BALANCE_FORK)
-   sd = tmp;
-   }
-
-   if (sd)
-   goto again;
-   }
+   BUG_ON(p->state != TASK_RUNNING);
 
-#endif
/*
 * We decrease the sleep average of forking parents
 * and children as well, to keep max-interactive tasks
@@ -1699,59 +1707,17 @@ out:
task_rq_unlock(rq, &flags);
 }
 
-/*
- * sche

Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Nick Piggin <[EMAIL PROTECTED]> wrote:

This is Suresh's patch with some modifications.

Remove degenerate scheduler domains during the sched-domain init.

actually, i'd suggest to not do this patch. The point of booting with a 
CONFIG_NUMA kernel on a non-NUMA box is mostly for testing, and the 
'degenerate' toplevel domain exposed conceptual bugs in the 
sched-domains code. In that sense removing such 'unnecessary' domains 
inhibits debuggability to a certain degree. If we had this patch earlier 
we'd not have experienced the wrong decisions taken by the scheduler, 
only on the much rarer 'really NUMA' boxes.

True. Although I'd imagine it may be something distros may want.
For example, a generic x86-64 kernel for both AMD and Intel systems
could easily have SMT and NUMA turned on.
I agree with the downside of exercising less code paths though.
What about putting as a (default to off for 2.6) config option in
the config embedded menu?
is there any case where we'd want to simplify the domain tree? One more 
domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
domain tree got optimized. Hm?

I guess there is the SMT issue too, and even booting an SMP kernel
on a UP system. Also small ia64 NUMA systems will probably have one
redundant NUMA level.
If/when topologies get more complex (for example, the recent Altix
discussions we had with Paul), it will be generally easier to set
up all levels in a generic way, then weed them out using something
like this, rather than put the logic in the domain setup code.
Nick
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/5] sched: NULL domains

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Ingo Molnar <[EMAIL PROTECTED]> wrote:

* Nick Piggin <[EMAIL PROTECTED]> wrote:

2/5

The previous patch fixed the last 2 places that directly access a
runqueue's sched-domain and assume it cannot be NULL.
We can now use a NULL domain instead of a dummy domain to signify
no balancing is to happen. No functional changes.
Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
  ^^^
Thanks.

if the previous 'remove degenerate domains' patch would go away then 
this patch needs to be merged/modified. (and most of the others as well)

I probably should respin this so it goes in *first* anyway.
Rather than doing half in the remove degenerate domains and
half here.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/5] sched: multilevel sbe and sbf

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
note that no matter how much scheduler logic, in the end 
cross-scheduling of tasks between nodes on NUMA will always have a 
permanent penalty (i.e. the 'migration cost' is 'infinity' in the long 
run), so the primary focus _hast to be_ on 'get it right initially' When 
tasks must spill over to other nodes will always remain a special case.  
So balance-on-fork/exec/[clone] definitely needs to be aware of the full 
domain tree picture.

Yes, well put. I imagine this will only become more important as
there becomes more push towards multiprocessing machines, and the
need for higher memory bandwidth and lower latency to CPUs.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 4/5] sched: RCU sched domains

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Nick Piggin <[EMAIL PROTECTED]> wrote:

4/5

One of the problems with the multilevel balance-on-fork/exec is that 
it needs to jump through hoops to satisfy sched-domain's locking 
semantics (that is, you may traverse your own domain when not 
preemptable, and you may traverse others' domains when holding their 
runqueue lock).

balance-on-exec had to potentially migrate between more than one CPU 
before finding a final CPU to migrate to, and balance-on-fork needed 
to potentially take multiple runqueue locks.

So bite the bullet and make sched-domains go completely RCU. This 
actually simplifies the code quite a bit.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

i like it conceptually, so:
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
Oh good, thanks.
from now on, all domain-tree readonly uses have to be rcu_read_lock()-ed 
(or otherwise have to be in a non-preemptible section). But there's a 
bug in show_shedstats() which does a for_each_domain() from within a 
preemptible section. (It was a bug with the current hotplug logic too i 
think.)

Ah, thanks. That looks like a bug in the code with the locking
we have now too...
At a minimum i think we need the fix+comment below.
Well if we say "this is actually RCU", then yes. And we should
probably change the preempt_{dis|en}ables in other places to
rcu_read_lock.
OTOH, if we say we just want all running threads to process through
a preemption stage, then this would just be a preempt_disable/enable
pair.
In practice that makes no difference yet, but it looks like you and
Paul are working to distinguish these two cases in the RCU code, to
accomodate your low latency RCU stuff?
I'd prefer the latter (ie. just disable preempt, and use
synchronize_sched), but I'm not too sure of what is going on with
your the low latency RCU work...?
Ingo
Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Thanks for catching that. I may just push it through first as a fix
to the current 2.6 schedstats code (using preempt_disable), and
afterwards we can change it to rcu_read_lock if that is required.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/5] sched: consolidate sbe sbf

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Nick Piggin <[EMAIL PROTECTED]> wrote:

5/5
Any ideas about what to do with schedstats?
Do we really need balance on exec and fork as seperate
statistics?

Consolidate balance-on-exec with balance-on-fork. This is made easy
by the sched-domains RCU patches.
As well as the general goodness of code reduction, this allows
the runqueues to be unlocked during balance-on-fork.
schedstats is a problem. Maybe just have balance-on-event instead
of distinguishing fork and exec?
Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

looks good.
One problem I just noticed, sorry. This is doing set_cpus_allowed
without holding the runqueue lock and without checking the hard
affinity mask either.
We could just do a set_cpus_allowed, or take the lock, set_cpus_allowed,
and take the new lock, but that's probably a bit heavy if we can avoid it.
In the interests of speed in this fast path, do you think we can do this
in sched_fork, before the task has even been put on the tasklist?
That would avoid all locking problems. Passing clone_flags into sched_fork
would not be a problem if we want to distinguish fork() and clone(CLONE_VM).
Yes? I'll cut a new patch to do just that.
 Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
while the code is now consolidated, i think we still need the separate 
fork/exec stats for schedstat.
This makes it a bit harder then, to get good stats in the sched-domain
(which is really what we want). It would basically mean doing
if (balance fork)
schedstat_inc(sbf_cnt);
else if (balance exec)
schedstat_inc(sbe_cnt);
etc.
That should all get optimised out by the compiler, but still a bit ugly.
Any ideas?
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Siddha, Suresh B <[EMAIL PROTECTED]> wrote:

Similarly I am working on adding a new core domain for dual-core 
systems! All these domains are unnecessary and cause performance 
isssues on non Multi-threading/Multi-core capable cpus! Agreed that 
performance impact will be minor but still...

ok, lets keep it then. It may in fact simplify the domain setup code: we 
could generate the 'most generic' layout for a given arch all the time, 
and then optimize it automatically. I.e. in theory we could have just a 
single domain-setup routine, which would e.g. generate the NUMA domains 
on SMP too, which would then be optimized away.

Yep, exactly. Even so, Andrew: please ignore this patch series
and I'll redo it for you when we all agree on everything.
Thanks.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/5] sched: consolidate sbe sbf

2005-04-06 Thread Nick Piggin
Nick Piggin wrote:
One problem I just noticed, sorry. This is doing set_cpus_allowed
without holding the runqueue lock and without checking the hard
affinity mask either.
Err, that is to say set_task_cpu, not set_cpus_allowed.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Coding style: mixed-case

2005-04-06 Thread Nick Piggin
Kenneth Aafløy wrote:
On Wednesday 06 April 2005 04:09, Matt Mackall wrote:
While there may be reasons why mixed case is suboptimal, the real
reason is that it's hard to keep track of which style is used where.
It's annoying and error-prone to have to remember the naming format
for everything in addition to its name. As most things are in a
standard style, things are made easier by having every piece of new
code follow that style and let us slowly approach uniformity.

My primary concern was that of; why does the kernels own coding style
deviate from that advise given in it's documentation. Other than that
Probably it's been like that for a long time, and nobody has
really bothered to change it.
If you posted a patch for pf_locked() and friends (and note that it's
lowercase to match function-like usage), you'd probably find some
enthusiasts and some naysayers. Most of the naysayers would object on
the grounds of "it ain't broke", but if someone were to do it as part
of a series of more substantial clean-ups, it'd likely be accepted.

Certainly I would like to have a go at a patch, but I must say that I do not
feel particularly familiar with the code in question to make such a change.
I would have risen to the challenge had this been a driver level change,
but the mmu is something that I will not touch untill I feel comfortable.
Well the only patch that could possibly be considered would be a
straight search and replace, and absolutely no functional changes;
I think you would be up to it ;)
A few suggestions:
Don't use PF_*. That namespace is already being used by at least
process flags and protocol flags. Maybe page_locked, page_dirty,
etc. might be better
There could be a quite a bit of external code using these interfaces.
Typically we wouldn't just rename public interfaces in a stable
series "just because", but the rules are a bit different for 2.6.
Your best bet would be to firstly do a patch to create the new interface
names but keep the old ones in place for backwards compatibility (just
#defined to the new name), then a second patch to convert over all the
in-kernel users. The compatibility stuff can be removed in N years.
Lastly, it is quite likely that many people will consider this to be
more trouble than it's worth. So keep in mind it is not guaranteed to
get included.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: return value of ptep_get_and_clear

2005-04-06 Thread Nick Piggin
Kumar Gala wrote:
ptep_get_and_clear has a signature that looks something like:
static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned 
long addr,
   pte_t *ptep)

It appears that its suppose to return the pte_t pointed to by ptep 
before its modified.  Why do we bother doing this?  The caller seems 
perfectly able to dereference ptep and hold on to it.  Am I missing 
something here?

You need to be able to *atomically* clear the pte and retrieve the
old value.
Nick
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.12-rc2-mm1

2005-04-06 Thread Nick Piggin
Siddha, Suresh B wrote:
On Tue, Apr 05, 2005 at 05:33:49PM +1000, Nick Piggin wrote:

Lastly, I'd like to be a bit less intrusive with pinned task
handling improvements. I think we can do this while still being
effective in preventing livelocks.

We want to see this fixed. Please post your patch and I can let you know
the test results.
Using the attached patch, a puny dual PIII-650 with ~400MB RAM swapped
itself to death after 2 infinite loop tasks had been pinned to one
of the CPUs. See how you go.
--
SUSE Labs, Novell Inc.
Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-07 02:39:22.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-07 02:45:26.0 +1000
@@ -2041,6 +2041,12 @@ static runqueue_t *find_busiest_queue(st
 }
 
 /*
+ * Max backoff if we encounter pinned tasks. Pretty arbitrary value, but
+ * so long as it is large enough.
+ */
+#define MAX_PINNED_INTERVAL1024
+
+/*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
  *
@@ -2052,7 +2058,7 @@ static int load_balance(int this_cpu, ru
struct sched_group *group;
runqueue_t *busiest;
unsigned long imbalance;
-   int nr_moved, all_pinned;
+   int nr_moved, all_pinned = 0;
int active_balance = 0;
 
spin_lock(&this_rq->lock);
@@ -2143,7 +2149,8 @@ out_balanced:
 
sd->nr_balance_failed = 0;
/* tune up the balancing interval */
-   if (sd->balance_interval < sd->max_interval)
+   if ((all_pinned && sd->balance_interval < MAX_PINNED_INTERVAL) ||
+   (sd->balance_interval < sd->max_interval))
sd->balance_interval *= 2;
 
return 0;


[patch] sched: consolidate sbe sbf

2005-04-06 Thread Nick Piggin
Hi Ingo,
What do you think of the following patch? I won't send the
whole series again, I'll queue them up with Andrew if you
think this one looks OK (which is the only major change).
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Consolidate balance-on-exec with balance-on-fork. This is made easy
by the sched-domains RCU patches.

As well as the general goodness of code reduction, this allows
the runqueues to be unlocked during balance-on-fork.

schedstats is a problem. Maybe just have balance-on-event instead
of distinguishing fork and exec?

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-07 02:39:21.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-07 12:34:06.0 +1000
@@ -1022,8 +1022,57 @@ static int find_idlest_cpu(struct sched_
return idlest;
 }
 
+/*
+ * sched_balance_self: balance the current task (running on cpu) in domains
+ * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
+ * SD_BALANCE_EXEC.
+ *
+ * Balance, ie. select the least loaded group.
+ *
+ * Returns the target CPU number, or the same CPU if no balancing is needed.
+ *
+ * preempt must be disabled.
+ */
+static int sched_balance_self(int cpu, int flag)
+{
+   struct task_struct *t = current;
+   struct sched_domain *tmp, *sd = NULL;
 
-#endif
+   for_each_domain(cpu, tmp)
+   if (tmp->flags & flag)
+   sd = tmp;
+
+   while (sd) {
+   cpumask_t span;
+   struct sched_group *group;
+   int new_cpu;
+
+   span = sd->span;
+   group = find_idlest_group(sd, t, cpu);
+   if (!group)
+   goto nextlevel;
+
+   new_cpu = find_idlest_cpu(group, cpu);
+   if (new_cpu == -1 || new_cpu == cpu)
+   goto nextlevel;
+
+   /* Now try balancing at a lower domain level */
+   cpu = new_cpu;
+nextlevel:
+   sd = NULL;
+   for_each_domain(cpu, tmp) {
+   if (cpus_subset(span, tmp->span))
+   break;
+   if (tmp->flags & flag)
+   sd = tmp;
+   }
+   /* while loop will break here if sd == NULL */
+   }
+
+   return cpu;
+}
+
+#endif /* CONFIG_SMP */
 
 /*
  * wake_idle() will wake a task on an idle cpu if task->cpu is
@@ -1241,8 +1290,17 @@ int fastcall wake_up_state(task_t *p, un
  * Perform scheduler related setup for a newly forked process p.
  * p is forked by current.
  */
-void fastcall sched_fork(task_t *p)
+void fastcall sched_fork(task_t *p, int clone_flags)
 {
+   int cpu = smp_processor_id();
+   
+#ifdef CONFIG_SMP
+   preempt_disable();
+   cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
+   preempt_enable();
+#endif
+   set_task_cpu(p, cpu);
+
/*
 * We mark the process as running here, but have not actually
 * inserted it onto the runqueue yet. This guarantees that
@@ -1303,64 +1361,12 @@ void fastcall wake_up_new_task(task_t * 
unsigned long flags;
int this_cpu, cpu;
runqueue_t *rq, *this_rq;
-#ifdef CONFIG_SMP
-   struct sched_domain *tmp, *sd = NULL;
-#endif
 
rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
this_cpu = smp_processor_id();
cpu = task_cpu(p);
 
-#ifdef CONFIG_SMP
-   for_each_domain(cpu, tmp)
-   if (tmp->flags & SD_BALANCE_FORK)
-   sd = tmp;
-
-   if (sd) {
-   cpumask_t span;
-   int new_cpu;
-   struct sched_group *group;
-
-again:
-   schedstat_inc(sd, sbf_cnt);
-   span = sd->span;
-   cpu = task_cpu(p);
-   group = find_idlest_group(sd, p, cpu);
-   if (!group) {
-   schedstat_inc(sd, sbf_balanced);
-   goto nextlevel;
-   }
-
-   new_cpu = find_idlest_cpu(group, cpu);
-   if (new_cpu == -1 || new_cpu == cpu) {
-   schedstat_inc(sd, sbf_balanced);
-   goto nextlevel;
-   }
-
-   if (cpu_isset(new_cpu, p->cpus_allowed)) {
-   schedstat_inc(sd, sbf_pushed);
-   set_task_cpu(p, new_cpu);
-   task_rq_unlock(rq, &flags);
-   rq = task_rq_lock(p, &flags);
-   cpu = task_cpu(p);
-   }
-
-   /* Now try balancing at a lower domain level */
-nextlevel:
-   sd = NULL;
-   for_each_domain(cpu, tmp) {
-   if (cpus_subset(span, tmp->span))
-  

Re: [patch 4/5] sched: RCU sched domains

2005-04-07 Thread Nick Piggin
Ingo Molnar wrote:
* Nick Piggin <[EMAIL PROTECTED]> wrote:

At a minimum i think we need the fix+comment below.
Well if we say "this is actually RCU", then yes. And we should 
probably change the preempt_{dis|en}ables in other places to 
rcu_read_lock.

OTOH, if we say we just want all running threads to process through a 
preemption stage, then this would just be a preempt_disable/enable 
pair.

In practice that makes no difference yet, but it looks like you and 
Paul are working to distinguish these two cases in the RCU code, to 
accomodate your low latency RCU stuff?

it doesnt impact PREEMPT_RCU/PREEMPT_RT directly, because the scheduler 
itself always needs to be non-preemptible.

those few places where we currently do preempt_disable(), which should 
thus be rcu_read_lock(), are never in codepaths that can take alot of 
time.

but yes, in principle you are right, but in this particular (and 
special) case it's not a big issue. We should document the RCU read-lock 
dependencies cleanly and make all rcu-read-lock cases truly 
rcu_read_lock(), but it's not a pressing issue even considering possible 
future features like PREEMPT_RT.

the only danger in this area is to PREEMPT_RT: it is a bug on PREEMPT_RT 
if kernel code has an implicit 'spinlock means preempt-off and thus 
RCU-read-lock' assumption. Most of the time these get discovered via 
PREEMPT_DEBUG. (preempt_disable() disables preemption on PREEMPT_RT too, 
so that is not a problem either.)

OK thanks for the good explanation. So I'll keep it as is for now,
and whatever needs cleaning up later can be worked out as it comes
up.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >