date:20121121

Re: [PATCH] gpiolib: Fix use after free in gpiochip_add_pin_range

2012-11-21 Thread Linus Walleij

On Wed, Nov 21, 2012 at 7:33 AM, Axel Lin  wrote:

> This is introduced by commit 9ab6e988
> "gpiolib: return any error code from range creation".
>
> Signed-off-by: Axel Lin 
> ---
> This patch is against LinusW's linux-pinctrl tree, for-next branch.
> Axel

Oops thanks a lot for catching this Axel!

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] gpiolib: rename pin range arguments

2012-11-21 Thread Viresh Kumar

On 21 November 2012 13:20, Linus Walleij  wrote:
> From: Linus Walleij 
>
> To be crystal clear on what the arguments mean in this
> funtion dealing with both GPIO and PIN ranges with confusing
> naming, we now have gpio_offset and pin_offset and we are
> on the clear that these are offsets into the specific GPIO
> and pin controller respectively. The GPIO chip itself will
> of course keep track of the base offset into the global
> GPIO number space.
>
> Signed-off-by: Linus Walleij 
> ---
>  drivers/gpio/gpiolib.c | 19 ++-
>  include/asm-generic/gpio.h |  4 ++--
>  include/linux/gpio.h   |  2 +-
>  3 files changed, 13 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
> index 317ff04..26e27c1 100644
> --- a/drivers/gpio/gpiolib.c
> +++ b/drivers/gpio/gpiolib.c
> @@ -1191,13 +1191,13 @@ EXPORT_SYMBOL_GPL(gpiochip_find);
>   * gpiochip_add_pin_range() - add a range for GPIO <-> pin mapping
>   * @chip: the gpiochip to add the range for
>   * @pinctrl_name: the dev_name() of the pin controller to map to
> - * @offset: the start offset in the current gpio_chip number space
> - * @pin_base: the start offset in the pin controller number space
> + * @gpio_offset: the start offset in the current gpio_chip number space
> + * @pin_offset: the start offset in the pin controller number space
>   * @npins: the number of pins from the offset of each pin space (GPIO and
>   * pin controller) to accumulate in this range
>   */
>  int gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
> -  unsigned int offset, unsigned int pin_base,
> +  unsigned int gpio_offset, unsigned int pin_offset,
>unsigned int npins)
>  {
> struct gpio_pin_range *pin_range;
> @@ -1210,11 +1210,11 @@ int gpiochip_add_pin_range(struct gpio_chip *chip, 
> const char *pinctl_name,
> }
>
> /* Use local offset as range ID */
> -   pin_range->range.id = offset;
> +   pin_range->range.id = gpio_offset;
> pin_range->range.gc = chip;
> pin_range->range.name = chip->label;
> -   pin_range->range.base = chip->base + offset;
> -   pin_range->range.pin_base = pin_base;
> +   pin_range->range.base = chip->base + gpio_offset;
> +   pin_range->range.pin_base = pin_offset;
> pin_range->range.npins = npins;
> pin_range->pctldev = pinctrl_find_and_add_gpio_range(pinctl_name,
> &pin_range->range);
> @@ -1224,9 +1224,10 @@ int gpiochip_add_pin_range(struct gpio_chip *chip, 
> const char *pinctl_name,
> kfree(pin_range);
> return PTR_ERR(pin_range->pctldev);
> }
> -   pr_debug("%s: GPIO chip: created GPIO range %d->%d ==> PIN %d->%d\n",
> -chip->label, offset, offset + npins - 1,
> -pin_base, pin_base + npins - 1);
> +   pr_debug("GPIO chip %s: created GPIO range %d->%d ==> %s PIN 
> %d->%d\n",
> +chip->label, gpio_offset, gpio_offset + npins - 1,
> +pinctl_name,
> +pin_offset, pin_offset + npins - 1);
>
> list_add_tail(&pin_range->node, &chip->pin_ranges);
>
> diff --git a/include/asm-generic/gpio.h b/include/asm-generic/gpio.h
> index ec58fdb..9fd3093 100644
> --- a/include/asm-generic/gpio.h
> +++ b/include/asm-generic/gpio.h
> @@ -283,7 +283,7 @@ struct gpio_pin_range {
>  };
>
>  int gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
> -  unsigned int offset, unsigned int pin_base,
> +  unsigned int gpio_offset, unsigned int pin_offset,
>unsigned int npins);
>  void gpiochip_remove_pin_ranges(struct gpio_chip *chip);
>
> @@ -291,7 +291,7 @@ void gpiochip_remove_pin_ranges(struct gpio_chip *chip);
>
>  static inline int
>  gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
> -  unsigned int offset, unsigned int pin_base,
> +  unsigned int gpio_offset, unsigned int pin_offset,
>unsigned int npins)
>  {
> return 0;
> diff --git a/include/linux/gpio.h b/include/linux/gpio.h
> index 99861c6..bfe6656 100644
> --- a/include/linux/gpio.h
> +++ b/include/linux/gpio.h
> @@ -233,7 +233,7 @@ static inline int irq_to_gpio(unsigned irq)
>
>  static inline int
>  gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
> -  unsigned int offset, unsigned int pin_base,
> +  unsigned int gpio_offset, unsigned int pin_offset,
>unsigned int npins)
>  {
> WARN_ON(1);

Reviewed-by: Viresh Kumar 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  h

Re: The bug of iput() removal from flusher thread?

2012-11-21 Thread Andrew Morton

On Wed, 21 Nov 2012 02:48:51 +0100 Jan Kara  wrote:

> +/*
> + * Add inode to LRU if needed (inode is unused and clean).
> + *
> + * Needs inode->i_lock held.
> + */
> +void inode_add_lru(struct inode *inode)
> +{
> + if (!(inode->i_state & (I_DIRTY | I_FREEING | I_SYNC)) &&
> + !atomic_read(&inode->i_count) && inode->i_sb->s_flags & MS_ACTIVE)
> + inode_lru_list_add(inode);
> +}

Is i_lock sufficient to stabilise i_count?

Is evict_inodes() wrong to test i_count outside i_lock?

invalidate_inodes() looks better.

can_unuse() must be called under i_lock, and is.  Apparently this
requirement was sufficiently obvious to not meed documenting.

prune_icache_sb() gets it right.

iput() gets it right.

So to answer my own question: yes, it is sufficient.  But a) the
comment for inode.i_lock is out of date and b) evict_inodes() looks
fishy.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] mm: dmapool: use provided gfp flags for all dma_alloc_coherent() calls

2012-11-21 Thread Marek Szyprowski


Hello,

On 11/20/2012 8:33 PM, Andrew Morton wrote:

On Tue, 20 Nov 2012 15:31:45 +0100
Marek Szyprowski  wrote:

> dmapool always calls dma_alloc_coherent() with GFP_ATOMIC flag,
> regardless the flags provided by the caller. This causes excessive
> pruning of emergency memory pools without any good reason. Additionaly,
> on ARM architecture any driver which is using dmapools will sooner or
> later  trigger the following error:
> "ERROR: 256 KiB atomic DMA coherent pool is too small!
> Please increase it with coherent_pool= kernel parameter!".
> Increasing the coherent pool size usually doesn't help much and only
> delays such error, because all GFP_ATOMIC DMA allocations are always
> served from the special, very limited memory pool.
>

Is this problem serious enough to justify merging the patch into 3.7?
And into -stable kernels?


I wonder if it is a good idea to merge such change at the end of current
-rc period. It changes the behavior of dma pool allocations and I bet there
might be some drivers which don't care much about passed gfp flags, as for
ages it simply worked for them, even if the allocations were done from
atomic context. What do You think? Technically it is also not a pure bugfix,
so imho it shouldn't be considered for -stable.

On the other hand at least for ARM users of sata_mv driver (which is just
an innocent client of dma pool, correctly passing GFP_KERNEL flag) it would
solve the issues related to shortage of atomic pool for dma allocations what
might justify pushing it to 3.7.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem in Page Cache Replacement

2012-11-21 Thread metin d



>  Curious. Added linux-mm list to CC to catch more attention. If you run
> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?


I'm guessing it'd evict the entries, but am wondering if we could run any more 
diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently 
and the other one about once a month. It seems like the memory manager keeps 
unused pages in memory at the expense of frequently used database's performance.

My understanding was that under memory pressure from heavily accessed pages, 
unused pages would eventually get evicted. Is there anything else we can try on 
this host to understand why this is happening?

Thank you,

Metin


- Original Message -
From: Jan Kara 
To: metin d 
Cc: "linux-kernel@vger.kernel.org" ; 
linux...@kvack.org
Sent: Tuesday, November 20, 2012 8:25 PM
Subject: Re: Problem in Page Cache Replacement

On Tue 20-11-12 09:42:42, metin d wrote:
> I have two PostgreSQL databases named data-1 and data-2 that sit on the
> same machine. Both databases keep 40 GB of data, and the total memory
> available on the machine is 68GB.
> 
> I started data-1 and data-2, and ran several queries to go over all their
> data. Then, I shut down data-1 and kept issuing queries against data-2.
> For some reason, the OS still holds on to large parts of data-1's pages
> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> a result, my queries on data-2 keep hitting disk.
> 
> I'm checking page cache usage with fincore. When I run a table scan query
> against data-2, I see that data-2's pages get evicted and put back into
> the cache in a round-robin manner. Nothing happens to data-1's pages,
> although they haven't been touched for days.
> 
> Does anybody know why data-1's pages aren't evicted from the page cache?
> I'm open to all kind of suggestions you think it might relate to problem.
  Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
  does it evict data-1 pages from memory?

> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> swap space. The kernel version is:
> 
> $ uname -r
> 3.2.28-45.62.amzn1.x86_64
> Edit:
> 
> and it seems that I use one NUMA instance, if  you think that it can a 
> problem.
> 
> $ numactl --hardware
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 70007 MB
> node 0 free: 360 MB
> node distances:
> node   0
>   0:  10

                                Honza
-- 
Jan Kara 
SUSE Labs, CR

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones

2012-11-21 Thread Ingo Molnar


* Rik van Riel  wrote:

> >+}
> >+
> >+/*
> >+ * Add a simple loop to also fetch ptes within the same pmd:
> >+ */
> 
> That's not a very useful comment. How about something like:
> 
>   /*
>* Also fault over nearby ptes from within the same pmd and vma,
>* in order to minimize the overhead from page fault exceptions
>* and TLB flushes.
>*/

There's no TLB flushes here. But I'm fine with the other part so 
I've updated the comment to say:

/*
 * Also fault over nearby ptes from within the same pmd and vma,
 * in order to minimize the overhead from page fault exceptions:
 */

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem in Page Cache Replacement

2012-11-21 Thread metin d

>  Curious. Added linux-mm list to CC to catch more attention. If you run
> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?

I'm guessing it'd evict the entries, but am wondering if we could run any more 
diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently 
and the other one about once a month. It seems like the memory manager keeps 
unused pages in memory at the expense of frequently used database's performance.

My understanding was that under memory pressure from heavily accessed pages, 
unused pages would eventually get evicted. Is there anything else we can try on 
this host to understand why this is happening?

Thank you,

Metin

On Tue 20-11-12 09:42:42, metin d wrote:
> I have two PostgreSQL databases named data-1 and data-2 that sit on the
> same machine. Both databases keep 40 GB of data, and the total memory
> available on the machine is 68GB.
> 
> I started data-1 and data-2, and ran several queries to go over all their
> data. Then, I shut down data-1 and kept issuing queries against data-2.
> For some reason, the OS still holds on to large parts of data-1's pages
> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> a result, my queries on data-2 keep hitting disk.
> 
> I'm checking page cache usage with fincore. When I run a table scan query
> against data-2, I see that data-2's pages get evicted and put back into
> the cache in a round-robin manner. Nothing happens to data-1's pages,
> although they haven't been touched for days.
> 
> Does anybody know why data-1's pages aren't evicted from the page cache?
> I'm open to all kind of suggestions you think it might relate to problem.
  Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
  does it evict data-1 pages from memory?

> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> swap space. The kernel version is:
> 
> $ uname -r
> 3.2.28-45.62.amzn1.x86_64
> Edit:
> 
> and it seems that I use one NUMA instance, if  you think that it can a 
> problem.
> 
> $ numactl --hardware
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 70007 MB
> node 0 free: 360 MB
> node distances:
> node   0
>   0:  10

-- 
Jan Kara 
SUSE Labs, CR

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 08/11] ARM: ux500: convert timer suspend/resume to clock_event_device

2012-11-21 Thread Linus Walleij

On Mon, Nov 19, 2012 at 7:31 PM, Stephen Warren  wrote:

> From: Stephen Warren 
>
> Move ux500's timer suspend/resume functions from struct sys_timer
> ux500_timer into struct clock_event_device nmdk_clkevt. This
> will allow the sys_timer suspend/resume fields to be removed, and
> eventually lead to a complete removal of struct sys_timer.
>
> Cc: Srinidhi Kasagar 
> Cc: Linus Walleij 
> Signed-off-by: Stephen Warren 

Acked-by: Linus Walleij 

We have a comment on the .resume member but I'll find
the right patch and comment there.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv9 1/3] Runtime Interpreted Power Sequences

2012-11-21 Thread Tomi Valkeinen

On 2012-11-21 03:56, Alex Courbot wrote:
> Hi Tomi,
> 
> On Tuesday 20 November 2012 22:48:18 Tomi Valkeinen wrote:
>> I guess there's a reason, but the above looks a bit inconsistent. For
>> gpio you define the gpio resource inside the step. For power and pwm the
>> resource is defined before the steps. Why wouldn't "pwm = <&pwm 2
>> 500>;" work in step2?
> 
> That's mostly a framework issue. Most frameworks do not export a function 
> that 
> allow to dereference a phandle - they expect resources to be declared right 
> under the device node and accessed by name through foo_get(device, name). So 
> using phandles in power sequences would require to export these additional 

Right, I expected something like that.

> functions and also opens the door to some inconsistencies - for instance, 
> your 
> PWM phandle could be referenced a second time in the sequence with a 
> different 
> period - how do you know that these are actually referring the same PWM 
> device?

This I didn't understand. Doesn't "<&pwm 2 xyz>" point to a single
device, no matter where and how many times it's used?

>>> +When a power sequence is run, its steps is executed one after the other
>>> until +one step fails or the end of the sequence is reached.
>>
>> The document doesn't give any hint of what the driver should do if
>> running the power sequence fails. Run the "opposite" power sequence?
>> Will that work for all resources? I'm mainly thinking of a case where
>> each enable of the resource should be matched by a disable, i.e. you
>> can't call disable if no enable was called.
> 
> We discussed that issue already (around v5 I think) and the conclusion was 
> that it should be up to the driver. When we simply enable/disable resources 
> it 
> is easy to revert, but in the future non-boolean properties will likely be 
> introduced, and these cannot easily be reverted. Moreover some drivers might 
> have more complex recovery needs. This deserves more discussion I think, as 
> I'd like to have some "generic" recovery mechanism that covers most of the 
> cases.

Ok. I'll need to dig up the conversation. Did you consider any examples
of how some driver could handle the error cases?

What I'm worried about is that, as far as I understand, the power
sequence is kinda like black box to the driver. The driver just does
"power-up", without knowing what really goes on in there.

And if it doesn't know what goes on in there, nor what's in "power-down"
sequence, how can it do anything when an error happens? The only option
I see is that the driver doesn't do anything, which will leave some
resources enabled, or it can run the power-down sequence, which may or
may not work.

 Tomi




signature.asc
Description: OpenPGP digital signature

[PATCH] Add the values related to buddy system for filtering free pages

2012-11-21 Thread Atsushi Kumagai

This patch adds the values related to buddy system to vmcoreinfo data
so that makedumpfile (dump filtering command) can filter out all free
pages with the new logic.
It's faster than the current logic because it can distinguish free page
by analyzing page structure at the same time as filtering for other
unnecessary pages (e.g. anonymous page).
OTOH, the current logic has to trace free_list to distinguish free 
pages while analyzing page structure to filter out other unnecessary
pages.

The new logic uses the fact that buddy page is marked by _mapcount == 
PAGE_BUDDY_MAPCOUNT_VALUE. The values below are required to distinguish
it.

Required values:
  - OFFSET(page._mapcount)
  - OFFSET(page.private)
  - SIZE(pageflags)
  - NUMBER(PG_slab)
  - NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE)
  
What's makedumpfile:
  makedumpfile creates a small dumpfile by excluding unnecessary pages
  for the analysis. To distinguish unnecessary pages, makedumpfile gets
  the vmcoreinfo data which has the minimum debugging information only
  for dump filtering.

Signed-off-by: Atsushi Kumagai 
---
 include/linux/kexec.h | 3 +++
 kernel/kexec.c| 5 +
 2 files changed, 8 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d0b8458..a90b148 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -158,6 +158,9 @@ unsigned long paddr_vmcoreinfo_note(void);
 #define VMCOREINFO_STRUCT_SIZE(name) \
vmcoreinfo_append_str("SIZE(%s)=%lu\n", #name, \
  (unsigned long)sizeof(struct name))
+#define VMCOREINFO_ENUM_SIZE(name) \
+   vmcoreinfo_append_str("SIZE(%s)=%lu\n", #name, \
+ (unsigned long)sizeof(enum name))
 #define VMCOREINFO_OFFSET(name, field) \
vmcoreinfo_append_str("OFFSET(%s.%s)=%lu\n", #name, #field, \
  (unsigned long)offsetof(struct name, field))
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 5e4bd78..511151b 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1485,10 +1485,13 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_STRUCT_SIZE(zone);
VMCOREINFO_STRUCT_SIZE(free_area);
VMCOREINFO_STRUCT_SIZE(list_head);
+   VMCOREINFO_ENUM_SIZE(pageflags);
VMCOREINFO_SIZE(nodemask_t);
VMCOREINFO_OFFSET(page, flags);
VMCOREINFO_OFFSET(page, _count);
VMCOREINFO_OFFSET(page, mapping);
+   VMCOREINFO_OFFSET(page, _mapcount);
+   VMCOREINFO_OFFSET(page, private);
VMCOREINFO_OFFSET(page, lru);
VMCOREINFO_OFFSET(pglist_data, node_zones);
VMCOREINFO_OFFSET(pglist_data, nr_zones);
@@ -1512,6 +1515,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PG_lru);
VMCOREINFO_NUMBER(PG_private);
VMCOREINFO_NUMBER(PG_swapcache);
+   VMCOREINFO_NUMBER(PG_slab);
+   VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);

arch_crash_save_vmcoreinfo();
update_vmcoreinfo_note();
--
1.7.11
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH] xen: tmem: selfballooning should be enabled when xen tmem is enabled

2012-11-21 Thread Jan Beulich

>>> On 20.11.12 at 23:42, Dan Magenheimer  wrote:
> Konrad: Any chance this can get in for the upcoming window?
> (Or is it enough of a bug fix that it can go in at an -rcN?)
> 
> It was just pointed out to me that some kernels have
> cleancache and frontswap and xen_tmem enabled but NOT
> xen_selfballooning!  While this configuration should be
> possible, nearly all kernels that have CONFIG_XEN_TMEM=y should
> also have CONFIG_XEN_SELFBALLOONING=y, since Transcendent
> Memory (tmem) for Xen has very limited value without
> selfballooning.
> 
> This is probably a result of a Kconfig mistake fixed I think
> by the patch below.  Note that the year-old Oracle UEK2 kernel
> distro has both CONFIG_XEN_TMEM and CONFIG_XEN_SELFBALLOONING
> enabled, as does a Fedora 17 kernel update (3.6.6-1.fc17), so
> the combination should be well tested.  Also, Xen tmem (and thus
> selfballooning) are currently only enabled when a kernel boot
> parameter is supplied so there is no runtime impact without
> that boot parameter.
> 
> Signed-off-by: Dan Magenheimer 
> 
> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> index d4dffcd..b5f02f3 100644
> --- a/drivers/xen/Kconfig
> +++ b/drivers/xen/Kconfig
> @@ -10,9 +10,9 @@ config XEN_BALLOON
> return unneeded memory to the system.
>  
>  config XEN_SELFBALLOONING
> - bool "Dynamically self-balloon kernel memory to target"

Why would you want to take away the configurability of this?
You wanting it always on in your use case doesn't mean everyone
agrees. This would be the right way only when the option being
off despite all its dependencies being enabled is actively wrong.

> - depends on XEN && XEN_BALLOON && CLEANCACHE && SWAP && XEN_TMEM
> - default n
> + bool
> + depends on XEN_BALLOON && SWAP
> + default y if XEN_TMEM

Changing the default, otoh, is certainly acceptable. However, this
should imo be (assuming that you dropped the CLEANCACHE
dependency for an unrelated [and unexplained] reason),

depends on XEN_BALLOON && SWAP && XEN_TMEM
default XEN_TMEM

i.e. the default selection can be simplified, but if you indeed
have a good reason to drop the prompt, the
dependencies should continue to include the symbol referenced
by the default directive, as otherwise you may end up with a
.config pointlessly having

# CONFIG_XEN_SELFBALLOONING is disabled. This is particularly
annoying when subsequently this gets a prompt re-added, since
at that point a "make oldconfig" won't ask for the item to get
possibly enabled as there is a value known for it already.

>   help
> Self-ballooning dynamically balloons available kernel memory driven
> by the current usage of anonymous memory ("committed AS") and

If you take away the prompt, keeping the help text isn't useful
either.

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 10/11] ARM: remove struct sys_timer suspend and resume fields

2012-11-21 Thread Linus Walleij

On Mon, Nov 19, 2012 at 7:31 PM, Stephen Warren  wrote:

> From: Stephen Warren 
>
> These fields duplicate e.g. struct clock_event_device's suspend and
> resume fields, so remove them now that nothing is using them. The aim
> is to remove all fields from struct sys_timer except .init, then replace
> the ARM machine descriptor's .timer field with a .init_time function
> instead, and delete struct sys_timer.
>
> Signed-off-by: Stephen Warren 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The bug of iput() removal from flusher thread?

2012-11-21 Thread Dave Chinner

On Wed, Nov 21, 2012 at 12:05:33AM -0800, Andrew Morton wrote:
> On Wed, 21 Nov 2012 02:48:51 +0100 Jan Kara  wrote:
> 
> > +/*
> > + * Add inode to LRU if needed (inode is unused and clean).
> > + *
> > + * Needs inode->i_lock held.
> > + */
> > +void inode_add_lru(struct inode *inode)
> > +{
> > +   if (!(inode->i_state & (I_DIRTY | I_FREEING | I_SYNC)) &&
> > +   !atomic_read(&inode->i_count) && inode->i_sb->s_flags & MS_ACTIVE)
> > +   inode_lru_list_add(inode);
> > +}
> 
> Is i_lock sufficient to stabilise i_count?
> 
> 
> 
> Is evict_inodes() wrong to test i_count outside i_lock?
> 
> invalidate_inodes() looks better.
> 
> can_unuse() must be called under i_lock, and is.  Apparently this
> requirement was sufficiently obvious to not meed documenting.

It is documented. can_unuse looks at i_state and i_count, and both
are documented as requiring the i_lock at the top of the file in
the locking rules section. Also, see __iget(), also mentioned in
the locking rules

> prune_icache_sb() gets it right.
> 
> iput() gets it right.
> 
> So to answer my own question: yes, it is sufficient.  But a) the
> comment for inode.i_lock is out of date

If you means the one in fs.h, then yeah, it's way out of date
>
> and b) evict_inodes() looks
> fishy.

As I understand it, evict_inodes() is special - it's only called
from generic_shutdown_super() after the MS_ACTIVE flag has been
removed from the filesytem, the dcache has been pruned and all the
inodes cleaned. So there should be no new references to the inodes
occurring, and hence we don't need to hold the lock to serialise
against new references being taken

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

2012-11-21 Thread Glauber Costa

On 11/20/2012 10:23 PM, David Rientjes wrote:
> Anton can correct me if I'm wrong, but I certainly don't think this is 
> where mempressure is headed: I don't think any accounting needs to be done 
> and, if it is, it's a design issue that should be addressed now rather 
> than later.  I believe notifications should occur on current's mempressure 
> cgroup depending on its level of reclaim: nobody cares if your memcg has a 
> limit of 64GB when you only have 32GB of RAM, we'll want the notification.

My main concern is that to trigger those notifications, one would have
to first determine whether or not the particular group of tasks is under
pressure. And to do that, we need to somehow know how much memory we are
using, and how much we are reclaiming, etc. On a system-wide level, we
have this information. On a grouplevel, this is already accounted by memcg.

In fact, the current code already seems to rely on memcg:

+   vmpressure(sc->target_mem_cgroup,
+  sc->nr_scanned - nr_scanned, nr_reclaimed);

Now, let's start simple: Assume we will have a different cgroup.
We want per-group pressure notifications for that group. How would you
determine that the specific group is under pressure?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 10/11] ARM: remove struct sys_timer suspend and resume fields

2012-11-21 Thread Linus Walleij

Oh and there was this comment/TODO:

On Mon, Nov 19, 2012 at 7:31 PM, Stephen Warren  wrote:

> @@ -17,15 +17,6 @@
>   *   Initialise the kernels jiffy timer source, claim interrupt
>   *   using setup_irq.  This is called early on during initialisation
>   *   while interrupts are still disabled on the local CPU.
> - * - suspend
> - *   Suspend the kernel jiffy timer source, if necessary.  This
> - *   is called with interrupts disabled, after all normal devices
> - *   have been suspended.  If no action is required, set this to
> - *   NULL.
> - * - resume
> - *   Resume the kernel jiffy timer source, if necessary.  This
> - *   is called with interrupts disabled before any normal devices
> - *   are resumed.  If no action is required, set this to NULL.
>   * - offset
>   *   Return the timer offset in microseconds since the last timer
>   *   interrupt.  Note: this must take account of any unprocessed
> @@ -33,8 +24,6 @@
>   */
>  struct sys_timer {
> void(*init)(void);
> -   void(*suspend)(void);
> -   void(*resume)(void);
>  };

So from the above it is quite clear that the sys_timer is breaking
the suspend_noirq/resume_noirq naming convention from
runtime PM as IRQs are disabled on these paths.

The same goes for struct clock_event_device ...

So while this looks just as bad after as before the patch,
we should take a mental notice to rename the .suspend
and .resume hooks in the clock_event_device to
.suspend_noirq and .resume_noirq at some point.

I was thinking that if your patch set is introducing a
plethora of new users of these hooks we should maybe
stick a patch at the beginning of the series renaming the
hooks to *_noirq, but if it's a major obstacle it can surely wait.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

2012-11-21 Thread Glauber Costa

On 11/21/2012 12:18 AM, Andrew Morton wrote:
> On Tue, 20 Nov 2012 13:18:19 +0400
> Glauber Costa  wrote:
> 
>> On 11/12/2012 03:37 PM, Mel Gorman wrote:
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index 02c1c971..d0a7967 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -31,6 +31,7 @@ struct vm_area_struct;
>>>  #define ___GFP_THISNODE0x4u
>>>  #define ___GFP_RECLAIMABLE 0x8u
>>>  #define ___GFP_NOTRACK 0x20u
>>> +#define ___GFP_NO_KSWAPD   0x40u
>>>  #define ___GFP_OTHER_NODE  0x80u
>>>  #define ___GFP_WRITE   0x100u
>>
>> Keep in mind that this bit has been reused in -mm.
>> If this patch needs to be reverted, we'll need to first change
>> the definition of __GFP_KMEMCG (and __GFP_BITS_SHIFT as a result), or it
>> would break things.
> 
> I presently have
> 
> /* Plain integer GFP bitmasks. Do not use this directly. */
> #define ___GFP_DMA0x01u
> #define ___GFP_HIGHMEM0x02u
> #define ___GFP_DMA32  0x04u
> #define ___GFP_MOVABLE0x08u
> #define ___GFP_WAIT   0x10u
> #define ___GFP_HIGH   0x20u
> #define ___GFP_IO 0x40u
> #define ___GFP_FS 0x80u
> #define ___GFP_COLD   0x100u
> #define ___GFP_NOWARN 0x200u
> #define ___GFP_REPEAT 0x400u
> #define ___GFP_NOFAIL 0x800u
> #define ___GFP_NORETRY0x1000u
> #define ___GFP_MEMALLOC   0x2000u
> #define ___GFP_COMP   0x4000u
> #define ___GFP_ZERO   0x8000u
> #define ___GFP_NOMEMALLOC 0x1u
> #define ___GFP_HARDWALL   0x2u
> #define ___GFP_THISNODE   0x4u
> #define ___GFP_RECLAIMABLE0x8u
> #define ___GFP_KMEMCG 0x10u
> #define ___GFP_NOTRACK0x20u
> #define ___GFP_NO_KSWAPD  0x40u
> #define ___GFP_OTHER_NODE 0x80u
> #define ___GFP_WRITE  0x100u
> 
> and
> 

Humm, I didn't realize there were also another free space at 0x10u.
This seems fine.

> #define __GFP_BITS_SHIFT 25   /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> 
> Which I think is OK?
Yes, if we haven't increased the size of the flag-space, no need to
change it.

> 
> I'd forgotten about __GFP_BITS_SHIFT.  Should we do this?
> 
> --- a/include/linux/gfp.h~a
> +++ a/include/linux/gfp.h
> @@ -35,6 +35,7 @@ struct vm_area_struct;
>  #define ___GFP_NO_KSWAPD 0x40u
>  #define ___GFP_OTHER_NODE0x80u
>  #define ___GFP_WRITE 0x100u
> +/* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
This is a very helpful comment.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/2] block: Remove deadlock in disk_clear_events

2012-11-21 Thread Andrew Morton

On Mon, 19 Nov 2012 18:07:01 -0800 Derek Basehore  
wrote:

> In disk_clear_events, do not put work on system_nrt_freezable_wq. Instead, put
> it on system_nrt_wq.
> 
> There is a race between probing a usb and suspending the device. Since 
> probing a
> usb calls disk_clear_events, which puts work on a frozen workqueue, probing
> cannot finish after the workqueue is frozen. However, suspending cannot finish
> until the usb probe is finished, so we get a deadlock.

um,

- this is identical to v1

- ten days ago Jens said "thanks, applied" to v1, but it isn't in linux-next.

- At that time Jens asked you whether a -stable backport was
  warranted but I see no reply on that topic.

> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1571,7 +1571,13 @@ unsigned int disk_clear_events(struct gendisk *disk, 
> unsigned int mask)
>  
>   /* uncondtionally schedule event check and wait for it to finish */
>   disk_block_events(disk);
> - queue_delayed_work(system_freezable_wq, &ev->dwork, 0);
> + /* We need to put the work on system_nrt_wq here since there is a

Like this:

/*
 * We need to ...

> +  * deadlock that happens while probing a usb device while suspending. If
> +  * we put work on a freezable worqueue here, a usb probe will wait here

s/worqueue/workqueue/

> +  * until the workqueue is unfrozen during suspend. Since suspend waits
> +  * on all probes to complete, we have a deadlock
> +  */
> + queue_delayed_work(system_nrt_wq, &ev->dwork, 0);
>   flush_delayed_work(&ev->dwork);
>   __disk_unblock_events(disk, false);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv9 1/3] Runtime Interpreted Power Sequences

2012-11-21 Thread Alex Courbot

On Wednesday 21 November 2012 16:13:47 Tomi Valkeinen wrote:
> * PGP Signed by an unknown key
> 
> On 2012-11-21 03:56, Alex Courbot wrote:
> > Hi Tomi,
> > 
> > On Tuesday 20 November 2012 22:48:18 Tomi Valkeinen wrote:
> >> I guess there's a reason, but the above looks a bit inconsistent. For
> >> gpio you define the gpio resource inside the step. For power and pwm the
> >> resource is defined before the steps. Why wouldn't "pwm = <&pwm 2
> >> 500>;" work in step2?
> > 
> > That's mostly a framework issue. Most frameworks do not export a function
> > that allow to dereference a phandle - they expect resources to be
> > declared right under the device node and accessed by name through
> > foo_get(device, name). So using phandles in power sequences would require
> > to export these additional
> Right, I expected something like that.
> 
> > functions and also opens the door to some inconsistencies - for instance,
> > your PWM phandle could be referenced a second time in the sequence with a
> > different period - how do you know that these are actually referring the
> > same PWM device?
> 
> This I didn't understand. Doesn't "<&pwm 2 xyz>" point to a single
> device, no matter where and how many times it's used?
> 
> >>> +When a power sequence is run, its steps is executed one after the other
> >>> until +one step fails or the end of the sequence is reached.
> >> 
> >> The document doesn't give any hint of what the driver should do if
> >> running the power sequence fails. Run the "opposite" power sequence?
> >> Will that work for all resources? I'm mainly thinking of a case where
> >> each enable of the resource should be matched by a disable, i.e. you
> >> can't call disable if no enable was called.
> > 
> > We discussed that issue already (around v5 I think) and the conclusion was
> > that it should be up to the driver. When we simply enable/disable
> > resources it is easy to revert, but in the future non-boolean properties
> > will likely be introduced, and these cannot easily be reverted. Moreover
> > some drivers might have more complex recovery needs. This deserves more
> > discussion I think, as I'd like to have some "generic" recovery mechanism
> > that covers most of the cases.
> 
> Ok. I'll need to dig up the conversation

IIRC it was somewhere around here:

https://lkml.org/lkml/2012/9/7/662

See the parent messages too.

> Did you consider any examples
> of how some driver could handle the error cases?

For all the (limited) use cases I considered, playing the power-off sequence 
when power-on fails just works. If power-off also fails you are potentially in 
more trouble though. Maybe we could have another "run" function that does not 
stop on errors for handling such cases where you want to "stop everything you 
can".

> What I'm worried about is that, as far as I understand, the power
> sequence is kinda like black box to the driver. The driver just does
> "power-up", without knowing what really goes on in there.

The driver could always inspect the sequence, but you are right in that this 
is not how it is intended to be done.

> And if it doesn't know what goes on in there, nor what's in "power-down"
> sequence, how can it do anything when an error happens? The only option
> I see is that the driver doesn't do anything, which will leave some
> resources enabled, or it can run the power-down sequence, which may or
> may not work.

Failures might be better handled if sequences have some "recovery policy" 
about what to do when they fail, as mentioned in the link above. As you 
pointed out, the driver might not always know enough about the resources 
involved to do the right thing.

Alex.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFT] regulator: da9055: Properly handle voltage range that doesn't start with 0 offset

2012-11-21 Thread Ashish Jangam

> This patch implements map_voltage and list_voltage callbacks to properly 
> handle
> the case voltage range that doesn't start with 0 offset.
> 
> Now we adjust the selector in map_voltage() before calling set_voltage_sel().
> And return 0 in list_voltage() for invalid selectors.
> 
> With above change, we can remove da9055_regulator_set_voltage_bits function.
> 
> One tricky part is that we need adding voffset to n_voltages.
> Although for the cases "selector < voffset" are invalid, we need add voffset 
> to
> n_voltage so regulator_list_voltage() won't fail while checking the boundary 
> for
> selector before calling list_voltage callback.
> 
> Signed-off-by: Axel Lin 
> ---
> Hi Ashish,
>   I don't have this hardware to test this patch.
>   Can you help to review and test this patch?
> Thank you,
> Axel
This patch looks good to me.
I have tested this patch on SMDK6410 using the DA9055 evaluation board.
Tested-by: Ashish Jangam 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/8] staging: wlags49_h2: ap_h2.c: fixes inconsistent spacing around *

2012-11-21 Thread Dan Carpenter

On Tue, Nov 20, 2012 at 04:45:00PM +0200, Johan Meiring wrote:
> This commit fixes an inconsistent spacing issue around *
> 

The others are good, but this one is not right.

> Signed-off-by: Johan Meiring 
> ---
>  drivers/staging/wlags49_h2/ap_h2.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/staging/wlags49_h2/ap_h2.c 
> b/drivers/staging/wlags49_h2/ap_h2.c
> index e524153..c2d43ec 100644
> --- a/drivers/staging/wlags49_h2/ap_h2.c
> +++ b/drivers/staging/wlags49_h2/ap_h2.c
> @@ -3256,7 +3256,7 @@ static const CFG_PROG_STRCT fw_image_code[] = {
>   0x0146, /* sizeof(fw_image_1_data), */
>   0x0060, /* Target 
> address in NIC Memory */
>   0x, /* CRC: yes/no  
> TYPE: primary/station/tertiary */
> - (hcf_8 FAR *) fw_image_1_data
> + (hcf_8 FAR*) fw_image_1_data

We don't use far pointers in linux.  When you're casting something
there is no space between the cat operation and the variable.  The
reason is that casting is a high precedence operation.

More readable:  (char *)p + 1;
Less readable:  (char *) p + 1;

In the first line it's obvious that we cast first and then add 1.

So this should be:
(hcf_8 *)fw_image_1_data,

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem in Page Cache Replacement

2012-11-21 Thread Jaegeuk Hanse


Cc Fengguang Wu.

On 11/21/2012 04:13 PM, metin d wrote:

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?

I'm guessing it'd evict the entries, but am wondering if we could run any more 
diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently 
and the other one about once a month. It seems like the memory manager keeps 
unused pages in memory at the expense of frequently used database's performance.

My understanding was that under memory pressure from heavily accessed pages, 
unused pages would eventually get evicted. Is there anything else we can try on 
this host to understand why this is happening?

Thank you,

Metin

On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
   does it evict data-1 pages from memory?


This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
0:  10


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: avoid unnecessary function call when memcg is disabled

2012-11-21 Thread Michal Hocko

On Tue 20-11-12 13:49:32, Andrew Morton wrote:
> On Mon, 19 Nov 2012 17:44:34 -0800 (PST)
> David Rientjes  wrote:
> 
> > While profiling numa/core v16 with cgroup_disable=memory on the command 
> > line, I noticed mem_cgroup_count_vm_event() still showed up as high as 
> > 0.60% in perftop.
> > 
> > This occurs because the function is called extremely often even when memcg 
> > is disabled.
> > 
> > To fix this, inline the check for mem_cgroup_disabled() so we avoid the 
> > unnecessary function call if memcg is disabled.
> > 
> > ...
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -181,7 +181,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct 
> > zone *zone, int order,
> > gfp_t gfp_mask,
> > unsigned long *total_scanned);
> >  
> > -void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item 
> > idx);
> > +void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item 
> > idx);
> > +static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> > +enum vm_event_item idx)
> > +{
> > +   if (mem_cgroup_disabled() || !mm)
> > +   return;
> > +   __mem_cgroup_count_vm_event(mm, idx);
> > +}
> 
> Does the !mm case occur frequently enough to justify inlining it, or
> should that test remain out-of-line?

Now that you've asked about it I started looking around and I cannot see
how mm can ever be NULL. The condition is there since the very beginning
(456f998e memcg: add the pagefault count into memcg stats) but all the
callers are page fault handlers and those shouldn't have mm==NULL.
Or is there anything obvious I am missing?

Ying, the whole thread starts https://lkml.org/lkml/2012/11/19/545 but
the primary question is why we need !mm test for mem_cgroup_count_vm_event
at all.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] mm: dmapool: use provided gfp flags for all dma_alloc_coherent() calls

2012-11-21 Thread Andrew Morton

On Wed, 21 Nov 2012 09:08:52 +0100 Marek Szyprowski  
wrote:

> Hello,
> 
> On 11/20/2012 8:33 PM, Andrew Morton wrote:
> > On Tue, 20 Nov 2012 15:31:45 +0100
> > Marek Szyprowski  wrote:
> >
> > > dmapool always calls dma_alloc_coherent() with GFP_ATOMIC flag,
> > > regardless the flags provided by the caller. This causes excessive
> > > pruning of emergency memory pools without any good reason. Additionaly,
> > > on ARM architecture any driver which is using dmapools will sooner or
> > > later  trigger the following error:
> > > "ERROR: 256 KiB atomic DMA coherent pool is too small!
> > > Please increase it with coherent_pool= kernel parameter!".
> > > Increasing the coherent pool size usually doesn't help much and only
> > > delays such error, because all GFP_ATOMIC DMA allocations are always
> > > served from the special, very limited memory pool.
> > >
> >
> > Is this problem serious enough to justify merging the patch into 3.7?
> > And into -stable kernels?
> 
> I wonder if it is a good idea to merge such change at the end of current
> -rc period.

I'm not sure what you mean by this.

But what we do sometimes if we think a patch needs a bit more
real-world testing before backporting is to merge it into -rc1 in the
normal merge window, and tag it for -stable backporting.  That way it
gets a few weeks(?) testing in mainline before getting backported.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Lockdep complain for zram

2012-11-21 Thread Minchan Kim

Hi alls,

Today, I saw below complain of lockdep.
As a matter of fact, I knew it long time ago but forgot that.
The reason lockdep complains is that now zram uses GFP_KERNEL
in reclaim path(ex, __zram_make_request) :(
I can fix it via replacing GFP_KERNEL with GFP_NOIO.
But more big problem is vzalloc in zram_init_device which calls GFP_KERNEL.
Of course, I can change it with __vmalloc which can receive gfp_t.
But still we have a problem. Althoug __vmalloc can handle gfp_t, it calls
allocation of GFP_KERNEL. That's why I sent the patch.
https://lkml.org/lkml/2012/4/23/77
Since then, I forgot it, saw the bug today and poped the question again.

Yes. Fundamental problem is utter crap API vmalloc.
If we can fix it, everyone would be happy. But life isn't simple like seeing
my thread of the patch.

So next option is to move zram_init_device into setting disksize time.
But it makes unnecessary metadata waste until zram is used really(That's why
Nitin move zram_init_device from disksize setting time to make_request) and
it makes user should set the disksize before using, which are behavior change.

I would like to clean up this issue before promoting because it might change
usage behavior.

Do you have any idea?

 8< ==


[  335.772277] =
[  335.772615] [ INFO: inconsistent lock state ]
[  335.772955] 3.7.0-rc6 #162 Tainted: G C  
[  335.773320] -
[  335.773663] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[  335.774170] kswapd0/23 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  335.774564]  (&zram->init_lock){+-}, at: [] 
zram_make_request+0x4a/0x260 [zram]
[  335.775321] {RECLAIM_FS-ON-W} state was registered at:
[  335.775716]   [] mark_held_locks+0x82/0x130
[  335.776009]   [] lockdep_trace_alloc+0x67/0xc0
[  335.776009]   [] __alloc_pages_nodemask+0x94/0xa00
[  335.776009]   [] alloc_pages_current+0xb6/0x120
[  335.776009]   [] __get_free_pages+0x14/0x50
[  335.776009]   [] kmalloc_order_trace+0x3f/0xf0
[  335.776009]   [] zram_init_device+0x7b/0x220 [zram]
[  335.776009]   [] zram_make_request+0x24a/0x260 [zram]
[  335.776009]   [] generic_make_request+0xca/0x100
[  335.776009]   [] submit_bio+0x7b/0x160
[  335.776009]   [] submit_bh+0xf2/0x120
[  335.776009]   [] block_read_full_page+0x235/0x3a0
[  335.776009]   [] blkdev_readpage+0x18/0x20
[  335.776009]   [] __do_page_cache_readahead+0x2c7/0x2d0
[  335.776009]   [] force_page_cache_readahead+0x79/0xb0
[  335.776009]   [] page_cache_sync_readahead+0x43/0x50
[  335.776009]   [] generic_file_aio_read+0x4f0/0x760
[  335.776009]   [] blkdev_aio_read+0xbb/0xf0
[  335.776009]   [] do_sync_read+0xa3/0xe0
[  335.776009]   [] vfs_read+0xb0/0x180
[  335.776009]   [] sys_read+0x52/0xa0
[  335.776009]   [] system_call_fastpath+0x16/0x1b
[  335.776009] irq event stamp: 97589
[  335.776009] hardirqs last  enabled at (97589): [] 
throtl_update_dispatch_stats+0x94/0xf0
[  335.776009] hardirqs last disabled at (97588): [] 
throtl_update_dispatch_stats+0x4d/0xf0
[  335.776009] softirqs last  enabled at (67416): [] 
__do_softirq+0x139/0x280
[  335.776009] softirqs last disabled at (67395): [] 
call_softirq+0x1c/0x30
[  335.776009] 
[  335.776009] other info that might help us debug this:
[  335.776009]  Possible unsafe locking scenario:
[  335.776009] 
[  335.776009]CPU0
[  335.776009]
[  335.776009]   lock(&zram->init_lock);
[  335.776009]   
[  335.776009] lock(&zram->init_lock);
[  335.776009] 
[  335.776009]  *** DEADLOCK ***
[  335.776009] 
[  335.776009] no locks held by kswapd0/23.
[  335.776009] 
[  335.776009] stack backtrace:
[  335.776009] Pid: 23, comm: kswapd0 Tainted: G C   3.7.0-rc6 #162
[  335.776009] Call Trace:
[  335.776009]  [] print_usage_bug+0x1f5/0x206
[  335.776009]  [] ? save_stack_trace+0x2f/0x50
[  335.776009]  [] mark_lock+0x295/0x2f0
[  335.776009]  [] ? 
print_irq_inversion_bug.part.37+0x1f0/0x1f0
[  335.776009]  [] ? blk_throtl_bio+0x88/0x630
[  335.776009]  [] __lock_acquire+0x564/0x1c00
[  335.776009]  [] ? trace_hardirqs_on_caller+0x105/0x190
[  335.776009]  [] ? blk_throtl_bio+0x3c2/0x630
[  335.776009]  [] ? blk_throtl_bio+0x88/0x630
[  335.776009]  [] ? create_task_io_context+0xdc/0x150
[  335.776009]  [] ? create_task_io_context+0xdc/0x150
[  335.776009]  [] ? zram_make_request+0x4a/0x260 [zram]
[  335.776009]  [] lock_acquire+0x85/0x130
[  335.776009]  [] ? zram_make_request+0x4a/0x260 [zram]
[  335.776009]  [] down_read+0x4c/0x61
[  335.776009]  [] ? zram_make_request+0x4a/0x260 [zram]
[  335.776009]  [] ? generic_make_request_checks+0x222/0x420
[  335.776009]  [] ? test_set_page_writeback+0x6e/0x1a0
[  335.776009]  [] zram_make_request+0x4a/0x260 [zram]
[  335.776009]  [] generic_make_request+0xca/0x100
[  335.776009]  [] submit_bio+0x7b/0x160
[  335.776009]  [] ? account_page_writeback+0x13/0x20
[  335.776009]  [] ? test_set_page_writeback+0xf5/0x1a0
[  335.776009]  [] swap_writepage+0x1b9/0x240
[  335

Re: numa/core regressions fixed - more testers wanted

2012-11-21 Thread Alex Shi

>
> Those of you who would like to test all the latest patches are
> welcome to pick up latest bits at tip:master:
>
>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
>

I am wondering if it is a problem, but it still exists on HEAD: c418de93e39891
http://article.gmane.org/gmane.linux.kernel.mm/90131/match=compiled+with+name+pl+and+start+it+on+my

like when just start 4 pl tasks, often 3 were running on node 0, and 1
was running on node 1.
The old balance will average assign tasks to different node, different core.

Regards
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 RESEND] Add NumaChip remote PCI support

2012-11-21 Thread Daniel J Blueman

Add NumaChip-specific PCI access mechanism via MMCONFIG cycles, but
preventing access to AMD Northbridges which shouldn't respond.

v2: Use PCI_DEVFN in precomputed constant limit; drop unneeded includes

Signed-off-by: Daniel J Blueman 
---
 arch/x86/include/asm/numachip/numachip.h |   20 +
 arch/x86/kernel/apic/apic_numachip.c |2 +
 arch/x86/pci/Makefile|1 +
 arch/x86/pci/numachip.c  |  134 ++
 4 files changed, 157 insertions(+)
 create mode 100644 arch/x86/include/asm/numachip/numachip.h
 create mode 100644 arch/x86/pci/numachip.c

diff --git a/arch/x86/include/asm/numachip/numachip.h 
b/arch/x86/include/asm/numachip/numachip.h
new file mode 100644
index 000..d35e71a
--- /dev/null
+++ b/arch/x86/include/asm/numachip/numachip.h
@@ -0,0 +1,20 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Numascale NumaConnect-specific header file
+ *
+ * Copyright (C) 2012 Numascale AS. All rights reserved.
+ *
+ * Send feedback to 
+ *
+ */
+
+#ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
+#define _ASM_X86_NUMACHIP_NUMACHIP_H
+
+extern int __init pci_numachip_init(void);
+
+#endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
+
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index a65829a..9c2aa89 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -179,6 +180,7 @@ static int __init numachip_system_init(void)
return 0;
 
x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
+   x86_init.pci.arch_init = pci_numachip_init;
 
map_csrs();
 
diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
index 3af5a1e..ee0af58 100644
--- a/arch/x86/pci/Makefile
+++ b/arch/x86/pci/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_STA2X11)   += sta2x11-fixup.o
 obj-$(CONFIG_X86_VISWS)+= visws.o
 
 obj-$(CONFIG_X86_NUMAQ)+= numaq_32.o
+obj-$(CONFIG_X86_NUMACHIP) += numachip.o
 
 obj-$(CONFIG_X86_INTEL_MID)+= mrst.o
 
diff --git a/arch/x86/pci/numachip.c b/arch/x86/pci/numachip.c
new file mode 100644
index 000..3773e05
--- /dev/null
+++ b/arch/x86/pci/numachip.c
@@ -0,0 +1,129 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Numascale NumaConnect-specific PCI code
+ *
+ * Copyright (C) 2012 Numascale AS. All rights reserved.
+ *
+ * Send feedback to 
+ *
+ * PCI accessor functions derived from mmconfig_64.c
+ *
+ */
+
+#include 
+#include 
+
+static u8 limit __read_mostly;
+
+static inline char __iomem *pci_dev_base(unsigned int seg, unsigned int bus, 
unsigned int devfn)
+{
+   struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);
+
+   if (cfg && cfg->virt)
+   return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
+   return NULL;
+}
+
+static int pci_mmcfg_read_numachip(unsigned int seg, unsigned int bus,
+ unsigned int devfn, int reg, int len, u32 *value)
+{
+   char __iomem *addr;
+
+   /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
+   if (unlikely((bus > 255) || (devfn > 255) || (reg > 4095))) {
+err:   *value = -1;
+   return -EINVAL;
+   }
+
+   /* Ensure AMD Northbridges don't decode reads to other devices */
+   if (unlikely(bus == 0 && devfn >= limit)) {
+   *value = -1;
+   return 0;
+   }
+
+   rcu_read_lock();
+   addr = pci_dev_base(seg, bus, devfn);
+   if (!addr) {
+   rcu_read_unlock();
+   goto err;
+   }
+
+   switch (len) {
+   case 1:
+   *value = mmio_config_readb(addr + reg);
+   break;
+   case 2:
+   *value = mmio_config_readw(addr + reg);
+   break;
+   case 4:
+   *value = mmio_config_readl(addr + reg);
+   break;
+   }
+   rcu_read_unlock();
+
+   return 0;
+}
+
+static int pci_mmcfg_write_numachip(unsigned int seg, unsigned int bus,
+  unsigned int devfn, int reg, int len, u32 value)
+{
+   char __iomem *addr;
+
+   /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
+   if (unlikely((bus > 255) || (devfn > 255) || (reg > 4095)))
+   return -EINVAL;
+
+   /* Ensure AMD Northbridges don't decode writes to other devices */
+   if (unlikely(bus == 0 && devfn >= limit))
+   return 0;
+
+   rcu_read_lock();
+   addr = pci_dev_base(seg, bus, devfn);
+   if (!addr) {
+   rcu_read_unlock();
+   return -EINVAL;
+

[PATCHv2 2/2] dw_dmac: make usage of dw_dma_slave optional

2012-11-21 Thread Andy Shevchenko

The driver requires a custom slave configuration to be present to be able to
make the slave transfers. Nevertheless, in some cases we need only the request
line as an additional information to the generic slave configuration.  The
request line is provided by slave_id parameter of the dma_slave_config
structure. That's why the custom slave configuration could be optional for such
cases.

Signed-off-by: Andy Shevchenko 
Acked-by: Viresh Kumar 
---
 drivers/dma/dw_dmac.c |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/dma/dw_dmac.c b/drivers/dma/dw_dmac.c
index 1100fa0..6e20746 100644
--- a/drivers/dma/dw_dmac.c
+++ b/drivers/dma/dw_dmac.c
@@ -50,11 +50,12 @@ static inline unsigned int dwc_get_sms(struct dw_dma_slave 
*slave)
struct dw_dma_slave *__slave = (_chan->private);\
struct dw_dma_chan *_dwc = to_dw_dma_chan(_chan);   \
struct dma_slave_config *_sconfig = &_dwc->dma_sconfig; \
+   bool _is_slave = is_slave_xfer(_dwc->direction);\
int _dms = dwc_get_dms(__slave);\
int _sms = dwc_get_sms(__slave);\
-   u8 _smsize = __slave ? _sconfig->src_maxburst : \
+   u8 _smsize = _is_slave ? _sconfig->src_maxburst :   \
DW_DMA_MSIZE_16;\
-   u8 _dmsize = __slave ? _sconfig->dst_maxburst : \
+   u8 _dmsize = _is_slave ? _sconfig->dst_maxburst :   \
DW_DMA_MSIZE_16;\
\
(DWC_CTLL_DST_MSIZE(_dmsize)\
@@ -325,7 +326,7 @@ dwc_descriptor_complete(struct dw_dma_chan *dwc, struct 
dw_desc *desc,
list_splice_init(&desc->tx_list, &dwc->free_list);
list_move(&desc->desc_node, &dwc->free_list);
 
-   if (!dwc->chan.private) {
+   if (!is_slave_xfer(dwc->direction)) {
struct device *parent = chan2parent(&dwc->chan);
if (!(txd->flags & DMA_COMPL_SKIP_DEST_UNMAP)) {
if (txd->flags & DMA_COMPL_DEST_UNMAP_SINGLE)
@@ -806,7 +807,7 @@ dwc_prep_slave_sg(struct dma_chan *chan, struct scatterlist 
*sgl,
 
dev_vdbg(chan2dev(chan), "%s\n", __func__);
 
-   if (unlikely(!dws || !sg_len))
+   if (unlikely(!is_slave_xfer(direction) || !sg_len))
return NULL;
 
dwc->direction = direction;
@@ -982,8 +983,8 @@ set_runtime_config(struct dma_chan *chan, struct 
dma_slave_config *sconfig)
 {
struct dw_dma_chan *dwc = to_dw_dma_chan(chan);
 
-   /* Check if it is chan is configured for slave transfers */
-   if (!chan->private)
+   /* Check if chan will be configured for slave transfers */
+   if (!is_slave_xfer(sconfig->direction))
return -EINVAL;
 
memcpy(&dwc->dma_sconfig, sconfig, sizeof(*sconfig));
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 1/2] dw_dmac: store direction in the custom channel structure

2012-11-21 Thread Andy Shevchenko

Currently the direction value comes from the generic slave configuration
structure and explicitly as a preparation function parameter. The first one is
kinda obsoleted. Thus, we have to store the value passed to the preparation
function somewhere in our structures to be able to use it later. The best
candidate to provide the storage is a custom channel structure. Until now we
still keep and check the direction field of the slave config structure as well.

Signed-off-by: Andy Shevchenko 
Acked-by: Viresh Kumar 
---
 drivers/dma/dw_dmac.c  |   12 ++--
 drivers/dma/dw_dmac_regs.h |   14 --
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/dma/dw_dmac.c b/drivers/dma/dw_dmac.c
index 5e2c4dc..1100fa0 100644
--- a/drivers/dma/dw_dmac.c
+++ b/drivers/dma/dw_dmac.c
@@ -178,9 +178,9 @@ static void dwc_initialize(struct dw_dma_chan *dwc)
cfghi = dws->cfg_hi;
cfglo |= dws->cfg_lo & ~DWC_CFGL_CH_PRIOR_MASK;
} else {
-   if (dwc->dma_sconfig.direction == DMA_MEM_TO_DEV)
+   if (dwc->direction == DMA_MEM_TO_DEV)
cfghi = DWC_CFGH_DST_PER(dwc->dma_sconfig.slave_id);
-   else if (dwc->dma_sconfig.direction == DMA_DEV_TO_MEM)
+   else if (dwc->direction == DMA_DEV_TO_MEM)
cfghi = DWC_CFGH_SRC_PER(dwc->dma_sconfig.slave_id);
}
 
@@ -723,6 +723,8 @@ dwc_prep_dma_memcpy(struct dma_chan *chan, dma_addr_t dest, 
dma_addr_t src,
return NULL;
}
 
+   dwc->direction = DMA_MEM_TO_MEM;
+
data_width = min_t(unsigned int, dwc->dw->data_width[dwc_get_sms(dws)],
 dwc->dw->data_width[dwc_get_dms(dws)]);
 
@@ -807,6 +809,8 @@ dwc_prep_slave_sg(struct dma_chan *chan, struct scatterlist 
*sgl,
if (unlikely(!dws || !sg_len))
return NULL;
 
+   dwc->direction = direction;
+
prev = first = NULL;
 
switch (direction) {
@@ -983,6 +987,7 @@ set_runtime_config(struct dma_chan *chan, struct 
dma_slave_config *sconfig)
return -EINVAL;
 
memcpy(&dwc->dma_sconfig, sconfig, sizeof(*sconfig));
+   dwc->direction = sconfig->direction;
 
convert_burst(&dwc->dma_sconfig.src_maxburst);
convert_burst(&dwc->dma_sconfig.dst_maxburst);
@@ -1339,6 +1344,8 @@ struct dw_cyclic_desc *dw_dma_cyclic_prep(struct dma_chan 
*chan,
if (unlikely(!is_slave_xfer(direction)))
goto out_err;
 
+   dwc->direction = direction;
+
if (direction == DMA_MEM_TO_DEV)
reg_width = __ffs(sconfig->dst_addr_width);
else
@@ -1713,6 +1720,7 @@ static int __devinit dw_probe(struct platform_device 
*pdev)
channel_clear_bit(dw, CH_EN, dwc->mask);
 
dwc->dw = dw;
+   dwc->direction = DMA_TRANS_NONE;
 
/* hardware configuration */
if (autocfg) {
diff --git a/drivers/dma/dw_dmac_regs.h b/drivers/dma/dw_dmac_regs.h
index 8881e9b..f9532c2 100644
--- a/drivers/dma/dw_dmac_regs.h
+++ b/drivers/dma/dw_dmac_regs.h
@@ -9,6 +9,7 @@
  * published by the Free Software Foundation.
  */
 
+#include 
 #include 
 
 #define DW_DMA_MAX_NR_CHANNELS 8
@@ -184,12 +185,13 @@ enum dw_dmac_flags {
 };
 
 struct dw_dma_chan {
-   struct dma_chan chan;
-   void __iomem*ch_regs;
-   u8  mask;
-   u8  priority;
-   boolpaused;
-   boolinitialized;
+   struct dma_chan chan;
+   void __iomem*ch_regs;
+   u8  mask;
+   u8  priority;
+   enum dma_transfer_direction direction;
+   boolpaused;
+   boolinitialized;
 
/* software emulation of the LLP transfers */
struct list_head*tx_list;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND] Fix printing when no interrupt is allocated

2012-11-21 Thread Daniel J Blueman

Previously a new line is implicitly added in the no GSI case:

[7.185182] pci 0001:00:12.0: can't derive routing for PCI INT A
[7.191352] pci 0001:00:12.0: PCI INT A: no GSI
[7.195956]  - using ISA IRQ 10

The code thus prints a blank line where no legacy IRQ is available:

[1.650124] pci :00:14.0: can't derive routing for PCI INT A
[1.650126] pci :00:14.0: PCI INT A: no GSI
[1.650126] 
[1.650180] pci :00:14.0: can't derive routing for PCI INT A

Fix this by making the newline explicit and removing the superfluous
one.

Signed-off-by: Daniel J Blueman 
---
 drivers/acpi/pci_irq.c |8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
index 0eefa12..2c37996 100644
--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -459,7 +459,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
 */
if (gsi < 0) {
u32 dev_gsi;
-   dev_warn(&dev->dev, "PCI INT %c: no GSI", pin_name(pin));
+   dev_warn(&dev->dev, "PCI INT %c: no GSI\n", pin_name(pin));
/* Interrupt Line values above 0xF are forbidden */
if (dev->irq > 0 && (dev->irq <= 0xF) &&
(acpi_isa_irq_to_gsi(dev->irq, &dev_gsi) == 0)) {
@@ -467,11 +467,9 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
acpi_register_gsi(&dev->dev, dev_gsi,
  ACPI_LEVEL_SENSITIVE,
  ACPI_ACTIVE_LOW);
-   return 0;
-   } else {
-   printk("\n");
-   return 0;
}
+
+   return 0;
}
 
rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv9 1/3] Runtime Interpreted Power Sequences

2012-11-21 Thread Tomi Valkeinen

On 2012-11-21 10:32, Alex Courbot wrote:

>> Ok. I'll need to dig up the conversation
> 
> IIRC it was somewhere around here:
> 
> https://lkml.org/lkml/2012/9/7/662
> 
> See the parent messages too.

Thanks.

>> Did you consider any examples
>> of how some driver could handle the error cases?
> 
> For all the (limited) use cases I considered, playing the power-off sequence 
> when power-on fails just works. If power-off also fails you are potentially 
> in 
> more trouble though. Maybe we could have another "run" function that does not 
> stop on errors for handling such cases where you want to "stop everything you 
> can".

If the power-off sequence disables a regulator that was supposed to be
enabled by the power-on sequence (but wasn't enabled because of an
error), the regulator_disable is still called when the driver runs the
power-off sequence, isn't it? Regulator enables and disables are ref
counted, and the enables should match the disables.

> Failures might be better handled if sequences have some "recovery policy" 
> about what to do when they fail, as mentioned in the link above. As you 
> pointed out, the driver might not always know enough about the resources 
> involved to do the right thing.

Yes, I think such recovery policy would be needed.

 Tomi




signature.asc
Description: OpenPGP digital signature

Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

2012-11-21 Thread Anton Vorontsov

On Wed, Nov 21, 2012 at 12:27:28PM +0400, Glauber Costa wrote:
> On 11/20/2012 10:23 PM, David Rientjes wrote:
> > Anton can correct me if I'm wrong, but I certainly don't think this is 
> > where mempressure is headed: I don't think any accounting needs to be done

Yup, I'd rather not do any accounting, at least not in bytes.

> > and, if it is, it's a design issue that should be addressed now rather 
> > than later.  I believe notifications should occur on current's mempressure 
> > cgroup depending on its level of reclaim: nobody cares if your memcg has a 
> > limit of 64GB when you only have 32GB of RAM, we'll want the notification.
> 
> My main concern is that to trigger those notifications, one would have
> to first determine whether or not the particular group of tasks is under
> pressure.

As far as I understand, the notifications will be triggered by a process
that tries to allocate memory. So, effectively that would be a per-process
pressure.

So, if one process in a group is suffering, we notify that "a process in a
group is under pressure", and the notification goes to a cgroup listener

> And to do that, we need to somehow know how much memory we are
> using, and how much we are reclaiming, etc. On a system-wide level, we
> have this information. On a grouplevel, this is already accounted by memcg.
> 
> In fact, the current code already seems to rely on memcg:
> 
> + vmpressure(sc->target_mem_cgroup,
> +sc->nr_scanned - nr_scanned, nr_reclaimed);

Well, I'm yet unsure about the details, but I guess in "mempressure"
cgroup approach, this will be derived from the current->, i.e. a task.

But note that we won't report pressure to a memcg cgroup, we will notify
only mempressure cgroup. But a process can be in both of them
simultaneously. In the code, the mempressure and memcg will not depend on
each other.

> Now, let's start simple: Assume we will have a different cgroup.
> We want per-group pressure notifications for that group. How would you
> determine that the specific group is under pressure?

If a process that tries to allocate memory & causes reclaim is a part of
the cgroup, then cgroup has a pressure.

At least that's very brief understanding of the idea, details to be
investigated... But I welcome David to comment whether I got everything
correctly. :)

Thanks,
Anton.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH resend] TPM: Issue TPM_STARTUP at driver load if the TPM has not been started

2012-11-21 Thread Peter.Huewe

Hi Jason, Hi Kent,
>
>Discussion on this got a bit sidetracked talking about
>suspend/resume.. To be clear, this fixes a real, serious, problem on
>normal embedded cases where the kernel refuses to attach the TPM driver
>at all.
>
>Key, are you happy with this as-is for the next merge window?
>
>This version is rebased and retested to 3.7-rc6. Tested on Atmel
>and Nuvoton LPC TPMs on PPC32.

Sorry I was tied up at work with other stuff, so I couldn't try out your 
initial patch.
Thanks for resubmitting!

I just gave the new version a run on my beagleboard with our Infineon SLB9635 
TT 1.2 Soft I2C TPM
and it seems to work as expected. (Tested with and without previous startup).

Tested-by: Peter Huewe 

I just have some minor comments - please see below.


>+++ b/drivers/char/tpm/tpm.c
>+#define TPM_ORD_STARTUP cpu_to_be32(153)
>+#define TPM_ST_CLEAR cpu_to_be16(1)
>+#define TPM_ST_STATE cpu_to_be16(2)
>+#define TPM_ST_DEACTIVATED cpu_to_be16(3)
>+static const struct tpm_input_header tpm_startup_header = {
>+  .tag = TPM_TAG_RQU_COMMAND,
>+  .length = cpu_to_be32(12),
>+  .ordinal = TPM_ORD_STARTUP
>+};
>+
> ssize_t tpm_getcap(struct device *dev, __be32 subcap_id, cap_t *cap,
>  const char *desc)
> {


Purely cosmetic question, but why did you define this before the tpm_getcap and 
not tpm_startup?
All the other definitions are made before they are used - so this should 
perhaps better be moved directly before tpm_startup.
(Maybe we should move out these definitions to a common location? Header?)




>@@ -541,11 +560,27 @@ int tpm_get_timeouts(struct tpm_chip *chip)
>   tpm_cmd.params.getcap_in.cap = TPM_CAP_PROP;
>   tpm_cmd.params.getcap_in.subcap_size = cpu_to_be32(4);
>   tpm_cmd.params.getcap_in.subcap = TPM_CAP_PROP_TIS_TIMEOUT;
>+  rc = transmit_cmd(chip, &tpm_cmd, TPM_INTERNAL_RESULT_SIZE, 0);

Please use NULL instead of 0, otherwise sparse complains - so 
-   rc = transmit_cmd(chip, &tpm_cmd, TPM_INTERNAL_RESULT_SIZE, 0);
+   rc = transmit_cmd(chip, &tpm_cmd, TPM_INTERNAL_RESULT_SIZE, NULL);


>+  if (rc == TPM_ERR_INVALID_POSTINIT) {
>   ...
>+  rc = transmit_cmd(chip, &tpm_cmd, TPM_INTERNAL_RESULT_SIZE, 0);

Same here 
please use NULL instead of 0, otherwise sparse complains - so 
-   rc = transmit_cmd(chip, &tpm_cmd, TPM_INTERNAL_RESULT_SIZE, 0);
+   rc = transmit_cmd(chip, &tpm_cmd, TPM_INTERNAL_RESULT_SIZE, 
NULL);


>diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
> extern ssize_t tpm_show_pubek(struct device *, struct device_attribute *attr,
>@@ -291,6 +292,10 @@ struct tpm_getrandom_in {
>   __be32 num_bytes;
> }__attribute__((packed));
> 
>+struct tpm_startup_in {
>+  __be16  startup_type;
>+} __packed;


All the other user
__attribute__((packed));
Care to change to be consistent?


Apart from these three minor topics also a 
Reviewed-by: Peter Huewe 

Thanks,
Peter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem in Page Cache Replacement

2012-11-21 Thread Fengguang Wu

On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
> Cc Fengguang Wu.
> 
> On 11/21/2012 04:13 PM, metin d wrote:
> >>   Curious. Added linux-mm list to CC to catch more attention. If you run
> >>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> >I'm guessing it'd evict the entries, but am wondering if we could run any 
> >more diagnostics before trying this.
> >
> >We regularly use a setup where we have two databases; one gets used 
> >frequently and the other one about once a month. It seems like the memory 
> >manager keeps unused pages in memory at the expense of frequently used 
> >database's performance.

> >My understanding was that under memory pressure from heavily
> >accessed pages, unused pages would eventually get evicted. Is there
> >anything else we can try on this host to understand why this is
> >happening?

We may debug it this way.

1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
   (please double check via /proc/vmstat whether it does the expected work)

2) run 'page-types -r' with root, to view the page status for the
   remaining pages of data-1

The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
Please compile them with options "-Dlinux -I. -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"

page-types can be found in the kernel source tree tools/vm/page-types.c

Sorry that sounds a bit twisted.. I do have a patch to directly dump
page cache status of a user specified file, however it's not
upstreamed yet.

Thanks,
Fengguang

> >On Tue 20-11-12 09:42:42, metin d wrote:
> >>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> >>same machine. Both databases keep 40 GB of data, and the total memory
> >>available on the machine is 68GB.
> >>
> >>I started data-1 and data-2, and ran several queries to go over all their
> >>data. Then, I shut down data-1 and kept issuing queries against data-2.
> >>For some reason, the OS still holds on to large parts of data-1's pages
> >>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> >>a result, my queries on data-2 keep hitting disk.
> >>
> >>I'm checking page cache usage with fincore. When I run a table scan query
> >>against data-2, I see that data-2's pages get evicted and put back into
> >>the cache in a round-robin manner. Nothing happens to data-1's pages,
> >>although they haven't been touched for days.
> >>
> >>Does anybody know why data-1's pages aren't evicted from the page cache?
> >>I'm open to all kind of suggestions you think it might relate to problem.
> >   Curious. Added linux-mm list to CC to catch more attention. If you run
> >echo 1 >/proc/sys/vm/drop_caches
> >   does it evict data-1 pages from memory?
> >
> >>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> >>swap space. The kernel version is:
> >>
> >>$ uname -r
> >>3.2.28-45.62.amzn1.x86_64
> >>Edit:
> >>
> >>and it seems that I use one NUMA instance, if  you think that it can a 
> >>problem.
> >>
> >>$ numactl --hardware
> >>available: 1 nodes (0)
> >>node 0 cpus: 0 1 2 3 4 5 6 7
> >>node 0 size: 70007 MB
> >>node 0 free: 360 MB
> >>node distances:
> >>node   0
> >>0:  10
#include 
#include 
#include 
#include 
#include 
#include 

#include "fadvise.h"

char *progname;

static void usage(void)
{
	fprintf(stderr, "Usage: %s filename offset length advice [loops]\n", progname);
	fprintf(stderr, "  advice: normal sequential willneed noreuse "
	"dontneed asyncwrite writewait\n");
	exit(1);
}

int
main(int argc, char *argv[])
{
	int c;
	int fd;
	char *sadvice;
	char *filename;
	loff_t offset;
	unsigned long length;
	int advice = 0;
	int ret;
	int loops = 1;

	progname = argv[0];

	while ((c = getopt(argc, argv, "")) != -1) {
		switch (c) {
		}
	}

	if (optind == argc)
		usage();
	filename = argv[optind++];

	if (optind == argc)
		usage();
	offset = strtoull(argv[optind++], NULL, 0);

	if (optind == argc)
		usage();
	length = strtol(argv[optind++], NULL, 0);

	if (optind == argc)
		usage();
	sadvice = argv[optind++];

	if (optind != argc)
		loops = strtol(argv[optind++], NULL, 0);

	if (optind != argc)
		usage();

	if (!strcmp(sadvice, "normal"))
		advice = POSIX_FADV_NORMAL;
	else if (!strcmp(sadvice, "sequential"))
		advice = POSIX_FADV_SEQUENTIAL;
	else if (!strcmp(sadvice, "willneed"))
		advice = POSIX_FADV_WILLNEED;
	else if (!strcmp(sadvice, "noreuse"))
		advice = POSIX_FADV_NOREUSE;
	else if (!strcmp(sadvice, "dontneed"))
		advice = POSIX_FADV_DONTNEED;
	else if (!strcmp(sadvice, "asyncwrite"))
		advice = LINUX_FADV_ASYNC_WRITE;
	else if (!strcmp(sadvice, "writewait"))
		advice = LINUX_FADV_WRITE_WAIT;
	else
		usage();

	fd = open(filename, O_RDONLY);
	if (fd < 0) {
		fprintf(stderr, "%s: cannot open `%s': %s\n",
			progname, filename, strerror(errno));
		exit(1);
	}

	while (loops--) {
		ret = __posix_fadvise64(fd, offset, length, advice);
		if (ret) {
			fprintf(stderr, "%s: fadvise() failed: %s\n",
p

RE: [PATCH 0/4] thermal: Add support for interrupt based notification to thermal layer

2012-11-21 Thread Zhang, Rui

Hi, Amit,

As THERMAL_TREND_RAISE_FULL/THERMAL_TREND_DROP_FULL
has been introduced to thermal next tree,
I'd like to get your plan about this patch set?

Thanks,
Rui

> -Original Message-
> From: linux-acpi-ow...@vger.kernel.org [mailto:linux-acpi-
> ow...@vger.kernel.org] On Behalf Of Amit Daniel Kachhap
> Sent: Thursday, November 08, 2012 12:26 PM
> To: linux...@lists.linux-foundation.org
> Cc: linux-samsung-...@vger.kernel.org; linux-kernel@vger.kernel.org; R,
> Durgadoss; l...@kernel.org; Zhang, Rui; linux-a...@vger.kernel.org;
> amit.kach...@linaro.org; jonghwa3@samsung.com
> Subject: [PATCH 0/4] thermal: Add support for interrupt based
> notification to thermal layer
> Importance: High
> 
> The patch submitted by Jonghwa Lee
> (https://patchwork.kernel.org/patch/1683441/)
> adds support for interrupt based notification to thermal layer. This is
> a good feature but the current thermal framework needs polling/regular
> notification for invoking suitable cooling action. So adding 2 new
> thermal trend type to implement this feature.
> 
> All these patches are based on thermal maintainer next tree.
> git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux.git next
> 
> Amit Daniel Kachhap (3):
>   thermal: Add new thermal trend type to support quick cooling
>   thermal: exynos: Miscellaneous fixes to support falling threshold
> interrupt
>   thermal: exynos: Use the new thermal trend type for quick cooling
> action.
> 
> Jonghwa Lee (1):
>   Thermal: exynos: Add support for temperature falling interrupt.
> 
>  drivers/thermal/exynos_thermal.c |  105 +++---
> 
>  drivers/thermal/step_wise.c  |   19 -
>  include/linux/platform_data/exynos_thermal.h |3 +
>  include/linux/thermal.h  |2 +
>  4 files changed, 80 insertions(+), 49 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi"
> in the body of a message to majord...@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH] PM / devfreq: Add runtime-pm support

2012-11-21 Thread Rajagopal Venkat

Instead of devfreq device driver explicitly calling devfreq suspend
and resume apis perhaps from runtime-pm suspend and resume callbacks,
let devfreq core handle it automatically.

Attach devfreq core to runtime-pm framework so that, devfreq device
driver pm_runtime_suspend() will automatically suspend the devfreq
and pm_runtime_resume() will resume the devfreq.

Signed-off-by: Rajagopal Venkat 
---
 drivers/devfreq/devfreq.c | 145 --
 include/linux/devfreq.h   |  12 
 2 files changed, 102 insertions(+), 55 deletions(-)

diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 45e053e..190e414 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -25,10 +25,9 @@
 #include 
 #include 
 #include 
+#include 
 #include "governor.h"
 
-static struct class *devfreq_class;
-
 /*
  * devfreq core provides delayed work based load monitoring helper
  * functions. Governors can use these or can implement their own
@@ -414,6 +413,93 @@ static void devfreq_dev_release(struct device *dev)
 }
 
 /**
+ * devfreq_suspend_device() - Suspend devfreq of a device.
+ * @devfreq: the devfreq instance to be suspended
+ */
+static int devfreq_suspend_device(struct devfreq *devfreq)
+{
+   if (!devfreq)
+   return -EINVAL;
+
+   if (!devfreq->governor)
+   return 0;
+
+   return devfreq->governor->event_handler(devfreq,
+   DEVFREQ_GOV_SUSPEND, NULL);
+}
+
+/**
+ * devfreq_resume_device() - Resume devfreq of a device.
+ * @devfreq: the devfreq instance to be resumed
+ */
+static int devfreq_resume_device(struct devfreq *devfreq)
+{
+   if (!devfreq)
+   return -EINVAL;
+
+   if (!devfreq->governor)
+   return 0;
+
+   return devfreq->governor->event_handler(devfreq,
+   DEVFREQ_GOV_RESUME, NULL);
+}
+
+static int devfreq_runtime_suspend(struct device *dev)
+{
+   int ret;
+   struct devfreq *devfreq;
+
+   mutex_lock(&devfreq_list_lock);
+   devfreq = find_device_devfreq(dev);
+   mutex_unlock(&devfreq_list_lock);
+
+   ret = devfreq_suspend_device(devfreq);
+   if (ret < 0)
+   goto out;
+
+   ret = pm_generic_runtime_suspend(dev);
+out:
+   return ret;
+}
+
+static int devfreq_runtime_resume(struct device *dev)
+{
+   int ret;
+   struct devfreq *devfreq;
+
+   mutex_lock(&devfreq_list_lock);
+   devfreq = find_device_devfreq(dev);
+   mutex_unlock(&devfreq_list_lock);
+
+   ret = devfreq_resume_device(devfreq);
+   if (ret < 0)
+   goto out;
+
+   ret = pm_generic_runtime_resume(dev);
+out:
+   return ret;
+}
+
+static int devfreq_runtime_idle(struct device *dev)
+{
+   return pm_generic_runtime_idle(dev);
+}
+
+static const struct dev_pm_ops devfreq_pm_ops = {
+   SET_RUNTIME_PM_OPS(
+   devfreq_runtime_suspend,
+   devfreq_runtime_resume,
+   devfreq_runtime_idle
+   )
+};
+
+static struct class devfreq_class = {
+   .name = "devfreq",
+   .owner = THIS_MODULE,
+   .pm = &devfreq_pm_ops,
+};
+
+/**
  * devfreq_add_device() - Add devfreq feature to the device
  * @dev:   the device to add devfreq feature.
  * @profile:   device-specific profile to run devfreq.
@@ -454,8 +540,9 @@ struct devfreq *devfreq_add_device(struct device *dev,
 
mutex_init(&devfreq->lock);
mutex_lock(&devfreq->lock);
+   dev->class = &devfreq_class;
devfreq->dev.parent = dev;
-   devfreq->dev.class = devfreq_class;
+   devfreq->dev.class = &devfreq_class;
devfreq->dev.release = devfreq_dev_release;
devfreq->profile = profile;
strncpy(devfreq->governor_name, governor_name, DEVFREQ_NAME_LEN);
@@ -498,6 +585,9 @@ struct devfreq *devfreq_add_device(struct device *dev,
goto err_init;
}
 
+   pm_runtime_get_noresume(dev);
+   pm_runtime_set_active(dev);
+
return devfreq;
 
 err_init:
@@ -526,40 +616,6 @@ int devfreq_remove_device(struct devfreq *devfreq)
 EXPORT_SYMBOL(devfreq_remove_device);
 
 /**
- * devfreq_suspend_device() - Suspend devfreq of a device.
- * @devfreq: the devfreq instance to be suspended
- */
-int devfreq_suspend_device(struct devfreq *devfreq)
-{
-   if (!devfreq)
-   return -EINVAL;
-
-   if (!devfreq->governor)
-   return 0;
-
-   return devfreq->governor->event_handler(devfreq,
-   DEVFREQ_GOV_SUSPEND, NULL);
-}
-EXPORT_SYMBOL(devfreq_suspend_device);
-
-/**
- * devfreq_resume_device() - Resume devfreq of a device.
- * @devfreq: the devfreq instance to be resumed
- */
-int devfreq_resume_device(struct devfreq *devfreq)
-{
-   if (!devfreq)
-   return -EINVAL;
-
-   if (!devfreq->governor)
-   return 0;
-
-   return devfreq->governor->event_handler(devfreq,
-

Re: [PATCH 10/10] staging: cxt1e1: sbecrc.c: fixes 80+ char line length issue

2012-11-21 Thread Dan Carpenter

On Tue, Nov 20, 2012 at 07:28:52PM +0200, Johan Meiring wrote:
> This commit sorts out a single case where a line was longer than
> 80 characters.
> 
> Signed-off-by: Johan Meiring 
> ---
>  drivers/staging/cxt1e1/sbecrc.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/cxt1e1/sbecrc.c b/drivers/staging/cxt1e1/sbecrc.c
> index 87512a5..59dd7e2 100644
> --- a/drivers/staging/cxt1e1/sbecrc.c
> +++ b/drivers/staging/cxt1e1/sbecrc.c
> @@ -101,7 +101,8 @@ sbeCrc(u_int8_t *buffer,  /* data buffer to crc */
>   tbl = &CRCTable;
>   genCrcTable(tbl);
>  #else
> - tbl = (u_int32_t *) OS_kmalloc(CRC_TABLE_ENTRIES * 
> sizeof(u_int32_t));
> + tbl = (u_int32_t *) OS_kmalloc(CRC_TABLE_ENTRIES
> + * sizeof(u_int32_t));

The way we would normally break this is:

tbl = (u_int32_t *)OS_kmalloc(CRC_TABLE_ENTRIES *
  sizeof(u_int32_t));

* goes on the first line so that it shows this is a partial line.
The sizeof() lines up with the first parameter.  You will have to
use spaces since it's not exactly on a tab stop.

But really it's better to just get rid of the call to OS_kmalloc().

tbl = kmalloc(CRC_TABLE_ENTRIES * sizeof(*tbl), GFP_KERNEL);

OS_kmalloc() adds a GFP_DMA flag and a memset() but it's not needed
here.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[regression] 3.7+ suspend to RAM/offline CPU fails with nmi_watchdog=0 (bisected)

2012-11-21 Thread Norbert Warmuth

3.7-rc6 booted with nmi_watchdog=0 fails to suspend to RAM or
offline CPUs. It's reproducable with a KVM guest and physical
system. 

git bisect identified (config used for bisecting attached):
  commit bcd951cf10f24e341defcd002c15a1f4eea13ddb
  Author: Thomas Gleixner 
  Date:   Mon Jul 16 10:42:38 2012 +
  
  watchdog: Use hotplug thread infrastructure

(re-)tested with:
- uname -m: x86_64
- getconf _NPROCESSORS_ONLN: 8
- kernel: 3.7-rc6
- cpuid (physical system): "Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz"

Tests:
  echo processors > /sys/power/pm_test
  echo mem > /sys/power/state

Results:
- nmi_watchdog=1: OK

- nmi_watchdog=1, echo 0 > /proc/sys/kernel/nmi_watchdog: OK

- nmi_watchdog=0: FAIL
  Disabling non-boot CPUs ...
  Unregister pv shared memory for cpu 1
  [hang, reset required]

- nmi_watchdog=0, echo 0 > /sys/devices/system/cpu/cpu7/online: FAIL
  Unregister pv shared memory for cpu 7
  [hang, reset required]

- nmi_watchdog=0, bcd951cf10f reverted: OK

I've used to require nmi_watchdog=0 for virtualization but have not
verified it lately and a quick search finds references related to 
oprofile and power saving.

Norbert

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

CONFIG_EXPERIMENTAL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_FHANDLE=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
CONFIG_HAVE_GENERIC_HARDIRQS=y

CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

CONFIG_TICK_CPU_ACCOUNTING=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

CONFIG_TREE_RCU=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
CONFIG_RCU_FAST_NO_HZ=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_UID16=y
CONFIG_KALLSYMS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_HAVE_PERF_EVENTS=y

CONFIG_PERF_EVENTS=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=m
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONF

Re: [PATCH 06/10] staging: cxt1e1: musycc.c: fixes placement of parentheses

2012-11-21 Thread Dan Carpenter

On Tue, Nov 20, 2012 at 07:28:48PM +0200, Johan Meiring wrote:
> This commit fixes several incorrect placements of parantheses, as
> identified by the checkpatch.pl tool.
> 

This patch is fine, and all.
Acked-by: Dan Carpenter 

But you could go beyond fixing just checkpatch.pl warnings.

> Signed-off-by: Johan Meiring 
> ---
>  drivers/staging/cxt1e1/musycc.c |  490 
> +++
>  1 file changed, 245 insertions(+), 245 deletions(-)
> 
> diff --git a/drivers/staging/cxt1e1/musycc.c b/drivers/staging/cxt1e1/musycc.c
> index 42e1ca4..b2cc68a 100644
> --- a/drivers/staging/cxt1e1/musycc.c
> +++ b/drivers/staging/cxt1e1/musycc.c
> @@ -60,21 +60,21 @@ extern ci_t *CI;/* dummy pointr to board 
> ZEROE's data - DEBUG
>  
>  /***/

This line could be deleted.

>  /* forward references */

Obvious comment is obvious.

> -voidc4_fifo_free (mpi_t *, int);
> -voidc4_wk_chan_restart (mch_t *);
> -voidmusycc_bh_tx_eom (mpi_t *, int);
> -int musycc_chan_up (ci_t *, int);
> -status_t __init musycc_init (ci_t *);
> -STATIC void __init musycc_init_port (mpi_t *);
> -voidmusycc_intr_bh_tasklet (ci_t *);
> -voidmusycc_serv_req (mpi_t *, u_int32_t);
> -voidmusycc_update_timeslots (mpi_t *);
> +voidc4_fifo_free(mpi_t *, int);
> +voidc4_wk_chan_restart(mch_t *);
> +voidmusycc_bh_tx_eom(mpi_t *, int);
> +int musycc_chan_up(ci_t *, int);
> +status_t __init musycc_init(ci_t *);
> +STATIC void __init musycc_init_port(mpi_t *);
> +voidmusycc_intr_bh_tasklet(ci_t *);
> +voidmusycc_serv_req(mpi_t *, u_int32_t);
> +voidmusycc_update_timeslots(mpi_t *);


These would look better done properly.

void musycc_serv_req(mpi_t *pi, u_int32_t req);

Keep the parameter names because they serve as documentation.  And
actually, they should be moved to a header file which is included
instead of declared in the .c files where they are used.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem in Page Cache Replacement

2012-11-21 Thread Fengguang Wu

On Wed, Nov 21, 2012 at 05:02:04PM +0800, Fengguang Wu wrote:
> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
> > Cc Fengguang Wu.
> > 
> > On 11/21/2012 04:13 PM, metin d wrote:
> > >>   Curious. Added linux-mm list to CC to catch more attention. If you run
> > >>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> > >I'm guessing it'd evict the entries, but am wondering if we could run any 
> > >more diagnostics before trying this.
> > >
> > >We regularly use a setup where we have two databases; one gets used 
> > >frequently and the other one about once a month. It seems like the memory 
> > >manager keeps unused pages in memory at the expense of frequently used 
> > >database's performance.
> 
> > >My understanding was that under memory pressure from heavily
> > >accessed pages, unused pages would eventually get evicted. Is there
> > >anything else we can try on this host to understand why this is
> > >happening?
> 
> We may debug it this way.

Better to add a step

0) run 'page-types -r' to get an initial view of the page cache
   status.

Thanks,
Fengguang

> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
>(please double check via /proc/vmstat whether it does the expected work)
> 
> 2) run 'page-types -r' with root, to view the page status for the
>remaining pages of data-1
> 
> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE 
> -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
> 
> page-types can be found in the kernel source tree tools/vm/page-types.c
> 
> Sorry that sounds a bit twisted.. I do have a patch to directly dump
> page cache status of a user specified file, however it's not
> upstreamed yet.
> 
> Thanks,
> Fengguang
> 
> > >On Tue 20-11-12 09:42:42, metin d wrote:
> > >>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> > >>same machine. Both databases keep 40 GB of data, and the total memory
> > >>available on the machine is 68GB.
> > >>
> > >>I started data-1 and data-2, and ran several queries to go over all their
> > >>data. Then, I shut down data-1 and kept issuing queries against data-2.
> > >>For some reason, the OS still holds on to large parts of data-1's pages
> > >>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> > >>a result, my queries on data-2 keep hitting disk.
> > >>
> > >>I'm checking page cache usage with fincore. When I run a table scan query
> > >>against data-2, I see that data-2's pages get evicted and put back into
> > >>the cache in a round-robin manner. Nothing happens to data-1's pages,
> > >>although they haven't been touched for days.
> > >>
> > >>Does anybody know why data-1's pages aren't evicted from the page cache?
> > >>I'm open to all kind of suggestions you think it might relate to problem.
> > >   Curious. Added linux-mm list to CC to catch more attention. If you run
> > >echo 1 >/proc/sys/vm/drop_caches
> > >   does it evict data-1 pages from memory?
> > >
> > >>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> > >>swap space. The kernel version is:
> > >>
> > >>$ uname -r
> > >>3.2.28-45.62.amzn1.x86_64
> > >>Edit:
> > >>
> > >>and it seems that I use one NUMA instance, if  you think that it can a 
> > >>problem.
> > >>
> > >>$ numactl --hardware
> > >>available: 1 nodes (0)
> > >>node 0 cpus: 0 1 2 3 4 5 6 7
> > >>node 0 size: 70007 MB
> > >>node 0 free: 360 MB
> > >>node distances:
> > >>node   0
> > >>0:  10

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include "fadvise.h"
> 
> char *progname;
> 
> static void usage(void)
> {
>   fprintf(stderr, "Usage: %s filename offset length advice [loops]\n", 
> progname);
>   fprintf(stderr, "  advice: normal sequential willneed noreuse "
>   "dontneed asyncwrite writewait\n");
>   exit(1);
> }
> 
> int
> main(int argc, char *argv[])
> {
>   int c;
>   int fd;
>   char *sadvice;
>   char *filename;
>   loff_t offset;
>   unsigned long length;
>   int advice = 0;
>   int ret;
>   int loops = 1;
> 
>   progname = argv[0];
> 
>   while ((c = getopt(argc, argv, "")) != -1) {
>   switch (c) {
>   }
>   }
> 
>   if (optind == argc)
>   usage();
>   filename = argv[optind++];
> 
>   if (optind == argc)
>   usage();
>   offset = strtoull(argv[optind++], NULL, 0);
> 
>   if (optind == argc)
>   usage();
>   length = strtol(argv[optind++], NULL, 0);
> 
>   if (optind == argc)
>   usage();
>   sadvice = argv[optind++];
> 
>   if (optind != argc)
>   loops = strtol(argv[optind++], NULL, 0);
> 
>   if (optind != argc)
>   usage();
> 
>   if (!strcmp(sadvice, "normal"))
>   advice = POSIX_FADV_NORMAL;
>   else i

Re: [GIT PULL] Calxeda cpuidle support

2012-11-21 Thread Olof Johansson

On Mon, Nov 19, 2012 at 08:45:53AM -0600, Rob Herring wrote:
> Arnd, Olof,
> 
> Please pull cpuidle support for highbank. This is the first driver in
> drivers/cpuidle. All the existing drivers for ARM are in arch/arm. I'm
> asking for you to pull since there seems to be a lack of maintainer for
> cpuidle drivers.
> 
> Regards,
> Rob
> 
> The following changes since commit 8f0d8163b50e01f398b14bcd4dc039ac5ab18d64:
> 
>   Linux 3.7-rc3 (2012-10-28 12:24:48 -0700)
> 
> are available in the git repository at:
> 
>   git://sources.calxeda.com/kernel/linux.git tags/highbank-cpuidle

Thanks, pulled into next/soc.


-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] mm: dmapool: use provided gfp flags for all dma_alloc_coherent() calls

2012-11-21 Thread Marek Szyprowski

Hello,

On 11/21/2012 9:36 AM, Andrew Morton wrote:

On Wed, 21 Nov 2012 09:08:52 +0100 Marek Szyprowski  
wrote:

> Hello,
>
> On 11/20/2012 8:33 PM, Andrew Morton wrote:
> > On Tue, 20 Nov 2012 15:31:45 +0100
> > Marek Szyprowski  wrote:
> >
> > > dmapool always calls dma_alloc_coherent() with GFP_ATOMIC flag,
> > > regardless the flags provided by the caller. This causes excessive
> > > pruning of emergency memory pools without any good reason. Additionaly,
> > > on ARM architecture any driver which is using dmapools will sooner or
> > > later  trigger the following error:
> > > "ERROR: 256 KiB atomic DMA coherent pool is too small!
> > > Please increase it with coherent_pool= kernel parameter!".
> > > Increasing the coherent pool size usually doesn't help much and only
> > > delays such error, because all GFP_ATOMIC DMA allocations are always
> > > served from the special, very limited memory pool.
> > >
> >
> > Is this problem serious enough to justify merging the patch into 3.7?
> > And into -stable kernels?
>
> I wonder if it is a good idea to merge such change at the end of current
> -rc period.

I'm not sure what you mean by this.

But what we do sometimes if we think a patch needs a bit more
real-world testing before backporting is to merge it into -rc1 in the
normal merge window, and tag it for -stable backporting.  That way it
gets a few weeks(?) testing in mainline before getting backported.

I just wondered that if it gets merged to v3.7-rc7 there won't be much time
for real-world testing before final v3.7 release. This patch is in
linux-next for over a week and I'm not aware of any issues, but -rc releases
gets much more attention and testing than linux-next tree.

If You think it's fine to put such change to v3.7-rc7 I will send a pull
request and tag it for stable asap.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

2012-11-21 Thread Glauber Costa

On 11/21/2012 12:46 PM, Anton Vorontsov wrote:
> On Wed, Nov 21, 2012 at 12:27:28PM +0400, Glauber Costa wrote:
>> On 11/20/2012 10:23 PM, David Rientjes wrote:
>>> Anton can correct me if I'm wrong, but I certainly don't think this is 
>>> where mempressure is headed: I don't think any accounting needs to be done
> 
> Yup, I'd rather not do any accounting, at least not in bytes.

It doesn't matter here, but memcg doesn't do any accounting in bytes as
well. It only display it in bytes, but internally, it's all pages. The
bytes representation is convenient, because then you can be agnostic of
page sizes.

> 
>>> and, if it is, it's a design issue that should be addressed now rather 
>>> than later.  I believe notifications should occur on current's mempressure 
>>> cgroup depending on its level of reclaim: nobody cares if your memcg has a 
>>> limit of 64GB when you only have 32GB of RAM, we'll want the notification.
>>
>> My main concern is that to trigger those notifications, one would have
>> to first determine whether or not the particular group of tasks is under
>> pressure.
> 
> As far as I understand, the notifications will be triggered by a process
> that tries to allocate memory. So, effectively that would be a per-process
> pressure.
> 
> So, if one process in a group is suffering, we notify that "a process in a
> group is under pressure", and the notification goes to a cgroup listener

If you effectively have a per-process mechanism, why do you need an
extra cgroup at all?

It seems to me that this is simply something that should be inherited
over fork, and then you register the notifier in your first process, and
it will be valid for everybody in the process tree.

If you need tasks in different processes to respond to the same
notifier, then you just register the same notifier in two different
processes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] of/i2c: check status property for i2c client

2012-11-21 Thread Bongkyu Kim

Because of_i2c_register_devices() do not check status property,
all i2c clients are registered.

This patch add checking status property for i2c client.
After this patch, if status property is absent or "okay" or "ok",
i2c client will be registered.

Signed-off-by: Bongkyu Kim 
---
 drivers/of/of_i2c.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/of/of_i2c.c b/drivers/of/of_i2c.c
index 3550f3b..2552fc5 100644
--- a/drivers/of/of_i2c.c
+++ b/drivers/of/of_i2c.c
@@ -37,6 +37,9 @@ void of_i2c_register_devices(struct i2c_adapter *adap)
 
dev_dbg(&adap->dev, "of_i2c: register %s\n", node->full_name);
 
+   if (!of_device_is_available(node))
+   continue;
+
if (of_modalias_node(node, info.type, sizeof(info.type)) < 0) {
dev_err(&adap->dev, "of_i2c: modalias failure on %s\n",
node->full_name);
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] of/i2c: support more interrupt specifiers

2012-11-21 Thread Bongkyu Kim

This patch supports more interrupt specifiers for i2c client.

Signed-off-by: Bongkyu Kim 
---
 drivers/of/of_i2c.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/of/of_i2c.c b/drivers/of/of_i2c.c
index 3550f3b..c6d9b7e 100644
--- a/drivers/of/of_i2c.c
+++ b/drivers/of/of_i2c.c
@@ -34,6 +34,7 @@ void of_i2c_register_devices(struct i2c_adapter *adap)
struct dev_archdata dev_ad = {};
const __be32 *addr;
int len;
+   int nr = 0;
 
dev_dbg(&adap->dev, "of_i2c: register %s\n", node->full_name);
 
@@ -57,7 +58,9 @@ void of_i2c_register_devices(struct i2c_adapter *adap)
continue;
}
 
-   info.irq = irq_of_parse_and_map(node, 0);
+   info.irq = irq_of_parse_and_map(node, nr++);
+   while (irq_of_parse_and_map(node, nr))
+   nr++;
info.of_node = of_node_get(node);
info.archdata = &dev_ad;
 
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

2012-11-21 Thread Kirill A. Shutemov

On Tue, Nov 20, 2012 at 10:02:45AM -0800, David Rientjes wrote:
> On Mon, 19 Nov 2012, Glauber Costa wrote:
> 
> > >> In the case I outlined below, for backwards compatibility. What I
> > >> actually mean is that memcg *currently* allows arbitrary notifications.
> > >> One way to merge those, while moving to a saner 3-point notification, is
> > >> to still allow the old writes and fit them in the closest bucket.
> > >>
> > > 
> > > Yeah, but I'm wondering why three is the right answer.
> > > 
> > 
> > This is unrelated to what I am talking about.
> > I am talking about pre-defined values with a specific event meaning (in
> > his patchset, 3) vs arbitrary numbers valued in bytes.
> > 
> 
> Right, and I don't see how you can map the memcg thresholds onto Anton's 
> scheme

BTW, there's interface for OOM notification in memcg. See oom_control.
I guess other pressure levels can also fit to the interface.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones

2012-11-21 Thread Ingo Molnar

* David Rientjes  wrote:

> Ok, this is significantly better, it almost cut the regression 
> in half on my system. [...]

The other half still seems to be related to the emulation faults 
that I fixed in the other patch:

>  0.49%  [kernel]  [k] page_fault  
>  
>  0.06%  [kernel]  [k] emulate_vsyscall
>  

Plus TLB flush costs:

>  0.13%  [kernel]  [k] generic_smp_call_function_interrupt
>  0.08%  [kernel]  [k] flush_tlb_func

for which you should try the third patch I sent.

So please try all my fixes - the easiest way to do that would be 
to try the latest tip:master that has all related fixes 
integrated and send me a new perf top output - most page fault 
and TLB flush overhead should be gone from the profile.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: numa/core regressions fixed - more testers wanted

2012-11-21 Thread Ingo Molnar


* David Rientjes  wrote:

> I started profiling on a new machine that is an exact 
> duplicate of the 16-way, 4 node, 32GB machine I was profiling 
> with earlier to rule out any machine-specific problems.  I 
> pulled master and ran new comparisons with THP enabled at 
> c418de93e398 ("Merge branch 'x86/mm'"):
> 
>   CONFIG_NUMA_BALANCING disabled  136521.55 SPECjbb2005 bops
>   CONFIG_NUMA_BALANCING enabled   132476.07 SPECjbb2005 bops 
> (-3.0%)
> 
> Aside: neither 4739578c3ab3 ("x86/mm: Don't flush the TLB on 
> #WP pmd fixups") nor 01e9c2441eee ("x86/vsyscall: Add Kconfig 
> option to use native vsyscalls and switch to it") 
> significantly improved upon the throughput on this system.

Could you please send an updated profile done with latest -tip? 
The last profile I got from you still had the vsyscall emulation 
page faults in it.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v11] kvm: notify host when the guest is panicked

2012-11-21 Thread Gleb Natapov

On Tue, Nov 20, 2012 at 07:33:49PM -0200, Marcelo Tosatti wrote:
> On Tue, Nov 20, 2012 at 06:09:48PM +0800, Hu Tao wrote:
> > Hi Marcelo,
> > 
> > On Tue, Nov 13, 2012 at 12:19:08AM -0200, Marcelo Tosatti wrote:
> > > On Fri, Nov 09, 2012 at 03:17:39PM -0500, Sasha Levin wrote:
> > > > On Mon, Nov 5, 2012 at 8:58 PM, Hu Tao  wrote:
> > > > > But in the case of panic notification, more dependency means more
> > > > > chances of failure of panic notification. Say, if we use a virtio 
> > > > > device
> > > > > to do panic notification, then we will fail if: virtio itself has
> > > > > problems, virtio for some reason can't be deployed(neither built-in or
> > > > > as a module), or guest doesn't support virtio, etc.
> > > > 
> > > > Add polling to your virtio device. If it didn't notify of a panic but
> > > > taking more than 20 sec to answer your poll request you can assume
> > > > it's dead.
> > > > 
> > > > Actually, just use virtio-serial and something in userspace on the 
> > > > guest.
> > > 
> > > They want the guest to stop, so a memory dump can be taken by management
> > > interface.
> > > 
> > > Hu Tao, lets assume port I/O is the preferred method for communication.
> > 
> > Okey.
> > 
> > > Now, the following comments have still not been addressed:
> > > 
> > > 1) Lifecycle of the stopped guest and interaction with other stopped
> > > states in QEMU.
> > 
> > Patch 3 already deals with run state transitions. But in case I'm
> > missing something, could you be more specific?
> 
> - What are the possibilities during migration? Say:
>   - migration starts.
>   - guest panics.
>   - migration starts vm on other side?
> - Guest stopped due to EIO.
>   - guest vcpuN panics, VMEXIT but still outside QEMU.
>   - QEMU EIO error, stop vm.
>   - guest vcpuN completes, processes IO exit.
> - system_reset due to panic.
> - Add all possibilities that should be verified (that is, interaction 
> of this feature with other stopped states in QEMU).
> 
BTW I do remember getting asserts while using breakpoints via gdbstub
and stop/cont from the monitor.

> ---
> 
> - What happens if the guest has reboot-on-panic configured? Does it take
> precedence over hypervisor notification?
> 
> 
> 
> Out of curiosity, does kexec support memory dumping?
> 
> > > 2) Format of the interface for other architectures (you can choose
> > > a different KVM supported architecture and write an example).
> > > 
> > > 3) Clear/documented management interface for the feature.
> > 
> > It is documented in patch 0: Documentation/virtual/kvm/pv_event.txt.
> > Does it need to be improved?
> 
> This is documentation for the host<->guest interface. There is no 
> documentation on the interface for management.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V4 1/3] arm: dma mapping: Export a dma ops function arm_dma_set_mask

2012-11-21 Thread Gregory CLEMENT

Expose another DMA operations function: arm_dma_set_mask. This
function will be added to a custom DMA ops for Armada 370/XP.
Depending of its configuration Armada 370/XP can be set as a "nearly"
coherent architecture. In this case the DMA ops is made of:
- specific functions for this architecture
- already exposed arm DMA related functions
- the arm_dma_set_mask which was not exposed yet.

Signed-off-by: Gregory CLEMENT 
---
 arch/arm/include/asm/dma-mapping.h |2 ++
 arch/arm/mm/dma-mapping.c  |4 +---
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/dma-mapping.h 
b/arch/arm/include/asm/dma-mapping.h
index 2300484..98d4dab 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -111,6 +111,8 @@ static inline void dma_free_noncoherent(struct device *dev, 
size_t size,
 
 extern int dma_supported(struct device *dev, u64 mask);
 
+extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
+
 /**
  * arm_dma_alloc - allocate consistent memory for DMA
  * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 58bc3e4..5383bc0 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -124,8 +124,6 @@ static void arm_dma_sync_single_for_device(struct device 
*dev,
__dma_page_cpu_to_dev(page, offset, size, dir);
 }
 
-static int arm_dma_set_mask(struct device *dev, u64 dma_mask);
-
 struct dma_map_ops arm_dma_ops = {
.alloc  = arm_dma_alloc,
.free   = arm_dma_free,
@@ -971,7 +969,7 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
-static int arm_dma_set_mask(struct device *dev, u64 dma_mask)
+int arm_dma_set_mask(struct device *dev, u64 dma_mask)
 {
if (!dev->dma_mask || !dma_supported(dev, dma_mask))
return -EIO;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V4 3/3] arm: mvebu: Add hardware I/O Coherency support

2012-11-21 Thread Gregory CLEMENT

Armada 370 and XP come with an unit called coherency fabric. This unit
allows to use the Armada 370/XP as a nearly coherent architecture. The
coherency mechanism uses snoop filters to ensure the coherency between
caches, DRAM and devices. This mechanism needs a synchronization
barrier which guarantees that all the memory writes initiated by the
devices have reached their target and do not reside in intermediate
write buffers. That's why the architecture is not totally coherent and
we need to provide our own functions for some DMA operations.

Beside the use of the coherency fabric, the device units will have to
set the attribute flag of the decoding address window to select the
accurate coherency process for the memory transaction. This is done
each device driver programs the DRAM address windows. The value of the
attribute set by the driver is retrieved through the
orion_addr_map_cfg struct filled during the early initialization of
the platform.

Signed-off-by: Gregory CLEMENT 
Reviewed-by: Yehuda Yitschak 
---
 .../devicetree/bindings/arm/coherency-fabric.txt   |9 ++-
 arch/arm/boot/dts/armada-370-xp.dtsi   |3 +-
 arch/arm/mach-mvebu/addr-map.c |3 +
 arch/arm/mach-mvebu/coherency.c|   73 
 4 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/arm/coherency-fabric.txt 
b/Documentation/devicetree/bindings/arm/coherency-fabric.txt
index 2bfbf67..17d8cd1 100644
--- a/Documentation/devicetree/bindings/arm/coherency-fabric.txt
+++ b/Documentation/devicetree/bindings/arm/coherency-fabric.txt
@@ -5,12 +5,17 @@ Available on Marvell SOCs: Armada 370 and Armada XP
 Required properties:
 
 - compatible: "marvell,coherency-fabric"
-- reg: Should contain,coherency fabric registers location and length.
+
+- reg: Should contain coherency fabric registers location and
+  length. First pair for the coherency fabric registers, second pair
+  for the per-CPU fabric registers registers.
 
 Example:
 
 coherency-fabric@d0020200 {
compatible = "marvell,coherency-fabric";
-   reg = <0xd0020200 0xb0>;
+   reg = <0xd0020200 0xb0>,
+   <0xd0021810 0x1c>;
+
 };
 
diff --git a/arch/arm/boot/dts/armada-370-xp.dtsi 
b/arch/arm/boot/dts/armada-370-xp.dtsi
index b0d075b..98a6b26 100644
--- a/arch/arm/boot/dts/armada-370-xp.dtsi
+++ b/arch/arm/boot/dts/armada-370-xp.dtsi
@@ -38,7 +38,8 @@
 
coherency-fabric@d0020200 {
compatible = "marvell,coherency-fabric";
-   reg = <0xd0020200 0xb0>;
+   reg = <0xd0020200 0xb0>,
+ <0xd0021810 0x1c>;
};
 
soc {
diff --git a/arch/arm/mach-mvebu/addr-map.c b/arch/arm/mach-mvebu/addr-map.c
index fe454a4..595f6b7 100644
--- a/arch/arm/mach-mvebu/addr-map.c
+++ b/arch/arm/mach-mvebu/addr-map.c
@@ -108,6 +108,9 @@ static int __init armada_setup_cpu_mbus(void)
 
addr_map_cfg.bridge_virt_base = mbus_unit_addr_decoding_base;
 
+   if (of_find_compatible_node(NULL, NULL, "marvell,coherency-fabric"))
+   addr_map_cfg.hw_io_coherency = 1;
+
/*
 * Disable, clear and configure windows.
 */
diff --git a/arch/arm/mach-mvebu/coherency.c b/arch/arm/mach-mvebu/coherency.c
index 1bc02d0..9413bd5 100644
--- a/arch/arm/mach-mvebu/coherency.c
+++ b/arch/arm/mach-mvebu/coherency.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include "armada-370-xp.h"
 
@@ -32,10 +34,13 @@
  * value matching its virtual mapping
  */
 static void __iomem *coherency_base = ARMADA_370_XP_REGS_VIRT_BASE + 0x20200;
+static void __iomem *coherency_cpu_base;
 
 /* Coherency fabric registers */
 #define COHERENCY_FABRIC_CFG_OFFSET   0x4
 
+#define IO_SYNC_BARRIER_CTL_OFFSET0x0
+
 static struct of_device_id of_coherency_table[] = {
{.compatible = "marvell,coherency-fabric"},
{ /* end of list */ },
@@ -66,6 +71,70 @@ int set_cpu_coherent(unsigned int hw_cpu_id, int 
smp_group_id)
return 0;
 }
 
+static inline void mvebu_hwcc_sync_io_barrier(void)
+{
+   writel(0x1, coherency_cpu_base + IO_SYNC_BARRIER_CTL_OFFSET);
+   while (readl(coherency_cpu_base + IO_SYNC_BARRIER_CTL_OFFSET) & 0x1);
+}
+
+static dma_addr_t mvebu_hwcc_dma_map_page(struct device *dev, struct page 
*page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+   if (dir != DMA_TO_DEVICE)
+   mvebu_hwcc_sync_io_barrier();
+   return pfn_to_dma(dev, page_to_pfn(page)) + offset;
+}
+
+
+static void mvebu_hwcc_dma_unmap_page(struct device *dev, dma_addr_t 
dma_handle,
+ size_t size, enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+   if (dir != DMA_TO_DEVICE)
+   mvebu_

[PATCH V4 2/3] arm: plat-orion: Add coherency attribute when setup mbus target

2012-11-21 Thread Gregory CLEMENT

Recent SoC such as Armada 370/XP came with the possibility to deal
with the I/O coherency by hardware. In this case the transaction
attribute of the window must be flagged as "Shared transaction". Once
this flag is set, then the transactions will be forced to be sent
through the coherency block, in other case transaction is driven
directly to DRAM.

Signed-off-by: Gregory CLEMENT 
Reviewed-by: Yehuda Yitschak 
Acked-by: Thomas Petazzoni 
---
 arch/arm/plat-orion/addr-map.c  |4 
 arch/arm/plat-orion/include/plat/addr-map.h |1 +
 2 files changed, 5 insertions(+)

diff --git a/arch/arm/plat-orion/addr-map.c b/arch/arm/plat-orion/addr-map.c
index a7b8060..febe386 100644
--- a/arch/arm/plat-orion/addr-map.c
+++ b/arch/arm/plat-orion/addr-map.c
@@ -42,6 +42,8 @@ EXPORT_SYMBOL_GPL(mv_mbus_dram_info);
 #define WIN_REMAP_LO_OFF   0x0008
 #define WIN_REMAP_HI_OFF   0x000c
 
+#define ATTR_HW_COHERENCY  (0x1 << 4)
+
 /*
  * Default implementation
  */
@@ -163,6 +165,8 @@ void __init orion_setup_cpu_mbus_target(const struct 
orion_addr_map_cfg *cfg,
w = &orion_mbus_dram_info.cs[cs++];
w->cs_index = i;
w->mbus_attr = 0xf & ~(1 << i);
+   if (cfg->hw_io_coherency)
+   w->mbus_attr |= ATTR_HW_COHERENCY;
w->base = base & 0x;
w->size = (size | 0x) + 1;
}
diff --git a/arch/arm/plat-orion/include/plat/addr-map.h 
b/arch/arm/plat-orion/include/plat/addr-map.h
index ec63e4a..b76c065 100644
--- a/arch/arm/plat-orion/include/plat/addr-map.h
+++ b/arch/arm/plat-orion/include/plat/addr-map.h
@@ -17,6 +17,7 @@ struct orion_addr_map_cfg {
const int num_wins; /* Total number of windows */
const int remappable_wins;
void __iomem *bridge_virt_base;
+   int hw_io_coherency;
 
/* If NULL, the default cpu_win_can_remap will be used, using
   the value in remappable_wins */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Streamlining Developer's Certificate of Origin, Signed-off-by tag

2012-11-21 Thread Jiri Slaby

On 11/21/2012 02:13 AM, Luis R. Rodriguez wrote:
> Ah so keep the original in place to let references to the original in
> whatever way those may exist to keep pointing but promote new usage to
> a copy and.. perhaps refer to the new copy in master, or just leave
> that in place as is?

It depends if they really want to have the same thing we do. I.e. don't
they want to rephrase the document a bit? If so, there is no point of
linking the document at all.

If no, we can create a separate document from that in the kernel so that
we allow people to link that at some fixed version using git commit SHA.
This can be done easily doing a link to git.kernel.org.

The link to git.kernel.org might seem to be long. One can create a
dynamic helper on some web like signed-off-by.cgi?id=SHA and it will
return that document in that version. (It will redirect basically.)

regards,
-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V4 0/3] Add hardware I/O coherency support for Armada 370/XP

2012-11-21 Thread Gregory CLEMENT

The purpose of this patch set is to add hardware I/O Coherency support
for Armada 370 and Armada XP. Theses SoCs come with an unit called
coherency fabric. A beginning of the support for this unit have been
introduced with the SMP patch set. This series extend this support:
the coherency fabric unit allows to use the Armada XP and the Armada
370 as nearly coherent architectures.

The third patches enables this new feature and register our own set
of DMA ops, to benefit this hardware enhancement.

The first patches exports dma operation functions needed by to
register our own set of dma ops.

The second patch introduces a new flag for the address decoding
configuration in order to be able to set the memory windows as
shared memory.

This series depend on the SMP patch set (V5 was posted few minutes
earlier)

The git branch called HWIOCC-for-3.8-V3 is also available at
https://github.com/MISL-EBU-System-SW/mainline-public.git.

Changelog:
V3->V4:
- Exposed only the needed dma ops function

V2 -> V3:
- Rebased on to ArmadaXP-SMP-for-3.8-V5
- Use the coherent version of dma ops for .alloc() and .free()

V1 -> V2:
- Rebased on to v3.7-rc5
- Added a new patch to exports the dma ops functions
- Renamed the function for a more generic name mvebu_hwcc
- removed the non SMP case during init
- spelling and wording issues
- updating the binding documentation for coherency fabric

Gregory CLEMENT (3):
  arm: dma mapping: Export a dma ops function arm_dma_set_mask
  arm: plat-orion: Add coherency attribute when setup mbus target
  arm: mvebu: Add hardware I/O Coherency support

 .../devicetree/bindings/arm/coherency-fabric.txt   |9 ++-
 arch/arm/boot/dts/armada-370-xp.dtsi   |3 +-
 arch/arm/include/asm/dma-mapping.h |2 +
 arch/arm/mach-mvebu/addr-map.c |3 +
 arch/arm/mach-mvebu/coherency.c|   73 
 arch/arm/mm/dma-mapping.c  |4 +-
 arch/arm/plat-orion/addr-map.c |4 ++
 arch/arm/plat-orion/include/plat/addr-map.h|1 +
 8 files changed, 93 insertions(+), 6 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem in Page Cache Replacement

2012-11-21 Thread Jaegeuk Hanse


On 11/21/2012 05:02 PM, Fengguang Wu wrote:

On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:

Cc Fengguang Wu.

On 11/21/2012 04:13 PM, metin d wrote:

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?

I'm guessing it'd evict the entries, but am wondering if we could run any more 
diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently 
and the other one about once a month. It seems like the memory manager keeps 
unused pages in memory at the expense of frequently used database's performance.
My understanding was that under memory pressure from heavily
accessed pages, unused pages would eventually get evicted. Is there
anything else we can try on this host to understand why this is
happening?

We may debug it this way.

1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
(please double check via /proc/vmstat whether it does the expected work)

2) run 'page-types -r' with root, to view the page status for the
remaining pages of data-1

The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
-D_LARGEFILE64_SOURCE"

page-types can be found in the kernel source tree tools/vm/page-types.c

Sorry that sounds a bit twisted.. I do have a patch to directly dump
page cache status of a user specified file, however it's not
upstreamed yet.


Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

flagspage-count   MB  symbolic-flags long-symbolic-flags
0x607699 2373 
___
0x0001343227 1340 
___r___reserved


But I have some questions of the print of page-type:

Is 2373MB here mean total memory in used include page cache? I don't 
think so.

Which kind of pages will be marked reserved?
Which line of long-symbolic-flags is for page cache?

Regards,
Jaegeuk



Thanks,
Fengguang


On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
   does it evict data-1 pages from memory?


This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
0:  10


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] checkpatch: add double empty line check

2012-11-21 Thread Eilon Greenstein

On Tue, 2012-11-20 at 15:41 -0800, Joe Perches wrote:
> On Tue, 2012-11-20 at 23:19 +, Andy Whitcroft wrote:
> > On Tue, Nov 20, 2012 at 01:58:48PM -0800, Joe Perches wrote:
> > 
> > > +# check for multiple blank lines, warn only on the second one in a block
> > > + if ($rawline =~ /^.\s*$/ &&
> > > + $prevrawline =~ /^.\s*$/ &&
> > > + $linenr != $last_blank_linenr + 1) {
> > > + CHK("DOUBLE_EMPTY_LINE",
> > > + "One blank line separating blocks is generally 
> > > sufficient\n" . $herecurr);
> > > + $last_blank_linenr = $linenr;
> > > + }
> > > +
> > >  # check for line continuations in quoted strings with odd counts of "
> > >   if ($rawline =~ /\\$/ && $rawline =~ tr/"/"/ % 2) {
> > >   WARN("LINE_CONTINUATIONS",
> > 
> > Pretty sure that will fail with combination which have removed lines.
> 
> Not as far as I can tell.
> Deleted lines followed by inserted lines seem
> to work OK.
> 
> This check is located after the test that ensures
> the current $line/$rawline is an insertion.
> 

But you do not look at the next line, so you will miss something like
that:

diff --git a/test.c b/test.c
index e3c46d4..e1c6ffc 100644
--- a/test.c
+++ b/test.c
@@ -15,7 +15,8 @@
  * something
  * something
  * something
- * next line was already empty */
+ * next line was already empty, but I'm adding another one now*/
+

 /* something else
  * something else

> > I have a version here which I am testing with the combinations I have
> > isolated to far ...
> 
> Enjoy.
> Can you please test my proposal against those combinations too?
> 

The way I see it, we have to handle the following cases:
a. The patch adds more than a single consecutive empty line - easy
enough, the only "problem" here is to warn only once and there are many
ways to do that.
b. The patch is adding a new empty line after an existing empty line -
for that, we must check the previous line.
c. The patch is adding a new empty line before an existing empty line -
for that, we must check the next line. If we are already checking the
next line, we can tell if this is the last empty line added and
therefore do not need to save anything in order to warn only once per
block.

My version of the patch addresses all 3 cases above, and I do not see
how we can do it without looking at the next line and the previous line
- so I think it is a valid approach.

The only identified down side is that it might fail to warn about double
empty lines if we will find a diff utility that will add the deleted
lines after the inserted lines - but even in that case, it will not
generate any annoying false positives and no other perl warnings. To
address this issue, I can add a loop that will look forward if the next
line after a newly added empty line is a deleted line, but I think this
is excessive and I will only be able to test it on manually generated
files since the diff utilities I'm familiar with are behaving nicely and
delete before adding.

Anyway, I'm looking forward for your version.

Thanks,
Eilon

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] ARM: EXYNOS: Add aliases for i2c controller for exynos4

2012-11-21 Thread Kukjin Kim

Doug Anderson wrote:
> 
> This is similar to a recent commit for exynos5250 titled:
>   ARM: EXYNOS: Add aliases for i2c controller
> 
> Adding aliases will be useful to prevent warnings in a future
> change.  See:
>   i2c: s3c2410: Get the i2c bus number from alias id
> 
> Signed-off-by: Doug Anderson 
> 
> ---
>  arch/arm/boot/dts/exynos4.dtsi |   24 
>  1 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/exynos4.dtsi
> b/arch/arm/boot/dts/exynos4.dtsi
> index a26c3dd..824d362 100644
> --- a/arch/arm/boot/dts/exynos4.dtsi
> +++ b/arch/arm/boot/dts/exynos4.dtsi
> @@ -28,6 +28,14 @@
>   spi0 = &spi_0;
>   spi1 = &spi_1;
>   spi2 = &spi_2;
> + i2c0 = &i2c_0;
> + i2c1 = &i2c_1;
> + i2c2 = &i2c_2;
> + i2c3 = &i2c_3;
> + i2c4 = &i2c_4;
> + i2c5 = &i2c_5;
> + i2c6 = &i2c_6;
> + i2c7 = &i2c_7;
>   };
> 
>   gic:interrupt-controller@1049 {
> @@ -121,7 +129,7 @@
>   status = "disabled";
>   };
> 
> - i2c@1386 {
> + i2c_0: i2c@1386 {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "samsung,s3c2440-i2c";
> @@ -130,7 +138,7 @@
>   status = "disabled";
>   };
> 
> - i2c@1387 {
> + i2c_1: i2c@1387 {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "samsung,s3c2440-i2c";
> @@ -139,7 +147,7 @@
>   status = "disabled";
>   };
> 
> - i2c@1388 {
> + i2c_2: i2c@1388 {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "samsung,s3c2440-i2c";
> @@ -148,7 +156,7 @@
>   status = "disabled";
>   };
> 
> - i2c@1389 {
> + i2c_3: i2c@1389 {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "samsung,s3c2440-i2c";
> @@ -157,7 +165,7 @@
>   status = "disabled";
>   };
> 
> - i2c@138A {
> + i2c_4: i2c@138A {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "samsung,s3c2440-i2c";
> @@ -166,7 +174,7 @@
>   status = "disabled";
>   };
> 
> - i2c@138B {
> + i2c_5: i2c@138B {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "samsung,s3c2440-i2c";
> @@ -175,7 +183,7 @@
>   status = "disabled";
>   };
> 
> - i2c@138C {
> + i2c_6: i2c@138C {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "samsung,s3c2440-i2c";
> @@ -184,7 +192,7 @@
>   status = "disabled";
>   };
> 
> - i2c@138D {
> + i2c_7: i2c@138D {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "samsung,s3c2440-i2c";
> --
> 1.7.7.3

Applied, thanks.

Best regards,
Kgene.
--
Kukjin Kim , Senior Engineer,
SW Solution Development Team, Samsung Electronics Co., Ltd.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 2/2] i2c: s3c2410: Get the i2c bus number from alias id

2012-11-21 Thread Kukjin Kim

Doug Anderson wrote:
> 
> From: Padmavathi Venna 
> 
> Get the i2c bus number that the device is connected to using the alias
> id.  This makes debugging / grokking of kernel messages much easier.
> 
> [dianders: slight patch cleanup from Padmavathi's original.]
> 
> Signed-off-by: Padmavathi Venna 
> Signed-off-by: Doug Anderson 

Acked-by: Kukjin Kim 

Thanks.

Best regards,
Kgene.
--
Kukjin Kim , Senior Engineer,
SW Solution Development Team, Samsung Electronics Co., Ltd.

> ---
>  drivers/i2c/busses/i2c-s3c2410.c |   10 +-
>  1 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-s3c2410.c b/drivers/i2c/busses/i2c-
> s3c2410.c
> index 3e0335f..ca43590 100644
> --- a/drivers/i2c/busses/i2c-s3c2410.c
> +++ b/drivers/i2c/busses/i2c-s3c2410.c
> @@ -899,11 +899,19 @@ static void
>  s3c24xx_i2c_parse_dt(struct device_node *np, struct s3c24xx_i2c *i2c)
>  {
>   struct s3c2410_platform_i2c *pdata = i2c->pdata;
> + int id;
> 
>   if (!np)
>   return;
> 
> - pdata->bus_num = -1; /* i2c bus number is dynamically assigned */
> + id = of_alias_get_id(np, "i2c");
> + if (id < 0) {
> + dev_warn(i2c->dev, "failed to get alias id:%d\n", id);
> + pdata->bus_num = -1;
> + } else {
> + /* i2c bus number is statically assigned from alias */
> + pdata->bus_num = id;
> + }
>   of_property_read_u32(np, "samsung,i2c-sda-delay", &pdata-
> >sda_delay);
>   of_property_read_u32(np, "samsung,i2c-slave-addr", &pdata-
> >slave_addr);
>   of_property_read_u32(np, "samsung,i2c-max-bus-freq",
> --
> 1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] ARM: exynos: dt: add all i2c busses to auxdata

2012-11-21 Thread Kukjin Kim

Doug Anderson wrote:
> 
> From: Olof Johansson 
> 
> Needed to match device ids for clocks, etc.
> 
> Signed-off-by: Olof Johansson 
> Signed-off-by: Doug Anderson 
> 
> ---
>  arch/arm/mach-exynos/mach-exynos5-dt.c |   10 ++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/arm/mach-exynos/mach-exynos5-dt.c b/arch/arm/mach-
> exynos/mach-exynos5-dt.c
> index ed37273..e1491f7 100644
> --- a/arch/arm/mach-exynos/mach-exynos5-dt.c
> +++ b/arch/arm/mach-exynos/mach-exynos5-dt.c
> @@ -52,6 +52,16 @@ static const struct of_dev_auxdata
> exynos5250_auxdata_lookup[] __initconst = {
>   "s3c2440-i2c.1", NULL),
>   OF_DEV_AUXDATA("samsung,s3c2440-i2c", EXYNOS5_PA_IIC(2),
>   "s3c2440-i2c.2", NULL),
> + OF_DEV_AUXDATA("samsung,s3c2440-i2c", EXYNOS5_PA_IIC(3),
> + "s3c2440-i2c.3", NULL),
> + OF_DEV_AUXDATA("samsung,s3c2440-i2c", EXYNOS5_PA_IIC(4),
> + "s3c2440-i2c.4", NULL),
> + OF_DEV_AUXDATA("samsung,s3c2440-i2c", EXYNOS5_PA_IIC(5),
> + "s3c2440-i2c.5", NULL),
> + OF_DEV_AUXDATA("samsung,s3c2440-i2c", EXYNOS5_PA_IIC(6),
> + "s3c2440-i2c.6", NULL),
> + OF_DEV_AUXDATA("samsung,s3c2440-i2c", EXYNOS5_PA_IIC(7),
> + "s3c2440-i2c.7", NULL),
>   OF_DEV_AUXDATA("samsung,s3c2440-hdmiphy-i2c", EXYNOS5_PA_IIC(8),
>   "s3c2440-hdmiphy-i2c", NULL),
>   OF_DEV_AUXDATA("samsung,exynos5250-dw-mshc", EXYNOS5_PA_DWMCI0,
> --
> 1.7.7.3

Looks ok to me, applied.

Thanks.

Best regards,
Kgene.
--
Kukjin Kim , Senior Engineer,
SW Solution Development Team, Samsung Electronics Co., Ltd.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 2/2] ARM: dts: snow: Add board dts file for Snow board (ARM Chromebook)

2012-11-21 Thread Kukjin Kim

Kukjin Kim wrote:
> 
> Olof Johansson wrote:
> >

[...]

> >
> > Acked-by: Olof Johansson 
> >
> > Kukjin, since your pull requests came in today, can you ack this and
> > I'll just apply it on top of your branches?
> >
> Hi Olof, yeah, I sent pull-request but there are some patches for dt and
> exynos5440 stuff in my tree not sent yet. If you're ok, let me take this
> with your ack.
> 
Applied, thanks.

Best regards,
Kgene.
--
Kukjin Kim , Senior Engineer,
SW Solution Development Team, Samsung Electronics Co., Ltd.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] pkt_sched: QFQ Plus: fair-queueing service at DRR cost

2012-11-21 Thread Paolo Valente

Il 20/11/2012 19:54, David Miller ha scritto:

From: Paolo Valente 
Date: Tue, 20 Nov 2012 18:45:13 +0100

-   struct sk_buff *skb;
+   struct sk_buff *skb = NULL;

This is not really an improvement,

Sorry for trying this silly short cut
 now the compiler can think

that NULL is passed eventually into qdisc_bstats_update().

Please make the logic easier for the compiler to digest.

For example, restructure the top-level logic into something like:

skb = NULL;
if (!list_empty(&in_serv_agg->active))
skb = qfq_peek_skb(in_serv_agg, &cl, &len);
else
len = 0; /* no more active classes in the in-service agg */

if (len == 0 || in_serv_agg->budget < len) {
  ...
/*
 * If we get here, there are other aggregates queued:
 * choose the new aggregate to serve.
 */
in_serv_agg = q->in_serv_agg = qfq_choose_next_agg(q);
skb = qfq_peek_skb(in_serv_agg, &cl, &len);
}
if (!skb)
return NULL;

That way it is clearer, to both humans and the compiler, what is
going on here.

Got it. Actually, if the first qfq_peek_skb returns NULL, then the 
example version that you are proposing apparently may behave in a 
different way than the original one: in your proposal the scheduler 
tries to switch to a new aggregate and may return a non-NULL value, 
whereas the original version would immediately return NULL. I guess that 
this slightly different behavior is fine as well, and I am preparing a 
new patch that integrates these changes.

Thanks.

--
---
| Paolo Valente  ||
| Algogroup  ||
| Dip. Ing. Informazione | tel:   +39 059 2056318 |
| Via Vignolese 905/b| fax:   +39 059 2056129 |
| 41125 Modena - Italy   ||
| home:  http://algo.ing.unimo.it/people/paolo/   |
---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/2] thermal: exynos: Fix wrong bit to control tmu core

2012-11-21 Thread Amit Kachhap

On 20 November 2012 11:23, Zhang Rui  wrote:
> On Tue, 2012-11-20 at 10:39 +0900, Kyungmin Park wrote:
>> On 11/20/12, Jonghwan Choi  wrote:
>> > [0]bit is used to enable/disable tmu core. [1] bit is a reserved bit.
>> >
>> > Signed-off-by: Jonghwan Choi 
>> Acked-by: Kyungmin Park 
>
> Amit and Donggeun Kim,
>
> any comments on this patch?
>
> thanks,
> rui
Hi Riu,

This patch is according to the suggestion I made earlier and looks fine.
Acked-by: Amit Daniel Kachhap 

Thanks,
Amit Daniel
>
>> > ---
>> >  drivers/thermal/exynos_thermal.c |   16 
>> >  1 files changed, 12 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/drivers/thermal/exynos_thermal.c
>> > b/drivers/thermal/exynos_thermal.c
>> > index 6dd29e4..129e827 100644
>> > --- a/drivers/thermal/exynos_thermal.c
>> > +++ b/drivers/thermal/exynos_thermal.c
>> > @@ -52,9 +52,12 @@
>> >
>> >  #define EXYNOS_TMU_TRIM_TEMP_MASK  0xff
>> >  #define EXYNOS_TMU_GAIN_SHIFT  8
>> > +#define EXYNOS_TMU_GAIN_MASK   (0xF << EXYNOS_TMU_GAIN_SHIFT)
>> >  #define EXYNOS_TMU_REF_VOLTAGE_SHIFT   24
>> > -#define EXYNOS_TMU_CORE_ON 3
>> > -#define EXYNOS_TMU_CORE_OFF2
>> > +#define EXYNOS_TMU_REF_VOLTAGE_MASK(0x1F <<
>> > EXYNOS_TMU_REF_VOLTAGE_SHIFT)
>> > +#define EXYNOS_TMU_CORE_ON BIT(0)
>> > +#define EXYNOS_TMU_CORE_ON_SHIFT   0
>> > +#define EXYNOS_TMU_CORE_ON_MASK(0x1 <<
>> > EXYNOS_TMU_CORE_ON_SHIFT)
>> >  #define EXYNOS_TMU_DEF_CODE_TO_TEMP_OFFSET 50
>> >
>> >  /* Exynos4210 specific registers */
>> > @@ -85,7 +88,9 @@
>> >  #define EXYNOS_TMU_CLEAR_FALL_INT  (0x111 << 16)
>> >  #define EXYNOS_MUX_ADDR_VALUE  6
>> >  #define EXYNOS_MUX_ADDR_SHIFT  20
>> > +#define EXYNOS_MUX_ADDR_MASK   (0x7 << EXYNOS_MUX_ADDR_SHIFT)
>> >  #define EXYNOS_TMU_TRIP_MODE_SHIFT 13
>> > +#define EXYNOS_TMU_TRIP_MODE_MASK  (0x7 << EXYNOS_TMU_TRIP_MODE_SHIFT)
>> >
>> >  #define EFUSE_MIN_VALUE 40
>> >  #define EFUSE_MAX_VALUE 100
>> > @@ -650,10 +655,14 @@ static void exynos_tmu_control(struct platform_device
>> > *pdev, bool on)
>> > mutex_lock(&data->lock);
>> > clk_enable(data->clk);
>> >
>> > -   con = pdata->reference_voltage << EXYNOS_TMU_REF_VOLTAGE_SHIFT |
>> > +   con = readl(data->base + EXYNOS_TMU_REG_CONTROL);
>> > +   con &= ~(EXYNOS_TMU_REF_VOLTAGE_MASK | EXYNOS_TMU_GAIN_MASK |
>> > +   EXYNOS_TMU_CORE_ON_MASK);
>> > +   con |= pdata->reference_voltage << EXYNOS_TMU_REF_VOLTAGE_SHIFT |
>> > pdata->gain << EXYNOS_TMU_GAIN_SHIFT;
>> >
>> > if (data->soc == SOC_ARCH_EXYNOS) {
>> > +   con &= ~(EXYNOS_TMU_TRIP_MODE_MASK | EXYNOS_MUX_ADDR_MASK);
>> > con |= pdata->noise_cancel_mode <<
>> > EXYNOS_TMU_TRIP_MODE_SHIFT;
>> > con |= (EXYNOS_MUX_ADDR_VALUE << EXYNOS_MUX_ADDR_SHIFT);
>> > }
>> > @@ -665,7 +674,6 @@ static void exynos_tmu_control(struct platform_device
>> > *pdev, bool on)
>> > pdata->trigger_level1_en << 4 |
>> > pdata->trigger_level0_en;
>> > } else {
>> > -   con |= EXYNOS_TMU_CORE_OFF;
>> > interrupt_en = 0; /* Disable all interrupts */
>> > }
>> > writel(interrupt_en, data->base + EXYNOS_TMU_REG_INTEN);
>> > --
>> > 1.7.4.1
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at  http://www.tux.org/lkml/
>> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] powerpc/dma/raidengine: add raidengine device

2012-11-21 Thread b29237

From: Xuelin Shi 

The RaidEngine is a new Freescale hardware that used for parity
computation offloading in RAID5/6.

This patch adds the device node in device tree and related binding
documentation.

Signed-off-by: Harninder Rai 
Signed-off-by: Naveen Burmi 
Signed-off-by: Xuelin Shi 
---
 .../devicetree/bindings/powerpc/fsl/raideng.txt|   81 +++
 arch/powerpc/boot/dts/fsl/p5020si-post.dtsi|1 +
 arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi |6 ++
 arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi |   85 
 4 files changed, 173 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/raideng.txt
 create mode 100644 arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi

diff --git a/Documentation/devicetree/bindings/powerpc/fsl/raideng.txt 
b/Documentation/devicetree/bindings/powerpc/fsl/raideng.txt
new file mode 100644
index 000..4ad29b9
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/fsl/raideng.txt
@@ -0,0 +1,81 @@
+* Freescale 85xx RAID Engine nodes
+
+RAID Engine nodes are defined to describe on-chip RAID accelerators.  Each RAID
+Engine should have a separate node.
+
+Supported chips:
+P5020, P5040
+
+Required properties:
+
+- compatible:  Should contain "fsl,raideng-v1.0" as the value
+   This identifies RAID Engine block. 1 in 1.0 represents
+   major number whereas 0 represents minor number. The
+   version matches the hardware IP version.
+- reg: offset and length of the register set for the device
+- ranges:  standard ranges property specifying the translation
+   between child address space and parent address space
+
+Example:
+   /* P5020 */
+   raideng: raideng@32 {
+   compatible = "fsl,raideng-v1.0";
+   #address-cells = <1>;
+   #size-cells = <1>;
+   reg = <0x32 0x1>;
+   ranges  = <0 0x32 0x1>;
+   };
+
+
+There must be a sub-node for each job queue present in RAID Engine
+This node must be a sub-node of the main RAID Engine node
+
+- compatible:  Should contain "fsl,raideng-v1.0-job-queue" as the value
+   This identifies the job queue interface
+- reg: offset and length of the register set for job queue
+- ranges:  standard ranges property specifying the translation
+   between child address space and parent address space
+
+Example:
+   /* P5020 */
+   raideng_jq0@1000 {
+   compatible = "fsl,raideng-v1.0-job-queue";
+   reg= <0x1000 0x1000>;
+   ranges = <0x0 0x1000 0x1000>;
+   };
+
+
+There must be a sub-node for each job ring present in RAID Engine
+This node must be a sub-node of job queue node
+
+- compatible:  Must contain "fsl,raideng-v1.0-job-ring" as the value
+   This identifies job ring. Should contain either
+   "fsl,raideng-v1.0-hp-ring" or "fsl,raideng-v1.0-lp-ring"
+   depending upon whether ring has high or low priority
+- reg: offset and length of the register set for job ring
+- interrupts:  interrupt mapping for job ring IRQ
+
+Optional property:
+
+- fsl,liodn:   Specifies the LIODN to be used for Job Ring. This
+   property is normally set by firmware. Value
+   is of 12-bits which is the LIODN number for this JR.
+   This property is used by the IOMMU (PAMU) to distinquish
+   transactions from this JR and than be able to do address
+   translation & protection accordingly.
+
+Example:
+   /* P5020 */
+   raideng_jq0@1000 {
+   compatible = "fsl,raideng-v1.0-job-queue";
+   reg= <0x1000 0x1000>;
+   ranges = <0x0 0x1000 0x1000>;
+
+   raideng_jr0: jr@0 {
+   compatible = "fsl,raideng-v1.0-job-ring", 
"fsl,raideng-v1.0-hp-ring";
+   reg= <0x0 0x400>;
+   interrupts = <139 2 0 0>;
+   interrupt-parent = <&mpic>;
+   fsl,liodn = <0x41>;
+   };
+   };
diff --git a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
index 64b6abe..5d7205b 100644
--- a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
@@ -354,4 +354,5 @@
 /include/ "qoriq-sata2-0.dtsi"
 /include/ "qoriq-sata2-1.dtsi"
 /include/ "qoriq-sec4.2-0.dtsi"
+/include/ "qoriq-raid1.0-0.dtsi"
 };
diff --git a/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi 
b/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi
index 0a198b0..8df47fc 100644
--- a/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi
@@ -73,6 +73,12 @@
rtic_c = &rtic_c;
rtic_d = &rtic_d;
sec_mon = &sec_mon;
+
+   raideng = &raideng;
+   rai

[PATCH 2/2] powerpc/dma/raidengine: enable Freescale RaidEngine device

2012-11-21 Thread b29237

From: Xuelin Shi 

The RaidEngine is a new FSL hardware that used as hardware acceration
for RAID5/6.

This patch enables the RaidEngine functionality and provides hardware
offloading capability for memcpy, xor and raid6 pq computation. It works
under dmaengine control with async_layer interface.

Signed-off-by: Harninder Rai 
Signed-off-by: Naveen Burmi 
Signed-off-by: Xuelin Shi 
---
 drivers/dma/Kconfig|   14 +
 drivers/dma/Makefile   |1 +
 drivers/dma/fsl_raid.c |  990 
 drivers/dma/fsl_raid.h |  317 
 4 files changed, 1322 insertions(+)
 create mode 100644 drivers/dma/fsl_raid.c
 create mode 100644 drivers/dma/fsl_raid.h

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index d4c1218..aa37279 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -320,6 +320,20 @@ config MMP_PDMA
help
  Support the MMP PDMA engine for PXA and MMP platfrom.
 
+config FSL_RAID
+tristate "Freescale RAID Engine Device Driver"
+depends on FSL_SOC && !FSL_DMA
+select DMA_ENGINE
+select ASYNC_TX_ENABLE_CHANNEL_SWITCH
+select ASYNC_MEMCPY
+select ASYNC_XOR
+select ASYNC_PQ
+---help---
+  Enable support for Freescale RAID Engine. RAID Engine is
+  available on some QorIQ SoCs (like P5020). It has
+  the capability to offload RAID5/RAID6 operations from CPU.
+  RAID5 is XOR and memcpy. RAID6 is P/Q and memcpy
+
 config DMA_ENGINE
bool
 
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index 7428fea..29b65eb 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_DMATEST) += dmatest.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioat/
 obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o
 obj-$(CONFIG_FSL_DMA) += fsldma.o
+obj-$(CONFIG_FSL_RAID) += fsl_raid.o
 obj-$(CONFIG_MPC512X_DMA) += mpc512x_dma.o
 obj-$(CONFIG_MV_XOR) += mv_xor.o
 obj-$(CONFIG_DW_DMAC) += dw_dmac.o
diff --git a/drivers/dma/fsl_raid.c b/drivers/dma/fsl_raid.c
new file mode 100644
index 000..ec19817
--- /dev/null
+++ b/drivers/dma/fsl_raid.c
@@ -0,0 +1,990 @@
+/*
+ * drivers/dma/fsl_raid.c
+ *
+ * Freescale RAID Engine device driver
+ *
+ * Author:
+ * Harninder Rai 
+ * Naveen Burmi 
+ *
+ * Copyright (c) 2010-2012 Freescale Semiconductor, Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ * * Neither the name of Freescale Semiconductor nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * ALTERNATIVELY, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") as published by the Free Software
+ * Foundation, either version 2 of that License or (at your option) any
+ * later version.
+ *
+ * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 
THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Theory of operation:
+ *
+ * General capabilities:
+ * RAID Engine (RE) block is capable of offloading XOR, memcpy and P/Q
+ * calculations required in RAID5 and RAID6 operations. RE driver
+ * registers with Linux's ASYNC layer as dma driver. RE hardware
+ * maintains strict ordering of the requests through chained
+ * command queueing.
+ *
+ * Data flow:
+ * Software RAID layer of Linux (MD layer) maintains RAID partitions,
+ * strips, stripes etc. It sends requests to the underlying AYSNC layer
+ * which further passes it to RE driver. ASYNC layer decides which request
+ * goes to which job ring of RE hardware. For every request processed by
+ * RAID Engine, driver gets an interrupt unless coalescing is set. The
+ * per job ring interrupt handler checks the status register for errors,
+ *

RE: [PATCH] ARM: exynos: add UART3 to DEBUG_LL ports

2012-11-21 Thread Kukjin Kim

Olof Johansson wrote:
> 
> On Tue, Nov 20, 2012 at 02:48:58PM -0800, Doug Anderson wrote:
> > From: Olof Johansson 
> >
> > UART3 is used for debugging on exynos5250-snow.
> >
> > [dianders: cleaned commit message.]
> >
> > Signed-off-by: Olof Johansson 
> > Signed-off-by: Doug Anderson 
> 
> >
> > ---
> >  arch/arm/Kconfig.debug|   11 +++
> >  arch/arm/plat-samsung/Kconfig |1 +
> >  2 files changed, 12 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
> > index 33a8930..35ba7dc 100644
> > --- a/arch/arm/Kconfig.debug
> > +++ b/arch/arm/Kconfig.debug
> > @@ -355,6 +355,17 @@ choice
> >   The uncompressor code port configuration is now handled
> >   by CONFIG_S3C_LOWLEVEL_UART_PORT.
> >
> > +   config DEBUG_S3C_UART3
> > +   depends on PLAT_SAMSUNG
> 
> 
> Sorry, the reason I hadn't re-posted this is that Kukjin had proposed
> to protect users of <= 3 UART platforms to select it. An added "Depends
> on ARCH_EXYNOS4 || ARCH_EXYNOS5" should cover that. Can you add and
> repost, please?
> 
Yes, please :-)

Thanks.

Best regards,
Kgene.
--
Kukjin Kim , Senior Engineer,
SW Solution Development Team, Samsung Electronics Co., Ltd.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V4 1/3] of: introduce for_each_matching_node_and_match()

2012-11-21 Thread Arnd Bergmann

On Tuesday 20 November 2012, Stephen Warren wrote:
> However, this results in iterating over table twice; the second time
> inside of_match_node(). The implementation of for_each_matching_node()
> already found the match, so this is redundant. Invent new function
> of_find_matching_node_and_match() and macro
> for_each_matching_node_and_match() to remove the double iteration,
> thus transforming the above code to:
> 
> for_each_matching_node_and_match(np, table, &match)
> 
> Signed-off-by: Stephen Warren 

This look useful, but I wonder if the interface would make more sense if you
make the last argument to the macro a normal pointer, rather than a
pointer-to-pointer. You can take the reference as part of the macro.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: opp_get_notifier() needs to be under rcu_lock?

2012-11-21 Thread MyungJoo Ham

> Hi,
> 
> It looks like find_device_opp() (called from opp_get_notifier()) needs
> to be under RCU read lock, but this doesn't seem to be happening in
> drivers/devfreq/devfreq.c. Doesn't this run the risk of referencing a
> freed variable?
> 
> Thanks,
> 
> -Kees

Yes, that's an issue requiring updates.

Thank you for pointing out.



Cheers,
MyungJoo

> 
> -- 
> Kees Cook
> Chrome OS Security
>

Re: [PATCH 0/4] Dove pinctrl fixes and DT enabling

2012-11-21 Thread Linus Walleij

On Mon, Nov 19, 2012 at 10:39 AM, Sebastian Hesselbarth
 wrote:

> This patch relies on a patch set for mvebu pinctrl taken through
> Linus' pinctrl branch. As there is no other platform than Dove
> involved, I suggest to take it though Jason's tree to avoid any
> further conflicts.

Sounds like a plan. So you have some commit history pulled
in from the pinctrl tree in the MVEBU tree?

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem in Page Cache Replacement

2012-11-21 Thread metin d



Hi Fengguang,

I run tests and attached the results. The line below I guess shows the data-1 
page caches.

0x0008006c       6584051    25718  
__RU_lA___P    referenced,uptodate,lru,active,private
Metin



From: Jaegeuk Hanse 
To: Fengguang Wu  
Cc: metin d ; Jan Kara ; 
"linux-kernel@vger.kernel.org" ; 
"linux...@kvack.org"  
Sent: Wednesday, November 21, 2012 11:42 AM
Subject: Re: Problem in Page Cache Replacement

On 11/21/2012 05:02 PM, Fengguang Wu wrote:
> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
>> Cc Fengguang Wu.
>>
>> On 11/21/2012 04:13 PM, metin d wrote:
    Curious. Added linux-mm list to CC to catch more attention. If you run
 echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
>>> I'm guessing it'd evict the entries, but am wondering if we could run any 
>>> more diagnostics before trying this.
>>>
>>> We regularly use a setup where we have two databases; one gets used 
>>> frequently and the other one about once a month. It seems like the memory 
>>> manager keeps unused pages in memory at the expense of frequently used 
>>> database's performance.
>>> My understanding was that under memory pressure from heavily
>>> accessed pages, unused pages would eventually get evicted. Is there
>>> anything else we can try on this host to understand why this is
>>> happening?
> We may debug it this way.
>
> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
>     (please double check via /proc/vmstat whether it does the expected work)
>
> 2) run 'page-types -r' with root, to view the page status for the
>     remaining pages of data-1
>
> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE 
> -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
>
> page-types can be found in the kernel source tree tools/vm/page-types.c
>
> Sorry that sounds a bit twisted.. I do have a patch to directly dump
> page cache status of a user specified file, however it's not
> upstreamed yet.

Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

         flags    page-count       MB  symbolic-flags long-symbolic-flags
0x        607699     2373 
___
0x0001        343227     1340 
___r___    reserved

But I have some questions of the print of page-type:

Is 2373MB here mean total memory in used include page cache? I don't 
think so.
Which kind of pages will be marked reserved?
Which line of long-symbolic-flags is for page cache?

Regards,
Jaegeuk

>
> Thanks,
> Fengguang
>
>>> On Tue 20-11-12 09:42:42, metin d wrote:
 I have two PostgreSQL databases named data-1 and data-2 that sit on the
 same machine. Both databases keep 40 GB of data, and the total memory
 available on the machine is 68GB.

 I started data-1 and data-2, and ran several queries to go over all their
 data. Then, I shut down data-1 and kept issuing queries against data-2.
 For some reason, the OS still holds on to large parts of data-1's pages
 in its page cache, and reserves about 35 GB of RAM to data-2's files. As
 a result, my queries on data-2 keep hitting disk.

 I'm checking page cache usage with fincore. When I run a table scan query
 against data-2, I see that data-2's pages get evicted and put back into
 the cache in a round-robin manner. Nothing happens to data-1's pages,
 although they haven't been touched for days.

 Does anybody know why data-1's pages aren't evicted from the page cache?
 I'm open to all kind of suggestions you think it might relate to problem.
>>>    Curious. Added linux-mm list to CC to catch more attention. If you run
>>> echo 1 >/proc/sys/vm/drop_caches
>>>    does it evict data-1 pages from memory?
>>>
 This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
 swap space. The kernel version is:

 $ uname -r
 3.2.28-45.62.amzn1.x86_64
 Edit:

 and it seems that I use one NUMA instance, if  you think that it can a 
 problem.

 $ numactl --hardware
 available: 1 nodes (0)
 node 0 cpus: 0 1 2 3 4 5 6 7
 node 0 size: 70007 MB
 node 0 free: 360 MB
 node distances:
 node   0
     0:  10 flags  page-count   MB  symbolic-flags 
long-symbolic-flags
0x 550831721516  
___
0x0001  335993 1312  
___r___reserved
0x0021   35634  139  
___rO__reserved,owner_private
0x0001   45069  176  
T__compound_tail
0x002015165  
O__

Re: [PATCHv9 1/3] Runtime Interpreted Power Sequences

2012-11-21 Thread Alex Courbot

On Wednesday 21 November 2012 16:48:45 Tomi Valkeinen wrote:
> If the power-off sequence disables a regulator that was supposed to be
> enabled by the power-on sequence (but wasn't enabled because of an
> error), the regulator_disable is still called when the driver runs the
> power-off sequence, isn't it? Regulator enables and disables are ref
> counted, and the enables should match the disables.

And there collapses my theory.

> > Failures might be better handled if sequences have some "recovery policy"
> > about what to do when they fail, as mentioned in the link above. As you
> > pointed out, the driver might not always know enough about the resources
> > involved to do the right thing.
> 
> Yes, I think such recovery policy would be needed.

Indeed, from your last paragraph this makes even more sense now.

Oh, and I noticed I forgot to reply to this:

> This I didn't understand. Doesn't "<&pwm 2 xyz>" point to a single
> device, no matter where and how many times it's used?

That's true - however when dereferencing the phandle, the underlying framework 
will try to acquire the PWM, which will result in failure if the same resource 
is referenced several times.

One could compare the phandles to avoid this, but in your example you must 
know that for PWMs the "xyz" part is not relevant for comparison.

This makes referencing of resources by name much easier to implement and more 
elegant with respect to frameworks leveraging.

Alex.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 0/1] ubi: Add ubiblock driver

2012-11-21 Thread Thomas Petazzoni

Dear Ezequiel Garcia,

On Tue, 20 Nov 2012 19:39:38 -0300, Ezequiel Garcia wrote:

> * Read/write support
> 
> Yes, this implementation supports read/write access.

While I think the original ubiblk that was read-only made sense to
allow the usage of read-only filesystems like squashfs, I am not sure a
read/write ubiblock is useful.

Using a standard block read/write filesystem on top of ubiblock is going
to cause damage to your flash. Even though UBI does wear-leveling, your
standard block read/write filesystem will think it has 512 bytes block
below him, and will do a crazy number of writes to small blocks. Even
though you have a one LEB cache, it is going to be defeated quite
strongly by the small random I/O of the read/write filesystem.

I am not sure letting people use read/write block filesystems on top of
flashes, even through UBI, is a good idea.

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/4] pinctrl: mvebu: Fix dove_audio1_ctrl_set function

2012-11-21 Thread Linus Walleij

On Mon, Nov 19, 2012 at 10:39 AM, Sebastian Hesselbarth
 wrote:

> From: Axel Lin 
>
> When setting audio1 pinmux the bits in the corresponding registers
> are not cleared. This fix first clears all bits and then sets the
> required bits according to the selected function.
>
> Signed-off-by: Axel Lin 
> Signed-off-by: Sebastian Hesselbarth 

Acked-by: Linus Walleij 

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] pinctrl: mvebu: fix iomem pointer for dove pinctrl

2012-11-21 Thread Linus Walleij

On Mon, Nov 19, 2012 at 10:39 AM, Sebastian Hesselbarth
 wrote:

> There has been a change in readl/writel to require registers
> addresses marked as IOMEM(). This patch takes care of this and
> also replaces ORing address offsets with adding them.
>
> Signed-off-by: Sebastian Hesselbarth 

Acked-by: Linus Walleij 

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] drivers/pnp: fixed coding style on numerious lines.

2012-11-21 Thread Michael L. Hobbs

Fixed coding style on numerious lines.

Signed-off-by: Michael L. Hobbs 
---
 drivers/pnp/card.c |   18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/pnp/card.c b/drivers/pnp/card.c
index bc00693..400568a 100644
--- a/drivers/pnp/card.c
+++ b/drivers/pnp/card.c
@@ -152,7 +152,8 @@ static void pnp_release_card(struct device *dmdev)
kfree(card);
 }
 
-struct pnp_card *pnp_alloc_card(struct pnp_protocol *protocol, int id, char 
*pnpid)
+struct pnp_card *pnp_alloc_card(struct pnp_protocol *protocol,
+int id, char *pnpid)
 {
struct pnp_card *card;
struct pnp_id *dev_id;
@@ -165,7 +166,8 @@ struct pnp_card *pnp_alloc_card(struct pnp_protocol 
*protocol, int id, char *pnp
card->number = id;
 
card->dev.parent = &card->protocol->dev;
-   dev_set_name(&card->dev, "%02x:%02x", card->protocol->number, 
card->number);
+   dev_set_name(&card->dev, "%02x:%02x", card->protocol->number,
+card->number);
 
card->dev.coherent_dma_mask = DMA_BIT_MASK(24);
card->dev.dma_mask = &card->dev.coherent_dma_mask;
@@ -186,7 +188,7 @@ static ssize_t pnp_show_card_name(struct device *dmdev,
struct pnp_card *card = to_pnp_card(dmdev);
 
str += sprintf(str, "%s\n", card->name);
-   return (str - buf);
+   return str - buf;
 }
 
 static DEVICE_ATTR(name, S_IRUGO, pnp_show_card_name, NULL);
@@ -202,7 +204,7 @@ static ssize_t pnp_show_card_ids(struct device *dmdev,
str += sprintf(str, "%s\n", pos->id);
pos = pos->next;
}
-   return (str - buf);
+   return str - buf;
 }
 
 static DEVICE_ATTR(card_id, S_IRUGO, pnp_show_card_ids, NULL);
@@ -366,6 +368,7 @@ err_out:
dev->card_link = NULL;
return NULL;
 }
+EXPORT_SYMBOL(pnp_request_card_device);
 
 /**
  * pnp_release_card_device - call this when the driver no longer needs the 
device
@@ -379,6 +382,7 @@ void pnp_release_card_device(struct pnp_dev *dev)
device_release_driver(&dev->dev);
drv->link.remove = &card_remove_first;
 }
+EXPORT_SYMBOL(pnp_release_card_device);
 
 /*
  * suspend/resume callbacks
@@ -404,6 +408,7 @@ static int card_resume(struct pnp_dev *dev)
return 0;
 }
 
+
 /**
  * pnp_register_card_driver - registers a PnP card driver with the PnP Layer
  * @drv: pointer to the driver to register
@@ -436,6 +441,7 @@ int pnp_register_card_driver(struct pnp_card_driver *drv)
}
return 0;
 }
+EXPORT_SYMBOL(pnp_register_card_driver);
 
 /**
  * pnp_unregister_card_driver - unregisters a PnP card driver from the PnP 
Layer
@@ -448,8 +454,4 @@ void pnp_unregister_card_driver(struct pnp_card_driver *drv)
spin_unlock(&pnp_lock);
pnp_unregister_driver(&drv->link);
 }
-
-EXPORT_SYMBOL(pnp_request_card_device);
-EXPORT_SYMBOL(pnp_release_card_device);
-EXPORT_SYMBOL(pnp_register_card_driver);
 EXPORT_SYMBOL(pnp_unregister_card_driver);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 1/2] dw_dmac: store direction in the custom channel structure

2012-11-21 Thread Viresh Kumar

On 21 November 2012 14:12, Andy Shevchenko
 wrote:
> Currently the direction value comes from the generic slave configuration
> structure and explicitly as a preparation function parameter. The first one is
> kinda obsoleted. Thus, we have to store the value passed to the preparation
> function somewhere in our structures to be able to use it later. The best
> candidate to provide the storage is a custom channel structure. Until now we
> still keep and check the direction field of the slave config structure as 
> well.
>
> Signed-off-by: Andy Shevchenko 
> Acked-by: Viresh Kumar 

Vinod, these patches look fine to me. Please apply them for 3.8.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V4 1/3] of: introduce for_each_matching_node_and_match()

2012-11-21 Thread Grant Likely

On Wed, Nov 21, 2012 at 9:53 AM, Arnd Bergmann  wrote:
> On Tuesday 20 November 2012, Stephen Warren wrote:
>> However, this results in iterating over table twice; the second time
>> inside of_match_node(). The implementation of for_each_matching_node()
>> already found the match, so this is redundant. Invent new function
>> of_find_matching_node_and_match() and macro
>> for_each_matching_node_and_match() to remove the double iteration,
>> thus transforming the above code to:
>>
>> for_each_matching_node_and_match(np, table, &match)
>>
>> Signed-off-by: Stephen Warren 
>
> This look useful, but I wonder if the interface would make more sense if you
> make the last argument to the macro a normal pointer, rather than a
> pointer-to-pointer. You can take the reference as part of the macro.

To me that makes for harder to understand code. It *looks* like an
argument to a normal function call, but it gets changed by the caller.

g.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem in Page Cache Replacement

2012-11-21 Thread Metin Döşlü

On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse  wrote:
>
> On 11/21/2012 05:58 PM, metin d wrote:
>
> Hi Fengguang,
>
> I run tests and attached the results. The line below I guess shows the data-1 
> page caches.
>
> 0x0008006c   658405125718  
> __RU_lA___Preferenced,uptodate,lru,active,private
>
>
> I thinks this is just one state of page cache pages.

But why these page caches are in this state as opposed to other page
caches. From the results I conclude that:

data-1 pages are in state : referenced,uptodate,lru,active,private
data-2 pages are in state : referenced,uptodate,lru,mappedtodisk

>
>
>
>
> Metin
>
>
> - Original Message -
> From: Jaegeuk Hanse 
> To: Fengguang Wu 
> Cc: metin d ; Jan Kara ; 
> "linux-kernel@vger.kernel.org" ; 
> "linux...@kvack.org" 
> Sent: Wednesday, November 21, 2012 11:42 AM
> Subject: Re: Problem in Page Cache Replacement
>
> On 11/21/2012 05:02 PM, Fengguang Wu wrote:
> > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
> >> Cc Fengguang Wu.
> >>
> >> On 11/21/2012 04:13 PM, metin d wrote:
> Curious. Added linux-mm list to CC to catch more attention. If you run
>  echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> >>> I'm guessing it'd evict the entries, but am wondering if we could run any 
> >>> more diagnostics before trying this.
> >>>
> >>> We regularly use a setup where we have two databases; one gets used 
> >>> frequently and the other one about once a month. It seems like the memory 
> >>> manager keeps unused pages in memory at the expense of frequently used 
> >>> database's performance.
> >>> My understanding was that under memory pressure from heavily
> >>> accessed pages, unused pages would eventually get evicted. Is there
> >>> anything else we can try on this host to understand why this is
> >>> happening?
> > We may debug it this way.
> >
> > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
> >(please double check via /proc/vmstat whether it does the expected work)
> >
> > 2) run 'page-types -r' with root, to view the page status for the
> >remaining pages of data-1
> >
> > The fadvise tool comes from Andrew Morton's ext3-tools. (source code 
> > attached)
> > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE 
> > -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
> >
> > page-types can be found in the kernel source tree tools/vm/page-types.c
> >
> > Sorry that sounds a bit twisted.. I do have a patch to directly dump
> > page cache status of a user specified file, however it's not
> > upstreamed yet.
>
> Hi Fengguang,
>
> Thanks for you detail steps, I think metin can have a try.
>
> flagspage-count  MB  symbolic-flags long-symbolic-flags
> 0x6076992373
> ___
> 0x00013432271340
> ___r___reserved
>
> But I have some questions of the print of page-type:
>
> Is 2373MB here mean total memory in used include page cache? I don't
> think so.
> Which kind of pages will be marked reserved?
> Which line of long-symbolic-flags is for page cache?
>
> Regards,
> Jaegeuk
>
> >
> > Thanks,
> > Fengguang
> >
> >>> On Tue 20-11-12 09:42:42, metin d wrote:
>  I have two PostgreSQL databases named data-1 and data-2 that sit on the
>  same machine. Both databases keep 40 GB of data, and the total memory
>  available on the machine is 68GB.
> 
>  I started data-1 and data-2, and ran several queries to go over all their
>  data. Then, I shut down data-1 and kept issuing queries against data-2.
>  For some reason, the OS still holds on to large parts of data-1's pages
>  in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>  a result, my queries on data-2 keep hitting disk.
> 
>  I'm checking page cache usage with fincore. When I run a table scan query
>  against data-2, I see that data-2's pages get evicted and put back into
>  the cache in a round-robin manner. Nothing happens to data-1's pages,
>  although they haven't been touched for days.
> 
>  Does anybody know why data-1's pages aren't evicted from the page cache?
>  I'm open to all kind of suggestions you think it might relate to problem.
> >>>Curious. Added linux-mm list to CC to catch more attention. If you run
> >>> echo 1 >/proc/sys/vm/drop_caches
> >>>does it evict data-1 pages from memory?
> >>>
>  This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>  swap space. The kernel version is:
> 
>  $ uname -r
>  3.2.28-45.62.amzn1.x86_64
>  Edit:
> 
>  and it seems that I use one NUMA instance, if  you think that it can a 
>  problem.
> 
>  $ numactl --hardware
>  available: 1 nodes (0)
>  node 0 cpus: 0 1 2 3 4 5 6 7
>  node 0 size: 70007 MB
>  node 0 free: 360 MB
>  node distan

Re: [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

2012-11-21 Thread Christoph Hellwig

On Mon, Nov 19, 2012 at 11:41:23PM -0800, Darrick J. Wong wrote:
> Provide VFS helpers for handling O_SYNC AIO DIO writes.  Filesystems wanting 
> to
> use the helpers have to pass DIO_SYNC_WRITES to __blockdev_direct_IO.  If the
> filesystem doesn't provide its own direct IO end_io handler, the generic code
> will take care of issuing the flush.  Otherwise, the filesystem's custom 
> end_io
> handler is passed struct dio_sync_io_work pointer as 'private' argument, and 
> it
> must call generic_dio_end_io() to finish the AIO DIO.  The generic code then
> takes care to call generic_write_sync() from a workqueue context when AIO DIO
> is complete.
> 
> Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling
> and the generic suffices for them, make blockdev_direct_IO() pass the new
> DIO_SYNC_WRITES flag.

I'd like to use this as a vehicle to revisit how dio completions work.
Now that the generic code has a reason to defer aio completions to a
workqueue can we maybe take the whole offload to a workqueue code into
the direct-io code instead of reimplementing it in ext4 and xfs?

>From a simplicity point of view I'd love to do it unconditionally, but I
also remember that this was causing performance regressions on important
workload.  So maybe we just need a flag in the dio structure, with a way
that the get_blocks callback can communicate that it's needed.

For the specific case of O_(D)SYNC aio this would allos allow to call
->fsync from generic code instead of the filesystems having to
reimplement this.

> + if (dio->sync_work)
> + private = dio->sync_work;
> + else
> + private = dio->private;
> +
>   dio->end_io(dio->iocb, offset, transferred,
> - dio->private, ret, is_async);
> + private, ret, is_async);

Eww.  I'd be much happier to add a new argument than having two
different members passed as the private argument.

Maybe it's even time to bite the bullet and make struct dio public
and pass that to the end_io argument as well as generic_dio_end_io.

> + /* No IO submitted? Skip syncing... */
> + if (!dio->result && dio->sync_work) {
> + kfree(dio->sync_work);
> + dio->sync_work = NULL;
> + }
> + generic_dio_end_io(dio->iocb, offset, transferred,
> +dio->sync_work, ret, is_async);


Any reason the check above isn't done inside of generic_dio_end_io?

> +static noinline int dio_create_flush_wq(struct super_block *sb)
> +{
> + struct workqueue_struct *wq =
> + alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
> +
> + if (!wq)
> + return -ENOMEM;
> + /*
> +  * Atomically put workqueue in place. Release our one in case someone
> +  * else won the race and attached workqueue to superblock.
> +  */
> + if (cmpxchg(&sb->s_dio_flush_wq, NULL, wq))
> + destroy_workqueue(wq);
> + return 0;

Eww.  Workqueues are cheap, just create it on bootup instead of this
uglyness.  Also I don't really see any reason to make it per-fs instead
of global.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/9] xfs: factor out everything but the filemap_write_and_wait from xfs_file_fsync

2012-11-21 Thread Christoph Hellwig

On Mon, Nov 19, 2012 at 11:41:38PM -0800, Darrick J. Wong wrote:
> Hi,
> 
> Fsyncing is tricky business, so factor out the bits of the xfs_file_fsync
> function that can be used from the I/O post-processing path.

Why would we need to skip the filemap_write_and_wait_range call here?
If we're doing direct I/O we should not have any pages in this regions
anyway.  You're also not skipping it in the generic implementation as
far as I can see, so I see no point in doing it just in XFS.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V4 1/3] of: introduce for_each_matching_node_and_match()

2012-11-21 Thread Thomas Petazzoni


On Wed, 21 Nov 2012 10:06:16 +, Grant Likely wrote:
> On Wed, Nov 21, 2012 at 9:53 AM, Arnd Bergmann  wrote:
> > On Tuesday 20 November 2012, Stephen Warren wrote:
> >> However, this results in iterating over table twice; the second time
> >> inside of_match_node(). The implementation of for_each_matching_node()
> >> already found the match, so this is redundant. Invent new function
> >> of_find_matching_node_and_match() and macro
> >> for_each_matching_node_and_match() to remove the double iteration,
> >> thus transforming the above code to:
> >>
> >> for_each_matching_node_and_match(np, table, &match)
> >>
> >> Signed-off-by: Stephen Warren 
> >
> > This look useful, but I wonder if the interface would make more sense if you
> > make the last argument to the macro a normal pointer, rather than a
> > pointer-to-pointer. You can take the reference as part of the macro.
> 
> To me that makes for harder to understand code. It *looks* like an
> argument to a normal function call, but it gets changed by the caller.

Agreed. Too much magic is too much.

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] PM / devfreq: missing rcu_read_lock() added for find_device_opp()

2012-11-21 Thread MyungJoo Ham

opp_get_notifier() uses find_device_opp(), which requires to
held rcu_read_lock. In order to keep the notifier-header
valid, we have added rcu_read_lock().

Reported-by: Kees Cook 
Signed-off-by: MyungJoo Ham 
---
 drivers/devfreq/devfreq.c |   26 --
 1 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 45e053e..e91cb22 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -1023,11 +1023,18 @@ struct opp *devfreq_recommended_opp(struct device *dev, 
unsigned long *freq,
  */
 int devfreq_register_opp_notifier(struct device *dev, struct devfreq *devfreq)
 {
-   struct srcu_notifier_head *nh = opp_get_notifier(dev);
+   struct srcu_notifier_head *nh;
+   int ret = 0;
 
+   rcu_read_lock();
+   nh = opp_get_notifier(dev);
if (IS_ERR(nh))
-   return PTR_ERR(nh);
-   return srcu_notifier_chain_register(nh, &devfreq->nb);
+   ret = PTR_ERR(nh);
+   if (!ret)
+   ret = srcu_notifier_chain_register(nh, &devfreq->nb);
+   rcu_read_unlock();
+
+   return ret;
 }
 
 /**
@@ -1042,11 +1049,18 @@ int devfreq_register_opp_notifier(struct device *dev, 
struct devfreq *devfreq)
  */
 int devfreq_unregister_opp_notifier(struct device *dev, struct devfreq 
*devfreq)
 {
-   struct srcu_notifier_head *nh = opp_get_notifier(dev);
+   struct srcu_notifier_head *nh;
+   int ret = 0;
 
+   rcu_read_lock();
+   nh = opp_get_notifier(dev);
if (IS_ERR(nh))
-   return PTR_ERR(nh);
-   return srcu_notifier_chain_unregister(nh, &devfreq->nb);
+   ret = PTR_ERR(nh);
+   if (!ret)
+   ret = srcu_notifier_chain_unregister(nh, &devfreq->nb);
+   rcu_read_unlock();
+
+   return ret;
 }
 
 MODULE_AUTHOR("MyungJoo Ham ");
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Staging: rtl8187se: fixed some checkpatch warnings and errors in r8180_wx.c

2012-11-21 Thread MAACHE Mehdi

Le 20/11/2012 15:25, Dan Carpenter a écrit :

> On Tue, Nov 20, 2012 at 02:26:42PM +0100, MAACHE Mehdi wrote:
>> This is a patch to the r8180_wx.c file that fixes up some warnings and 
>> errors found by the checkpatch.pl tool
>> - WARNING: line over 80 characters
>> - ERROR: "(foo*)" should be "(foo *)"
>> - ERROR: "foo* bar" should be "foo *bar"
>> - ERROR: trailing whitespace
>> - ERROR: that open brace { should be on the previous line
>>
> 
> This needs to be broken into 4-5 separate patches and sent as
> series.  One patch per warning type.
> 
>> -if (erq->flags & IW_ENCODE_DISABLED)
>> -
>> -if (erq->length > 0) {
>> -u32* tkey = (u32*) key;
>> +if ((erq->flags & IW_ENCODE_DISABLED) && erq->length > 0) {
>> +u32 *tkey = (u32 *) key;
> 
> Interesting...  You have preserved the meaning of the original code,
> but actually the original code is buggy.  Just delete the
> "if (erq->flags & IW_ENCODE_DISABLED)" check.  This bug was
> introduced in de171bd6ff "Staging: rtl8187se: r8180_wx: fixed a lot
> of checkptahc.pl issues".
> 
> Send that as a separate patch and mark it as a bugfix.
> 
> regards,
> dan carpenter
> 


Ok, I will do this. Thanks for the advice.

regards,

maache mehdi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] PM/devfreq: Fix return value in devfreq_remove_governor()

2012-11-21 Thread MyungJoo Ham

> Use the value obtained from the function instead of -EINVAL.
> 
> Signed-off-by: Sachin Kamat 

Acked-by: MyungJoo Ham 

Both patches applied as they are obvious bugfixes.
I'll send pull request with other bugfixes within days.

> ---
>  drivers/devfreq/devfreq.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> index 83c2129..2bd9ab0 100644
> --- a/drivers/devfreq/devfreq.c
> +++ b/drivers/devfreq/devfreq.c
> @@ -644,7 +644,7 @@ int devfreq_remove_governor(struct devfreq_governor 
> *governor)
>   if (IS_ERR(g)) {
>   pr_err("%s: governor %s not registered\n", __func__,
>  governor->name);
> - err = -EINVAL;
> + err = PTR_ERR(g);
>   goto err_out;
>   }
>   list_for_each_entry(devfreq, &devfreq_list, node) {
> -- 
> 1.7.4.1
> 
> 
> 
> 
>
>   
>  
>

Re: [RFC/PATCH 1/1] ubi: Add ubiblock driver

2012-11-21 Thread Ezequiel Garcia

On Tue, Nov 20, 2012 at 8:59 PM, richard -rw- weinberger
 wrote:
> On Tue, Nov 20, 2012 at 11:39 PM, Ezequiel Garcia  
> wrote:
>> Block device emulation on top of ubi volumes with read/write support.
>> Block devices get automatically created for each ubi volume present.
>>
>> Each ubiblock is fairly cheap since it's based on workqueues
>> and not on threads.
>>
>> Read/write access is expected to work fairly well because the
>> request queue at block elevator orders block transfers to be space-effective.
>> In other words, it's expected that reads and writes gets ordered
>> to point to the same LEB.
>>
>> To help this and reduce access to the UBI volume, a 1-LEB size
>> write-back cache has been implemented.
>> Every read and every write, goes through this cache and the write is
>> only done when a request arrives to read or write to a different LEB
>> or when the device is released, when the last file handle is closed.
>
> Did you also benchmark your driver with two caches?
> (One for reading and one for writing.)
> By using two caches you can lower the amount of atomic LEB changes.
>
> Maybe it would be also good to ensure that an cache entry becomes not too old.
>

Yes, I thought of this.

For now, I decided to keep the implementation as simple as possible.

Regards,

Ezequiel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] Dove pinctrl fixes and DT enabling

2012-11-21 Thread Sebastian Hesselbarth


On 11/21/2012 10:59 AM, Linus Walleij wrote:

On Mon, Nov 19, 2012 at 10:39 AM, Sebastian Hesselbarth
  wrote:

This patch relies on a patch set for mvebu pinctrl taken through
Linus' pinctrl branch. As there is no other platform than Dove
involved, I suggest to take it though Jason's tree to avoid any
further conflicts.


Sounds like a plan. So you have some commit history pulled
in from the pinctrl tree in the MVEBU tree?


Linus,

I am referring to patches for a pinctrl/mvebu subfolder. IIRC Thomas
posted that patch a while ago. Jason is currently sorting things out
for mvebu pull requests. I guess both can comment on your question,
as I don't fully understand it.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/46] x86: mm: only do a local tlb flush in ptep_set_access_flags()

2012-11-21 Thread Mel Gorman

From: Rik van Riel 

The function ptep_set_access_flags() is only ever invoked to set access
flags or add write permission on a PTE.  The write bit is only ever set
together with the dirty bit.

Because we only ever upgrade a PTE, it is safe to skip flushing entries on
remote TLBs. The worst that can happen is a spurious page fault on other
CPUs, which would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally is
(much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Cc: Michel Lespinasse 
Cc: Ingo Molnar 
---
 arch/x86/mm/pgtable.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
free_page((unsigned long)pgd);
 }
 
+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
 int ptep_set_access_flags(struct vm_area_struct *vma,
  unsigned long address, pte_t *ptep,
  pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
-   flush_tlb_page(vma, address);
+   __flush_tlb_one(address);
}
 
return changed;
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/46] mm: compaction: Add scanned and isolated counters for compaction

2012-11-21 Thread Mel Gorman

Compaction already has tracepoints to count scanned and isolated pages
but it requires that ftrace be enabled and if that information has to be
written to disk then it can be disruptive. This patch adds vmstat counters
for compaction called compact_migrate_scanned, compact_free_scanned and
compact_isolated.

With these counters, it is possible to define a basic cost model for
compaction. This approximates of how much work compaction is doing and can
be compared that with an oprofile showing TLB misses and see if the cost of
compaction is being offset by THP for example. Minimally a compaction patch
can be evaluated in terms of whether it increases or decreases cost. The
basic cost model looks like this

Fundamental unit u: a word  sizeof(void *)

Ca  = cost of struct page access = sizeof(struct page) / u

Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
Cmf = Cost migrate failure   = Ca * 2
Ci  = Cost page isolation= (Ca + Wi)
where Wi is a constant that should reflect the approximate
cost of the locking operation.

Csm = Cost migrate scanning = Ca
Csf = Cost freescanning = Ca

Overall cost =  (Csm * compact_migrate_scanned) +
(Csf * compact_free_scanned)+
(Ci  * compact_isolated)+
(Cmc * pgmigrate_success)   +
(Cmf * pgmigrate_failed)

Where the values are read from /proc/vmstat.

This is very basic and ignores certain costs such as the allocation cost
to do a migrate page copy but any improvement to the model would still
use the same vmstat counters.

Signed-off-by: Mel Gorman 
Reviewed-by: Rik van Riel 
---
 include/linux/vm_event_item.h |2 ++
 mm/compaction.c   |8 
 mm/vmstat.c   |3 +++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8aa7cb9..a1f750b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -42,6 +42,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
 #ifdef CONFIG_COMPACTION
+   COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+   COMPACTISOLATED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 2c077a7..aee7443 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -356,6 +356,10 @@ static unsigned long isolate_freepages_block(struct 
compact_control *cc,
if (blockpfn == end_pfn)
update_pageblock_skip(cc, valid_page, total_isolated, false);
 
+   count_vm_events(COMPACTFREE_SCANNED, nr_scanned);
+   if (total_isolated)
+   count_vm_events(COMPACTISOLATED, total_isolated);
+
return total_isolated;
 }
 
@@ -646,6 +650,10 @@ next_pageblock:
 
trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
+   count_vm_events(COMPACTMIGRATE_SCANNED, nr_scanned);
+   if (nr_isolated)
+   count_vm_events(COMPACTISOLATED, nr_isolated);
+
return low_pfn;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89a7fd6..3a067fa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -779,6 +779,9 @@ const char * const vmstat_text[] = {
"pgmigrate_fail",
 #endif
 #ifdef CONFIG_COMPACTION
+   "compact_migrate_scanned",
+   "compact_free_scanned",
+   "compact_isolated",
"compact_stall",
"compact_fail",
"compact_success",
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 12/46] mm: numa: pte_numa() and pmd_numa()

2012-11-21 Thread Mel Gorman

From: Andrea Arcangeli 

Implement pte_numa and pmd_numa.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
a thread touches a virtual address in the corresponding virtual range,
a NUMA hinting page fault will trigger. The NUMA hinting page fault
will clear the NUMA bit and set the present bit again to resolve the
page fault.

The expectation is that a NUMA hinting page fault is used as part
of a placement policy that decides if a page should remain on the
current node or migrated to a different node.

Acked-by: Rik van Riel 
Signed-off-by: Andrea Arcangeli 
Signed-off-by: Mel Gorman 
---
 arch/x86/include/asm/pgtable.h |   11 --
 include/asm-generic/pgtable.h  |   74 
 init/Kconfig   |   33 ++
 3 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5fe03aa..9cd7b72 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-   return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+   return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+  _PAGE_NUMA);
 }
 
 #define pte_accessible pte_accessible
@@ -426,7 +427,8 @@ static inline int pmd_present(pmd_t pmd)
 * the _PAGE_PSE flag will remain set at all times while the
 * _PAGE_PRESENT bit is clear).
 */
-   return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
+   return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+_PAGE_NUMA);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -485,6 +487,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, 
unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_BALANCE_NUMA
+   /* pmd_numa check */
+   if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
+   return 0;
+#endif
return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 48fc1dc..1e236fe 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -558,6 +558,80 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+#ifndef pte_numa
+static inline int pte_numa(pte_t pte)
+{
+   return (pte_flags(pte) &
+   (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+#ifndef pmd_numa
+static inline int pmd_numa(pmd_t pmd)
+{
+   return (pmd_flags(pmd) &
+   (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+#ifndef pte_mknonnuma
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+   pte = pte_clear_flags(pte, _PAGE_NUMA);
+   return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+#endif
+
+#ifndef pmd_mknonnuma
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+   pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+   return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+#endif
+
+#ifndef pte_mknuma
+static inline pte_t pte_mknuma(pte_t pte)
+{
+   pte = pte_set_flags(pte, _PAGE_NUMA);
+   return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+#endif
+
+#ifndef pmd_mknuma
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+   pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+   return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+#endif
+#else
+extern int pte_numa(pte_t pte);
+extern int pmd_numa(pmd_t pmd);
+extern pte_t pte_mknonnuma(pte_t pte);
+extern pmd_t pmd_mknonnuma(pmd_t pmd);
+extern pte_t pte_mknuma(pte_t pte);
+extern pmd_t pmd_mknuma(pmd_t pmd);
+#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..6897a05 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,39 @@ config LOG_BUF_SHIFT
 config HAVE_UNSTABLE_SCHED_CLOCK
bool
 
+#
+# For architectures that want to enable the support fo

[PATCH 21/46] mm: mempolicy: Add MPOL_MF_LAZY

2012-11-21 Thread Mel Gorman

From: Lee Schermerhorn 

NOTE: Once again there is a lot of patch stealing and the end result
is sufficiently different that I had to drop the signed-offs.
Will re-add if the original authors are ok with that.

This patch adds another mbind() flag to request "lazy migration".  The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Mel Gorman 
Reviewed-by: Rik van Riel 
---
 include/linux/mm.h |5 ++
 include/uapi/linux/mempolicy.h |   13 ++-
 mm/mempolicy.c |  185 
 3 files changed, 185 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa16152..471185e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1551,6 +1551,11 @@ static inline pgprot_t vm_get_page_prot(unsigned long 
vm_flags)
 }
 #endif
 
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+void change_prot_numa(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end);
+#endif
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 472de8a..6a1baae 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT (1<<0)  /* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE   (1<<1)  /* Move pages owned by this process to conform 
to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)/* Move every page to conform to 
mapping */
-#define MPOL_MF_INTERNAL (1<<3)/* Internal flags start here */
+#define MPOL_MF_MOVE(1<<1) /* Move pages owned by this process to conform
+  to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)/* Move every page to conform to policy 
*/
+#define MPOL_MF_LAZY(1<<3) /* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)/* Internal flags start here */
+
+#define MPOL_MF_VALID  (MPOL_MF_STRICT   | \
+MPOL_MF_MOVE | \
+MPOL_MF_MOVE_ALL | \
+MPOL_MF_LAZY)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index df1466d..51d3ebd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -90,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -565,6 +566,145 @@ static inline int check_pgd_range(struct vm_area_struct 
*vma,
return 0;
 }
 
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+/*
+ * Here we search for not shared page mappings (mapcount == 1) and we
+ * set up the pmd/pte_numa on those mappings so the very next access
+ * will fire a NUMA hinting page fault.
+ */
+static int
+change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
+   unsigned long address)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte, *_pte;
+   struct page *page;
+   unsigned long _address, end;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   VM_BUG_ON(address & ~PAGE_MASK);
+
+   pgd = pgd_offset(mm, address);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, address);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, address);
+   if (pmd_none(*pmd))
+   goto out;
+
+   if (pmd_trans_huge_lock(pmd, vma) == 1) {
+   int page_nid;
+   ret = HPAGE_PMD_NR;
+
+   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+   if (pmd_numa(*pmd)) {
+   spin_unlock(&mm->page_table_lock);
+   goto out;
+   }
+
+   page = pmd_page(*pmd);
+
+   /* only check non-shared pages */
+   if (page_mapcount(page) != 1) {
+   spin_unlock(&mm->page_table_lock);
+   goto out;
+   }
+
+   page_nid = page_to_nid(page);
+
+   if (pmd_numa(*pmd))

[PATCH 23/46] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now

2012-11-21 Thread Mel Gorman

The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
explicitly request lazy migration is a good idea but the actual
API has not been well reviewed and once released we have to support it.
For now this patch prevents an application using the services. This
will need to be revisited.

Signed-off-by: Mel Gorman 
---
 include/uapi/linux/mempolicy.h |4 +---
 mm/mempolicy.c |9 -
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 6a1baae..16fb4e6 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,7 +21,6 @@ enum {
MPOL_BIND,
MPOL_INTERLEAVE,
MPOL_LOCAL,
-   MPOL_NOOP,  /* retain existing policy for range */
MPOL_MAX,   /* always last member of enum */
 };
 
@@ -57,8 +56,7 @@ enum mpol_rebind_step {
 
 #define MPOL_MF_VALID  (MPOL_MF_STRICT   | \
 MPOL_MF_MOVE | \
-MPOL_MF_MOVE_ALL | \
-MPOL_MF_LAZY)
+MPOL_MF_MOVE_ALL)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 75d4600..a7a62fe 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -252,7 +252,7 @@ static struct mempolicy *mpol_new(unsigned short mode, 
unsigned short flags,
pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-   if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
+   if (mode == MPOL_DEFAULT) {
if (nodes && !nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
return NULL;
@@ -1186,7 +1186,7 @@ static long do_mbind(unsigned long start, unsigned long 
len,
if (start & ~PAGE_MASK)
return -EINVAL;
 
-   if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
+   if (mode == MPOL_DEFAULT)
flags &= ~MPOL_MF_STRICT;
 
len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -1241,7 +1241,7 @@ static long do_mbind(unsigned long start, unsigned long 
len,
  flags | MPOL_MF_INVERT, &pagelist);
 
err = PTR_ERR(vma); /* maybe ... */
-   if (!IS_ERR(vma) && mode != MPOL_NOOP)
+   if (!IS_ERR(vma))
err = mbind_range(mm, start, end, new);
 
if (!err) {
@@ -2530,7 +2530,6 @@ static const char * const policy_modes[] =
[MPOL_BIND]   = "bind",
[MPOL_INTERLEAVE] = "interleave",
[MPOL_LOCAL]  = "local",
-   [MPOL_NOOP]   = "noop", /* should not actually be used */
 };
 
 
@@ -2581,7 +2580,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, 
int no_context)
break;
}
}
-   if (mode >= MPOL_MAX || mode == MPOL_NOOP)
+   if (mode >= MPOL_MAX)
goto out;
 
switch (mode) {
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 31/46] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting

2012-11-21 Thread Mel Gorman

From: Andrea Arcangeli 

This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.

Signed-off-by: Andrea Arcangeli 
Signed-off-by: Mel Gorman 
---
 include/linux/mmzone.h |   13 +
 mm/page_alloc.c|5 +
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..1ed16e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -717,6 +717,19 @@ typedef struct pglist_data {
struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */
int kswapd_max_order;
enum zone_type classzone_idx;
+#ifdef CONFIG_BALANCE_NUMA
+   /*
+* Lock serializing the per destination node AutoNUMA memory
+* migration rate limiting data.
+*/
+   spinlock_t balancenuma_migrate_lock;
+
+   /* Rate limiting time interval */
+   unsigned long balancenuma_migrate_next_window;
+
+   /* Number of pages migrated during the rate limiting time interval */
+   unsigned long balancenuma_migrate_nr_pages;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5953dc2..df58654 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4449,6 +4449,11 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
int ret;
 
pgdat_resize_init(pgdat);
+#ifdef CONFIG_BALANCE_NUMA
+   spin_lock_init(&pgdat->balancenuma_migrate_lock);
+   pgdat->balancenuma_migrate_nr_pages = 0;
+   pgdat->balancenuma_migrate_next_window = jiffies;
+#endif
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat_page_cgroup_init(pgdat);
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 35/46] mm: numa: Introduce last_nid to the page frame

2012-11-21 Thread Mel Gorman

This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.

Signed-off-by: Mel Gorman 
---
 include/linux/mm.h   |   30 ++
 include/linux/mm_types.h |4 
 mm/page_alloc.c  |2 ++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d04c2f0..a0834e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -693,6 +693,36 @@ static inline int page_to_nid(const struct page *page)
 }
 #endif
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+   return xchg(&page->_last_nid, nid);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+   return page->_last_nid;
+}
+static inline void reset_page_last_nid(struct page *page)
+{
+   page->_last_nid = -1;
+}
+#else
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+   return page_to_nid(page);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+   return page_to_nid(page);
+}
+
+static inline void reset_page_last_nid(struct page *page)
+{
+}
+#endif
+
 static inline struct zone *page_zone(const struct page *page)
 {
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b40f4ef..6b478ff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -175,6 +175,10 @@ struct page {
 */
void *shadow;
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+   int _last_nid;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df58654..fd6a073 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -608,6 +608,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
+   reset_page_last_nid(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -3826,6 +3827,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
reset_page_mapcount(page);
+   reset_page_last_nid(page);
SetPageReserved(page);
/*
 * Mark the block movable so that blocks are reserved for
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 42/46] sched: numa: CPU follows memory

2012-11-21 Thread Mel Gorman

NOTE: This is heavily based on "autonuma: CPU follows memory algorithm"
and "autonuma: mm_autonuma and task_autonuma data structures"
with bits taken but worked within the scheduler hooks and home
node mechanism as defined by schednuma.

This patch adds per-mm and per-task data structures to track the number
of faults in total and on a per-nid basis. On each NUMA fault it
checks if the system would benefit if the current task was migrated
to another node. If the task should be migrated, its home node is
updated and the task is requeued.

[dhi...@gmail.com: remove unnecessary check]
Signed-off-by: Mel Gorman 
---
 include/linux/sched.h |1 -
 kernel/sched/fair.c   |  228 -
 2 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7b6625a..269ff7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2040,7 +2040,6 @@ extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
 extern unsigned int sysctl_balance_numa_scan_size;
-extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc8f95d..495eed8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -837,15 +837,229 @@ unsigned int sysctl_balance_numa_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_balance_numa_scan_delay = 1000;
 
+#define BALANCENUMA_SCALE 1000
+static inline unsigned long balancenuma_weight(unsigned long nid_faults,
+  unsigned long total_faults)
+{
+   if (nid_faults > total_faults)
+   nid_faults = total_faults;
+
+   return nid_faults * BALANCENUMA_SCALE / total_faults;
+}
+
+static inline unsigned long balancenuma_task_weight(struct task_struct *p,
+   int nid)
+{
+   struct task_balancenuma *task_balancenuma = p->task_balancenuma;
+   unsigned long nid_faults, total_faults;
+
+   nid_faults = task_balancenuma->task_numa_fault[nid];
+   total_faults = task_balancenuma->task_numa_fault_tot;
+   return balancenuma_weight(nid_faults, total_faults);
+}
+
+static inline unsigned long balancenuma_mm_weight(struct task_struct *p,
+   int nid)
+{
+   struct mm_balancenuma *mm_balancenuma = p->mm->mm_balancenuma;
+   unsigned long nid_faults, total_faults;
+
+   nid_faults = mm_balancenuma->mm_numa_fault[nid];
+   total_faults = mm_balancenuma->mm_numa_fault_tot;
+
+   /* It's possible for total_faults to decay to 0 in parallel so check */
+   return total_faults ? balancenuma_weight(nid_faults, total_faults) : 0;
+}
+
+/*
+ * Examines all other nodes examining remote tasks to see if there would
+ * be fewer remote numa faults if tasks swapped home nodes
+ */
+static void task_numa_find_placement(struct task_struct *p)
+{
+   struct cpumask *allowed = tsk_cpus_allowed(p);
+   int this_cpu = smp_processor_id();
+   int this_nid = numa_node_id();
+   long p_task_weight, p_mm_weight;
+   long weight_diff_max = 0;
+   struct task_struct *selected_task = NULL;
+   int selected_nid = -1;
+   int nid;
+
+   p_task_weight = balancenuma_task_weight(p, this_nid);
+   p_mm_weight = balancenuma_mm_weight(p, this_nid);
+
+   /* Examine a task on every other node */
+   for_each_online_node(nid) {
+   int cpu;
+   for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+   struct rq *rq;
+   struct mm_struct *other_mm;
+   struct task_struct *other_task;
+   long this_weight, other_weight, p_weight;
+   long other_diff, this_diff;
+
+   if (!cpu_online(cpu) || idle_cpu(cpu))
+   continue;
+
+   /* Racy check if a task is running on the other rq */
+   rq = cpu_rq(cpu);
+   other_mm = rq->curr->mm;
+   if (!other_mm || !other_mm->mm_balancenuma)
+   continue;
+
+   /* Effectively pin the other task to get fault stats */
+   raw_spin_lock_irq(&rq->lock);
+   other_task = rq->curr;
+   other_mm = other_task->mm;
+
+   /* Ensure the other task has usable stats */
+   if (!other_task->task_balancenuma ||
+   !other_task->task_balancenuma->task_numa_fault_tot 
||
+   !other_mm ||
+

[PATCH 44/46] sched: numa: Consider only one CPU per node for CPU-follows-memory

2012-11-21 Thread Mel Gorman

The implementation of CPU follows memory was intended to reflect
the considerations made by autonuma on the basis that it had the
best performance figures at the time of writing. However, a major
criticism was the use of kernel threads and the impact of the
cost of the load balancer paths. As a consequence, the cpu follows
memory algorithm moved to the task_numa_work() path where it would
be incurred directly by the process. Unfortunately, it's still very
heavy, it's just much easier to measure now.

This patch attempts to reduce the cost of the path. Only one CPU
per node is considered for tasks to swap. If there is a task running
on that CPU, the calculations will determine if the system would be
better overall if the tasks were swapped. If the CPU is idle, it
will be checked if running on that node would be better than running
on the current node.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c |   21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 495eed8..2c9300f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -899,9 +899,18 @@ static void task_numa_find_placement(struct task_struct *p)
long this_weight, other_weight, p_weight;
long other_diff, this_diff;
 
-   if (!cpu_online(cpu) || idle_cpu(cpu))
+   if (!cpu_online(cpu))
continue;
 
+   /* Idle CPU, consider running this task on that node */
+   if (idle_cpu(cpu)) {
+   this_weight = balancenuma_task_weight(p, nid);
+   other_weight = 0;
+   other_task = NULL;
+   p_weight = p_task_weight;
+   goto compare_other;
+   }
+
/* Racy check if a task is running on the other rq */
rq = cpu_rq(cpu);
other_mm = rq->curr->mm;
@@ -947,6 +956,7 @@ static void task_numa_find_placement(struct task_struct *p)
 
raw_spin_unlock_irq(&rq->lock);
 
+compare_other:
/*
 * other_diff: How much does the current task perfer to
 * run on the remote node thn the task that is
@@ -975,13 +985,20 @@ static void task_numa_find_placement(struct task_struct 
*p)
selected_task = other_task;
}
}
+
+   /*
+* Examine just one task per node. Examing all tasks
+* disrupts the system excessively
+*/
+   break;
}
}
 
/* Swap the task on the selected target node */
if (selected_nid != -1 && selected_nid != this_nid) {
sched_setnode(p, selected_nid);
-   sched_setnode(selected_task, this_nid);
+   if (selected_task)
+   sched_setnode(selected_task, this_nid);
}
 }
 
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 46/46] Simple CPU follow

2012-11-21 Thread Mel Gorman

---
 kernel/sched/fair.c |  112 +++
 1 file changed, 15 insertions(+), 97 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5cc5b60..fd53f17 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,118 +877,36 @@ static inline unsigned long balancenuma_mm_weight(struct 
task_struct *p,
  */
 static void task_numa_find_placement(struct task_struct *p)
 {
-   struct cpumask *allowed = tsk_cpus_allowed(p);
-   int this_cpu = smp_processor_id();
int this_nid = numa_node_id();
long p_task_weight, p_mm_weight;
-   long weight_diff_max = 0;
-   struct task_struct *selected_task = NULL;
+   long max_weight = 0;
int selected_nid = -1;
int nid;
 
p_task_weight = balancenuma_task_weight(p, this_nid);
p_mm_weight = balancenuma_mm_weight(p, this_nid);
 
-   /* Examine a task on every other node */
+   /* Check if this task should run on another node */
for_each_online_node(nid) {
-   int cpu;
-   for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
-   struct rq *rq;
-   struct mm_struct *other_mm;
-   struct task_struct *other_task;
-   long this_weight, other_weight, p_weight;
-   long other_diff, this_diff;
-
-   if (!cpu_online(cpu))
-   continue;
-
-   /* Idle CPU, consider running this task on that node */
-   if (idle_cpu(cpu)) {
-   this_weight = balancenuma_task_weight(p, nid);
-   other_weight = 0;
-   other_task = NULL;
-   p_weight = p_task_weight;
-   goto compare_other;
-   }
-
-   /* Racy check if a task is running on the other rq */
-   rq = cpu_rq(cpu);
-   other_mm = rq->curr->mm;
-   if (!other_mm || !other_mm->mm_balancenuma)
-   continue;
-
-   /* Effectively pin the other task to get fault stats */
-   raw_spin_lock_irq(&rq->lock);
-   other_task = rq->curr;
-   other_mm = other_task->mm;
-
-   /* Ensure the other task has usable stats */
-   if (!other_task->task_balancenuma ||
-   !other_task->task_balancenuma->task_numa_fault_tot 
||
-   !other_mm ||
-   !other_mm->mm_balancenuma ||
-   !other_mm->mm_balancenuma->mm_numa_fault_tot) {
-   raw_spin_unlock_irq(&rq->lock);
-   continue;
-   }
-
-   /*
-* Read the fault statistics. If the remote task is a
-* thread in the process then use the task statistics.
-* Otherwise use the per-mm statistics.
-*/
-   if (other_mm == p->mm) {
-   this_weight = balancenuma_task_weight(p, nid);
-   other_weight = 
balancenuma_task_weight(other_task, nid);
-   p_weight = p_task_weight;
-   } else {
-   this_weight = balancenuma_mm_weight(p, nid);
-   other_weight = 
balancenuma_mm_weight(other_task, nid);
-   p_weight = p_mm_weight;
-   }
-
-   raw_spin_unlock_irq(&rq->lock);
-
-compare_other:
-   /*
-* other_diff: How much does the current task perfer to
-* run on the remote node thn the task that is
-* currently running there?
-*/
-   other_diff = this_weight - other_weight;
+   unsigned long nid_weight;
 
-   /*
-* this_diff: How much does the currrent task prefer to
-* run on the remote NUMA node compared to the current
-* node?
-*/
-   this_diff = this_weight - p_weight;
-
-   /*
-* Would nid reduce the overall cross-node NUMA faults?
-*/
-   if (other_diff > 0 && this_diff > 0) {
-   long weight_diff = other_diff + this_diff;
-
-   /* Remember the best candidate. */
-   if (weight_diff > weight_diff_max) {
-

[PATCH 45/46] balancenuma: no task swap in finding placement

2012-11-21 Thread Mel Gorman

From: Hillf Danton 

Node is selected on behalf of given task, but no reason to punish
the currently running tasks on other nodes. That punishment maybe benifit,
who knows. Better if they are treated not in random way.

Signed-off-by: Hillf Danton 
---
 kernel/sched/fair.c |   15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c9300f..5cc5b60 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -873,7 +873,7 @@ static inline unsigned long balancenuma_mm_weight(struct 
task_struct *p,
 
 /*
  * Examines all other nodes examining remote tasks to see if there would
- * be fewer remote numa faults if tasks swapped home nodes
+ * be fewer remote numa faults
  */
 static void task_numa_find_placement(struct task_struct *p)
 {
@@ -932,13 +932,6 @@ static void task_numa_find_placement(struct task_struct *p)
continue;
}
 
-   /* Ensure the other task can be swapped */
-   if (!cpumask_test_cpu(this_cpu,
- tsk_cpus_allowed(other_task))) {
-   raw_spin_unlock_irq(&rq->lock);
-   continue;
-   }
-
/*
 * Read the fault statistics. If the remote task is a
 * thread in the process then use the task statistics.
@@ -972,8 +965,7 @@ compare_other:
this_diff = this_weight - p_weight;
 
/*
-* Would swapping the tasks reduce the overall
-* cross-node NUMA faults?
+* Would nid reduce the overall cross-node NUMA faults?
 */
if (other_diff > 0 && this_diff > 0) {
long weight_diff = other_diff + this_diff;
@@ -994,11 +986,8 @@ compare_other:
}
}
 
-   /* Swap the task on the selected target node */
if (selected_nid != -1 && selected_nid != this_nid) {
sched_setnode(p, selected_nid);
-   if (selected_task)
-   sched_setnode(selected_task, this_nid);
}
 }
 
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 38/46] sched: numa: Introduce tsk_home_node()

2012-11-21 Thread Mel Gorman

From: Peter Zijlstra 

Introduce the home-node concept for tasks. In order to keep memory
locality we need to have a something to stay local to, we define the
home-node of a task as the node we prefer to allocate memory from and
prefer to execute on.

These are no hard guarantees, merely soft preferences. This allows for
optimal resource usage, we can run a task away from the home-node, the
remote memory hit -- while expensive -- is less expensive than not
running at all, or very little, due to severe cpu overload.

Similarly, we can allocate memory from another node if our home-node
is depleted, again, some memory is better than no memory.

This patch merely introduces the basic infrastructure, all policy
comes later.

Signed-off-by: Peter Zijlstra 
Cc: Lee Schermerhorn 
Cc: Rik van Riel 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Ingo Molnar 
Signed-off-by: Mel Gorman 
---
 include/linux/init_task.h |8 
 include/linux/sched.h |   10 ++
 kernel/sched/core.c   |   36 
 3 files changed, 54 insertions(+)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..fdf0692 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_BALANCE_NUMA
+# define INIT_TASK_NUMA(tsk)   \
+   .home_node = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1f (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group;
INIT_TRACE_RECURSION\
INIT_TASK_RCU_PREEMPT(tsk)  \
INIT_CPUSET_SEQ \
+   INIT_TASK_NUMA(tsk) \
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2b06ea..b8580f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1480,6 +1480,7 @@ struct task_struct {
short pref_node_fork;
 #endif
 #ifdef CONFIG_BALANCE_NUMA
+   int home_node;
int numa_scan_seq;
int numa_migrate_seq;
unsigned int numa_scan_period;
@@ -1569,6 +1570,15 @@ static inline void task_numa_fault(int node, int pages)
 }
 #endif
 
+static inline int tsk_home_node(struct task_struct *p)
+{
+#ifdef CONFIG_BALANCE_NUMA
+   return p->home_node;
+#else
+   return -1;
+#endif
+}
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 047e3c7..55dcf53 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5972,6 +5972,42 @@ static struct sched_domain_topology_level 
default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology = 
default_topology;
 
+#ifdef CONFIG_BALANCE_NUMA
+
+/*
+ * Requeues a task ensuring its on the right load-balance list so
+ * that it might get migrated to its new home.
+ *
+ * Note that we cannot actively migrate ourselves since our callers
+ * can be from atomic context. We rely on the regular load-balance
+ * mechanisms to move us around -- its all preference anyway.
+ */
+void sched_setnode(struct task_struct *p, int node)
+{
+   unsigned long flags;
+   int on_rq, running;
+   struct rq *rq;
+
+   rq = task_rq_lock(p, &flags);
+   on_rq = p->on_rq;
+   running = task_current(rq, p);
+
+   if (on_rq)
+   dequeue_task(rq, p, 0);
+   if (running)
+   p->sched_class->put_prev_task(rq, p);
+
+   p->home_node = node;
+
+   if (running)
+   p->sched_class->set_curr_task(rq);
+   if (on_rq)
+   enqueue_task(rq, p, 0);
+   task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_BALANCE_NUMA */
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 37/46] mm: numa: Add THP migration for the NUMA working set scanning fault case.

2012-11-21 Thread Mel Gorman

Note: This is very heavily based on a patch from Peter Zijlstra with
fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner.  That patch
put a lot of migration logic into mm/huge_memory.c where it does
not belong. This version puts tries to share some of the migration
logic with migrate_misplaced_page.  However, it should be noted
that now migrate.c is doing more with the pagetable manipulation
than is preferred. The end result is barely recognisable so as
before, the signed-offs had to be removed but will be re-added if
the original authors are ok with it.

Add THP migration for the NUMA working set scanning fault case.

It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.

Signed-off-by: Mel Gorman 
---
 include/linux/migrate.h |   16 
 mm/huge_memory.c|   55 +++-
 mm/internal.h   |2 +
 mm/migrate.c|  212 ++-
 4 files changed, 226 insertions(+), 59 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0d4ee94..23dc324 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -41,6 +41,11 @@ extern void migrate_page_copy(struct page *newpage, struct 
page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
  struct page *newpage, struct page *page);
 extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+   struct vm_area_struct *vma,
+   pmd_t *pmd, pmd_t entry,
+   unsigned long address,
+   struct page *page, int node);
 extern bool migrate_ratelimited(int node);
 #else
 
@@ -80,6 +85,17 @@ int migrate_misplaced_page(struct page *page, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
+
+static inline
+int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+   struct vm_area_struct *vma,
+   pmd_t *pmd, pmd_t entry,
+   unsigned long address,
+   struct page *page, int node)
+{
+   return -EAGAIN;
+}
+
 static inline
 bool migrate_ratelimited(int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f89a98..e74cb93 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -600,7 +600,7 @@ out:
 }
 __setup("transparent_hugepage=", setup_transparent_hugepage);
 
-static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
if (likely(vma->vm_flags & VM_WRITE))
pmd = pmd_mkwrite(pmd);
@@ -1022,9 +1022,11 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
-   struct page *page = NULL;
+   struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
+   bool migrated;
+   bool page_locked = false;
 
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1032,39 +1034,54 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
page = pmd_page(pmd);
get_page(page);
-   spin_unlock(&mm->page_table_lock);
count_vm_numa_event(NUMA_HINT_FAULTS);
 
target_nid = mpol_misplaced(page, vma, haddr);
-   if (target_nid == -1)
+   if (target_nid == -1) {
+   put_page(page);
goto clear_pmdnuma;
+   }
 
-   /*
-* Due to lacking code to migrate thp pages, we'll split
-* (which preserves the special PROT_NONE) and re-take the
-* fault on the normal pages.
-*/
-   split_huge_page(page);
-   put_page(page);
-
-   return 0;
+   /* Acquire the page lock to serialise THP migrations */
+   spin_unlock(&mm->page_table_lock);
+   lock_page(page);
+   page_locked = true;
 
-clear_pmdnuma:
+   /* Confirm the PTE did not while locked */
spin_lock(&mm->page_table_lock);
-   if (unlikely(!pmd_same(pmd, *pmdp)))
+   if (unlikely(!pmd_same(pmd, *pmdp))) {
+   unlock_page(page);
+   put_page(page);
goto out_unlock;
+   }
+   spin_unlock(&mm->page_table_lock);
+
+   /* Migrate the THP to the requested node */
+   migrated = migrate_misplaced_transhuge_page(mm, vma,
+   pmdp, pmd, addr,
+   page, target_nid);
+   if (!migrated) {
+   spin_lock(&mm->page_table_lock);
+   if (unlikely(!pmd_same(pmd, *pmdp))) {
+   unlock_page(page);
+   goto out_unlock;
+   }
+   goto

[PATCH 43/46] sched: numa: Rename mempolicy to HOME

2012-11-21 Thread Mel Gorman

Rename the policy to reflect that while allocations and migrations are
based on reference that the home node is taken into account for
migration decisions.

Signed-off-by: Mel Gorman 
---
 include/uapi/linux/mempolicy.h |9 -
 mm/mempolicy.c |9 ++---
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 0d11c3d..4506772 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -67,7 +67,14 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)  /* identify policies in rebinding */
 #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
-#define MPOL_F_MORON   (1 << 4) /* Migrate On pte_numa Reference On Node */
+#define MPOL_F_HOME(1 << 4) /*
+ * Migrate towards referencing node.
+ * By building up stats on faults, the
+ * scheduler will reinforce the choice
+ * by identifying a home node and
+ * queueing the task on that node
+ * where possible.
+ */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fd20e28..3da7435 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2316,8 +2316,11 @@ int mpol_misplaced(struct page *page, struct 
vm_area_struct *vma, unsigned long
BUG();
}
 
-   /* Migrate the page towards the node whose CPU is referencing it */
-   if (pol->flags & MPOL_F_MORON) {
+   /*
+* Migrate pages towards their referencing node. Based on the fault
+* statistics a home node will be chosen by the scheduler
+*/
+   if (pol->flags & MPOL_F_HOME) {
int last_nid;
 
polnid = numa_node_id();
@@ -2540,7 +2543,7 @@ void __init numa_policy_init(void)
preferred_node_policy[nid] = (struct mempolicy) {
.refcnt = ATOMIC_INIT(1),
.mode = MPOL_PREFERRED,
-   .flags = MPOL_F_MOF | MPOL_F_MORON,
+   .flags = MPOL_F_MOF | MPOL_F_HOME,
.v = { .preferred_node = nid, },
};
}
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 41/46] sched: numa: Introduce per-mm and per-task structures

2012-11-21 Thread Mel Gorman

NOTE: This is heavily based on "autonuma: CPU follows memory algorithm"
and "autonuma: mm_autonuma and task_autonuma data structures"

At the most basic level, any placement policy is going to make some
sort of smart decision based on per-mm and per-task statistics. This
patch simply introduces the structures with basic fault statistics
that can be expaned upon or replaced later. It may be that a placement
policy can approximate without needing both structures in which case
they can be safely deleted later while still having a comparison point
to ensure the approximation is accurate.

[dhi...@gmail.com: Use @pages parameter for fault statistics]
Signed-off-by: Mel Gorman 
---
 include/linux/mm_types.h |   26 ++
 include/linux/sched.h|   18 ++
 kernel/fork.c|   18 ++
 kernel/sched/core.c  |3 +++
 kernel/sched/fair.c  |   25 -
 kernel/sched/sched.h |   14 ++
 6 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b478ff..9588a91 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -312,6 +312,29 @@ struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * Per-mm structure that contains the NUMA memory placement statistics
+ * generated by pte_numa faults.
+ */
+struct mm_balancenuma {
+   /*
+* Number of pages that will trigger NUMA faults for this mm. Total
+* decays each time whether the home node should change to keep
+* track only of recent events
+*/
+   unsigned long mm_numa_fault_tot;
+
+   /*
+* Number of pages that will trigger NUMA faults for each [nid].
+* Also decays.
+*/
+   unsigned long mm_numa_fault[0];
+
+   /* do not add more variables here, the above array size is dynamic */
+};
+#endif /* CONFIG_BALANCE_NUMA */
+
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
@@ -415,6 +438,9 @@ struct mm_struct {
 
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
+
+   /* this is used by the scheduler and the page allocator */
+   struct mm_balancenuma *mm_balancenuma;
 #endif
struct uprobes_state uprobes_state;
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1cccfc3..7b6625a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1188,6 +1188,23 @@ enum perf_event_task_context {
perf_nr_task_contexts,
 };
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * Per-task structure that contains the NUMA memory placement statistics
+ * generated by pte_numa faults. This structure is dynamically allocated
+ * when the first pte_numa fault is handled.
+ */
+struct task_balancenuma {
+   /* Total number of eligible pages that triggered NUMA faults */
+   unsigned long task_numa_fault_tot;
+
+   /* Number of pages that triggered NUMA faults for each [nid] */
+   unsigned long task_numa_fault[0];
+
+   /* do not add more variables here, the above array size is dynamic */
+};
+#endif /* CONFIG_BALANCE_NUMA */
+
 struct task_struct {
volatile long state;/* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1488,6 +1505,7 @@ struct task_struct {
unsigned int numa_scan_period;
u64 node_stamp; /* migration stamp  */
struct callback_head numa_work;
+   struct task_balancenuma *task_balancenuma;
 #endif /* CONFIG_BALANCE_NUMA */
 
struct rcu_head rcu;
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..c8752f6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -525,6 +525,20 @@ static void mm_init_aio(struct mm_struct *mm)
 #endif
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline void free_mm_balancenuma(struct mm_struct *mm)
+{
+   if (mm->mm_balancenuma)
+   kfree(mm->mm_balancenuma);
+
+   mm->mm_balancenuma = NULL;
+}
+#else
+static inline void free_mm_balancenuma(struct mm_struct *mm)
+{
+}
+#endif
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
atomic_set(&mm->mm_users, 1);
@@ -539,6 +553,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p)
spin_lock_init(&mm->page_table_lock);
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
+   mm->mm_balancenuma = NULL;
mm_init_aio(mm);
mm_init_owner(mm, p);
 
@@ -548,6 +563,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p)
return mm;
}
 
+   free_mm_balancenuma(mm);
free_mm(mm);
return NULL;
 }
@@ -597,6 +613,7 @@ void __mmdrop(struct mm_struct *mm)
destroy_context(mm);
mmu_notifier_mm_destroy(mm);
check_mm(mm);
+   free_mm_b

[PATCH 40/46] sched: numa: Implement home-node awareness

2012-11-21 Thread Mel Gorman

NOTE: Entirely on "sched, numa, mm: Implement home-node awareness" but
only a subset of it. There was stuff in there that was disabled
by default and generally did slightly more than what I felt was
necessary at this stage. In particular the random queue selection
logic is gone because it looks broken but it does mean that the
last CPU in a node may see increased scheduling pressure which
is almost certainly the wrong thing to do. Needs re-examination
Signed-offs removed as a result but will re-add if authors are ok.

Implement home node preference in the scheduler's load-balancer.

- task_numa_hot(); make it harder to migrate tasks away from their
  home-node, controlled using the NUMA_HOMENODE_PREFERRED feature flag.

- load_balance(); during the regular pull load-balance pass, try
  pulling tasks that are on the wrong node first with a preference of
  moving them nearer to their home-node through task_numa_hot(), controlled
  through the NUMA_PULL feature flag.

- load_balance(); when the balancer finds no imbalance, introduce
  some such that it still prefers to move tasks towards their home-node,
  using active load-balance if needed, controlled through the NUMA_PULL_BIAS
  feature flag.

  In particular, only introduce this BIAS if the system is otherwise properly
  (weight) balanced and we either have an offnode or !numa task to trade
  for it.

In order to easily find off-node tasks, split the per-cpu task list
into two parts.

Signed-off-by: Mel Gorman 
---
 include/linux/sched.h   |3 +
 kernel/sched/core.c |   14 ++-
 kernel/sched/debug.c|3 +
 kernel/sched/fair.c |  298 +++
 kernel/sched/features.h |   18 +++
 kernel/sched/sched.h|   16 +++
 6 files changed, 324 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b8580f5..1cccfc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
 #define SD_PREFER_SIBLING  0x1000  /* Prefer to place tasks in a sibling 
domain */
 #define SD_OVERLAP 0x2000  /* sched_domains of this level overlap 
*/
+#define SD_NUMA0x4000  /* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
@@ -1481,6 +1482,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_BALANCE_NUMA
int home_node;
+   unsigned long numa_contrib;
int numa_scan_seq;
int numa_migrate_seq;
unsigned int numa_scan_period;
@@ -2104,6 +2106,7 @@ extern int sched_setscheduler(struct task_struct *, int,
  const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
  const struct sched_param *);
+extern void sched_setnode(struct task_struct *p, int node);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 55dcf53..3d9fc26 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5978,9 +5978,9 @@ static struct sched_domain_topology_level 
*sched_domain_topology = default_topol
  * Requeues a task ensuring its on the right load-balance list so
  * that it might get migrated to its new home.
  *
- * Note that we cannot actively migrate ourselves since our callers
- * can be from atomic context. We rely on the regular load-balance
- * mechanisms to move us around -- its all preference anyway.
+ * Since home-node is pure preference there's no hard migrate to force
+ * us anywhere, this also allows us to call this from atomic context if
+ * required.
  */
 void sched_setnode(struct task_struct *p, int node)
 {
@@ -6053,6 +6053,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int 
cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+   | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance   = jiffies,
@@ -6914,7 +6915,12 @@ void __init sched_init(void)
rq->avg_idle = 2*sysctl_sched_migration_cost;
 
INIT_LIST_HEAD(&rq->cfs_tasks);
-
+#ifdef CONFIG_BALANCE_NUMA
+   INIT_LIST_HEAD(&rq->offnode_tasks);
+   rq->onnode_running = 0;
+   rq->offnode_running = 0;
+   rq->offnode_weight = 0;
+#endif
rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
rq->nohz_flags = 0;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6f79596..2474a02 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/d

1 2 3 4 5 >

1 - 100 of 473 matches

Mail list logo