date:20161012

Re: [PATCH] iwlwifi: pcie: reduce "unsupported splx" to a warning

2016-10-12 Thread Luca Coelho

Hi Chris,
On Tue, 2016-10-11 at 09:09 -0500, Chris Rorvick wrote:
> On Tue, Oct 11, 2016 at 5:11 AM, Paul Bolle  wrote:
> > > This is not coming from the NIC itself, but from the platform's ACPI
> > > tables.  Can you tell us which platform you are using?
> 
> 
> Interesting.  I'm running a Dell XPS 13 9350.  I replaced the
> factory-provided Broadcom card with an AC 8260.  I can update the
> commit log to reflect this.

Okay, so this makes sense.  Those entries are probably formatted for
the Broadcom card, which the iwlwifi driver obviously doesn't
understand.  The best we can do, as I already said, is to ignore values
we don't understand.

I will also check what is the correct procedure in such cases, because
it is possible, in theory, that the format *matches* but applies only
to another device.

> > > If this is really bothering you, I guess I could apply this patch for
> > > now.  But as I said, this is not solving the actual problem.
> > 
> > 
> > Bikeshedding: I think IWL_INFO() is more appropriate, as info doesn't
> > imply one needs to act on this message, while warn does imply that
> > action is needed.
> 
> 
> Agreed.  I still think making this a warning is appropriate, but it
> seems pretty clear this is not an error.  This has nothing to do with
> how much it bothers me.  An error tells the user something needs to be
> fixed, but in this case the interface is working fine.  Making it a
> warning with an improved message will result in fewer people wasting
> their time.

Yes, so I'll try to stop wasting people's timing by trying to do the
correct thing without bothering the user at all. :)

Thanks for pointing this all out!

--
Cheers,
Luca.

Re: [PATCH] gpio: pca953x: add a comment explaining the need for a lockdep subclass

2016-10-12 Thread Wolfram Sang

On Mon, Sep 26, 2016 at 11:54:15AM +0200, Bartosz Golaszewski wrote:
> This is a follow-up to commit 559b46990e76 ("gpio: pca953x: fix an
> incorrect lockdep warning"). The reason for calling
> lockdep_set_subclass() in pca953x_probe() is not explained in
> the code.
> 
> Add a comment describing the problem, partial solution and required
> future extensions.
> 
> Signed-off-by: Bartosz Golaszewski 

Applied to for-current, thanks!



signature.asc
Description: PGP signature

Re: [PATCH 00/44] Convert FibreChannel bsg code to use bsg-lib

2016-10-12 Thread Johannes Thumshirn

On Tue, Oct 11, 2016 at 09:49:38AM -0700, Christoph Hellwig wrote:
> Hi Johannes,
> 
> this looks great to me.  But is there a chance to consolidate it into
> a more manageable set of patches?  E.g. all the patches to call
> export fc_bsg_jobdone, use it directly and remove the function pointer
> could go together, possibly even including the new calling convention.
> Similar all the patches about fc_bsg_to_shost could be merged into one,
> and if we add the bsg refcounting early, we could maybe skip a few
> steps of the conversion later on?

Sure, I think 44 patches is a bit huge. Especially given the 0day bot
fallout it generated. Let me see how I can slim it down.

Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH v3 07/11] arm64/tracing: fix compat syscall handling

2016-10-12 Thread Marcin Nowakowski


Hi Will,

On 11.10.2016 15:36, Will Deacon wrote:

On Tue, Oct 11, 2016 at 12:42:52PM +0200, Marcin Nowakowski wrote:

Add arch_syscall_addr for arm64 and define NR_compat_syscalls, as the
number of compat syscalls for arm64 exceeds the number defined by
NR_syscalls.

Signed-off-by: Marcin Nowakowski 
Cc: Steven Rostedt 
Cc: Ingo Molnar 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
---
 arch/arm64/include/asm/ftrace.h | 12 +---
 arch/arm64/include/asm/unistd.h |  1 +
 arch/arm64/kernel/Makefile  |  1 +
 arch/arm64/kernel/ftrace.c  | 16 
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index caa955f..b57ff7c 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -41,17 +41,7 @@ static inline unsigned long ftrace_call_adjust(unsigned long 
addr)

 #define ftrace_return_address(n) return_address(n)

-/*
- * Because AArch32 mode does not share the same syscall table with AArch64,
- * tracing compat syscalls may result in reporting bogus syscalls or even
- * hang-up, so just do not trace them.
- * See kernel/trace/trace_syscalls.c
- *
- * x86 code says:
- * If the user really wants these, then they should use the
- * raw syscall tracepoints with filtering.
- */
-#define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
+#define ARCH_COMPAT_SYSCALL_NUMBERS_OVERLAP 1
 static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs)
 {
return is_compat_task();
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index e78ac26..276d049 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -45,6 +45,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE+5)

 #define __NR_compat_syscalls   394
+#define NR_compat_syscalls (__NR_compat_syscalls)


We may as well just define NR_compat_syscalls instead of
__NR_compat_syscalls and move the handful of users over.


I had tried to minimise the amount of arch-specific changes here - 
especially those that are not directly related to the proposed syscall 
handling change. But I agree having these 2 #defines is a bit 
unnecessary ...



diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index 40ad08a..75d010f 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -176,4 +176,20 @@ int ftrace_disable_ftrace_graph_caller(void)
return ftrace_modify_graph_caller(false);
 }
 #endif /* CONFIG_DYNAMIC_FTRACE */
+
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
+#if (defined CONFIG_FTRACE_SYSCALLS) && (defined CONFIG_COMPAT)
+
+extern const void *sys_call_table[];
+extern const void *compat_sys_call_table[];
+
+unsigned long __init arch_syscall_addr(int nr, bool compat)
+{
+   if (compat)
+   return (unsigned long)compat_sys_call_table[nr];
+
+   return (unsigned long)sys_call_table[nr];
+}


Do we care about the compat private syscalls (from base 0x0f)? We
need to make sure that we exhibit the same behaviour as a native
32-bit ARM machine.

Will


Tracing of such syscalls has been disabled for a long time (see
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=086ba77a6db0).
Apart from using non-contiguous numbers, they are not defined using 
standard SYSCALL macros, so they do not have any metadata generated either.
My suggestion is that if you wanted those to be included in the trace 
then it should be done separately from these changes.


Marcin

Re: [PATCH 1/2] driver core: skip removal test for non-removable drivers

2016-10-12 Thread Laszlo Ersek

Hi Rob,

On 10/11/16 20:41, Rob Herring wrote:
> Some drivers do not support removal/unbinding. These drivers should have
> drv->suppress_bind_attrs set to true, so use that to skip the removal
> test.
> 
> This doesn't fix anything reported so far, but should prevent some other
> cases. Some drivers will need fixes to set suppress_bind_attrs to avoid
> this test.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=177021
> Fixes: bea5b158ff0d ("driver core: add test of driver remove calls during 
> probe")
> Reported-by: Laszlo Ersek 
> Signed-off-by: Rob Herring 
> ---
>  drivers/base/dd.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index d22a7260f42b..8937a7ad7165 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -324,7 +324,8 @@ static int really_probe(struct device *dev, struct 
> device_driver *drv)
>  {
>   int ret = -EPROBE_DEFER;
>   int local_trigger_count = atomic_read(&deferred_trigger_count);
> - bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE);
> + bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
> +!drv->suppress_bind_attrs;
>  
>   if (defer_all_probes) {
>   /*
> 

can you please repost the full series with me CC'd on all of the
messages; I'm not subscribed to LKML.

Thanks,
Laszlo

Re: [mm] c4344e8035: WARNING: CPU: 0 PID: 101 at mm/memory.c:303 __tlb_remove_page_size+0x25/0x99

2016-10-12 Thread Ye Xiaolong

On 10/12, Aneesh Kumar K.V wrote:
>kernel test robot  writes:
>
>> FYI, we noticed the following commit:
>>
>> https://github.com/0day-ci/linux 
>> Aneesh-Kumar-K-V/mm-Use-the-correct-page-size-when-removing-the-page/20161012-013446
>> commit c4344e80359420d7574b3b90fddf53311f1d24e6 ("mm: Remove the page size 
>> change check in tlb_remove_page")
>>
>> in testcase: boot
>>
>> on test machine: qemu-system-i386 -enable-kvm -cpu Haswell,+smep,+smap -m 
>> 360M
>>
>> caused below changes:
>>
>>
>> ++++
>> || eff764128d | c4344e8035 |
>> ++++
>> | boot_successes | 59 | 0  |
>> | boot_failures  | 0  | 43 |
>> | WARNING:at_mm/memory.c:#__tlb_remove_page_size | 0  | 43 |
>> | calltrace:SyS_execve   | 0  | 43 |
>> | calltrace:run_init_process | 0  | 21 |
>> ++++
>>
>>
>>
>> [4.096204] Write protecting the kernel text: 3148k
>> [4.096911] Write protecting the kernel read-only data: 1444k
>> [4.120357] [ cut here ]
>> [4.121078] WARNING: CPU: 0 PID: 101 at mm/memory.c:303 
>> __tlb_remove_page_size+0x25/0x99
>> [4.122380] Modules linked in:
>> [4.122788] CPU: 0 PID: 101 Comm: run-parts Not tainted 
>> 4.8.0-mm1-00315-gc4344e8 #5
>> [4.123956]  bd145dc4 b111e5e6 bd145de0 b10320dc 012f b10974d1 
>> bd145e70 c4954170
>> [4.125277]  c4954170 bd145df4 b103215f 0009   
>> bd145e04 b10974d1
>> [4.126424]  c4954170 bd145e70 bd145e14 b10263ca bd145e70 bd47bafc 
>> bd145e40 b109767a
>> [4.127622] Call Trace:
>
>Thanks for the report. The below change should fix this.
>
>commit 18c929e7cf672da617dc218c6265366bf78b1644
>Author: Aneesh Kumar K.V 
>Date:   Wed Oct 12 08:40:41 2016 +0530
>
>update mmu gather page size before flushing page table cache
>
>diff --git a/mm/memory.c b/mm/memory.c
>index 26d1ba8c87e6..7e7eccb82a2b 100644
>--- a/mm/memory.c
>+++ b/mm/memory.c
>@@ -526,7 +526,11 @@ void free_pgd_range(struct mmu_gather *tlb,
>   end -= PMD_SIZE;
>   if (addr > end - 1)
>   return;
>-
>+  /*
>+   * We add page table cache pages with PAGE_SIZE,
>+   * (see pte_free_tlb()), flush the tlb if we need
>+   */
>+  tlb_remove_check_page_size_change(tlb, PAGE_SIZE);
>   pgd = pgd_offset(tlb->mm, addr);
>   do {
>   next = pgd_addr_end(addr, end);
>

Just applied this fix on top of commit c4344e8035 and confirmed that
reportedwarning is gone with this fix.

Tested-by: Xiaolong Ye 

=
compiler/kconfig/rootfs/sleep/tbox_group/testcase:
  
gcc-6/i386-randconfig-s1-201641/quantal-core-i386.cgz/1/vm-vp-quantal-i386/boot

commit:
  c4344e80359420d7574b3b90fddf53311f1d24e6
  384db818365c90b91d8bad80be188765e801cf58 ("update mmu gather page size before 
flushing page table cache")

c4344e80359420d7 384db818365c90b91d8bad80be
 --
   fail:runs  %reproductionfail:runs
   | | |
 24:24-100%:5 
dmesg.WARNING:at_mm/memory.c:#__tlb_remove_page_size

Thanks,
Xiaolong

Re: [PATCH v2 3/4] mm: try to exhaust highatomic reserve before the OOM

2016-10-12 Thread Vlastimil Babka


On 10/12/2016 07:33 AM, Minchan Kim wrote:

It's weird to show that zone has enough free memory above min
watermark but OOMed with 4K GFP_KERNEL allocation due to
reserved highatomic pages. As last resort, try to unreserve
highatomic pages again and if it has moved pages to
non-highatmoc free list, retry reclaim once more.


I would move the details (OOM report etc) from the cover letter here, otherwise 
they end up in Patch 1's changelog, which is less helpful.



Signed-off-by: Michal Hocko 
Signed-off-by: Minchan Kim 


Acked-by: Vlastimil Babka 


---
 mm/page_alloc.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18808f392718..a7472426663f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2080,7 +2080,7 @@ static void reserve_highatomic_pageblock(struct page 
*page, struct zone *zone,
  * intense memory pressure but failed atomic allocations should be easier
  * to recover from than an OOM.
  */
-static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
+static bool unreserve_highatomic_pageblock(const struct alloc_context *ac)
 {
struct zonelist *zonelist = ac->zonelist;
unsigned long flags;
@@ -2088,6 +2088,7 @@ static void unreserve_highatomic_pageblock(const struct 
alloc_context *ac)
struct zone *zone;
struct page *page;
int order;
+   bool ret = false;

for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
ac->nodemask) {
@@ -2136,12 +2137,14 @@ static void unreserve_highatomic_pageblock(const struct 
alloc_context *ac)
 * may increase.
 */
set_pageblock_migratetype(page, ac->migratetype);
-   move_freepages_block(zone, page, ac->migratetype);
+   ret = move_freepages_block(zone, page, ac->migratetype);
spin_unlock_irqrestore(&zone->lock, flags);
-   return;
+   return ret;
}
spin_unlock_irqrestore(&zone->lock, flags);
}
+
+   return ret;
 }

 /* Remove an element from the buddy allocator from the fallback list */
@@ -3457,8 +3460,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 * Make sure we converge to OOM if we cannot make any progress
 * several times in the row.
 */
-   if (*no_progress_loops > MAX_RECLAIM_RETRIES)
+   if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
+   /* Before OOM, exhaust highatomic_reserve */
+   if (unreserve_highatomic_pageblock(ac))
+   return true;
return false;
+   }

/*
 * Keep reclaiming pages while there is a chance this will lead

Re: [PATCH v2 4/4] mm: make unreserve highatomic functions reliable

2016-10-12 Thread Vlastimil Babka


On 10/12/2016 07:33 AM, Minchan Kim wrote:

Currently, unreserve_highatomic_pageblock bails out if it found
highatomic pageblock regardless of really moving free pages
from the one so that it could mitigate unreserve logic's goal
which saves OOM of a process.

This patch makes unreserve functions bail out only if it moves
some pages out of !highatomic free list to avoid such false
positive.

Another potential problem is that by race between page freeing and
reserve highatomic function, pages could be in highatomic free list
even though the pageblock is !high atomic migratetype. In that case,
unreserve_highatomic_pageblock can be void if count of highatomic
reserve is less than pageblock_nr_pages. We could solve it simply
via draining all of reserved pages before the OOM. It would have
a safeguard role to exhuast reserved pages before converging to OOM.

Signed-off-by: Michal Hocko 


Ah, I think that the first S-o-b has to match "From:" to be valid chain (also 
for 3/4).



Signed-off-by: Minchan Kim 


Acked-by: Vlastimil Babka 


---
 mm/page_alloc.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a7472426663f..565589eae6a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2079,8 +2079,12 @@ static void reserve_highatomic_pageblock(struct page 
*page, struct zone *zone,
  * potentially hurts the reliability of high-order allocations when under
  * intense memory pressure but failed atomic allocations should be easier
  * to recover from than an OOM.
+ *
+ * If @drain is true, try to move all of reserved pages out of highatomic
+ * free list.
  */
-static bool unreserve_highatomic_pageblock(const struct alloc_context *ac)
+static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
+   bool drain)
 {
struct zonelist *zonelist = ac->zonelist;
unsigned long flags;
@@ -2092,8 +2096,12 @@ static bool unreserve_highatomic_pageblock(const struct 
alloc_context *ac)

for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
ac->nodemask) {
-   /* Preserve at least one pageblock */
-   if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
+   /*
+* Preserve at least one pageblock unless memory pressure
+* is really high.
+*/
+   if (!drain && zone->nr_reserved_highatomic <=
+   pageblock_nr_pages)
continue;

spin_lock_irqsave(&zone->lock, flags);
@@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const struct 
alloc_context *ac)
 */
set_pageblock_migratetype(page, ac->migratetype);
ret = move_freepages_block(zone, page, ac->migratetype);
-   spin_unlock_irqrestore(&zone->lock, flags);
-   return ret;
+   if (!drain && ret) {
+   spin_unlock_irqrestore(&zone->lock, flags);
+   return ret;
+   }
}
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -3343,7 +3353,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int 
order,
 * Shrink them them and try again
 */
if (!page && !drained) {
-   unreserve_highatomic_pageblock(ac);
+   unreserve_highatomic_pageblock(ac, false);
drain_all_pages(NULL);
drained = true;
goto retry;
@@ -3462,7 +3472,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 */
if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
/* Before OOM, exhaust highatomic_reserve */
-   if (unreserve_highatomic_pageblock(ac))
+   if (unreserve_highatomic_pageblock(ac, true))
return true;
return false;
}

Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area

2016-10-12 Thread zijun_hu

On 10/12/2016 02:53 PM, Michal Hocko wrote:
> On Wed 12-10-16 08:28:17, zijun_hu wrote:
>> On 2016/10/12 1:22, Michal Hocko wrote:
>>> On Tue 11-10-16 21:24:50, zijun_hu wrote:
 From: zijun_hu 

 the LSB of a chunk->map element is used for free/in-use flag of a area
 and the other bits for offset, the sufficient and necessary condition of
 this usage is that both size and alignment of a area must be even numbers
 however, pcpu_alloc() doesn't force its @align parameter a even number
 explicitly, so a odd @align maybe causes a series of errors, see below
 example for concrete descriptions.
>>>
>>> Is or was there any user who would use a different than even (or power of 2)
>>> alighment? If not is this really worth handling?
>>>
>>
>> it seems only a power of 2 alignment except 1 can make sure it work very 
>> well,
>> that is a strict limit, maybe this more strict limit should be checked
> 
> I fail to see how any other alignment would actually make any sense
> what so ever. Look, I am not a maintainer of this code but adding a new
> code to catch something that doesn't make any sense sounds dubious at
> best to me.
> 
> I could understand this patch if you see a problem and want to prevent
> it from repeating bug doing these kind of changes just in case sounds
> like a bad idea.
> 
thanks for your reply

should we have a generic discussion whether such patches which considers
many boundary or rare conditions are necessary.

should we make below declarations as conventions
1) when we say 'alignment', it means align to a power of 2 value
   for example, aligning value @v to @b implicit @v is power of 2
   , align 10 to 4 is 12
2) when we say 'round value @v up/down to boundary @b', it means the 
   result is a times of @b,  it don't requires @b is a power of 2

Re: [PATCH v2 3/4] mm: try to exhaust highatomic reserve before the OOM

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 14:33:35, Minchan Kim wrote:
> It's weird to show that zone has enough free memory above min
> watermark but OOMed with 4K GFP_KERNEL allocation due to
> reserved highatomic pages. As last resort, try to unreserve
> highatomic pages again and if it has moved pages to
> non-highatmoc free list, retry reclaim once more.

Agreed with Vlastimil on the OOM report in the changelog. The above will
not tell the reader much to understand how does the situation look like
and whether the patch is really needed in his particular situation.

Few nits below but in general looks good to me

> Signed-off-by: Michal Hocko 
> Signed-off-by: Minchan Kim 
> ---
>  mm/page_alloc.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 18808f392718..a7472426663f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2080,7 +2080,7 @@ static void reserve_highatomic_pageblock(struct page 
> *page, struct zone *zone,
>   * intense memory pressure but failed atomic allocations should be easier
>   * to recover from than an OOM.
>   */
> -static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> +static bool unreserve_highatomic_pageblock(const struct alloc_context *ac)
>  {
>   struct zonelist *zonelist = ac->zonelist;
>   unsigned long flags;
> @@ -2088,6 +2088,7 @@ static void unreserve_highatomic_pageblock(const struct 
> alloc_context *ac)
>   struct zone *zone;
>   struct page *page;
>   int order;
> + bool ret = false;

no need to initialization, see below
>  
>   for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
>   ac->nodemask) {
> @@ -2136,12 +2137,14 @@ static void unreserve_highatomic_pageblock(const 
> struct alloc_context *ac)
>* may increase.
>*/
>   set_pageblock_migratetype(page, ac->migratetype);
> - move_freepages_block(zone, page, ac->migratetype);
> + ret = move_freepages_block(zone, page, ac->migratetype);
>   spin_unlock_irqrestore(&zone->lock, flags);
> - return;
> + return ret;
>   }
>   spin_unlock_irqrestore(&zone->lock, flags);
>   }
> +
> + return ret;

return false;
>  }
>  
>  /* Remove an element from the buddy allocator from the fallback list */
> @@ -3457,8 +3460,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>* Make sure we converge to OOM if we cannot make any progress
>* several times in the row.
>*/
> - if (*no_progress_loops > MAX_RECLAIM_RETRIES)
> + if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> + /* Before OOM, exhaust highatomic_reserve */
> + if (unreserve_highatomic_pageblock(ac))
> + return true;

return unreserve_highatomic_pageblock(ac);

>   return false;
> + }
>  
>   /*
>* Keep reclaiming pages while there is a chance this will lead
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area

2016-10-12 Thread zijun_hu

On 10/12/2016 02:53 PM, Michal Hocko wrote:
> On Wed 12-10-16 08:28:17, zijun_hu wrote:
>> On 2016/10/12 1:22, Michal Hocko wrote:
>>> On Tue 11-10-16 21:24:50, zijun_hu wrote:
 From: zijun_hu 

 the LSB of a chunk->map element is used for free/in-use flag of a area
 and the other bits for offset, the sufficient and necessary condition of
 this usage is that both size and alignment of a area must be even numbers
 however, pcpu_alloc() doesn't force its @align parameter a even number
 explicitly, so a odd @align maybe causes a series of errors, see below
 example for concrete descriptions.
>>>
>>> Is or was there any user who would use a different than even (or power of 2)
>>> alighment? If not is this really worth handling?
>>>
>>
>> it seems only a power of 2 alignment except 1 can make sure it work very 
>> well,
>> that is a strict limit, maybe this more strict limit should be checked
> 
> I fail to see how any other alignment would actually make any sense
> what so ever. Look, I am not a maintainer of this code but adding a new
> code to catch something that doesn't make any sense sounds dubious at
> best to me.
> 
> I could understand this patch if you see a problem and want to prevent
> it from repeating bug doing these kind of changes just in case sounds
> like a bad idea.
> 

thanks for your reply

should we have a generic discussion whether such patches which considers
many boundary or rare conditions are necessary.

i found the following code segments in mm/vmalloc.c
static struct vmap_area *alloc_vmap_area(unsigned long size,
unsigned long align,
unsigned long vstart, unsigned long vend,
int node, gfp_t gfp_mask)
{
...

BUG_ON(!size);
BUG_ON(offset_in_page(size));
BUG_ON(!is_power_of_2(align));


should we make below declarations as conventions
1) when we say 'alignment', it means align to a power of 2 value
   for example, aligning value @v to @b implicit @v is power of 2
   , align 10 to 4 is 12
2) when we say 'round value @v up/down to boundary @b', it means the 
   result is a times of @b,  it don't requires @b is a power of 2

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Jan Beulich

>>> On 11.10.16 at 17:53,  wrote:
> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich  wrote:
> Andrew Cooper  10/10/16 6:44 PM >>>
>>>On 10/10/16 01:35, Haozhong Zhang wrote:
 Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
memory management data structures, i.e. frame table and M2P table.
 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
hypervisor.
>>>
>>>However, I can't see any justification for 1).  Dom0 should not be
>>>involved in Xen's management of its own frame table and m2p.  The mfns
>>>making up the pmem/pblk regions should be treated just like any other
>>>MMIO regions, and be handed wholesale to dom0 by default.
>>
>> That precludes the use as RAM extension, and I thought earlier rounds of
>> discussion had got everyone in agreement that at least for the pmem case
>> we will need some control data in Xen.
> 
> The missing piece for me is why this reservation for control data
> needs to be done in the libnvdimm core?  I would expect that any dax
> capable file could be mapped and made available to a guest.  This
> includes /dev/ramX devices that are dax capable, but are external to
> the libnvdimm sub-system.

Despite me being the only one on the To list, I don't think the question
was really meant to be directed to me.

Jan

Re: [linux-sunxi] [PATCH 4/5] ARM: dts: sun6i: add pinmux for PWM0

2016-10-12 Thread Chen-Yu Tsai

On Wed, Oct 12, 2016 at 12:20 PM, Icenowy Zheng  wrote:
> PWM0 is used by sun6i tablets as the backlight PWM.
>
> Add pinmux for it.
>
> Signed-off-by: Icenowy Zheng 
> ---
>  arch/arm/boot/dts/sun6i-a31.dtsi | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/arm/boot/dts/sun6i-a31.dtsi 
> b/arch/arm/boot/dts/sun6i-a31.dtsi
> index 97626ce..76f5a06 100644
> --- a/arch/arm/boot/dts/sun6i-a31.dtsi
> +++ b/arch/arm/boot/dts/sun6i-a31.dtsi
> @@ -494,6 +494,13 @@
> allwinner,pull = ;
> };
>
> +   pwm0_pins: pwm0@0 {
> +   allwinner,pins = "PH13";
> +   allwinner,function = "pwm0";
> +   allwinner,drive = ;
> +   allwinner,pull = ;

Maxime is updating the pinctrl bindings to use generic pinconf,
but otherwise this patch looks good.

ChenYu

> +   };
> +
> mmc0_pins_a: mmc0@0 {
> allwinner,pins = "PF0", "PF1", "PF2",
>  "PF3", "PF4", "PF5";
> --
> 2.10.1
>
> --
> You received this message because you are subscribed to the Google Groups 
> "linux-sunxi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to linux-sunxi+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Re: [linux-sunxi] [PATCH 2/5] pwm: sun4i: Add support for PWM controller on sun6i SoCs

2016-10-12 Thread Chen-Yu Tsai

Hi,

On Wed, Oct 12, 2016 at 12:20 PM, Icenowy Zheng  wrote:
> The PWM controller in A31 is different with other Allwinner SoCs, with a
> control register per channel (in other SoCs the control register is
> shared), and each channel are allocated 16 bytes of address (but only 8
> bytes are used.). The register map in one channel is just like a
> single-channel A10 PWM controller, however, A31 have a different
> prescaler table than other SoCs.
>
> In order to use the driver for all 4 channels, device nodes should be
> created per channel.

I think Maxime wants you to support the different register offsets
in this driver, and have all 4 channels in the same device (node).

ChenYu


> Signed-off-by: Icenowy Zheng 
> ---
>  drivers/pwm/pwm-sun4i.c | 37 -
>  1 file changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pwm/pwm-sun4i.c b/drivers/pwm/pwm-sun4i.c
> index 03a99a5..3e93bdf 100644
> --- a/drivers/pwm/pwm-sun4i.c
> +++ b/drivers/pwm/pwm-sun4i.c
> @@ -46,7 +46,7 @@
>
>  #define BIT_CH(bit, chan)  ((bit) << ((chan) * PWMCH_OFFSET))
>
> -static const u32 prescaler_table[] = {
> +static const u32 prescaler_table_a10[] = {
> 120,
> 180,
> 240,
> @@ -65,10 +65,30 @@ static const u32 prescaler_table[] = {
> 0, /* Actually 1 but tested separately */
>  };
>
> +static const u32 prescaler_table_a31[] = {
> +   1,
> +   2,
> +   4,
> +   8,
> +   16,
> +   32,
> +   64,
> +   0,
> +   0,
> +   0,
> +   0,
> +   0,
> +   0,
> +   0,
> +   0,
> +   0,
> +};
> +
>  struct sun4i_pwm_data {
> bool has_prescaler_bypass;
> bool has_rdy;
> unsigned int npwm;
> +   const u32 *prescaler_table;
>  };
>
>  struct sun4i_pwm_chip {
> @@ -100,6 +120,7 @@ static int sun4i_pwm_config(struct pwm_chip *chip, struct 
> pwm_device *pwm,
> int duty_ns, int period_ns)
>  {
> struct sun4i_pwm_chip *sun4i_pwm = to_sun4i_pwm_chip(chip);
> +   const u32 *prescaler_table = sun4i_pwm->data->prescaler_table;
> u32 prd, dty, val, clk_gate;
> u64 clk_rate, div = 0;
> unsigned int prescaler = 0;
> @@ -264,24 +285,35 @@ static const struct sun4i_pwm_data sun4i_pwm_data_a10 = 
> {
> .has_prescaler_bypass = false,
> .has_rdy = false,
> .npwm = 2,
> +   .prescaler_table = prescaler_table_a10,
>  };
>
>  static const struct sun4i_pwm_data sun4i_pwm_data_a10s = {
> .has_prescaler_bypass = true,
> .has_rdy = true,
> .npwm = 2,
> +   .prescaler_table = prescaler_table_a10,
>  };
>
>  static const struct sun4i_pwm_data sun4i_pwm_data_a13 = {
> .has_prescaler_bypass = true,
> .has_rdy = true,
> .npwm = 1,
> +   .prescaler_table = prescaler_table_a10,
>  };
>
>  static const struct sun4i_pwm_data sun4i_pwm_data_a20 = {
> .has_prescaler_bypass = true,
> .has_rdy = true,
> .npwm = 2,
> +   .prescaler_table = prescaler_table_a10,
> +};
> +
> +static const struct sun4i_pwm_data sun4i_pwm_data_a31 = {
> +   .has_prescaler_bypass = false,
> +   .has_rdy = true,
> +   .npwm = 1,
> +   .prescaler_table = prescaler_table_a31,
>  };
>
>  static const struct of_device_id sun4i_pwm_dt_ids[] = {
> @@ -298,6 +330,9 @@ static const struct of_device_id sun4i_pwm_dt_ids[] = {
> .compatible = "allwinner,sun7i-a20-pwm",
> .data = &sun4i_pwm_data_a20,
> }, {
> +   .compatible = "allwinner,sun6i-a31-pwm",
> +   .data = &sun4i_pwm_data_a31
> +   }, {
> /* sentinel */
> },
>  };
> --
> 2.10.1
>
> --
> You received this message because you are subscribed to the Google Groups 
> "linux-sunxi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to linux-sunxi+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Re: [RFC PATCH 00/11] Introduce writeback connectors

2016-10-12 Thread Brian Starkey


Hi Eric,

On Tue, Oct 11, 2016 at 12:01:14PM -0700, Eric Anholt wrote:

Brian Starkey  writes:


Hi,

This RFC series introduces a new connector type:
 DRM_MODE_CONNECTOR_WRITEBACK
It is a follow-on from a previous discussion: [1]

Writeback connectors are used to expose the memory writeback engines
found in some display controllers, which can write a CRTC's
composition result to a memory buffer.
This is useful e.g. for testing, screen-recording, screenshots,
wireless display, display cloning, memory-to-memory composition.

Patches 1-7 include the core framework changes required, and patches
8-11 implement a writeback connector for the Mali-DP writeback engine.
The Mali-DP patches depend on this other series: [2].

The connector is given the FB_ID property for the output framebuffer,
and two new read-only properties: PIXEL_FORMATS and
PIXEL_FORMATS_SIZE, which expose the supported framebuffer pixel
formats of the engine.

The EDID property is not exposed for writeback connectors.

Writeback connector usage:
--
Due to connector routing changes being treated as "full modeset"
operations, any client which wishes to use a writeback connector
should include the connector in every modeset. The writeback will not
actually become active until a framebuffer is attached.

The writeback itself is enabled by attaching a framebuffer to the
FB_ID property of the connector. The driver must then ensure that the
CRTC content of that atomic commit is written into the framebuffer.

The writeback works in a one-shot mode with each atomic commit. This
prevents the same content from being written multiple times.
In some cases (front-buffer rendering) there might be a desire for
continuous operation - I think a property could be added later for
this kind of control.

Writeback can be disabled by setting FB_ID to zero.


I think this sounds great, and the interface is just right IMO.



Thanks, glad you like it! Hopefully you're equally agreeable with the
changes Daniel has been suggesting.


I don't really see a use for continuous mode -- a sequence of one-shots
makes a lot more sense because then you can know what data has changed,
which anyone trying to use the writeback buffer would need to know.



Agreed - we've never found a use for it.


Known issues:
-
 * I'm not sure what "DPMS" should mean for writeback connectors.
   It could be used to disable writeback (even when a framebuffer is
   attached), or it could be hidden entirely (which would break the
   legacy DPMS call for writeback connectors).
 * With Daniel's recent re-iteration of the userspace API rules, I
   fully expect to provide some userspace code to support this. The
   question is what, and where? We want to use writeback for testing,
   so perhaps some tests in igt is suitable.
 * Documentation. Probably some portion of this cover letter needs to
   make it into Documentation/
 * Synchronisation. Our hardware will finish the writeback by the next
   vsync. I've not implemented fence support here, but it would be an
   obvious addition.


My hardware won't necessarily finish by the next vsync -- it trickles
out at whatever rate it can find memory bandwidth to get the job done,
and fires an interrupt when it's finished.



Is it bounded? You presumably have to finish the write-out before you
can change any input buffers?


So I would like some definition for how syncing works.  One answer would
be that these flips don't trigger their pageflip events until the
writeback is done (so I need to collect both the vsync irq and the
writeback irq before sending).  Another would be that manage an
independent fence for the writeback fb, so that you still immediately
know when framebuffers from the previous scanout-only frame are idle.



I much prefer the sound of the explicit fence approach.

Hopefully we can agree that a new atomic commit can't be completed
whilst there's a writeback ongoing, otherwise managing the fence and
framebuffer lifetime sounds really tricky - they'd need to be decoupled
from the atomic_state and outlive the commit that spawned them.

Cheers,
-Brian


Also, tests for this in igt, please.  Writeback in igt will give us so
much more ability to cover KMS functionality on non-Intel hardware.

Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

2016-10-12 Thread Vincent Guittot

On 11 October 2016 at 20:57, Matt Fleming  wrote:
> On Tue, 11 Oct, at 03:14:47PM, Vincent Guittot wrote:
>> >
>> > I see a regression,
>> >
>> >   baseline: 2.41228
>> >   patched : 2.64528 (-9.7%)
>>
>> Just to be sure; By baseline you mean v4.8 ?
>
> Baseline is actually tip/sched/core commit 447976ef4fd0
> ("sched/irqtime: Consolidate irqtime flushing code") but I could try
> out v4.8 instead if you'd prefer that.

ok. In fact, I have noticed another regression with tip/sched/core and
hackbench while looking at yours.
I have bisect to :
10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings")

hackbench -P -g 1

   v4.8tip/sched/core  tip/sched/core+revert 10e2f1acd010
and 1b568f0aabf2
min 0.051   0,052   0.049
avg 0.057(0%)   0,062(-7%)   0.056(+1%)
max 0.070   0,073  0.067
stdev  +/-8%   +/-10%+/-9%

The issue seems to be that it prevents some migration at wake up at
the end of hackbench test so we have last tasks that compete for the
same CPU whereas other CPUs are idle in the same MC domain. I haven't
to look more deeply which part of the patch do the regression yet

 >
>> >   cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \
>> > sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'
>>
>> nice command to evaluate spread
>
> Thanks!

Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support

2016-10-12 Thread Jan Kara

On Tue 11-10-16 16:51:30, Ross Zwisler wrote:
> On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> > On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ac3cd05..e51d51f 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct 
> > > address_space *mapping,
> > >* queue to the start of that PMD.  This ensures that all offsets in
> > >* the range covered by the PMD map to the same bit lock.
> > >*/
> > > - if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > > + if ((unsigned long)entry & RADIX_DAX_PMD)
> > >   index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> > 
> > I agree with Christoph - helper for masking type bits would make this
> > nicer.
> 
> Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
> seems clean to me, but if you or Christoph feel strongly that it would be
> cleaner as a local 'flags' variable, I'll make the change.

One idea I had is that you could have helpers like:

dax_is_pmd_entry()
dax_is_pte_entry()
dax_is_empty_entry()
dax_is_hole_entry()

And then you would use these helpers - all the flags would be hidden in the
helpers so even if we decide to change the flagging scheme to compress
things or so, it should be pretty local change.

> > > - entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > > -RADIX_DAX_ENTRY_LOCK);
> > > +
> > > + /*
> > > +  * Besides huge zero pages the only other thing that gets
> > > +  * downgraded are empty entries which don't need to be
> > > +  * unmapped.
> > > +  */
> > > + if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > > + unmap_mapping_range(mapping,
> > > + (index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > > +
> > >   spin_lock_irq(&mapping->tree_lock);
> > > - err = radix_tree_insert(&mapping->page_tree, index, entry);
> > > +
> > > + if (pmd_downgrade) {
> > > + radix_tree_delete(&mapping->page_tree, index);
> > > + mapping->nrexceptional--;
> > > + dax_wake_mapping_entry_waiter(mapping, index, entry,
> > > + false);
> > 
> > You need to set 'wake_all' argument here to true. Otherwise there could be
> > waiters waiting for non-existent entry forever...
> 
> Interesting.   Fixed, but let me make sure I understand.  So is the issue that
> you could have say 2 tasks waiting on a PMD index that has been rounded down
> to the PMD index via dax_entry_waitqueue()?
> 
> The person holding the lock on the entry would remove the PMD, insert a PTE
> and wake just one of the PMD aligned waiters.  That waiter would wake up, do
> something PTE based (since the PMD space is now polluted with PTEs), and then
> wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
> forever on the PMD aligned index.  Is this correct?

Yes.

> So, perhaps more succinctly:
> 
> Thread 1  Thread 2Thread 3
>   
> index 0x202, hold PMD lock 0x200
>   index 0x203, sleep on 0x200
>   index 0x204, sleep on 0x200
> downgrade, removing 0x200
> wake one waiter on 0x200
> insert PTE @ 0x202
>   wake up, grab index 0x203
>   ...
>   wake one waiter on index 0x203
> 
>   ... sleeps forever
> Right?
 
Exactly.

> > > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct 
> > > address_space *mapping,
> > >   error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> > >   if (error)
> > >   return ERR_PTR(error);
> > > + } else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > > + !(flags & RADIX_DAX_HZP)) {
> > > + /* replacing huge zero page with PMD block mapping */
> > > + unmap_mapping_range(mapping,
> > > + (vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > >   }
> > >  
> > >   spin_lock_irq(&mapping->tree_lock);
> > > - new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > > -RADIX_DAX_ENTRY_LOCK);
> > > + new_entry = dax_radix_entry(sector, flags);
> > > +
> > 
> > You've lost the RADIX_DAX_ENTRY_LOCK flag here?
> 
> Oh, nope, that's embedded in the dax_radix_entry() helper:
> 
> /* entries begin locked */
> static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
> {
>   return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
>   ((unsigned long)sector << RADIX_DAX_SHIFT) |
>   RADIX_DAX_ENTRY_LOCK);
> }
> 
> I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
> clearer to the reader.

Yep, that wo

Re: [PATCH v2] x86/tsc: Set X86_FEATURE_TSC_RELIABLE to skip refined calibration

2016-10-12 Thread Thomas Gleixner

On Tue, 11 Oct 2016, Bin Gao wrote:
> On Fri, Aug 26, 2016 at 12:14:58PM +0200, Thomas Gleixner wrote:
> 
> The Linux kernel does think a reliable calibration implies the reliability 
> (i.e.
> no watchdog required). I'm posting some code pieces to explain.

I know that and I know exactly how all that works. And I certainly did not
ask for an explanation of the current state of affairs. Here is what I
wrote:

> > Second thoughts. We should seperate the calibration aspect from the 
> > reliablity
> > aspect.
> > 
> > If a MSR/CPUID readout provides reliable calibration then this does not tell
> > us about the reliablity (i.e. no watchdog required). So having two flags for
> > this - and sure you can set both on those SoCs is the proper solution.

In other words: I want to have two seperate flags:

1) FEATURE_KNOWN_FREQUENCY - Grab the frequency from CPUID/MSR or whatever
 and skip the whole calibration thing

2) FEATURE_RELIABLE- Do not invoke the watchdog

Thanks,

tglx

Re: [PATCH v3 0/1] man/set_mempolicy.2,mbind.2: add MPOL_LOCAL NUMA memory policy documentation

2016-10-12 Thread Michael Kerrisk (man-pages)

Hello Piotr,

On 10/10/2016 06:23 PM, Piotr Kwapulinski wrote:
> The MPOL_LOCAL mode has been implemented by
> Peter Zijlstra 
> (commit: 479e2802d09f1e18a97262c4c6f8f17ae5884bd8).
> Add the documentation for this mode.

Thanks. I've applied this patch. I have a question below.

> Signed-off-by: Piotr Kwapulinski 
> ---
> This version fixes grammar
> ---
>  man2/mbind.2 | 28 
>  man2/set_mempolicy.2 | 19 ++-
>  2 files changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/man2/mbind.2 b/man2/mbind.2
> index 3ea24f6..854580c 100644
> --- a/man2/mbind.2
> +++ b/man2/mbind.2
> @@ -130,8 +130,9 @@ argument must specify one of
>  .BR MPOL_DEFAULT ,
>  .BR MPOL_BIND ,
>  .BR MPOL_INTERLEAVE ,
> +.BR MPOL_PREFERRED ,
>  or
> -.BR MPOL_PREFERRED .
> +.BR MPOL_LOCAL .
>  All policy modes except
>  .B MPOL_DEFAULT
>  require the caller to specify via the
> @@ -258,9 +259,26 @@ and
>  .I maxnode
>  arguments specify the empty set, then the memory is allocated on
>  the node of the CPU that triggered the allocation.
> -This is the only way to specify "local allocation" for a
> -range of memory via
> -.BR mbind ().
> +
> +.B MPOL_LOCAL
> +specifies the "local allocation", the memory is allocated on
> +the node of the CPU that triggered the allocation, "local node".
> +The
> +.I nodemask
> +and
> +.I maxnode
> +arguments must specify the empty set. If the "local node" is low
> +on free memory the kernel will try to allocate memory from other
> +nodes. The kernel will allocate memory from the "local node"
> +whenever memory for this node is available. If the "local node"
> +is not allowed by the process's current cpuset context the kernel
> +will try to allocate memory from other nodes. The kernel will
> +allocate memory from the "local node" whenever it becomes allowed
> +by the process's current cpuset context. In contrast
> +.B MPOL_DEFAULT
> +reverts to the policy of the process which may have been set with
> +.BR set_mempolicy (2).
> +It may not be the "local allocation".

What is the sense of "may not be" here? (And repeated below).
Is the meaning "this could be something other than"?
Presumably the answer is yes, in which case I'll clarify
the wording there. Let me know.

Cheers,

Michael


>  
>  If
>  .B MPOL_MF_STRICT
> @@ -440,6 +458,8 @@ To select explicit "local allocation" for a memory range,
>  specify a
>  .I mode
>  of
> +.B MPOL_LOCAL
> +or
>  .B MPOL_PREFERRED
>  with an empty set of nodes.
>  This method will work for
> diff --git a/man2/set_mempolicy.2 b/man2/set_mempolicy.2
> index 1f02037..22b0f7c 100644
> --- a/man2/set_mempolicy.2
> +++ b/man2/set_mempolicy.2
> @@ -79,8 +79,9 @@ argument must specify one of
>  .BR MPOL_DEFAULT ,
>  .BR MPOL_BIND ,
>  .BR MPOL_INTERLEAVE ,
> +.BR MPOL_PREFERRED ,
>  or
> -.BR MPOL_PREFERRED .
> +.BR MPOL_LOCAL .
>  All modes except
>  .B MPOL_DEFAULT
>  require the caller to specify via the
> @@ -211,6 +212,22 @@ arguments specify the empty set, then the policy
>  specifies "local allocation"
>  (like the system default policy discussed above).
>  
> +.B MPOL_LOCAL
> +specifies the "local allocation", the memory is allocated on
> +the node of the CPU that triggered the allocation, "local node".
> +The
> +.I nodemask
> +and
> +.I maxnode
> +arguments must specify the empty set. If the "local node" is low
> +on free memory the kernel will try to allocate memory from other
> +nodes. The kernel will allocate memory from the "local node"
> +whenever memory for this node is available. If the "local node"
> +is not allowed by the process's current cpuset context the kernel
> +will try to allocate memory from other nodes. The kernel will
> +allocate memory from the "local node" whenever it becomes allowed
> +by the process's current cpuset context.
> +
>  The thread memory policy is preserved across an
>  .BR execve (2),
>  and is inherited by child threads created using
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH 2/2] powernv: Pass PSSCR value and mask to power9_idle_stop

2016-10-12 Thread Stewart Smith

Gautham R Shenoy  writes:
> On Tue, Oct 04, 2016 at 10:33:27PM +1100, Balbir Singh wrote:
>> 
>> 
>> On 04/10/16 21:32, Michael Ellerman wrote:
>> > "Gautham R. Shenoy"  writes:
>> > 
>> >> From: "Gautham R. Shenoy" 
>> >>
>> >> The power9_idle_stop method currently takes only the requested stop
>> >> level as a parameter and picks up the rest of the PSSCR bits from a
>> >> hand-coded macro. This is not a very flexible design, especially when
>> >> the firmware has the capability to communicate the psscr value and the
>> >> mask associated with a particular stop state via device tree.
>> >>
>> >> This patch modifies the power9_idle_stop API to take as parameters the
>> >> PSSCR value and the PSSCR mask corresponding to the stop state that
>> >> needs to be set. These PSSCR value and mask are respectively obtained
>> >> by parsing the "ibm,cpu-idle-state-psscr" and
>> >> "ibm,cpu-idle-state-psscr-mask" fields from the device tree.
>> >>
>> >> In addition to this, the patch adds support for handling stop states
>> >> for which ESL and EC bits in the PSSCR are zero. As per the
>> >> architecture, a wakeup from these stop states resumes execution from
>> >> the subsequent instruction as opposed to waking up at the System
>> >> Vector.
>> > 
>> > That looks good.
>> > 
>> >> This patch depends on the following skiboot patch that exports the
>> >> PSSCR values and the mask for all the stop states:
>> >> https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html
>> > 
>> > But we can't depend on a skiboot patch. The kernel has to cope with
>> > running on an old skiboot.
>>
>
> Hmm.. We can still do that. The older skiboot only provides the RL
> field of the PSSCR value for each stop state and the corresponding
> PSSCR mask is set to 0xF in the older skiboot for all the stop states.
>
> We can insist that the future skiboot sets the ESL, EC, PSLL, TR, MTL
> and the the RL fields of the PSSCR for any exported stop state. This
> should be reflected in the psscr_mask of that stop state.  Thus, the
> psscr_mask of any stop state proposed in the future will have:
> (PSSCR_ESL_MASK | PSCCR_EC_MASK | PSCCR_PSLL_MASK | PSSCR_TR_MASK |
> PSSCR_MTL_MASK | PSSCR_RL_MASK) bits set in the skiboot.
>
> To handle the older firmware, we can do something like the following
> during the discovery of the stop states to mimic the behaviour present
> in the 4.8 kernel running on older firmware.
>
> === drivers/cpuidle/cpuidle-powernv.c ===
> /*
>  * By default we set the ESL and EC bits in the PSSCR.
>  * The MTL and PSLL are set to the maximum value possible as per the
>  * ISA, i.e 15.
>  * The Transition Rate is set to the Maximum value 3.
>  */
> #define DEFAULT_PSSCR_VAL  PSSCR_ESL_MASK |   \
>  PSCCR_EC_MASK | PSCCR_PSLL_MASK |\
>  PSSCR_TR_MASK | PSSCR_MTL_MASK
>
> #define DEFAULT_PSSCR_MASK PSSCR_ESL_MASK |   \
>  PSCCR_EC_MASK | PSCCR_PSLL_MASK |\
>  PSSCR_TR_MASK | PSSCR_MTL_MASK | \
>  PSSCR_RL_MASK
>
>
> static int powernv_add_idle_states(void)
> {
>   .
>   .
>   .
>   for (i = 0; i < dt_idle_states; i++) {
>   u64 val, mask;
>   .   
>   .
>   .
>   val = (DEFAULT_PSSCR_VAL & ~psscr_mask[i]) | psscr_val[i];
>   mask = DEFAULT_PSSCR_MASK | psscr_mask[i];
>   stop_psscr_table[nr_idle_states].val = val;
>   stop_psscr_table[nr_idle_states].mask = mask;
>   }
> } 
> 
>
>
> Is this approach ok ?

What if we just treat the 0xF state from firmware as special and set it
to DEFAULT_PSSCR_MASK in that case? That deals with old skiboot, new
kernel, and sets a pretty small special case that's easy to track into
the future as something we should watch out for.

Additionally, if we make skiboot set sane values in ~DEFAULT_PSSCR_MASK
for valid fields in PSSCR on boot/(also kexec?), then
we should end up in a situation where everything works with everything
(even if you don't get the best power saving). Specifically, new
skiboot, old kernel... but it looks like there's nothing currently
missing there

Should this patch also have Fixes: 3005c597ba4 and CC to stable?

-- 
Stewart Smith
OPAL Architect, IBM.

[PATCH v6 2/2] devicetree: bindings: uart: Add new compatible string for ZynqMP

2016-10-12 Thread Nava kishore Manne

From: Nava kishore Manne 

This patch Adds the new compatible string for ZynqMP SoC.

Signed-off-by: Nava kishore Manne 
---
Changes for v6:
-Added New compatiable string for ZynqMP SoC as
 suggested by Rob Herring.
Changes for v5:
-Mofified the compatible session.
Changes for v4:
-Modified the ChangeLog comment.
Changes for v3:
-Added changeLog comment.
Changes for v2:
-None

 Documentation/devicetree/bindings/serial/cdns,uart.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/serial/cdns,uart.txt 
b/Documentation/devicetree/bindings/serial/cdns,uart.txt
index a3eb154..227bb77 100644
--- a/Documentation/devicetree/bindings/serial/cdns,uart.txt
+++ b/Documentation/devicetree/bindings/serial/cdns,uart.txt
@@ -1,7 +1,9 @@
 Binding for Cadence UART Controller
 
 Required properties:
-- compatible : should be "cdns,uart-r1p8", or "xlnx,xuartps"
+- compatible :
+  Use "xlnx,xuartps","cdns,uart-r1p8" for Zynq-7xxx SoC.
+  Use "xlnx,zynqmp-uart","cdns,uart-r1p12" for Zynq Ultrascale+ MPSoC.
 - reg: Should contain UART controller registers location and length.
 - interrupts: Should contain UART controller interrupts.
 - clocks: Must contain phandles to the UART clocks
-- 
2.1.1

[PATCH v3 1/4] mm: don't steal highatomic pageblock

2016-10-12 Thread Minchan Kim

In page freeing path, migratetype is racy so that a highorderatomic
page could free into non-highorderatomic free list. If that page
is allocated, VM can change the pageblock from higorderatomic to
something. In that case, highatomic pageblock accounting is broken
so it doesn't work(e.g., VM cannot reserve highorderatomic pageblocks
any more although it doesn't reach 1% limit).

So, this patch prohibits the changing from highatomic to other type.
It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
array so stealing will only happen due to unexpected races which is
really rare. Also, such prohibiting keeps highatomic pageblock more
longer so it would be better for highorderatomic page allocation.

Signed-off-by: Minchan Kim 
Acked-by: Vlastimil Babka 
Acked-by: Mel Gorman 
---
 mm/page_alloc.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 55ad0229ebf3..79853b258211 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2154,7 +2154,8 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, 
int start_migratetype)
 
page = list_first_entry(&area->free_list[fallback_mt],
struct page, lru);
-   if (can_steal)
+   if (can_steal &&
+   get_pageblock_migratetype(page) != MIGRATE_HIGHATOMIC)
steal_suitable_fallback(zone, page, start_migratetype);
 
/* Remove the page from the freelists */
@@ -2555,7 +2556,8 @@ int __isolate_free_page(struct page *page, unsigned int 
order)
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
int mt = get_pageblock_migratetype(page);
-   if (!is_migrate_isolate(mt) && !is_migrate_cma(mt))
+   if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
+   && mt != MIGRATE_HIGHATOMIC)
set_pageblock_migratetype(page,
  MIGRATE_MOVABLE);
}
-- 
2.7.4

[PATCH v3 2/4] mm: prevent double decrease of nr_reserved_highatomic

2016-10-12 Thread Minchan Kim

There is race between page freeing and unreserved highatomic.

 CPU 0  CPU 1

free_hot_cold_page
  mt = get_pfnblock_migratetype
  set_pcppage_migratetype(page, mt)
unreserve_highatomic_pageblock
spin_lock_irqsave(&zone->lock)
move_freepages_block
set_pageblock_migratetype(page)
spin_unlock_irqrestore(&zone->lock)
  free_pcppages_bulk
__free_one_page(mt) <- mt is stale

By above race, a page on CPU 0 could go non-highorderatomic free list
since the pageblock's type is changed. By that, unreserve logic of
highorderatomic can decrease reserved count on a same pageblock
severak times and then it will make mismatch between
nr_reserved_highatomic and the number of reserved pageblock.

So, this patch verifies whether the pageblock is highatomic or not
and decrease the count only if the pageblock is highatomic.

Signed-off-by: Minchan Kim 
Acked-by: Vlastimil Babka 
Acked-by: Mel Gorman 
---
 mm/page_alloc.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 79853b258211..18808f392718 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2106,13 +2106,25 @@ static void unreserve_highatomic_pageblock(const struct 
alloc_context *ac)
continue;
 
/*
-* It should never happen but changes to locking could
-* inadvertently allow a per-cpu drain to add pages
-* to MIGRATE_HIGHATOMIC while unreserving so be safe
-* and watch for underflows.
+* In page freeing path, migratetype change is racy so
+* we can counter several free pages in a pageblock
+* in this loop althoug we changed the pageblock type
+* from highatomic to ac->migratetype. So we should
+* adjust the count once.
 */
-   zone->nr_reserved_highatomic -= min(pageblock_nr_pages,
-   zone->nr_reserved_highatomic);
+   if (get_pageblock_migratetype(page) ==
+   MIGRATE_HIGHATOMIC) {
+   /*
+* It should never happen but changes to
+* locking could inadvertently allow a per-cpu
+* drain to add pages to MIGRATE_HIGHATOMIC
+* while unreserving so be safe and watch for
+* underflows.
+*/
+   zone->nr_reserved_highatomic -= min(
+   pageblock_nr_pages,
+   zone->nr_reserved_highatomic);
+   }
 
/*
 * Convert to ac->migratetype and avoid the normal
-- 
2.7.4

Re: [PATCH v2 4/4] mm: make unreserve highatomic functions reliable

2016-10-12 Thread Minchan Kim

On Wed, Oct 12, 2016 at 09:33:28AM +0200, Michal Hocko wrote:
> On Wed 12-10-16 14:33:36, Minchan Kim wrote:
> [...]
> > @@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const 
> > struct alloc_context *ac)
> >  */
> > set_pageblock_migratetype(page, ac->migratetype);
> > ret = move_freepages_block(zone, page, ac->migratetype);
> > -   spin_unlock_irqrestore(&zone->lock, flags);
> > -   return ret;
> > +   if (!drain && ret) {
> > +   spin_unlock_irqrestore(&zone->lock, flags);
> > +   return ret;
> > +   }
> 
> I've already mentioned that during the previous discussion. This sounds

Yeb, we did but I sent wrong version in my git tree. :(

> overly aggressive to me. Why do we want to drain the whole reserve and
> risk that we won't be able to build up a new one after OOM. Doing one
> block at the time should be sufficient IMHO.

I will resend with updating with every reveiw points.

Thanks.

[PATCH v3 0/4] use up highorder free pages before OOM

2016-10-12 Thread Minchan Kim

I got OOM report from production team with v4.4 kernel.
It had enough free memory but failed to allocate GFP_KERNEL order-0
page and finally encountered OOM kill. It occured during QA process
which launches several apps, switching and so on. It happned rarely.
IOW, In normal situation, it was not a problem but if we are unluck
so that several apps uses peak memory at the same time, it can happen.
If we manage to pass the phase, the system can go working well.

I could reproduce it with my test(memory spike easily. Look at below.

The reason is free pages(19M) of DMA32 zone are reserved for
HIGHORDERATOMIC and doesn't unreserved before the OOM.

balloon invoked oom-killer: 
gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
balloon cpuset=/ mems_allowed=0
CPU: 1 PID: 8473 Comm: balloon Tainted: GW  OE   
4.8.0-rc7-00219-g3f74c9559583-dirty #3161
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
  88007f15bbc8 8138eb13 88007f15bd88
 88005a72a4c0 88007f15bc28 811d2d13 88007f15bc08
 8146a5ca 81c8df60 0015 0206
Call Trace:
 [] dump_stack+0x63/0x90
 [] dump_header+0x5c/0x1ce
 [] ? virtballoon_oom_notify+0x2a/0x80
 [] oom_kill_process+0x22e/0x400
 [] out_of_memory+0x1ac/0x210
 [] __alloc_pages_nodemask+0x101e/0x1040
 [] handle_mm_fault+0xa0a/0xbf0
 [] __do_page_fault+0x1dd/0x4d0
 [] trace_do_page_fault+0x43/0x130
 [] do_async_page_fault+0x1a/0xa0
 [] async_page_fault+0x28/0x30
Mem-Info:
active_anon:383949 inactive_anon:106724 isolated_anon:0
 active_file:15 inactive_file:44 isolated_file:0
 unevictable:0 dirty:0 writeback:24 unstable:0
 slab_reclaimable:2483 slab_unreclaimable:3326
 mapped:0 shmem:0 pagetables:1906 bounce:0
 free:6898 free_pcp:291 free_cma:0
Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB 
inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB 
pages_scanned:1418 all_unreclaimable? no
DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB 
inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB 
writepending:0kB present:15992kB managed:15908kB mlocked:0kB 
slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB 
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1952 1952 1952
DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB 
inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB 
writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB 
slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB 
pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 
2*4096kB (H) = 8192kB
DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 
2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
51131 total pagecache pages
50795 pages in swap cache
Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
Free swap  = 8kB
Total swap = 255996kB
524158 pages RAM
0 pages HighMem/MovableOnly
12658 pages reserved
0 pages cma reserved
0 pages hwpoisoned

Another example exceeded the limit by the race is

in:imklog: page allocation failure: order:0, 
mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
CPU: 0 PID: 476 Comm: in:imklog Tainted: GE   
4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
  880077c37590 81389033 
  880077c37618 8117519b 02280020
  81cedb40  0040
Call Trace:
 [] dump_stack+0x63/0x90
 [] warn_alloc_failed+0xdb/0x130
 [] __alloc_pages_nodemask+0x4d6/0xdb0
 [] ? bdev_write_page+0xa9/0xd0
 [] ? __page_check_address+0xd3/0x130
 [] ? deactivate_slab+0x12a/0x3e0
 [] new_slab+0x339/0x490
 [] ___slab_alloc.constprop.74+0x367/0x480
 [] ? alloc_indirect.isra.14+0x1d/0x50
 [] ? default_wake_function+0x12/0x20
 [] __slab_alloc.constprop.73+0x20/0x40
 [] __kmalloc+0x1a4/0x1e0
 [] alloc_indirect.isra.14+0x1d/0x50
 [] virtqueue_add_sgs+0x1c4/0x470
 [] ? __bt_get.isra.8+0xe5/0x1c0
 [] __virtblk_add_req+0xae/0x1f0
 [] ? wake_atomic_t_function+0x60/0x60
 [] ? sched_clock+0x9/0x10
 [] ? __blk_mq_alloc_request+0x10b/0x230
 [] ? blk_rq_map_sg+0x213/0x550
 [] virtio_queue_rq+0x12d/0x290
 [] __blk_mq_run_hw_queue+0x239/0x370
 [] blk_mq_run_hw_queue+0x8f/0xb0
 [] blk_mq_insert_requests+0x18c/0x1a0
 [] blk_mq_flush_plug_list+0x125/0x140
 [] blk_flush_plug_list+0xc7/0x220
 [] blk_finish_plug+0x2c/0x40
 [] __do_page_cache_readahead+0x196/0x230
 [] ? zram_free_page+0x3a/0xb0 [zram]
 [] filemap_fault+0x448/0x4f0
 [] ? allo

[PATCH v3 4/4] mm: make unreserve highatomic functions reliable

2016-10-12 Thread Minchan Kim

Currently, unreserve_highatomic_pageblock bails out if it found
highatomic pageblock regardless of really moving free pages
from the one so that it could mitigate unreserve logic's goal
which saves OOM of a process.

This patch makes unreserve functions bail out only if it moves
some pages out of !highatomic free list to avoid such false
positive.

Another potential problem is that by race between page freeing and
reserve highatomic function, pages could be in highatomic free list
even though the pageblock is !high atomic migratetype. In that case,
unreserve_highatomic_pageblock can be void if count of highatomic
reserve is less than pageblock_nr_pages. We could solve it simply
via draining all of reserved pages before the OOM. It would have
a safeguard role to exhuast reserved pages before converging to OOM.

Signed-off-by: Minchan Kim 
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
---
 mm/page_alloc.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd2f0e1bffc4..163d7fa759a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2079,8 +2079,12 @@ static void reserve_highatomic_pageblock(struct page 
*page, struct zone *zone,
  * potentially hurts the reliability of high-order allocations when under
  * intense memory pressure but failed atomic allocations should be easier
  * to recover from than an OOM.
+ *
+ * If @force is true, try to unreserve a pageblock even though highatomic
+ * pageblock is exhausted.
  */
-static bool unreserve_highatomic_pageblock(const struct alloc_context *ac)
+static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
+   bool force)
 {
struct zonelist *zonelist = ac->zonelist;
unsigned long flags;
@@ -2092,8 +2096,12 @@ static bool unreserve_highatomic_pageblock(const struct 
alloc_context *ac)
 
for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
ac->nodemask) {
-   /* Preserve at least one pageblock */
-   if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
+   /*
+* Preserve at least one pageblock unless memory pressure
+* is really high.
+*/
+   if (!force && zone->nr_reserved_highatomic <=
+   pageblock_nr_pages)
continue;
 
spin_lock_irqsave(&zone->lock, flags);
@@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const struct 
alloc_context *ac)
 */
set_pageblock_migratetype(page, ac->migratetype);
ret = move_freepages_block(zone, page, ac->migratetype);
-   spin_unlock_irqrestore(&zone->lock, flags);
-   return ret;
+   if (ret) {
+   spin_unlock_irqrestore(&zone->lock, flags);
+   return ret;
+   }
}
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -3343,7 +3353,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int 
order,
 * Shrink them them and try again
 */
if (!page && !drained) {
-   unreserve_highatomic_pageblock(ac);
+   unreserve_highatomic_pageblock(ac, false);
drain_all_pages(NULL);
drained = true;
goto retry;
@@ -3462,7 +3472,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 */
if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
/* Before OOM, exhaust highatomic_reserve */
-   return unreserve_highatomic_pageblock(ac);
+   return unreserve_highatomic_pageblock(ac, true);
}
 
/*
-- 
2.7.4

[PATCH v3 3/4] mm: try to exhaust highatomic reserve before the OOM

2016-10-12 Thread Minchan Kim

I got OOM report from production team with v4.4 kernel.
It had enough free memory but failed to allocate GFP_KERNEL order-0
page and finally encountered OOM kill. It occured during QA process
which launches several apps, switching and so on. It happned rarely.
IOW, In normal situation, it was not a problem but if we are unluck
so that several apps uses peak memory at the same time, it can happen.
If we manage to pass the phase, the system can go working well.

I could reproduce it with my test(memory spike easily. Look at below.

The reason is free pages(19M) of DMA32 zone are reserved for
HIGHORDERATOMIC and doesn't unreserved before the OOM.

balloon invoked oom-killer: 
gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
balloon cpuset=/ mems_allowed=0
CPU: 1 PID: 8473 Comm: balloon Tainted: GW  OE   
4.8.0-rc7-00219-g3f74c9559583-dirty #3161
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
  88007f15bbc8 8138eb13 88007f15bd88
 88005a72a4c0 88007f15bc28 811d2d13 88007f15bc08
 8146a5ca 81c8df60 0015 0206
Call Trace:
 [] dump_stack+0x63/0x90
 [] dump_header+0x5c/0x1ce
 [] ? virtballoon_oom_notify+0x2a/0x80
 [] oom_kill_process+0x22e/0x400
 [] out_of_memory+0x1ac/0x210
 [] __alloc_pages_nodemask+0x101e/0x1040
 [] handle_mm_fault+0xa0a/0xbf0
 [] __do_page_fault+0x1dd/0x4d0
 [] trace_do_page_fault+0x43/0x130
 [] do_async_page_fault+0x1a/0xa0
 [] async_page_fault+0x28/0x30
Mem-Info:
active_anon:383949 inactive_anon:106724 isolated_anon:0
 active_file:15 inactive_file:44 isolated_file:0
 unevictable:0 dirty:0 writeback:24 unstable:0
 slab_reclaimable:2483 slab_unreclaimable:3326
 mapped:0 shmem:0 pagetables:1906 bounce:0
 free:6898 free_pcp:291 free_cma:0
Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB 
inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB 
pages_scanned:1418 all_unreclaimable? no
DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB 
inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB 
writepending:0kB present:15992kB managed:15908kB mlocked:0kB 
slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB 
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1952 1952 1952
DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB 
inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB 
writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB 
slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB 
pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 
2*4096kB (H) = 8192kB
DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 
2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
51131 total pagecache pages
50795 pages in swap cache
Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
Free swap  = 8kB
Total swap = 255996kB
524158 pages RAM
0 pages HighMem/MovableOnly
12658 pages reserved
0 pages cma reserved
0 pages hwpoisoned

Another example exceeded the limit by the race is

in:imklog: page allocation failure: order:0, 
mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
CPU: 0 PID: 476 Comm: in:imklog Tainted: GE   
4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
  880077c37590 81389033 
  880077c37618 8117519b 02280020
  81cedb40  0040
Call Trace:
 [] dump_stack+0x63/0x90
 [] warn_alloc_failed+0xdb/0x130
 [] __alloc_pages_nodemask+0x4d6/0xdb0
 [] ? bdev_write_page+0xa9/0xd0
 [] ? __page_check_address+0xd3/0x130
 [] ? deactivate_slab+0x12a/0x3e0
 [] new_slab+0x339/0x490
 [] ___slab_alloc.constprop.74+0x367/0x480
 [] ? alloc_indirect.isra.14+0x1d/0x50
 [] ? default_wake_function+0x12/0x20
 [] __slab_alloc.constprop.73+0x20/0x40
 [] __kmalloc+0x1a4/0x1e0
 [] alloc_indirect.isra.14+0x1d/0x50
 [] virtqueue_add_sgs+0x1c4/0x470
 [] ? __bt_get.isra.8+0xe5/0x1c0
 [] __virtblk_add_req+0xae/0x1f0
 [] ? wake_atomic_t_function+0x60/0x60
 [] ? sched_clock+0x9/0x10
 [] ? __blk_mq_alloc_request+0x10b/0x230
 [] ? blk_rq_map_sg+0x213/0x550
 [] virtio_queue_rq+0x12d/0x290
 [] __blk_mq_run_hw_queue+0x239/0x370
 [] blk_mq_run_hw_queue+0x8f/0xb0
 [] blk_mq_insert_requests+0x18c/0x1a0
 [] blk_mq_flush_plug_list+0x125/0x140
 [] blk_flush_plug_list+0xc7/0x220
 [] blk_finish_plug+0x2c/0x40
 [] __do_page_cache_readahead+0x196/0x230
 [] ? zram_free_page+0x3a/0xb0 [zram]
 [] filemap_fault+0x448/0x4f0
 [] ? allo

Re: [PATCH 2/2] intel_pmc_core: avoid boot time warning for !CONFIG_DEBUGFS_FS

2016-10-12 Thread Darren Hart

On Mon, Oct 10, 2016 at 02:29:17PM +0200, Greg Kroah-Hartman wrote:
> On Mon, Oct 10, 2016 at 01:12:58PM +0200, Arnd Bergmann wrote:
> > While looking at a patch that introduced a compile-time warning
> > "‘pmc_core_dev_state_get’ defined but not used" (I sent a patch
> > for debugfs to fix it), I noticed that the same patch caused
> > it in intel_pmc_core also introduced a bogus run-time warning:
> > "PMC Core: debugfs register failed".
> > 
> > The problem is the IS_ERR_OR_NULL() check that as usual gets
> > things wrong: when CONFIG_DEBUGFS_FS is disabled,
> > debugfs_create_dir() fails with an error code, and we don't
> > need to warn about it, unlike the case in which it returns
> > NULL.
> > 
> > This reverts the driver to the previous state of not warning
> > about CONFIG_DEBUGFS_FS being disabled. I chose not to
> > restore the driver to making a runtime error in debugfs
> > fatal in pmc_core_probe().
> > 
> > Fixes: df2294fb6428 ("intel_pmc_core: Convert to DEFINE_DEBUGFS_ATTRIBUTE")
> > Signed-off-by: Arnd Bergmann 
> > ---
> >  drivers/platform/x86/intel_pmc_core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/platform/x86/intel_pmc_core.c 
> > b/drivers/platform/x86/intel_pmc_core.c
> > index 520b58a04daa..e8b1b836ca2d 100644
> > --- a/drivers/platform/x86/intel_pmc_core.c
> > +++ b/drivers/platform/x86/intel_pmc_core.c
> > @@ -100,7 +100,7 @@ static int pmc_core_dbgfs_register(struct pmc_dev 
> > *pmcdev)
> > struct dentry *dir, *file;
> >  
> > dir = debugfs_create_dir("pmc_core", NULL);
> > -   if (IS_ERR_OR_NULL(dir))
> > +   if (!dir)
> > return -ENOMEM;
> 
> Hah, no, you shouldn't ever care about any return value being "wrong"
> from debugfs, the code should just continue on as normal.
> 
> And yes, you are correct, the IS_ERR_OR_NULL() is totally wrong.
> 
> thanks,
> 
> greg k-h
> 

Thanks Arnd and Greg, appreciate the catch and the fix.

Applied.

-- 
Darren Hart
Intel Open Source Technology Center

mmotm git tree since-4.8 branch created (was: mmotm 2016-10-11-15-46 uploaded)

2016-10-12 Thread Michal Hocko

I have just created since-4.8 branch in mm git tree
(http://git.kernel.org/?p=linux/kernel/git/mhocko/mm.git;a=summary). It
is based on v4.8 tag in Linus tree and mmotm-2016-10-11-15-46.

As usual mmotm trees are tagged with signed tag
(finger print BB43 1E25 7FB8 660F F2F1 D22D 48E2 09A2 B310 E347)

The shortlog says:
Aaron Lu (1):
  thp: reduce usage of huge zero page's atomic counter

Ales Novak (1):
  ptrace: clear TIF_SYSCALL_TRACE on ptrace detach

Alexander Potapenko (3):
  include/linux: provide a safe version of container_of()
  llist: introduce llist_entry_safe()
  kcov: do not instrument lib/stackdepot.c

Alexandre Bounine (1):
  rapidio/rio_cm: use memdup_user() instead of duplicating code

Alexey Dobriyan (5):
  mm: unrig VMA cache hit ratio
  proc: much faster /proc/vmstat
  proc: faster /proc/*/status
  include/linux/ctype.h: make isdigit() table lookupless
  lib/kstrtox.c: smaller _parse_integer()

Andrea Arcangeli (6):
  mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE
  mm: vma_adjust: remove superfluous confusing update in remove_next == 1 
case
  mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
  mm: vma_adjust: remove superfluous check for next not NULL
  mm: vma_adjust: minor comment correction
  mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb

Andrew Morton (1):
  mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE()

Andrey Konovalov (1):
  kcov: properly check if we are in an interrupt

Aneesh Kumar K.V (1):
  mm: use zonelist name instead of using hardcoded index

Baoyou Xie (1):
  mm: move phys_mem_access_prot_allowed() declaration to pgtable.h

Bart Van Assche (1):
  do_generic_file_read(): fail immediately if killed

Borislav Petkov (1):
  config/android: Remove CONFIG_IPV6_PRIVACY

Catalin Marinas (1):
  mm: kmemleak: avoid using __va() on addresses that don't have a lowmem 
mapping

Christoph Hellwig (1):
  kprobes: include  instead of 

Dan Williams (1):
  mm: fix cache mode tracking in vm_insert_mixed()

Darrick J. Wong (3):
  block: invalidate the page cache when issuing BLKZEROOUT
  block: require write_same and discard requests align to logical block size
  block: implement (some of) fallocate for block devices

Davidlohr Bueso (3):
  ipc/msg: batch queue sender wakeups
  ipc/msg: make ss_wakeup() kill arg boolean
  ipc/msg: avoid waking sender upon full queue

Ganesh Mahendran (2):
  mm/zsmalloc: add trace events for zs_compact
  mm/zsmalloc: add per-class compact trace event

Gerald Schaefer (3):
  mm/hugetlb: fix memory offline with hugepage size > memory block size
  mm/hugetlb: check for reserved hugepages during memory offline
  mm/hugetlb: improve locking in dissolve_free_huge_pages()

Hidehiro Kawai (2):
  x86/panic: replace smp_send_stop() with kdump friendly version in panic 
path
  mips/panic: replace smp_send_stop() with kdump friendly version in panic 
path

Huang Ying (4):
  mm, swap: add swap_cluster_list
  mm: don't use radix tree writeback tags for pages in swap cache
  mm, swap: use offset of swap entry as key of swap cache
  mm: remove page_file_index

Ian Kent (5):
  autofs: fix autofs4_fill_super() error exit handling
  autofs: remove ino free in autofs4_dir_symlink()
  autofs: fix dev ioctl number range check
  autofs: add autofs_dev_ioctl_version() for AUTOFS_DEV_IOCTL_VERSION_CMD
  autofs4: move linux/auto_dev-ioctl.h to uapi/linux

James Morse (3):
  mm: pagewalk: fix the comment for test_walk
  fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in 
clear_refs_write() obvious
  mm/memcontrol.c: make the walk_page_range() limit obvious

Jason Cooper (7):
  random: simplify API for random address requests
  x86: use simpler API for random address requests
  ARM: use simpler API for random address requests
  arm64: use simpler API for random address requests
  tile: use simpler API for random address requests
  unicore32: use simpler API for random address requests
  random: remove unused randomize_range()

Joe Perches (15):
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  meminfo: break apart a very long seq_printf with #ifdefs
  checkpatch: see if modified files are marked obsolete in MAINTAINERS
  checkpatch: look for symbolic permissions and suggest octal instead
  checkpatch: test multiple line block comment alignment
  checkpatch: don't test for prefer ether_addr_
  checkpatch: externalize the structs that should be const
  const_structs.checkpatch: add frequently used from Julia Lawall's list
  checkpatch: speed up checking for filenames in sections marked obsolete
  checkpatch: improve the block comment * alignment test
  checkpatch: add --strict test for macro argument reuse
  c

Re: [PATCH v3 1/4] net: phy: dp83867: Add documentation for optional impedance control

2016-10-12 Thread Mugunthan V N

On Monday 10 October 2016 06:48 PM, Rob Herring wrote:
> On Thu, Oct 06, 2016 at 10:43:52AM +0530, Mugunthan V N wrote:
>> Add documention of ti,impedance-control which can be used to
> 
> Needs updating.

Oops, will update this in next version.

> 
>> correct MAC impedance mismatch using phy extended registers.
>>
>> Signed-off-by: Mugunthan V N 
>> ---
>>  Documentation/devicetree/bindings/net/ti,dp83867.txt | 12 
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/ti,dp83867.txt 
>> b/Documentation/devicetree/bindings/net/ti,dp83867.txt
>> index 5d21141..85bf945 100644
>> --- a/Documentation/devicetree/bindings/net/ti,dp83867.txt
>> +++ b/Documentation/devicetree/bindings/net/ti,dp83867.txt
>> @@ -9,6 +9,18 @@ Required properties:
>>  - ti,fifo-depth - Transmitt FIFO depth- see dt-bindings/net/ti-dp83867.h
>>  for applicable values
>>  
>> +Optional property:
>> +- ti,min-output-impedance - MAC Interface Impedance control to set
>> +the programmable output impedance to
>> +minimum value (35 ohms).
>> +- ti,max-output-impedance - MAC Interface Impedance control to set
>> +the programmable output impedance to
>> +maximum value (70 ohms).
> 
> Define what are valid range of values for these.

The values are already mentioned in documentation as 35/70 ohms.
Are you mentioning about the register values?

Regards
Mugunthan V N

Re: [PATCH] ideapad-laptop: Add Lenovo Yoga 910-13IKB to no_hw_rfkill dmi list

2016-10-12 Thread Darren Hart

On Tue, Oct 11, 2016 at 07:28:02PM -0400, Brian Masney wrote:
> The Lenovo Yoga 910-13IKB does not have a hw rfkill switch, and trying
> to read the hw rfkill switch through the ideapad module causes it to
> always report as blocked.
> 
> This commit adds the Lenovo Yoga 910-13IKB to the no_hw_rfkill dmi list,
> fixing the WiFI breakage.
> 
> Signed-off-by: Brian Masney 

Thanks Brian,

Queued to testing.

-- 
Darren Hart
Intel Open Source Technology Center

Re: [PATCH] irqchip/jcore: fix lost per-cpu interrupts

2016-10-12 Thread Thomas Gleixner

On Tue, 11 Oct 2016, Rich Felker wrote:
> On Sun, Oct 09, 2016 at 09:23:58PM +0200, Thomas Gleixner wrote:
> > On Sun, 9 Oct 2016, Rich Felker wrote:
> > > On Sun, Oct 09, 2016 at 01:03:10PM +0200, Thomas Gleixner wrote:
> > > My preference would just be to keep the branch, but with your improved
> > > version that doesn't need a function call:
> > > 
> > >   irqd_is_per_cpu(irq_desc_get_irq_data(desc))
> > >
> > > While there is some overhead testing this condition every time, I can
> > > probably come up with several better places to look for a ~10 cycle
> > > improvement in the irq code path without imposing new requirements on
> > > the DT bindings.
> > 
> > Fair enough. Your call.
> >  
> > > As noted in my followup to the clocksource stall thread, there's also
> > > a possibility that it might make sense to consider the current
> > > behavior of having non-percpu irqs bound to a particular cpu as part
> > > of what's required by the compatible tag, in which case
> > > handle_percpu_irq or something similar/equivalent might be suitable
> > > for both the percpu and non-percpu cases. I don't understand the irq
> > > subsystem well enough to insist on that but I think it's worth
> > > consideration since it looks like it would improve performance of
> > > non-percpu interrupts a bit.
> > 
> > Well, you can use handle_percpu_irq() for your device interrupts if you
> > guarantee at the hardware level that there is no reentrancy. Once you make
> > the hardware capable of delivering them on either core the picture changes.
> 
> One more concern here -- I see that handle_simple_irq is handling the
> soft-disable / IRQS_PENDING flag behavior, and irq_check_poll stuff
> that's perhaps important too. Since soft-disable is all we have
> (there's no hard-disable of interrupts), is this a problem? In other
> words, can drivers have an expectation of not receiving interrupts
> when the irq is disabled? I would think anything compatible with irq
> sharing can't have such an expectation, but perhaps the kernel needs
> disabling internally for synchronization at module-unload time or
> similar cases?

Sure. A driver would be surprised getting an interrupt when it is disabled,
but with your exceptionally well thought out interrupt controller a pending
(level) interrupt which is not handled will be reraised forever and just
hard lock the machine.

> If you think any of these things are problems I'll switch back to the
> conditional version rather than using handle_percpu_irq for
> everything.

It might be the approach of least surprise, but it won't make a difference
for the situation described above.

Thanks,

tglx

Re: [GIT pull] locking fix for 4.9

2016-10-12 Thread Steven Rostedt

On Mon, 10 Oct 2016 10:29:27 -0700
Linus Torvalds  wrote:

> On Sat, Oct 8, 2016 at 5:47 AM, Thomas Gleixner  wrote:
> >
> > A single fix which prevents newer GCCs from spamming the build output with
> > overly eager warnings about __builtin_return_address() uses which are
> > correct.  
> 
> Ugh. This feels over-engineered to me.
> 
> We already disable that warning unconditionally for the trace
> subdirectory, and for mm/usercopy.c.
> 
> I feel that the simpler solution is to just disable the warning
> globally, and not worry about "when this config option is enabled we
> need to disable it".
> 
> Basically, we disable the warning every time we ever use
> __builtin_return_address(), so maybe we should just disable it once
> and for all.

The only advantage of doing this is to make it a pain to use
__builtin_return_address(n) with n > 0, so that we don't accidentally
use it without knowing what we are doing.

> 
> It's not like the __builtin_return_address() warning is so incredibly
> useful anyway.
> 

But I agree. We have lived a long time without the need for this
warning. I'm not strongly advocating keeping the warning around and
just disabling it totally. But it all comes down to how much we
trust those that inherit this after we are gone ;-)

/me is feeling his age.

-- Steve

Re: [PATCH 24/54] md/raid1: Improve another size determination in setup_conf()

2016-10-12 Thread Dan Carpenter

Compare:

foo = kmalloc(sizeof(*foo), GFP_KERNEL);

This says you are allocating enough space for foo.  It can be reviewed
by looking at one line.  If you change the type of foo it will still
work.

foo = kmalloc(sizeof(struct whatever), GFP_KERNEL);

There isn't enough information to say if this is correct.  If you change
the type of foo then you have to update the allocation as well.

It's not a super common type of bug, but I see it occasionally.

regards,
dan carpenter

Re: [PATCH 6/6] cpufreq: pxa: convert to clock API

2016-10-12 Thread Robert Jarzmik

Viresh Kumar  writes:

> On 12-10-16, 08:22, Robert Jarzmik wrote:
>> Viresh Kumar  writes:
>> 
>> > On 10-10-16, 22:09, Robert Jarzmik wrote:
>> >> As the clock settings have been introduced into the clock pxa drivers,
>> >> which are now available to change the CPU clock by themselves, remove
>> >> the clock handling from this driver, and rely on pxa clock drivers.
>> >> 
>> >> Signed-off-by: Robert Jarzmik 
>> >> ---
>> >>  drivers/cpufreq/pxa2xx-cpufreq.c | 191 
>> >> ---
>> >>  1 file changed, 39 insertions(+), 152 deletions(-)
>> >
>> > As you mentioned in the previous patch, why can't you use cpufreq-dt
>> > driver now and delete this one ?
>> 
>> PXA architecture have both legacy platform_data based configurations and new
>> devicetree based ones.
>
> I don't see any platform data specific code in this driver. What am I
> missing ?

In a legacy platform, ie. without devicetree, we have CONFIG_OF=n.
How would cpufreq-dt be usable in this case ?

You can see such a platform in arch/arm/configs/mainstone_defconfig and
arch/arm/mach-pxa/mainstone.c as an example.

Cheers.

-- 
Robert

Re: [PATCH v2 4/4] mm: make unreserve highatomic functions reliable

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 14:33:36, Minchan Kim wrote:
[...]
> @@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const 
> struct alloc_context *ac)
>*/
>   set_pageblock_migratetype(page, ac->migratetype);
>   ret = move_freepages_block(zone, page, ac->migratetype);
> - spin_unlock_irqrestore(&zone->lock, flags);
> - return ret;
> + if (!drain && ret) {
> + spin_unlock_irqrestore(&zone->lock, flags);
> + return ret;
> + }

I've already mentioned that during the previous discussion. This sounds
overly aggressive to me. Why do we want to drain the whole reserve and
risk that we won't be able to build up a new one after OOM. Doing one
block at the time should be sufficient IMHO.

if (ret) {
spin_unlock_irqrestore(&zone->lock, flags);
return ret;
}

will do the trick and work both for drain and !drain cases which is a
good thing. Because even !drain case would like to see a block freed.
The only difference between those two is that the drain one would really
like to free something and ignore the "at least one block" reserve.

-- 
Michal Hocko
SUSE Labs

[PATCH] rtc: Add support for maxim dallas rtc max-6917

2016-10-12 Thread VENKAT PRASHANTH B U

This is a patch to add support for
maxim dallas rtc max6917.

Signed-off-by: Venkat Prashanth B U 
---
---
 drivers/rtc/Kconfig   |   9 +
 drivers/rtc/Makefile  |   1 +
 drivers/rtc/rtc-max6917.c | 406 ++
 3 files changed, 416 insertions(+)

diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
index e215f50..2163606 100644
--- a/drivers/rtc/Kconfig
+++ b/drivers/rtc/Kconfig
@@ -277,6 +277,15 @@ config RTC_DRV_MAX6900
  This driver can also be built as a module. If so, the module
  will be called rtc-max6900.
 
+config RTC_DRV_MAX6917
+   tristate "Maxim MAX6917"
+   help
+ If you say yes here you will get support for the
+ Maxim MAX6917 I2C RTC chip.
+
+ This driver can also be built as a module. If so, the module
+ will be called rtc-max6917.
+
 config RTC_DRV_MAX8907
tristate "Maxim MAX8907"
depends on MFD_MAX8907
diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile
index 7cf7ad5..29332fb 100644
--- a/drivers/rtc/Makefile
+++ b/drivers/rtc/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_RTC_DRV_M48T86)  += rtc-m48t86.o
 obj-$(CONFIG_RTC_DRV_MAX6900)  += rtc-max6900.o
 obj-$(CONFIG_RTC_DRV_MAX6902)  += rtc-max6902.o
 obj-$(CONFIG_RTC_DRV_MAX6916)  += rtc-max6916.o
+obj-$(CONFIG_RTC_DRV_MAX6917)  += rtc-max6917.o
 obj-$(CONFIG_RTC_DRV_MAX77686) += rtc-max77686.o
 obj-$(CONFIG_RTC_DRV_MAX8907)  += rtc-max8907.o
 obj-$(CONFIG_RTC_DRV_MAX8925)  += rtc-max8925.o
diff --git a/drivers/rtc/rtc-max6917.c b/drivers/rtc/rtc-max6917.c
index e69de29..1176384 100644
--- a/drivers/rtc/rtc-max6917.c
+++ b/drivers/rtc/rtc-max6917.c
@@ -0,0 +1,406 @@
+   /* rtc-max6917.c
+   *
+   * Driver for MAXIM  max6917  I2C-Compatible Real Time Clock
+   *
+   * Author : Venkat Prashanth B U 
+   *
+   * This program is free software; you can redistribute it and/or modify
+   * it under the terms of the GNU General Public License version 2 as
+   * published by the Free Software Foundation.
+   *
+   */
+
+   #include 
+   #include 
+   #include 
+   #include 
+   #include 
+   #include 
+   #include 
+   #include 
+   #include 
+
+   #define MAX6917_REG_SECS0x01/* 00-59 */
+   #define MAX6917_REG_MIN 0x02/* 00-59 */
+   #define MAX6917_REG_HOUR0x03/* 00-23, or 
1-12{am,pm} */
+   #define MAX6917_REG_WDAY0x04/* 01-07 */
+   #define MAX6917_REG_MDAY0x05/* 01-31 */
+   #define MAX6917_REG_MONTH   0x06/* 01-12 */
+   #define MAX6917_REG_YEAR0x07/* 00-99 */
+   #define MAX6917_REG_CONTROL 0x08
+   #define MAX6917_REG_STATUS  0x0c
+   #define MAX6917_REG_ALARM   0x0a
+   #define MAX6917_BURST_LEN   8   /* can burst r/w first 
8 regs */
+   #define MAX6917_REG_CENTURY 9   /* century */
+   #define MAX6917_REG_LEN 10
+   #define MAX6917_REG_CT_WP   (1 << 7)/* Write 
Protect */
+   /*
+   * register read/write commands
+   */
+   #define MAX6917_REG_CONTROL_WRITE   0x8e
+   #define MAX6917_REG_CENTURY_WRITE   0x92
+   #define MAX6917_REG_CENTURY_READ0x93
+   #define MAX6917_REG_RESERVED_READ   0x96
+   #define MAX6917_REG_BURST_WRITE 0xbe
+   #define MAX6917_REG_BURST_READ  0xbf
+
+   #define MAX6917_IDLE_TIME_AFTER_WRITE   3   /* specification says 
2.5 mS */
+
+   static struct i2c_driver max6917_driver;
+
+   struct max6917
+   {
+   u8 offset;  /* register's offset */
+   u8 regs[11];
+   u16 nvram_offset;
+   struct bin_attribute *nvram;
+   unsigned long flags;
+   #define HAS_NVRAM   0   /* bit 0 == sysfs file 
active */
+   #define HAS_ALARM   1   /* bit 1 == irq claimed 
*/
+   struct i2c_client *client;
+   struct rtc_device *rtc;
+   s32 (*read_block_data) (const struct i2c_client * client, u8 
command,
+   u8 length, u8 * values);
+   s32 (*write_block_data) (const struct i2c_client * client, u8 
command,
+   u8 length, const u8 * values);
+   };
+
+   struct chip_desc
+   {
+   unsigned alarm:1;
+   u16 nvram_offset;
+   u16 nvram_size;
+   };
+
+   static int
+   max6917_i2c_read_regs (struct i2c_client *client, u8 * buf)
+   {
+   u8 reg_burst_read[1] = { MAX6917_REG_BURST_READ };
+   u8 reg_century_read[1] = { MAX6917_REG_CENTURY_READ };
+   struct i2c_msg msgs[4] = {
+   {
+   .addr = client->addr,
+   .flags = 0, /* write */
+   .len = siz

Re: [PATCH v3 3/4] mm: try to exhaust highatomic reserve before the OOM

2016-10-12 Thread Michal Hocko

Looks much better. Thanks! I am wondering whether we want to have this
marked for stable. The patch is quite non-intrusive and fires only when
we are really OOM. It is definitely better to try harder than go and
disrupt the system by the OOM killer. So I would add
Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic 
allocations on demand")
Cc: stable # 4.4+

The backport will look slightly different for kernels prior 4.6 because
we do not have should_reclaim_retry yet but the check might hook right
before __alloc_pages_may_oom.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH RFC 2/2] KVM: x86: Support using the VMX preemption timer for APIC Timer periodic/oneshot mode

2016-10-12 Thread Wanpeng Li

2016-10-11 23:06 GMT+08:00 Radim Krčmář :
> 2016-10-11 20:17+0800, Wanpeng Li:
>> From: Wanpeng Li 
>>
>> Most windows guests still utilize APIC Timer periodic/oneshot mode
>> instead of tsc-deadline mode, and the APIC Timer periodic/oneshot
>> mode are still emulated by high overhead hrtimer on host. This patch
>> converts the expected expire time of the periodic/oneshot mode to
>> guest deadline tsc in order to leverage VMX preemption timer logic
>> for APIC Timer tsc-deadline mode.
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Cc: Yunhong Jiang 
>> Signed-off-by: Wanpeng Li 
>> ---
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> @@ -1101,13 +1101,20 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic)
>>   apic->lapic_timer.period == 0)
>>   return 0;
>>
>> + if (kvm_lapic_hv_timer_in_use(apic->vcpu)) {
>> + u64 tscl = rdtsc();
>> +
>> + tmcct = apic->lapic_timer.tscdeadline -
>> + kvm_read_l1_tsc(apic->vcpu, tscl);
>
> Yes, this won't work.  The easiest way to return a less bogus TMCCT
> would be remember the timeout when setting up the timer and replace
> hrtimer_get_remaining() with it -- our deliver method shouldn't change
> the expiration time.

Agreed.

>
>> + } else {
>> + remaining = hrtimer_get_remaining(&apic->lapic_timer.timer);
>> + if (ktime_to_ns(remaining) < 0)
>> + remaining = ktime_set(0, 0);
>> +
>> + ns = mod_64(ktime_to_ns(remaining), apic->lapic_timer.period);
>> + tmcct = div64_u64(ns,
>> +  (APIC_BUS_CYCLE_NS * apic->divide_count));
>> + }
>>
>>   return tmcct;
>>  }
>> @@ -1400,52 +1407,65 @@ bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu)
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_lapic_hv_timer_in_use);
>>
>> -static void cancel_hv_tscdeadline(struct kvm_lapic *apic)
>> +static void cancel_hv_timer(struct kvm_lapic *apic)
>>  {
>>   kvm_x86_ops->cancel_hv_timer(apic->vcpu);
>>   apic->lapic_timer.hv_timer_in_use = false;
>>  }
>>
>> +static bool start_hv_timer(struct kvm_lapic *apic)
>>  {
>> + u64 tscdeadline;
>>
>> + if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic)) {
>> + u64 tscl = rdtsc();
>>
>> + apic->lapic_timer.period = (u64)kvm_lapic_get_reg(apic, 
>> APIC_TMICT)
>> + * APIC_BUS_CYCLE_NS * apic->divide_count;
>> + apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, 
>> tscl) +
>> + nsec_to_cycles(apic->vcpu, apic->lapic_timer.period);
>
> start_hv_timer() is called from kvm_lapic_switch_to_hv_timer(), which
> can happen mid-period.  The worst case is that the timer will never
> fire, because we always convert back and forth.  You have to compute the
> equivalent deadline only once, and carry it around.

Agreed. Thanks for your review. :) Please see RFC V2.

Regards,
Wanpeng Li

>
>> + }
>> +
>> + tscdeadline = apic->lapic_timer.tscdeadline;
>>
>>   if (atomic_read(&apic->lapic_timer.pending) ||
>>   kvm_x86_ops->set_hv_timer(apic->vcpu, tscdeadline)) {
>>   if (apic->lapic_timer.hv_timer_in_use)
>> + cancel_hv_timer(apic);
>>   } else {
>>   apic->lapic_timer.hv_timer_in_use = true;
>>   hrtimer_cancel(&apic->lapic_timer.timer);
>>
>>   /* In case the sw timer triggered in the window */
>>   if (atomic_read(&apic->lapic_timer.pending))
>> + cancel_hv_timer(apic);
>>   }
>>   trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
>>   apic->lapic_timer.hv_timer_in_use);
>>   return apic->lapic_timer.hv_timer_in_use;
>>  }
>>
>> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
>> +{
>> + struct kvm_lapic *apic = vcpu->arch.apic;
>> +
>> + WARN_ON(!apic->lapic_timer.hv_timer_in_use);
>> + WARN_ON(swait_active(&vcpu->wq));
>> + cancel_hv_timer(apic);
>> + apic_timer_expired(apic);
>> +
>> + if (apic_lvtt_period(apic))
>> + start_hv_timer(apic);
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
>> +
>>  void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu)
>>  {
>>   struct kvm_lapic *apic = vcpu->arch.apic;
>>
>>   WARN_ON(apic->lapic_timer.hv_timer_in_use);
>>
>> - if (apic_lvtt_tscdeadline(apic))
>> - start_hv_tscdeadline(apic);
>> + start_hv_timer(apic);
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_hv_timer);
>>

Re: [PATCH v4 10/10] ARM: sunxi: Enable sun8i-emac driver on multi_v7_defconfig

2016-10-12 Thread LABBE Corentin

On Tue, Oct 11, 2016 at 11:40:42AM +0200, Maxime Ripard wrote:
> On Mon, Oct 10, 2016 at 03:09:43PM +0200, Jean-Francois Moine wrote:
> > On Mon, 10 Oct 2016 14:35:11 +0200
> > LABBE Corentin  wrote:
> > 
> > > On Mon, Oct 10, 2016 at 02:30:46PM +0200, Maxime Ripard wrote:
> > > > On Fri, Oct 07, 2016 at 10:25:57AM +0200, Corentin Labbe wrote:
> > > > > Enable the sun8i-emac driver in the multi_v7 default configuration
> > > > > 
> > > > > Signed-off-by: Corentin Labbe 
> > > > > ---
> > > > >  arch/arm/configs/multi_v7_defconfig | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/arch/arm/configs/multi_v7_defconfig 
> > > > > b/arch/arm/configs/multi_v7_defconfig
> > > > > index 5845910..f44d633 100644
> > > > > --- a/arch/arm/configs/multi_v7_defconfig
> > > > > +++ b/arch/arm/configs/multi_v7_defconfig
> > > > > @@ -229,6 +229,7 @@ CONFIG_NETDEVICES=y
> > > > >  CONFIG_VIRTIO_NET=y
> > > > >  CONFIG_HIX5HD2_GMAC=y
> > > > >  CONFIG_SUN4I_EMAC=y
> > > > > +CONFIG_SUN8I_EMAC=y
> > > > 
> > > > Any reason to build it statically?
> > > 
> > > No, just copied the same than CONFIG_SUN4I_EMAC that probably do
> > > not need it also.
> > 
> > All arm configs are done the same way, and, some day, the generic ARM
> > V7 kernel will not be loadable in 1Gb RAM...
> 
> Yeah, if possible, I'd really like to avoid introducing statically
> built drivers to multi_v7.
> 

I forgot to said it in my first answer, but yes I will change it.

Regards

Re: [PATCH v3 4/4] mm: make unreserve highatomic functions reliable

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 17:03:49, Minchan Kim wrote:
> Currently, unreserve_highatomic_pageblock bails out if it found
> highatomic pageblock regardless of really moving free pages
> from the one so that it could mitigate unreserve logic's goal
> which saves OOM of a process.
> 
> This patch makes unreserve functions bail out only if it moves
> some pages out of !highatomic free list to avoid such false
> positive.
> 
> Another potential problem is that by race between page freeing and
> reserve highatomic function, pages could be in highatomic free list
> even though the pageblock is !high atomic migratetype. In that case,
> unreserve_highatomic_pageblock can be void if count of highatomic
> reserve is less than pageblock_nr_pages. We could solve it simply
> via draining all of reserved pages before the OOM. It would have
> a safeguard role to exhuast reserved pages before converging to OOM.
> 
> Signed-off-by: Minchan Kim 
> Signed-off-by: Michal Hocko 
> Acked-by: Vlastimil Babka 

Looks good to me as well. If the previous one is agreed to go to stable
this one should go with it IMHO.

Thanks!

> ---
>  mm/page_alloc.c | 24 +---
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fd2f0e1bffc4..163d7fa759a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2079,8 +2079,12 @@ static void reserve_highatomic_pageblock(struct page 
> *page, struct zone *zone,
>   * potentially hurts the reliability of high-order allocations when under
>   * intense memory pressure but failed atomic allocations should be easier
>   * to recover from than an OOM.
> + *
> + * If @force is true, try to unreserve a pageblock even though highatomic
> + * pageblock is exhausted.
>   */
> -static bool unreserve_highatomic_pageblock(const struct alloc_context *ac)
> +static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> + bool force)
>  {
>   struct zonelist *zonelist = ac->zonelist;
>   unsigned long flags;
> @@ -2092,8 +2096,12 @@ static bool unreserve_highatomic_pageblock(const 
> struct alloc_context *ac)
>  
>   for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
>   ac->nodemask) {
> - /* Preserve at least one pageblock */
> - if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> + /*
> +  * Preserve at least one pageblock unless memory pressure
> +  * is really high.
> +  */
> + if (!force && zone->nr_reserved_highatomic <=
> + pageblock_nr_pages)
>   continue;
>  
>   spin_lock_irqsave(&zone->lock, flags);
> @@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const 
> struct alloc_context *ac)
>*/
>   set_pageblock_migratetype(page, ac->migratetype);
>   ret = move_freepages_block(zone, page, ac->migratetype);
> - spin_unlock_irqrestore(&zone->lock, flags);
> - return ret;
> + if (ret) {
> + spin_unlock_irqrestore(&zone->lock, flags);
> + return ret;
> + }
>   }
>   spin_unlock_irqrestore(&zone->lock, flags);
>   }
> @@ -3343,7 +3353,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned 
> int order,
>* Shrink them them and try again
>*/
>   if (!page && !drained) {
> - unreserve_highatomic_pageblock(ac);
> + unreserve_highatomic_pageblock(ac, false);
>   drain_all_pages(NULL);
>   drained = true;
>   goto retry;
> @@ -3462,7 +3472,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>*/
>   if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
>   /* Before OOM, exhaust highatomic_reserve */
> - return unreserve_highatomic_pageblock(ac);
> + return unreserve_highatomic_pageblock(ac, true);
>   }
>  
>   /*
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area

2016-10-12 Thread zijun_hu

On 10/12/2016 04:25 PM, Michal Hocko wrote:
> On Wed 12-10-16 15:24:33, zijun_hu wrote:
>> On 10/12/2016 02:53 PM, Michal Hocko wrote:
>>> On Wed 12-10-16 08:28:17, zijun_hu wrote:
 On 2016/10/12 1:22, Michal Hocko wrote:
> On Tue 11-10-16 21:24:50, zijun_hu wrote:
>> From: zijun_hu 
>>
>> should we have a generic discussion whether such patches which considers
>> many boundary or rare conditions are necessary.
> 
> In general, I believe that kernel internal interfaces which have no
> userspace exposure shouldn't be cluttered with sanity checks.
> 

you are right and i agree with you. but there are many internal interfaces
perform sanity checks in current linux sources

>> i found the following code segments in mm/vmalloc.c
>> static struct vmap_area *alloc_vmap_area(unsigned long size,
>> unsigned long align,
>> unsigned long vstart, unsigned long vend,
>> int node, gfp_t gfp_mask)
>> {
>> ...
>>
>> BUG_ON(!size);
>> BUG_ON(offset_in_page(size));
>> BUG_ON(!is_power_of_2(align));
> 
> See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and
> from a quick look they are even unnecessary. So rather than adding more
> of those, I think removing those that are not needed is much more
> preferred.
>
i notice that, and the above code segments is used to illustrate that
input parameter checking is necessary sometimes

>> should we make below declarations as conventions
>> 1) when we say 'alignment', it means align to a power of 2 value
>>for example, aligning value @v to @b implicit @v is power of 2
>>, align 10 to 4 is 12
> 
> alignment other than power-of-two makes only very limited sense to me.
> 
you are right and i agree with you.
>> 2) when we say 'round value @v up/down to boundary @b', it means the 
>>result is a times of @b,  it don't requires @b is a power of 2
> 

i will write to linus to ask for opinions whether we should declare 
the meaning of 'align' and 'round up/down' formally and whether such
patches are necessary

Re: [PATCH 6/6] cpufreq: pxa: convert to clock API

2016-10-12 Thread Viresh Kumar

On 12-10-16, 10:29, Robert Jarzmik wrote:
> Viresh Kumar  writes:
> 
> > On 12-10-16, 08:22, Robert Jarzmik wrote:
> >> Viresh Kumar  writes:
> >> 
> >> > On 10-10-16, 22:09, Robert Jarzmik wrote:
> >> >> As the clock settings have been introduced into the clock pxa drivers,
> >> >> which are now available to change the CPU clock by themselves, remove
> >> >> the clock handling from this driver, and rely on pxa clock drivers.
> >> >> 
> >> >> Signed-off-by: Robert Jarzmik 
> >> >> ---
> >> >>  drivers/cpufreq/pxa2xx-cpufreq.c | 191 
> >> >> ---
> >> >>  1 file changed, 39 insertions(+), 152 deletions(-)
> >> >
> >> > As you mentioned in the previous patch, why can't you use cpufreq-dt
> >> > driver now and delete this one ?
> >> 
> >> PXA architecture have both legacy platform_data based configurations and 
> >> new
> >> devicetree based ones.
> >
> > I don't see any platform data specific code in this driver. What am I
> > missing ?
> 
> In a legacy platform, ie. without devicetree, we have CONFIG_OF=n.
> How would cpufreq-dt be usable in this case ?
> 
> You can see such a platform in arch/arm/configs/mainstone_defconfig and
> arch/arm/mach-pxa/mainstone.c as an example.

Okay, so its not about platform_data as you said earlier. Rather it is
about the CONFIG_OF option.

In that case, what about making this driver depends_on !CONFIG_OF ? So
that the DT users don't use it anymore.

-- 
viresh

Re: [PATCH] gpio: mockup: add sysfs dependency

2016-10-12 Thread Linus Walleij

On Mon, Oct 10, 2016 at 2:42 PM, Arnd Bergmann  wrote:

> Building the gpio-mockup driver without SYSFS results in a harmless Kconfig
> warning:
>
> warning: (GPIO_MOCKUP) selects GPIO_SYSFS which has unmet direct dependencies 
> (GPIOLIB && SYSFS)
>
> We can easily avoid that warning by adding a dependency on SYSFS.
>
> Fixes: 0f98dd1b27d2 ("gpio/mockup: add virtual gpio device")
> Signed-off-by: Arnd Bergmann 

Patch applied.

Yours,
Linus Walleij

Re: [PATCH v3 03/11] tracing/syscalls: add compat syscall metadata

2016-10-12 Thread Michael Ellerman

Marcin Nowakowski  writes:

> Now that compat syscalls are properly distinguished from native calls,
> we can add metadata for compat syscalls as well.
> All the macros used to generate the metadata are the same as for
> standard syscalls, but with a compat_ prefix to distinguish them easily.
>
> Signed-off-by: Marcin Nowakowski 
> Cc: Steven Rostedt 
> Cc: Ingo Molnar 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-...@lists.ozlabs.org
> ---
>  arch/powerpc/include/asm/ftrace.h | 15 +---
>  include/linux/compat.h| 74 
> +++
>  kernel/trace/trace_syscalls.c |  8 +++--
>  3 files changed, 90 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/ftrace.h 
> b/arch/powerpc/include/asm/ftrace.h
> index 686c5f7..9697a73 100644
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -73,12 +73,17 @@ struct dyn_arch_ftrace {
>  static inline bool arch_syscall_match_sym_name(const char *sym, const char 
> *name)
>  {
>   /*
> -  * Compare the symbol name with the system call name. Skip the .sys or 
> .SyS
> -  * prefix from the symbol name and the sys prefix from the system call 
> name and
> -  * just match the rest. This is only needed on ppc64 since symbol names 
> on
> -  * 32bit do not start with a period so the generic function will work.
> +  * Compare the symbol name with the system call name. Skip the .sys,
> +  * .SyS or .compat_sys prefix from the symbol name and the sys prefix
> +  * from the system call name and just match the rest. This is only
> +  * needed on ppc64 since symbol names on 32bit do not start with a
> +  * period so the generic function will work.
>*/
> - return !strcmp(sym + 4, name + 3);
> + int prefix_len = 3;
> +
> + if (!strncasecmp(name, "compat_", 7))
> + prefix_len = 10;
> + return !strcmp(sym + prefix_len + 1, name + prefix_len);
>  }

It's annoying that we have to duplicate all that just to do a + 1.

How about this as a precursor?

cheers


diff --git a/Documentation/trace/ftrace-design.txt 
b/Documentation/trace/ftrace-design.txt
index dd5f916b351d..bd65f2adeb09 100644
--- a/Documentation/trace/ftrace-design.txt
+++ b/Documentation/trace/ftrace-design.txt
@@ -226,10 +226,6 @@ You need very few things to get the syscalls tracing in an 
arch.
 - If the system call table on this arch is more complicated than a simple array
   of addresses of the system calls, implement an arch_syscall_addr to return
   the address of a given system call.
-- If the symbol names of the system calls do not match the function names on
-  this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and
-  implement arch_syscall_match_sym_name with the appropriate logic to return
-  true if the function name corresponds with the symbol name.
 - Tag this arch as HAVE_SYSCALL_TRACEPOINTS.
 
 
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 686c5f70eb84..dc48f5b2878d 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -60,6 +60,12 @@ struct dyn_arch_ftrace {
struct module *mod;
 };
 #endif /*  CONFIG_DYNAMIC_FTRACE */
+
+#ifdef PPC64_ELF_ABI_v1
+/* On ppc64 ABIv1 (BE) we have to skip the leading '.' in the symbol name */
+#define ARCH_SYM_NAME_SKIP_CHARS 1
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
@@ -67,20 +73,4 @@ struct dyn_arch_ftrace {
 #endif
 #endif
 
-#if defined(CONFIG_FTRACE_SYSCALLS) && !defined(__ASSEMBLY__)
-#ifdef PPC64_ELF_ABI_v1
-#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
-static inline bool arch_syscall_match_sym_name(const char *sym, const char 
*name)
-{
-   /*
-* Compare the symbol name with the system call name. Skip the .sys or 
.SyS
-* prefix from the symbol name and the sys prefix from the system call 
name and
-* just match the rest. This is only needed on ppc64 since symbol names 
on
-* 32bit do not start with a period so the generic function will work.
-*/
-   return !strcmp(sym + 4, name + 3);
-}
-#endif
-#endif /* CONFIG_FTRACE_SYSCALLS && !__ASSEMBLY__ */
-
 #endif /* _ASM_POWERPC_FTRACE */
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index b2b6efc083a4..91a7315dbe43 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -31,8 +31,11 @@ extern struct syscall_metadata *__stop_syscalls_metadata[];
 
 static struct syscall_metadata **syscalls_metadata;
 
-#ifndef ARCH_HAS_SYSCALL_MATCH_SYM_NAME
-static inline bool arch_syscall_match_sym_name(const char *sym, const char 
*name)
+#ifndef ARCH_SYM_NAME_SKIP_CHARS
+#define ARCH_SYM_NAME_SKIP_CHARS 0
+#endif
+
+static inline bool syscall_match_sym_name(const char *sym, const char *name)
 {
/*
 * Only compare after the "sys" prefix. Archs that use
@@ -40,9 +43,8 @@

Re: [PATCH v4 08/10] ARM: dts: sun8i: Enable sun8i-emac on the Orange Pi 2

2016-10-12 Thread Maxime Ripard

On Wed, Oct 12, 2016 at 10:55:59AM +0200, Jean-Francois Moine wrote:
> On Fri,  7 Oct 2016 10:25:55 +0200
> Corentin Labbe  wrote:
> 
> > The sun8i-emac hardware is present on the Orange PI 2.
> > It uses the internal PHY.
> > 
> > This patch create the needed emac node.
> > 
> > Signed-off-by: Corentin Labbe 
> > ---
> >  arch/arm/boot/dts/sun8i-h3-orangepi-2.dts | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts 
> > b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
> > index f93f5d1..5608eb4 100644
> > --- a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
> > +++ b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
> > @@ -54,6 +54,7 @@
> >  
> > aliases {
> > serial0 = &uart0;
> > +   ethernet0 = &emac;
> 
> As there is no 'of_alias_get_id' in the driver, this alias is
> useless.

Not really, this is used by U-Boot to set the mac address.

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com


signature.asc
Description: PGP signature

RE: [PATCH]"drm: change DRM_MIPI_DSI module type from "bool" to "tristate".

2016-10-12 Thread Sun, Jing A

I think "installing a kernel with my changes for both drm and i915" takes more 
time and effort to complete than "only updating DRM/i915 modules without 
rebuilding the whole kernel". In some cases, that's beneficial.

Also reloadablility is always a good thing to have and I truly hope 
Hajda/Iwai's patches would be accepted and merged.
No downside of it after all.

Regards,
Sun, Jing

-Original Message-
From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of Daniel Vetter
Sent: Wednesday, October 12, 2016 2:52 PM
To: Sun, Jing A
Cc: Andrzej Hajda; Jani Nikula; Takashi Iwai; Emil Velikov; 
linux-kernel@vger.kernel.org; dri-de...@lists.freedesktop.org; Vetter, Daniel; 
Thierry Reding
Subject: Re: [PATCH]"drm: change DRM_MIPI_DSI module type from "bool" to 
"tristate".

On Wed, Oct 12, 2016 at 03:08:24AM +, Sun, Jing A wrote:
> Interestingly, I am able to reload i915 and drm. Our CI has tests for
> i915 unload/reload, but does not check drm. In any case the config 
> problem should not impact the reloadability of i915.
> ==
> Sorry that I didn't make myself clear. In order to replace the default
> i915 module with an updated one, the related DRM modules also need to 
> be updated to match the updated i915, hence the restriction.

Just to avoid tears in the future: If you plan to ship this in product, you 
won't ship.

And for debugging, just install a kernel with your changes for both drm and 
i915.

In short, your use-case isn't really valid (but we could still make the dsi 
code modular if people feel like).
-Daniel

> 
> Regards,
> Sun, Jing
> 
> 
> -Original Message-
> From: Andrzej Hajda [mailto:a.ha...@samsung.com]
> Sent: Tuesday, October 11, 2016 5:53 PM
> To: Jani Nikula; Sun, Jing A; Takashi Iwai
> Cc: airl...@linux.ie; Vetter, Daniel; linux-kernel@vger.kernel.org; 
> dri-de...@lists.freedesktop.org; Thierry Reding; Emil Velikov
> Subject: Re: [PATCH]"drm: change DRM_MIPI_DSI module type from "bool" to 
> "tristate".
> 
> On 11.10.2016 11:33, Jani Nikula wrote:
> > On Tue, 11 Oct 2016, "Sun, Jing A"  wrote:
> >> It's needed that DRM Driver module could be removed and reloaded 
> >> after kernel booting on the projects that I have been working on, 
> >> and I hope such module type change could be accepted. Looks like 
> >> Iwai has similar change request as well. Would you please review it 
> >> and let us know if any concerns?
> > Looking at the Kconfig, selecting CONFIG_DRM_MIPI_DSI is against the 
> > recommendations of Documentation/kbuild/kconfig-language.txt:
> >
> > select should be used with care. select will force
> > a symbol to a value without visiting the dependencies.
> > By abusing select you are able to select a symbol FOO even
> > if FOO depends on BAR that is not set.
> > In general use select only for non-visible symbols
> > (no prompts anywhere) and for symbols with no dependencies.
> > That will limit the usefulness but on the other hand avoid
> > the illegal configurations all over.
> 
> All existing drivers which selects DRM_MIPI_DSI also depends on DRM.
> So the dependency is always true. I am not sure if it could not change 
> in the future, but in such case mipi_dsi bus should be completely 
> detached from DRM framework, I hope we have not such case yet :)
> 
> >
> > Indeed, you may end up with CONFIG_DRM_MIPI_DSI=y and CONFIG_DRM=m, 
> > which violates DRM_MIPI_DSI dependency on CONFIG_DRM. This is broken 
> > and should be fixed. The suggested patch does *not* fix this issue.
> 
> At the moment it should not be possible.
> 
> Regards
> Andrzej
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-12 Thread Johannes Berg

Hi,

Sorry - I meant to look into this yesterday but forgot.

> Andy, can this be related to CONFIG_VMAP_STACK?

I think it is.

> > current -git kills my system. 

Can you elaborate on how exactly it kills your system?

> > adding
> > 
> > if (!virt_addr_valid(&aad[2])) {
> > WARN_ON(1);
> > return -EINVAL;
> > }

That's pretty obviously false with VMAP_STACK, since the caller
(ieee80211_crypto_ccmp_decrypt) puts the aad on the stack. b_0 is also
on the stack, but maybe that doesn't matter.

Herbert, do you know what could cause this, and how we should fix it?

We can't really afford to do an allocation here, and we don't have
space in the skb (not even in skb->cb at that point), so if we really
have no way to continue using the stack we'd ... not sure, use a per-
CPU buffer perhaps.
We need 32 bytes for aad and 16 bytes for b_0, if that also can't be on
the stack any more.

johannes

Re: [PATCH v4 08/10] ARM: dts: sun8i: Enable sun8i-emac on the Orange Pi 2

2016-10-12 Thread Jean-Francois Moine

On Fri,  7 Oct 2016 10:25:55 +0200
Corentin Labbe  wrote:

> The sun8i-emac hardware is present on the Orange PI 2.
> It uses the internal PHY.
> 
> This patch create the needed emac node.
> 
> Signed-off-by: Corentin Labbe 
> ---
>  arch/arm/boot/dts/sun8i-h3-orangepi-2.dts | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts 
> b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
> index f93f5d1..5608eb4 100644
> --- a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
> +++ b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
> @@ -54,6 +54,7 @@
>  
>   aliases {
>   serial0 = &uart0;
> + ethernet0 = &emac;

As there is no 'of_alias_get_id' in the driver, this alias is useless.

>   };
>  
>   chosen {
> @@ -184,3 +185,10 @@
>   usb1_vbus-supply = <®_usb1_vbus>;
>   status = "okay";
>  };
> +
> +&emac {
> + phy-handle = <&int_mii_phy>;
> + phy-mode = "mii";
> + allwinner,leds-active-low;
> + status = "okay";
> +};
> -- 
> 2.7.3
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


-- 
Ken ar c'hentañ | ** Breizh ha Linux atav! **
Jef |   http://moinejf.free.fr/

Re: [PATCH] kexec: Export memory sections virtual addresses to vmcoreinfo

2016-10-12 Thread Pratyush Anand




On Wednesday 12 October 2016 05:56 AM, Baoquan He wrote:

PAGE_OFFSET can be get via vaddr - paddr from elf pt_loads so only
> VMALLOC_BASE and VMEMMAP_BASE is necessary..

Well, yes, I was wrong. I wrongly thought of kernel text virtual address
when I wrote the reply


So, if you can get PAGE_OFFSET then, probably you do not need to know 
anything else.


I think, we can simplify makedumpfile code, where we do not need to 
depend on VMALLOC_START or VMEMMAP_START etc.


"If we know PAGE_OFFSET, we can read from swapper space. If we can read 
from swapper space, then we can know PA of any kernel VA, whether it is 
VMALLOC, or vmemmap or module or kernel text area."



In fact, I have cleanup patches for ARM64 [1], which take above approach 
and get rid of need of VMALLOC_START or VMEMMAP_START etc. I will be 
sending them upstream soon.


Probably, x86 can take the similar approach.

~Pratyush

[1] 
https://github.com/pratyushanand/makedumpfile/blob/arm64_devel/arch/arm64.c#L228

Re: [PATCH] mm: page_alloc: Use KERN_CONT where appropriate

2016-10-12 Thread Michal Hocko

On Tue 11-10-16 19:24:55, Joe Perches wrote:
> Recent changes to printk require KERN_CONT uses to continue logging
> messages.  So add KERN_CONT where necessary.

I was really wondering what happened when Aaron reported an allocation
failure http://lkml.kernel.org/r/20161012065423.ga16...@aaronlu.sh.intel.com
See the attached log got the current Linus' tree

Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation 
lines")
> Signed-off-by: Joe Perches 

Acked-by: Michal Hocko 

I believe we can simplify the code a bit as well. What do you think
about the following on top?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f8c356140a0..7e1b74ee79cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4078,10 +4078,12 @@ unsigned long nr_free_pagecache_pages(void)
return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
 }
 
-static inline void show_node(struct zone *zone)
+static inline void show_zone_node(struct zone *zone)
 {
if (IS_ENABLED(CONFIG_NUMA))
-   printk("Node %d ", zone_to_nid(zone));
+   printk("Node %d %s", zone_to_nid(zone), zone->name);
+   else
+   printk("%s: ", zone->name);
 }
 
 long si_mem_available(void)
@@ -4329,9 +4331,8 @@ void show_free_areas(unsigned int filter)
for_each_online_cpu(cpu)
free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
 
-   show_node(zone);
+   show_zone_node(zone);
printk(KERN_CONT
-   "%s"
" free:%lukB"
" min:%lukB"
" low:%lukB"
@@ -4354,7 +4355,6 @@ void show_free_areas(unsigned int filter)
" local_pcp:%ukB"
" free_cma:%lukB"
"\n",
-   zone->name,
K(zone_page_state(zone, NR_FREE_PAGES)),
K(min_wmark_pages(zone)),
K(low_wmark_pages(zone)),
@@ -4379,7 +4379,6 @@ void show_free_areas(unsigned int filter)
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
printk(KERN_CONT " %ld", zone->lowmem_reserve[i]);
-   printk(KERN_CONT "\n");
}
 
for_each_populated_zone(zone) {
@@ -4389,8 +4388,7 @@ void show_free_areas(unsigned int filter)
 
if (skip_free_areas_node(filter, zone_to_nid(zone)))
continue;
-   show_node(zone);
-   printk(KERN_CONT "%s: ", zone->name);
+   show_zone_node(zone);
 
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
-- 
Michal Hocko
SUSE Labs

Re: [RFC 0/6] Module for tracking/accounting shared memory buffers

2016-10-12 Thread Christian König


Am 12.10.2016 um 01:50 schrieb Ruchi Kandoi:

This patchstack adds memtrack hooks into dma-buf and ion.  If there's upstream
interest in memtrack, it can be extended to other memory allocators as well,
such as GEM implementations.
We have run into similar problems before. Because of this I already 
proposed a solution for this quite a while ago, but never pushed on 
upstreaming this since it was only done for a special use case.


Instead of keeping track of how much memory a process has bound (which 
is very fragile) my solution  only added some more debugging info on a 
per fd basis (e.g. how much memory is bound to this fd).


This information was then used by the OOM killer (for example) to make a 
better decision on which process to reap.


Shouldn't be to hard to expose this through debugfs or maybe a new fcntl 
to userspace for debugging.


I haven't looked at the code in detail, but messing with the per process 
memory accounting like you did in this proposal is clearly not a good 
idea if you ask me.


Regards,
Christian.

Re: [PATCH 3/3] mtd: s3c2410: parse the device configuration from OF node

2016-10-12 Thread Boris Brezillon

Hi Sergio,

On Wed,  5 Oct 2016 20:46:57 -0300
Sergio Prado  wrote:

> Allows configuring Samsung's s3c2410 memory controller using a
> devicetree.
> 
> Signed-off-by: Sergio Prado 
> ---
>  drivers/mtd/nand/s3c2410.c | 171 
> ++---
>  include/linux/platform_data/mtd-nand-s3c2410.h |   1 +
>  2 files changed, 156 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/mtd/nand/s3c2410.c b/drivers/mtd/nand/s3c2410.c
> index 174ac9dc4265..352cf2656bc8 100644
> --- a/drivers/mtd/nand/s3c2410.c
> +++ b/drivers/mtd/nand/s3c2410.c

[...]

> +
> +static int s3c2410_nand_init_timings(struct s3c2410_nand_info *info,
> +  struct nand_chip *chip)
> +{
> + struct s3c2410_platform_nand *pdata = info->platform;
> + const struct nand_sdr_timings *t;
> + int tacls, mode;
> +
> + mode = onfi_get_async_timing_mode(chip);
> + if (mode == ONFI_TIMING_MODE_UNKNOWN)
> + mode = chip->onfi_timing_mode_default;
> +
> + t = onfi_async_timing_mode_to_sdr_timings(mode);
> + if (IS_ERR(t))
> + return PTR_ERR(t);

We recently introduced an method to automate timing selection and
configuration [1]. Can you switch to this approach (the changes are in
the nand/next branch [2] and should appear in 4.9-rc1)?

Thanks,

Boris

[1]https://www.spinics.net/lists/arm-kernel/msg532007.html 
[2]https://github.com/linux-nand/linux/tree/nand/next

[GIT PULL] fbdev changes for 4.9

2016-10-12 Thread Tomi Valkeinen

Hi Linus,

Please pull fbdev changes for 4.9.

 Tomi

The following changes since commit 29b4817d4018df78086157ea3a55c1d9424a7cfc:

  Linux 4.8-rc1 (2016-08-07 18:18:00 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux.git tags/fbdev-4.9

for you to fetch changes up to c456a2f30de53e77a2eb8eeb4202d742516aa76b:

  video: smscufx: remove unused variable (2016-09-27 11:47:37 +0300)


fbdev changes for 4.9

Main changes:

- amba-cldc: DT backlight support, Nomadik support, Versatile improvements, 
fixes
- efifb: fix fbcon RGB565 palette
- exynos: remove unused DSI driver


Arnd Bergmann (3):
  video: ARM CLCD: fix endpoint lookup logic
  video: ARM CLCD: export symbols for driver module
  video: fbdev: mb862xx: remove unused variable

Bhaktipriya Shridhar (2):
  omapfb: panel-dsi-cm: Remove deprecated create_singlethread_workqueue
  fbdev: Remove deprecated create_singlethread_workqueue

Chen-Yu Tsai (1):
  simplefb: Disable and release clocks and regulators in destroy callback

Colin Ian King (2):
  video: fbdev: add missing \n at end of printk error message
  video: fbdev: i810: add in missing white space in error message text

Dan Carpenter (2):
  video: fbdev: pxafb: potential NULL dereference on error
  fb: adv7393: off by one in probe function

Javier Martinez Canillas (1):
  fb: adv7393: Use IS_ENABLED() instead of checking for built-in or module

Julia Lawall (2):
  matroxfb: constify local structures
  video: fbdev: constify fb_fix_screeninfo and fb_var_screeninfo structures

Krzysztof Kozlowski (3):
  video: s3c2410fb: Register cpufreq notifier only on S3C24xx
  video: fbdev: exynos: Remove old non-working MIPI driver
  ARM: exynos_defconfig: Remove old non-working MIPI driver

LABBE Corentin (2):
  fbdev: ssd1307fb: constify the device_info pointer
  fbdev: ssd1307fb: fix a possible NULL dereference

Linus Walleij (7):
  video: ARM CLCD: backlight support for OF
  video: ARM CLCD: support DT signal inversion flags
  video: ARM CLCD: support pads connected in reverse order
  video: ARM CLCD: support Nomadik variant
  video: ARM CLCD: add special board and panel hooks for Nomadik
  video: ARM CLCD: add special panel hook for Versatiles
  video: ARM CLCD: fix up Integrator support

Marek Vasut (1):
  video: mxsfb: Fix framebuffer corruption on mx6sx

Mark Brown (1):
  omapfb: Fix regulator API abuse in dss.c and hdmi4/5.c

Max Staudt (1):
  fbdev/efifb: Fix 16 color palette entry calculation

Nicholas Mc Guire (1):
  omapfb/dss: wait_for_completion_interruptible_timeout expects long

Oleg Drokin (1):
  mx3fb: Fix print format string

Sudip Mukherjee (3):
  video: fbdev: intelfb: remove impossible condition
  matroxfb: fix size of memcpy
  video: smscufx: remove unused variable

Tomi Valkeinen (1):
  MAINTAINERS: update fbdev entries

Vladimir Murzin (3):
  fbdev: vfb: add description to module parameters
  fbdev: vfb: add option for video mode
  fbdev: vfb: simplify memory management

Wei Yongjun (3):
  video: ARM CLCD: fix return value check in versatile_clcd_init_panel()
  video: fbdev: pxafb: add missing of_node_put() in of_get_pxafb_mode_info()
  omapfb: fix return value check in dsi_bind()

Wolfram Sang (1):
  video: fbdev: mb862xx: mb862xx-i2c: don't print error when adding adapter 
fails

Yongji Xie (1):
  video: fbdev: offb: Call pci_enable_device() before using the PCI VGA 
device

 MAINTAINERS|  12 -
 arch/arm/configs/exynos_defconfig  |   2 -
 drivers/video/fbdev/Kconfig|   7 +-
 drivers/video/fbdev/Makefile   |   3 +-
 drivers/video/fbdev/amba-clcd-nomadik.c| 259 ++
 drivers/video/fbdev/amba-clcd-nomadik.h|  24 +
 drivers/video/fbdev/amba-clcd-versatile.c  | 395 +
 drivers/video/fbdev/amba-clcd-versatile.h  |  17 +
 drivers/video/fbdev/amba-clcd.c| 190 -
 drivers/video/fbdev/arcfb.c|   4 +-
 drivers/video/fbdev/asiliantfb.c   |   4 +-
 drivers/video/fbdev/aty/aty128fb.c |   6 +-
 drivers/video/fbdev/aty/atyfb_base.c   |   2 +-
 drivers/video/fbdev/aty/radeon_monitor.c   |   2 +-
 drivers/video/fbdev/bfin_adv7393fb.c   |   5 +-
 drivers/video/fbdev/efifb.c|   6 +-
 drivers/video/fbdev/exynos/Kconfig |  32 -
 drivers/video/fbdev/exynos/Makefile|   9 -
 drivers/video/fbdev/exynos/exynos_mipi_dsi.c   | 574 -
 .../video/fbdev/exynos/exynos_mipi_dsi_common.c| 880 
 .../video/fbdev/exynos/exynos_

Re: [patch] drm/amdgpu: potential NULL dereference in debugfs code

2016-10-12 Thread Christian König


Am 12.10.2016 um 08:17 schrieb Dan Carpenter:

debugfs_create_file() returns NULL on error, it only returns error
pointers if debugfs isn't enabled in the config and we checked for that
earlier so it can't happen.

Fixes: 4f4824b55650 ('drm/amd/amdgpu: Convert ring debugfs entries to binary')
Signed-off-by: Dan Carpenter 


Reviewed-by: Christian König .



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 85aeb0a..8d16eaf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -371,8 +371,8 @@ static int amdgpu_debugfs_ring_init(struct amdgpu_device 
*adev,
ent = debugfs_create_file(name,
  S_IFREG | S_IRUGO, root,
  ring, &amdgpu_debugfs_ring_fops);
-   if (IS_ERR(ent))
-   return PTR_ERR(ent);
+   if (!ent)
+   return -ENOMEM;
  
  	i_size_write(ent->d_inode, ring->ring_size + 12);

ring->ent = ent;

Re: [PATCH v2] timers: Fix usleep_range() in the context of wake_up_process()

2016-10-12 Thread Mark Brown

On Tue, Oct 11, 2016 at 09:33:15AM -0700, Doug Anderson wrote:
> On Tue, Oct 11, 2016 at 12:14 AM, Thomas Gleixner  wrote:
> > On Mon, 10 Oct 2016, Douglas Anderson wrote:

> >> Users of usleep_range() expect that it will _never_ return in less time
> >> than the minimum passed parameter.  However, nothing in any of the code
> >> ensures this.  Specifically:

> > There is no such guarantee for that interface and never has been, so how
> > did you make sure that none of the existing users is relying on this?

> > You can't just can't just declare that all all of the users expect that and
> > be done with it.

> You're right that I can't guarantee that no callers are relying on the
> existing behavior of a wake_up_process() causing usleep_range() to
> return early.  I would say, however, that all callers I've seen are
> absolutely relying on the min delay being enforced and I've never
> personally seen a caller relying on being woken up from
> usleep_range().  All the users relying on the min delay being enforced

Indeed.  It's *highly* surprising for any sleep interface to undershoot
on delays, the usual thing is that they might delay for longer.  If the
function doesn't actually reliably delay for the minimum time then I'd
expect that a large proportion of those conversions and other recent
code that's been added is buggy.

> one of two functions: usleep_atlest() and usleep_wakeable().  As
> argued below I think that usleep_range() name implies that it will at
> least sleep the minimum so I would really like to avoid keeping the
> name usleep_range() and also keeping the existing behavior.

I tend to agree with everything Doug is saying in terms of API
expectations.


signature.asc
Description: PGP signature

Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 15:24:33, zijun_hu wrote:
> On 10/12/2016 02:53 PM, Michal Hocko wrote:
> > On Wed 12-10-16 08:28:17, zijun_hu wrote:
> >> On 2016/10/12 1:22, Michal Hocko wrote:
> >>> On Tue 11-10-16 21:24:50, zijun_hu wrote:
>  From: zijun_hu 
> 
>  the LSB of a chunk->map element is used for free/in-use flag of a area
>  and the other bits for offset, the sufficient and necessary condition of
>  this usage is that both size and alignment of a area must be even numbers
>  however, pcpu_alloc() doesn't force its @align parameter a even number
>  explicitly, so a odd @align maybe causes a series of errors, see below
>  example for concrete descriptions.
> >>>
> >>> Is or was there any user who would use a different than even (or power of 
> >>> 2)
> >>> alighment? If not is this really worth handling?
> >>>
> >>
> >> it seems only a power of 2 alignment except 1 can make sure it work very 
> >> well,
> >> that is a strict limit, maybe this more strict limit should be checked
> > 
> > I fail to see how any other alignment would actually make any sense
> > what so ever. Look, I am not a maintainer of this code but adding a new
> > code to catch something that doesn't make any sense sounds dubious at
> > best to me.
> > 
> > I could understand this patch if you see a problem and want to prevent
> > it from repeating bug doing these kind of changes just in case sounds
> > like a bad idea.
> > 
> 
> thanks for your reply
> 
> should we have a generic discussion whether such patches which considers
> many boundary or rare conditions are necessary.

In general, I believe that kernel internal interfaces which have no
userspace exposure shouldn't be cluttered with sanity checks.

> i found the following code segments in mm/vmalloc.c
> static struct vmap_area *alloc_vmap_area(unsigned long size,
> unsigned long align,
> unsigned long vstart, unsigned long vend,
> int node, gfp_t gfp_mask)
> {
> ...
> 
> BUG_ON(!size);
> BUG_ON(offset_in_page(size));
> BUG_ON(!is_power_of_2(align));

See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and
from a quick look they are even unnecessary. So rather than adding more
of those, I think removing those that are not needed is much more
preferred.
 
> should we make below declarations as conventions
> 1) when we say 'alignment', it means align to a power of 2 value
>for example, aligning value @v to @b implicit @v is power of 2
>, align 10 to 4 is 12

alignment other than power-of-two makes only very limited sense to me.

> 2) when we say 'round value @v up/down to boundary @b', it means the 
>result is a times of @b,  it don't requires @b is a power of 2

-- 
Michal Hocko
SUSE Labs

Re: [RESEND RFC PATCH v2 1/1] mm/vmalloc.c: simplify /proc/vmallocinfo implementation

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 16:23:01, zijun_hu wrote:
> From: zijun_hu 
> 
> many seq_file helpers exist for simplifying implementation of virtual files
> especially, for /proc nodes. however, the helpers for iteration over
> list_head are available but aren't adopted to implement /proc/vmallocinfo
> currently.
> 
> simplify /proc/vmallocinfo implementation by existing seq_file helpers

the simplification is nice and code duplication removal useful

> Signed-off-by: zijun_hu 

Acked-by: Michal Hocko 

Thanks!

> ---
>  Changes in v2:
>   - the redundant type cast is removed as advised by rient...@google.com
>   - commit messages are updated
> 
>  mm/vmalloc.c | 27 +--
>  1 file changed, 5 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f2481cb4e6b2..e73948afac70 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2574,32 +2574,13 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int 
> nr_vms)
>  static void *s_start(struct seq_file *m, loff_t *pos)
>   __acquires(&vmap_area_lock)
>  {
> - loff_t n = *pos;
> - struct vmap_area *va;
> -
>   spin_lock(&vmap_area_lock);
> - va = list_first_entry(&vmap_area_list, typeof(*va), list);
> - while (n > 0 && &va->list != &vmap_area_list) {
> - n--;
> - va = list_next_entry(va, list);
> - }
> - if (!n && &va->list != &vmap_area_list)
> - return va;
> -
> - return NULL;
> -
> + return seq_list_start(&vmap_area_list, *pos);
>  }
>  
>  static void *s_next(struct seq_file *m, void *p, loff_t *pos)
>  {
> - struct vmap_area *va = p, *next;
> -
> - ++*pos;
> - next = list_next_entry(va, list);
> - if (&next->list != &vmap_area_list)
> - return next;
> -
> - return NULL;
> + return seq_list_next(p, &vmap_area_list, pos);
>  }
>  
>  static void s_stop(struct seq_file *m, void *p)
> @@ -2634,9 +2615,11 @@ static void show_numa_info(struct seq_file *m, struct 
> vm_struct *v)
>  
>  static int s_show(struct seq_file *m, void *p)
>  {
> - struct vmap_area *va = p;
> + struct vmap_area *va;
>   struct vm_struct *v;
>  
> + va = list_entry(p, struct vmap_area, list);
> +
>   /*
>* s_show can encounter race with remove_vm_area, !VM_VM_AREA on
>* behalf of vmap area is being tear down or vm_map_ram allocation.
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs

MPOL_BIND on memory only nodes

2016-10-12 Thread Anshuman Khandual

Hi,

We have the following function policy_zonelist() which selects a zonelist
during various allocation paths. With this, general user space allocations
(IIUC might not have __GFP_THISNODE) fails while trying to get memory from
a memory only node without CPUs as the application runs some where else
and that node is not part of the nodemask. Why we insist on __GFP_THISNODE ?
On any memory only node its likely that the local node "nd" might not be
part of the nodemask, hence does it make sense to pick up the first node of
the nodemask in those cases without looking for __GFP_THISNODE ?

/* Return a zonelist indicated by gfp for node representing a mempolicy */
static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
int nd)
{
switch (policy->mode) {
case MPOL_PREFERRED:
if (!(policy->flags & MPOL_F_LOCAL))
nd = policy->v.preferred_node;
break;
case MPOL_BIND:
/*
 * Normally, MPOL_BIND allocations are node-local within the
 * allowed nodemask.  However, if __GFP_THISNODE is set and the
 * current node isn't part of the mask, we use the zonelist for
 * the first node in the mask instead.
 */
if (unlikely(gfp & __GFP_THISNODE) &&
unlikely(!node_isset(nd, policy->v.nodes)))
nd = first_node(policy->v.nodes);
break;
default:
BUG();
}
return node_zonelist(nd, gfp);
}

- Anshuman

[PATCH v6 1/2] serial: xuartps: Add new compatible string for ZynqMP

2016-10-12 Thread Nava kishore Manne

This patch Adds the new compatible string for ZynqMP SoC.

Signed-off-by: Nava kishore Manne 
---
Changes for v6:
-Added New patch.

 drivers/tty/serial/xilinx_uartps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/tty/serial/xilinx_uartps.c 
b/drivers/tty/serial/xilinx_uartps.c
index f37edaa..dd4c02f 100644
--- a/drivers/tty/serial/xilinx_uartps.c
+++ b/drivers/tty/serial/xilinx_uartps.c
@@ -1200,6 +1200,7 @@ static int __init cdns_early_console_setup(struct 
earlycon_device *device,
 OF_EARLYCON_DECLARE(cdns, "xlnx,xuartps", cdns_early_console_setup);
 OF_EARLYCON_DECLARE(cdns, "cdns,uart-r1p8", cdns_early_console_setup);
 OF_EARLYCON_DECLARE(cdns, "cdns,uart-r1p12", cdns_early_console_setup);
+OF_EARLYCON_DECLARE(cdns, "xlnx,zynqmp-uart", cdns_early_console_setup);
 
 /**
  * cdns_uart_console_write - perform write operation
@@ -1438,6 +1439,7 @@ static const struct of_device_id cdns_uart_of_match[] = {
{ .compatible = "xlnx,xuartps", },
{ .compatible = "cdns,uart-r1p8", },
{ .compatible = "cdns,uart-r1p12", .data = &zynqmp_uart_def },
+   { .compatible = "xlnx,zynqmp-uart", .data = &zynqmp_uart_def },
{}
 };
 MODULE_DEVICE_TABLE(of, cdns_uart_of_match);
-- 
2.1.1

Re: [PATCH v8 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-12 Thread Daniel Lezcano

On Tue, Oct 11, 2016 at 04:28:50PM -0400, Rich Felker wrote:
> On Tue, Oct 11, 2016 at 08:18:12PM +0200, Daniel Lezcano wrote:
> > 
> > Hi Rich,
> > 
> > On Sun, Oct 09, 2016 at 05:34:22AM +, Rich Felker wrote:
> > > At the hardware level, the J-Core PIT is integrated with the interrupt
> > > controller, but it is represented as its own device and has an
> > > independent programming interface. It provides a 12-bit countdown
> > > timer, which is not presently used, and a periodic timer. The interval
> > > length for the latter is programmable via a 32-bit throttle register
> > > whose units are determined by a bus-period register. The periodic
> > > timer is used to implement both periodic and oneshot clock event
> > > modes; in oneshot mode the interrupt handler simply disables the timer
> > > as soon as it fires.
> > > 
> > > Despite its device tree node representing an interrupt for the PIT,
> > > the actual irq generated is programmable, not hard-wired. The driver
> > > is responsible for programming the PIT to generate the hardware irq
> > > number that the DT assigns to it.
> > > 
> > > On SMP configurations, J-Core provides cpu-local instances of the PIT;
> > > no broadcast timer is needed. This driver supports the creation of the
> > > necessary per-cpu clock_event_device instances.
> > 
> > For my personnal information, why no broadcast timer is needed ?
> 
> Broadcast timer is only needed if you don't have percpu local timers.
> Early on in SMP development I actually tested with an ipi broadcast
> timer and performance was noticably worse.

Obviously. I thought there were another reason related to power management.
 
> > Are the CPUs on always-on power down ?
> 
> For now they are always on and don't even have the sleep instruction
> (i.e. stop cpu clock until interrupt) implemented. Adding sleep will
> be the first power-saving step, and perhaps the only one for now,
> since there doesn't seem to be any indication (according to the ppl
> working on the hardware) that a deeper sleep would provide significant
> additional savings.

Ok.

However, the 'sleep' state is not, in the power management terminology,
the idle state described above. It is called "clock gated" / "Wait for
Interrupt".

The 'sleep' state lose the CPU context.
 
> > > A nanosecond-resolution clocksource is provided using the J-Core "RTC"
> > > registers, which give a 64-bit seconds count and 32-bit nanoseconds
> > > that wrap every second. The driver converts these to a full-range
> > > 32-bit nanoseconds count.
> > > 
> > > Signed-off-by: Rich Felker 
> > > ---
> > >  drivers/clocksource/Kconfig |  10 ++
> > >  drivers/clocksource/Makefile|   1 +
> > >  drivers/clocksource/jcore-pit.c | 231 
> > > 
> > >  include/linux/cpuhotplug.h  |   1 +
> > >  4 files changed, 243 insertions(+)
> > >  create mode 100644 drivers/clocksource/jcore-pit.c
> > > 
> > > diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> > > index 5677886..95dd78b 100644
> > > --- a/drivers/clocksource/Kconfig
> > > +++ b/drivers/clocksource/Kconfig
> > > @@ -407,6 +407,16 @@ config SYS_SUPPORTS_SH_TMU
> > >  config SYS_SUPPORTS_EM_STI
> > >  bool
> > >  
> > > +config CLKSRC_JCORE_PIT
> > > + bool "J-Core PIT timer driver"
> > > + depends on OF && (SUPERH || COMPILE_TEST)
> > 
> > Actually the idea is to have the SUPERH to select this timer, not create
> > a dependency on SUPERH from here.
> > 
> > We don't want to prompt in the configuration menu the drivers because it
> > would be impossible to anyone to know which timer comes with which
> > hardware, so we let the platform to select the timer it needs.
> 
> I thought we discussed this before. For users building a kernel for
> legacy SH systems, especially in the current state where they're only
> supported with hard-coded board files rather than device tree, it
> makes no sense to build drivers for J-core hardware. It would make
> sense to be on by default for CONFIG_SH_DEVICE_TREE with a compatible
> CPU selection, but at least at this time, not for SUPERH in general.

Probably I am missing the point but why the user would have to unselect
this driver manually ? The user wants a config file nothing more or a very
trivial option. Can you imagine someone can know every single IP block for
each boards of the same arch and be able to disable/enable the right ones ?

> Anyway I'd really like to do this non-invasively as long as we have a
> mix of legacy and new stuff and the legacy stuff is not readily
> testable. Once all of arch/sh is moved over to device tree, could we
> revisit this and make all the drivers follow a common policy (on by
> default if they're associated with boards/SoCs using a matching or
> compatible CPU model, or something like that, but still able to be
> disabled manually, since the user might be trying to get a tiny-ish
> embedded kernel)?

I understand the goal is to have one single configuration and ev

Re: [RFC] net: phy: smsc: Disable auto-negotiation on startup

2016-10-12 Thread Florian Fainelli

On 10/10/2016 10:41 AM, Kyle Roeschley wrote:
> Because the SMSC PHY completes auto-negotiation before the driver is
> ready to handle interrupts, the PHY state machine never realizes that we
> have a link. Clear the ANENABLE bit on initialization, which lets
> genphy_config_aneg do its thing when that code is hit later.
> 
> While this patch does fix the problem we see (no link on boot without
> re-plugging the cable), it seems like the generic PHY code should be
> able to handle auto-negotiation completing before interrupts are
> enabled. Submitted as an RFC in the hopes that someone has an idea as to
> how that could be done.
> 
> This fix is copied from commit 99f81afc139c ("phy: micrel: Disable auto
> negotiation on startup").

Do you mind trying:

https://www.spinics.net/lists/netdev/msg397857.html

and see if you do get link interrupts without your patch applied? Thanks!
-- 
Florian

Re: MPOL_BIND on memory only nodes

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
> Hi,
> 
> We have the following function policy_zonelist() which selects a zonelist
> during various allocation paths. With this, general user space allocations
> (IIUC might not have __GFP_THISNODE) fails while trying to get memory from
> a memory only node without CPUs as the application runs some where else
> and that node is not part of the nodemask.

I am not sure I understand. So you have a task with MPOL_BIND without a
cpu less node in the mask and you are wondering why the memory is not
allocated from that node?

> Why we insist on __GFP_THISNODE ?

AFAIU __GFP_THISNODE just overrides the given node to the policy
nodemask in case the current node is not part of that node mask. In
other words we are ignoring the given node and use what the policy says. 
I can see how this can be confusing especially when confronting the
documentation:

 * __GFP_THISNODE forces the allocation to be satisified from the requested
 *   node with no fallbacks or placement policy enforcements.

-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 16:44:31, zijun_hu wrote:
> On 10/12/2016 04:25 PM, Michal Hocko wrote:
> > On Wed 12-10-16 15:24:33, zijun_hu wrote:
[...]
> >> i found the following code segments in mm/vmalloc.c
> >> static struct vmap_area *alloc_vmap_area(unsigned long size,
> >> unsigned long align,
> >> unsigned long vstart, unsigned long vend,
> >> int node, gfp_t gfp_mask)
> >> {
> >> ...
> >>
> >> BUG_ON(!size);
> >> BUG_ON(offset_in_page(size));
> >> BUG_ON(!is_power_of_2(align));
> > 
> > See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and
> > from a quick look they are even unnecessary. So rather than adding more
> > of those, I think removing those that are not needed is much more
> > preferred.
> >
> i notice that, and the above code segments is used to illustrate that
> input parameter checking is necessary sometimes

Why do you think it is necessary here?

-- 
Michal Hocko
SUSE Labs

drm/i915: WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock)

2016-10-12 Thread Paul Bolle

On a laptop that tracks the latest stable release (Ie, it now runs
v4.8.1) I see this WARNING
    WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock)

Full trace pasted below. I never saw this WARNING before v4.8. Since
v4.8 I've had it in all (four, actually) boots.

What am I expected to do about this WARNING?

Thanks,


Paul Bolle

WARNING: CPU: 3 PID: 1368 at drivers/gpu/drm/i915/intel_display.c:14178 
skl_max_scale.part.120+0x75/0x80 [i915]
WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock)
Modules linked in:
 rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter 
ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat 
ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 cmac 
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_mangle iptable_raw iptable_security ebtable_filter ebtables 
ip6table_filter ip6_tables bnep vfat fat arc4 snd_hda_codec_hdmi snd_soc_skl 
dell_led snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core 
snd_soc_sst_match snd_soc_core intel_rapl snd_hda_codec_realtek 
snd_hda_codec_generic x86_pkg_temp_thermal coretemp kvm_intel snd_compress 
snd_pcm_dmaengine ac97_bus kvm snd_hda_intel iwlmvm snd_hda_codec mac80211 
iTCO_wdt
 iTCO_vendor_support uvcvideo snd_hda_core snd_hwdep snd_seq irqbypass 
dell_laptop i2c_designware_platform i2c_designware_core dell_wmi 
crct10dif_pclmul dell_smbios dcdbas crc32_pclmul snd_seq_device iwlwifi 
videobuf2_vmalloc videobuf2_memops ghash_clmulni_intel snd_pcm videobuf2_v4l2 
videobuf2_core cfg80211 videodev media joydev pcspkr mei_me rtsx_pci_ms 
memstick snd_timer i2c_i801 i2c_smbus mei snd btusb soundcore shpchp hci_uart 
btrtl btbcm btqca idma64 btintel bluetooth intel_pch_thermal 
processor_thermal_device intel_lpss_pci intel_soc_dts_iosf wmi 
pinctrl_sunrisepoint intel_lpss_acpi rfkill pinctrl_intel intel_lpss 
int3400_thermal acpi_als int3403_thermal int340x_thermal_zone kfifo_buf 
acpi_thermal_rel intel_hid industrialio sparse_keymap acpi_pad tpm_tis 
tpm_tis_core tpm nfsd auth_rpcgss
 nfs_acl lockd grace sunrpc hid_multitouch i915 rtsx_pci_sdmmc mmc_core 
i2c_algo_bit drm_kms_helper crc32c_intel drm serio_raw nvme rtsx_pci nvme_core 
i2c_hid video fjes
CPU: 3 PID: 1368 Comm: Xorg Not tainted 4.8.1-1.local1.fc24.x86_64 #1
Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.4 06/14/2016
 0286 df2f374c a31528d53910 b83e5cfd
 a31528d53960  a31528d53950 b80a7d5b
 3762c72b3010 a3151e4d8cc0 a31526c23800 a31526e6
Call Trace:
 [] dump_stack+0x63/0x86
 [] __warn+0xcb/0xf0
 [] warn_slowpath_fmt+0x5f/0x80
 [] ? sort+0x147/0x220
 [] ? drm_atomic_helper_normalize_zpos+0x264/0x300 
[drm_kms_helper]
 [] skl_max_scale.part.120+0x75/0x80 [i915]
 [] intel_check_primary_plane+0xc6/0xe0 [i915]
 [] ? drm_atomic_helper_normalize_zpos+0x264/0x300 
[drm_kms_helper]
 [] intel_plane_atomic_check+0x132/0x1f0 [i915]
 [] drm_atomic_helper_check_planes+0x84/0x200 [drm_kms_helper]
 [] intel_atomic_check+0x9a7/0x11a0 [i915]
 [] ? __kmalloc_track_caller+0x17a/0x210
 [] drm_atomic_check_only+0x187/0x610 [drm]
 [] ? drm_atomic_get_crtc_state+0x88/0x100 [drm]
 [] drm_atomic_commit+0x17/0x60 [drm]
 [] drm_atomic_helper_update_plane+0xec/0x130 [drm_kms_helper]
 [] __setplane_internal+0x22b/0x270 [drm]
 [] drm_mode_cursor_universal+0x139/0x240 [drm]
 [] drm_mode_cursor_common+0x7e/0x180 [drm]
 [] drm_mode_cursor2_ioctl+0xe/0x10 [drm]
 [] drm_ioctl+0x1da/0x4b0 [drm]
 [] ? drm_mode_cursor_ioctl+0x70/0x70 [drm]
 [] ? enqueue_hrtimer+0x3d/0x80
 [] do_vfs_ioctl+0xa3/0x5e0
 [] ? __sys_recvmsg+0x51/0x90
 [] SyS_ioctl+0x79/0x90
 [] entry_SYSCALL_64_fastpath+0x1a/0xa4

[PATCH] mm: kmemleak: Ensure that the task stack is not freed during scanning

2016-10-12 Thread Catalin Marinas

Commit 68f24b08ee89 ("sched/core: Free the stack early if
CONFIG_THREAD_INFO_IN_TASK") may cause the task->stack to be freed
during kmemleak_scan() execution, leading to either a NULL pointer
fault (if task->stack is NULL) or kmemleak accessing already freed
memory. This patch uses the new try_get_task_stack() API to ensure that
the task stack is not freed during kmemleak stack scanning.

Fixes: 68f24b08ee89 ("sched/core: Free the stack early if 
CONFIG_THREAD_INFO_IN_TASK")
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: CAI Qian 
Reported-by: CAI Qian 
Signed-off-by: Catalin Marinas 
---

This was reported in a subsequent comment here:

https://bugzilla.kernel.org/show_bug.cgi?id=173901

However, the original bugzilla entry doesn't look related to task stack
freeing as it was first reported on 4.8-rc8. Andy, sorry for cc'ing you
to bugzilla, please feel free to remove your email from the bug above (I
can't seem to be able to do it).

 mm/kmemleak.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index a5e453cf05c4..e5355a5b423f 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1453,8 +1453,11 @@ static void kmemleak_scan(void)
 
read_lock(&tasklist_lock);
do_each_thread(g, p) {
-   scan_block(task_stack_page(p), task_stack_page(p) +
-  THREAD_SIZE, NULL);
+   void *stack = try_get_task_stack(p);
+   if (stack) {
+   scan_block(stack, stack + THREAD_SIZE, NULL);
+   put_task_stack(p);
+   }
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
}

Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area

2016-10-12 Thread zijun_hu

On 10/12/2016 05:54 PM, Michal Hocko wrote:
> On Wed 12-10-16 16:44:31, zijun_hu wrote:
>> On 10/12/2016 04:25 PM, Michal Hocko wrote:
>>> On Wed 12-10-16 15:24:33, zijun_hu wrote:
> [...]
 i found the following code segments in mm/vmalloc.c
 static struct vmap_area *alloc_vmap_area(unsigned long size,
 unsigned long align,
 unsigned long vstart, unsigned long vend,
 int node, gfp_t gfp_mask)
 {
 ...

 BUG_ON(!size);
 BUG_ON(offset_in_page(size));
 BUG_ON(!is_power_of_2(align));
>>>
>>> See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and
>>> from a quick look they are even unnecessary. So rather than adding more
>>> of those, I think removing those that are not needed is much more
>>> preferred.
>>>
>> i notice that, and the above code segments is used to illustrate that
>> input parameter checking is necessary sometimes
> 
> Why do you think it is necessary here?
> 
i am sorry for reply late
i don't know whether it is necessary
i just find there are so many sanity checkup in current internal interfaces

Re: [PATCH v3 08/11] powerpc/tracing: fix compat syscall handling

2016-10-12 Thread Michael Ellerman

Marcin Nowakowski  writes:

> Adapt the code to make use of new syscall handling interface
>
> Signed-off-by: Marcin Nowakowski 
> Cc: Steven Rostedt 
> Cc: Ingo Molnar 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-...@lists.ozlabs.org
> ---
>  arch/powerpc/include/asm/ftrace.h | 11 +++
>  arch/powerpc/kernel/ftrace.c  |  4 

I went to test this and noticed the exit and enter events appear to be
reversed in time? (your series on top of 24532f768121)

  ls-4221  [003] 83.766113: compat_sys_rt_sigprocmask -> 0x2
  ls-4221  [003] 83.766137: compat_sys_rt_sigprocmask(how: 2, nset: 
1010db30, oset: 0, sigsetsize: 8)
  ls-4221  [003] 83.766175: compat_sys_rt_sigaction -> 0x14
  ls-4221  [003] 83.766175: compat_sys_rt_sigaction(sig: 14, act: 
ffbd33c4, oact: ffbd3338, sigsetsize: 8)
  ls-4221  [003] 83.766177: compat_sys_rt_sigaction -> 0x15
  ls-4221  [003] 83.766177: compat_sys_rt_sigaction(sig: 15, act: 
ffbd33c4, oact: ffbd3338, sigsetsize: 8)
  ls-4221  [003] 83.766178: compat_sys_rt_sigaction -> 0x16
  ls-4221  [003] 83.766178: compat_sys_rt_sigaction(sig: 16, act: 
ffbd33d4, oact: ffbd3348, sigsetsize: 8)
  ls-4221  [003] 83.766179: sys_setpgid -> 0x107d
  ls-4221  [003] 83.766179: sys_setpgid(pid: 107d, pgid: 107d)
  ls-4221  [003] 83.766180: compat_sys_rt_sigprocmask -> 0x0
  ls-4221  [003] 83.766181: compat_sys_rt_sigprocmask(how: 0, nset: 
ffbd34b0, oset: ffbd3530, sigsetsize: 8)
  ls-4221  [003] 83.766186: compat_sys_ioctl -> 0xff
  ls-4221  [003] 83.766187: compat_sys_ioctl(fd: ff, cmd: 80047476, 
arg32: ffbd3488)
  ls-4221  [003] 83.766188: compat_sys_rt_sigprocmask -> 0x2
  ls-4221  [003] 83.766189: compat_sys_rt_sigprocmask(how: 2, nset: 
ffbd3530, oset: 0, sigsetsize: 8)
  ls-4221  [003] 83.766189: sys_close -> 0x4
  ls-4221  [003] 83.766190: sys_close(fd: 4)
  ls-4221  [003] 83.766191: sys_read -> 0x3
  ls-4221  [003] 83.766191: sys_read(fd: 3, buf: ffbd35dc, count: 1)
  ls-4221  [003] 83.766235: sys_close -> 0x3
  ls-4221  [003] 83.766235: sys_close(fd: 3)

cheers

[PATCH] drm/bridge: analogix: protect power when get_modes or detect

2016-10-12 Thread Mark Yao

The drm callback ->detect and ->get_modes seems is not power safe,
they may be called when device is power off, do register access on
detect or get_modes will cause system die.

Here is the path call ->detect before analogix_dp power on
[] analogix_dp_detect+0x44/0xdc
[] 
drm_helper_probe_single_connector_modes_merge_bits+0xe8/0x41c
[] drm_helper_probe_single_connector_modes+0x10/0x18
[] drm_mode_getconnector+0xf4/0x304
[] drm_ioctl+0x23c/0x390
[] do_vfs_ioctl+0x4b8/0x58c
[] SyS_ioctl+0x60/0x88

Cc: Inki Dae 
Cc: Sean Paul 
Cc: Gustavo Padovan 
Cc: "Ville Syrjälä" 

Signed-off-by: Mark Yao 
---
 drivers/gpu/drm/bridge/analogix/analogix_dp_core.c | 28 ++
 1 file changed, 28 insertions(+)

diff --git a/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c 
b/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
index efac8ab..09dece2 100644
--- a/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
+++ b/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c
@@ -1062,6 +1062,13 @@ int analogix_dp_get_modes(struct drm_connector 
*connector)
return 0;
}
 
+   if (dp->dpms_mode != DRM_MODE_DPMS_ON) {
+   pm_runtime_get_sync(dp->dev);
+
+   if (dp->plat_data->power_on)
+   dp->plat_data->power_on(dp->plat_data);
+   }
+
if (analogix_dp_handle_edid(dp) == 0) {
drm_mode_connector_update_edid_property(&dp->connector, edid);
num_modes += drm_add_edid_modes(&dp->connector, edid);
@@ -1073,6 +1080,13 @@ int analogix_dp_get_modes(struct drm_connector 
*connector)
if (dp->plat_data->get_modes)
num_modes += dp->plat_data->get_modes(dp->plat_data, connector);
 
+   if (dp->dpms_mode != DRM_MODE_DPMS_ON) {
+   if (dp->plat_data->power_off)
+   dp->plat_data->power_off(dp->plat_data);
+
+   pm_runtime_put_sync(dp->dev);
+   }
+
ret = analogix_dp_prepare_panel(dp, false, false);
if (ret)
DRM_ERROR("Failed to unprepare panel (%d)\n", ret);
@@ -1106,9 +1120,23 @@ analogix_dp_detect(struct drm_connector *connector, bool 
force)
return connector_status_disconnected;
}
 
+   if (dp->dpms_mode != DRM_MODE_DPMS_ON) {
+   pm_runtime_get_sync(dp->dev);
+
+   if (dp->plat_data->power_on)
+   dp->plat_data->power_on(dp->plat_data);
+   }
+
if (!analogix_dp_detect_hpd(dp))
status = connector_status_connected;
 
+   if (dp->dpms_mode != DRM_MODE_DPMS_ON) {
+   if (dp->plat_data->power_off)
+   dp->plat_data->power_off(dp->plat_data);
+
+   pm_runtime_put_sync(dp->dev);
+   }
+
ret = analogix_dp_prepare_panel(dp, false, false);
if (ret)
DRM_ERROR("Failed to unprepare panel (%d)\n", ret);
-- 
1.9.1

Re: [PATCH v3 07/11] arm64/tracing: fix compat syscall handling

2016-10-12 Thread Will Deacon

On Wed, Oct 12, 2016 at 09:07:03AM +0200, Marcin Nowakowski wrote:
> On 11.10.2016 15:36, Will Deacon wrote:
> >On Tue, Oct 11, 2016 at 12:42:52PM +0200, Marcin Nowakowski wrote:
> >>diff --git a/arch/arm64/include/asm/unistd.h 
> >>b/arch/arm64/include/asm/unistd.h
> >>index e78ac26..276d049 100644
> >>--- a/arch/arm64/include/asm/unistd.h
> >>+++ b/arch/arm64/include/asm/unistd.h
> >>@@ -45,6 +45,7 @@
> >> #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE+5)
> >>
> >> #define __NR_compat_syscalls   394
> >>+#define NR_compat_syscalls (__NR_compat_syscalls)
> >
> >We may as well just define NR_compat_syscalls instead of
> >__NR_compat_syscalls and move the handful of users over.
> 
> I had tried to minimise the amount of arch-specific changes here -
> especially those that are not directly related to the proposed syscall
> handling change. But I agree having these 2 #defines is a bit unnecessary

There's only three users of __NR_compat_syscalls, so I think you can
move them over.

> >>diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
> >>index 40ad08a..75d010f 100644
> >>--- a/arch/arm64/kernel/ftrace.c
> >>+++ b/arch/arm64/kernel/ftrace.c
> >>@@ -176,4 +176,20 @@ int ftrace_disable_ftrace_graph_caller(void)
> >>return ftrace_modify_graph_caller(false);
> >> }
> >> #endif /* CONFIG_DYNAMIC_FTRACE */
> >>+
> >> #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
> >>+
> >>+#if (defined CONFIG_FTRACE_SYSCALLS) && (defined CONFIG_COMPAT)
> >>+
> >>+extern const void *sys_call_table[];
> >>+extern const void *compat_sys_call_table[];
> >>+
> >>+unsigned long __init arch_syscall_addr(int nr, bool compat)
> >>+{
> >>+   if (compat)
> >>+   return (unsigned long)compat_sys_call_table[nr];
> >>+
> >>+   return (unsigned long)sys_call_table[nr];
> >>+}
> >
> >Do we care about the compat private syscalls (from base 0x0f)? We
> >need to make sure that we exhibit the same behaviour as a native
> >32-bit ARM machine.
> >
> >Will
> 
> Tracing of such syscalls has been disabled for a long time (see
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=086ba77a6db0).
> Apart from using non-contiguous numbers, they are not defined using standard
> SYSCALL macros, so they do not have any metadata generated either.
> My suggestion is that if you wanted those to be included in the trace then
> it should be done separately from these changes.

Fine by me -- I just wanted to make sure our compat behaviour matched
the behaviour of native arch/arm/. It sounds like it does, so no need to
change anything here.

Acked-by: Will Deacon 

Will

Re: [PATCH v2] z3fold: add shrinker

2016-10-12 Thread Vitaly Wool

On Wed, 12 Oct 2016 09:52:06 +1100
Dave Chinner  wrote:


> 
> > +static unsigned long z3fold_shrink_scan(struct shrinker *shrink,
> > +   struct shrink_control *sc)
> > +{
> > +   struct z3fold_pool *pool = container_of(shrink, struct z3fold_pool,
> > +   shrinker);
> > +   struct z3fold_header *zhdr;
> > +   int i, nr_to_scan = sc->nr_to_scan;
> > +
> > +   spin_lock(&pool->lock);
> 
> Do not do this. Shrinkers should not run entirely under a spin lock
> like this - it causes scheduling latency problems and when the
> shrinker is run concurrently on different CPUs it will simply burn
> CPU doing no useful work. Especially, in this case, as each call to
> z3fold_compact_page() may be copying a significant amount of data
> around and so there is potentially a /lot/ of work being done on
> each call to the shrinker.
> 
> If you need compaction exclusion for the shrinker invocation, then
> please use a sleeping lock to protect the compaction work.

Well, as far as I recall, spin_lock() will resolve to a sleeping lock
for PREEMPT_RT, so it is not that much of a problem for configurations
which do care much about latencies. Please also note that the time
spent in the loop is deterministic since we take not more than one entry
from every unbuddied list.

What I could do though is add the following piece of code at the end of
the loop, right after the /break/:
spin_unlock(&pool->lock);
cond_resched();
spin_lock(&pool->lock);

Would that make sense for you?

> 
> >  */
> > @@ -234,6 +335,13 @@ static struct z3fold_pool *z3fold_create_pool(gfp_t 
> > gfp,
> > INIT_LIST_HEAD(&pool->unbuddied[i]);
> > INIT_LIST_HEAD(&pool->buddied);
> > INIT_LIST_HEAD(&pool->lru);
> > +   pool->shrinker.count_objects = z3fold_shrink_count;
> > +   pool->shrinker.scan_objects = z3fold_shrink_scan;
> > +   pool->shrinker.seeks = DEFAULT_SEEKS;
> > +   if (register_shrinker(&pool->shrinker)) {
> > +   pr_warn("z3fold: could not register shrinker\n");
> > +   pool->no_shrinker = true;
> > +   } 
> 
> Just fail creation of the pool. If you can't register a shrinker,
> then much bigger problems are about to happen to your system, and
> running a new memory consumer that /can't be shrunk/ is not going to
> help anyone.

I don't have a strong opinion on this but it doesn't look fatal to me
in _this_ particular case (z3fold) since even without the shrinker, the
compression ratio will never be lower than the one of zbud, which
doesn't have a shrinker at all.

Best regards,
   Vitaly

[PATCH v1 3/4] Add hwcap2 for x86

2016-10-12 Thread Grzegorz Andrejczuk

Add hwcap2 attribute for x86.
Reserve 1st bit of HWCAP2 for exposing Xeon Phi ring 3 monitor/mwait.
With this userspace apps can detect Ring 3 MONITOR/MWAIT instructions.

Change-Id: I37d0354d1e2b9594d7feebc2bacda30b68163efe
Signed-off-by: Grzegorz Andrejczuk 
---
 arch/x86/include/asm/elf.h| 7 +++
 arch/x86/include/uapi/asm/hwcap.h | 7 +++
 arch/x86/kernel/cpu/common.c  | 3 +++
 3 files changed, 17 insertions(+)
 create mode 100644 arch/x86/include/uapi/asm/hwcap.h

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e7f155c..62d060a 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -258,6 +258,13 @@ extern int force_personality32;
 
 #define ELF_HWCAP  (boot_cpu_data.x86_capability[CPUID_1_EDX])
 
+extern unsigned int elf_hwcap2
+
+/* HWCAP2 supplies kernel enabled CPU feature, so that the application
+   can know that it can safely use them. The bits are defined in
+   uapi/asm/hwcap.h. */
+#define ELF_HWCAP2 elf_hwcap2
+
 /* This yields a string that ld.so will use to load implementation
specific libraries for optimization.  This is more specific in
intent than poking at uname or /proc/cpuinfo.
diff --git a/arch/x86/include/uapi/asm/hwcap.h 
b/arch/x86/include/uapi/asm/hwcap.h
new file mode 100644
index 000..d1f4f98
--- /dev/null
+++ b/arch/x86/include/uapi/asm/hwcap.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_HWCAP_H
+#define _ASM_HWCAP_H 1
+
+/* Kernel enabled Ring 3 MWAIT for Xeon Phi*/
+#define HWCAP2_PHIR3MWAIT  (1 << 0)
+/* upto bit 31 free */
+#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index bcc9ccc..93ffaa5 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,6 +52,8 @@
 
 #include "cpu.h"
 
+unsigned elf_hwcap2 __read_mostly;
+
 /* all of these masks are initialized in setup_cpu_local_masks() */
 cpumask_var_t cpu_initialized_mask;
 cpumask_var_t cpu_callout_mask;
-- 
2.5.1

[RESEND RFC PATCH v2 1/1] mm/vmalloc.c: simplify /proc/vmallocinfo implementation

2016-10-12 Thread zijun_hu

From: zijun_hu 

many seq_file helpers exist for simplifying implementation of virtual files
especially, for /proc nodes. however, the helpers for iteration over
list_head are available but aren't adopted to implement /proc/vmallocinfo
currently.

simplify /proc/vmallocinfo implementation by existing seq_file helpers

Signed-off-by: zijun_hu 
---
 Changes in v2:
  - the redundant type cast is removed as advised by rient...@google.com
  - commit messages are updated

 mm/vmalloc.c | 27 +--
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f2481cb4e6b2..e73948afac70 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2574,32 +2574,13 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int 
nr_vms)
 static void *s_start(struct seq_file *m, loff_t *pos)
__acquires(&vmap_area_lock)
 {
-   loff_t n = *pos;
-   struct vmap_area *va;
-
spin_lock(&vmap_area_lock);
-   va = list_first_entry(&vmap_area_list, typeof(*va), list);
-   while (n > 0 && &va->list != &vmap_area_list) {
-   n--;
-   va = list_next_entry(va, list);
-   }
-   if (!n && &va->list != &vmap_area_list)
-   return va;
-
-   return NULL;
-
+   return seq_list_start(&vmap_area_list, *pos);
 }
 
 static void *s_next(struct seq_file *m, void *p, loff_t *pos)
 {
-   struct vmap_area *va = p, *next;
-
-   ++*pos;
-   next = list_next_entry(va, list);
-   if (&next->list != &vmap_area_list)
-   return next;
-
-   return NULL;
+   return seq_list_next(p, &vmap_area_list, pos);
 }
 
 static void s_stop(struct seq_file *m, void *p)
@@ -2634,9 +2615,11 @@ static void show_numa_info(struct seq_file *m, struct 
vm_struct *v)
 
 static int s_show(struct seq_file *m, void *p)
 {
-   struct vmap_area *va = p;
+   struct vmap_area *va;
struct vm_struct *v;
 
+   va = list_entry(p, struct vmap_area, list);
+
/*
 * s_show can encounter race with remove_vm_area, !VM_VM_AREA on
 * behalf of vmap area is being tear down or vm_map_ram allocation.
-- 
1.9.1

Re: Intermittent perf build failures

2016-10-12 Thread Jiri Olsa

On Tue, Oct 11, 2016 at 02:18:49PM -0700, Laura Abbott wrote:
> On 10/11/2016 01:59 PM, Jiri Olsa wrote:
> > On Tue, Oct 11, 2016 at 01:43:36PM -0700, Laura Abbott wrote:
> > > Hi,
> > > 
> > > While building today's Fedora rawhide kernel, there was a failure
> > > building perf with -j4 [1]:

ok, the -j 4 is the problem

running "make -j 4  install-bin install-traceevent-plugins"

  BUILD:   Doing 'make -j4' parallel build
  BUILD:   Doing 'make -j4' parallel build

will run paralel make instances for install-bin and install-traceevent-plugins
which will eventually touch same files and crash..

the main perf Makefile is actualy detecting number of cpus
and runs Makefile.perf with -j X option so there's no need
to specify it on top level.. you can always customize it via
JOBS=X make variable

so if you don't specify the -j X option it will run the
'Makefile.perf install-bin install-traceevent-plugins' with
-j X set and it should execute sequentialy and fix your problem

thanks,
jirka

[PATCH v1 0/4] Enabling Ring 3 MONITOR/MWAIT feature for Knights Landing

2016-10-12 Thread Grzegorz Andrejczuk

These patches enable Intel Xeon Phi x200 feature to use MONITOR/MWAIT
instruction in ring 3 (userspace) Patches set MSR 0x140 for all logical CPUs.
Then expose it as CPU feature and introduces elf HWCAP capability for x86.
Reference:
https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait

Grzegorz Andrejczuk (4):
  Add R3MWAIT register and bit to msr-info.h
  Add enabling of the R3 MWAIT during boot for KNL
  Add hwcap2 for x86
  Add R3MWAIT to CPU features

 arch/x86/include/asm/cpufeature.h|  6 --
 arch/x86/include/asm/cpufeatures.h   |  6 +-
 arch/x86/include/asm/disabled-features.h |  3 ++-
 arch/x86/include/asm/elf.h   |  7 +++
 arch/x86/include/asm/msr-index.h |  5 +
 arch/x86/include/asm/required-features.h |  3 ++-
 arch/x86/include/uapi/asm/hwcap.h|  7 +++
 arch/x86/kernel/cpu/common.c |  6 ++
 arch/x86/kernel/cpu/intel.c  | 27 +++
 9 files changed, 65 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/uapi/asm/hwcap.h

-- 
2.5.1

[PATCH v1 4/4] Add R3MWAIT to CPU features

2016-10-12 Thread Grzegorz Andrejczuk

Add cpu feature for ring 3 monitor/mwait.

Change-Id: Iba4d20639efd8d3637d37db9294cbc43a98f009a
Signed-off-by: Grzegorz Andrejczuk 
---
 arch/x86/include/asm/cpufeature.h| 6 --
 arch/x86/include/asm/cpufeatures.h   | 6 +-
 arch/x86/include/asm/disabled-features.h | 3 ++-
 arch/x86/include/asm/required-features.h | 3 ++-
 arch/x86/kernel/cpu/common.c | 3 +++
 arch/x86/kernel/cpu/intel.c  | 1 +
 6 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 1d2b69f..1baa1df 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -78,8 +78,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 15, feature_bit) ||\
   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 16, feature_bit) ||\
   CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 17, feature_bit) ||\
+  CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 18, feature_bit) ||\
   REQUIRED_MASK_CHECK||\
-  BUILD_BUG_ON_ZERO(NCAPINTS != 18))
+  BUILD_BUG_ON_ZERO(NCAPINTS != 19))
 
 #define DISABLED_MASK_BIT_SET(feature_bit) \
 ( CHECK_BIT_IN_MASK_WORD(DISABLED_MASK,  0, feature_bit) ||\
@@ -100,8 +101,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 15, feature_bit) ||\
   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 16, feature_bit) ||\
   CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 17, feature_bit) ||\
+  CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 18, feature_bit) ||\
   DISABLED_MASK_CHECK||\
-  BUILD_BUG_ON_ZERO(NCAPINTS != 18))
+  BUILD_BUG_ON_ZERO(NCAPINTS != 19))
 
 #define cpu_has(c, bit)
\
(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :  \
diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 92a8308..242cd16 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -12,7 +12,7 @@
 /*
  * Defines x86 CPU feature bits
  */
-#define NCAPINTS   18  /* N 32-bit words worth of info */
+#define NCAPINTS   19  /* N 32-bit words worth of info */
 #define NBUGINTS   1   /* N 32-bit bug flags */
 
 /*
@@ -286,6 +286,10 @@
 #define X86_FEATURE_SUCCOR (17*32+1) /* Uncorrectable error containment 
and recovery */
 #define X86_FEATURE_SMCA   (17*32+3) /* Scalable MCA */
 
+
+/* non architectural Intel-defined CPU features not present in CPUID, word 18 
*/
+#define X86_FEATURE_PHIR3MWAIT (18*32+ 0)
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 85599ad..8b45e08 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -57,6 +57,7 @@
 #define DISABLED_MASK150
 #define DISABLED_MASK16(DISABLE_PKU|DISABLE_OSPKE)
 #define DISABLED_MASK170
-#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
+#define DISABLED_MASK180
+#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff --git a/arch/x86/include/asm/required-features.h 
b/arch/x86/include/asm/required-features.h
index fac9a5c..6847d85 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -100,6 +100,7 @@
 #define REQUIRED_MASK150
 #define REQUIRED_MASK160
 #define REQUIRED_MASK170
-#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
+#define REQUIRED_MASK180
+#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 93ffaa5..15fe27f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1108,6 +1108,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 #endif
/* The boot/hotplug time assigment got cleared, restore it */
c->logical_proc_id = topology_phys_to_logical_pkg(c->phys_proc_id);
+
+   if (cpu_has(c, X86_FEATURE_PHIR3MWAIT))
+   elf_hwcap2 |= HWCAP2_PHIR3MWAIT;
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 7f0f01a..1f65815 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -236,6 +236,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
rdmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, prev);
wrmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE,
   prev | MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT);
+   set_cpu_cap(c, X86_FEATURE_PHIR3MWAIT);
}
 }
 
-- 
2.5.1

[PATCH v1 2/4] Add enabling of the R3 MWAIT during boot for KNL

2016-10-12 Thread Grzegorz Andrejczuk

If processor is Intel Xeon Phi we enable user-level mwait feature.
Enabling this feature suppreses invalid-opcode error, when MONITOR/MWAIT
is called from ring 3.

Change-Id: I1c7defb99296b022790a068a6c725b3e860cd68c
Signed-off-by: Grzegorz Andrejczuk 
---
 arch/x86/kernel/cpu/intel.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index fcd484d..7f0f01a 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -61,6 +61,14 @@ void check_mpx_erratum(struct cpuinfo_x86 *c)
}
 }
 
+static int phir3mwait = 1;
+static int __init phir3mwait_disable(char *value)
+{
+   phir3mwait = 0;
+   return 1;
+}
+__setup("intel-phir3mwait=disable", phir3mwait_disable);
+
 static void early_init_intel(struct cpuinfo_x86 *c)
 {
u64 misc_enable;
@@ -211,6 +219,24 @@ static void early_init_intel(struct cpuinfo_x86 *c)
}
 
check_mpx_erratum(c);
+
+   /*
+   * Setting ring 3 MONITOR/MWAIT for all threads
+   * when CPU is Xeon Phi Family x200
+   * This can be disabled with phir3mwait=disable cmdline switch.
+   * We preserve the reserved values and set only 2nd bit.
+   * Ref:
+   * 
https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait
+   */
+   if (c->x86 == 6 &&
+   c->x86_model == INTEL_FAM6_XEON_PHI_KNL &&
+   phir3mwait) {
+   u64 prev;
+
+   rdmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, prev);
+   wrmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE,
+  prev | MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT);
+   }
 }
 
 #ifdef CONFIG_X86_32
-- 
2.5.1

[PATCH v1 1/4] Add R3MWAIT register and bit to msr-info.h

2016-10-12 Thread Grzegorz Andrejczuk

Intel Xeon Phi x200 (codenamed Knights Landing) has MSR
MISC_THD_FEATURE_ENABLE 0x140.

Setting its 2nd bit make MONITOR and MWAIT instructions do not cause
invalid-opcode exception.

This commit adds this register prefixed by PHI and bit to msr-info.h
Reference:
https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait

Change-Id: If3b14c78f4e66d734e5a00921023a8c7cafc0cf3
Signed-off-by: Grzegorz Andrejczuk 
---
 arch/x86/include/asm/msr-index.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 56f4c66..3eb1713 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -540,6 +540,11 @@
 #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE_BIT   39
 #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE   (1ULL << 
MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE_BIT)
 
+/* Intel Xeon Phi x200 ring 3 MONITOR/MWAIT */
+#define MSR_PHI_MISC_THD_FEATURE_ENABLE0x0140
+#define MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT_BIT1
+#define MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT(1ULL << 
MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT_BIT)
+
 #define MSR_IA32_TSC_DEADLINE  0x06E0
 
 /* P4/Xeon+ specific */
-- 
2.5.1

Re: [PATCH] mm: kmemleak: Ensure that the task stack is not freed during scanning

2016-10-12 Thread Hillf Danton

> @@ -1453,8 +1453,11 @@ static void kmemleak_scan(void)
> 
>   read_lock(&tasklist_lock);
>   do_each_thread(g, p) {

Take a look at this commit please.
1da4db0cd5 ("oom_kill: change oom_kill.c to use for_each_thread()")

> - scan_block(task_stack_page(p), task_stack_page(p) +
> -THREAD_SIZE, NULL);
> + void *stack = try_get_task_stack(p);
> + if (stack) {
> + scan_block(stack, stack + THREAD_SIZE, NULL);
> + put_task_stack(p);
> + }
>   } while_each_thread(g, p);
>   read_unlock(&tasklist_lock);
>   }
>

Re: [PATCH v8 6/6] mfd: lpc_ich: Add support for Intel Apollo Lake GPIO pinctrl in non-ACPI system

2016-10-12 Thread Andy Shevchenko

On Wed, 2016-10-12 at 14:51 +0800, Tan Jui Nee wrote:
> This driver uses the P2SB hide/unhide mechanism cooperatively
> to pass the PCI BAR address to the gpio platform driver.
> 

Almost minor issues below.

> --- a/drivers/mfd/Makefile
> +++ b/drivers/mfd/Makefile
> @@ -161,6 +161,10 @@ obj-$(CONFIG_MFD_INTEL_QUARK_I2C_GPIO)   +=
> intel_quark_i2c_gpio.o
>  obj-$(CONFIG_LPC_SCH)+= lpc_sch.o
>  lpc_ich-objs := lpc_ich_core.o

^^^

>  obj-$(CONFIG_LPC_ICH)+= lpc_ich.o
> +lpc_ich-objs := lpc_ich_core.o

^^^ duplication.

> +ifeq ($(CONFIG_X86_INTEL_IVI),y)
> +lpc_ich-objs += lpc_ich_apl.o
> +endif

> +++ b/drivers/mfd/lpc_ich_apl.c
> @@ -0,0 +1,120 @@
> +/*
> + * Intel Apollo Lake In-Vehicle Infotainment (IVI) systems used in
> cars support

> + *
> + * Copyright (C) 2016 Intel Corporation
> + *
> + * Author: Tan, Jui Nee 
> + *
> + * This program is free software; you can redistribute it and/or
> modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include 

Hmm... asm stuff is platform specific, better to put it in separate
section like:

#include 
#include 

#include 

#include "c.h"

> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "lpc_ich_apl.h"
> +
> 

> +int lpc_ich_add_gpio(struct pci_dev *dev, enum lpc_chipsets chipset)
> +{
> + unsigned int i;
> + int ret;
> + struct resource base;
> +
> 

> + if (chipset != LPC_APL)
> + return -ENODEV;

Replace this by positive check (see below). Moreover -ENODEV will be
returned if no cells were added.

> + /*
> +  * Apollo lake, has not 1, but 4 gpio controllers,

Perhaps "Apollo lake has 4 gpio controllers,"

> +  * handle it a bit differently.
> +  */
> +
> + ret = p2sb_bar(dev, PCI_DEVFN(PCI_IDSEL_P2SB, 0), &base);
> + if (ret)
> + goto warn_continue;
> +
> + for (i = 0; i < APL_GPIO_COMMUNITY_MAX; i++) {
> + struct resource *res = &apl_gpio_io_res[i];
> +
> + /* Fill MEM resource */
> + res->start += base.start;
> + res->end += base.start;
> + res->flags = base.flags;
> +
> + res++;
> + }
> +
> + ret = mfd_add_devices(&dev->dev, 0,
> + apl_gpio_devices, ARRAY_SIZE(apl_gpio_devices),
> + NULL, 0, NULL);
> +

> + if (ret)
> +warn_continue:

Swap them.

> + dev_warn(&dev->dev,
> + "Failed to add Apollo Lake GPIO: %d\n",
> + ret);
> +
> + return ret;
> +}


> +++ b/drivers/mfd/lpc_ich_apl.h
> @@ -0,0 +1,29 @@
> +/*
> + * lpc_ich_apl.h - Intel In-Vehicle Infotainment (IVI) systems used
> in cars
> + * support

> + *
> + * Copyright (C) 2016, Intel Corporation
> + *
> + * Author: Tan, Jui Nee 
> + *
> + * This program is free software; you can redistribute it and/or
> modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef __LPC_ICH_APL_H__
> +#define __LPC_ICH_APL_H__
> +
> +#include 
> +
> +#if IS_ENABLED(CONFIG_X86_INTEL_IVI)
> +int lpc_ich_add_gpio(struct pci_dev *dev, enum lpc_chipsets chipset);
> +#else /* CONFIG_X86_INTEL_IVI is not set */
> +static inline int lpc_ich_add_gpio(struct pci_dev *dev,
> + enum lpc_chipsets chipset)
> +{
> + return -ENODEV;
> +}
> +#endif

Add comment here if you want to be looking like in p2sb.h.

> +
> +#endif
> 

> --- a/drivers/mfd/lpc_ich_core.c
> +++ b/drivers/mfd/lpc_ich_core.c
> @@ -70,6 +70,8 @@
>  #include 
>  #include 
>  
> +#include "lpc_ich_apl.h"
> +
>  #define ACPIBASE 0x40
>  #define ACPIBASE_GPE_OFF 0x28
>  #define ACPIBASE_GPE_END 0x2f
> @@ -1032,6 +1034,9 @@ static int lpc_ich_probe(struct pci_dev *dev,
>   cell_added = true;
>   }
>  
> + if (!lpc_ich_add_gpio(dev, priv->chipset))
> + cell_added = true;
> +

Like it's already used:

if (priv->chipset == XXX) {
 do_yyy(dev);
 cell_added = true;
}


-- 
Andy Shevchenko 
Intel Finland Oy

Re: [PATCH v8 4/6] mfd: move enum lpc_chipsets into lpc_ich.h

2016-10-12 Thread Andy Shevchenko

On Wed, 2016-10-12 at 14:51 +0800, Tan Jui Nee wrote:
> Move the enum's definition into a standalone header file which can be
> used
> wherever its definition is needed.
> 
> --- a/include/linux/mfd/lpc_ich.h
> +++ b/include/linux/mfd/lpc_ich.h
> @@ -43,4 +43,75 @@ struct lpc_ich_info {
>   u8 use_gpio;
>  };
>  
> +/* chipset related info */
> +enum lpc_chipsets {

Maybe it worth to add that the list should be not shuffled, new items
should go at the end.

But it is up to you.

-- 
Andy Shevchenko 
Intel Finland Oy

[PATCH 3/4] Add hwcap2 for x86

2016-10-12 Thread Grzegorz Andrejczuk

Add hwcap2 attribute for x86.
Reserve 1st bit of HWCAP2 for exposing Xeon Phi ring 3 monitor/mwait.
With this userspace apps can detect Ring 3 MONITOR/MWAIT instructions.

Change-Id: I37d0354d1e2b9594d7feebc2bacda30b68163efe
Signed-off-by: Grzegorz Andrejczuk 
---
 arch/x86/include/asm/elf.h| 7 +++
 arch/x86/include/uapi/asm/hwcap.h | 7 +++
 arch/x86/kernel/cpu/common.c  | 3 +++
 3 files changed, 17 insertions(+)
 create mode 100644 arch/x86/include/uapi/asm/hwcap.h

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e7f155c..a3f7856 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -258,6 +258,13 @@ extern int force_personality32;
 
 #define ELF_HWCAP  (boot_cpu_data.x86_capability[CPUID_1_EDX])
 
+extern unsigned int elf_hwcap2;
+
+/* HWCAP2 supplies kernel enabled CPU feature, so that the application
+   can know that it can safely use them. The bits are defined in
+   uapi/asm/hwcap.h. */
+#define ELF_HWCAP2 elf_hwcap2
+
 /* This yields a string that ld.so will use to load implementation
specific libraries for optimization.  This is more specific in
intent than poking at uname or /proc/cpuinfo.
diff --git a/arch/x86/include/uapi/asm/hwcap.h 
b/arch/x86/include/uapi/asm/hwcap.h
new file mode 100644
index 000..d1f4f98
--- /dev/null
+++ b/arch/x86/include/uapi/asm/hwcap.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_HWCAP_H
+#define _ASM_HWCAP_H 1
+
+/* Kernel enabled Ring 3 MWAIT for Xeon Phi*/
+#define HWCAP2_PHIR3MWAIT  (1 << 0)
+/* upto bit 31 free */
+#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index bcc9ccc..93ffaa5 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,6 +52,8 @@
 
 #include "cpu.h"
 
+unsigned elf_hwcap2 __read_mostly;
+
 /* all of these masks are initialized in setup_cpu_local_masks() */
 cpumask_var_t cpu_initialized_mask;
 cpumask_var_t cpu_callout_mask;
-- 
2.5.1

Re: [PATCH v7 2/8] power: add power sequence library

2016-10-12 Thread Heiko Stuebner

Hi,

Am Dienstag, 20. September 2016, 11:36:41 CEST schrieb Peter Chen:
> We have an well-known problem that the device needs to do some power
> sequence before it can be recognized by related host, the typical
> example like hard-wired mmc devices and usb devices.
> 
> This power sequence is hard to be described at device tree and handled by
> related host driver, so we have created a common power sequence
> library to cover this requirement. The core code has supplied
> some common helpers for host driver, and individual power sequence
> libraries handle kinds of power sequence for devices.
> 
> pwrseq_generic is intended for general purpose of power sequence, which
> handles gpios and clocks currently, and can cover regulator and pinctrl
> in future. The host driver just needs to call of_pwrseq_on/of_pwrseq_off
> if only one power sequence is needed, else call of_pwrseq_on_list
> /of_pwrseq_off_list instead (eg, USB hub driver).
>
> Signed-off-by: Peter Chen 
> Tested-by Joshua Clayton 
> Reviewed-by: Matthias Kaehlcke 
> Tested-by: Matthias Kaehlcke 

first of all, glad to see this move forward. I've only some qualms with the 
static number of allocated power sequences below.

[...]

> diff --git a/drivers/power/pwrseq/Kconfig b/drivers/power/pwrseq/Kconfig
> new file mode 100644
> index 000..dff5e35
> --- /dev/null
> +++ b/drivers/power/pwrseq/Kconfig
> @@ -0,0 +1,45 @@
> +#
> +# Power Sequence library
> +#
> +
> +config POWER_SEQUENCE
> + bool
> +
> +menu "Power Sequence Support"
> +
> +config PWRSEQ_GENERIC
> + bool "Generic power sequence control"
> + depends on OF
> + select POWER_SEQUENCE
> + help
> +It is used for drivers which needs to do power sequence
> +(eg, turn on clock, toggle reset gpio) before the related
> +devices can be found by hardware. This generic one can be
> +used for common power sequence control.
> +
> +config PWRSEQ_GENERIC_INSTANCE_NUMBER
> + int "Number of Generic Power Sequence Instance"
> + depends on PWRSEQ_GENERIC
> + range 1 10
> + default 2
> + help
> +Usually, there are not so many devices needs power sequence, we set 
> two
> +as default value.

limiting this to some arbitary compile-time number somehow seems crippling for 
the single-image approach. I.e. a distribution might select something and 
during its lifetime the board requiring n+1 power-sequences appears and thus 
needs a different kernel version just to support that additional sequence.

Also, board designers are creative, and there were already complex examples 
mentioned elsewhere, so nothing keeps people from inventing something even 
more complex.

[...]

> diff --git a/drivers/power/pwrseq/pwrseq_generic.c
> b/drivers/power/pwrseq/pwrseq_generic.c new file mode 100644
> index 000..bcd16c3
> --- /dev/null
> +++ b/drivers/power/pwrseq/pwrseq_generic.c

[...]

> +static int pwrseq_generic_get(struct device_node *np, struct pwrseq
> *pwrseq) +{
> + struct pwrseq_generic *pwrseq_gen = to_generic_pwrseq(pwrseq);
> + enum of_gpio_flags flags;
> + int reset_gpio, clk, ret = 0;
> +
> + for (clk = 0; clk < PWRSEQ_MAX_CLKS; clk++) {
> + pwrseq_gen->clks[clk] = of_clk_get(np, clk);
> + if (IS_ERR(pwrseq_gen->clks[clk])) {
> + ret = PTR_ERR(pwrseq_gen->clks[clk]);
> + if (ret != -ENOENT)
> + goto err_put_clks;
> + pwrseq_gen->clks[clk] = NULL;
> + break;
> + }
> + }
> +
> + reset_gpio = of_get_named_gpio_flags(np, "reset-gpios", 0, &flags);
> + if (gpio_is_valid(reset_gpio)) {
> + unsigned long gpio_flags;
> +
> + if (flags & OF_GPIO_ACTIVE_LOW)
> + gpio_flags = GPIOF_ACTIVE_LOW | GPIOF_OUT_INIT_LOW;
> + else
> + gpio_flags = GPIOF_OUT_INIT_HIGH;
> +
> + ret = gpio_request_one(reset_gpio, gpio_flags,
> + "pwrseq-reset-gpios");
> + if (ret)
> + goto err_put_clks;
> +
> + pwrseq_gen->gpiod_reset = gpio_to_desc(reset_gpio);
> + of_property_read_u32(np, "reset-duration-us",
> + &pwrseq_gen->duration_us);
> + } else {
> + if (reset_gpio == -ENOENT)
> + return 0;
> +
> + ret = reset_gpio;
> + pr_err("Failed to get reset gpio on %s, err = %d\n",
> + np->full_name, reset_gpio);
> + goto err_put_clks;
> + }
> +
> + return ret;
> +
> +err_put_clks:
> + while (--clk >= 0)
> + clk_put(pwrseq_gen->clks[clk]);
> + return ret;
> +}
> +
> +static const struct of_device_id generic_id_table[] = {
> + { .compatible = "generic",},
> + { /* sentinel */ }
> +};
> +
> +static int __init pwrseq_generic_register(void)
> +{
> + struct pwrseq

Re: [PATCH v1 4/4] Add R3MWAIT to CPU features

2016-10-12 Thread Borislav Petkov

On Wed, Oct 12, 2016 at 12:13:10PM +0200, Grzegorz Andrejczuk wrote:
> Add cpu feature for ring 3 monitor/mwait.
> 
> Change-Id: Iba4d20639efd8d3637d37db9294cbc43a98f009a

Please no internal IDs in upstream submission.

> Signed-off-by: Grzegorz Andrejczuk 

...

> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 92a8308..242cd16 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -12,7 +12,7 @@
>  /*
>   * Defines x86 CPU feature bits
>   */
> -#define NCAPINTS 18  /* N 32-bit words worth of info */
> +#define NCAPINTS 19  /* N 32-bit words worth of info */
>  #define NBUGINTS 1   /* N 32-bit bug flags */
>  
>  /*
> @@ -286,6 +286,10 @@
>  #define X86_FEATURE_SUCCOR   (17*32+1) /* Uncorrectable error containment 
> and recovery */
>  #define X86_FEATURE_SMCA (17*32+3) /* Scalable MCA */
>  
> +
> +/* non architectural Intel-defined CPU features not present in CPUID, word 
> 18 */
> +#define X86_FEATURE_PHIR3MWAIT   (18*32+ 0)

Please use init_scattered_cpuid_features() for the whole
thing. There are some free bits in word 3 for example, see
arch/x86/include/asm/cpufeatures.h.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--

Re: [PATCH v8 3/6] x86/intel-ivi: Add Intel In-Vehicle Infotainment (IVI) systems used in cars support

2016-10-12 Thread Andy Shevchenko

On Wed, 2016-10-12 at 14:51 +0800, Tan Jui Nee wrote:
> Add support for non ACPI system, such as system that uses Advanced
> Boot
> Loader (ABL) whereby a platform device has to be created in order to
> bind
> with PINCTRL/GPIO.
> 
> At the moment, Intel Apollo Lake SoC requires P2SB driver to hide and
> unhide P2SB to lookup P2SB BAR and pass the PCI BAR address to GPIO.

I dunno if this patch would go as a last in the series.

> 
> +config X86_INTEL_IVI
> + bool "Intel In-Vehicle Infotainment (IVI) systems used in
> cars"
> + ---help---
> +   Select this option to enable MMIO BAR access over the P2SB
> for
> +   non-ACPI Intel Apollo Lake SoC platforms.

This sounds not what the option is used for.
What I see from the code as simple as "Enable support of Intel IVI
systems. This enables necessary drivers and libraries which are used in
IVI systems."


>  This driver uses the P2SB
> +   hide/unhide mechanism cooperatively to pass the PCI BAR
> address to
> +   the platform driver, currently GPIO.


-- 
Andy Shevchenko 
Intel Finland Oy

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Haozhong Zhang

On 10/11/16 13:17 -0700, Dan Williams wrote:

On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilk
 wrote:

On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote:

On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk
 wrote:
> On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote:
[..]
>> Right, but why does the libnvdimm core need to know about this
>> specific Xen reservation?  For example, if Xen wants some in-kernel
>
> Let me turn this around - why does the libnvdimm core need to know about
> Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD
> for example can also poke a hole in this and fill it with its
> OS-management meta-data?

Specifically the core needs to know so that it can answer the Linux
specific question of whether the pfn returned by ->direct_access() has
a corresponding struct page or not. It's tied to the lifetime of the
device and the usage of the reservation needs to be coordinated
against the references of those pages.  If FreeBSD decides it needs to
reserve "struct page" capacity at the start of the device, I would
hope that it reuses the same on-device info block that Linux is using
and not create a new "FreeBSD-mode" device type.

The issue here (as I understand, I may be missing something new)
is that the size of this special namespace may be different. That is
the 'struct page' on FreeBSD could be 256 bytes while on Linux it is
64 bytes (numbers pulled out of the sky).

Hence one would have to expand or such to re-use this.

Sure, but we could support that today.  If FreeBSD lays down the info
block it is free to make a bigger reservation and Linux would be happy
to use a smaller subset.  If we, as an industry, want this "struct
page" reservation to be common we can take it to a standards body to
make as a cross-OS guarantee... but I think this is separate from the
Xen reservation.

To be honest I do not yet understand what metadata Xen wants to store
in the device, but it seems the producer and consumer of that metadata
is Xen itself and not the wider Linux kernel as is the case with
struct page.  Can you fill me in on what problem Xen solves with this

Exactly!

reservation?

The same as Linux - its variant of 'struct page'. Which I think is
smaller than the Linux one, but perhaps it is not?

If the hypervisor needs to know where it can store some metadata, can
that be satisfied with userspace tooling in Dom0? Something like,
"/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem
with files to hand to guests".  So my question is not about the
rationale for having metadata, it's why does the Linux kernel need to
know about the Xen reservation? As far as I can see it is independent
/ opaque to the kernel.

Thank everyone for all these comments!

How about doing the reservation in the following way:

1. Create partition(s) on /dev/pmemX and make sure space besides the
  partition table and potential padding before the first partition is
  large enough to hold Xen's management structures and a super block
  introduced in step 2. The space besides the partition table,
  padding and the super block will be used as the reserved area.

2. Write a super block before above reserved area. The super block
  records the base address and the size of the reserved area. It also
  contains a signature and a checksum to identify itself.

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
   \_ ___/
  V
 /dev/pmem0

Above two steps can be done via a userspace program and do not need
Xen hypervisor running. The partitions on the device can be used
regardless of the existence of Xen hypervisor.

3. When Xen is running, implement a function in Dom0 Linux xen driver
  (drivers/xen/) to response to udevd events that notify the
  detection of the pmem regions.

  This function searches on the pmem region for the super block
  created in step 2. If one is found, it will know this pmem region
  has been prepared for Xen usage.

  Then it gets the base address and size of the reserved area (from
  super block) and the entire address ranges of the pmem region (from
  pmem driver), and reports them to Xen hypervisor.

The implementation of this step can be completely included in the
kernel Xen driver. (It may also be implemented as a udevd service in
userspace, if it's not considered as unsafe)

Thanks,
Haozhong

[PATCH] qede: fix CONFIG_INFINIBAND_QEDR=m build error

2016-10-12 Thread Arnd Bergmann

The newly introduced INFINIBAND_QEDR option is 'tristate' but
fails to build when set to 'm':

drivers/net/built-in.o: In function `qed_hw_init':
(.text+0x1c0e17): undefined reference to `qed_rdma_dpm_bar'
drivers/net/built-in.o: In function `qed_eq_completion':
(.text+0x1d185b): undefined reference to `qed_async_roce_event'
drivers/net/built-in.o: In function `qed_ll2_txq_completion':
qed_ll2.c:(.text+0x1e2fdd): undefined reference to 
`qed_ll2b_complete_tx_gsi_packet'
drivers/net/built-in.o: In function `qed_ll2_rxq_completion':
qed_ll2.c:(.text+0x1e479a): undefined reference to 
`qed_ll2b_complete_rx_gsi_packet'
drivers/net/built-in.o: In function `qed_ll2_terminate_connection':
(.text+0x1e5645): undefined reference to `qed_ll2b_release_tx_gsi_packet'

There are multiple problems here:

- The option should be 'bool', as this is not a separate module
  but rather a single file that gets added to the normal driver
  module

- The qed_rdma_dpm_bar() helper function should have been 'static
  inline' as it's declared in a header file, the current workaround
  of including qed_roce.h conditionally is not good

- There is no reason to use '#if' all the time to check for the
  symbol, it should use use 'if IS_ENABLED()' to make the code
  more readable and get better compile coverage.

This addresses all three of the above.

Fixes: cee9fbd8e2e9 ("qede: Add qedr framework")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/qlogic/Kconfig|  2 +-
 drivers/net/ethernet/qlogic/qed/qed_cxt.c  |  6 +-
 drivers/net/ethernet/qlogic/qed/qed_dev.c  |  7 +++
 drivers/net/ethernet/qlogic/qed/qed_main.c | 24 +++-
 drivers/net/ethernet/qlogic/qed/qed_roce.h |  4 
 drivers/net/ethernet/qlogic/qed/qed_spq.c  | 13 ++---
 6 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/Kconfig 
b/drivers/net/ethernet/qlogic/Kconfig
index 0df1391f9663..90562cf8fa19 100644
--- a/drivers/net/ethernet/qlogic/Kconfig
+++ b/drivers/net/ethernet/qlogic/Kconfig
@@ -108,7 +108,7 @@ config QEDE
  This enables the support for ...
 
 config INFINIBAND_QEDR
-   tristate "QLogic qede RoCE sources [debug]"
+   bool "QLogic qede RoCE sources [debug]"
depends on QEDE && 64BIT
select QED_LL2
default n
diff --git a/drivers/net/ethernet/qlogic/qed/qed_cxt.c 
b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
index 82370a1a59ad..0a3ffcd9f073 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_cxt.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
@@ -48,12 +48,8 @@
 #define TM_ELEM_SIZE4
 
 /* ILT constants */
-#if IS_ENABLED(CONFIG_INFINIBAND_QEDR)
 /* For RoCE we configure to 64K to cover for RoCE max tasks 256K purpose. */
-#define ILT_DEFAULT_HW_P_SIZE  4
-#else
-#define ILT_DEFAULT_HW_P_SIZE  3
-#endif
+#define ILT_DEFAULT_HW_P_SIZE  IS_ENABLED(CONFIG_INFINIBAND_QEDR) ? 4 : 3
 
 #define ILT_PAGE_IN_BYTES(hw_p_size)   (1U << ((hw_p_size) + 12))
 #define ILT_CFG_REG(cli, reg)  PSWRQ2_REG_ ## cli ## _ ## reg ## _RT_OFFSET
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 754f6a908858..63a38e3b8f3f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -890,7 +890,7 @@ qed_hw_init_pf_doorbell_bar(struct qed_hwfn *p_hwfn, struct 
qed_ptt *p_ptt)
n_cpus = 1;
rc = qed_hw_init_dpi_size(p_hwfn, p_ptt, pwm_regsize, n_cpus);
 
-   if (cond)
+   if (IS_ENABLED(CONFIG_INFINIBAND_QEDR) && cond)
qed_rdma_dpm_bar(p_hwfn, p_ptt);
}
 
@@ -1422,19 +1422,18 @@ static void qed_hw_set_feat(struct qed_hwfn *p_hwfn)
u32 *feat_num = p_hwfn->hw_info.feat_num;
int num_features = 1;
 
-#if IS_ENABLED(CONFIG_INFINIBAND_QEDR)
/* Roce CNQ each requires: 1 status block + 1 CNQ. We divide the
 * status blocks equally between L2 / RoCE but with consideration as
 * to how many l2 queues / cnqs we have
 */
-   if (p_hwfn->hw_info.personality == QED_PCI_ETH_ROCE) {
+   if (IS_ENABLED(CONFIG_INFINIBAND_QEDR) &&
+   p_hwfn->hw_info.personality == QED_PCI_ETH_ROCE) {
num_features++;
 
feat_num[QED_RDMA_CNQ] =
min_t(u32, RESC_NUM(p_hwfn, QED_SB) / num_features,
  RESC_NUM(p_hwfn, QED_RDMA_CNQ_RAM));
}
-#endif
feat_num[QED_PF_L2_QUE] = min_t(u32, RESC_NUM(p_hwfn, QED_SB) /
num_features,
RESC_NUM(p_hwfn, QED_L2_QUEUE));
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 4ee3151e80c2..36023a3583f2 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -33,10 +33,8 @@
 #include "qed_hw.h"
 #include "qed_selftest.h

Re: [PATCH v1 2/4] Add enabling of the R3 MWAIT during boot for KNL

2016-10-12 Thread Borislav Petkov

On Wed, Oct 12, 2016 at 12:13:08PM +0200, Grzegorz Andrejczuk wrote:
> If processor is Intel Xeon Phi we enable user-level mwait feature.
> Enabling this feature suppreses invalid-opcode error, when MONITOR/MWAIT
> is called from ring 3.
> 
> Change-Id: I1c7defb99296b022790a068a6c725b3e860cd68c
> Signed-off-by: Grzegorz Andrejczuk 
> ---
>  arch/x86/kernel/cpu/intel.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index fcd484d..7f0f01a 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -61,6 +61,14 @@ void check_mpx_erratum(struct cpuinfo_x86 *c)
>   }
>  }
>  
> +static int phir3mwait = 1;
> +static int __init phir3mwait_disable(char *value)
> +{
> + phir3mwait = 0;
> + return 1;
> +}
> +__setup("intel-phir3mwait=disable", phir3mwait_disable);

That's a lot of typing on the cmdline. "r3mwait=disable" looks just as
fine to me, for example.

>  static void early_init_intel(struct cpuinfo_x86 *c)
>  {
>   u64 misc_enable;
> @@ -211,6 +219,24 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>   }
>  
>   check_mpx_erratum(c);
> +
> + /*
> + * Setting ring 3 MONITOR/MWAIT for all threads
> + * when CPU is Xeon Phi Family x200
> + * This can be disabled with phir3mwait=disable cmdline switch.
> + * We preserve the reserved values and set only 2nd bit.
> + * Ref:
> + * 
> https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait
> + */
> + if (c->x86 == 6 &&
> + c->x86_model == INTEL_FAM6_XEON_PHI_KNL &&
> + phir3mwait) {
> + u64 prev;
> +
> + rdmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, prev);
> + wrmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE,
> +prev | MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT);

Wanna test the MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT bit before doing
the MSR write?

Btw, you might want to shorten those define names - they're huuge.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--

Re: [PATCH] mm: page_alloc: Use KERN_CONT where appropriate

2016-10-12 Thread Joe Perches

(resending as lkml bounced)

On Wed, 2016-10-12 at 11:10 +0200, Michal Hocko wrote:
> On Tue 11-10-16 19:24:55, Joe Perches wrote:
> > Recent changes to printk require KERN_CONT uses to continue logging
> > messages.  So add KERN_CONT where necessary.
> 
> 
> 
> I was really wondering what happened when Aaron reported an allocation
> failure http://lkml.kernel.org/r/20161012065423.ga16...@aaronlu.sh.intel.com
> See the attached log got the current Linus' tree
> 
> Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation 
> lines")
> > Signed-off-by: Joe Perches 
> 
> 
> 
> Acked-by: Michal Hocko 
> 
> I believe we can simplify the code a bit as well. What do you think
> about the following on top?


Hi Michal

I think the show_node to show_zone_node renaming is superfluous,
but if it makes you happy, it doesn't bother me.

This recent change to printk logging making KERN_CONT necessary to
continue a line might be reverted when it's better known just how
many instances in the kernel tree will need to be changed.

For now, I'd rather keep the KERN_CONT "\n" and trailing "\n" as
there are _very_ few missing newlines in logging messages today
and removing them now might be a bit early process-wise.

Dunno.

> --- 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6f8c356140a0..7e1b74ee79cb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4078,10 +4078,12 @@ unsigned long nr_free_pagecache_pages(void)
>   return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
>  }
>  
> -static inline void show_node(struct zone *zone)
> +static inline void show_zone_node(struct zone *zone)
>  {
>   if (IS_ENABLED(CONFIG_NUMA))
> - printk("Node %d ", zone_to_nid(zone));
> + printk("Node %d %s", zone_to_nid(zone), zone->name);
> + else
> + printk("%s: ", zone->name);
>  }
>  
>  long si_mem_available(void)
> @@ -4329,9 +4331,8 @@ void show_free_areas(unsigned int filter)
>   for_each_online_cpu(cpu)
>   free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
>  
> - show_node(zone);
> + show_zone_node(zone);
>   printk(KERN_CONT
> - "%s"
>   " free:%lukB"
>   " min:%lukB"
>   " low:%lukB"
> @@ -4354,7 +4355,6 @@ void show_free_areas(unsigned int filter)
>   " local_pcp:%ukB"
>   " free_cma:%lukB"
>   "\n",
> - zone->name,
>   K(zone_page_state(zone, NR_FREE_PAGES)),
>   K(min_wmark_pages(zone)),
>   K(low_wmark_pages(zone)),
> @@ -4379,7 +4379,6 @@ void show_free_areas(unsigned int filter)
>   printk("lowmem_reserve[]:");
>   for (i = 0; i < MAX_NR_ZONES; i++)
>   printk(KERN_CONT " %ld", zone->lowmem_reserve[i]);
> - printk(KERN_CONT "\n");
>   }
>  
>   for_each_populated_zone(zone) {
> @@ -4389,8 +4388,7 @@ void show_free_areas(unsigned int filter)
>  
>   if (skip_free_areas_node(filter, zone_to_nid(zone)))
>   continue;
> - show_node(zone);
> - printk(KERN_CONT "%s: ", zone->name);
> + show_zone_node(zone);
>  
>   spin_lock_irqsave(&zone->lock, flags);
>   for (order = 0; order < MAX_ORDER; order++) {

Re: MPOL_BIND on memory only nodes

2016-10-12 Thread Anshuman Khandual

On 10/12/2016 03:13 PM, Michal Hocko wrote:
> On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
>> Hi,
>>
>> We have the following function policy_zonelist() which selects a zonelist
>> during various allocation paths. With this, general user space allocations
>> (IIUC might not have __GFP_THISNODE) fails while trying to get memory from
>> a memory only node without CPUs as the application runs some where else
>> and that node is not part of the nodemask.

My bad. Was playing with some changes to the zonelists rebuild after
a memory node hotplug and the order of various zones in them.

> 
> I am not sure I understand. So you have a task with MPOL_BIND without a
> cpu less node in the mask and you are wondering why the memory is not
> allocated from that node?

In my experiment, there is a MPOL_BIND call with a CPU less node in
the node mask and the memory is not allocated from that CPU less node.
Thats because the zone of the CPU less node was absent from the
FALLBACK zonelist of the local node.

> 
>> Why we insist on __GFP_THISNODE ?
> 
> AFAIU __GFP_THISNODE just overrides the given node to the policy
> nodemask in case the current node is not part of that node mask. In
> other words we are ignoring the given node and use what the policy says. 

Right but provided the gfp flag has __GFP_THISNODE in it. In absence
of __GFP_THISNODE, the node from the nodemask will not be selected. I
still wonder why ? Can we always go to the first node in the nodemask
for MPOL_BIND interface calls ? Just curious to know why preference
is given to the local node and it's FALLBACK zonelist.

> I can see how this can be confusing especially when confronting the
> documentation:
> 
>  * __GFP_THISNODE forces the allocation to be satisified from the requested
>  *   node with no fallbacks or placement policy enforcements.
> 

Yeah, right.

Thanks for your reply.

Re: [PATCH] mm: kmemleak: Ensure that the task stack is not freed during scanning

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 10:57:03, Catalin Marinas wrote:
> Commit 68f24b08ee89 ("sched/core: Free the stack early if
> CONFIG_THREAD_INFO_IN_TASK") may cause the task->stack to be freed
> during kmemleak_scan() execution, leading to either a NULL pointer
> fault (if task->stack is NULL) or kmemleak accessing already freed
> memory. This patch uses the new try_get_task_stack() API to ensure that
> the task stack is not freed during kmemleak stack scanning.

Looks good to me
 
> Fixes: 68f24b08ee89 ("sched/core: Free the stack early if 
> CONFIG_THREAD_INFO_IN_TASK")
> Cc: Andrew Morton 
> Cc: Andy Lutomirski 
> Cc: CAI Qian 
> Reported-by: CAI Qian 
> Signed-off-by: Catalin Marinas 

Acked-by: Michal Hocko 

> ---
> 
> This was reported in a subsequent comment here:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=173901
> 
> However, the original bugzilla entry doesn't look related to task stack
> freeing as it was first reported on 4.8-rc8. Andy, sorry for cc'ing you
> to bugzilla, please feel free to remove your email from the bug above (I
> can't seem to be able to do it).
> 
>  mm/kmemleak.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index a5e453cf05c4..e5355a5b423f 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -1453,8 +1453,11 @@ static void kmemleak_scan(void)
>  
>   read_lock(&tasklist_lock);
>   do_each_thread(g, p) {
> - scan_block(task_stack_page(p), task_stack_page(p) +
> -THREAD_SIZE, NULL);
> + void *stack = try_get_task_stack(p);
> + if (stack) {
> + scan_block(stack, stack + THREAD_SIZE, NULL);
> + put_task_stack(p);
> + }
>   } while_each_thread(g, p);
>   read_unlock(&tasklist_lock);
>   }
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] ext4: super.c: Update logging style using PR_CONT

2016-10-12 Thread Jan Kara

On Tue 11-10-16 18:57:58, Joe Perches wrote:
> Recent commit require line continuing printks to use PR_CONT.
> 
> Update super.c to use PR_CONT and use vsprintf extension %pV
> to avoid a printk/vprintk/printk("\n") sequence as well.

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> 
> Signed-off-by: Joe Perches 
> ---
>  fs/ext4/super.c | 21 +++--
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6db81fbcbaa6..20da99da0a34 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -597,14 +597,15 @@ void __ext4_std_error(struct super_block *sb, const 
> char *function,
>  void __ext4_abort(struct super_block *sb, const char *function,
>   unsigned int line, const char *fmt, ...)
>  {
> + struct va_format vaf;
>   va_list args;
>  
>   save_error_info(sb, function, line);
>   va_start(args, fmt);
> - printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: ", sb->s_id,
> -function, line);
> - vprintk(fmt, args);
> - printk("\n");
> + vaf.fmt = fmt;
> + vaf.va = &args;
> + printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: %pV\n",
> +sb->s_id, function, line, &vaf);
>   va_end(args);
>  
>   if ((sb->s_flags & MS_RDONLY) == 0) {
> @@ -2715,12 +2716,12 @@ static void print_daily_error_info(unsigned long arg)
>  es->s_first_error_func,
>  le32_to_cpu(es->s_first_error_line));
>   if (es->s_first_error_ino)
> - printk(": inode %u",
> + printk(KERN_CONT ": inode %u",
>  le32_to_cpu(es->s_first_error_ino));
>   if (es->s_first_error_block)
> - printk(": block %llu", (unsigned long long)
> + printk(KERN_CONT ": block %llu", (unsigned long long)
>  le64_to_cpu(es->s_first_error_block));
> - printk("\n");
> + printk(KERN_CONT "\n");
>   }
>   if (es->s_last_error_time) {
>   printk(KERN_NOTICE "EXT4-fs (%s): last error at time %u: 
> %.*s:%d",
> @@ -2729,12 +2730,12 @@ static void print_daily_error_info(unsigned long arg)
>  es->s_last_error_func,
>  le32_to_cpu(es->s_last_error_line));
>   if (es->s_last_error_ino)
> - printk(": inode %u",
> + printk(KERN_CONT ": inode %u",
>  le32_to_cpu(es->s_last_error_ino));
>   if (es->s_last_error_block)
> - printk(": block %llu", (unsigned long long)
> + printk(KERN_CONT ": block %llu", (unsigned long long)
>  le64_to_cpu(es->s_last_error_block));
> - printk("\n");
> + printk(KERN_CONT "\n");
>   }
>   mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ);  /* Once a day */
>  }
> -- 
> 2.10.0.rc2.1.g053435c
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH] mm: kmemleak: Ensure that the task stack is not freed during scanning

2016-10-12 Thread Catalin Marinas

On Wed, Oct 12, 2016 at 06:16:46PM +0800, Hillf Danton wrote:
> > @@ -1453,8 +1453,11 @@ static void kmemleak_scan(void)
> > 
> > read_lock(&tasklist_lock);
> > do_each_thread(g, p) {
> 
> Take a look at this commit please.
>   1da4db0cd5 ("oom_kill: change oom_kill.c to use for_each_thread()")

Thanks. Isn't holding tasklist_lock here enough to avoid such races?

-- 
Catalin

RE: [PATCH]"drm: change DRM_MIPI_DSI module type from "bool" to "tristate".

2016-10-12 Thread Jani Nikula

On Wed, 12 Oct 2016, "Sun, Jing A"  wrote:
> I think "installing a kernel with my changes for both drm and i915"
> takes more time and effort to complete than "only updating DRM/i915
> modules without rebuilding the whole kernel". In some cases, that's
> beneficial.

It's possible to change and rebuild and update just the drm and i915,
but you need to be careful to build against the same tree as the ones
you are replacing. This is like using out-of-tree modules (which is
something I can't recommend no matter what, but that's another
discussion).

However, this is completely different from planning to update drm and
i915 modules on a running production system by unloading the old ones
and probing the new ones. Don't do that. It will be a disaster.

> Also reloadablility is always a good thing to have and I truly hope
> Hajda/Iwai's patches would be accepted and merged.  No downside of it
> after all.

I think it's good to be able to unload and reload modules for debugging
and development, but not for normal use.

BR,
Jani.

-- 
Jani Nikula, Intel Open Source Technology Center

Re: [PATCH 1/7 v4] sched: factorize attach entity

2016-10-12 Thread Vincent Guittot

On 7 October 2016 at 01:11, Vincent Guittot  wrote:
>
> On 5 October 2016 at 11:38, Dietmar Eggemann  wrote:
> > On 09/26/2016 01:19 PM, Vincent Guittot wrote:
> >>
> >> Factorize post_init_entity_util_avg and part of attach_task_cfs_rq
> >> in one function attach_entity_cfs_rq
> >>
> >> Signed-off-by: Vincent Guittot 
> >> ---
> >>  kernel/sched/fair.c | 19 +++
> >>  1 file changed, 11 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 986c10c..e8ed8d1 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -697,9 +697,7 @@ void init_entity_runnable_average(struct sched_entity
> >> *se)
> >>  }
> >>
> >>  static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> >> -static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool
> >> update_freq);
> >> -static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force);
> >> -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct
> >> sched_entity *se);
> >> +static void attach_entity_cfs_rq(struct sched_entity *se);
> >>
> >>  /*
> >>   * With new tasks being created, their initial util_avgs are extrapolated
> >> @@ -764,9 +762,7 @@ void post_init_entity_util_avg(struct sched_entity
> >> *se)
> >>   }
> >>   }
> >
> >
> > You now could move the 'u64 now = cfs_rq_clock_task(cfs_rq);' into the
> > if condition to handle !fair_sched_class tasks.
>
> yes
>
> >
> >> - update_cfs_rq_load_avg(now, cfs_rq, false);
> >> - attach_entity_load_avg(cfs_rq, se);
> >> - update_tg_load_avg(cfs_rq, false);
> >> + attach_entity_cfs_rq(se);
> >>  }
> >>
> >>  #else /* !CONFIG_SMP */
> >> @@ -8501,9 +8497,8 @@ static void detach_task_cfs_rq(struct task_struct
> >> *p)
> >>   update_tg_load_avg(cfs_rq, false);
> >>  }
> >>
> >> -static void attach_task_cfs_rq(struct task_struct *p)
> >> +static void attach_entity_cfs_rq(struct sched_entity *se)
> >>  {
> >> - struct sched_entity *se = &p->se;
> >>   struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >
> >
> > Both callers of attach_entity_cfs_rq() already use cfs_rq_of(se). You
> > could pass it into attach_entity_cfs_rq().
>
> Yes that would make sense

In fact there is a 3rd caller online_fair_sched_group which calls
attach_entity_cfs_rq and doesn't already use  cfs_rq_of(se) so i
wonder if it's worth doing the interface change.


>
>
> >
> >>   u64 now = cfs_rq_clock_task(cfs_rq);
> >> @@ -8519,6 +8514,14 @@ static void attach_task_cfs_rq(struct task_struct
> >> *p)
> >
> >
> > The old comment /* Synchronize task ... */ should be changed to /*
> > Synchronize entity ... */
>
> yes
>
> >
> >>   update_cfs_rq_load_avg(now, cfs_rq, false);
> >>   attach_entity_load_avg(cfs_rq, se);
> >>   update_tg_load_avg(cfs_rq, false);
> >> +}
> >> +
> >> +static void attach_task_cfs_rq(struct task_struct *p)
> >> +{
> >> + struct sched_entity *se = &p->se;
> >> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >> +
> >> + attach_entity_cfs_rq(se);
> >>
> >>   if (!vruntime_normalized(p))
> >>   se->vruntime += cfs_rq->min_vruntime;
> >>
> >
> > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy the
> > information in any medium. Thank you.
> >

Re: MPOL_BIND on memory only nodes

2016-10-12 Thread Michal Hocko

On Wed 12-10-16 16:08:48, Anshuman Khandual wrote:
> On 10/12/2016 03:13 PM, Michal Hocko wrote:
> > On Wed 12-10-16 14:55:24, Anshuman Khandual wrote:
> >> Hi,
> >>
> >> We have the following function policy_zonelist() which selects a zonelist
> >> during various allocation paths. With this, general user space allocations
> >> (IIUC might not have __GFP_THISNODE) fails while trying to get memory from
> >> a memory only node without CPUs as the application runs some where else
> >> and that node is not part of the nodemask.
> 
> My bad. Was playing with some changes to the zonelists rebuild after
> a memory node hotplug and the order of various zones in them.
> 
> > 
> > I am not sure I understand. So you have a task with MPOL_BIND without a
> > cpu less node in the mask and you are wondering why the memory is not
> > allocated from that node?
> 
> In my experiment, there is a MPOL_BIND call with a CPU less node in
> the node mask and the memory is not allocated from that CPU less node.
> Thats because the zone of the CPU less node was absent from the
> FALLBACK zonelist of the local node.

So do I understand this correctly that the issue was caused by
non-upstream changes?

> >> Why we insist on __GFP_THISNODE ?
> > 
> > AFAIU __GFP_THISNODE just overrides the given node to the policy
> > nodemask in case the current node is not part of that node mask. In
> > other words we are ignoring the given node and use what the policy says. 
> 
> Right but provided the gfp flag has __GFP_THISNODE in it. In absence
> of __GFP_THISNODE, the node from the nodemask will not be selected.

In absence of __GFP_THISNODE we will use the zonelist for the given node
and that should contain even memoryless nodes for the fallback. The
nodemask from policy_nodemask() will then make sure that only nodes
relevant to the used policy is used.

> I still wonder why ? Can we always go to the first node in the
> nodemask for MPOL_BIND interface calls ? Just curious to know why
> preference is given to the local node and it's FALLBACK zonelist.

It is not always a local node. Look at how do_huge_pmd_wp_page_fallback
tries to make all the pages into the same node. Also we have
alloc_pages_current() which tries to allocate from the local node which
should not fallback to the firs node in the policy nodemask.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH RESEND] ARM: dts: keystone-k2*: Increase SPI Flash partition size for U-Boot

2016-10-12 Thread Vignesh R

Hi,

On Monday 10 October 2016 08:01 PM, Russell King - ARM Linux wrote:
> On Mon, Oct 10, 2016 at 07:41:41PM +0530, Vignesh R wrote:
>> U-Boot SPI Boot image is now more than 512KB for Keystone2 devices and
>> cannot fit into existing partition. So, increase the SPI Flash partition
>> for U-Boot to 1MB for all Keystone2 devices.
>>
>> Signed-off-by: Vignesh R 
>> ---
>>
>> This was submitted to v4.9 merge window but was never picked up:
>> https://patchwork.kernel.org/patch/9135023/
> 
> I think you need to explain why it's safe to change the layout of the
> flash partitions like this.
> 
> - What is this "misc" partition?
> 

This partition seems to exists from the very beginning.  I believe, this
is just a spare area of flash that can be used as per end-user
requirement. Either to store a small filesystem or kernel. Copying
Murali who added above partition if he has any input here.

> - Why is it safe to move the "misc" partition in this way?
> 
> - Do users need to do anything with data stored in the "misc" partition
>   when changing kernels?
> 

MTD layer will take care of most abstractions (like start address etc).
Will add a note in commit message informing about the reduction in size
of the partition.

> If the "misc" partition is simply unused space on the flash device, why
> list it in DT?
> 

If the unused space is not listed in the DT, then there is no /dev/mtdX
node created for the unused section. User will then have to manually
edit DT, in order to get the node and mount it. Instead, lets make it
available by default.

-- 
Regards
Vignesh

Re: [PATCH] x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt

2016-10-12 Thread Wanpeng Li

2016-09-19 16:10 GMT+08:00 Peter Zijlstra :
> On Thu, Sep 15, 2016 at 10:58:04AM +0200, Thomas Gleixner wrote:
>> On Thu, 15 Sep 2016, Wanpeng Li wrote:
>> > ---
>> >  arch/x86/include/asm/apic.h | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
>> > index 1243577..71c1fe2 100644
>> > --- a/arch/x86/include/asm/apic.h
>> > +++ b/arch/x86/include/asm/apic.h
>> > @@ -650,8 +650,8 @@ static inline void entering_ack_irq(void)
>> >
>> >  static inline void ipi_entering_ack_irq(void)
>> >  {
>> > -   ack_APIC_irq();
>> > irq_enter();
>> > +   ack_APIC_irq();
>> >  }
>>
>> which makes ipi_entering_ack_irq() the same as entering_ack_irq() and
>> therefor pointless.
>
> entering_ack_irq() seems to use entering_irq() instead of irq_enter().
> Which is close but not the same. This thing seems to also do
> exit_idle().
>
> Now, there's only a handfull of ipi_entering_ack_irq() users, and it
> doesn't seem to make sense to me to only call exit_idle() on IPIs, why
> don't we need to call exit_idle() on regular IRQs ?!
>
> All in all, that stuff is crufty and needs a cleanup I'd say.

[  116.587762]
[  116.587768] ===
[  116.587770] [ INFO: suspicious RCU usage. ]
[  116.587773] 4.8.0+ #24 Not tainted
[  116.587775] ---
[  116.58] ./arch/x86/include/asm/msr-trace.h:47 suspicious
rcu_dereference_check() usage!
[  116.587779]
[  116.587779] other info that might help us debug this:
[  116.587779]
[  116.587782]
[  116.587782] RCU used illegally from idle CPU!
[  116.587782] rcu_scheduler_active = 1, debug_locks = 0
[  116.587785] RCU used illegally from extended quiescent state!
[  116.587787] no locks held by swapper/1/0.
[  116.587788]
[  116.587788] stack backtrace:
[  116.587792] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0+ #24
[  116.587794] Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03
01/08/2015
[  116.587796]  90285de03f58 9d44a0c9 90285ca5d100
0001
[  116.587803]  90285de03f88 9d0ebd67 902845165410
080b
[  116.587809]    90285de03fb8
9d492b95
[  116.587814] Call Trace:
[  116.587817][] dump_stack+0x99/0xd0
[  116.587827]  [] lockdep_rcu_suspicious+0xe7/0x120
[  116.587832]  [] do_trace_write_msr+0x135/0x140
[  116.587836]  [] native_write_msr+0x20/0x30
[  116.587841]  [] native_apic_msr_eoi_write+0x1d/0x30
[  116.587845]  [] smp_reschedule_interrupt+0x1d/0x30
[  116.587849]  [] reschedule_interrupt+0x96/0xa0
[  116.587851][] ? cpuidle_enter_state+0xe4/0x360
[  116.587858]  [] ? cpuidle_enter_state+0xcf/0x360
[  116.587861]  [] cpuidle_enter+0x17/0x20
[  116.587865]  [] call_cpuidle+0x23/0x50
[  116.587868]  [] cpu_startup_entry+0x15c/0x280
[  116.587872]  [] start_secondary+0x154/0x180

irq_enter() which is called in scheduler_ipi() is too late to tell RCU
susbstems to end the extended quiescent state before ack_APIC_irq(),
any ideas?

Regards,
Wanpeng Li

Re: [Intel-gfx] drm/i915: WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock)

2016-10-12 Thread Joonas Lahtinen

On ke, 2016-10-12 at 11:56 +0200, Paul Bolle wrote:
> On a laptop that tracks the latest stable release (Ie, it now runs
> v4.8.1) I see this WARNING
>     WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock)
> 
> Full trace pasted below. I never saw this WARNING before v4.8. Since
> v4.8 I've had it in all (four, actually) boots.
> 
> What am I expected to do about this WARNING?
> 

Bisecting the offending commit between v4.8 and v4.8.1 would be a good
start.

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation

Re: [PATCH RESEND] ARM: dts: keystone-k2*: Increase SPI Flash partition size for U-Boot

2016-10-12 Thread Vignesh R

Hi,

On Monday 10 October 2016 09:31 PM, Santosh Shilimkar wrote:
> Vignesh,
> 
> On 10/10/2016 7:31 AM, Russell King - ARM Linux wrote:
>> On Mon, Oct 10, 2016 at 07:41:41PM +0530, Vignesh R wrote:
>>> U-Boot SPI Boot image is now more than 512KB for Keystone2 devices and
>>> cannot fit into existing partition. So, increase the SPI Flash partition
>>> for U-Boot to 1MB for all Keystone2 devices.
>>>
>>> Signed-off-by: Vignesh R 
>>> ---
>>>
>>> This was submitted to v4.9 merge window but was never picked up:
>>> https://patchwork.kernel.org/patch/9135023/
> 
> Another point is, if you want me to pick your patch, please copy
> me next time :-). AFAIK, am seeing this patch in my inbox first time.
> 

Sorry, I did address the previous patch to you. Not sure what happened :(

>>
>> I think you need to explain why it's safe to change the layout of the
>> flash partitions like this.
>>
>> - What is this "misc" partition?
>>
>> - Why is it safe to move the "misc" partition in this way?
>>
>> - Do users need to do anything with data stored in the "misc" partition
>>   when changing kernels?
>>
>> If the "misc" partition is simply unused space on the flash device, why
>> list it in DT?
>>
> Thanks Russell. Yes, above clarification would be good to get first.


Ok, will send v2 with updated commit message as per my reply in other
thread.

-- 
Regards
Vignesh

1 2 3 4 5 6 7 >

1 - 100 of 675 matches

Mail list logo