Re: [PATCH] iwlwifi: pcie: reduce "unsupported splx" to a warning
Hi Chris, On Tue, 2016-10-11 at 09:09 -0500, Chris Rorvick wrote: > On Tue, Oct 11, 2016 at 5:11 AM, Paul Bolle wrote: > > > This is not coming from the NIC itself, but from the platform's ACPI > > > tables. Can you tell us which platform you are using? > > > Interesting. I'm running a Dell XPS 13 9350. I replaced the > factory-provided Broadcom card with an AC 8260. I can update the > commit log to reflect this. Okay, so this makes sense. Those entries are probably formatted for the Broadcom card, which the iwlwifi driver obviously doesn't understand. The best we can do, as I already said, is to ignore values we don't understand. I will also check what is the correct procedure in such cases, because it is possible, in theory, that the format *matches* but applies only to another device. > > > If this is really bothering you, I guess I could apply this patch for > > > now. But as I said, this is not solving the actual problem. > > > > > > Bikeshedding: I think IWL_INFO() is more appropriate, as info doesn't > > imply one needs to act on this message, while warn does imply that > > action is needed. > > > Agreed. I still think making this a warning is appropriate, but it > seems pretty clear this is not an error. This has nothing to do with > how much it bothers me. An error tells the user something needs to be > fixed, but in this case the interface is working fine. Making it a > warning with an improved message will result in fewer people wasting > their time. Yes, so I'll try to stop wasting people's timing by trying to do the correct thing without bothering the user at all. :) Thanks for pointing this all out! -- Cheers, Luca.
Re: [PATCH] gpio: pca953x: add a comment explaining the need for a lockdep subclass
On Mon, Sep 26, 2016 at 11:54:15AM +0200, Bartosz Golaszewski wrote: > This is a follow-up to commit 559b46990e76 ("gpio: pca953x: fix an > incorrect lockdep warning"). The reason for calling > lockdep_set_subclass() in pca953x_probe() is not explained in > the code. > > Add a comment describing the problem, partial solution and required > future extensions. > > Signed-off-by: Bartosz Golaszewski Applied to for-current, thanks! signature.asc Description: PGP signature
Re: [PATCH 00/44] Convert FibreChannel bsg code to use bsg-lib
On Tue, Oct 11, 2016 at 09:49:38AM -0700, Christoph Hellwig wrote: > Hi Johannes, > > this looks great to me. But is there a chance to consolidate it into > a more manageable set of patches? E.g. all the patches to call > export fc_bsg_jobdone, use it directly and remove the function pointer > could go together, possibly even including the new calling convention. > Similar all the patches about fc_bsg_to_shost could be merged into one, > and if we add the bsg refcounting early, we could maybe skip a few > steps of the conversion later on? Sure, I think 44 patches is a bit huge. Especially given the 0day bot fallout it generated. Let me see how I can slim it down. Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v3 07/11] arm64/tracing: fix compat syscall handling
Hi Will, On 11.10.2016 15:36, Will Deacon wrote: On Tue, Oct 11, 2016 at 12:42:52PM +0200, Marcin Nowakowski wrote: Add arch_syscall_addr for arm64 and define NR_compat_syscalls, as the number of compat syscalls for arm64 exceeds the number defined by NR_syscalls. Signed-off-by: Marcin Nowakowski Cc: Steven Rostedt Cc: Ingo Molnar Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-ker...@lists.infradead.org --- arch/arm64/include/asm/ftrace.h | 12 +--- arch/arm64/include/asm/unistd.h | 1 + arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/ftrace.c | 16 4 files changed, 19 insertions(+), 11 deletions(-) diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index caa955f..b57ff7c 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -41,17 +41,7 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr) #define ftrace_return_address(n) return_address(n) -/* - * Because AArch32 mode does not share the same syscall table with AArch64, - * tracing compat syscalls may result in reporting bogus syscalls or even - * hang-up, so just do not trace them. - * See kernel/trace/trace_syscalls.c - * - * x86 code says: - * If the user really wants these, then they should use the - * raw syscall tracepoints with filtering. - */ -#define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS +#define ARCH_COMPAT_SYSCALL_NUMBERS_OVERLAP 1 static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs) { return is_compat_task(); diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index e78ac26..276d049 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -45,6 +45,7 @@ #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE+5) #define __NR_compat_syscalls 394 +#define NR_compat_syscalls (__NR_compat_syscalls) We may as well just define NR_compat_syscalls instead of __NR_compat_syscalls and move the handful of users over. I had tried to minimise the amount of arch-specific changes here - especially those that are not directly related to the proposed syscall handling change. But I agree having these 2 #defines is a bit unnecessary ... diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c index 40ad08a..75d010f 100644 --- a/arch/arm64/kernel/ftrace.c +++ b/arch/arm64/kernel/ftrace.c @@ -176,4 +176,20 @@ int ftrace_disable_ftrace_graph_caller(void) return ftrace_modify_graph_caller(false); } #endif /* CONFIG_DYNAMIC_FTRACE */ + #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ + +#if (defined CONFIG_FTRACE_SYSCALLS) && (defined CONFIG_COMPAT) + +extern const void *sys_call_table[]; +extern const void *compat_sys_call_table[]; + +unsigned long __init arch_syscall_addr(int nr, bool compat) +{ + if (compat) + return (unsigned long)compat_sys_call_table[nr]; + + return (unsigned long)sys_call_table[nr]; +} Do we care about the compat private syscalls (from base 0x0f)? We need to make sure that we exhibit the same behaviour as a native 32-bit ARM machine. Will Tracing of such syscalls has been disabled for a long time (see http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=086ba77a6db0). Apart from using non-contiguous numbers, they are not defined using standard SYSCALL macros, so they do not have any metadata generated either. My suggestion is that if you wanted those to be included in the trace then it should be done separately from these changes. Marcin
Re: [PATCH 1/2] driver core: skip removal test for non-removable drivers
Hi Rob, On 10/11/16 20:41, Rob Herring wrote: > Some drivers do not support removal/unbinding. These drivers should have > drv->suppress_bind_attrs set to true, so use that to skip the removal > test. > > This doesn't fix anything reported so far, but should prevent some other > cases. Some drivers will need fixes to set suppress_bind_attrs to avoid > this test. > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=177021 > Fixes: bea5b158ff0d ("driver core: add test of driver remove calls during > probe") > Reported-by: Laszlo Ersek > Signed-off-by: Rob Herring > --- > drivers/base/dd.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index d22a7260f42b..8937a7ad7165 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -324,7 +324,8 @@ static int really_probe(struct device *dev, struct > device_driver *drv) > { > int ret = -EPROBE_DEFER; > int local_trigger_count = atomic_read(&deferred_trigger_count); > - bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE); > + bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) && > +!drv->suppress_bind_attrs; > > if (defer_all_probes) { > /* > can you please repost the full series with me CC'd on all of the messages; I'm not subscribed to LKML. Thanks, Laszlo
Re: [mm] c4344e8035: WARNING: CPU: 0 PID: 101 at mm/memory.c:303 __tlb_remove_page_size+0x25/0x99
On 10/12, Aneesh Kumar K.V wrote: >kernel test robot writes: > >> FYI, we noticed the following commit: >> >> https://github.com/0day-ci/linux >> Aneesh-Kumar-K-V/mm-Use-the-correct-page-size-when-removing-the-page/20161012-013446 >> commit c4344e80359420d7574b3b90fddf53311f1d24e6 ("mm: Remove the page size >> change check in tlb_remove_page") >> >> in testcase: boot >> >> on test machine: qemu-system-i386 -enable-kvm -cpu Haswell,+smep,+smap -m >> 360M >> >> caused below changes: >> >> >> ++++ >> || eff764128d | c4344e8035 | >> ++++ >> | boot_successes | 59 | 0 | >> | boot_failures | 0 | 43 | >> | WARNING:at_mm/memory.c:#__tlb_remove_page_size | 0 | 43 | >> | calltrace:SyS_execve | 0 | 43 | >> | calltrace:run_init_process | 0 | 21 | >> ++++ >> >> >> >> [4.096204] Write protecting the kernel text: 3148k >> [4.096911] Write protecting the kernel read-only data: 1444k >> [4.120357] [ cut here ] >> [4.121078] WARNING: CPU: 0 PID: 101 at mm/memory.c:303 >> __tlb_remove_page_size+0x25/0x99 >> [4.122380] Modules linked in: >> [4.122788] CPU: 0 PID: 101 Comm: run-parts Not tainted >> 4.8.0-mm1-00315-gc4344e8 #5 >> [4.123956] bd145dc4 b111e5e6 bd145de0 b10320dc 012f b10974d1 >> bd145e70 c4954170 >> [4.125277] c4954170 bd145df4 b103215f 0009 >> bd145e04 b10974d1 >> [4.126424] c4954170 bd145e70 bd145e14 b10263ca bd145e70 bd47bafc >> bd145e40 b109767a >> [4.127622] Call Trace: > >Thanks for the report. The below change should fix this. > >commit 18c929e7cf672da617dc218c6265366bf78b1644 >Author: Aneesh Kumar K.V >Date: Wed Oct 12 08:40:41 2016 +0530 > >update mmu gather page size before flushing page table cache > >diff --git a/mm/memory.c b/mm/memory.c >index 26d1ba8c87e6..7e7eccb82a2b 100644 >--- a/mm/memory.c >+++ b/mm/memory.c >@@ -526,7 +526,11 @@ void free_pgd_range(struct mmu_gather *tlb, > end -= PMD_SIZE; > if (addr > end - 1) > return; >- >+ /* >+ * We add page table cache pages with PAGE_SIZE, >+ * (see pte_free_tlb()), flush the tlb if we need >+ */ >+ tlb_remove_check_page_size_change(tlb, PAGE_SIZE); > pgd = pgd_offset(tlb->mm, addr); > do { > next = pgd_addr_end(addr, end); > Just applied this fix on top of commit c4344e8035 and confirmed that reportedwarning is gone with this fix. Tested-by: Xiaolong Ye = compiler/kconfig/rootfs/sleep/tbox_group/testcase: gcc-6/i386-randconfig-s1-201641/quantal-core-i386.cgz/1/vm-vp-quantal-i386/boot commit: c4344e80359420d7574b3b90fddf53311f1d24e6 384db818365c90b91d8bad80be188765e801cf58 ("update mmu gather page size before flushing page table cache") c4344e80359420d7 384db818365c90b91d8bad80be -- fail:runs %reproductionfail:runs | | | 24:24-100%:5 dmesg.WARNING:at_mm/memory.c:#__tlb_remove_page_size Thanks, Xiaolong
Re: [PATCH v2 3/4] mm: try to exhaust highatomic reserve before the OOM
On 10/12/2016 07:33 AM, Minchan Kim wrote: It's weird to show that zone has enough free memory above min watermark but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic pages. As last resort, try to unreserve highatomic pages again and if it has moved pages to non-highatmoc free list, retry reclaim once more. I would move the details (OOM report etc) from the cover letter here, otherwise they end up in Patch 1's changelog, which is less helpful. Signed-off-by: Michal Hocko Signed-off-by: Minchan Kim Acked-by: Vlastimil Babka --- mm/page_alloc.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 18808f392718..a7472426663f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2080,7 +2080,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, * intense memory pressure but failed atomic allocations should be easier * to recover from than an OOM. */ -static void unreserve_highatomic_pageblock(const struct alloc_context *ac) +static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) { struct zonelist *zonelist = ac->zonelist; unsigned long flags; @@ -2088,6 +2088,7 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) struct zone *zone; struct page *page; int order; + bool ret = false; for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx, ac->nodemask) { @@ -2136,12 +2137,14 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) * may increase. */ set_pageblock_migratetype(page, ac->migratetype); - move_freepages_block(zone, page, ac->migratetype); + ret = move_freepages_block(zone, page, ac->migratetype); spin_unlock_irqrestore(&zone->lock, flags); - return; + return ret; } spin_unlock_irqrestore(&zone->lock, flags); } + + return ret; } /* Remove an element from the buddy allocator from the fallback list */ @@ -3457,8 +3460,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, * Make sure we converge to OOM if we cannot make any progress * several times in the row. */ - if (*no_progress_loops > MAX_RECLAIM_RETRIES) + if (*no_progress_loops > MAX_RECLAIM_RETRIES) { + /* Before OOM, exhaust highatomic_reserve */ + if (unreserve_highatomic_pageblock(ac)) + return true; return false; + } /* * Keep reclaiming pages while there is a chance this will lead
Re: [PATCH v2 4/4] mm: make unreserve highatomic functions reliable
On 10/12/2016 07:33 AM, Minchan Kim wrote: Currently, unreserve_highatomic_pageblock bails out if it found highatomic pageblock regardless of really moving free pages from the one so that it could mitigate unreserve logic's goal which saves OOM of a process. This patch makes unreserve functions bail out only if it moves some pages out of !highatomic free list to avoid such false positive. Another potential problem is that by race between page freeing and reserve highatomic function, pages could be in highatomic free list even though the pageblock is !high atomic migratetype. In that case, unreserve_highatomic_pageblock can be void if count of highatomic reserve is less than pageblock_nr_pages. We could solve it simply via draining all of reserved pages before the OOM. It would have a safeguard role to exhuast reserved pages before converging to OOM. Signed-off-by: Michal Hocko Ah, I think that the first S-o-b has to match "From:" to be valid chain (also for 3/4). Signed-off-by: Minchan Kim Acked-by: Vlastimil Babka --- mm/page_alloc.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a7472426663f..565589eae6a2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2079,8 +2079,12 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, * potentially hurts the reliability of high-order allocations when under * intense memory pressure but failed atomic allocations should be easier * to recover from than an OOM. + * + * If @drain is true, try to move all of reserved pages out of highatomic + * free list. */ -static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) +static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, + bool drain) { struct zonelist *zonelist = ac->zonelist; unsigned long flags; @@ -2092,8 +2096,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx, ac->nodemask) { - /* Preserve at least one pageblock */ - if (zone->nr_reserved_highatomic <= pageblock_nr_pages) + /* +* Preserve at least one pageblock unless memory pressure +* is really high. +*/ + if (!drain && zone->nr_reserved_highatomic <= + pageblock_nr_pages) continue; spin_lock_irqsave(&zone->lock, flags); @@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) */ set_pageblock_migratetype(page, ac->migratetype); ret = move_freepages_block(zone, page, ac->migratetype); - spin_unlock_irqrestore(&zone->lock, flags); - return ret; + if (!drain && ret) { + spin_unlock_irqrestore(&zone->lock, flags); + return ret; + } } spin_unlock_irqrestore(&zone->lock, flags); } @@ -3343,7 +3353,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, * Shrink them them and try again */ if (!page && !drained) { - unreserve_highatomic_pageblock(ac); + unreserve_highatomic_pageblock(ac, false); drain_all_pages(NULL); drained = true; goto retry; @@ -3462,7 +3472,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, */ if (*no_progress_loops > MAX_RECLAIM_RETRIES) { /* Before OOM, exhaust highatomic_reserve */ - if (unreserve_highatomic_pageblock(ac)) + if (unreserve_highatomic_pageblock(ac, true)) return true; return false; }
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 02:53 PM, Michal Hocko wrote: > On Wed 12-10-16 08:28:17, zijun_hu wrote: >> On 2016/10/12 1:22, Michal Hocko wrote: >>> On Tue 11-10-16 21:24:50, zijun_hu wrote: From: zijun_hu the LSB of a chunk->map element is used for free/in-use flag of a area and the other bits for offset, the sufficient and necessary condition of this usage is that both size and alignment of a area must be even numbers however, pcpu_alloc() doesn't force its @align parameter a even number explicitly, so a odd @align maybe causes a series of errors, see below example for concrete descriptions. >>> >>> Is or was there any user who would use a different than even (or power of 2) >>> alighment? If not is this really worth handling? >>> >> >> it seems only a power of 2 alignment except 1 can make sure it work very >> well, >> that is a strict limit, maybe this more strict limit should be checked > > I fail to see how any other alignment would actually make any sense > what so ever. Look, I am not a maintainer of this code but adding a new > code to catch something that doesn't make any sense sounds dubious at > best to me. > > I could understand this patch if you see a problem and want to prevent > it from repeating bug doing these kind of changes just in case sounds > like a bad idea. > thanks for your reply should we have a generic discussion whether such patches which considers many boundary or rare conditions are necessary. should we make below declarations as conventions 1) when we say 'alignment', it means align to a power of 2 value for example, aligning value @v to @b implicit @v is power of 2 , align 10 to 4 is 12 2) when we say 'round value @v up/down to boundary @b', it means the result is a times of @b, it don't requires @b is a power of 2
Re: [PATCH v2 3/4] mm: try to exhaust highatomic reserve before the OOM
On Wed 12-10-16 14:33:35, Minchan Kim wrote: > It's weird to show that zone has enough free memory above min > watermark but OOMed with 4K GFP_KERNEL allocation due to > reserved highatomic pages. As last resort, try to unreserve > highatomic pages again and if it has moved pages to > non-highatmoc free list, retry reclaim once more. Agreed with Vlastimil on the OOM report in the changelog. The above will not tell the reader much to understand how does the situation look like and whether the patch is really needed in his particular situation. Few nits below but in general looks good to me > Signed-off-by: Michal Hocko > Signed-off-by: Minchan Kim > --- > mm/page_alloc.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 18808f392718..a7472426663f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2080,7 +2080,7 @@ static void reserve_highatomic_pageblock(struct page > *page, struct zone *zone, > * intense memory pressure but failed atomic allocations should be easier > * to recover from than an OOM. > */ > -static void unreserve_highatomic_pageblock(const struct alloc_context *ac) > +static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) > { > struct zonelist *zonelist = ac->zonelist; > unsigned long flags; > @@ -2088,6 +2088,7 @@ static void unreserve_highatomic_pageblock(const struct > alloc_context *ac) > struct zone *zone; > struct page *page; > int order; > + bool ret = false; no need to initialization, see below > > for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx, > ac->nodemask) { > @@ -2136,12 +2137,14 @@ static void unreserve_highatomic_pageblock(const > struct alloc_context *ac) >* may increase. >*/ > set_pageblock_migratetype(page, ac->migratetype); > - move_freepages_block(zone, page, ac->migratetype); > + ret = move_freepages_block(zone, page, ac->migratetype); > spin_unlock_irqrestore(&zone->lock, flags); > - return; > + return ret; > } > spin_unlock_irqrestore(&zone->lock, flags); > } > + > + return ret; return false; > } > > /* Remove an element from the buddy allocator from the fallback list */ > @@ -3457,8 +3460,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, >* Make sure we converge to OOM if we cannot make any progress >* several times in the row. >*/ > - if (*no_progress_loops > MAX_RECLAIM_RETRIES) > + if (*no_progress_loops > MAX_RECLAIM_RETRIES) { > + /* Before OOM, exhaust highatomic_reserve */ > + if (unreserve_highatomic_pageblock(ac)) > + return true; return unreserve_highatomic_pageblock(ac); > return false; > + } > > /* >* Keep reclaiming pages while there is a chance this will lead > -- > 2.7.4 > -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 02:53 PM, Michal Hocko wrote: > On Wed 12-10-16 08:28:17, zijun_hu wrote: >> On 2016/10/12 1:22, Michal Hocko wrote: >>> On Tue 11-10-16 21:24:50, zijun_hu wrote: From: zijun_hu the LSB of a chunk->map element is used for free/in-use flag of a area and the other bits for offset, the sufficient and necessary condition of this usage is that both size and alignment of a area must be even numbers however, pcpu_alloc() doesn't force its @align parameter a even number explicitly, so a odd @align maybe causes a series of errors, see below example for concrete descriptions. >>> >>> Is or was there any user who would use a different than even (or power of 2) >>> alighment? If not is this really worth handling? >>> >> >> it seems only a power of 2 alignment except 1 can make sure it work very >> well, >> that is a strict limit, maybe this more strict limit should be checked > > I fail to see how any other alignment would actually make any sense > what so ever. Look, I am not a maintainer of this code but adding a new > code to catch something that doesn't make any sense sounds dubious at > best to me. > > I could understand this patch if you see a problem and want to prevent > it from repeating bug doing these kind of changes just in case sounds > like a bad idea. > thanks for your reply should we have a generic discussion whether such patches which considers many boundary or rare conditions are necessary. i found the following code segments in mm/vmalloc.c static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask) { ... BUG_ON(!size); BUG_ON(offset_in_page(size)); BUG_ON(!is_power_of_2(align)); should we make below declarations as conventions 1) when we say 'alignment', it means align to a power of 2 value for example, aligning value @v to @b implicit @v is power of 2 , align 10 to 4 is 12 2) when we say 'round value @v up/down to boundary @b', it means the result is a times of @b, it don't requires @b is a power of 2
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 11.10.16 at 17:53, wrote: > On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: > Andrew Cooper 10/10/16 6:44 PM >>> >>>On 10/10/16 01:35, Haozhong Zhang wrote: Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: 1) Reserve an area on NVDIMM devices for Xen hypervisor to place memory management data structures, i.e. frame table and M2P table. 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen hypervisor. >>> >>>However, I can't see any justification for 1). Dom0 should not be >>>involved in Xen's management of its own frame table and m2p. The mfns >>>making up the pmem/pblk regions should be treated just like any other >>>MMIO regions, and be handed wholesale to dom0 by default. >> >> That precludes the use as RAM extension, and I thought earlier rounds of >> discussion had got everyone in agreement that at least for the pmem case >> we will need some control data in Xen. > > The missing piece for me is why this reservation for control data > needs to be done in the libnvdimm core? I would expect that any dax > capable file could be mapped and made available to a guest. This > includes /dev/ramX devices that are dax capable, but are external to > the libnvdimm sub-system. Despite me being the only one on the To list, I don't think the question was really meant to be directed to me. Jan
Re: [linux-sunxi] [PATCH 4/5] ARM: dts: sun6i: add pinmux for PWM0
On Wed, Oct 12, 2016 at 12:20 PM, Icenowy Zheng wrote: > PWM0 is used by sun6i tablets as the backlight PWM. > > Add pinmux for it. > > Signed-off-by: Icenowy Zheng > --- > arch/arm/boot/dts/sun6i-a31.dtsi | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/arm/boot/dts/sun6i-a31.dtsi > b/arch/arm/boot/dts/sun6i-a31.dtsi > index 97626ce..76f5a06 100644 > --- a/arch/arm/boot/dts/sun6i-a31.dtsi > +++ b/arch/arm/boot/dts/sun6i-a31.dtsi > @@ -494,6 +494,13 @@ > allwinner,pull = ; > }; > > + pwm0_pins: pwm0@0 { > + allwinner,pins = "PH13"; > + allwinner,function = "pwm0"; > + allwinner,drive = ; > + allwinner,pull = ; Maxime is updating the pinctrl bindings to use generic pinconf, but otherwise this patch looks good. ChenYu > + }; > + > mmc0_pins_a: mmc0@0 { > allwinner,pins = "PF0", "PF1", "PF2", > "PF3", "PF4", "PF5"; > -- > 2.10.1 > > -- > You received this message because you are subscribed to the Google Groups > "linux-sunxi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to linux-sunxi+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.
Re: [linux-sunxi] [PATCH 2/5] pwm: sun4i: Add support for PWM controller on sun6i SoCs
Hi, On Wed, Oct 12, 2016 at 12:20 PM, Icenowy Zheng wrote: > The PWM controller in A31 is different with other Allwinner SoCs, with a > control register per channel (in other SoCs the control register is > shared), and each channel are allocated 16 bytes of address (but only 8 > bytes are used.). The register map in one channel is just like a > single-channel A10 PWM controller, however, A31 have a different > prescaler table than other SoCs. > > In order to use the driver for all 4 channels, device nodes should be > created per channel. I think Maxime wants you to support the different register offsets in this driver, and have all 4 channels in the same device (node). ChenYu > Signed-off-by: Icenowy Zheng > --- > drivers/pwm/pwm-sun4i.c | 37 - > 1 file changed, 36 insertions(+), 1 deletion(-) > > diff --git a/drivers/pwm/pwm-sun4i.c b/drivers/pwm/pwm-sun4i.c > index 03a99a5..3e93bdf 100644 > --- a/drivers/pwm/pwm-sun4i.c > +++ b/drivers/pwm/pwm-sun4i.c > @@ -46,7 +46,7 @@ > > #define BIT_CH(bit, chan) ((bit) << ((chan) * PWMCH_OFFSET)) > > -static const u32 prescaler_table[] = { > +static const u32 prescaler_table_a10[] = { > 120, > 180, > 240, > @@ -65,10 +65,30 @@ static const u32 prescaler_table[] = { > 0, /* Actually 1 but tested separately */ > }; > > +static const u32 prescaler_table_a31[] = { > + 1, > + 2, > + 4, > + 8, > + 16, > + 32, > + 64, > + 0, > + 0, > + 0, > + 0, > + 0, > + 0, > + 0, > + 0, > + 0, > +}; > + > struct sun4i_pwm_data { > bool has_prescaler_bypass; > bool has_rdy; > unsigned int npwm; > + const u32 *prescaler_table; > }; > > struct sun4i_pwm_chip { > @@ -100,6 +120,7 @@ static int sun4i_pwm_config(struct pwm_chip *chip, struct > pwm_device *pwm, > int duty_ns, int period_ns) > { > struct sun4i_pwm_chip *sun4i_pwm = to_sun4i_pwm_chip(chip); > + const u32 *prescaler_table = sun4i_pwm->data->prescaler_table; > u32 prd, dty, val, clk_gate; > u64 clk_rate, div = 0; > unsigned int prescaler = 0; > @@ -264,24 +285,35 @@ static const struct sun4i_pwm_data sun4i_pwm_data_a10 = > { > .has_prescaler_bypass = false, > .has_rdy = false, > .npwm = 2, > + .prescaler_table = prescaler_table_a10, > }; > > static const struct sun4i_pwm_data sun4i_pwm_data_a10s = { > .has_prescaler_bypass = true, > .has_rdy = true, > .npwm = 2, > + .prescaler_table = prescaler_table_a10, > }; > > static const struct sun4i_pwm_data sun4i_pwm_data_a13 = { > .has_prescaler_bypass = true, > .has_rdy = true, > .npwm = 1, > + .prescaler_table = prescaler_table_a10, > }; > > static const struct sun4i_pwm_data sun4i_pwm_data_a20 = { > .has_prescaler_bypass = true, > .has_rdy = true, > .npwm = 2, > + .prescaler_table = prescaler_table_a10, > +}; > + > +static const struct sun4i_pwm_data sun4i_pwm_data_a31 = { > + .has_prescaler_bypass = false, > + .has_rdy = true, > + .npwm = 1, > + .prescaler_table = prescaler_table_a31, > }; > > static const struct of_device_id sun4i_pwm_dt_ids[] = { > @@ -298,6 +330,9 @@ static const struct of_device_id sun4i_pwm_dt_ids[] = { > .compatible = "allwinner,sun7i-a20-pwm", > .data = &sun4i_pwm_data_a20, > }, { > + .compatible = "allwinner,sun6i-a31-pwm", > + .data = &sun4i_pwm_data_a31 > + }, { > /* sentinel */ > }, > }; > -- > 2.10.1 > > -- > You received this message because you are subscribed to the Google Groups > "linux-sunxi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to linux-sunxi+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.
Re: [RFC PATCH 00/11] Introduce writeback connectors
Hi Eric, On Tue, Oct 11, 2016 at 12:01:14PM -0700, Eric Anholt wrote: Brian Starkey writes: Hi, This RFC series introduces a new connector type: DRM_MODE_CONNECTOR_WRITEBACK It is a follow-on from a previous discussion: [1] Writeback connectors are used to expose the memory writeback engines found in some display controllers, which can write a CRTC's composition result to a memory buffer. This is useful e.g. for testing, screen-recording, screenshots, wireless display, display cloning, memory-to-memory composition. Patches 1-7 include the core framework changes required, and patches 8-11 implement a writeback connector for the Mali-DP writeback engine. The Mali-DP patches depend on this other series: [2]. The connector is given the FB_ID property for the output framebuffer, and two new read-only properties: PIXEL_FORMATS and PIXEL_FORMATS_SIZE, which expose the supported framebuffer pixel formats of the engine. The EDID property is not exposed for writeback connectors. Writeback connector usage: -- Due to connector routing changes being treated as "full modeset" operations, any client which wishes to use a writeback connector should include the connector in every modeset. The writeback will not actually become active until a framebuffer is attached. The writeback itself is enabled by attaching a framebuffer to the FB_ID property of the connector. The driver must then ensure that the CRTC content of that atomic commit is written into the framebuffer. The writeback works in a one-shot mode with each atomic commit. This prevents the same content from being written multiple times. In some cases (front-buffer rendering) there might be a desire for continuous operation - I think a property could be added later for this kind of control. Writeback can be disabled by setting FB_ID to zero. I think this sounds great, and the interface is just right IMO. Thanks, glad you like it! Hopefully you're equally agreeable with the changes Daniel has been suggesting. I don't really see a use for continuous mode -- a sequence of one-shots makes a lot more sense because then you can know what data has changed, which anyone trying to use the writeback buffer would need to know. Agreed - we've never found a use for it. Known issues: - * I'm not sure what "DPMS" should mean for writeback connectors. It could be used to disable writeback (even when a framebuffer is attached), or it could be hidden entirely (which would break the legacy DPMS call for writeback connectors). * With Daniel's recent re-iteration of the userspace API rules, I fully expect to provide some userspace code to support this. The question is what, and where? We want to use writeback for testing, so perhaps some tests in igt is suitable. * Documentation. Probably some portion of this cover letter needs to make it into Documentation/ * Synchronisation. Our hardware will finish the writeback by the next vsync. I've not implemented fence support here, but it would be an obvious addition. My hardware won't necessarily finish by the next vsync -- it trickles out at whatever rate it can find memory bandwidth to get the job done, and fires an interrupt when it's finished. Is it bounded? You presumably have to finish the write-out before you can change any input buffers? So I would like some definition for how syncing works. One answer would be that these flips don't trigger their pageflip events until the writeback is done (so I need to collect both the vsync irq and the writeback irq before sending). Another would be that manage an independent fence for the writeback fb, so that you still immediately know when framebuffers from the previous scanout-only frame are idle. I much prefer the sound of the explicit fence approach. Hopefully we can agree that a new atomic commit can't be completed whilst there's a writeback ongoing, otherwise managing the fence and framebuffer lifetime sounds really tricky - they'd need to be decoupled from the atomic_state and outlive the commit that spawned them. Cheers, -Brian Also, tests for this in igt, please. Writeback in igt will give us so much more ability to cover KMS functionality on non-Intel hardware.
Re: [PATCH] sched/fair: Do not decay new task load on first enqueue
On 11 October 2016 at 20:57, Matt Fleming wrote: > On Tue, 11 Oct, at 03:14:47PM, Vincent Guittot wrote: >> > >> > I see a regression, >> > >> > baseline: 2.41228 >> > patched : 2.64528 (-9.7%) >> >> Just to be sure; By baseline you mean v4.8 ? > > Baseline is actually tip/sched/core commit 447976ef4fd0 > ("sched/irqtime: Consolidate irqtime flushing code") but I could try > out v4.8 instead if you'd prefer that. ok. In fact, I have noticed another regression with tip/sched/core and hackbench while looking at yours. I have bisect to : 10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings") hackbench -P -g 1 v4.8tip/sched/core tip/sched/core+revert 10e2f1acd010 and 1b568f0aabf2 min 0.051 0,052 0.049 avg 0.057(0%) 0,062(-7%) 0.056(+1%) max 0.070 0,073 0.067 stdev +/-8% +/-10%+/-9% The issue seems to be that it prevents some migration at wake up at the end of hackbench test so we have last tasks that compete for the same CPU whereas other CPUs are idle in the same MC domain. I haven't to look more deeply which part of the patch do the regression yet > >> > cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \ >> > sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}' >> >> nice command to evaluate spread > > Thanks!
Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
On Tue 11-10-16 16:51:30, Ross Zwisler wrote: > On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote: > > On Fri 07-10-16 15:09:02, Ross Zwisler wrote: > > > diff --git a/fs/dax.c b/fs/dax.c > > > index ac3cd05..e51d51f 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct > > > address_space *mapping, > > >* queue to the start of that PMD. This ensures that all offsets in > > >* the range covered by the PMD map to the same bit lock. > > >*/ > > > - if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) > > > + if ((unsigned long)entry & RADIX_DAX_PMD) > > > index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1); > > > > I agree with Christoph - helper for masking type bits would make this > > nicer. > > Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph. It > seems clean to me, but if you or Christoph feel strongly that it would be > cleaner as a local 'flags' variable, I'll make the change. One idea I had is that you could have helpers like: dax_is_pmd_entry() dax_is_pte_entry() dax_is_empty_entry() dax_is_hole_entry() And then you would use these helpers - all the flags would be hidden in the helpers so even if we decide to change the flagging scheme to compress things or so, it should be pretty local change. > > > - entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | > > > -RADIX_DAX_ENTRY_LOCK); > > > + > > > + /* > > > + * Besides huge zero pages the only other thing that gets > > > + * downgraded are empty entries which don't need to be > > > + * unmapped. > > > + */ > > > + if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP)) > > > + unmap_mapping_range(mapping, > > > + (index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0); > > > + > > > spin_lock_irq(&mapping->tree_lock); > > > - err = radix_tree_insert(&mapping->page_tree, index, entry); > > > + > > > + if (pmd_downgrade) { > > > + radix_tree_delete(&mapping->page_tree, index); > > > + mapping->nrexceptional--; > > > + dax_wake_mapping_entry_waiter(mapping, index, entry, > > > + false); > > > > You need to set 'wake_all' argument here to true. Otherwise there could be > > waiters waiting for non-existent entry forever... > > Interesting. Fixed, but let me make sure I understand. So is the issue that > you could have say 2 tasks waiting on a PMD index that has been rounded down > to the PMD index via dax_entry_waitqueue()? > > The person holding the lock on the entry would remove the PMD, insert a PTE > and wake just one of the PMD aligned waiters. That waiter would wake up, do > something PTE based (since the PMD space is now polluted with PTEs), and then > wake any waiters on it's PTE index. Meanwhile, the second waiter could sleep > forever on the PMD aligned index. Is this correct? Yes. > So, perhaps more succinctly: > > Thread 1 Thread 2Thread 3 > > index 0x202, hold PMD lock 0x200 > index 0x203, sleep on 0x200 > index 0x204, sleep on 0x200 > downgrade, removing 0x200 > wake one waiter on 0x200 > insert PTE @ 0x202 > wake up, grab index 0x203 > ... > wake one waiter on index 0x203 > > ... sleeps forever > Right? Exactly. > > > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct > > > address_space *mapping, > > > error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM); > > > if (error) > > > return ERR_PTR(error); > > > + } else if (((unsigned long)entry & RADIX_DAX_HZP) && > > > + !(flags & RADIX_DAX_HZP)) { > > > + /* replacing huge zero page with PMD block mapping */ > > > + unmap_mapping_range(mapping, > > > + (vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0); > > > } > > > > > > spin_lock_irq(&mapping->tree_lock); > > > - new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) | > > > -RADIX_DAX_ENTRY_LOCK); > > > + new_entry = dax_radix_entry(sector, flags); > > > + > > > > You've lost the RADIX_DAX_ENTRY_LOCK flag here? > > Oh, nope, that's embedded in the dax_radix_entry() helper: > > /* entries begin locked */ > static inline void *dax_radix_entry(sector_t sector, unsigned long flags) > { > return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags | > ((unsigned long)sector << RADIX_DAX_SHIFT) | > RADIX_DAX_ENTRY_LOCK); > } > > I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this > clearer to the reader. Yep, that wo
Re: [PATCH v2] x86/tsc: Set X86_FEATURE_TSC_RELIABLE to skip refined calibration
On Tue, 11 Oct 2016, Bin Gao wrote: > On Fri, Aug 26, 2016 at 12:14:58PM +0200, Thomas Gleixner wrote: > > The Linux kernel does think a reliable calibration implies the reliability > (i.e. > no watchdog required). I'm posting some code pieces to explain. I know that and I know exactly how all that works. And I certainly did not ask for an explanation of the current state of affairs. Here is what I wrote: > > Second thoughts. We should seperate the calibration aspect from the > > reliablity > > aspect. > > > > If a MSR/CPUID readout provides reliable calibration then this does not tell > > us about the reliablity (i.e. no watchdog required). So having two flags for > > this - and sure you can set both on those SoCs is the proper solution. In other words: I want to have two seperate flags: 1) FEATURE_KNOWN_FREQUENCY - Grab the frequency from CPUID/MSR or whatever and skip the whole calibration thing 2) FEATURE_RELIABLE- Do not invoke the watchdog Thanks, tglx
Re: [PATCH v3 0/1] man/set_mempolicy.2,mbind.2: add MPOL_LOCAL NUMA memory policy documentation
Hello Piotr, On 10/10/2016 06:23 PM, Piotr Kwapulinski wrote: > The MPOL_LOCAL mode has been implemented by > Peter Zijlstra > (commit: 479e2802d09f1e18a97262c4c6f8f17ae5884bd8). > Add the documentation for this mode. Thanks. I've applied this patch. I have a question below. > Signed-off-by: Piotr Kwapulinski > --- > This version fixes grammar > --- > man2/mbind.2 | 28 > man2/set_mempolicy.2 | 19 ++- > 2 files changed, 42 insertions(+), 5 deletions(-) > > diff --git a/man2/mbind.2 b/man2/mbind.2 > index 3ea24f6..854580c 100644 > --- a/man2/mbind.2 > +++ b/man2/mbind.2 > @@ -130,8 +130,9 @@ argument must specify one of > .BR MPOL_DEFAULT , > .BR MPOL_BIND , > .BR MPOL_INTERLEAVE , > +.BR MPOL_PREFERRED , > or > -.BR MPOL_PREFERRED . > +.BR MPOL_LOCAL . > All policy modes except > .B MPOL_DEFAULT > require the caller to specify via the > @@ -258,9 +259,26 @@ and > .I maxnode > arguments specify the empty set, then the memory is allocated on > the node of the CPU that triggered the allocation. > -This is the only way to specify "local allocation" for a > -range of memory via > -.BR mbind (). > + > +.B MPOL_LOCAL > +specifies the "local allocation", the memory is allocated on > +the node of the CPU that triggered the allocation, "local node". > +The > +.I nodemask > +and > +.I maxnode > +arguments must specify the empty set. If the "local node" is low > +on free memory the kernel will try to allocate memory from other > +nodes. The kernel will allocate memory from the "local node" > +whenever memory for this node is available. If the "local node" > +is not allowed by the process's current cpuset context the kernel > +will try to allocate memory from other nodes. The kernel will > +allocate memory from the "local node" whenever it becomes allowed > +by the process's current cpuset context. In contrast > +.B MPOL_DEFAULT > +reverts to the policy of the process which may have been set with > +.BR set_mempolicy (2). > +It may not be the "local allocation". What is the sense of "may not be" here? (And repeated below). Is the meaning "this could be something other than"? Presumably the answer is yes, in which case I'll clarify the wording there. Let me know. Cheers, Michael > > If > .B MPOL_MF_STRICT > @@ -440,6 +458,8 @@ To select explicit "local allocation" for a memory range, > specify a > .I mode > of > +.B MPOL_LOCAL > +or > .B MPOL_PREFERRED > with an empty set of nodes. > This method will work for > diff --git a/man2/set_mempolicy.2 b/man2/set_mempolicy.2 > index 1f02037..22b0f7c 100644 > --- a/man2/set_mempolicy.2 > +++ b/man2/set_mempolicy.2 > @@ -79,8 +79,9 @@ argument must specify one of > .BR MPOL_DEFAULT , > .BR MPOL_BIND , > .BR MPOL_INTERLEAVE , > +.BR MPOL_PREFERRED , > or > -.BR MPOL_PREFERRED . > +.BR MPOL_LOCAL . > All modes except > .B MPOL_DEFAULT > require the caller to specify via the > @@ -211,6 +212,22 @@ arguments specify the empty set, then the policy > specifies "local allocation" > (like the system default policy discussed above). > > +.B MPOL_LOCAL > +specifies the "local allocation", the memory is allocated on > +the node of the CPU that triggered the allocation, "local node". > +The > +.I nodemask > +and > +.I maxnode > +arguments must specify the empty set. If the "local node" is low > +on free memory the kernel will try to allocate memory from other > +nodes. The kernel will allocate memory from the "local node" > +whenever memory for this node is available. If the "local node" > +is not allowed by the process's current cpuset context the kernel > +will try to allocate memory from other nodes. The kernel will > +allocate memory from the "local node" whenever it becomes allowed > +by the process's current cpuset context. > + > The thread memory policy is preserved across an > .BR execve (2), > and is inherited by child threads created using > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH 2/2] powernv: Pass PSSCR value and mask to power9_idle_stop
Gautham R Shenoy writes: > On Tue, Oct 04, 2016 at 10:33:27PM +1100, Balbir Singh wrote: >> >> >> On 04/10/16 21:32, Michael Ellerman wrote: >> > "Gautham R. Shenoy" writes: >> > >> >> From: "Gautham R. Shenoy" >> >> >> >> The power9_idle_stop method currently takes only the requested stop >> >> level as a parameter and picks up the rest of the PSSCR bits from a >> >> hand-coded macro. This is not a very flexible design, especially when >> >> the firmware has the capability to communicate the psscr value and the >> >> mask associated with a particular stop state via device tree. >> >> >> >> This patch modifies the power9_idle_stop API to take as parameters the >> >> PSSCR value and the PSSCR mask corresponding to the stop state that >> >> needs to be set. These PSSCR value and mask are respectively obtained >> >> by parsing the "ibm,cpu-idle-state-psscr" and >> >> "ibm,cpu-idle-state-psscr-mask" fields from the device tree. >> >> >> >> In addition to this, the patch adds support for handling stop states >> >> for which ESL and EC bits in the PSSCR are zero. As per the >> >> architecture, a wakeup from these stop states resumes execution from >> >> the subsequent instruction as opposed to waking up at the System >> >> Vector. >> > >> > That looks good. >> > >> >> This patch depends on the following skiboot patch that exports the >> >> PSSCR values and the mask for all the stop states: >> >> https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html >> > >> > But we can't depend on a skiboot patch. The kernel has to cope with >> > running on an old skiboot. >> > > Hmm.. We can still do that. The older skiboot only provides the RL > field of the PSSCR value for each stop state and the corresponding > PSSCR mask is set to 0xF in the older skiboot for all the stop states. > > We can insist that the future skiboot sets the ESL, EC, PSLL, TR, MTL > and the the RL fields of the PSSCR for any exported stop state. This > should be reflected in the psscr_mask of that stop state. Thus, the > psscr_mask of any stop state proposed in the future will have: > (PSSCR_ESL_MASK | PSCCR_EC_MASK | PSCCR_PSLL_MASK | PSSCR_TR_MASK | > PSSCR_MTL_MASK | PSSCR_RL_MASK) bits set in the skiboot. > > To handle the older firmware, we can do something like the following > during the discovery of the stop states to mimic the behaviour present > in the 4.8 kernel running on older firmware. > > === drivers/cpuidle/cpuidle-powernv.c === > /* > * By default we set the ESL and EC bits in the PSSCR. > * The MTL and PSLL are set to the maximum value possible as per the > * ISA, i.e 15. > * The Transition Rate is set to the Maximum value 3. > */ > #define DEFAULT_PSSCR_VAL PSSCR_ESL_MASK | \ > PSCCR_EC_MASK | PSCCR_PSLL_MASK |\ > PSSCR_TR_MASK | PSSCR_MTL_MASK > > #define DEFAULT_PSSCR_MASK PSSCR_ESL_MASK | \ > PSCCR_EC_MASK | PSCCR_PSLL_MASK |\ > PSSCR_TR_MASK | PSSCR_MTL_MASK | \ > PSSCR_RL_MASK > > > static int powernv_add_idle_states(void) > { > . > . > . > for (i = 0; i < dt_idle_states; i++) { > u64 val, mask; > . > . > . > val = (DEFAULT_PSSCR_VAL & ~psscr_mask[i]) | psscr_val[i]; > mask = DEFAULT_PSSCR_MASK | psscr_mask[i]; > stop_psscr_table[nr_idle_states].val = val; > stop_psscr_table[nr_idle_states].mask = mask; > } > } > > > > Is this approach ok ? What if we just treat the 0xF state from firmware as special and set it to DEFAULT_PSSCR_MASK in that case? That deals with old skiboot, new kernel, and sets a pretty small special case that's easy to track into the future as something we should watch out for. Additionally, if we make skiboot set sane values in ~DEFAULT_PSSCR_MASK for valid fields in PSSCR on boot/(also kexec?), then we should end up in a situation where everything works with everything (even if you don't get the best power saving). Specifically, new skiboot, old kernel... but it looks like there's nothing currently missing there Should this patch also have Fixes: 3005c597ba4 and CC to stable? -- Stewart Smith OPAL Architect, IBM.
[PATCH v6 2/2] devicetree: bindings: uart: Add new compatible string for ZynqMP
From: Nava kishore Manne This patch Adds the new compatible string for ZynqMP SoC. Signed-off-by: Nava kishore Manne --- Changes for v6: -Added New compatiable string for ZynqMP SoC as suggested by Rob Herring. Changes for v5: -Mofified the compatible session. Changes for v4: -Modified the ChangeLog comment. Changes for v3: -Added changeLog comment. Changes for v2: -None Documentation/devicetree/bindings/serial/cdns,uart.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/serial/cdns,uart.txt b/Documentation/devicetree/bindings/serial/cdns,uart.txt index a3eb154..227bb77 100644 --- a/Documentation/devicetree/bindings/serial/cdns,uart.txt +++ b/Documentation/devicetree/bindings/serial/cdns,uart.txt @@ -1,7 +1,9 @@ Binding for Cadence UART Controller Required properties: -- compatible : should be "cdns,uart-r1p8", or "xlnx,xuartps" +- compatible : + Use "xlnx,xuartps","cdns,uart-r1p8" for Zynq-7xxx SoC. + Use "xlnx,zynqmp-uart","cdns,uart-r1p12" for Zynq Ultrascale+ MPSoC. - reg: Should contain UART controller registers location and length. - interrupts: Should contain UART controller interrupts. - clocks: Must contain phandles to the UART clocks -- 2.1.1
[PATCH v3 1/4] mm: don't steal highatomic pageblock
In page freeing path, migratetype is racy so that a highorderatomic page could free into non-highorderatomic free list. If that page is allocated, VM can change the pageblock from higorderatomic to something. In that case, highatomic pageblock accounting is broken so it doesn't work(e.g., VM cannot reserve highorderatomic pageblocks any more although it doesn't reach 1% limit). So, this patch prohibits the changing from highatomic to other type. It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback array so stealing will only happen due to unexpected races which is really rare. Also, such prohibiting keeps highatomic pageblock more longer so it would be better for highorderatomic page allocation. Signed-off-by: Minchan Kim Acked-by: Vlastimil Babka Acked-by: Mel Gorman --- mm/page_alloc.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 55ad0229ebf3..79853b258211 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2154,7 +2154,8 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype) page = list_first_entry(&area->free_list[fallback_mt], struct page, lru); - if (can_steal) + if (can_steal && + get_pageblock_migratetype(page) != MIGRATE_HIGHATOMIC) steal_suitable_fallback(zone, page, start_migratetype); /* Remove the page from the freelists */ @@ -2555,7 +2556,8 @@ int __isolate_free_page(struct page *page, unsigned int order) struct page *endpage = page + (1 << order) - 1; for (; page < endpage; page += pageblock_nr_pages) { int mt = get_pageblock_migratetype(page); - if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)) + if (!is_migrate_isolate(mt) && !is_migrate_cma(mt) + && mt != MIGRATE_HIGHATOMIC) set_pageblock_migratetype(page, MIGRATE_MOVABLE); } -- 2.7.4
[PATCH v3 2/4] mm: prevent double decrease of nr_reserved_highatomic
There is race between page freeing and unreserved highatomic. CPU 0 CPU 1 free_hot_cold_page mt = get_pfnblock_migratetype set_pcppage_migratetype(page, mt) unreserve_highatomic_pageblock spin_lock_irqsave(&zone->lock) move_freepages_block set_pageblock_migratetype(page) spin_unlock_irqrestore(&zone->lock) free_pcppages_bulk __free_one_page(mt) <- mt is stale By above race, a page on CPU 0 could go non-highorderatomic free list since the pageblock's type is changed. By that, unreserve logic of highorderatomic can decrease reserved count on a same pageblock severak times and then it will make mismatch between nr_reserved_highatomic and the number of reserved pageblock. So, this patch verifies whether the pageblock is highatomic or not and decrease the count only if the pageblock is highatomic. Signed-off-by: Minchan Kim Acked-by: Vlastimil Babka Acked-by: Mel Gorman --- mm/page_alloc.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 79853b258211..18808f392718 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2106,13 +2106,25 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac) continue; /* -* It should never happen but changes to locking could -* inadvertently allow a per-cpu drain to add pages -* to MIGRATE_HIGHATOMIC while unreserving so be safe -* and watch for underflows. +* In page freeing path, migratetype change is racy so +* we can counter several free pages in a pageblock +* in this loop althoug we changed the pageblock type +* from highatomic to ac->migratetype. So we should +* adjust the count once. */ - zone->nr_reserved_highatomic -= min(pageblock_nr_pages, - zone->nr_reserved_highatomic); + if (get_pageblock_migratetype(page) == + MIGRATE_HIGHATOMIC) { + /* +* It should never happen but changes to +* locking could inadvertently allow a per-cpu +* drain to add pages to MIGRATE_HIGHATOMIC +* while unreserving so be safe and watch for +* underflows. +*/ + zone->nr_reserved_highatomic -= min( + pageblock_nr_pages, + zone->nr_reserved_highatomic); + } /* * Convert to ac->migratetype and avoid the normal -- 2.7.4
Re: [PATCH v2 4/4] mm: make unreserve highatomic functions reliable
On Wed, Oct 12, 2016 at 09:33:28AM +0200, Michal Hocko wrote: > On Wed 12-10-16 14:33:36, Minchan Kim wrote: > [...] > > @@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const > > struct alloc_context *ac) > > */ > > set_pageblock_migratetype(page, ac->migratetype); > > ret = move_freepages_block(zone, page, ac->migratetype); > > - spin_unlock_irqrestore(&zone->lock, flags); > > - return ret; > > + if (!drain && ret) { > > + spin_unlock_irqrestore(&zone->lock, flags); > > + return ret; > > + } > > I've already mentioned that during the previous discussion. This sounds Yeb, we did but I sent wrong version in my git tree. :( > overly aggressive to me. Why do we want to drain the whole reserve and > risk that we won't be able to build up a new one after OOM. Doing one > block at the time should be sufficient IMHO. I will resend with updating with every reveiw points. Thanks.
[PATCH v3 0/4] use up highorder free pages before OOM
I got OOM report from production team with v4.4 kernel. It had enough free memory but failed to allocate GFP_KERNEL order-0 page and finally encountered OOM kill. It occured during QA process which launches several apps, switching and so on. It happned rarely. IOW, In normal situation, it was not a problem but if we are unluck so that several apps uses peak memory at the same time, it can happen. If we manage to pass the phase, the system can go working well. I could reproduce it with my test(memory spike easily. Look at below. The reason is free pages(19M) of DMA32 zone are reserved for HIGHORDERATOMIC and doesn't unreserved before the OOM. balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 balloon cpuset=/ mems_allowed=0 CPU: 1 PID: 8473 Comm: balloon Tainted: GW OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 88007f15bbc8 8138eb13 88007f15bd88 88005a72a4c0 88007f15bc28 811d2d13 88007f15bc08 8146a5ca 81c8df60 0015 0206 Call Trace: [] dump_stack+0x63/0x90 [] dump_header+0x5c/0x1ce [] ? virtballoon_oom_notify+0x2a/0x80 [] oom_kill_process+0x22e/0x400 [] out_of_memory+0x1ac/0x210 [] __alloc_pages_nodemask+0x101e/0x1040 [] handle_mm_fault+0xa0a/0xbf0 [] __do_page_fault+0x1dd/0x4d0 [] trace_do_page_fault+0x43/0x130 [] do_async_page_fault+0x1a/0xa0 [] async_page_fault+0x28/0x30 Mem-Info: active_anon:383949 inactive_anon:106724 isolated_anon:0 active_file:15 inactive_file:44 isolated_file:0 unevictable:0 dirty:0 writeback:24 unstable:0 slab_reclaimable:2483 slab_unreclaimable:3326 mapped:0 shmem:0 pagetables:1906 bounce:0 free:6898 free_pcp:291 free_cma:0 Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB 51131 total pagecache pages 50795 pages in swap cache Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228 Free swap = 8kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12658 pages reserved 0 pages cma reserved 0 pages hwpoisoned Another example exceeded the limit by the race is in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK) CPU: 0 PID: 476 Comm: in:imklog Tainted: GE 4.8.0-rc7-00217-g266ef83c51e5-dirty #3135 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 880077c37590 81389033 880077c37618 8117519b 02280020 81cedb40 0040 Call Trace: [] dump_stack+0x63/0x90 [] warn_alloc_failed+0xdb/0x130 [] __alloc_pages_nodemask+0x4d6/0xdb0 [] ? bdev_write_page+0xa9/0xd0 [] ? __page_check_address+0xd3/0x130 [] ? deactivate_slab+0x12a/0x3e0 [] new_slab+0x339/0x490 [] ___slab_alloc.constprop.74+0x367/0x480 [] ? alloc_indirect.isra.14+0x1d/0x50 [] ? default_wake_function+0x12/0x20 [] __slab_alloc.constprop.73+0x20/0x40 [] __kmalloc+0x1a4/0x1e0 [] alloc_indirect.isra.14+0x1d/0x50 [] virtqueue_add_sgs+0x1c4/0x470 [] ? __bt_get.isra.8+0xe5/0x1c0 [] __virtblk_add_req+0xae/0x1f0 [] ? wake_atomic_t_function+0x60/0x60 [] ? sched_clock+0x9/0x10 [] ? __blk_mq_alloc_request+0x10b/0x230 [] ? blk_rq_map_sg+0x213/0x550 [] virtio_queue_rq+0x12d/0x290 [] __blk_mq_run_hw_queue+0x239/0x370 [] blk_mq_run_hw_queue+0x8f/0xb0 [] blk_mq_insert_requests+0x18c/0x1a0 [] blk_mq_flush_plug_list+0x125/0x140 [] blk_flush_plug_list+0xc7/0x220 [] blk_finish_plug+0x2c/0x40 [] __do_page_cache_readahead+0x196/0x230 [] ? zram_free_page+0x3a/0xb0 [zram] [] filemap_fault+0x448/0x4f0 [] ? allo
[PATCH v3 4/4] mm: make unreserve highatomic functions reliable
Currently, unreserve_highatomic_pageblock bails out if it found highatomic pageblock regardless of really moving free pages from the one so that it could mitigate unreserve logic's goal which saves OOM of a process. This patch makes unreserve functions bail out only if it moves some pages out of !highatomic free list to avoid such false positive. Another potential problem is that by race between page freeing and reserve highatomic function, pages could be in highatomic free list even though the pageblock is !high atomic migratetype. In that case, unreserve_highatomic_pageblock can be void if count of highatomic reserve is less than pageblock_nr_pages. We could solve it simply via draining all of reserved pages before the OOM. It would have a safeguard role to exhuast reserved pages before converging to OOM. Signed-off-by: Minchan Kim Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka --- mm/page_alloc.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd2f0e1bffc4..163d7fa759a2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2079,8 +2079,12 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone, * potentially hurts the reliability of high-order allocations when under * intense memory pressure but failed atomic allocations should be easier * to recover from than an OOM. + * + * If @force is true, try to unreserve a pageblock even though highatomic + * pageblock is exhausted. */ -static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) +static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, + bool force) { struct zonelist *zonelist = ac->zonelist; unsigned long flags; @@ -2092,8 +2096,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx, ac->nodemask) { - /* Preserve at least one pageblock */ - if (zone->nr_reserved_highatomic <= pageblock_nr_pages) + /* +* Preserve at least one pageblock unless memory pressure +* is really high. +*/ + if (!force && zone->nr_reserved_highatomic <= + pageblock_nr_pages) continue; spin_lock_irqsave(&zone->lock, flags); @@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) */ set_pageblock_migratetype(page, ac->migratetype); ret = move_freepages_block(zone, page, ac->migratetype); - spin_unlock_irqrestore(&zone->lock, flags); - return ret; + if (ret) { + spin_unlock_irqrestore(&zone->lock, flags); + return ret; + } } spin_unlock_irqrestore(&zone->lock, flags); } @@ -3343,7 +3353,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, * Shrink them them and try again */ if (!page && !drained) { - unreserve_highatomic_pageblock(ac); + unreserve_highatomic_pageblock(ac, false); drain_all_pages(NULL); drained = true; goto retry; @@ -3462,7 +3472,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, */ if (*no_progress_loops > MAX_RECLAIM_RETRIES) { /* Before OOM, exhaust highatomic_reserve */ - return unreserve_highatomic_pageblock(ac); + return unreserve_highatomic_pageblock(ac, true); } /* -- 2.7.4
[PATCH v3 3/4] mm: try to exhaust highatomic reserve before the OOM
I got OOM report from production team with v4.4 kernel. It had enough free memory but failed to allocate GFP_KERNEL order-0 page and finally encountered OOM kill. It occured during QA process which launches several apps, switching and so on. It happned rarely. IOW, In normal situation, it was not a problem but if we are unluck so that several apps uses peak memory at the same time, it can happen. If we manage to pass the phase, the system can go working well. I could reproduce it with my test(memory spike easily. Look at below. The reason is free pages(19M) of DMA32 zone are reserved for HIGHORDERATOMIC and doesn't unreserved before the OOM. balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 balloon cpuset=/ mems_allowed=0 CPU: 1 PID: 8473 Comm: balloon Tainted: GW OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 88007f15bbc8 8138eb13 88007f15bd88 88005a72a4c0 88007f15bc28 811d2d13 88007f15bc08 8146a5ca 81c8df60 0015 0206 Call Trace: [] dump_stack+0x63/0x90 [] dump_header+0x5c/0x1ce [] ? virtballoon_oom_notify+0x2a/0x80 [] oom_kill_process+0x22e/0x400 [] out_of_memory+0x1ac/0x210 [] __alloc_pages_nodemask+0x101e/0x1040 [] handle_mm_fault+0xa0a/0xbf0 [] __do_page_fault+0x1dd/0x4d0 [] trace_do_page_fault+0x43/0x130 [] do_async_page_fault+0x1a/0xa0 [] async_page_fault+0x28/0x30 Mem-Info: active_anon:383949 inactive_anon:106724 isolated_anon:0 active_file:15 inactive_file:44 isolated_file:0 unevictable:0 dirty:0 writeback:24 unstable:0 slab_reclaimable:2483 slab_unreclaimable:3326 mapped:0 shmem:0 pagetables:1906 bounce:0 free:6898 free_pcp:291 free_cma:0 Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB 51131 total pagecache pages 50795 pages in swap cache Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228 Free swap = 8kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12658 pages reserved 0 pages cma reserved 0 pages hwpoisoned Another example exceeded the limit by the race is in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK) CPU: 0 PID: 476 Comm: in:imklog Tainted: GE 4.8.0-rc7-00217-g266ef83c51e5-dirty #3135 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 880077c37590 81389033 880077c37618 8117519b 02280020 81cedb40 0040 Call Trace: [] dump_stack+0x63/0x90 [] warn_alloc_failed+0xdb/0x130 [] __alloc_pages_nodemask+0x4d6/0xdb0 [] ? bdev_write_page+0xa9/0xd0 [] ? __page_check_address+0xd3/0x130 [] ? deactivate_slab+0x12a/0x3e0 [] new_slab+0x339/0x490 [] ___slab_alloc.constprop.74+0x367/0x480 [] ? alloc_indirect.isra.14+0x1d/0x50 [] ? default_wake_function+0x12/0x20 [] __slab_alloc.constprop.73+0x20/0x40 [] __kmalloc+0x1a4/0x1e0 [] alloc_indirect.isra.14+0x1d/0x50 [] virtqueue_add_sgs+0x1c4/0x470 [] ? __bt_get.isra.8+0xe5/0x1c0 [] __virtblk_add_req+0xae/0x1f0 [] ? wake_atomic_t_function+0x60/0x60 [] ? sched_clock+0x9/0x10 [] ? __blk_mq_alloc_request+0x10b/0x230 [] ? blk_rq_map_sg+0x213/0x550 [] virtio_queue_rq+0x12d/0x290 [] __blk_mq_run_hw_queue+0x239/0x370 [] blk_mq_run_hw_queue+0x8f/0xb0 [] blk_mq_insert_requests+0x18c/0x1a0 [] blk_mq_flush_plug_list+0x125/0x140 [] blk_flush_plug_list+0xc7/0x220 [] blk_finish_plug+0x2c/0x40 [] __do_page_cache_readahead+0x196/0x230 [] ? zram_free_page+0x3a/0xb0 [zram] [] filemap_fault+0x448/0x4f0 [] ? allo
Re: [PATCH 2/2] intel_pmc_core: avoid boot time warning for !CONFIG_DEBUGFS_FS
On Mon, Oct 10, 2016 at 02:29:17PM +0200, Greg Kroah-Hartman wrote: > On Mon, Oct 10, 2016 at 01:12:58PM +0200, Arnd Bergmann wrote: > > While looking at a patch that introduced a compile-time warning > > "‘pmc_core_dev_state_get’ defined but not used" (I sent a patch > > for debugfs to fix it), I noticed that the same patch caused > > it in intel_pmc_core also introduced a bogus run-time warning: > > "PMC Core: debugfs register failed". > > > > The problem is the IS_ERR_OR_NULL() check that as usual gets > > things wrong: when CONFIG_DEBUGFS_FS is disabled, > > debugfs_create_dir() fails with an error code, and we don't > > need to warn about it, unlike the case in which it returns > > NULL. > > > > This reverts the driver to the previous state of not warning > > about CONFIG_DEBUGFS_FS being disabled. I chose not to > > restore the driver to making a runtime error in debugfs > > fatal in pmc_core_probe(). > > > > Fixes: df2294fb6428 ("intel_pmc_core: Convert to DEFINE_DEBUGFS_ATTRIBUTE") > > Signed-off-by: Arnd Bergmann > > --- > > drivers/platform/x86/intel_pmc_core.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/platform/x86/intel_pmc_core.c > > b/drivers/platform/x86/intel_pmc_core.c > > index 520b58a04daa..e8b1b836ca2d 100644 > > --- a/drivers/platform/x86/intel_pmc_core.c > > +++ b/drivers/platform/x86/intel_pmc_core.c > > @@ -100,7 +100,7 @@ static int pmc_core_dbgfs_register(struct pmc_dev > > *pmcdev) > > struct dentry *dir, *file; > > > > dir = debugfs_create_dir("pmc_core", NULL); > > - if (IS_ERR_OR_NULL(dir)) > > + if (!dir) > > return -ENOMEM; > > Hah, no, you shouldn't ever care about any return value being "wrong" > from debugfs, the code should just continue on as normal. > > And yes, you are correct, the IS_ERR_OR_NULL() is totally wrong. > > thanks, > > greg k-h > Thanks Arnd and Greg, appreciate the catch and the fix. Applied. -- Darren Hart Intel Open Source Technology Center
mmotm git tree since-4.8 branch created (was: mmotm 2016-10-11-15-46 uploaded)
I have just created since-4.8 branch in mm git tree (http://git.kernel.org/?p=linux/kernel/git/mhocko/mm.git;a=summary). It is based on v4.8 tag in Linus tree and mmotm-2016-10-11-15-46. As usual mmotm trees are tagged with signed tag (finger print BB43 1E25 7FB8 660F F2F1 D22D 48E2 09A2 B310 E347) The shortlog says: Aaron Lu (1): thp: reduce usage of huge zero page's atomic counter Ales Novak (1): ptrace: clear TIF_SYSCALL_TRACE on ptrace detach Alexander Potapenko (3): include/linux: provide a safe version of container_of() llist: introduce llist_entry_safe() kcov: do not instrument lib/stackdepot.c Alexandre Bounine (1): rapidio/rio_cm: use memdup_user() instead of duplicating code Alexey Dobriyan (5): mm: unrig VMA cache hit ratio proc: much faster /proc/vmstat proc: faster /proc/*/status include/linux/ctype.h: make isdigit() table lookupless lib/kstrtox.c: smaller _parse_integer() Andrea Arcangeli (6): mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE mm: vma_adjust: remove superfluous confusing update in remove_next == 1 case mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk mm: vma_adjust: remove superfluous check for next not NULL mm: vma_adjust: minor comment correction mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb Andrew Morton (1): mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE() Andrey Konovalov (1): kcov: properly check if we are in an interrupt Aneesh Kumar K.V (1): mm: use zonelist name instead of using hardcoded index Baoyou Xie (1): mm: move phys_mem_access_prot_allowed() declaration to pgtable.h Bart Van Assche (1): do_generic_file_read(): fail immediately if killed Borislav Petkov (1): config/android: Remove CONFIG_IPV6_PRIVACY Catalin Marinas (1): mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping Christoph Hellwig (1): kprobes: include instead of Dan Williams (1): mm: fix cache mode tracking in vm_insert_mixed() Darrick J. Wong (3): block: invalidate the page cache when issuing BLKZEROOUT block: require write_same and discard requests align to logical block size block: implement (some of) fallocate for block devices Davidlohr Bueso (3): ipc/msg: batch queue sender wakeups ipc/msg: make ss_wakeup() kill arg boolean ipc/msg: avoid waking sender upon full queue Ganesh Mahendran (2): mm/zsmalloc: add trace events for zs_compact mm/zsmalloc: add per-class compact trace event Gerald Schaefer (3): mm/hugetlb: fix memory offline with hugepage size > memory block size mm/hugetlb: check for reserved hugepages during memory offline mm/hugetlb: improve locking in dissolve_free_huge_pages() Hidehiro Kawai (2): x86/panic: replace smp_send_stop() with kdump friendly version in panic path mips/panic: replace smp_send_stop() with kdump friendly version in panic path Huang Ying (4): mm, swap: add swap_cluster_list mm: don't use radix tree writeback tags for pages in swap cache mm, swap: use offset of swap entry as key of swap cache mm: remove page_file_index Ian Kent (5): autofs: fix autofs4_fill_super() error exit handling autofs: remove ino free in autofs4_dir_symlink() autofs: fix dev ioctl number range check autofs: add autofs_dev_ioctl_version() for AUTOFS_DEV_IOCTL_VERSION_CMD autofs4: move linux/auto_dev-ioctl.h to uapi/linux James Morse (3): mm: pagewalk: fix the comment for test_walk fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious mm/memcontrol.c: make the walk_page_range() limit obvious Jason Cooper (7): random: simplify API for random address requests x86: use simpler API for random address requests ARM: use simpler API for random address requests arm64: use simpler API for random address requests tile: use simpler API for random address requests unicore32: use simpler API for random address requests random: remove unused randomize_range() Joe Perches (15): seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char meminfo: break apart a very long seq_printf with #ifdefs checkpatch: see if modified files are marked obsolete in MAINTAINERS checkpatch: look for symbolic permissions and suggest octal instead checkpatch: test multiple line block comment alignment checkpatch: don't test for prefer ether_addr_ checkpatch: externalize the structs that should be const const_structs.checkpatch: add frequently used from Julia Lawall's list checkpatch: speed up checking for filenames in sections marked obsolete checkpatch: improve the block comment * alignment test checkpatch: add --strict test for macro argument reuse c
Re: [PATCH v3 1/4] net: phy: dp83867: Add documentation for optional impedance control
On Monday 10 October 2016 06:48 PM, Rob Herring wrote: > On Thu, Oct 06, 2016 at 10:43:52AM +0530, Mugunthan V N wrote: >> Add documention of ti,impedance-control which can be used to > > Needs updating. Oops, will update this in next version. > >> correct MAC impedance mismatch using phy extended registers. >> >> Signed-off-by: Mugunthan V N >> --- >> Documentation/devicetree/bindings/net/ti,dp83867.txt | 12 >> 1 file changed, 12 insertions(+) >> >> diff --git a/Documentation/devicetree/bindings/net/ti,dp83867.txt >> b/Documentation/devicetree/bindings/net/ti,dp83867.txt >> index 5d21141..85bf945 100644 >> --- a/Documentation/devicetree/bindings/net/ti,dp83867.txt >> +++ b/Documentation/devicetree/bindings/net/ti,dp83867.txt >> @@ -9,6 +9,18 @@ Required properties: >> - ti,fifo-depth - Transmitt FIFO depth- see dt-bindings/net/ti-dp83867.h >> for applicable values >> >> +Optional property: >> +- ti,min-output-impedance - MAC Interface Impedance control to set >> +the programmable output impedance to >> +minimum value (35 ohms). >> +- ti,max-output-impedance - MAC Interface Impedance control to set >> +the programmable output impedance to >> +maximum value (70 ohms). > > Define what are valid range of values for these. The values are already mentioned in documentation as 35/70 ohms. Are you mentioning about the register values? Regards Mugunthan V N
Re: [PATCH] ideapad-laptop: Add Lenovo Yoga 910-13IKB to no_hw_rfkill dmi list
On Tue, Oct 11, 2016 at 07:28:02PM -0400, Brian Masney wrote: > The Lenovo Yoga 910-13IKB does not have a hw rfkill switch, and trying > to read the hw rfkill switch through the ideapad module causes it to > always report as blocked. > > This commit adds the Lenovo Yoga 910-13IKB to the no_hw_rfkill dmi list, > fixing the WiFI breakage. > > Signed-off-by: Brian Masney Thanks Brian, Queued to testing. -- Darren Hart Intel Open Source Technology Center
Re: [PATCH] irqchip/jcore: fix lost per-cpu interrupts
On Tue, 11 Oct 2016, Rich Felker wrote: > On Sun, Oct 09, 2016 at 09:23:58PM +0200, Thomas Gleixner wrote: > > On Sun, 9 Oct 2016, Rich Felker wrote: > > > On Sun, Oct 09, 2016 at 01:03:10PM +0200, Thomas Gleixner wrote: > > > My preference would just be to keep the branch, but with your improved > > > version that doesn't need a function call: > > > > > > irqd_is_per_cpu(irq_desc_get_irq_data(desc)) > > > > > > While there is some overhead testing this condition every time, I can > > > probably come up with several better places to look for a ~10 cycle > > > improvement in the irq code path without imposing new requirements on > > > the DT bindings. > > > > Fair enough. Your call. > > > > > As noted in my followup to the clocksource stall thread, there's also > > > a possibility that it might make sense to consider the current > > > behavior of having non-percpu irqs bound to a particular cpu as part > > > of what's required by the compatible tag, in which case > > > handle_percpu_irq or something similar/equivalent might be suitable > > > for both the percpu and non-percpu cases. I don't understand the irq > > > subsystem well enough to insist on that but I think it's worth > > > consideration since it looks like it would improve performance of > > > non-percpu interrupts a bit. > > > > Well, you can use handle_percpu_irq() for your device interrupts if you > > guarantee at the hardware level that there is no reentrancy. Once you make > > the hardware capable of delivering them on either core the picture changes. > > One more concern here -- I see that handle_simple_irq is handling the > soft-disable / IRQS_PENDING flag behavior, and irq_check_poll stuff > that's perhaps important too. Since soft-disable is all we have > (there's no hard-disable of interrupts), is this a problem? In other > words, can drivers have an expectation of not receiving interrupts > when the irq is disabled? I would think anything compatible with irq > sharing can't have such an expectation, but perhaps the kernel needs > disabling internally for synchronization at module-unload time or > similar cases? Sure. A driver would be surprised getting an interrupt when it is disabled, but with your exceptionally well thought out interrupt controller a pending (level) interrupt which is not handled will be reraised forever and just hard lock the machine. > If you think any of these things are problems I'll switch back to the > conditional version rather than using handle_percpu_irq for > everything. It might be the approach of least surprise, but it won't make a difference for the situation described above. Thanks, tglx
Re: [GIT pull] locking fix for 4.9
On Mon, 10 Oct 2016 10:29:27 -0700 Linus Torvalds wrote: > On Sat, Oct 8, 2016 at 5:47 AM, Thomas Gleixner wrote: > > > > A single fix which prevents newer GCCs from spamming the build output with > > overly eager warnings about __builtin_return_address() uses which are > > correct. > > Ugh. This feels over-engineered to me. > > We already disable that warning unconditionally for the trace > subdirectory, and for mm/usercopy.c. > > I feel that the simpler solution is to just disable the warning > globally, and not worry about "when this config option is enabled we > need to disable it". > > Basically, we disable the warning every time we ever use > __builtin_return_address(), so maybe we should just disable it once > and for all. The only advantage of doing this is to make it a pain to use __builtin_return_address(n) with n > 0, so that we don't accidentally use it without knowing what we are doing. > > It's not like the __builtin_return_address() warning is so incredibly > useful anyway. > But I agree. We have lived a long time without the need for this warning. I'm not strongly advocating keeping the warning around and just disabling it totally. But it all comes down to how much we trust those that inherit this after we are gone ;-) /me is feeling his age. -- Steve
Re: [PATCH 24/54] md/raid1: Improve another size determination in setup_conf()
Compare: foo = kmalloc(sizeof(*foo), GFP_KERNEL); This says you are allocating enough space for foo. It can be reviewed by looking at one line. If you change the type of foo it will still work. foo = kmalloc(sizeof(struct whatever), GFP_KERNEL); There isn't enough information to say if this is correct. If you change the type of foo then you have to update the allocation as well. It's not a super common type of bug, but I see it occasionally. regards, dan carpenter
Re: [PATCH 6/6] cpufreq: pxa: convert to clock API
Viresh Kumar writes: > On 12-10-16, 08:22, Robert Jarzmik wrote: >> Viresh Kumar writes: >> >> > On 10-10-16, 22:09, Robert Jarzmik wrote: >> >> As the clock settings have been introduced into the clock pxa drivers, >> >> which are now available to change the CPU clock by themselves, remove >> >> the clock handling from this driver, and rely on pxa clock drivers. >> >> >> >> Signed-off-by: Robert Jarzmik >> >> --- >> >> drivers/cpufreq/pxa2xx-cpufreq.c | 191 >> >> --- >> >> 1 file changed, 39 insertions(+), 152 deletions(-) >> > >> > As you mentioned in the previous patch, why can't you use cpufreq-dt >> > driver now and delete this one ? >> >> PXA architecture have both legacy platform_data based configurations and new >> devicetree based ones. > > I don't see any platform data specific code in this driver. What am I > missing ? In a legacy platform, ie. without devicetree, we have CONFIG_OF=n. How would cpufreq-dt be usable in this case ? You can see such a platform in arch/arm/configs/mainstone_defconfig and arch/arm/mach-pxa/mainstone.c as an example. Cheers. -- Robert
Re: [PATCH v2 4/4] mm: make unreserve highatomic functions reliable
On Wed 12-10-16 14:33:36, Minchan Kim wrote: [...] > @@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const > struct alloc_context *ac) >*/ > set_pageblock_migratetype(page, ac->migratetype); > ret = move_freepages_block(zone, page, ac->migratetype); > - spin_unlock_irqrestore(&zone->lock, flags); > - return ret; > + if (!drain && ret) { > + spin_unlock_irqrestore(&zone->lock, flags); > + return ret; > + } I've already mentioned that during the previous discussion. This sounds overly aggressive to me. Why do we want to drain the whole reserve and risk that we won't be able to build up a new one after OOM. Doing one block at the time should be sufficient IMHO. if (ret) { spin_unlock_irqrestore(&zone->lock, flags); return ret; } will do the trick and work both for drain and !drain cases which is a good thing. Because even !drain case would like to see a block freed. The only difference between those two is that the drain one would really like to free something and ignore the "at least one block" reserve. -- Michal Hocko SUSE Labs
[PATCH] rtc: Add support for maxim dallas rtc max-6917
This is a patch to add support for maxim dallas rtc max6917. Signed-off-by: Venkat Prashanth B U --- --- drivers/rtc/Kconfig | 9 + drivers/rtc/Makefile | 1 + drivers/rtc/rtc-max6917.c | 406 ++ 3 files changed, 416 insertions(+) diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig index e215f50..2163606 100644 --- a/drivers/rtc/Kconfig +++ b/drivers/rtc/Kconfig @@ -277,6 +277,15 @@ config RTC_DRV_MAX6900 This driver can also be built as a module. If so, the module will be called rtc-max6900. +config RTC_DRV_MAX6917 + tristate "Maxim MAX6917" + help + If you say yes here you will get support for the + Maxim MAX6917 I2C RTC chip. + + This driver can also be built as a module. If so, the module + will be called rtc-max6917. + config RTC_DRV_MAX8907 tristate "Maxim MAX8907" depends on MFD_MAX8907 diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile index 7cf7ad5..29332fb 100644 --- a/drivers/rtc/Makefile +++ b/drivers/rtc/Makefile @@ -87,6 +87,7 @@ obj-$(CONFIG_RTC_DRV_M48T86) += rtc-m48t86.o obj-$(CONFIG_RTC_DRV_MAX6900) += rtc-max6900.o obj-$(CONFIG_RTC_DRV_MAX6902) += rtc-max6902.o obj-$(CONFIG_RTC_DRV_MAX6916) += rtc-max6916.o +obj-$(CONFIG_RTC_DRV_MAX6917) += rtc-max6917.o obj-$(CONFIG_RTC_DRV_MAX77686) += rtc-max77686.o obj-$(CONFIG_RTC_DRV_MAX8907) += rtc-max8907.o obj-$(CONFIG_RTC_DRV_MAX8925) += rtc-max8925.o diff --git a/drivers/rtc/rtc-max6917.c b/drivers/rtc/rtc-max6917.c index e69de29..1176384 100644 --- a/drivers/rtc/rtc-max6917.c +++ b/drivers/rtc/rtc-max6917.c @@ -0,0 +1,406 @@ + /* rtc-max6917.c + * + * Driver for MAXIM max6917 I2C-Compatible Real Time Clock + * + * Author : Venkat Prashanth B U + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + */ + + #include + #include + #include + #include + #include + #include + #include + #include + #include + + #define MAX6917_REG_SECS0x01/* 00-59 */ + #define MAX6917_REG_MIN 0x02/* 00-59 */ + #define MAX6917_REG_HOUR0x03/* 00-23, or 1-12{am,pm} */ + #define MAX6917_REG_WDAY0x04/* 01-07 */ + #define MAX6917_REG_MDAY0x05/* 01-31 */ + #define MAX6917_REG_MONTH 0x06/* 01-12 */ + #define MAX6917_REG_YEAR0x07/* 00-99 */ + #define MAX6917_REG_CONTROL 0x08 + #define MAX6917_REG_STATUS 0x0c + #define MAX6917_REG_ALARM 0x0a + #define MAX6917_BURST_LEN 8 /* can burst r/w first 8 regs */ + #define MAX6917_REG_CENTURY 9 /* century */ + #define MAX6917_REG_LEN 10 + #define MAX6917_REG_CT_WP (1 << 7)/* Write Protect */ + /* + * register read/write commands + */ + #define MAX6917_REG_CONTROL_WRITE 0x8e + #define MAX6917_REG_CENTURY_WRITE 0x92 + #define MAX6917_REG_CENTURY_READ0x93 + #define MAX6917_REG_RESERVED_READ 0x96 + #define MAX6917_REG_BURST_WRITE 0xbe + #define MAX6917_REG_BURST_READ 0xbf + + #define MAX6917_IDLE_TIME_AFTER_WRITE 3 /* specification says 2.5 mS */ + + static struct i2c_driver max6917_driver; + + struct max6917 + { + u8 offset; /* register's offset */ + u8 regs[11]; + u16 nvram_offset; + struct bin_attribute *nvram; + unsigned long flags; + #define HAS_NVRAM 0 /* bit 0 == sysfs file active */ + #define HAS_ALARM 1 /* bit 1 == irq claimed */ + struct i2c_client *client; + struct rtc_device *rtc; + s32 (*read_block_data) (const struct i2c_client * client, u8 command, + u8 length, u8 * values); + s32 (*write_block_data) (const struct i2c_client * client, u8 command, + u8 length, const u8 * values); + }; + + struct chip_desc + { + unsigned alarm:1; + u16 nvram_offset; + u16 nvram_size; + }; + + static int + max6917_i2c_read_regs (struct i2c_client *client, u8 * buf) + { + u8 reg_burst_read[1] = { MAX6917_REG_BURST_READ }; + u8 reg_century_read[1] = { MAX6917_REG_CENTURY_READ }; + struct i2c_msg msgs[4] = { + { + .addr = client->addr, + .flags = 0, /* write */ + .len = siz
Re: [PATCH v3 3/4] mm: try to exhaust highatomic reserve before the OOM
Looks much better. Thanks! I am wondering whether we want to have this marked for stable. The patch is quite non-intrusive and fires only when we are really OOM. It is definitely better to try harder than go and disrupt the system by the OOM killer. So I would add Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Cc: stable # 4.4+ The backport will look slightly different for kernels prior 4.6 because we do not have should_reclaim_retry yet but the check might hook right before __alloc_pages_may_oom. -- Michal Hocko SUSE Labs
Re: [PATCH RFC 2/2] KVM: x86: Support using the VMX preemption timer for APIC Timer periodic/oneshot mode
2016-10-11 23:06 GMT+08:00 Radim Krčmář : > 2016-10-11 20:17+0800, Wanpeng Li: >> From: Wanpeng Li >> >> Most windows guests still utilize APIC Timer periodic/oneshot mode >> instead of tsc-deadline mode, and the APIC Timer periodic/oneshot >> mode are still emulated by high overhead hrtimer on host. This patch >> converts the expected expire time of the periodic/oneshot mode to >> guest deadline tsc in order to leverage VMX preemption timer logic >> for APIC Timer tsc-deadline mode. >> >> Cc: Paolo Bonzini >> Cc: Radim Krčmář >> Cc: Yunhong Jiang >> Signed-off-by: Wanpeng Li >> --- >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >> @@ -1101,13 +1101,20 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic) >> apic->lapic_timer.period == 0) >> return 0; >> >> + if (kvm_lapic_hv_timer_in_use(apic->vcpu)) { >> + u64 tscl = rdtsc(); >> + >> + tmcct = apic->lapic_timer.tscdeadline - >> + kvm_read_l1_tsc(apic->vcpu, tscl); > > Yes, this won't work. The easiest way to return a less bogus TMCCT > would be remember the timeout when setting up the timer and replace > hrtimer_get_remaining() with it -- our deliver method shouldn't change > the expiration time. Agreed. > >> + } else { >> + remaining = hrtimer_get_remaining(&apic->lapic_timer.timer); >> + if (ktime_to_ns(remaining) < 0) >> + remaining = ktime_set(0, 0); >> + >> + ns = mod_64(ktime_to_ns(remaining), apic->lapic_timer.period); >> + tmcct = div64_u64(ns, >> + (APIC_BUS_CYCLE_NS * apic->divide_count)); >> + } >> >> return tmcct; >> } >> @@ -1400,52 +1407,65 @@ bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu) >> } >> EXPORT_SYMBOL_GPL(kvm_lapic_hv_timer_in_use); >> >> -static void cancel_hv_tscdeadline(struct kvm_lapic *apic) >> +static void cancel_hv_timer(struct kvm_lapic *apic) >> { >> kvm_x86_ops->cancel_hv_timer(apic->vcpu); >> apic->lapic_timer.hv_timer_in_use = false; >> } >> >> +static bool start_hv_timer(struct kvm_lapic *apic) >> { >> + u64 tscdeadline; >> >> + if (apic_lvtt_period(apic) || apic_lvtt_oneshot(apic)) { >> + u64 tscl = rdtsc(); >> >> + apic->lapic_timer.period = (u64)kvm_lapic_get_reg(apic, >> APIC_TMICT) >> + * APIC_BUS_CYCLE_NS * apic->divide_count; >> + apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, >> tscl) + >> + nsec_to_cycles(apic->vcpu, apic->lapic_timer.period); > > start_hv_timer() is called from kvm_lapic_switch_to_hv_timer(), which > can happen mid-period. The worst case is that the timer will never > fire, because we always convert back and forth. You have to compute the > equivalent deadline only once, and carry it around. Agreed. Thanks for your review. :) Please see RFC V2. Regards, Wanpeng Li > >> + } >> + >> + tscdeadline = apic->lapic_timer.tscdeadline; >> >> if (atomic_read(&apic->lapic_timer.pending) || >> kvm_x86_ops->set_hv_timer(apic->vcpu, tscdeadline)) { >> if (apic->lapic_timer.hv_timer_in_use) >> + cancel_hv_timer(apic); >> } else { >> apic->lapic_timer.hv_timer_in_use = true; >> hrtimer_cancel(&apic->lapic_timer.timer); >> >> /* In case the sw timer triggered in the window */ >> if (atomic_read(&apic->lapic_timer.pending)) >> + cancel_hv_timer(apic); >> } >> trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, >> apic->lapic_timer.hv_timer_in_use); >> return apic->lapic_timer.hv_timer_in_use; >> } >> >> +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu) >> +{ >> + struct kvm_lapic *apic = vcpu->arch.apic; >> + >> + WARN_ON(!apic->lapic_timer.hv_timer_in_use); >> + WARN_ON(swait_active(&vcpu->wq)); >> + cancel_hv_timer(apic); >> + apic_timer_expired(apic); >> + >> + if (apic_lvtt_period(apic)) >> + start_hv_timer(apic); >> +} >> +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer); >> + >> void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu) >> { >> struct kvm_lapic *apic = vcpu->arch.apic; >> >> WARN_ON(apic->lapic_timer.hv_timer_in_use); >> >> - if (apic_lvtt_tscdeadline(apic)) >> - start_hv_tscdeadline(apic); >> + start_hv_timer(apic); >> } >> EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_hv_timer); >>
Re: [PATCH v4 10/10] ARM: sunxi: Enable sun8i-emac driver on multi_v7_defconfig
On Tue, Oct 11, 2016 at 11:40:42AM +0200, Maxime Ripard wrote: > On Mon, Oct 10, 2016 at 03:09:43PM +0200, Jean-Francois Moine wrote: > > On Mon, 10 Oct 2016 14:35:11 +0200 > > LABBE Corentin wrote: > > > > > On Mon, Oct 10, 2016 at 02:30:46PM +0200, Maxime Ripard wrote: > > > > On Fri, Oct 07, 2016 at 10:25:57AM +0200, Corentin Labbe wrote: > > > > > Enable the sun8i-emac driver in the multi_v7 default configuration > > > > > > > > > > Signed-off-by: Corentin Labbe > > > > > --- > > > > > arch/arm/configs/multi_v7_defconfig | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/arch/arm/configs/multi_v7_defconfig > > > > > b/arch/arm/configs/multi_v7_defconfig > > > > > index 5845910..f44d633 100644 > > > > > --- a/arch/arm/configs/multi_v7_defconfig > > > > > +++ b/arch/arm/configs/multi_v7_defconfig > > > > > @@ -229,6 +229,7 @@ CONFIG_NETDEVICES=y > > > > > CONFIG_VIRTIO_NET=y > > > > > CONFIG_HIX5HD2_GMAC=y > > > > > CONFIG_SUN4I_EMAC=y > > > > > +CONFIG_SUN8I_EMAC=y > > > > > > > > Any reason to build it statically? > > > > > > No, just copied the same than CONFIG_SUN4I_EMAC that probably do > > > not need it also. > > > > All arm configs are done the same way, and, some day, the generic ARM > > V7 kernel will not be loadable in 1Gb RAM... > > Yeah, if possible, I'd really like to avoid introducing statically > built drivers to multi_v7. > I forgot to said it in my first answer, but yes I will change it. Regards
Re: [PATCH v3 4/4] mm: make unreserve highatomic functions reliable
On Wed 12-10-16 17:03:49, Minchan Kim wrote: > Currently, unreserve_highatomic_pageblock bails out if it found > highatomic pageblock regardless of really moving free pages > from the one so that it could mitigate unreserve logic's goal > which saves OOM of a process. > > This patch makes unreserve functions bail out only if it moves > some pages out of !highatomic free list to avoid such false > positive. > > Another potential problem is that by race between page freeing and > reserve highatomic function, pages could be in highatomic free list > even though the pageblock is !high atomic migratetype. In that case, > unreserve_highatomic_pageblock can be void if count of highatomic > reserve is less than pageblock_nr_pages. We could solve it simply > via draining all of reserved pages before the OOM. It would have > a safeguard role to exhuast reserved pages before converging to OOM. > > Signed-off-by: Minchan Kim > Signed-off-by: Michal Hocko > Acked-by: Vlastimil Babka Looks good to me as well. If the previous one is agreed to go to stable this one should go with it IMHO. Thanks! > --- > mm/page_alloc.c | 24 +--- > 1 file changed, 17 insertions(+), 7 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index fd2f0e1bffc4..163d7fa759a2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2079,8 +2079,12 @@ static void reserve_highatomic_pageblock(struct page > *page, struct zone *zone, > * potentially hurts the reliability of high-order allocations when under > * intense memory pressure but failed atomic allocations should be easier > * to recover from than an OOM. > + * > + * If @force is true, try to unreserve a pageblock even though highatomic > + * pageblock is exhausted. > */ > -static bool unreserve_highatomic_pageblock(const struct alloc_context *ac) > +static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, > + bool force) > { > struct zonelist *zonelist = ac->zonelist; > unsigned long flags; > @@ -2092,8 +2096,12 @@ static bool unreserve_highatomic_pageblock(const > struct alloc_context *ac) > > for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx, > ac->nodemask) { > - /* Preserve at least one pageblock */ > - if (zone->nr_reserved_highatomic <= pageblock_nr_pages) > + /* > + * Preserve at least one pageblock unless memory pressure > + * is really high. > + */ > + if (!force && zone->nr_reserved_highatomic <= > + pageblock_nr_pages) > continue; > > spin_lock_irqsave(&zone->lock, flags); > @@ -2138,8 +2146,10 @@ static bool unreserve_highatomic_pageblock(const > struct alloc_context *ac) >*/ > set_pageblock_migratetype(page, ac->migratetype); > ret = move_freepages_block(zone, page, ac->migratetype); > - spin_unlock_irqrestore(&zone->lock, flags); > - return ret; > + if (ret) { > + spin_unlock_irqrestore(&zone->lock, flags); > + return ret; > + } > } > spin_unlock_irqrestore(&zone->lock, flags); > } > @@ -3343,7 +3353,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned > int order, >* Shrink them them and try again >*/ > if (!page && !drained) { > - unreserve_highatomic_pageblock(ac); > + unreserve_highatomic_pageblock(ac, false); > drain_all_pages(NULL); > drained = true; > goto retry; > @@ -3462,7 +3472,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, >*/ > if (*no_progress_loops > MAX_RECLAIM_RETRIES) { > /* Before OOM, exhaust highatomic_reserve */ > - return unreserve_highatomic_pageblock(ac); > + return unreserve_highatomic_pageblock(ac, true); > } > > /* > -- > 2.7.4 > -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 04:25 PM, Michal Hocko wrote: > On Wed 12-10-16 15:24:33, zijun_hu wrote: >> On 10/12/2016 02:53 PM, Michal Hocko wrote: >>> On Wed 12-10-16 08:28:17, zijun_hu wrote: On 2016/10/12 1:22, Michal Hocko wrote: > On Tue 11-10-16 21:24:50, zijun_hu wrote: >> From: zijun_hu >> >> should we have a generic discussion whether such patches which considers >> many boundary or rare conditions are necessary. > > In general, I believe that kernel internal interfaces which have no > userspace exposure shouldn't be cluttered with sanity checks. > you are right and i agree with you. but there are many internal interfaces perform sanity checks in current linux sources >> i found the following code segments in mm/vmalloc.c >> static struct vmap_area *alloc_vmap_area(unsigned long size, >> unsigned long align, >> unsigned long vstart, unsigned long vend, >> int node, gfp_t gfp_mask) >> { >> ... >> >> BUG_ON(!size); >> BUG_ON(offset_in_page(size)); >> BUG_ON(!is_power_of_2(align)); > > See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and > from a quick look they are even unnecessary. So rather than adding more > of those, I think removing those that are not needed is much more > preferred. > i notice that, and the above code segments is used to illustrate that input parameter checking is necessary sometimes >> should we make below declarations as conventions >> 1) when we say 'alignment', it means align to a power of 2 value >>for example, aligning value @v to @b implicit @v is power of 2 >>, align 10 to 4 is 12 > > alignment other than power-of-two makes only very limited sense to me. > you are right and i agree with you. >> 2) when we say 'round value @v up/down to boundary @b', it means the >>result is a times of @b, it don't requires @b is a power of 2 > i will write to linus to ask for opinions whether we should declare the meaning of 'align' and 'round up/down' formally and whether such patches are necessary
Re: [PATCH 6/6] cpufreq: pxa: convert to clock API
On 12-10-16, 10:29, Robert Jarzmik wrote: > Viresh Kumar writes: > > > On 12-10-16, 08:22, Robert Jarzmik wrote: > >> Viresh Kumar writes: > >> > >> > On 10-10-16, 22:09, Robert Jarzmik wrote: > >> >> As the clock settings have been introduced into the clock pxa drivers, > >> >> which are now available to change the CPU clock by themselves, remove > >> >> the clock handling from this driver, and rely on pxa clock drivers. > >> >> > >> >> Signed-off-by: Robert Jarzmik > >> >> --- > >> >> drivers/cpufreq/pxa2xx-cpufreq.c | 191 > >> >> --- > >> >> 1 file changed, 39 insertions(+), 152 deletions(-) > >> > > >> > As you mentioned in the previous patch, why can't you use cpufreq-dt > >> > driver now and delete this one ? > >> > >> PXA architecture have both legacy platform_data based configurations and > >> new > >> devicetree based ones. > > > > I don't see any platform data specific code in this driver. What am I > > missing ? > > In a legacy platform, ie. without devicetree, we have CONFIG_OF=n. > How would cpufreq-dt be usable in this case ? > > You can see such a platform in arch/arm/configs/mainstone_defconfig and > arch/arm/mach-pxa/mainstone.c as an example. Okay, so its not about platform_data as you said earlier. Rather it is about the CONFIG_OF option. In that case, what about making this driver depends_on !CONFIG_OF ? So that the DT users don't use it anymore. -- viresh
Re: [PATCH] gpio: mockup: add sysfs dependency
On Mon, Oct 10, 2016 at 2:42 PM, Arnd Bergmann wrote: > Building the gpio-mockup driver without SYSFS results in a harmless Kconfig > warning: > > warning: (GPIO_MOCKUP) selects GPIO_SYSFS which has unmet direct dependencies > (GPIOLIB && SYSFS) > > We can easily avoid that warning by adding a dependency on SYSFS. > > Fixes: 0f98dd1b27d2 ("gpio/mockup: add virtual gpio device") > Signed-off-by: Arnd Bergmann Patch applied. Yours, Linus Walleij
Re: [PATCH v3 03/11] tracing/syscalls: add compat syscall metadata
Marcin Nowakowski writes: > Now that compat syscalls are properly distinguished from native calls, > we can add metadata for compat syscalls as well. > All the macros used to generate the metadata are the same as for > standard syscalls, but with a compat_ prefix to distinguish them easily. > > Signed-off-by: Marcin Nowakowski > Cc: Steven Rostedt > Cc: Ingo Molnar > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Michael Ellerman > Cc: linuxppc-...@lists.ozlabs.org > --- > arch/powerpc/include/asm/ftrace.h | 15 +--- > include/linux/compat.h| 74 > +++ > kernel/trace/trace_syscalls.c | 8 +++-- > 3 files changed, 90 insertions(+), 7 deletions(-) > > diff --git a/arch/powerpc/include/asm/ftrace.h > b/arch/powerpc/include/asm/ftrace.h > index 686c5f7..9697a73 100644 > --- a/arch/powerpc/include/asm/ftrace.h > +++ b/arch/powerpc/include/asm/ftrace.h > @@ -73,12 +73,17 @@ struct dyn_arch_ftrace { > static inline bool arch_syscall_match_sym_name(const char *sym, const char > *name) > { > /* > - * Compare the symbol name with the system call name. Skip the .sys or > .SyS > - * prefix from the symbol name and the sys prefix from the system call > name and > - * just match the rest. This is only needed on ppc64 since symbol names > on > - * 32bit do not start with a period so the generic function will work. > + * Compare the symbol name with the system call name. Skip the .sys, > + * .SyS or .compat_sys prefix from the symbol name and the sys prefix > + * from the system call name and just match the rest. This is only > + * needed on ppc64 since symbol names on 32bit do not start with a > + * period so the generic function will work. >*/ > - return !strcmp(sym + 4, name + 3); > + int prefix_len = 3; > + > + if (!strncasecmp(name, "compat_", 7)) > + prefix_len = 10; > + return !strcmp(sym + prefix_len + 1, name + prefix_len); > } It's annoying that we have to duplicate all that just to do a + 1. How about this as a precursor? cheers diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt index dd5f916b351d..bd65f2adeb09 100644 --- a/Documentation/trace/ftrace-design.txt +++ b/Documentation/trace/ftrace-design.txt @@ -226,10 +226,6 @@ You need very few things to get the syscalls tracing in an arch. - If the system call table on this arch is more complicated than a simple array of addresses of the system calls, implement an arch_syscall_addr to return the address of a given system call. -- If the symbol names of the system calls do not match the function names on - this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and - implement arch_syscall_match_sym_name with the appropriate logic to return - true if the function name corresponds with the symbol name. - Tag this arch as HAVE_SYSCALL_TRACEPOINTS. diff --git a/arch/powerpc/include/asm/ftrace.h b/arch/powerpc/include/asm/ftrace.h index 686c5f70eb84..dc48f5b2878d 100644 --- a/arch/powerpc/include/asm/ftrace.h +++ b/arch/powerpc/include/asm/ftrace.h @@ -60,6 +60,12 @@ struct dyn_arch_ftrace { struct module *mod; }; #endif /* CONFIG_DYNAMIC_FTRACE */ + +#ifdef PPC64_ELF_ABI_v1 +/* On ppc64 ABIv1 (BE) we have to skip the leading '.' in the symbol name */ +#define ARCH_SYM_NAME_SKIP_CHARS 1 +#endif + #endif /* __ASSEMBLY__ */ #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS @@ -67,20 +73,4 @@ struct dyn_arch_ftrace { #endif #endif -#if defined(CONFIG_FTRACE_SYSCALLS) && !defined(__ASSEMBLY__) -#ifdef PPC64_ELF_ABI_v1 -#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME -static inline bool arch_syscall_match_sym_name(const char *sym, const char *name) -{ - /* -* Compare the symbol name with the system call name. Skip the .sys or .SyS -* prefix from the symbol name and the sys prefix from the system call name and -* just match the rest. This is only needed on ppc64 since symbol names on -* 32bit do not start with a period so the generic function will work. -*/ - return !strcmp(sym + 4, name + 3); -} -#endif -#endif /* CONFIG_FTRACE_SYSCALLS && !__ASSEMBLY__ */ - #endif /* _ASM_POWERPC_FTRACE */ diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index b2b6efc083a4..91a7315dbe43 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -31,8 +31,11 @@ extern struct syscall_metadata *__stop_syscalls_metadata[]; static struct syscall_metadata **syscalls_metadata; -#ifndef ARCH_HAS_SYSCALL_MATCH_SYM_NAME -static inline bool arch_syscall_match_sym_name(const char *sym, const char *name) +#ifndef ARCH_SYM_NAME_SKIP_CHARS +#define ARCH_SYM_NAME_SKIP_CHARS 0 +#endif + +static inline bool syscall_match_sym_name(const char *sym, const char *name) { /* * Only compare after the "sys" prefix. Archs that use @@ -40,9 +43,8 @@
Re: [PATCH v4 08/10] ARM: dts: sun8i: Enable sun8i-emac on the Orange Pi 2
On Wed, Oct 12, 2016 at 10:55:59AM +0200, Jean-Francois Moine wrote: > On Fri, 7 Oct 2016 10:25:55 +0200 > Corentin Labbe wrote: > > > The sun8i-emac hardware is present on the Orange PI 2. > > It uses the internal PHY. > > > > This patch create the needed emac node. > > > > Signed-off-by: Corentin Labbe > > --- > > arch/arm/boot/dts/sun8i-h3-orangepi-2.dts | 8 > > 1 file changed, 8 insertions(+) > > > > diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts > > b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts > > index f93f5d1..5608eb4 100644 > > --- a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts > > +++ b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts > > @@ -54,6 +54,7 @@ > > > > aliases { > > serial0 = &uart0; > > + ethernet0 = &emac; > > As there is no 'of_alias_get_id' in the driver, this alias is > useless. Not really, this is used by U-Boot to set the mac address. Maxime -- Maxime Ripard, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com signature.asc Description: PGP signature
RE: [PATCH]"drm: change DRM_MIPI_DSI module type from "bool" to "tristate".
I think "installing a kernel with my changes for both drm and i915" takes more time and effort to complete than "only updating DRM/i915 modules without rebuilding the whole kernel". In some cases, that's beneficial. Also reloadablility is always a good thing to have and I truly hope Hajda/Iwai's patches would be accepted and merged. No downside of it after all. Regards, Sun, Jing -Original Message- From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of Daniel Vetter Sent: Wednesday, October 12, 2016 2:52 PM To: Sun, Jing A Cc: Andrzej Hajda; Jani Nikula; Takashi Iwai; Emil Velikov; linux-kernel@vger.kernel.org; dri-de...@lists.freedesktop.org; Vetter, Daniel; Thierry Reding Subject: Re: [PATCH]"drm: change DRM_MIPI_DSI module type from "bool" to "tristate". On Wed, Oct 12, 2016 at 03:08:24AM +, Sun, Jing A wrote: > Interestingly, I am able to reload i915 and drm. Our CI has tests for > i915 unload/reload, but does not check drm. In any case the config > problem should not impact the reloadability of i915. > == > Sorry that I didn't make myself clear. In order to replace the default > i915 module with an updated one, the related DRM modules also need to > be updated to match the updated i915, hence the restriction. Just to avoid tears in the future: If you plan to ship this in product, you won't ship. And for debugging, just install a kernel with your changes for both drm and i915. In short, your use-case isn't really valid (but we could still make the dsi code modular if people feel like). -Daniel > > Regards, > Sun, Jing > > > -Original Message- > From: Andrzej Hajda [mailto:a.ha...@samsung.com] > Sent: Tuesday, October 11, 2016 5:53 PM > To: Jani Nikula; Sun, Jing A; Takashi Iwai > Cc: airl...@linux.ie; Vetter, Daniel; linux-kernel@vger.kernel.org; > dri-de...@lists.freedesktop.org; Thierry Reding; Emil Velikov > Subject: Re: [PATCH]"drm: change DRM_MIPI_DSI module type from "bool" to > "tristate". > > On 11.10.2016 11:33, Jani Nikula wrote: > > On Tue, 11 Oct 2016, "Sun, Jing A" wrote: > >> It's needed that DRM Driver module could be removed and reloaded > >> after kernel booting on the projects that I have been working on, > >> and I hope such module type change could be accepted. Looks like > >> Iwai has similar change request as well. Would you please review it > >> and let us know if any concerns? > > Looking at the Kconfig, selecting CONFIG_DRM_MIPI_DSI is against the > > recommendations of Documentation/kbuild/kconfig-language.txt: > > > > select should be used with care. select will force > > a symbol to a value without visiting the dependencies. > > By abusing select you are able to select a symbol FOO even > > if FOO depends on BAR that is not set. > > In general use select only for non-visible symbols > > (no prompts anywhere) and for symbols with no dependencies. > > That will limit the usefulness but on the other hand avoid > > the illegal configurations all over. > > All existing drivers which selects DRM_MIPI_DSI also depends on DRM. > So the dependency is always true. I am not sure if it could not change > in the future, but in such case mipi_dsi bus should be completely > detached from DRM framework, I hope we have not such case yet :) > > > > > Indeed, you may end up with CONFIG_DRM_MIPI_DSI=y and CONFIG_DRM=m, > > which violates DRM_MIPI_DSI dependency on CONFIG_DRM. This is broken > > and should be fixed. The suggested patch does *not* fix this issue. > > At the moment it should not be possible. > > Regards > Andrzej > > ___ > dri-devel mailing list > dri-de...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)
Hi, Sorry - I meant to look into this yesterday but forgot. > Andy, can this be related to CONFIG_VMAP_STACK? I think it is. > > current -git kills my system. Can you elaborate on how exactly it kills your system? > > adding > > > > if (!virt_addr_valid(&aad[2])) { > > WARN_ON(1); > > return -EINVAL; > > } That's pretty obviously false with VMAP_STACK, since the caller (ieee80211_crypto_ccmp_decrypt) puts the aad on the stack. b_0 is also on the stack, but maybe that doesn't matter. Herbert, do you know what could cause this, and how we should fix it? We can't really afford to do an allocation here, and we don't have space in the skb (not even in skb->cb at that point), so if we really have no way to continue using the stack we'd ... not sure, use a per- CPU buffer perhaps. We need 32 bytes for aad and 16 bytes for b_0, if that also can't be on the stack any more. johannes
Re: [PATCH v4 08/10] ARM: dts: sun8i: Enable sun8i-emac on the Orange Pi 2
On Fri, 7 Oct 2016 10:25:55 +0200 Corentin Labbe wrote: > The sun8i-emac hardware is present on the Orange PI 2. > It uses the internal PHY. > > This patch create the needed emac node. > > Signed-off-by: Corentin Labbe > --- > arch/arm/boot/dts/sun8i-h3-orangepi-2.dts | 8 > 1 file changed, 8 insertions(+) > > diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts > b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts > index f93f5d1..5608eb4 100644 > --- a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts > +++ b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts > @@ -54,6 +54,7 @@ > > aliases { > serial0 = &uart0; > + ethernet0 = &emac; As there is no 'of_alias_get_id' in the driver, this alias is useless. > }; > > chosen { > @@ -184,3 +185,10 @@ > usb1_vbus-supply = <®_usb1_vbus>; > status = "okay"; > }; > + > +&emac { > + phy-handle = <&int_mii_phy>; > + phy-mode = "mii"; > + allwinner,leds-active-low; > + status = "okay"; > +}; > -- > 2.7.3 > > > ___ > linux-arm-kernel mailing list > linux-arm-ker...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel -- Ken ar c'hentañ | ** Breizh ha Linux atav! ** Jef | http://moinejf.free.fr/
Re: [PATCH] kexec: Export memory sections virtual addresses to vmcoreinfo
On Wednesday 12 October 2016 05:56 AM, Baoquan He wrote: PAGE_OFFSET can be get via vaddr - paddr from elf pt_loads so only > VMALLOC_BASE and VMEMMAP_BASE is necessary.. Well, yes, I was wrong. I wrongly thought of kernel text virtual address when I wrote the reply So, if you can get PAGE_OFFSET then, probably you do not need to know anything else. I think, we can simplify makedumpfile code, where we do not need to depend on VMALLOC_START or VMEMMAP_START etc. "If we know PAGE_OFFSET, we can read from swapper space. If we can read from swapper space, then we can know PA of any kernel VA, whether it is VMALLOC, or vmemmap or module or kernel text area." In fact, I have cleanup patches for ARM64 [1], which take above approach and get rid of need of VMALLOC_START or VMEMMAP_START etc. I will be sending them upstream soon. Probably, x86 can take the similar approach. ~Pratyush [1] https://github.com/pratyushanand/makedumpfile/blob/arm64_devel/arch/arm64.c#L228
Re: [PATCH] mm: page_alloc: Use KERN_CONT where appropriate
On Tue 11-10-16 19:24:55, Joe Perches wrote: > Recent changes to printk require KERN_CONT uses to continue logging > messages. So add KERN_CONT where necessary. I was really wondering what happened when Aaron reported an allocation failure http://lkml.kernel.org/r/20161012065423.ga16...@aaronlu.sh.intel.com See the attached log got the current Linus' tree Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines") > Signed-off-by: Joe Perches Acked-by: Michal Hocko I believe we can simplify the code a bit as well. What do you think about the following on top? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6f8c356140a0..7e1b74ee79cb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4078,10 +4078,12 @@ unsigned long nr_free_pagecache_pages(void) return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); } -static inline void show_node(struct zone *zone) +static inline void show_zone_node(struct zone *zone) { if (IS_ENABLED(CONFIG_NUMA)) - printk("Node %d ", zone_to_nid(zone)); + printk("Node %d %s", zone_to_nid(zone), zone->name); + else + printk("%s: ", zone->name); } long si_mem_available(void) @@ -4329,9 +4331,8 @@ void show_free_areas(unsigned int filter) for_each_online_cpu(cpu) free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count; - show_node(zone); + show_zone_node(zone); printk(KERN_CONT - "%s" " free:%lukB" " min:%lukB" " low:%lukB" @@ -4354,7 +4355,6 @@ void show_free_areas(unsigned int filter) " local_pcp:%ukB" " free_cma:%lukB" "\n", - zone->name, K(zone_page_state(zone, NR_FREE_PAGES)), K(min_wmark_pages(zone)), K(low_wmark_pages(zone)), @@ -4379,7 +4379,6 @@ void show_free_areas(unsigned int filter) printk("lowmem_reserve[]:"); for (i = 0; i < MAX_NR_ZONES; i++) printk(KERN_CONT " %ld", zone->lowmem_reserve[i]); - printk(KERN_CONT "\n"); } for_each_populated_zone(zone) { @@ -4389,8 +4388,7 @@ void show_free_areas(unsigned int filter) if (skip_free_areas_node(filter, zone_to_nid(zone))) continue; - show_node(zone); - printk(KERN_CONT "%s: ", zone->name); + show_zone_node(zone); spin_lock_irqsave(&zone->lock, flags); for (order = 0; order < MAX_ORDER; order++) { -- Michal Hocko SUSE Labs
Re: [RFC 0/6] Module for tracking/accounting shared memory buffers
Am 12.10.2016 um 01:50 schrieb Ruchi Kandoi: This patchstack adds memtrack hooks into dma-buf and ion. If there's upstream interest in memtrack, it can be extended to other memory allocators as well, such as GEM implementations. We have run into similar problems before. Because of this I already proposed a solution for this quite a while ago, but never pushed on upstreaming this since it was only done for a special use case. Instead of keeping track of how much memory a process has bound (which is very fragile) my solution only added some more debugging info on a per fd basis (e.g. how much memory is bound to this fd). This information was then used by the OOM killer (for example) to make a better decision on which process to reap. Shouldn't be to hard to expose this through debugfs or maybe a new fcntl to userspace for debugging. I haven't looked at the code in detail, but messing with the per process memory accounting like you did in this proposal is clearly not a good idea if you ask me. Regards, Christian.
Re: [PATCH 3/3] mtd: s3c2410: parse the device configuration from OF node
Hi Sergio, On Wed, 5 Oct 2016 20:46:57 -0300 Sergio Prado wrote: > Allows configuring Samsung's s3c2410 memory controller using a > devicetree. > > Signed-off-by: Sergio Prado > --- > drivers/mtd/nand/s3c2410.c | 171 > ++--- > include/linux/platform_data/mtd-nand-s3c2410.h | 1 + > 2 files changed, 156 insertions(+), 16 deletions(-) > > diff --git a/drivers/mtd/nand/s3c2410.c b/drivers/mtd/nand/s3c2410.c > index 174ac9dc4265..352cf2656bc8 100644 > --- a/drivers/mtd/nand/s3c2410.c > +++ b/drivers/mtd/nand/s3c2410.c [...] > + > +static int s3c2410_nand_init_timings(struct s3c2410_nand_info *info, > + struct nand_chip *chip) > +{ > + struct s3c2410_platform_nand *pdata = info->platform; > + const struct nand_sdr_timings *t; > + int tacls, mode; > + > + mode = onfi_get_async_timing_mode(chip); > + if (mode == ONFI_TIMING_MODE_UNKNOWN) > + mode = chip->onfi_timing_mode_default; > + > + t = onfi_async_timing_mode_to_sdr_timings(mode); > + if (IS_ERR(t)) > + return PTR_ERR(t); We recently introduced an method to automate timing selection and configuration [1]. Can you switch to this approach (the changes are in the nand/next branch [2] and should appear in 4.9-rc1)? Thanks, Boris [1]https://www.spinics.net/lists/arm-kernel/msg532007.html [2]https://github.com/linux-nand/linux/tree/nand/next
[GIT PULL] fbdev changes for 4.9
Hi Linus, Please pull fbdev changes for 4.9. Tomi The following changes since commit 29b4817d4018df78086157ea3a55c1d9424a7cfc: Linux 4.8-rc1 (2016-08-07 18:18:00 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux.git tags/fbdev-4.9 for you to fetch changes up to c456a2f30de53e77a2eb8eeb4202d742516aa76b: video: smscufx: remove unused variable (2016-09-27 11:47:37 +0300) fbdev changes for 4.9 Main changes: - amba-cldc: DT backlight support, Nomadik support, Versatile improvements, fixes - efifb: fix fbcon RGB565 palette - exynos: remove unused DSI driver Arnd Bergmann (3): video: ARM CLCD: fix endpoint lookup logic video: ARM CLCD: export symbols for driver module video: fbdev: mb862xx: remove unused variable Bhaktipriya Shridhar (2): omapfb: panel-dsi-cm: Remove deprecated create_singlethread_workqueue fbdev: Remove deprecated create_singlethread_workqueue Chen-Yu Tsai (1): simplefb: Disable and release clocks and regulators in destroy callback Colin Ian King (2): video: fbdev: add missing \n at end of printk error message video: fbdev: i810: add in missing white space in error message text Dan Carpenter (2): video: fbdev: pxafb: potential NULL dereference on error fb: adv7393: off by one in probe function Javier Martinez Canillas (1): fb: adv7393: Use IS_ENABLED() instead of checking for built-in or module Julia Lawall (2): matroxfb: constify local structures video: fbdev: constify fb_fix_screeninfo and fb_var_screeninfo structures Krzysztof Kozlowski (3): video: s3c2410fb: Register cpufreq notifier only on S3C24xx video: fbdev: exynos: Remove old non-working MIPI driver ARM: exynos_defconfig: Remove old non-working MIPI driver LABBE Corentin (2): fbdev: ssd1307fb: constify the device_info pointer fbdev: ssd1307fb: fix a possible NULL dereference Linus Walleij (7): video: ARM CLCD: backlight support for OF video: ARM CLCD: support DT signal inversion flags video: ARM CLCD: support pads connected in reverse order video: ARM CLCD: support Nomadik variant video: ARM CLCD: add special board and panel hooks for Nomadik video: ARM CLCD: add special panel hook for Versatiles video: ARM CLCD: fix up Integrator support Marek Vasut (1): video: mxsfb: Fix framebuffer corruption on mx6sx Mark Brown (1): omapfb: Fix regulator API abuse in dss.c and hdmi4/5.c Max Staudt (1): fbdev/efifb: Fix 16 color palette entry calculation Nicholas Mc Guire (1): omapfb/dss: wait_for_completion_interruptible_timeout expects long Oleg Drokin (1): mx3fb: Fix print format string Sudip Mukherjee (3): video: fbdev: intelfb: remove impossible condition matroxfb: fix size of memcpy video: smscufx: remove unused variable Tomi Valkeinen (1): MAINTAINERS: update fbdev entries Vladimir Murzin (3): fbdev: vfb: add description to module parameters fbdev: vfb: add option for video mode fbdev: vfb: simplify memory management Wei Yongjun (3): video: ARM CLCD: fix return value check in versatile_clcd_init_panel() video: fbdev: pxafb: add missing of_node_put() in of_get_pxafb_mode_info() omapfb: fix return value check in dsi_bind() Wolfram Sang (1): video: fbdev: mb862xx: mb862xx-i2c: don't print error when adding adapter fails Yongji Xie (1): video: fbdev: offb: Call pci_enable_device() before using the PCI VGA device MAINTAINERS| 12 - arch/arm/configs/exynos_defconfig | 2 - drivers/video/fbdev/Kconfig| 7 +- drivers/video/fbdev/Makefile | 3 +- drivers/video/fbdev/amba-clcd-nomadik.c| 259 ++ drivers/video/fbdev/amba-clcd-nomadik.h| 24 + drivers/video/fbdev/amba-clcd-versatile.c | 395 + drivers/video/fbdev/amba-clcd-versatile.h | 17 + drivers/video/fbdev/amba-clcd.c| 190 - drivers/video/fbdev/arcfb.c| 4 +- drivers/video/fbdev/asiliantfb.c | 4 +- drivers/video/fbdev/aty/aty128fb.c | 6 +- drivers/video/fbdev/aty/atyfb_base.c | 2 +- drivers/video/fbdev/aty/radeon_monitor.c | 2 +- drivers/video/fbdev/bfin_adv7393fb.c | 5 +- drivers/video/fbdev/efifb.c| 6 +- drivers/video/fbdev/exynos/Kconfig | 32 - drivers/video/fbdev/exynos/Makefile| 9 - drivers/video/fbdev/exynos/exynos_mipi_dsi.c | 574 - .../video/fbdev/exynos/exynos_mipi_dsi_common.c| 880 .../video/fbdev/exynos/exynos_
Re: [patch] drm/amdgpu: potential NULL dereference in debugfs code
Am 12.10.2016 um 08:17 schrieb Dan Carpenter: debugfs_create_file() returns NULL on error, it only returns error pointers if debugfs isn't enabled in the config and we checked for that earlier so it can't happen. Fixes: 4f4824b55650 ('drm/amd/amdgpu: Convert ring debugfs entries to binary') Signed-off-by: Dan Carpenter Reviewed-by: Christian König . diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 85aeb0a..8d16eaf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -371,8 +371,8 @@ static int amdgpu_debugfs_ring_init(struct amdgpu_device *adev, ent = debugfs_create_file(name, S_IFREG | S_IRUGO, root, ring, &amdgpu_debugfs_ring_fops); - if (IS_ERR(ent)) - return PTR_ERR(ent); + if (!ent) + return -ENOMEM; i_size_write(ent->d_inode, ring->ring_size + 12); ring->ent = ent;
Re: [PATCH v2] timers: Fix usleep_range() in the context of wake_up_process()
On Tue, Oct 11, 2016 at 09:33:15AM -0700, Doug Anderson wrote: > On Tue, Oct 11, 2016 at 12:14 AM, Thomas Gleixner wrote: > > On Mon, 10 Oct 2016, Douglas Anderson wrote: > >> Users of usleep_range() expect that it will _never_ return in less time > >> than the minimum passed parameter. However, nothing in any of the code > >> ensures this. Specifically: > > There is no such guarantee for that interface and never has been, so how > > did you make sure that none of the existing users is relying on this? > > You can't just can't just declare that all all of the users expect that and > > be done with it. > You're right that I can't guarantee that no callers are relying on the > existing behavior of a wake_up_process() causing usleep_range() to > return early. I would say, however, that all callers I've seen are > absolutely relying on the min delay being enforced and I've never > personally seen a caller relying on being woken up from > usleep_range(). All the users relying on the min delay being enforced Indeed. It's *highly* surprising for any sleep interface to undershoot on delays, the usual thing is that they might delay for longer. If the function doesn't actually reliably delay for the minimum time then I'd expect that a large proportion of those conversions and other recent code that's been added is buggy. > one of two functions: usleep_atlest() and usleep_wakeable(). As > argued below I think that usleep_range() name implies that it will at > least sleep the minimum so I would really like to avoid keeping the > name usleep_range() and also keeping the existing behavior. I tend to agree with everything Doug is saying in terms of API expectations. signature.asc Description: PGP signature
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On Wed 12-10-16 15:24:33, zijun_hu wrote: > On 10/12/2016 02:53 PM, Michal Hocko wrote: > > On Wed 12-10-16 08:28:17, zijun_hu wrote: > >> On 2016/10/12 1:22, Michal Hocko wrote: > >>> On Tue 11-10-16 21:24:50, zijun_hu wrote: > From: zijun_hu > > the LSB of a chunk->map element is used for free/in-use flag of a area > and the other bits for offset, the sufficient and necessary condition of > this usage is that both size and alignment of a area must be even numbers > however, pcpu_alloc() doesn't force its @align parameter a even number > explicitly, so a odd @align maybe causes a series of errors, see below > example for concrete descriptions. > >>> > >>> Is or was there any user who would use a different than even (or power of > >>> 2) > >>> alighment? If not is this really worth handling? > >>> > >> > >> it seems only a power of 2 alignment except 1 can make sure it work very > >> well, > >> that is a strict limit, maybe this more strict limit should be checked > > > > I fail to see how any other alignment would actually make any sense > > what so ever. Look, I am not a maintainer of this code but adding a new > > code to catch something that doesn't make any sense sounds dubious at > > best to me. > > > > I could understand this patch if you see a problem and want to prevent > > it from repeating bug doing these kind of changes just in case sounds > > like a bad idea. > > > > thanks for your reply > > should we have a generic discussion whether such patches which considers > many boundary or rare conditions are necessary. In general, I believe that kernel internal interfaces which have no userspace exposure shouldn't be cluttered with sanity checks. > i found the following code segments in mm/vmalloc.c > static struct vmap_area *alloc_vmap_area(unsigned long size, > unsigned long align, > unsigned long vstart, unsigned long vend, > int node, gfp_t gfp_mask) > { > ... > > BUG_ON(!size); > BUG_ON(offset_in_page(size)); > BUG_ON(!is_power_of_2(align)); See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and from a quick look they are even unnecessary. So rather than adding more of those, I think removing those that are not needed is much more preferred. > should we make below declarations as conventions > 1) when we say 'alignment', it means align to a power of 2 value >for example, aligning value @v to @b implicit @v is power of 2 >, align 10 to 4 is 12 alignment other than power-of-two makes only very limited sense to me. > 2) when we say 'round value @v up/down to boundary @b', it means the >result is a times of @b, it don't requires @b is a power of 2 -- Michal Hocko SUSE Labs
Re: [RESEND RFC PATCH v2 1/1] mm/vmalloc.c: simplify /proc/vmallocinfo implementation
On Wed 12-10-16 16:23:01, zijun_hu wrote: > From: zijun_hu > > many seq_file helpers exist for simplifying implementation of virtual files > especially, for /proc nodes. however, the helpers for iteration over > list_head are available but aren't adopted to implement /proc/vmallocinfo > currently. > > simplify /proc/vmallocinfo implementation by existing seq_file helpers the simplification is nice and code duplication removal useful > Signed-off-by: zijun_hu Acked-by: Michal Hocko Thanks! > --- > Changes in v2: > - the redundant type cast is removed as advised by rient...@google.com > - commit messages are updated > > mm/vmalloc.c | 27 +-- > 1 file changed, 5 insertions(+), 22 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index f2481cb4e6b2..e73948afac70 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2574,32 +2574,13 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int > nr_vms) > static void *s_start(struct seq_file *m, loff_t *pos) > __acquires(&vmap_area_lock) > { > - loff_t n = *pos; > - struct vmap_area *va; > - > spin_lock(&vmap_area_lock); > - va = list_first_entry(&vmap_area_list, typeof(*va), list); > - while (n > 0 && &va->list != &vmap_area_list) { > - n--; > - va = list_next_entry(va, list); > - } > - if (!n && &va->list != &vmap_area_list) > - return va; > - > - return NULL; > - > + return seq_list_start(&vmap_area_list, *pos); > } > > static void *s_next(struct seq_file *m, void *p, loff_t *pos) > { > - struct vmap_area *va = p, *next; > - > - ++*pos; > - next = list_next_entry(va, list); > - if (&next->list != &vmap_area_list) > - return next; > - > - return NULL; > + return seq_list_next(p, &vmap_area_list, pos); > } > > static void s_stop(struct seq_file *m, void *p) > @@ -2634,9 +2615,11 @@ static void show_numa_info(struct seq_file *m, struct > vm_struct *v) > > static int s_show(struct seq_file *m, void *p) > { > - struct vmap_area *va = p; > + struct vmap_area *va; > struct vm_struct *v; > > + va = list_entry(p, struct vmap_area, list); > + > /* >* s_show can encounter race with remove_vm_area, !VM_VM_AREA on >* behalf of vmap area is being tear down or vm_map_ram allocation. > -- > 1.9.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- Michal Hocko SUSE Labs
MPOL_BIND on memory only nodes
Hi, We have the following function policy_zonelist() which selects a zonelist during various allocation paths. With this, general user space allocations (IIUC might not have __GFP_THISNODE) fails while trying to get memory from a memory only node without CPUs as the application runs some where else and that node is not part of the nodemask. Why we insist on __GFP_THISNODE ? On any memory only node its likely that the local node "nd" might not be part of the nodemask, hence does it make sense to pick up the first node of the nodemask in those cases without looking for __GFP_THISNODE ? /* Return a zonelist indicated by gfp for node representing a mempolicy */ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy, int nd) { switch (policy->mode) { case MPOL_PREFERRED: if (!(policy->flags & MPOL_F_LOCAL)) nd = policy->v.preferred_node; break; case MPOL_BIND: /* * Normally, MPOL_BIND allocations are node-local within the * allowed nodemask. However, if __GFP_THISNODE is set and the * current node isn't part of the mask, we use the zonelist for * the first node in the mask instead. */ if (unlikely(gfp & __GFP_THISNODE) && unlikely(!node_isset(nd, policy->v.nodes))) nd = first_node(policy->v.nodes); break; default: BUG(); } return node_zonelist(nd, gfp); } - Anshuman
[PATCH v6 1/2] serial: xuartps: Add new compatible string for ZynqMP
This patch Adds the new compatible string for ZynqMP SoC. Signed-off-by: Nava kishore Manne --- Changes for v6: -Added New patch. drivers/tty/serial/xilinx_uartps.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/tty/serial/xilinx_uartps.c b/drivers/tty/serial/xilinx_uartps.c index f37edaa..dd4c02f 100644 --- a/drivers/tty/serial/xilinx_uartps.c +++ b/drivers/tty/serial/xilinx_uartps.c @@ -1200,6 +1200,7 @@ static int __init cdns_early_console_setup(struct earlycon_device *device, OF_EARLYCON_DECLARE(cdns, "xlnx,xuartps", cdns_early_console_setup); OF_EARLYCON_DECLARE(cdns, "cdns,uart-r1p8", cdns_early_console_setup); OF_EARLYCON_DECLARE(cdns, "cdns,uart-r1p12", cdns_early_console_setup); +OF_EARLYCON_DECLARE(cdns, "xlnx,zynqmp-uart", cdns_early_console_setup); /** * cdns_uart_console_write - perform write operation @@ -1438,6 +1439,7 @@ static const struct of_device_id cdns_uart_of_match[] = { { .compatible = "xlnx,xuartps", }, { .compatible = "cdns,uart-r1p8", }, { .compatible = "cdns,uart-r1p12", .data = &zynqmp_uart_def }, + { .compatible = "xlnx,zynqmp-uart", .data = &zynqmp_uart_def }, {} }; MODULE_DEVICE_TABLE(of, cdns_uart_of_match); -- 2.1.1
Re: [PATCH v8 2/2] clocksource: add J-Core timer/clocksource driver
On Tue, Oct 11, 2016 at 04:28:50PM -0400, Rich Felker wrote: > On Tue, Oct 11, 2016 at 08:18:12PM +0200, Daniel Lezcano wrote: > > > > Hi Rich, > > > > On Sun, Oct 09, 2016 at 05:34:22AM +, Rich Felker wrote: > > > At the hardware level, the J-Core PIT is integrated with the interrupt > > > controller, but it is represented as its own device and has an > > > independent programming interface. It provides a 12-bit countdown > > > timer, which is not presently used, and a periodic timer. The interval > > > length for the latter is programmable via a 32-bit throttle register > > > whose units are determined by a bus-period register. The periodic > > > timer is used to implement both periodic and oneshot clock event > > > modes; in oneshot mode the interrupt handler simply disables the timer > > > as soon as it fires. > > > > > > Despite its device tree node representing an interrupt for the PIT, > > > the actual irq generated is programmable, not hard-wired. The driver > > > is responsible for programming the PIT to generate the hardware irq > > > number that the DT assigns to it. > > > > > > On SMP configurations, J-Core provides cpu-local instances of the PIT; > > > no broadcast timer is needed. This driver supports the creation of the > > > necessary per-cpu clock_event_device instances. > > > > For my personnal information, why no broadcast timer is needed ? > > Broadcast timer is only needed if you don't have percpu local timers. > Early on in SMP development I actually tested with an ipi broadcast > timer and performance was noticably worse. Obviously. I thought there were another reason related to power management. > > Are the CPUs on always-on power down ? > > For now they are always on and don't even have the sleep instruction > (i.e. stop cpu clock until interrupt) implemented. Adding sleep will > be the first power-saving step, and perhaps the only one for now, > since there doesn't seem to be any indication (according to the ppl > working on the hardware) that a deeper sleep would provide significant > additional savings. Ok. However, the 'sleep' state is not, in the power management terminology, the idle state described above. It is called "clock gated" / "Wait for Interrupt". The 'sleep' state lose the CPU context. > > > A nanosecond-resolution clocksource is provided using the J-Core "RTC" > > > registers, which give a 64-bit seconds count and 32-bit nanoseconds > > > that wrap every second. The driver converts these to a full-range > > > 32-bit nanoseconds count. > > > > > > Signed-off-by: Rich Felker > > > --- > > > drivers/clocksource/Kconfig | 10 ++ > > > drivers/clocksource/Makefile| 1 + > > > drivers/clocksource/jcore-pit.c | 231 > > > > > > include/linux/cpuhotplug.h | 1 + > > > 4 files changed, 243 insertions(+) > > > create mode 100644 drivers/clocksource/jcore-pit.c > > > > > > diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig > > > index 5677886..95dd78b 100644 > > > --- a/drivers/clocksource/Kconfig > > > +++ b/drivers/clocksource/Kconfig > > > @@ -407,6 +407,16 @@ config SYS_SUPPORTS_SH_TMU > > > config SYS_SUPPORTS_EM_STI > > > bool > > > > > > +config CLKSRC_JCORE_PIT > > > + bool "J-Core PIT timer driver" > > > + depends on OF && (SUPERH || COMPILE_TEST) > > > > Actually the idea is to have the SUPERH to select this timer, not create > > a dependency on SUPERH from here. > > > > We don't want to prompt in the configuration menu the drivers because it > > would be impossible to anyone to know which timer comes with which > > hardware, so we let the platform to select the timer it needs. > > I thought we discussed this before. For users building a kernel for > legacy SH systems, especially in the current state where they're only > supported with hard-coded board files rather than device tree, it > makes no sense to build drivers for J-core hardware. It would make > sense to be on by default for CONFIG_SH_DEVICE_TREE with a compatible > CPU selection, but at least at this time, not for SUPERH in general. Probably I am missing the point but why the user would have to unselect this driver manually ? The user wants a config file nothing more or a very trivial option. Can you imagine someone can know every single IP block for each boards of the same arch and be able to disable/enable the right ones ? > Anyway I'd really like to do this non-invasively as long as we have a > mix of legacy and new stuff and the legacy stuff is not readily > testable. Once all of arch/sh is moved over to device tree, could we > revisit this and make all the drivers follow a common policy (on by > default if they're associated with boards/SoCs using a matching or > compatible CPU model, or something like that, but still able to be > disabled manually, since the user might be trying to get a tiny-ish > embedded kernel)? I understand the goal is to have one single configuration and ev
Re: [RFC] net: phy: smsc: Disable auto-negotiation on startup
On 10/10/2016 10:41 AM, Kyle Roeschley wrote: > Because the SMSC PHY completes auto-negotiation before the driver is > ready to handle interrupts, the PHY state machine never realizes that we > have a link. Clear the ANENABLE bit on initialization, which lets > genphy_config_aneg do its thing when that code is hit later. > > While this patch does fix the problem we see (no link on boot without > re-plugging the cable), it seems like the generic PHY code should be > able to handle auto-negotiation completing before interrupts are > enabled. Submitted as an RFC in the hopes that someone has an idea as to > how that could be done. > > This fix is copied from commit 99f81afc139c ("phy: micrel: Disable auto > negotiation on startup"). Do you mind trying: https://www.spinics.net/lists/netdev/msg397857.html and see if you do get link interrupts without your patch applied? Thanks! -- Florian
Re: MPOL_BIND on memory only nodes
On Wed 12-10-16 14:55:24, Anshuman Khandual wrote: > Hi, > > We have the following function policy_zonelist() which selects a zonelist > during various allocation paths. With this, general user space allocations > (IIUC might not have __GFP_THISNODE) fails while trying to get memory from > a memory only node without CPUs as the application runs some where else > and that node is not part of the nodemask. I am not sure I understand. So you have a task with MPOL_BIND without a cpu less node in the mask and you are wondering why the memory is not allocated from that node? > Why we insist on __GFP_THISNODE ? AFAIU __GFP_THISNODE just overrides the given node to the policy nodemask in case the current node is not part of that node mask. In other words we are ignoring the given node and use what the policy says. I can see how this can be confusing especially when confronting the documentation: * __GFP_THISNODE forces the allocation to be satisified from the requested * node with no fallbacks or placement policy enforcements. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On Wed 12-10-16 16:44:31, zijun_hu wrote: > On 10/12/2016 04:25 PM, Michal Hocko wrote: > > On Wed 12-10-16 15:24:33, zijun_hu wrote: [...] > >> i found the following code segments in mm/vmalloc.c > >> static struct vmap_area *alloc_vmap_area(unsigned long size, > >> unsigned long align, > >> unsigned long vstart, unsigned long vend, > >> int node, gfp_t gfp_mask) > >> { > >> ... > >> > >> BUG_ON(!size); > >> BUG_ON(offset_in_page(size)); > >> BUG_ON(!is_power_of_2(align)); > > > > See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and > > from a quick look they are even unnecessary. So rather than adding more > > of those, I think removing those that are not needed is much more > > preferred. > > > i notice that, and the above code segments is used to illustrate that > input parameter checking is necessary sometimes Why do you think it is necessary here? -- Michal Hocko SUSE Labs
drm/i915: WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock)
On a laptop that tracks the latest stable release (Ie, it now runs v4.8.1) I see this WARNING WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock) Full trace pasted below. I never saw this WARNING before v4.8. Since v4.8 I've had it in all (four, actually) boots. What am I expected to do about this WARNING? Thanks, Paul Bolle WARNING: CPU: 3 PID: 1368 at drivers/gpu/drm/i915/intel_display.c:14178 skl_max_scale.part.120+0x75/0x80 [i915] WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock) Modules linked in: rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 cmac nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables bnep vfat fat arc4 snd_hda_codec_hdmi snd_soc_skl dell_led snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_sst_match snd_soc_core intel_rapl snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp kvm_intel snd_compress snd_pcm_dmaengine ac97_bus kvm snd_hda_intel iwlmvm snd_hda_codec mac80211 iTCO_wdt iTCO_vendor_support uvcvideo snd_hda_core snd_hwdep snd_seq irqbypass dell_laptop i2c_designware_platform i2c_designware_core dell_wmi crct10dif_pclmul dell_smbios dcdbas crc32_pclmul snd_seq_device iwlwifi videobuf2_vmalloc videobuf2_memops ghash_clmulni_intel snd_pcm videobuf2_v4l2 videobuf2_core cfg80211 videodev media joydev pcspkr mei_me rtsx_pci_ms memstick snd_timer i2c_i801 i2c_smbus mei snd btusb soundcore shpchp hci_uart btrtl btbcm btqca idma64 btintel bluetooth intel_pch_thermal processor_thermal_device intel_lpss_pci intel_soc_dts_iosf wmi pinctrl_sunrisepoint intel_lpss_acpi rfkill pinctrl_intel intel_lpss int3400_thermal acpi_als int3403_thermal int340x_thermal_zone kfifo_buf acpi_thermal_rel intel_hid industrialio sparse_keymap acpi_pad tpm_tis tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc hid_multitouch i915 rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper crc32c_intel drm serio_raw nvme rtsx_pci nvme_core i2c_hid video fjes CPU: 3 PID: 1368 Comm: Xorg Not tainted 4.8.1-1.local1.fc24.x86_64 #1 Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.4 06/14/2016 0286 df2f374c a31528d53910 b83e5cfd a31528d53960 a31528d53950 b80a7d5b 3762c72b3010 a3151e4d8cc0 a31526c23800 a31526e6 Call Trace: [] dump_stack+0x63/0x86 [] __warn+0xcb/0xf0 [] warn_slowpath_fmt+0x5f/0x80 [] ? sort+0x147/0x220 [] ? drm_atomic_helper_normalize_zpos+0x264/0x300 [drm_kms_helper] [] skl_max_scale.part.120+0x75/0x80 [i915] [] intel_check_primary_plane+0xc6/0xe0 [i915] [] ? drm_atomic_helper_normalize_zpos+0x264/0x300 [drm_kms_helper] [] intel_plane_atomic_check+0x132/0x1f0 [i915] [] drm_atomic_helper_check_planes+0x84/0x200 [drm_kms_helper] [] intel_atomic_check+0x9a7/0x11a0 [i915] [] ? __kmalloc_track_caller+0x17a/0x210 [] drm_atomic_check_only+0x187/0x610 [drm] [] ? drm_atomic_get_crtc_state+0x88/0x100 [drm] [] drm_atomic_commit+0x17/0x60 [drm] [] drm_atomic_helper_update_plane+0xec/0x130 [drm_kms_helper] [] __setplane_internal+0x22b/0x270 [drm] [] drm_mode_cursor_universal+0x139/0x240 [drm] [] drm_mode_cursor_common+0x7e/0x180 [drm] [] drm_mode_cursor2_ioctl+0xe/0x10 [drm] [] drm_ioctl+0x1da/0x4b0 [drm] [] ? drm_mode_cursor_ioctl+0x70/0x70 [drm] [] ? enqueue_hrtimer+0x3d/0x80 [] do_vfs_ioctl+0xa3/0x5e0 [] ? __sys_recvmsg+0x51/0x90 [] SyS_ioctl+0x79/0x90 [] entry_SYSCALL_64_fastpath+0x1a/0xa4
[PATCH] mm: kmemleak: Ensure that the task stack is not freed during scanning
Commit 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK") may cause the task->stack to be freed during kmemleak_scan() execution, leading to either a NULL pointer fault (if task->stack is NULL) or kmemleak accessing already freed memory. This patch uses the new try_get_task_stack() API to ensure that the task stack is not freed during kmemleak stack scanning. Fixes: 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK") Cc: Andrew Morton Cc: Andy Lutomirski Cc: CAI Qian Reported-by: CAI Qian Signed-off-by: Catalin Marinas --- This was reported in a subsequent comment here: https://bugzilla.kernel.org/show_bug.cgi?id=173901 However, the original bugzilla entry doesn't look related to task stack freeing as it was first reported on 4.8-rc8. Andy, sorry for cc'ing you to bugzilla, please feel free to remove your email from the bug above (I can't seem to be able to do it). mm/kmemleak.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index a5e453cf05c4..e5355a5b423f 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1453,8 +1453,11 @@ static void kmemleak_scan(void) read_lock(&tasklist_lock); do_each_thread(g, p) { - scan_block(task_stack_page(p), task_stack_page(p) + - THREAD_SIZE, NULL); + void *stack = try_get_task_stack(p); + if (stack) { + scan_block(stack, stack + THREAD_SIZE, NULL); + put_task_stack(p); + } } while_each_thread(g, p); read_unlock(&tasklist_lock); }
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 05:54 PM, Michal Hocko wrote: > On Wed 12-10-16 16:44:31, zijun_hu wrote: >> On 10/12/2016 04:25 PM, Michal Hocko wrote: >>> On Wed 12-10-16 15:24:33, zijun_hu wrote: > [...] i found the following code segments in mm/vmalloc.c static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask) { ... BUG_ON(!size); BUG_ON(offset_in_page(size)); BUG_ON(!is_power_of_2(align)); >>> >>> See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and >>> from a quick look they are even unnecessary. So rather than adding more >>> of those, I think removing those that are not needed is much more >>> preferred. >>> >> i notice that, and the above code segments is used to illustrate that >> input parameter checking is necessary sometimes > > Why do you think it is necessary here? > i am sorry for reply late i don't know whether it is necessary i just find there are so many sanity checkup in current internal interfaces
Re: [PATCH v3 08/11] powerpc/tracing: fix compat syscall handling
Marcin Nowakowski writes: > Adapt the code to make use of new syscall handling interface > > Signed-off-by: Marcin Nowakowski > Cc: Steven Rostedt > Cc: Ingo Molnar > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Michael Ellerman > Cc: linuxppc-...@lists.ozlabs.org > --- > arch/powerpc/include/asm/ftrace.h | 11 +++ > arch/powerpc/kernel/ftrace.c | 4 I went to test this and noticed the exit and enter events appear to be reversed in time? (your series on top of 24532f768121) ls-4221 [003] 83.766113: compat_sys_rt_sigprocmask -> 0x2 ls-4221 [003] 83.766137: compat_sys_rt_sigprocmask(how: 2, nset: 1010db30, oset: 0, sigsetsize: 8) ls-4221 [003] 83.766175: compat_sys_rt_sigaction -> 0x14 ls-4221 [003] 83.766175: compat_sys_rt_sigaction(sig: 14, act: ffbd33c4, oact: ffbd3338, sigsetsize: 8) ls-4221 [003] 83.766177: compat_sys_rt_sigaction -> 0x15 ls-4221 [003] 83.766177: compat_sys_rt_sigaction(sig: 15, act: ffbd33c4, oact: ffbd3338, sigsetsize: 8) ls-4221 [003] 83.766178: compat_sys_rt_sigaction -> 0x16 ls-4221 [003] 83.766178: compat_sys_rt_sigaction(sig: 16, act: ffbd33d4, oact: ffbd3348, sigsetsize: 8) ls-4221 [003] 83.766179: sys_setpgid -> 0x107d ls-4221 [003] 83.766179: sys_setpgid(pid: 107d, pgid: 107d) ls-4221 [003] 83.766180: compat_sys_rt_sigprocmask -> 0x0 ls-4221 [003] 83.766181: compat_sys_rt_sigprocmask(how: 0, nset: ffbd34b0, oset: ffbd3530, sigsetsize: 8) ls-4221 [003] 83.766186: compat_sys_ioctl -> 0xff ls-4221 [003] 83.766187: compat_sys_ioctl(fd: ff, cmd: 80047476, arg32: ffbd3488) ls-4221 [003] 83.766188: compat_sys_rt_sigprocmask -> 0x2 ls-4221 [003] 83.766189: compat_sys_rt_sigprocmask(how: 2, nset: ffbd3530, oset: 0, sigsetsize: 8) ls-4221 [003] 83.766189: sys_close -> 0x4 ls-4221 [003] 83.766190: sys_close(fd: 4) ls-4221 [003] 83.766191: sys_read -> 0x3 ls-4221 [003] 83.766191: sys_read(fd: 3, buf: ffbd35dc, count: 1) ls-4221 [003] 83.766235: sys_close -> 0x3 ls-4221 [003] 83.766235: sys_close(fd: 3) cheers
[PATCH] drm/bridge: analogix: protect power when get_modes or detect
The drm callback ->detect and ->get_modes seems is not power safe, they may be called when device is power off, do register access on detect or get_modes will cause system die. Here is the path call ->detect before analogix_dp power on [] analogix_dp_detect+0x44/0xdc [] drm_helper_probe_single_connector_modes_merge_bits+0xe8/0x41c [] drm_helper_probe_single_connector_modes+0x10/0x18 [] drm_mode_getconnector+0xf4/0x304 [] drm_ioctl+0x23c/0x390 [] do_vfs_ioctl+0x4b8/0x58c [] SyS_ioctl+0x60/0x88 Cc: Inki Dae Cc: Sean Paul Cc: Gustavo Padovan Cc: "Ville Syrjälä" Signed-off-by: Mark Yao --- drivers/gpu/drm/bridge/analogix/analogix_dp_core.c | 28 ++ 1 file changed, 28 insertions(+) diff --git a/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c b/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c index efac8ab..09dece2 100644 --- a/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c +++ b/drivers/gpu/drm/bridge/analogix/analogix_dp_core.c @@ -1062,6 +1062,13 @@ int analogix_dp_get_modes(struct drm_connector *connector) return 0; } + if (dp->dpms_mode != DRM_MODE_DPMS_ON) { + pm_runtime_get_sync(dp->dev); + + if (dp->plat_data->power_on) + dp->plat_data->power_on(dp->plat_data); + } + if (analogix_dp_handle_edid(dp) == 0) { drm_mode_connector_update_edid_property(&dp->connector, edid); num_modes += drm_add_edid_modes(&dp->connector, edid); @@ -1073,6 +1080,13 @@ int analogix_dp_get_modes(struct drm_connector *connector) if (dp->plat_data->get_modes) num_modes += dp->plat_data->get_modes(dp->plat_data, connector); + if (dp->dpms_mode != DRM_MODE_DPMS_ON) { + if (dp->plat_data->power_off) + dp->plat_data->power_off(dp->plat_data); + + pm_runtime_put_sync(dp->dev); + } + ret = analogix_dp_prepare_panel(dp, false, false); if (ret) DRM_ERROR("Failed to unprepare panel (%d)\n", ret); @@ -1106,9 +1120,23 @@ analogix_dp_detect(struct drm_connector *connector, bool force) return connector_status_disconnected; } + if (dp->dpms_mode != DRM_MODE_DPMS_ON) { + pm_runtime_get_sync(dp->dev); + + if (dp->plat_data->power_on) + dp->plat_data->power_on(dp->plat_data); + } + if (!analogix_dp_detect_hpd(dp)) status = connector_status_connected; + if (dp->dpms_mode != DRM_MODE_DPMS_ON) { + if (dp->plat_data->power_off) + dp->plat_data->power_off(dp->plat_data); + + pm_runtime_put_sync(dp->dev); + } + ret = analogix_dp_prepare_panel(dp, false, false); if (ret) DRM_ERROR("Failed to unprepare panel (%d)\n", ret); -- 1.9.1
Re: [PATCH v3 07/11] arm64/tracing: fix compat syscall handling
On Wed, Oct 12, 2016 at 09:07:03AM +0200, Marcin Nowakowski wrote: > On 11.10.2016 15:36, Will Deacon wrote: > >On Tue, Oct 11, 2016 at 12:42:52PM +0200, Marcin Nowakowski wrote: > >>diff --git a/arch/arm64/include/asm/unistd.h > >>b/arch/arm64/include/asm/unistd.h > >>index e78ac26..276d049 100644 > >>--- a/arch/arm64/include/asm/unistd.h > >>+++ b/arch/arm64/include/asm/unistd.h > >>@@ -45,6 +45,7 @@ > >> #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE+5) > >> > >> #define __NR_compat_syscalls 394 > >>+#define NR_compat_syscalls (__NR_compat_syscalls) > > > >We may as well just define NR_compat_syscalls instead of > >__NR_compat_syscalls and move the handful of users over. > > I had tried to minimise the amount of arch-specific changes here - > especially those that are not directly related to the proposed syscall > handling change. But I agree having these 2 #defines is a bit unnecessary There's only three users of __NR_compat_syscalls, so I think you can move them over. > >>diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c > >>index 40ad08a..75d010f 100644 > >>--- a/arch/arm64/kernel/ftrace.c > >>+++ b/arch/arm64/kernel/ftrace.c > >>@@ -176,4 +176,20 @@ int ftrace_disable_ftrace_graph_caller(void) > >>return ftrace_modify_graph_caller(false); > >> } > >> #endif /* CONFIG_DYNAMIC_FTRACE */ > >>+ > >> #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ > >>+ > >>+#if (defined CONFIG_FTRACE_SYSCALLS) && (defined CONFIG_COMPAT) > >>+ > >>+extern const void *sys_call_table[]; > >>+extern const void *compat_sys_call_table[]; > >>+ > >>+unsigned long __init arch_syscall_addr(int nr, bool compat) > >>+{ > >>+ if (compat) > >>+ return (unsigned long)compat_sys_call_table[nr]; > >>+ > >>+ return (unsigned long)sys_call_table[nr]; > >>+} > > > >Do we care about the compat private syscalls (from base 0x0f)? We > >need to make sure that we exhibit the same behaviour as a native > >32-bit ARM machine. > > > >Will > > Tracing of such syscalls has been disabled for a long time (see > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=086ba77a6db0). > Apart from using non-contiguous numbers, they are not defined using standard > SYSCALL macros, so they do not have any metadata generated either. > My suggestion is that if you wanted those to be included in the trace then > it should be done separately from these changes. Fine by me -- I just wanted to make sure our compat behaviour matched the behaviour of native arch/arm/. It sounds like it does, so no need to change anything here. Acked-by: Will Deacon Will
Re: [PATCH v2] z3fold: add shrinker
On Wed, 12 Oct 2016 09:52:06 +1100 Dave Chinner wrote: > > > +static unsigned long z3fold_shrink_scan(struct shrinker *shrink, > > + struct shrink_control *sc) > > +{ > > + struct z3fold_pool *pool = container_of(shrink, struct z3fold_pool, > > + shrinker); > > + struct z3fold_header *zhdr; > > + int i, nr_to_scan = sc->nr_to_scan; > > + > > + spin_lock(&pool->lock); > > Do not do this. Shrinkers should not run entirely under a spin lock > like this - it causes scheduling latency problems and when the > shrinker is run concurrently on different CPUs it will simply burn > CPU doing no useful work. Especially, in this case, as each call to > z3fold_compact_page() may be copying a significant amount of data > around and so there is potentially a /lot/ of work being done on > each call to the shrinker. > > If you need compaction exclusion for the shrinker invocation, then > please use a sleeping lock to protect the compaction work. Well, as far as I recall, spin_lock() will resolve to a sleeping lock for PREEMPT_RT, so it is not that much of a problem for configurations which do care much about latencies. Please also note that the time spent in the loop is deterministic since we take not more than one entry from every unbuddied list. What I could do though is add the following piece of code at the end of the loop, right after the /break/: spin_unlock(&pool->lock); cond_resched(); spin_lock(&pool->lock); Would that make sense for you? > > > */ > > @@ -234,6 +335,13 @@ static struct z3fold_pool *z3fold_create_pool(gfp_t > > gfp, > > INIT_LIST_HEAD(&pool->unbuddied[i]); > > INIT_LIST_HEAD(&pool->buddied); > > INIT_LIST_HEAD(&pool->lru); > > + pool->shrinker.count_objects = z3fold_shrink_count; > > + pool->shrinker.scan_objects = z3fold_shrink_scan; > > + pool->shrinker.seeks = DEFAULT_SEEKS; > > + if (register_shrinker(&pool->shrinker)) { > > + pr_warn("z3fold: could not register shrinker\n"); > > + pool->no_shrinker = true; > > + } > > Just fail creation of the pool. If you can't register a shrinker, > then much bigger problems are about to happen to your system, and > running a new memory consumer that /can't be shrunk/ is not going to > help anyone. I don't have a strong opinion on this but it doesn't look fatal to me in _this_ particular case (z3fold) since even without the shrinker, the compression ratio will never be lower than the one of zbud, which doesn't have a shrinker at all. Best regards, Vitaly
[PATCH v1 3/4] Add hwcap2 for x86
Add hwcap2 attribute for x86. Reserve 1st bit of HWCAP2 for exposing Xeon Phi ring 3 monitor/mwait. With this userspace apps can detect Ring 3 MONITOR/MWAIT instructions. Change-Id: I37d0354d1e2b9594d7feebc2bacda30b68163efe Signed-off-by: Grzegorz Andrejczuk --- arch/x86/include/asm/elf.h| 7 +++ arch/x86/include/uapi/asm/hwcap.h | 7 +++ arch/x86/kernel/cpu/common.c | 3 +++ 3 files changed, 17 insertions(+) create mode 100644 arch/x86/include/uapi/asm/hwcap.h diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h index e7f155c..62d060a 100644 --- a/arch/x86/include/asm/elf.h +++ b/arch/x86/include/asm/elf.h @@ -258,6 +258,13 @@ extern int force_personality32; #define ELF_HWCAP (boot_cpu_data.x86_capability[CPUID_1_EDX]) +extern unsigned int elf_hwcap2 + +/* HWCAP2 supplies kernel enabled CPU feature, so that the application + can know that it can safely use them. The bits are defined in + uapi/asm/hwcap.h. */ +#define ELF_HWCAP2 elf_hwcap2 + /* This yields a string that ld.so will use to load implementation specific libraries for optimization. This is more specific in intent than poking at uname or /proc/cpuinfo. diff --git a/arch/x86/include/uapi/asm/hwcap.h b/arch/x86/include/uapi/asm/hwcap.h new file mode 100644 index 000..d1f4f98 --- /dev/null +++ b/arch/x86/include/uapi/asm/hwcap.h @@ -0,0 +1,7 @@ +#ifndef _ASM_HWCAP_H +#define _ASM_HWCAP_H 1 + +/* Kernel enabled Ring 3 MWAIT for Xeon Phi*/ +#define HWCAP2_PHIR3MWAIT (1 << 0) +/* upto bit 31 free */ +#endif diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index bcc9ccc..93ffaa5 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -51,6 +52,8 @@ #include "cpu.h" +unsigned elf_hwcap2 __read_mostly; + /* all of these masks are initialized in setup_cpu_local_masks() */ cpumask_var_t cpu_initialized_mask; cpumask_var_t cpu_callout_mask; -- 2.5.1
[RESEND RFC PATCH v2 1/1] mm/vmalloc.c: simplify /proc/vmallocinfo implementation
From: zijun_hu many seq_file helpers exist for simplifying implementation of virtual files especially, for /proc nodes. however, the helpers for iteration over list_head are available but aren't adopted to implement /proc/vmallocinfo currently. simplify /proc/vmallocinfo implementation by existing seq_file helpers Signed-off-by: zijun_hu --- Changes in v2: - the redundant type cast is removed as advised by rient...@google.com - commit messages are updated mm/vmalloc.c | 27 +-- 1 file changed, 5 insertions(+), 22 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb4e6b2..e73948afac70 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2574,32 +2574,13 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) static void *s_start(struct seq_file *m, loff_t *pos) __acquires(&vmap_area_lock) { - loff_t n = *pos; - struct vmap_area *va; - spin_lock(&vmap_area_lock); - va = list_first_entry(&vmap_area_list, typeof(*va), list); - while (n > 0 && &va->list != &vmap_area_list) { - n--; - va = list_next_entry(va, list); - } - if (!n && &va->list != &vmap_area_list) - return va; - - return NULL; - + return seq_list_start(&vmap_area_list, *pos); } static void *s_next(struct seq_file *m, void *p, loff_t *pos) { - struct vmap_area *va = p, *next; - - ++*pos; - next = list_next_entry(va, list); - if (&next->list != &vmap_area_list) - return next; - - return NULL; + return seq_list_next(p, &vmap_area_list, pos); } static void s_stop(struct seq_file *m, void *p) @@ -2634,9 +2615,11 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) static int s_show(struct seq_file *m, void *p) { - struct vmap_area *va = p; + struct vmap_area *va; struct vm_struct *v; + va = list_entry(p, struct vmap_area, list); + /* * s_show can encounter race with remove_vm_area, !VM_VM_AREA on * behalf of vmap area is being tear down or vm_map_ram allocation. -- 1.9.1
Re: Intermittent perf build failures
On Tue, Oct 11, 2016 at 02:18:49PM -0700, Laura Abbott wrote: > On 10/11/2016 01:59 PM, Jiri Olsa wrote: > > On Tue, Oct 11, 2016 at 01:43:36PM -0700, Laura Abbott wrote: > > > Hi, > > > > > > While building today's Fedora rawhide kernel, there was a failure > > > building perf with -j4 [1]: ok, the -j 4 is the problem running "make -j 4 install-bin install-traceevent-plugins" BUILD: Doing 'make -j4' parallel build BUILD: Doing 'make -j4' parallel build will run paralel make instances for install-bin and install-traceevent-plugins which will eventually touch same files and crash.. the main perf Makefile is actualy detecting number of cpus and runs Makefile.perf with -j X option so there's no need to specify it on top level.. you can always customize it via JOBS=X make variable so if you don't specify the -j X option it will run the 'Makefile.perf install-bin install-traceevent-plugins' with -j X set and it should execute sequentialy and fix your problem thanks, jirka
[PATCH v1 0/4] Enabling Ring 3 MONITOR/MWAIT feature for Knights Landing
These patches enable Intel Xeon Phi x200 feature to use MONITOR/MWAIT instruction in ring 3 (userspace) Patches set MSR 0x140 for all logical CPUs. Then expose it as CPU feature and introduces elf HWCAP capability for x86. Reference: https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait Grzegorz Andrejczuk (4): Add R3MWAIT register and bit to msr-info.h Add enabling of the R3 MWAIT during boot for KNL Add hwcap2 for x86 Add R3MWAIT to CPU features arch/x86/include/asm/cpufeature.h| 6 -- arch/x86/include/asm/cpufeatures.h | 6 +- arch/x86/include/asm/disabled-features.h | 3 ++- arch/x86/include/asm/elf.h | 7 +++ arch/x86/include/asm/msr-index.h | 5 + arch/x86/include/asm/required-features.h | 3 ++- arch/x86/include/uapi/asm/hwcap.h| 7 +++ arch/x86/kernel/cpu/common.c | 6 ++ arch/x86/kernel/cpu/intel.c | 27 +++ 9 files changed, 65 insertions(+), 5 deletions(-) create mode 100644 arch/x86/include/uapi/asm/hwcap.h -- 2.5.1
[PATCH v1 4/4] Add R3MWAIT to CPU features
Add cpu feature for ring 3 monitor/mwait. Change-Id: Iba4d20639efd8d3637d37db9294cbc43a98f009a Signed-off-by: Grzegorz Andrejczuk --- arch/x86/include/asm/cpufeature.h| 6 -- arch/x86/include/asm/cpufeatures.h | 6 +- arch/x86/include/asm/disabled-features.h | 3 ++- arch/x86/include/asm/required-features.h | 3 ++- arch/x86/kernel/cpu/common.c | 3 +++ arch/x86/kernel/cpu/intel.c | 1 + 6 files changed, 17 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 1d2b69f..1baa1df 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -78,8 +78,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32]; CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 15, feature_bit) ||\ CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 16, feature_bit) ||\ CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 17, feature_bit) ||\ + CHECK_BIT_IN_MASK_WORD(REQUIRED_MASK, 18, feature_bit) ||\ REQUIRED_MASK_CHECK||\ - BUILD_BUG_ON_ZERO(NCAPINTS != 18)) + BUILD_BUG_ON_ZERO(NCAPINTS != 19)) #define DISABLED_MASK_BIT_SET(feature_bit) \ ( CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 0, feature_bit) ||\ @@ -100,8 +101,9 @@ extern const char * const x86_bug_flags[NBUGINTS*32]; CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 15, feature_bit) ||\ CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 16, feature_bit) ||\ CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 17, feature_bit) ||\ + CHECK_BIT_IN_MASK_WORD(DISABLED_MASK, 18, feature_bit) ||\ DISABLED_MASK_CHECK||\ - BUILD_BUG_ON_ZERO(NCAPINTS != 18)) + BUILD_BUG_ON_ZERO(NCAPINTS != 19)) #define cpu_has(c, bit) \ (__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 : \ diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 92a8308..242cd16 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -12,7 +12,7 @@ /* * Defines x86 CPU feature bits */ -#define NCAPINTS 18 /* N 32-bit words worth of info */ +#define NCAPINTS 19 /* N 32-bit words worth of info */ #define NBUGINTS 1 /* N 32-bit bug flags */ /* @@ -286,6 +286,10 @@ #define X86_FEATURE_SUCCOR (17*32+1) /* Uncorrectable error containment and recovery */ #define X86_FEATURE_SMCA (17*32+3) /* Scalable MCA */ + +/* non architectural Intel-defined CPU features not present in CPUID, word 18 */ +#define X86_FEATURE_PHIR3MWAIT (18*32+ 0) + /* * BUG word(s) */ diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 85599ad..8b45e08 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -57,6 +57,7 @@ #define DISABLED_MASK150 #define DISABLED_MASK16(DISABLE_PKU|DISABLE_OSPKE) #define DISABLED_MASK170 -#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18) +#define DISABLED_MASK180 +#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19) #endif /* _ASM_X86_DISABLED_FEATURES_H */ diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h index fac9a5c..6847d85 100644 --- a/arch/x86/include/asm/required-features.h +++ b/arch/x86/include/asm/required-features.h @@ -100,6 +100,7 @@ #define REQUIRED_MASK150 #define REQUIRED_MASK160 #define REQUIRED_MASK170 -#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18) +#define REQUIRED_MASK180 +#define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19) #endif /* _ASM_X86_REQUIRED_FEATURES_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 93ffaa5..15fe27f 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1108,6 +1108,9 @@ static void identify_cpu(struct cpuinfo_x86 *c) #endif /* The boot/hotplug time assigment got cleared, restore it */ c->logical_proc_id = topology_phys_to_logical_pkg(c->phys_proc_id); + + if (cpu_has(c, X86_FEATURE_PHIR3MWAIT)) + elf_hwcap2 |= HWCAP2_PHIR3MWAIT; } /* diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 7f0f01a..1f65815 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -236,6 +236,7 @@ static void early_init_intel(struct cpuinfo_x86 *c) rdmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, prev); wrmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, prev | MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT); + set_cpu_cap(c, X86_FEATURE_PHIR3MWAIT); } } -- 2.5.1
[PATCH v1 2/4] Add enabling of the R3 MWAIT during boot for KNL
If processor is Intel Xeon Phi we enable user-level mwait feature. Enabling this feature suppreses invalid-opcode error, when MONITOR/MWAIT is called from ring 3. Change-Id: I1c7defb99296b022790a068a6c725b3e860cd68c Signed-off-by: Grzegorz Andrejczuk --- arch/x86/kernel/cpu/intel.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index fcd484d..7f0f01a 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -61,6 +61,14 @@ void check_mpx_erratum(struct cpuinfo_x86 *c) } } +static int phir3mwait = 1; +static int __init phir3mwait_disable(char *value) +{ + phir3mwait = 0; + return 1; +} +__setup("intel-phir3mwait=disable", phir3mwait_disable); + static void early_init_intel(struct cpuinfo_x86 *c) { u64 misc_enable; @@ -211,6 +219,24 @@ static void early_init_intel(struct cpuinfo_x86 *c) } check_mpx_erratum(c); + + /* + * Setting ring 3 MONITOR/MWAIT for all threads + * when CPU is Xeon Phi Family x200 + * This can be disabled with phir3mwait=disable cmdline switch. + * We preserve the reserved values and set only 2nd bit. + * Ref: + * https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait + */ + if (c->x86 == 6 && + c->x86_model == INTEL_FAM6_XEON_PHI_KNL && + phir3mwait) { + u64 prev; + + rdmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, prev); + wrmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, + prev | MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT); + } } #ifdef CONFIG_X86_32 -- 2.5.1
[PATCH v1 1/4] Add R3MWAIT register and bit to msr-info.h
Intel Xeon Phi x200 (codenamed Knights Landing) has MSR MISC_THD_FEATURE_ENABLE 0x140. Setting its 2nd bit make MONITOR and MWAIT instructions do not cause invalid-opcode exception. This commit adds this register prefixed by PHI and bit to msr-info.h Reference: https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait Change-Id: If3b14c78f4e66d734e5a00921023a8c7cafc0cf3 Signed-off-by: Grzegorz Andrejczuk --- arch/x86/include/asm/msr-index.h | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 56f4c66..3eb1713 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -540,6 +540,11 @@ #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE_BIT 39 #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE (1ULL << MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE_BIT) +/* Intel Xeon Phi x200 ring 3 MONITOR/MWAIT */ +#define MSR_PHI_MISC_THD_FEATURE_ENABLE0x0140 +#define MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT_BIT1 +#define MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT(1ULL << MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT_BIT) + #define MSR_IA32_TSC_DEADLINE 0x06E0 /* P4/Xeon+ specific */ -- 2.5.1
Re: [PATCH] mm: kmemleak: Ensure that the task stack is not freed during scanning
> @@ -1453,8 +1453,11 @@ static void kmemleak_scan(void) > > read_lock(&tasklist_lock); > do_each_thread(g, p) { Take a look at this commit please. 1da4db0cd5 ("oom_kill: change oom_kill.c to use for_each_thread()") > - scan_block(task_stack_page(p), task_stack_page(p) + > -THREAD_SIZE, NULL); > + void *stack = try_get_task_stack(p); > + if (stack) { > + scan_block(stack, stack + THREAD_SIZE, NULL); > + put_task_stack(p); > + } > } while_each_thread(g, p); > read_unlock(&tasklist_lock); > } >
Re: [PATCH v8 6/6] mfd: lpc_ich: Add support for Intel Apollo Lake GPIO pinctrl in non-ACPI system
On Wed, 2016-10-12 at 14:51 +0800, Tan Jui Nee wrote: > This driver uses the P2SB hide/unhide mechanism cooperatively > to pass the PCI BAR address to the gpio platform driver. > Almost minor issues below. > --- a/drivers/mfd/Makefile > +++ b/drivers/mfd/Makefile > @@ -161,6 +161,10 @@ obj-$(CONFIG_MFD_INTEL_QUARK_I2C_GPIO) += > intel_quark_i2c_gpio.o > obj-$(CONFIG_LPC_SCH)+= lpc_sch.o > lpc_ich-objs := lpc_ich_core.o ^^^ > obj-$(CONFIG_LPC_ICH)+= lpc_ich.o > +lpc_ich-objs := lpc_ich_core.o ^^^ duplication. > +ifeq ($(CONFIG_X86_INTEL_IVI),y) > +lpc_ich-objs += lpc_ich_apl.o > +endif > +++ b/drivers/mfd/lpc_ich_apl.c > @@ -0,0 +1,120 @@ > +/* > + * Intel Apollo Lake In-Vehicle Infotainment (IVI) systems used in > cars support > + * > + * Copyright (C) 2016 Intel Corporation > + * > + * Author: Tan, Jui Nee > + * > + * This program is free software; you can redistribute it and/or > modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#include Hmm... asm stuff is platform specific, better to put it in separate section like: #include #include #include #include "c.h" > +#include > +#include > +#include > +#include > + > +#include "lpc_ich_apl.h" > + > > +int lpc_ich_add_gpio(struct pci_dev *dev, enum lpc_chipsets chipset) > +{ > + unsigned int i; > + int ret; > + struct resource base; > + > > + if (chipset != LPC_APL) > + return -ENODEV; Replace this by positive check (see below). Moreover -ENODEV will be returned if no cells were added. > + /* > + * Apollo lake, has not 1, but 4 gpio controllers, Perhaps "Apollo lake has 4 gpio controllers," > + * handle it a bit differently. > + */ > + > + ret = p2sb_bar(dev, PCI_DEVFN(PCI_IDSEL_P2SB, 0), &base); > + if (ret) > + goto warn_continue; > + > + for (i = 0; i < APL_GPIO_COMMUNITY_MAX; i++) { > + struct resource *res = &apl_gpio_io_res[i]; > + > + /* Fill MEM resource */ > + res->start += base.start; > + res->end += base.start; > + res->flags = base.flags; > + > + res++; > + } > + > + ret = mfd_add_devices(&dev->dev, 0, > + apl_gpio_devices, ARRAY_SIZE(apl_gpio_devices), > + NULL, 0, NULL); > + > + if (ret) > +warn_continue: Swap them. > + dev_warn(&dev->dev, > + "Failed to add Apollo Lake GPIO: %d\n", > + ret); > + > + return ret; > +} > +++ b/drivers/mfd/lpc_ich_apl.h > @@ -0,0 +1,29 @@ > +/* > + * lpc_ich_apl.h - Intel In-Vehicle Infotainment (IVI) systems used > in cars > + * support > + * > + * Copyright (C) 2016, Intel Corporation > + * > + * Author: Tan, Jui Nee > + * > + * This program is free software; you can redistribute it and/or > modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#ifndef __LPC_ICH_APL_H__ > +#define __LPC_ICH_APL_H__ > + > +#include > + > +#if IS_ENABLED(CONFIG_X86_INTEL_IVI) > +int lpc_ich_add_gpio(struct pci_dev *dev, enum lpc_chipsets chipset); > +#else /* CONFIG_X86_INTEL_IVI is not set */ > +static inline int lpc_ich_add_gpio(struct pci_dev *dev, > + enum lpc_chipsets chipset) > +{ > + return -ENODEV; > +} > +#endif Add comment here if you want to be looking like in p2sb.h. > + > +#endif > > --- a/drivers/mfd/lpc_ich_core.c > +++ b/drivers/mfd/lpc_ich_core.c > @@ -70,6 +70,8 @@ > #include > #include > > +#include "lpc_ich_apl.h" > + > #define ACPIBASE 0x40 > #define ACPIBASE_GPE_OFF 0x28 > #define ACPIBASE_GPE_END 0x2f > @@ -1032,6 +1034,9 @@ static int lpc_ich_probe(struct pci_dev *dev, > cell_added = true; > } > > + if (!lpc_ich_add_gpio(dev, priv->chipset)) > + cell_added = true; > + Like it's already used: if (priv->chipset == XXX) { do_yyy(dev); cell_added = true; } -- Andy Shevchenko Intel Finland Oy
Re: [PATCH v8 4/6] mfd: move enum lpc_chipsets into lpc_ich.h
On Wed, 2016-10-12 at 14:51 +0800, Tan Jui Nee wrote: > Move the enum's definition into a standalone header file which can be > used > wherever its definition is needed. > > --- a/include/linux/mfd/lpc_ich.h > +++ b/include/linux/mfd/lpc_ich.h > @@ -43,4 +43,75 @@ struct lpc_ich_info { > u8 use_gpio; > }; > > +/* chipset related info */ > +enum lpc_chipsets { Maybe it worth to add that the list should be not shuffled, new items should go at the end. But it is up to you. -- Andy Shevchenko Intel Finland Oy
[PATCH 3/4] Add hwcap2 for x86
Add hwcap2 attribute for x86. Reserve 1st bit of HWCAP2 for exposing Xeon Phi ring 3 monitor/mwait. With this userspace apps can detect Ring 3 MONITOR/MWAIT instructions. Change-Id: I37d0354d1e2b9594d7feebc2bacda30b68163efe Signed-off-by: Grzegorz Andrejczuk --- arch/x86/include/asm/elf.h| 7 +++ arch/x86/include/uapi/asm/hwcap.h | 7 +++ arch/x86/kernel/cpu/common.c | 3 +++ 3 files changed, 17 insertions(+) create mode 100644 arch/x86/include/uapi/asm/hwcap.h diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h index e7f155c..a3f7856 100644 --- a/arch/x86/include/asm/elf.h +++ b/arch/x86/include/asm/elf.h @@ -258,6 +258,13 @@ extern int force_personality32; #define ELF_HWCAP (boot_cpu_data.x86_capability[CPUID_1_EDX]) +extern unsigned int elf_hwcap2; + +/* HWCAP2 supplies kernel enabled CPU feature, so that the application + can know that it can safely use them. The bits are defined in + uapi/asm/hwcap.h. */ +#define ELF_HWCAP2 elf_hwcap2 + /* This yields a string that ld.so will use to load implementation specific libraries for optimization. This is more specific in intent than poking at uname or /proc/cpuinfo. diff --git a/arch/x86/include/uapi/asm/hwcap.h b/arch/x86/include/uapi/asm/hwcap.h new file mode 100644 index 000..d1f4f98 --- /dev/null +++ b/arch/x86/include/uapi/asm/hwcap.h @@ -0,0 +1,7 @@ +#ifndef _ASM_HWCAP_H +#define _ASM_HWCAP_H 1 + +/* Kernel enabled Ring 3 MWAIT for Xeon Phi*/ +#define HWCAP2_PHIR3MWAIT (1 << 0) +/* upto bit 31 free */ +#endif diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index bcc9ccc..93ffaa5 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -51,6 +52,8 @@ #include "cpu.h" +unsigned elf_hwcap2 __read_mostly; + /* all of these masks are initialized in setup_cpu_local_masks() */ cpumask_var_t cpu_initialized_mask; cpumask_var_t cpu_callout_mask; -- 2.5.1
Re: [PATCH v7 2/8] power: add power sequence library
Hi, Am Dienstag, 20. September 2016, 11:36:41 CEST schrieb Peter Chen: > We have an well-known problem that the device needs to do some power > sequence before it can be recognized by related host, the typical > example like hard-wired mmc devices and usb devices. > > This power sequence is hard to be described at device tree and handled by > related host driver, so we have created a common power sequence > library to cover this requirement. The core code has supplied > some common helpers for host driver, and individual power sequence > libraries handle kinds of power sequence for devices. > > pwrseq_generic is intended for general purpose of power sequence, which > handles gpios and clocks currently, and can cover regulator and pinctrl > in future. The host driver just needs to call of_pwrseq_on/of_pwrseq_off > if only one power sequence is needed, else call of_pwrseq_on_list > /of_pwrseq_off_list instead (eg, USB hub driver). > > Signed-off-by: Peter Chen > Tested-by Joshua Clayton > Reviewed-by: Matthias Kaehlcke > Tested-by: Matthias Kaehlcke first of all, glad to see this move forward. I've only some qualms with the static number of allocated power sequences below. [...] > diff --git a/drivers/power/pwrseq/Kconfig b/drivers/power/pwrseq/Kconfig > new file mode 100644 > index 000..dff5e35 > --- /dev/null > +++ b/drivers/power/pwrseq/Kconfig > @@ -0,0 +1,45 @@ > +# > +# Power Sequence library > +# > + > +config POWER_SEQUENCE > + bool > + > +menu "Power Sequence Support" > + > +config PWRSEQ_GENERIC > + bool "Generic power sequence control" > + depends on OF > + select POWER_SEQUENCE > + help > +It is used for drivers which needs to do power sequence > +(eg, turn on clock, toggle reset gpio) before the related > +devices can be found by hardware. This generic one can be > +used for common power sequence control. > + > +config PWRSEQ_GENERIC_INSTANCE_NUMBER > + int "Number of Generic Power Sequence Instance" > + depends on PWRSEQ_GENERIC > + range 1 10 > + default 2 > + help > +Usually, there are not so many devices needs power sequence, we set > two > +as default value. limiting this to some arbitary compile-time number somehow seems crippling for the single-image approach. I.e. a distribution might select something and during its lifetime the board requiring n+1 power-sequences appears and thus needs a different kernel version just to support that additional sequence. Also, board designers are creative, and there were already complex examples mentioned elsewhere, so nothing keeps people from inventing something even more complex. [...] > diff --git a/drivers/power/pwrseq/pwrseq_generic.c > b/drivers/power/pwrseq/pwrseq_generic.c new file mode 100644 > index 000..bcd16c3 > --- /dev/null > +++ b/drivers/power/pwrseq/pwrseq_generic.c [...] > +static int pwrseq_generic_get(struct device_node *np, struct pwrseq > *pwrseq) +{ > + struct pwrseq_generic *pwrseq_gen = to_generic_pwrseq(pwrseq); > + enum of_gpio_flags flags; > + int reset_gpio, clk, ret = 0; > + > + for (clk = 0; clk < PWRSEQ_MAX_CLKS; clk++) { > + pwrseq_gen->clks[clk] = of_clk_get(np, clk); > + if (IS_ERR(pwrseq_gen->clks[clk])) { > + ret = PTR_ERR(pwrseq_gen->clks[clk]); > + if (ret != -ENOENT) > + goto err_put_clks; > + pwrseq_gen->clks[clk] = NULL; > + break; > + } > + } > + > + reset_gpio = of_get_named_gpio_flags(np, "reset-gpios", 0, &flags); > + if (gpio_is_valid(reset_gpio)) { > + unsigned long gpio_flags; > + > + if (flags & OF_GPIO_ACTIVE_LOW) > + gpio_flags = GPIOF_ACTIVE_LOW | GPIOF_OUT_INIT_LOW; > + else > + gpio_flags = GPIOF_OUT_INIT_HIGH; > + > + ret = gpio_request_one(reset_gpio, gpio_flags, > + "pwrseq-reset-gpios"); > + if (ret) > + goto err_put_clks; > + > + pwrseq_gen->gpiod_reset = gpio_to_desc(reset_gpio); > + of_property_read_u32(np, "reset-duration-us", > + &pwrseq_gen->duration_us); > + } else { > + if (reset_gpio == -ENOENT) > + return 0; > + > + ret = reset_gpio; > + pr_err("Failed to get reset gpio on %s, err = %d\n", > + np->full_name, reset_gpio); > + goto err_put_clks; > + } > + > + return ret; > + > +err_put_clks: > + while (--clk >= 0) > + clk_put(pwrseq_gen->clks[clk]); > + return ret; > +} > + > +static const struct of_device_id generic_id_table[] = { > + { .compatible = "generic",}, > + { /* sentinel */ } > +}; > + > +static int __init pwrseq_generic_register(void) > +{ > + struct pwrseq
Re: [PATCH v1 4/4] Add R3MWAIT to CPU features
On Wed, Oct 12, 2016 at 12:13:10PM +0200, Grzegorz Andrejczuk wrote: > Add cpu feature for ring 3 monitor/mwait. > > Change-Id: Iba4d20639efd8d3637d37db9294cbc43a98f009a Please no internal IDs in upstream submission. > Signed-off-by: Grzegorz Andrejczuk ... > diff --git a/arch/x86/include/asm/cpufeatures.h > b/arch/x86/include/asm/cpufeatures.h > index 92a8308..242cd16 100644 > --- a/arch/x86/include/asm/cpufeatures.h > +++ b/arch/x86/include/asm/cpufeatures.h > @@ -12,7 +12,7 @@ > /* > * Defines x86 CPU feature bits > */ > -#define NCAPINTS 18 /* N 32-bit words worth of info */ > +#define NCAPINTS 19 /* N 32-bit words worth of info */ > #define NBUGINTS 1 /* N 32-bit bug flags */ > > /* > @@ -286,6 +286,10 @@ > #define X86_FEATURE_SUCCOR (17*32+1) /* Uncorrectable error containment > and recovery */ > #define X86_FEATURE_SMCA (17*32+3) /* Scalable MCA */ > > + > +/* non architectural Intel-defined CPU features not present in CPUID, word > 18 */ > +#define X86_FEATURE_PHIR3MWAIT (18*32+ 0) Please use init_scattered_cpuid_features() for the whole thing. There are some free bits in word 3 for example, see arch/x86/include/asm/cpufeatures.h. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
Re: [PATCH v8 3/6] x86/intel-ivi: Add Intel In-Vehicle Infotainment (IVI) systems used in cars support
On Wed, 2016-10-12 at 14:51 +0800, Tan Jui Nee wrote: > Add support for non ACPI system, such as system that uses Advanced > Boot > Loader (ABL) whereby a platform device has to be created in order to > bind > with PINCTRL/GPIO. > > At the moment, Intel Apollo Lake SoC requires P2SB driver to hide and > unhide P2SB to lookup P2SB BAR and pass the PCI BAR address to GPIO. I dunno if this patch would go as a last in the series. > > +config X86_INTEL_IVI > + bool "Intel In-Vehicle Infotainment (IVI) systems used in > cars" > + ---help--- > + Select this option to enable MMIO BAR access over the P2SB > for > + non-ACPI Intel Apollo Lake SoC platforms. This sounds not what the option is used for. What I see from the code as simple as "Enable support of Intel IVI systems. This enables necessary drivers and libraries which are used in IVI systems." > This driver uses the P2SB > + hide/unhide mechanism cooperatively to pass the PCI BAR > address to > + the platform driver, currently GPIO. -- Andy Shevchenko Intel Finland Oy
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/11/16 13:17 -0700, Dan Williams wrote: On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilk wrote: On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: [..] >> Right, but why does the libnvdimm core need to know about this >> specific Xen reservation? For example, if Xen wants some in-kernel > > Let me turn this around - why does the libnvdimm core need to know about > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD > for example can also poke a hole in this and fill it with its > OS-management meta-data? Specifically the core needs to know so that it can answer the Linux specific question of whether the pfn returned by ->direct_access() has a corresponding struct page or not. It's tied to the lifetime of the device and the usage of the reservation needs to be coordinated against the references of those pages. If FreeBSD decides it needs to reserve "struct page" capacity at the start of the device, I would hope that it reuses the same on-device info block that Linux is using and not create a new "FreeBSD-mode" device type. The issue here (as I understand, I may be missing something new) is that the size of this special namespace may be different. That is the 'struct page' on FreeBSD could be 256 bytes while on Linux it is 64 bytes (numbers pulled out of the sky). Hence one would have to expand or such to re-use this. Sure, but we could support that today. If FreeBSD lays down the info block it is free to make a bigger reservation and Linux would be happy to use a smaller subset. If we, as an industry, want this "struct page" reservation to be common we can take it to a standards body to make as a cross-OS guarantee... but I think this is separate from the Xen reservation. To be honest I do not yet understand what metadata Xen wants to store in the device, but it seems the producer and consumer of that metadata is Xen itself and not the wider Linux kernel as is the case with struct page. Can you fill me in on what problem Xen solves with this Exactly! reservation? The same as Linux - its variant of 'struct page'. Which I think is smaller than the Linux one, but perhaps it is not? If the hypervisor needs to know where it can store some metadata, can that be satisfied with userspace tooling in Dom0? Something like, "/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem with files to hand to guests". So my question is not about the rationale for having metadata, it's why does the Linux kernel need to know about the Xen reservation? As far as I can see it is independent / opaque to the kernel. Thank everyone for all these comments! How about doing the reservation in the following way: 1. Create partition(s) on /dev/pmemX and make sure space besides the partition table and potential padding before the first partition is large enough to hold Xen's management structures and a super block introduced in step 2. The space besides the partition table, padding and the super block will be used as the reserved area. 2. Write a super block before above reserved area. The super block records the base address and the size of the reserved area. It also contains a signature and a checksum to identify itself. The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 Above two steps can be done via a userspace program and do not need Xen hypervisor running. The partitions on the device can be used regardless of the existence of Xen hypervisor. 3. When Xen is running, implement a function in Dom0 Linux xen driver (drivers/xen/) to response to udevd events that notify the detection of the pmem regions. This function searches on the pmem region for the super block created in step 2. If one is found, it will know this pmem region has been prepared for Xen usage. Then it gets the base address and size of the reserved area (from super block) and the entire address ranges of the pmem region (from pmem driver), and reports them to Xen hypervisor. The implementation of this step can be completely included in the kernel Xen driver. (It may also be implemented as a udevd service in userspace, if it's not considered as unsafe) Thanks, Haozhong
[PATCH] qede: fix CONFIG_INFINIBAND_QEDR=m build error
The newly introduced INFINIBAND_QEDR option is 'tristate' but fails to build when set to 'm': drivers/net/built-in.o: In function `qed_hw_init': (.text+0x1c0e17): undefined reference to `qed_rdma_dpm_bar' drivers/net/built-in.o: In function `qed_eq_completion': (.text+0x1d185b): undefined reference to `qed_async_roce_event' drivers/net/built-in.o: In function `qed_ll2_txq_completion': qed_ll2.c:(.text+0x1e2fdd): undefined reference to `qed_ll2b_complete_tx_gsi_packet' drivers/net/built-in.o: In function `qed_ll2_rxq_completion': qed_ll2.c:(.text+0x1e479a): undefined reference to `qed_ll2b_complete_rx_gsi_packet' drivers/net/built-in.o: In function `qed_ll2_terminate_connection': (.text+0x1e5645): undefined reference to `qed_ll2b_release_tx_gsi_packet' There are multiple problems here: - The option should be 'bool', as this is not a separate module but rather a single file that gets added to the normal driver module - The qed_rdma_dpm_bar() helper function should have been 'static inline' as it's declared in a header file, the current workaround of including qed_roce.h conditionally is not good - There is no reason to use '#if' all the time to check for the symbol, it should use use 'if IS_ENABLED()' to make the code more readable and get better compile coverage. This addresses all three of the above. Fixes: cee9fbd8e2e9 ("qede: Add qedr framework") Signed-off-by: Arnd Bergmann --- drivers/net/ethernet/qlogic/Kconfig| 2 +- drivers/net/ethernet/qlogic/qed/qed_cxt.c | 6 +- drivers/net/ethernet/qlogic/qed/qed_dev.c | 7 +++ drivers/net/ethernet/qlogic/qed/qed_main.c | 24 +++- drivers/net/ethernet/qlogic/qed/qed_roce.h | 4 drivers/net/ethernet/qlogic/qed/qed_spq.c | 13 ++--- 6 files changed, 22 insertions(+), 34 deletions(-) diff --git a/drivers/net/ethernet/qlogic/Kconfig b/drivers/net/ethernet/qlogic/Kconfig index 0df1391f9663..90562cf8fa19 100644 --- a/drivers/net/ethernet/qlogic/Kconfig +++ b/drivers/net/ethernet/qlogic/Kconfig @@ -108,7 +108,7 @@ config QEDE This enables the support for ... config INFINIBAND_QEDR - tristate "QLogic qede RoCE sources [debug]" + bool "QLogic qede RoCE sources [debug]" depends on QEDE && 64BIT select QED_LL2 default n diff --git a/drivers/net/ethernet/qlogic/qed/qed_cxt.c b/drivers/net/ethernet/qlogic/qed/qed_cxt.c index 82370a1a59ad..0a3ffcd9f073 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_cxt.c +++ b/drivers/net/ethernet/qlogic/qed/qed_cxt.c @@ -48,12 +48,8 @@ #define TM_ELEM_SIZE4 /* ILT constants */ -#if IS_ENABLED(CONFIG_INFINIBAND_QEDR) /* For RoCE we configure to 64K to cover for RoCE max tasks 256K purpose. */ -#define ILT_DEFAULT_HW_P_SIZE 4 -#else -#define ILT_DEFAULT_HW_P_SIZE 3 -#endif +#define ILT_DEFAULT_HW_P_SIZE IS_ENABLED(CONFIG_INFINIBAND_QEDR) ? 4 : 3 #define ILT_PAGE_IN_BYTES(hw_p_size) (1U << ((hw_p_size) + 12)) #define ILT_CFG_REG(cli, reg) PSWRQ2_REG_ ## cli ## _ ## reg ## _RT_OFFSET diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c b/drivers/net/ethernet/qlogic/qed/qed_dev.c index 754f6a908858..63a38e3b8f3f 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_dev.c +++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c @@ -890,7 +890,7 @@ qed_hw_init_pf_doorbell_bar(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt) n_cpus = 1; rc = qed_hw_init_dpi_size(p_hwfn, p_ptt, pwm_regsize, n_cpus); - if (cond) + if (IS_ENABLED(CONFIG_INFINIBAND_QEDR) && cond) qed_rdma_dpm_bar(p_hwfn, p_ptt); } @@ -1422,19 +1422,18 @@ static void qed_hw_set_feat(struct qed_hwfn *p_hwfn) u32 *feat_num = p_hwfn->hw_info.feat_num; int num_features = 1; -#if IS_ENABLED(CONFIG_INFINIBAND_QEDR) /* Roce CNQ each requires: 1 status block + 1 CNQ. We divide the * status blocks equally between L2 / RoCE but with consideration as * to how many l2 queues / cnqs we have */ - if (p_hwfn->hw_info.personality == QED_PCI_ETH_ROCE) { + if (IS_ENABLED(CONFIG_INFINIBAND_QEDR) && + p_hwfn->hw_info.personality == QED_PCI_ETH_ROCE) { num_features++; feat_num[QED_RDMA_CNQ] = min_t(u32, RESC_NUM(p_hwfn, QED_SB) / num_features, RESC_NUM(p_hwfn, QED_RDMA_CNQ_RAM)); } -#endif feat_num[QED_PF_L2_QUE] = min_t(u32, RESC_NUM(p_hwfn, QED_SB) / num_features, RESC_NUM(p_hwfn, QED_L2_QUEUE)); diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c b/drivers/net/ethernet/qlogic/qed/qed_main.c index 4ee3151e80c2..36023a3583f2 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_main.c +++ b/drivers/net/ethernet/qlogic/qed/qed_main.c @@ -33,10 +33,8 @@ #include "qed_hw.h" #include "qed_selftest.h
Re: [PATCH v1 2/4] Add enabling of the R3 MWAIT during boot for KNL
On Wed, Oct 12, 2016 at 12:13:08PM +0200, Grzegorz Andrejczuk wrote: > If processor is Intel Xeon Phi we enable user-level mwait feature. > Enabling this feature suppreses invalid-opcode error, when MONITOR/MWAIT > is called from ring 3. > > Change-Id: I1c7defb99296b022790a068a6c725b3e860cd68c > Signed-off-by: Grzegorz Andrejczuk > --- > arch/x86/kernel/cpu/intel.c | 26 ++ > 1 file changed, 26 insertions(+) > > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c > index fcd484d..7f0f01a 100644 > --- a/arch/x86/kernel/cpu/intel.c > +++ b/arch/x86/kernel/cpu/intel.c > @@ -61,6 +61,14 @@ void check_mpx_erratum(struct cpuinfo_x86 *c) > } > } > > +static int phir3mwait = 1; > +static int __init phir3mwait_disable(char *value) > +{ > + phir3mwait = 0; > + return 1; > +} > +__setup("intel-phir3mwait=disable", phir3mwait_disable); That's a lot of typing on the cmdline. "r3mwait=disable" looks just as fine to me, for example. > static void early_init_intel(struct cpuinfo_x86 *c) > { > u64 misc_enable; > @@ -211,6 +219,24 @@ static void early_init_intel(struct cpuinfo_x86 *c) > } > > check_mpx_erratum(c); > + > + /* > + * Setting ring 3 MONITOR/MWAIT for all threads > + * when CPU is Xeon Phi Family x200 > + * This can be disabled with phir3mwait=disable cmdline switch. > + * We preserve the reserved values and set only 2nd bit. > + * Ref: > + * > https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait > + */ > + if (c->x86 == 6 && > + c->x86_model == INTEL_FAM6_XEON_PHI_KNL && > + phir3mwait) { > + u64 prev; > + > + rdmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, prev); > + wrmsrl(MSR_PHI_MISC_THD_FEATURE_ENABLE, > +prev | MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT); Wanna test the MSR_PHI_MISC_THD_FEATURE_ENABLE_R3MWAIT bit before doing the MSR write? Btw, you might want to shorten those define names - they're huuge. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
Re: [PATCH] mm: page_alloc: Use KERN_CONT where appropriate
(resending as lkml bounced) On Wed, 2016-10-12 at 11:10 +0200, Michal Hocko wrote: > On Tue 11-10-16 19:24:55, Joe Perches wrote: > > Recent changes to printk require KERN_CONT uses to continue logging > > messages. So add KERN_CONT where necessary. > > > > I was really wondering what happened when Aaron reported an allocation > failure http://lkml.kernel.org/r/20161012065423.ga16...@aaronlu.sh.intel.com > See the attached log got the current Linus' tree > > Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation > lines") > > Signed-off-by: Joe Perches > > > > Acked-by: Michal Hocko > > I believe we can simplify the code a bit as well. What do you think > about the following on top? Hi Michal I think the show_node to show_zone_node renaming is superfluous, but if it makes you happy, it doesn't bother me. This recent change to printk logging making KERN_CONT necessary to continue a line might be reverted when it's better known just how many instances in the kernel tree will need to be changed. For now, I'd rather keep the KERN_CONT "\n" and trailing "\n" as there are _very_ few missing newlines in logging messages today and removing them now might be a bit early process-wise. Dunno. > --- > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 6f8c356140a0..7e1b74ee79cb 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4078,10 +4078,12 @@ unsigned long nr_free_pagecache_pages(void) > return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); > } > > -static inline void show_node(struct zone *zone) > +static inline void show_zone_node(struct zone *zone) > { > if (IS_ENABLED(CONFIG_NUMA)) > - printk("Node %d ", zone_to_nid(zone)); > + printk("Node %d %s", zone_to_nid(zone), zone->name); > + else > + printk("%s: ", zone->name); > } > > long si_mem_available(void) > @@ -4329,9 +4331,8 @@ void show_free_areas(unsigned int filter) > for_each_online_cpu(cpu) > free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count; > > - show_node(zone); > + show_zone_node(zone); > printk(KERN_CONT > - "%s" > " free:%lukB" > " min:%lukB" > " low:%lukB" > @@ -4354,7 +4355,6 @@ void show_free_areas(unsigned int filter) > " local_pcp:%ukB" > " free_cma:%lukB" > "\n", > - zone->name, > K(zone_page_state(zone, NR_FREE_PAGES)), > K(min_wmark_pages(zone)), > K(low_wmark_pages(zone)), > @@ -4379,7 +4379,6 @@ void show_free_areas(unsigned int filter) > printk("lowmem_reserve[]:"); > for (i = 0; i < MAX_NR_ZONES; i++) > printk(KERN_CONT " %ld", zone->lowmem_reserve[i]); > - printk(KERN_CONT "\n"); > } > > for_each_populated_zone(zone) { > @@ -4389,8 +4388,7 @@ void show_free_areas(unsigned int filter) > > if (skip_free_areas_node(filter, zone_to_nid(zone))) > continue; > - show_node(zone); > - printk(KERN_CONT "%s: ", zone->name); > + show_zone_node(zone); > > spin_lock_irqsave(&zone->lock, flags); > for (order = 0; order < MAX_ORDER; order++) {
Re: MPOL_BIND on memory only nodes
On 10/12/2016 03:13 PM, Michal Hocko wrote: > On Wed 12-10-16 14:55:24, Anshuman Khandual wrote: >> Hi, >> >> We have the following function policy_zonelist() which selects a zonelist >> during various allocation paths. With this, general user space allocations >> (IIUC might not have __GFP_THISNODE) fails while trying to get memory from >> a memory only node without CPUs as the application runs some where else >> and that node is not part of the nodemask. My bad. Was playing with some changes to the zonelists rebuild after a memory node hotplug and the order of various zones in them. > > I am not sure I understand. So you have a task with MPOL_BIND without a > cpu less node in the mask and you are wondering why the memory is not > allocated from that node? In my experiment, there is a MPOL_BIND call with a CPU less node in the node mask and the memory is not allocated from that CPU less node. Thats because the zone of the CPU less node was absent from the FALLBACK zonelist of the local node. > >> Why we insist on __GFP_THISNODE ? > > AFAIU __GFP_THISNODE just overrides the given node to the policy > nodemask in case the current node is not part of that node mask. In > other words we are ignoring the given node and use what the policy says. Right but provided the gfp flag has __GFP_THISNODE in it. In absence of __GFP_THISNODE, the node from the nodemask will not be selected. I still wonder why ? Can we always go to the first node in the nodemask for MPOL_BIND interface calls ? Just curious to know why preference is given to the local node and it's FALLBACK zonelist. > I can see how this can be confusing especially when confronting the > documentation: > > * __GFP_THISNODE forces the allocation to be satisified from the requested > * node with no fallbacks or placement policy enforcements. > Yeah, right. Thanks for your reply.
Re: [PATCH] mm: kmemleak: Ensure that the task stack is not freed during scanning
On Wed 12-10-16 10:57:03, Catalin Marinas wrote: > Commit 68f24b08ee89 ("sched/core: Free the stack early if > CONFIG_THREAD_INFO_IN_TASK") may cause the task->stack to be freed > during kmemleak_scan() execution, leading to either a NULL pointer > fault (if task->stack is NULL) or kmemleak accessing already freed > memory. This patch uses the new try_get_task_stack() API to ensure that > the task stack is not freed during kmemleak stack scanning. Looks good to me > Fixes: 68f24b08ee89 ("sched/core: Free the stack early if > CONFIG_THREAD_INFO_IN_TASK") > Cc: Andrew Morton > Cc: Andy Lutomirski > Cc: CAI Qian > Reported-by: CAI Qian > Signed-off-by: Catalin Marinas Acked-by: Michal Hocko > --- > > This was reported in a subsequent comment here: > > https://bugzilla.kernel.org/show_bug.cgi?id=173901 > > However, the original bugzilla entry doesn't look related to task stack > freeing as it was first reported on 4.8-rc8. Andy, sorry for cc'ing you > to bugzilla, please feel free to remove your email from the bug above (I > can't seem to be able to do it). > > mm/kmemleak.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/mm/kmemleak.c b/mm/kmemleak.c > index a5e453cf05c4..e5355a5b423f 100644 > --- a/mm/kmemleak.c > +++ b/mm/kmemleak.c > @@ -1453,8 +1453,11 @@ static void kmemleak_scan(void) > > read_lock(&tasklist_lock); > do_each_thread(g, p) { > - scan_block(task_stack_page(p), task_stack_page(p) + > -THREAD_SIZE, NULL); > + void *stack = try_get_task_stack(p); > + if (stack) { > + scan_block(stack, stack + THREAD_SIZE, NULL); > + put_task_stack(p); > + } > } while_each_thread(g, p); > read_unlock(&tasklist_lock); > } > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- Michal Hocko SUSE Labs
Re: [PATCH] ext4: super.c: Update logging style using PR_CONT
On Tue 11-10-16 18:57:58, Joe Perches wrote: > Recent commit require line continuing printks to use PR_CONT. > > Update super.c to use PR_CONT and use vsprintf extension %pV > to avoid a printk/vprintk/printk("\n") sequence as well. Looks good. You can add: Reviewed-by: Jan Kara Honza > > Signed-off-by: Joe Perches > --- > fs/ext4/super.c | 21 +++-- > 1 file changed, 11 insertions(+), 10 deletions(-) > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index 6db81fbcbaa6..20da99da0a34 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -597,14 +597,15 @@ void __ext4_std_error(struct super_block *sb, const > char *function, > void __ext4_abort(struct super_block *sb, const char *function, > unsigned int line, const char *fmt, ...) > { > + struct va_format vaf; > va_list args; > > save_error_info(sb, function, line); > va_start(args, fmt); > - printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: ", sb->s_id, > -function, line); > - vprintk(fmt, args); > - printk("\n"); > + vaf.fmt = fmt; > + vaf.va = &args; > + printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: %pV\n", > +sb->s_id, function, line, &vaf); > va_end(args); > > if ((sb->s_flags & MS_RDONLY) == 0) { > @@ -2715,12 +2716,12 @@ static void print_daily_error_info(unsigned long arg) > es->s_first_error_func, > le32_to_cpu(es->s_first_error_line)); > if (es->s_first_error_ino) > - printk(": inode %u", > + printk(KERN_CONT ": inode %u", > le32_to_cpu(es->s_first_error_ino)); > if (es->s_first_error_block) > - printk(": block %llu", (unsigned long long) > + printk(KERN_CONT ": block %llu", (unsigned long long) > le64_to_cpu(es->s_first_error_block)); > - printk("\n"); > + printk(KERN_CONT "\n"); > } > if (es->s_last_error_time) { > printk(KERN_NOTICE "EXT4-fs (%s): last error at time %u: > %.*s:%d", > @@ -2729,12 +2730,12 @@ static void print_daily_error_info(unsigned long arg) > es->s_last_error_func, > le32_to_cpu(es->s_last_error_line)); > if (es->s_last_error_ino) > - printk(": inode %u", > + printk(KERN_CONT ": inode %u", > le32_to_cpu(es->s_last_error_ino)); > if (es->s_last_error_block) > - printk(": block %llu", (unsigned long long) > + printk(KERN_CONT ": block %llu", (unsigned long long) > le64_to_cpu(es->s_last_error_block)); > - printk("\n"); > + printk(KERN_CONT "\n"); > } > mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ); /* Once a day */ > } > -- > 2.10.0.rc2.1.g053435c > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR
Re: [PATCH] mm: kmemleak: Ensure that the task stack is not freed during scanning
On Wed, Oct 12, 2016 at 06:16:46PM +0800, Hillf Danton wrote: > > @@ -1453,8 +1453,11 @@ static void kmemleak_scan(void) > > > > read_lock(&tasklist_lock); > > do_each_thread(g, p) { > > Take a look at this commit please. > 1da4db0cd5 ("oom_kill: change oom_kill.c to use for_each_thread()") Thanks. Isn't holding tasklist_lock here enough to avoid such races? -- Catalin
RE: [PATCH]"drm: change DRM_MIPI_DSI module type from "bool" to "tristate".
On Wed, 12 Oct 2016, "Sun, Jing A" wrote: > I think "installing a kernel with my changes for both drm and i915" > takes more time and effort to complete than "only updating DRM/i915 > modules without rebuilding the whole kernel". In some cases, that's > beneficial. It's possible to change and rebuild and update just the drm and i915, but you need to be careful to build against the same tree as the ones you are replacing. This is like using out-of-tree modules (which is something I can't recommend no matter what, but that's another discussion). However, this is completely different from planning to update drm and i915 modules on a running production system by unloading the old ones and probing the new ones. Don't do that. It will be a disaster. > Also reloadablility is always a good thing to have and I truly hope > Hajda/Iwai's patches would be accepted and merged. No downside of it > after all. I think it's good to be able to unload and reload modules for debugging and development, but not for normal use. BR, Jani. -- Jani Nikula, Intel Open Source Technology Center
Re: [PATCH 1/7 v4] sched: factorize attach entity
On 7 October 2016 at 01:11, Vincent Guittot wrote: > > On 5 October 2016 at 11:38, Dietmar Eggemann wrote: > > On 09/26/2016 01:19 PM, Vincent Guittot wrote: > >> > >> Factorize post_init_entity_util_avg and part of attach_task_cfs_rq > >> in one function attach_entity_cfs_rq > >> > >> Signed-off-by: Vincent Guittot > >> --- > >> kernel/sched/fair.c | 19 +++ > >> 1 file changed, 11 insertions(+), 8 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 986c10c..e8ed8d1 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -697,9 +697,7 @@ void init_entity_runnable_average(struct sched_entity > >> *se) > >> } > >> > >> static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); > >> -static int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool > >> update_freq); > >> -static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force); > >> -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct > >> sched_entity *se); > >> +static void attach_entity_cfs_rq(struct sched_entity *se); > >> > >> /* > >> * With new tasks being created, their initial util_avgs are extrapolated > >> @@ -764,9 +762,7 @@ void post_init_entity_util_avg(struct sched_entity > >> *se) > >> } > >> } > > > > > > You now could move the 'u64 now = cfs_rq_clock_task(cfs_rq);' into the > > if condition to handle !fair_sched_class tasks. > > yes > > > > >> - update_cfs_rq_load_avg(now, cfs_rq, false); > >> - attach_entity_load_avg(cfs_rq, se); > >> - update_tg_load_avg(cfs_rq, false); > >> + attach_entity_cfs_rq(se); > >> } > >> > >> #else /* !CONFIG_SMP */ > >> @@ -8501,9 +8497,8 @@ static void detach_task_cfs_rq(struct task_struct > >> *p) > >> update_tg_load_avg(cfs_rq, false); > >> } > >> > >> -static void attach_task_cfs_rq(struct task_struct *p) > >> +static void attach_entity_cfs_rq(struct sched_entity *se) > >> { > >> - struct sched_entity *se = &p->se; > >> struct cfs_rq *cfs_rq = cfs_rq_of(se); > > > > > > Both callers of attach_entity_cfs_rq() already use cfs_rq_of(se). You > > could pass it into attach_entity_cfs_rq(). > > Yes that would make sense In fact there is a 3rd caller online_fair_sched_group which calls attach_entity_cfs_rq and doesn't already use cfs_rq_of(se) so i wonder if it's worth doing the interface change. > > > > > >> u64 now = cfs_rq_clock_task(cfs_rq); > >> @@ -8519,6 +8514,14 @@ static void attach_task_cfs_rq(struct task_struct > >> *p) > > > > > > The old comment /* Synchronize task ... */ should be changed to /* > > Synchronize entity ... */ > > yes > > > > >> update_cfs_rq_load_avg(now, cfs_rq, false); > >> attach_entity_load_avg(cfs_rq, se); > >> update_tg_load_avg(cfs_rq, false); > >> +} > >> + > >> +static void attach_task_cfs_rq(struct task_struct *p) > >> +{ > >> + struct sched_entity *se = &p->se; > >> + struct cfs_rq *cfs_rq = cfs_rq_of(se); > >> + > >> + attach_entity_cfs_rq(se); > >> > >> if (!vruntime_normalized(p)) > >> se->vruntime += cfs_rq->min_vruntime; > >> > > > > IMPORTANT NOTICE: The contents of this email and any attachments are > > confidential and may also be privileged. If you are not the intended > > recipient, please notify the sender immediately and do not disclose the > > contents to any other person, use it for any purpose, or store or copy the > > information in any medium. Thank you. > >
Re: MPOL_BIND on memory only nodes
On Wed 12-10-16 16:08:48, Anshuman Khandual wrote: > On 10/12/2016 03:13 PM, Michal Hocko wrote: > > On Wed 12-10-16 14:55:24, Anshuman Khandual wrote: > >> Hi, > >> > >> We have the following function policy_zonelist() which selects a zonelist > >> during various allocation paths. With this, general user space allocations > >> (IIUC might not have __GFP_THISNODE) fails while trying to get memory from > >> a memory only node without CPUs as the application runs some where else > >> and that node is not part of the nodemask. > > My bad. Was playing with some changes to the zonelists rebuild after > a memory node hotplug and the order of various zones in them. > > > > > I am not sure I understand. So you have a task with MPOL_BIND without a > > cpu less node in the mask and you are wondering why the memory is not > > allocated from that node? > > In my experiment, there is a MPOL_BIND call with a CPU less node in > the node mask and the memory is not allocated from that CPU less node. > Thats because the zone of the CPU less node was absent from the > FALLBACK zonelist of the local node. So do I understand this correctly that the issue was caused by non-upstream changes? > >> Why we insist on __GFP_THISNODE ? > > > > AFAIU __GFP_THISNODE just overrides the given node to the policy > > nodemask in case the current node is not part of that node mask. In > > other words we are ignoring the given node and use what the policy says. > > Right but provided the gfp flag has __GFP_THISNODE in it. In absence > of __GFP_THISNODE, the node from the nodemask will not be selected. In absence of __GFP_THISNODE we will use the zonelist for the given node and that should contain even memoryless nodes for the fallback. The nodemask from policy_nodemask() will then make sure that only nodes relevant to the used policy is used. > I still wonder why ? Can we always go to the first node in the > nodemask for MPOL_BIND interface calls ? Just curious to know why > preference is given to the local node and it's FALLBACK zonelist. It is not always a local node. Look at how do_huge_pmd_wp_page_fallback tries to make all the pages into the same node. Also we have alloc_pages_current() which tries to allocate from the local node which should not fallback to the firs node in the policy nodemask. -- Michal Hocko SUSE Labs
Re: [PATCH RESEND] ARM: dts: keystone-k2*: Increase SPI Flash partition size for U-Boot
Hi, On Monday 10 October 2016 08:01 PM, Russell King - ARM Linux wrote: > On Mon, Oct 10, 2016 at 07:41:41PM +0530, Vignesh R wrote: >> U-Boot SPI Boot image is now more than 512KB for Keystone2 devices and >> cannot fit into existing partition. So, increase the SPI Flash partition >> for U-Boot to 1MB for all Keystone2 devices. >> >> Signed-off-by: Vignesh R >> --- >> >> This was submitted to v4.9 merge window but was never picked up: >> https://patchwork.kernel.org/patch/9135023/ > > I think you need to explain why it's safe to change the layout of the > flash partitions like this. > > - What is this "misc" partition? > This partition seems to exists from the very beginning. I believe, this is just a spare area of flash that can be used as per end-user requirement. Either to store a small filesystem or kernel. Copying Murali who added above partition if he has any input here. > - Why is it safe to move the "misc" partition in this way? > > - Do users need to do anything with data stored in the "misc" partition > when changing kernels? > MTD layer will take care of most abstractions (like start address etc). Will add a note in commit message informing about the reduction in size of the partition. > If the "misc" partition is simply unused space on the flash device, why > list it in DT? > If the unused space is not listed in the DT, then there is no /dev/mtdX node created for the unused section. User will then have to manually edit DT, in order to get the node and mount it. Instead, lets make it available by default. -- Regards Vignesh
Re: [PATCH] x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt
2016-09-19 16:10 GMT+08:00 Peter Zijlstra : > On Thu, Sep 15, 2016 at 10:58:04AM +0200, Thomas Gleixner wrote: >> On Thu, 15 Sep 2016, Wanpeng Li wrote: >> > --- >> > arch/x86/include/asm/apic.h | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h >> > index 1243577..71c1fe2 100644 >> > --- a/arch/x86/include/asm/apic.h >> > +++ b/arch/x86/include/asm/apic.h >> > @@ -650,8 +650,8 @@ static inline void entering_ack_irq(void) >> > >> > static inline void ipi_entering_ack_irq(void) >> > { >> > - ack_APIC_irq(); >> > irq_enter(); >> > + ack_APIC_irq(); >> > } >> >> which makes ipi_entering_ack_irq() the same as entering_ack_irq() and >> therefor pointless. > > entering_ack_irq() seems to use entering_irq() instead of irq_enter(). > Which is close but not the same. This thing seems to also do > exit_idle(). > > Now, there's only a handfull of ipi_entering_ack_irq() users, and it > doesn't seem to make sense to me to only call exit_idle() on IPIs, why > don't we need to call exit_idle() on regular IRQs ?! > > All in all, that stuff is crufty and needs a cleanup I'd say. [ 116.587762] [ 116.587768] === [ 116.587770] [ INFO: suspicious RCU usage. ] [ 116.587773] 4.8.0+ #24 Not tainted [ 116.587775] --- [ 116.58] ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage! [ 116.587779] [ 116.587779] other info that might help us debug this: [ 116.587779] [ 116.587782] [ 116.587782] RCU used illegally from idle CPU! [ 116.587782] rcu_scheduler_active = 1, debug_locks = 0 [ 116.587785] RCU used illegally from extended quiescent state! [ 116.587787] no locks held by swapper/1/0. [ 116.587788] [ 116.587788] stack backtrace: [ 116.587792] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0+ #24 [ 116.587794] Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 [ 116.587796] 90285de03f58 9d44a0c9 90285ca5d100 0001 [ 116.587803] 90285de03f88 9d0ebd67 902845165410 080b [ 116.587809] 90285de03fb8 9d492b95 [ 116.587814] Call Trace: [ 116.587817][] dump_stack+0x99/0xd0 [ 116.587827] [] lockdep_rcu_suspicious+0xe7/0x120 [ 116.587832] [] do_trace_write_msr+0x135/0x140 [ 116.587836] [] native_write_msr+0x20/0x30 [ 116.587841] [] native_apic_msr_eoi_write+0x1d/0x30 [ 116.587845] [] smp_reschedule_interrupt+0x1d/0x30 [ 116.587849] [] reschedule_interrupt+0x96/0xa0 [ 116.587851][] ? cpuidle_enter_state+0xe4/0x360 [ 116.587858] [] ? cpuidle_enter_state+0xcf/0x360 [ 116.587861] [] cpuidle_enter+0x17/0x20 [ 116.587865] [] call_cpuidle+0x23/0x50 [ 116.587868] [] cpu_startup_entry+0x15c/0x280 [ 116.587872] [] start_secondary+0x154/0x180 irq_enter() which is called in scheduler_ipi() is too late to tell RCU susbstems to end the extended quiescent state before ack_APIC_irq(), any ideas? Regards, Wanpeng Li
Re: [Intel-gfx] drm/i915: WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock)
On ke, 2016-10-12 at 11:56 +0200, Paul Bolle wrote: > On a laptop that tracks the latest stable release (Ie, it now runs > v4.8.1) I see this WARNING > WARN_ON_ONCE(!crtc_clock || cdclk < crtc_clock) > > Full trace pasted below. I never saw this WARNING before v4.8. Since > v4.8 I've had it in all (four, actually) boots. > > What am I expected to do about this WARNING? > Bisecting the offending commit between v4.8 and v4.8.1 would be a good start. Regards, Joonas -- Joonas Lahtinen Open Source Technology Center Intel Corporation
Re: [PATCH RESEND] ARM: dts: keystone-k2*: Increase SPI Flash partition size for U-Boot
Hi, On Monday 10 October 2016 09:31 PM, Santosh Shilimkar wrote: > Vignesh, > > On 10/10/2016 7:31 AM, Russell King - ARM Linux wrote: >> On Mon, Oct 10, 2016 at 07:41:41PM +0530, Vignesh R wrote: >>> U-Boot SPI Boot image is now more than 512KB for Keystone2 devices and >>> cannot fit into existing partition. So, increase the SPI Flash partition >>> for U-Boot to 1MB for all Keystone2 devices. >>> >>> Signed-off-by: Vignesh R >>> --- >>> >>> This was submitted to v4.9 merge window but was never picked up: >>> https://patchwork.kernel.org/patch/9135023/ > > Another point is, if you want me to pick your patch, please copy > me next time :-). AFAIK, am seeing this patch in my inbox first time. > Sorry, I did address the previous patch to you. Not sure what happened :( >> >> I think you need to explain why it's safe to change the layout of the >> flash partitions like this. >> >> - What is this "misc" partition? >> >> - Why is it safe to move the "misc" partition in this way? >> >> - Do users need to do anything with data stored in the "misc" partition >> when changing kernels? >> >> If the "misc" partition is simply unused space on the flash device, why >> list it in DT? >> > Thanks Russell. Yes, above clarification would be good to get first. Ok, will send v2 with updated commit message as per my reply in other thread. -- Regards Vignesh