from:"Srikar Dronamraju"

Re: [PATCH v3 0/7] Fixes for perf probe issues on ppc

2015-04-28 Thread Srikar Dronamraju

* Arnaldo Carvalho de Melo  [2015-04-28 10:54:53]:

> Em Tue, Apr 28, 2015 at 05:35:33PM +0530, Naveen N. Rao escreveu:
> > This patchset fixes various issues with perf probe on powerpc across ABIv1 
> > and
> > ABIv2:
> > - in the presence of DWARF debug-info,
> > - in the absence of DWARF, but with the symbol table, and
> > - in the absence of debug-info, but with kallsyms.
> > 
> > Arnaldo,
> > I have moved all patches to use __weak functions. Kindly take a look and 
> > let me
> > know if this is what you had in mind.
> 
> Ok, I applied all but the first, for which I am waiting for Masami's
> reaction, I kept Srikar's reviewed-by for the other patches, but would
> as well like to get his word that he keeps it after the __weak changes.
> 
> So, for now, I'll leave it for a while sitting in my local tree, to give
> time to Masami and Srikar, ok?
> 

Yes Arnaldo, I have looked at the patches after the __weak changes and
they look good to me.

> - Arnaldo
> 

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 01/11] sched: introduce sys_cpumask in tsk to adapt asymmetric system

2014-11-12 Thread Srikar Dronamraju

* kernelf...@gmail.com  [2014-10-16 15:29:50]:

> Some system such as powerpc, some tsk (vcpu thread) can only run on
> the dedicated cpu. Since we adapt some asymmetric method to monitor the
> whole physical cpu. (powerKVM only allows the primary hwthread to
> set up runtime env for the secondary when entering guest).
> 
> Nowadays, powerKVM run with all the secondary hwthread offline to ensure
> the vcpu threads only run on the primary thread. But we plan to keep all
> cpus online when running powerKVM to give more power when switching back
> to host, so introduce sys_allowed cpumask to reflect the cpuset which
> the vcpu thread can run on.
> 
> Signed-off-by: Liu Ping Fan 
> ---
>  include/linux/init_task.h |  1 +
>  include/linux/sched.h |  6 ++
>  kernel/sched/core.c   | 10 --
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 2bb4c4f3..c56f69e 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -172,6 +172,7 @@ extern struct task_group root_task_group;
>   .normal_prio= MAX_PRIO-20,  \
>   .policy = SCHED_NORMAL, \
>   .cpus_allowed   = CPU_MASK_ALL, \
> + .sys_allowed = CPU_MASK_ALL,\

Do we really need another mask, cant we just use cpus_allowed itself.

>   .nr_cpus_allowed= NR_CPUS,  \
>   .mm = NULL, \
>   .active_mm  = &init_mm, \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5c2c885..ce429f3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1260,7 +1260,10 @@ struct task_struct {
>  
>   unsigned int policy;
>   int nr_cpus_allowed;
> + /* Anded user and sys_allowed */
>   cpumask_t cpus_allowed;
> + /* due to the feature of asymmetric, some tsk can only run on such cpu 
> */
> + cpumask_t sys_allowed;
>  
>  #ifdef CONFIG_PREEMPT_RCU
>   int rcu_read_lock_nesting;
> @@ -2030,6 +2033,9 @@ static inline void tsk_restore_flags(struct task_struct 
> *task,
>  }
>  
>  #ifdef CONFIG_SMP
> +extern void set_cpus_sys_allowed(struct task_struct *p,
> + const struct cpumask *new_mask);
> +
>  extern void do_set_cpus_allowed(struct task_struct *p,
>  const struct cpumask *new_mask);
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ec1a286..2cd1ae3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4596,13 +4596,19 @@ void init_idle(struct task_struct *idle, int cpu)
>  }
>  
>  #ifdef CONFIG_SMP
> +void set_cpus_sys_allowed(struct task_struct *p,
> + const struct cpumask *new_mask)
> +{
> + cpumask_copy(&p->sys_allowed, new_mask);
> +}
> +

This function doesnt seem to be used anywhere... Not sure why it is
introduced

>  void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> *new_mask)
>  {
>   if (p->sched_class && p->sched_class->set_cpus_allowed)
>   p->sched_class->set_cpus_allowed(p, new_mask);
>  
> - cpumask_copy(&p->cpus_allowed, new_mask);
> - p->nr_cpus_allowed = cpumask_weight(new_mask);
> + cpumask_and(&p->cpus_allowed, &p->sys_allowed, new_mask);
> + p->nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
>  }
>  
>  /*
> -- 
> 1.8.3.1
> 
> 

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 02/11] powerpc: kvm: ensure vcpu-thread run only on primary hwthread

2014-11-12 Thread Srikar Dronamraju

* kernelf...@gmail.com  [2014-10-16 15:29:51]:

> When vcpu thread runs at the first time, it will ensure to stick
> to the primary thread.
> 
> Signed-off-by: Liu Ping Fan 
> ---
>  arch/powerpc/include/asm/kvm_host.h |  3 +++
>  arch/powerpc/kvm/book3s_hv.c| 17 +
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 98d9dd5..9a3355e 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -666,6 +666,9 @@ struct kvm_vcpu_arch {
>   spinlock_t tbacct_lock;
>   u64 busy_stolen;
>   u64 busy_preempt;
> +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
> + bool cpu_selected;
> +#endif
>  #endif
>  };
>  
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 27cced9..ba258c8 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1909,6 +1909,23 @@ static int kvmppc_vcpu_run_hv(struct kvm_run *run, 
> struct kvm_vcpu *vcpu)
>  {
>   int r;
>   int srcu_idx;
> +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
> + int cpu = smp_processor_id();
> + int target_cpu;
> + unsigned int cpu;

2 variables with same name... cpu

> + struct task_struct *p = current;
> +
> + if (unlikely(!vcpu->arch.cpu_selected)) {
> + vcpu->arch.cpu_selected = true;

Nit: something like cpumask_set seems to be better than cpu_selected

> + for (cpu = 0; cpu < NR_CPUS; cpu+=threads_per_core) {
> + cpumask_set_cpu(cpu, &p->sys_allowed);
Dont we need to reset the cpumask first before we set
the cpumask here?

> + }
> + if (cpu%threads_per_core != 0) {

At this time, cpu should be NR_CPUS and most times it should be a
multiple of threads_per_core. Unfortunately there wont be a cpu with
cpu number NR_CPUS.

> + target_cpu = cpu/threads_per_core*threads_per_core;

Its probably better of to have parenthesis here. 

> + migrate_task_to(current, target_cpu);
We are probably migrating to a non-existant cpu.
Also dont you need to check if the target_cpu is part of the cpumask?

> + }
> +     }
> +#endif
>  
>   if (!vcpu->arch.sane) {
>   run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> -- 
> 1.8.3.1
> 
> 

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 03/11] powerpc: kvm: add interface to control kvm function on a core

2014-11-12 Thread Srikar Dronamraju

* kernelf...@gmail.com  [2014-10-16 15:29:52]:

> When kvm is enabled on a core, we migrate all external irq to primary
> thread. Since currently, the kvmirq logic is handled by the primary
> hwthread.
> 
> Todo: this patch lacks re-enable of irqbalance when kvm is disable on
> the core
> 
> Signed-off-by: Liu Ping Fan 
> ---
>  arch/powerpc/kernel/sysfs.c| 39 
> ++
>  arch/powerpc/sysdev/xics/xics-common.c | 12 +++
>  2 files changed, 51 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index 67fd2fd..a2595dd 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -552,6 +552,45 @@ static void sysfs_create_dscr_default(void)
>   if (cpu_has_feature(CPU_FTR_DSCR))
>   err = device_create_file(cpu_subsys.dev_root, 
> &dev_attr_dscr_default);
>  }
> +
> +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
> +#define NR_CORES (CONFIG_NR_CPUS/threads_per_core)
> +static DECLARE_BITMAP(kvm_on_core, NR_CORES) __read_mostly
> +
> +static ssize_t show_kvm_enable(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> +}
> +
> +static ssize_t __used store_kvm_enable(struct device *dev,
> + struct device_attribute *attr, const char *buf,
> + size_t count)
> +{
> + struct cpumask stop_cpus;
> + unsigned long core, thr;
> +
> + sscanf(buf, "%lx", &core);
> + if (core > NR_CORES)
> + return -1;
> + if (!test_bit(core, &kvm_on_core))
> + for (thr = 1; thr< threads_per_core; thr++)
> + if (cpu_online(thr * threads_per_core + thr))
> + cpumask_set_cpu(thr * threads_per_core + thr, 
> &stop_cpus);

Shouldnt this be 
        if (cpu_online(core * threads_per_core + thr))
cpumask_set_cpu(core * threads_per_core + thr, 
&stop_cpus);
?


-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 1/4] uprobes: add trap variant helper

2013-03-26 Thread Srikar Dronamraju

* Ananth N Mavinakayanahalli  [2013-03-22 20:46:27]:

> From: Ananth N Mavinakayanahalli 
> 
> Some architectures like powerpc have multiple variants of the trap
> instruction. Introduce an additional helper is_trap_insn() for run-time
> handling of non-uprobe traps on such architectures.
> 
> While there, change is_swbp_at_addr() to is_trap_at_addr() for reading
> clarity.
> 
> With this change, the uprobe registration path will supercede any trap
> instruction inserted at the requested location, while taking care of
> delivering the SIGTRAP for cases where the trap notification came in
> for an address without a uprobe. See [1] for a more detailed explanation.
> 
> [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2013-March/104771.html
> 
> This change was suggested by Oleg Nesterov.
> 
> Signed-off-by: Ananth N Mavinakayanahalli 

Acked-by: Srikar Dronamraju 

> ---
>  include/linux/uprobes.h |1 +
>  kernel/events/uprobes.c |   32 
>  2 files changed, 29 insertions(+), 4 deletions(-)
> 
> Index: linux-3.9-rc3/include/linux/uprobes.h
> ===
> --- linux-3.9-rc3.orig/include/linux/uprobes.h
> +++ linux-3.9-rc3/include/linux/uprobes.h
> @@ -100,6 +100,7 @@ struct uprobes_state {
>  extern int __weak set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, 
> unsigned long vaddr);
>  extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct 
> *mm, unsigned long vaddr);
>  extern bool __weak is_swbp_insn(uprobe_opcode_t *insn);
> +extern bool __weak is_trap_insn(uprobe_opcode_t *insn);
>  extern int uprobe_register(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc);
>  extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc, bool);
>  extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc);
> Index: linux-3.9-rc3/kernel/events/uprobes.c
> ===
> --- linux-3.9-rc3.orig/kernel/events/uprobes.c
> +++ linux-3.9-rc3/kernel/events/uprobes.c
> @@ -173,6 +173,20 @@ bool __weak is_swbp_insn(uprobe_opcode_t
>   return *insn == UPROBE_SWBP_INSN;
>  }
> 
> +/**
> + * is_trap_insn - check if instruction is breakpoint instruction.
> + * @insn: instruction to be checked.
> + * Default implementation of is_trap_insn
> + * Returns true if @insn is a breakpoint instruction.
> + *
> + * This function is needed for the case where an architecture has multiple
> + * trap instructions (like powerpc).
> + */
> +bool __weak is_trap_insn(uprobe_opcode_t *insn)
> +{
> + return is_swbp_insn(insn);
> +}
> +
>  static void copy_opcode(struct page *page, unsigned long vaddr, 
> uprobe_opcode_t *opcode)
>  {
>   void *kaddr = kmap_atomic(page);
> @@ -185,6 +199,15 @@ static int verify_opcode(struct page *pa
>   uprobe_opcode_t old_opcode;
>   bool is_swbp;
> 
> + /*
> +  * Note: We only check if the old_opcode is UPROBE_SWBP_INSN here.
> +  * We do not check if it is any other 'trap variant' which could
> +  * be conditional trap instruction such as the one powerpc supports.
> +  *
> +  * The logic is that we do not care if the underlying instruction
> +  * is a trap variant; uprobes always wins over any other (gdb)
> +  * breakpoint.
> +  */
>   copy_opcode(page, vaddr, &old_opcode);
>   is_swbp = is_swbp_insn(&old_opcode);
> 
> @@ -204,7 +227,7 @@ static int verify_opcode(struct page *pa
>   * Expect the breakpoint instruction to be the smallest size instruction for
>   * the architecture. If an arch has variable length instruction and the
>   * breakpoint instruction is not of the smallest length instruction
> - * supported by that architecture then we need to modify is_swbp_at_addr and
> + * supported by that architecture then we need to modify is_trap_at_addr and
>   * write_opcode accordingly. This would never be a problem for archs that
>   * have fixed length instructions.
>   */
> @@ -1431,7 +1454,7 @@ static void mmf_recalc_uprobes(struct mm
>   clear_bit(MMF_HAS_UPROBES, &mm->flags);
>  }
> 
> -static int is_swbp_at_addr(struct mm_struct *mm, unsigned long vaddr)
> +static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>  {
>   struct page *page;
>   uprobe_opcode_t opcode;
> @@ -1452,7 +1475,8 @@ static int is_swbp_at_addr(struct mm_str
>   copy_opcode(page, vaddr, &opcode);
>   put_page(page);
>   out:
> - return is_swbp_insn(&opcode);
> + /* This needs to return true for any variant of the tra

Re: [PATCH v2 2/4] uprobes: refuse uprobe on trap variants

2013-03-26 Thread Srikar Dronamraju

* Ananth N Mavinakayanahalli  [2013-03-22 20:47:58]:

> From: Ananth N Mavinakayanahalli 
> 
> Refuse to place a uprobe if a trap variant already exists in the
> file copy at the address.
> 
> Signed-off-by: Ananth N Mavinakayanahalli 

Acked-by: Srikar Dronamraju 

> ---
>  kernel/events/uprobes.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-3.9-rc3/kernel/events/uprobes.c
> ===
> --- linux-3.9-rc3.orig/kernel/events/uprobes.c
> +++ linux-3.9-rc3/kernel/events/uprobes.c
> @@ -573,7 +573,7 @@ static int prepare_uprobe(struct uprobe
>   goto out;
> 
>   ret = -ENOTSUPP;
> - if (is_swbp_insn((uprobe_opcode_t *)uprobe->arch.insn))
> + if (is_trap_insn((uprobe_opcode_t *)uprobe->arch.insn))
>   goto out;
> 
>   ret = arch_uprobe_analyze_insn(&uprobe->arch, mm, vaddr);

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 3/4] uprobes/powerpc: teach uprobes to ignore gdb breakpoints

2013-03-26 Thread Srikar Dronamraju

* Ananth N Mavinakayanahalli  [2013-03-22 20:48:38]:

> From: Ananth N Mavinakayanahalli 
> 
> Powerpc has many trap variants that could be used by entities like gdb.
> Currently, running gdb on a program being traced by uprobes causes an
> endless loop since uprobes doesn't understand that the trap was inserted
> by some other entity and a SIGTRAP needs to be delivered.
> 
> Teach uprobes to ignore breakpoints that do not belong to it.
> 
> Signed-off-by: Ananth N Mavinakayanahalli 

Acked-by: Srikar Dronamraju 

> ---
>  arch/powerpc/kernel/uprobes.c |   10 ++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-3.9-rc3/arch/powerpc/kernel/uprobes.c
> ===
> --- linux-3.9-rc3.orig/arch/powerpc/kernel/uprobes.c
> +++ linux-3.9-rc3/arch/powerpc/kernel/uprobes.c
> @@ -31,6 +31,16 @@
>  #define UPROBE_TRAP_NR   UINT_MAX
> 
>  /**
> + * is_trap_insn - check if the instruction is a trap variant
> + * @insn: instruction to be checked.
> + * Returns true if @insn is a trap variant.
> + */
> +bool is_trap_insn(uprobe_opcode_t *insn)
> +{
> + return (is_trap(*insn));
> +}
> +
> +/**
>   * arch_uprobe_analyze_insn
>   * @mm: the probed address space.
>   * @arch_uprobe: the probepoint information.

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 4/4] uprobes/powerpc: remove additional trap instruction check

2013-03-26 Thread Srikar Dronamraju

* Ananth N Mavinakayanahalli  [2013-03-22 20:49:46]:

> From: Ananth N Mavinakayanahalli 
> 
> prepare_uprobe() already checks if the underlying unstruction
> (on file) is a trap variant. We don't need to check this again.
> 
> Signed-off-by: Ananth N Mavinakayanahalli 

Acked-by: Srikar Dronamraju 

> ---
>  arch/powerpc/kernel/uprobes.c |6 --
>  1 file changed, 6 deletions(-)
> 
> Index: linux-3.9-rc3/arch/powerpc/kernel/uprobes.c
> ===
> --- linux-3.9-rc3.orig/arch/powerpc/kernel/uprobes.c
> +++ linux-3.9-rc3/arch/powerpc/kernel/uprobes.c
> @@ -53,12 +53,6 @@ int arch_uprobe_analyze_insn(struct arch
>   if (addr & 0x03)
>   return -EINVAL;
> 
> - /*
> -  * We currently don't support a uprobe on an already
> -  * existing breakpoint instruction underneath
> -  */
> - if (is_trap(auprobe->ainsn))
> - return -ENOTSUPP;
>   return 0;
>  }
> 

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Panic on ppc64 with numa_balancing and !sparsemem_vmemmap

2014-02-19 Thread Srikar Dronamraju


On a powerpc machine with CONFIG_NUMA_BALANCING=y and CONFIG_SPARSEMEM_VMEMMAP
not enabled,  kernel panics.

This is true of kernel versions 3.13 to the latest commit 960dfc4 which is
3.14-rc3+.  i.e the recent 3 fixups from Aneesh doesnt seem to help this case.

Sometimes it fails on boot up itself. Otherwise a kernel compile is good enough
to trigger the same. I am seeing this on a Power 7 box.

Kernel 3.14.0-rc3-mainline_v313-00168-g960dfc4 on an ppc64

transam2s-lp1 login: qla2xxx [0003:01:00.1]-8038:2: Cable is unplugged...
Unable to handle kernel paging request for data at address 0x0457
Faulting instruction address: 0xc00d6004
cpu 0x38: Vector: 300 (Data Access) at [c0171561f700]
pc: c00d6004: .task_numa_fault+0x604/0xa30
lr: c00d62fc: .task_numa_fault+0x8fc/0xa30
sp: c0171561f980
   msr: 80009032
   dar: 457
 dsisr: 4000
  current = 0xc017155d9b00
  paca= 0xcec1e000   softe: 0irq_happened: 0x00
pid   = 16898, comm = gzip
enter ? for help
[c0171561fa70] c01b0fb0 .do_numa_page+0x1b0/0x2a0
[c0171561fb20] c01b2788 .handle_mm_fault+0x538/0xca0
[c0171561fc00] c082f498 .do_page_fault+0x378/0x880
[c0171561fe30] c0009568 handle_page_fault+0x10/0x30
--- Exception: 301 (Data Access) at 100031d8
SP (3fffd45ea2d0) is in userspace
38:mon>


(gdb) list *(task_numa_fault+0x604)
0xc00d6004 is in task_numa_fault 
(/home/srikar/work/linux.git/include/linux/mm.h:753).
748 return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
749 }
750
751 static inline bool __cpupid_match_pid(pid_t task_pid, int cpupid)
752 {
753 return (task_pid & LAST__PID_MASK) == cpupid_to_pid(cpupid);
754 }
755
756 #define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, 
cpupid)
757 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
(gdb) 


However this doesnt seem to happen if we have CONFIG_SPARSEMEM_VMEMMAP=y set in 
the config.


-- 
Thanks nnn Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Tasks stuck in futex code (in 3.14-rc6)

2014-03-19 Thread Srikar Dronamraju

Hi, 

When running specjbb on a power7 numa box, I am seeing java threads
getting stuck in futex 

#  ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex
 14808 pts/0root java - 0 futex_wait_queue_me
 14925 pts/0root java - 0 futex_wait_queue_me
#

stack traces, I see
[ 1843.426591] Call Trace:
[ 1843.426595] [c017101d74d0] [0020] 0x20 (unreliable)
[ 1843.426601] [c017101d76a0] [c0014c50] .__switch_to+0x1e0/0x390
[ 1843.426607] [c017101d7750] [c06ed314] .__schedule+0x364/0x8c0
[ 1843.426613] [c017101d79d0] [c0139c28] 
.futex_wait_queue_me+0xf8/0x1a0
[ 1843.426619] [c017101d7a60] [c013afbc] .futex_wait+0x17c/0x2a0
[ 1843.426626] [c017101d7c10] [c013cee4] .do_futex+0x254/0xd80
[ 1843.426631] [c017101d7d60] [c013db2c] .SyS_futex+0x11c/0x1d0
[ 1843.426638] [c017101d7e30] [c0009efc] syscall_exit+0x0/0x7c
[ 1843.426643] javaS 3fffa08b16a0 0 14812  14203 0x0080
[ 1843.426650] Call Trace:
[ 1843.426653] [c0170c6034d0] [c01710b09cf8] 0xc01710b09cf8 
(unreliable)
[ 1843.426660] [c0170c6036a0] [c0014c50] .__switch_to+0x1e0/0x390
[ 1843.42] [c0170c603750] [c06ed314] .__schedule+0x364/0x8c0
[ 1843.426672] [c0170c6039d0] [c0139c28] 
.futex_wait_queue_me+0xf8/0x1a0
[ 1843.426679] [c0170c603a60] [c013afbc] .futex_wait+0x17c/0x2a0
[ 1843.453383] [c0170c603c10] [c013cee4] .do_futex+0x254/0xd80
[ 1843.453389] [c0170c603d60] [c013db2c] .SyS_futex+0x11c/0x1d0
[ 1843.453395] [c0170c603e30] [c0009efc] syscall_exit+0x0/0x7c
[ 1843.453400] javaS 3fffa08b1a74 0 14813  14203 0x0080
[ 1843.453407] Call Trace:

There are 332 tasks all stuck in futex_wait_queue_me().
I am able to reproduce this consistently.

Infact I can reproduce this if the java_constraint is either node, socket, 
system.
However I am not able to reproduce if java_constraint is set to core.

I ran git bisect between v3.12 and v3.14-rc6 and found that

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b0c29f79ecea0b6fbcefc999e70f2843ae8306db

commit b0c29f79ecea0b6fbcefc999e70f2843ae8306db
Author: Davidlohr Bueso 
Date:   Sun Jan 12 15:31:25 2014 -0800

futexes: Avoid taking the hb->lock if there's nothing to wake up

was the commit thats causing the threads to be stuck in futex.

I reverted b0c29f79ecea0b6fbcefc999e70f2843ae8306db on top of v3.14-rc6 and 
confirmed that
reverting the commit solved the problem.

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-19 Thread Srikar Dronamraju

> > 
> > Infact I can reproduce this if the java_constraint is either node, socket, 
> > system.
> > However I am not able to reproduce if java_constraint is set to core.
> 
> What's any of that mean?
> 

Using the constraint, one can specify how many jvm instances should
participate in the specjbb run.

For example on a 4 node box, I can say 2 jvms per constraint with
constraint set to node and specjbb will run with 8 instances of java.

I was running with 1 jvm per constraint. But when I set the constraint
to node/System, I keep seeing this problem. However if I set the
constraint to core (which means running more instances of java), the
problem is not seen. I kind of guess, the lesser the number of java
instances the easier it is to reproduce. 

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-19 Thread Srikar Dronamraju

> > Joy,.. let me look at that with ppc in mind.
> 
> OK; so while pretty much all the comments from that patch are utter
> nonsense (what was I thinking), I cannot actually find a real bug.
> 
> But could you try the below which replaces a control dependency with a
> full barrier. The control flow is plenty convoluted that I think the
> control barrier isn't actually valid anymore and that might indeed
> explain the fail.
> 

Unfortunately the patch didnt help. Still seeing tasks stuck

# ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex
14680 pts/0root java - 0 futex_wait_queue_me
14797 pts/0root java - 0 futex_wait_queue_me
# :> /var/log/messages
# echo t > /proc/sysrq-trigger 
# grep futex_wait_queue_me /var/log/messages | wc -l 
334
#

[ 6904.211478] Call Trace:
[ 6904.211481] [c00fa1f1b4d0] [0020] 0x20 (unreliable)
[ 6904.211486] [c00fa1f1b6a0] [c0015208] .__switch_to+0x1e8/0x330
[ 6904.211491] [c00fa1f1b750] [c0702f00] .__schedule+0x360/0x8b0
[ 6904.211495] [c00fa1f1b9d0] [c0147348] 
.futex_wait_queue_me+0xf8/0x1a0
[ 6904.211500] [c00fa1f1ba60] [c01486dc] .futex_wait+0x17c/0x2a0
[ 6904.211505] [c00fa1f1bc10] [c014a614] .do_futex+0x254/0xd80
[ 6904.211510] [c00fa1f1bd60] [c014b25c] .SyS_futex+0x11c/0x1d0
[ 6904.238874] [c00fa1f1be30] [c000a0fc] syscall_exit+0x0/0x7c
[ 6904.238879] javaS 3fff825f6044 0 14682  14076 0x0080

Is there any other information that I provide that can help?

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-20 Thread Srikar Dronamraju

> This problem suggests that we missed a wakeup for a task that was adding
> itself to the queue in a wait path. And the only place that can happen
> is with the hb spinlock check for any pending waiters. Just in case we
> missed some assumption about checking the hash bucket spinlock as a way
> of detecting any waiters (powerpc?), could you revert this commit and
> try the original atomic operations variant:
> 
> https://lkml.org/lkml/2013/12/19/630

I think the above url and the commit id that I reverted i.e
git://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b0c29f79ecea0b6fbcefc999
are the same.

Or am I missing something? 

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-20 Thread Srikar Dronamraju

> 
> Ok, so a big reason why this patch doesn't apply cleanly after reverting
> is because *most* of the changes were done at the top of the file with
> regards to documenting the ordering guarantees, the actual code changes
> are quite minimal.
> 
> I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending
> commit), and then I cleanly applied the equivalent ones from v3 of the
> series (which was already *tested* and ready for upstream until you
> suggested looking into the alternative spinlock approach):
> 
> https://lkml.org/lkml/2013/12/19/624
> https://lkml.org/lkml/2013/12/19/630

I reverted commits 99b60ce6 and b0c29f79. Then applied the patches in
the above url. The last one had a reject but it was pretty
straightforward to resolve it. After this, specjbb completes. 

So reverting and applying v3 3/4 and 4/4 patches works for me.

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-21 Thread Srikar Dronamraju


> > So reverting and applying v3 3/4 and 4/4 patches works for me.
> 
> Ok, I verified that the above endds up resulting in the same tree as
> the minimal patch I sent out, modulo (a) some comments and (b) an
> #ifdef CONFIG_SMP in futex_get_mm() that doesn't really matter.
> 
> So I committed the minimal patch with your tested-by.
> 

Just to be sure, I have verified with latest mainline with HEAD having
commit 08edb33c4 (Merge branch 'drm-fixes' of
git://people.freedesktop.org/~airlied/linux) which also has the commit
11d4616bd0 futex:( revert back to the explicit waiter counting code).

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] sched/cpuacct: Check for NULL when using task_pt_regs()

2016-04-06 Thread Srikar Dronamraju

* Anton Blanchard  [2016-04-06 21:59:50]:

> Looks good, and the patch below does fix the oops for me.
> 
> Anton
> --
> 
> task_pt_regs() can return NULL for kernel threads, so add a check.
> This fixes an oops at boot on ppc64.
> 
> Signed-off-by: Anton Blanchard 

Works for me too.

Reported-and-Tested-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-18 Thread Srikar Dronamraju

> > node 0 (because firmware doesn't provide the distance information for
> > memoryless/cpuless nodes):
> > 
> >   node   0   1   2   3
> > 0:  10  40  10  10
> > 1:  40  10  40  40
> > 2:  10  40  10  10
> > 3:  10  40  10  10
> 
> *groan*... what does it do for things like percpu memory? ISTR the
> per-cpu chunks are all allocated early too. Having them all use memory
> out of node-0 would seem sub-optimal.

In the specific failing case, there is only one node with memory; all other
nodes are cpu only nodes.

However in the generic case since its just a cpu hotplug ops, the memory
allocated for per-cpu chunks allocated early would remain.

May be Michael Ellerman can correct me here.

> 
> > We should have:
> > 
> >   node   0   1   2   3
> > 0:  10  40  40  40
> > 1:  40  10  40  40
> > 2:  40  40  10  40
> > 3:  40  40  40  10
> 
> Can it happen that it introduces a new distance in the table? One that
> hasn't been seen before? This example only has 10 and 40, but suppose
> the new node lands at distance 20 (or 80); can such a thing happen?
> 
> If not; why not?

Yes distances can be 20, 40 or 80. There is nothing that makes the node
distance to be 40 always.

> So you're relying on sched_domain_numa_masks_set/clear() to fix this up,
> but that in turn relies on the sched_domain_numa_levels thing to stay
> accurate.
> 
> This all seems very fragile and unfortunate.
> 

Any reasons why this is fragile?

-- 
Thanks and Regards
Srikar Dronamraju

Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-18 Thread Srikar Dronamraju

* Laurent Vivier  [2019-03-15 12:12:45]:

> 
> Another way to avoid the nodes overlapping for the offline nodes at
> startup is to ensure the default values don't define a distance that
> merge all offline nodes into node 0.
> 
> A powerpc specific patch can workaround the kernel crash by doing this:
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 87f0dd0..3ba29bb 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
> struct device_node *memory;
> int default_nid = 0;
> unsigned long i;
> +   int nid, dist;
> 
> if (numa_enabled == 0) {
> printk(KERN_WARNING "NUMA disabled by user\n");
> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
> 
> dbg("NUMA associativity depth for CPU/Memory: %d\n",
> min_common_depth);
> 
> +   for (nid = 0; nid < MAX_NUMNODES; nid ++)
> +   for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
> +   distance_lookup_table[nid][dist] = nid;
> +

The only reason, this would have worked in the specific case, is because we
are overriding the distance_lookup_table with a unique distance.
So node_distance for any other node other than itself will return max
distance which is 40 in this case. (since distance_ref_points_depth is 2)

I am not sure if  this will work if the node distance between the two nodes
happens to be 20.

> /*
>  * Even though we connect cpus to numa domains later in SMP
>  * init, we need to know the node ids now. This is because
> 

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH 0/3] Offline memoryless cpuless node 0

2020-03-11 Thread Srikar Dronamraju

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

v5.6-rc4
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.6-rc4 + patches
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 32 ++--
 mm/page_alloc.c|  4 +++-
 2 files changed, 25 insertions(+), 11 deletions(-)

-- 
1.8.3.1

[PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-11 Thread Srikar Dronamraju

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8a399db..54dcd49 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
 
reset_numa_cpu_lookup_table();
 
-   for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   for_each_possible_cpu(cpu) {
+   /*
+* Powerpc with CONFIG_NUMA always used to have a node 0,
+* even if it was memoryless or cpuless. For all cpus that
+* are possible but not present, cpu_to_node() would point
+* to node 0. To remove a cpuless, memoryless dummy node,
+* powerpc need to make sure all possible but not present
+* cpu_to_node are set to a proper node.
+*/
+   if (cpu_present(cpu))
+   numa_setup_cpu(cpu);
+   else
+   set_cpu_numa_node(cpu, first_online_node);
+   }
 }
 
 void __init initmem_init(void)
-- 
1.8.3.1

[PATCH 2/3] powerpc/numa: Prefer node id queried from vphn

2020-03-11 Thread Srikar Dronamraju

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 54dcd49..8735fed 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
 */
for_each_present_cpu(i) {
struct device_node *cpu;
-   int nid;
-
-   cpu = of_get_cpu_node(i, NULL);
-   BUG_ON(!cpu);
-   nid = of_node_to_nid_single(cpu);
-   of_node_put(cpu);
+   int nid = vphn_get_nid(i);
 
/*
 * Don't fall back to default_nid yet -- we will plug
 * cpus into nodes once the memory scan has discovered
 * the topology.
 */
-   if (nid < 0)
-   continue;
+   if (nid == NUMA_NO_NODE) {
+   cpu = of_get_cpu_node(i, NULL);
+   if (cpu) {
+   nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
+   }
+   }
node_set_online(nid);
}
 
-- 
1.8.3.1

[PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-03-11 Thread Srikar Dronamraju

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

Lets stop assuming that Node 0 is always online.

v5.6-rc4
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.6-rc4 + patch
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb75..68e635f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@ struct pcpu_drain {
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+   [N_ONLINE] = NODE_MASK_NONE,
+#else
[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
1.8.3.1

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-11 Thread Srikar Dronamraju

* Michal Hocko  [2020-03-11 12:57:35]:

> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> > A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> > enabled always used to have a node 0, even if node 0 does not any cpus
> > or memory attached to it. As per PAPR, node affinity of a cpu is only
> > available once its present / online. For all cpus that are possible but
> > not present, cpu_to_node() would point to node 0.
> > 
> > To ensure a cpuless, memoryless dummy node is not online, powerpc need
> > to make sure all possible but not present cpu_to_node are set to a
> > proper node.
> 
> Just curious, is this somehow related to
> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
> 

The issue I am trying to fix is a known issue in Powerpc since many years.
So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node"). 

I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
kernel. Will work with Sachin/Abdul (reporters of the issue).


> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux...@kvack.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: Michal Hocko 
> > Cc: Mel Gorman 
> > Cc: Vlastimil Babka 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Christopher Lameter 
> > Cc: Michael Ellerman 
> > Cc: Andrew Morton 
> > Cc: Linus Torvalds 
> > Signed-off-by: Srikar Dronamraju 
> > ---
> >  arch/powerpc/mm/numa.c | 16 ++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 8a399db..54dcd49 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
> >  
> > reset_numa_cpu_lookup_table();
> >  
> > -   for_each_present_cpu(cpu)
> > -   numa_setup_cpu(cpu);
> > +   for_each_possible_cpu(cpu) {
> > +   /*
> > +* Powerpc with CONFIG_NUMA always used to have a node 0,
> > +* even if it was memoryless or cpuless. For all cpus that
> > +* are possible but not present, cpu_to_node() would point
> > +* to node 0. To remove a cpuless, memoryless dummy node,
> > +* powerpc need to make sure all possible but not present
> > +* cpu_to_node are set to a proper node.
> > +*/
> > +   if (cpu_present(cpu))
> > +   numa_setup_cpu(cpu);
> > +   else
> > +   set_cpu_numa_node(cpu, first_online_node);
> > +   }
> >  }
> >  
> >  void __init initmem_init(void)
> > -- 
> > 1.8.3.1
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-12 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-12 10:30:50]:

> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju  
> >> wrote:
> >> * Michal Hocko  [2020-03-11 12:57:35]:
> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >>>> to make sure all possible but not present cpu_to_node are set to a
> >>>> proper node.
> >>> 
> >>> Just curious, is this somehow related to
> >>> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
> >>> 
> >> 
> >> The issue I am trying to fix is a known issue in Powerpc since many years.
> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> >> shrinker_map on appropriate NUMA node"). 
> >> 
> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
> >> kernel. Will work with Sachin/Abdul (reporters of the issue).

I had used v1 and not v2. So my mistake.

> > I applied this 3 patch series on top of March 11 next tree (commit 
> > d44a64766795 )
> > The kernel still fails to boot with same call trace.
> 

While I am not an expert in the slub area, I looked at the patch
a75056fc1e7c and had some thoughts on why this could be causing this issue.

On the system where the crash happens, the possible number of nodes is much
greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
available for onlined nodes.

With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
for all possible nodes and in ___slab_alloc we end up looking at the
node_present_pages which is NODE_DATA(nid)->node_present_pages.
i.e for a node whose pdgat struct is not allocated, we are trying to
dereference.

Also for a memoryless/cpuless node or possible but not present nodes,
node_to_mem_node(node) will still end up as node (atleast on powerpc).

I tried with this hunk below and it works.

But I am not sure if we need to check at other places were
node_present_pages is being called.

diff --git a/mm/slub.c b/mm/slub.c
index 626cbcbd977f..bddb93bed55e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
gfpflags, int node,
if (unlikely(!node_match(page, node))) {
int searchnode = node;

-   if (node != NUMA_NO_NODE && !node_present_pages(node))
-   searchnode = node_to_mem_node(node);
-
+   if (node != NUMA_NO_NODE) {
+   if (!node_online(node) || !node_present_pages(node)) {
+   searchnode = node_to_mem_node(node);
+   if (!node_online(searchnode))
+   searchnode = first_online_node;
+   }
+   }
if (unlikely(!node_match(page, searchnode))) {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);

> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-12 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-12 14:51:38]:

> > * Vlastimil Babka  [2020-03-12 10:30:50]:
> > 
> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju 
> >> >>  wrote:
> >> >> * Michal Hocko  [2020-03-11 12:57:35]:
> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >> >>>> to make sure all possible but not present cpu_to_node are set to a
> >> >>>> proper node.
> >> >>> 
> >> >>> Just curious, is this somehow related to
> >> >>> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
> >> >>> 
> >> >> 
> >> >> The issue I am trying to fix is a known issue in Powerpc since many 
> >> >> years.
> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: 
> >> >> allocate
> >> >> shrinker_map on appropriate NUMA node"). 
> >> >> 
> > 
> > While I am not an expert in the slub area, I looked at the patch
> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
> > 
> > On the system where the crash happens, the possible number of nodes is much
> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> > available for onlined nodes.
> > 
> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> > for all possible nodes and in ___slab_alloc we end up looking at the
> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
> > i.e for a node whose pdgat struct is not allocated, we are trying to
> > dereference.
> 
> From what we saw, the pgdat does exist, the problem is that slab's per-node 
> data
> doesn't exist for a node that doesn't have present pages, as it would be a 
> waste
> of memory.

Just to be clear
Before my 3 patches to fix dummy node:
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
0-31
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
0-1

> 
> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
> Sachin's first report [1] we have
> 
> [0.00] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> [0.00] numa: NODE_DATA(0) on node 1
> [0.00] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
> 

So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
rest 30 nodes.

> But in this thread, with your patches Sachin reports:

and with my patches
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
0-31
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
1

> 
> [0.00] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> 

so we only see one pgdat.

> So I assume it's just node 1. In that case, node_present_pages is really 
> dangerous.
> 
> [1]
> https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
> 
> > Also for a memoryless/cpuless node or possible but not present nodes,
> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> 
> I think that's the place where this would be best to fix.
> 

Maybe. I thought about it but the current set_numa_mem semantics are apt
for memoryless cpu node and not for possible nodes.  We could have upto 256
possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
node_to_mem_node seems to return what is set in set_numa_mem().
set_numa_mem() seems to say set my numa_mem node for the current memoryless
node to the param passed.

But how do we set numa_mem for all the other 253 possible nodes, which
probably will have 0 as default?

Should we introduce another API such that we could update for all possible
nodes?

> > I tried with this hunk below and it works.
> > 
> > But I am not sure if we need to check at other places were
> > node_present_pages is being called.
> 
> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
> return only nodes that are online with present memory?
> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: 
> add
> support for node_to_mem_node() to determine the fallback node")
> 

Agree 

> I think we do need well defined and documented rules around 
> node_to_mem_node(),
> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> everyone handles it the same, safe way.
> 

Other option would be to tweak Kirill Tkhai's patch suc

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-13 Thread Srikar Dronamraju

* Joonsoo Kim  [2020-03-13 18:47:49]:

> > >>
> > >> > Also for a memoryless/cpuless node or possible but not present nodes,
> > >> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> > >>
> > >> I think that's the place where this would be best to fix.
> > >>
> > >
> > > Maybe. I thought about it but the current set_numa_mem semantics are apt
> > > for memoryless cpu node and not for possible nodes.  We could have upto 
> > > 256
> > > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> > > node_to_mem_node seems to return what is set in set_numa_mem().
> > > set_numa_mem() seems to say set my numa_mem node for the current 
> > > memoryless
> > > node to the param passed.
> > >
> > > But how do we set numa_mem for all the other 253 possible nodes, which
> > > probably will have 0 as default?
> > >
> > > Should we introduce another API such that we could update for all possible
> > > nodes?
> >
> > If we want to rely on node_to_mem_node() to give us something safe for each
> > possible node, then probably it would have to be like that, yeah.
> >
> > >> > I tried with this hunk below and it works.
> > >> >
> > >> > But I am not sure if we need to check at other places were
> > >> > node_present_pages is being called.
> > >>
> > >> I think this seems to defeat the purpose of node_to_mem_node()? 
> > >> Shouldn't it
> > >> return only nodes that are online with present memory?
> > >> CCing Joonsoo who seems to have introduced this in ad2c8144418c 
> > >> ("topology: add
> > >> support for node_to_mem_node() to determine the fallback node")
> > >>
> > >
> > > Agree
> 
> I lost all the memory about it. :)
> Anyway, how about this?
> 
> 1. make node_present_pages() safer
> static inline node_present_pages(nid)
> {
> if (!node_online(nid)) return 0;
> return (NODE_DATA(nid)->node_present_pages);
> }
> 

Yes this would help.

> 2. make node_to_mem_node() safer for all cases
> In ppc arch's mem_topology_setup(void)
> for_each_present_cpu(cpu) {
>  numa_setup_cpu(cpu);
>  mem_node = node_to_mem_node(numa_mem_id());
>  if (!node_present_pages(mem_node)) {
>   _node_numa_mem_[numa_mem_id()] = first_online_node;
>  }
> }
> 

But here as discussed above, we miss the case of possible but not present nodes.
For such nodes, the above change may not update, resulting in they still
having 0. And node 0 can be only possible but not present.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [5.6.0-rc2-next-20200218/powerpc] Boot failure on POWER9

2020-03-13 Thread Srikar Dronamraju

* Michael Ellerman  [2020-03-13 21:48:06]:

> Sachin Sant  writes:
> >> The patch below might work. Sachin can you test this? I tried faking up
> >> a system with a memoryless node zero but couldn't get it to even start
> >> booting.
> >> 
> > The patch did not help. The kernel crashed during
> > the boot with the same call trace.
> >
> > BUG_ON() introduced with the patch was not triggered.
> 
> OK, that's weird.
> 
> I eventually managed to get a memoryless node going in sim, and it
> appears to work there.
> 
> eg in dmesg:
> 
>   [0.00][T0] numa:   NODE_DATA [mem 0x2000fffa2f80-0x2000fffa7fff]
>   [0.00][T0] numa: NODE_DATA(0) on node 1
>   [0.00][T0] numa:   NODE_DATA [mem 0x2000fff9df00-0x2000fffa2f7f]
>   ...
>   [0.00][T0] Early memory node ranges
>   [0.00][T0]   node   1: [mem 
> 0x-0x]
>   [0.00][T0]   node   1: [mem 
> 0x2000-0x2000]
>   [0.00][T0] Could not find start_pfn for node 0
>   [0.00][T0] Initmem setup node 0 [mem 
> 0x-0x]
>   [0.00][T0] On node 0 totalpages: 0
>   [0.00][T0] Initmem setup node 1 [mem 
> 0x-0x2000]
>   [0.00][T0] On node 1 totalpages: 131072
>   
>   # dmesg | grep set_numa
>   [0.00][T0] set_numa_mem: mem node for 0 = 1
>   [0.005654][T0] set_numa_mem: mem node for 1 = 1
> 
> So is the problem more than just node zero having no memory?
> 

The problem would happen with possible nodes which are not yet present. i.e
no cpus, no memory attached to those nodes.

Please look at
http://lore.kernel.org/lkml/20200312131438.gb3...@linux.vnet.ibm.com/t/#u
for more details.

The summary being: pgdat/Node_Data for such nodes is not allocated. Hence
the node_present_pages(nid) called  where nid is a possible but not yet
present node fails. Currently node_present_pages(nid) and node_to_mem_node
don't seem to be equipped to handle possible but not present nodes.

> cheers

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-13 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-12 17:41:58]:

> On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka  [2020-03-12 14:51:38]:
> > 
> >> > * Vlastimil Babka  [2020-03-12 10:30:50]:
> >> > 
> >> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju 
> >> >> >>  wrote:
> >> >> >> * Michal Hocko  [2020-03-11 12:57:35]:
> >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> I think we do need well defined and documented rules around 
> >> node_to_mem_node(),
> >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> >> everyone handles it the same, safe way.
> 
> So let's try to brainstorm how this would look like? What I mean are some 
> rules
> like below, even if some details in my current understanding are most likely
> incorrect:
> 

Agree.

> with nid present in:
> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> node with memory so that we don't require everyone to search for it in 
> slightly
> different ways
> N_ONLINE - pgdat must exist, there doesn't have to be present memory,
> node_to_mem_node() still has to return something else (?)

Right, think this has been taken care of at this time.

> N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
> N_HIGH_MEMORY - node has present high memory
> 

dont see any problems with the above two to. That leaves us with N_POSSIBLE.

> > 
> > Other option would be to tweak Kirill Tkhai's patch such that we call
> > kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
> > if the node is offline.
> 
> I really would like a solution that hides these ugly details from callers so
> they don't have to workaround the APIs we provide. kvmalloc_node() really
> shouldn't crash, and it should fallback automatically if we don't give it
> __GFP_THISNODE
> 

Agree thats its better to make API's robust where possible.

> However, taking a step back, memcg_alloc_shrinker_maps() is probably rather
> wasteful on systems with 256 possible nodes and only few present, by 
> allocating
> effectively dead structures for each memcg.
> 

If we dont allocate now, we would have to allocate them when we online the
nodes. To me it looks better to allocate as soon as the nodes are onlined,

> SLUB tries to be smart, so it allocates the per-node per-cache structures only
> when the node goes online in slab_mem_going_online_callback(). This is why
> there's a crash when such non-existing structures are accessed for a node 
> that's
> not online, and why they shouldn't be accessed.
> 
> Perhaps memcg should do the same on-demand allocation, if possible.
> 

Right.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/2] powerpc/smp: Drop superfluous NULL check

2020-03-13 Thread Srikar Dronamraju

* Michael Ellerman  [2020-03-13 22:20:19]:

> We don't need the NULL check of np, the result is the same because the
> OF helpers cope with NULL, of_node_to_nid(NULL) == NUMA_NO_NODE (-1).
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/smp.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 37c12e3bab9e..aae61a3b3201 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1197,11 +1197,8 @@ int get_physical_package_id(int cpu)
>*/
>   if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR)) {
>   struct device_node *np = of_get_cpu_node(cpu, NULL);
> -
> - if (np) {
> - pkg_id = of_node_to_nid(np);
> - of_node_put(np);
> - }
> + pkg_id = of_node_to_nid(np);
> + of_node_put(np);
>   }
>  #endif /* CONFIG_PPC_SPLPAR */
> 
> -- 
> 2.21.1
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 2/2] powerpc/smp: Use IS_ENABLED() to avoid #ifdef

2020-03-13 Thread Srikar Dronamraju

* Michael Ellerman  [2020-03-13 22:20:20]:

> We can avoid the #ifdef by using IS_ENABLED() in the existing
> condition check.
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/smp.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index aae61a3b3201..6d2a3a3666f0 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1189,18 +1189,17 @@ int get_physical_package_id(int cpu)
>  {
>   int pkg_id = cpu_to_chip_id(cpu);
> 
> -#ifdef CONFIG_PPC_SPLPAR
>   /*
>* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
>* defined. Hence we would return the chip-id as the result of
>* get_physical_package_id.
>*/
> - if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR)) {
> + if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR) &&
> + IS_ENABLED(CONFIG_PPC_SPLPAR)) {
>   struct device_node *np = of_get_cpu_node(cpu, NULL);
>   pkg_id = of_node_to_nid(np);
>   of_node_put(np);
>   }
> -#endif /* CONFIG_PPC_SPLPAR */
> 
>   return pkg_id;
>  }
> -- 
> 2.21.1
> 

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Srikar Dronamraju

Currently while allocating a slab for a offline node, we use its
associated node_numa_mem to search for a partial slab. If we don't find
a partial slab, we try allocating a slab from the offline node using
__alloc_pages_node. However this is bound to fail.

NIP [c039a300] __alloc_pages_nodemask+0x130/0x3b0
LR [c039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0
Call Trace:
[c008b36837f0] [c039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 
(unreliable)
[c008b3683870] [c03d1ff8] new_slab+0x128/0xcf0
[c008b3683950] [c03d6060] ___slab_alloc+0x410/0x820
[c008b3683a40] [c03d64a4] __slab_alloc+0x34/0x60
[c008b3683a70] [c03d78b0] __kmalloc_node+0x110/0x490
[c008b3683af0] [c0343a08] kvmalloc_node+0x58/0x110
[c008b3683b30] [c03ffd44] mem_cgroup_css_online+0x104/0x270
[c008b3683b90] [c0234e08] online_css+0x48/0xd0
[c008b3683bc0] [c023dedc] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3683ca0] [c02416f8] cgroup_mkdir+0x228/0x5f0
[c008b3683d10] [c0520360] kernfs_iop_mkdir+0x90/0xf0
[c008b3683d50] [c043e400] vfs_mkdir+0x110/0x230
[c008b3683da0] [c0441ee0] do_mkdirat+0xb0/0x1a0
[c008b3683e20] [c000b278] system_call+0x5c/0x68

Mitigate this by allocating the new slab from the node_numa_mem.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 mm/slub.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1c55bf7892bf..fdf7f38f96e6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = node;
 
-   if (node == NUMA_NO_NODE)
-   searchnode = numa_mem_id();
-   else if (!node_present_pages(node))
-   searchnode = node_to_mem_node(node);
-
-   object = get_partial_node(s, get_node(s, searchnode), c, flags);
+   object = get_partial_node(s, get_node(s, node), c, flags);
if (object || node != NUMA_NO_NODE)
return object;
 
@@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct kmem_cache 
*s, gfp_t flags,
 
WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
 
+   if (node == NUMA_NO_NODE)
+   node = numa_mem_id();
+   else if (!node_present_pages(node))
+   node = node_to_mem_node(node);
+
freelist = get_partial(s, flags, node, c);
 
if (freelist)
@@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
gfpflags, int node,
 redo:
 
if (unlikely(!node_match(page, node))) {
-   int searchnode = node;
-
if (node != NUMA_NO_NODE && !node_present_pages(node))
-   searchnode = node_to_mem_node(node);
+   node = node_to_mem_node(node);
 
-   if (unlikely(!node_match(page, searchnode))) {
+   if (unlikely(!node_match(page, node))) {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
goto new_slab;
-- 
2.18.1

[PATCH 0/4] Fix kmalloc_node on offline nodes

2020-03-17 Thread Srikar Dronamraju

Sachin recently reported that linux-next was no more bootable on few
systems.
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/

# numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10
#

Sachin bisected the problem to Commit a75056fc1e7c ("mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node")

The root cause analysis showed that mm/slub and powerpc/numa had some 
shortcomings
with respect to offline nodes.

This patch series is on top of patches posted at
https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Srikar Dronamraju (4):
  mm: Check for node_online in node_present_pages
  mm/slub: Use mem_node to allocate a new slab
  mm: Implement reset_numa_mem
  powerpc/numa: Set fallback nodes for offline nodes

 arch/powerpc/mm/numa.c | 11 ++-
 include/asm-generic/topology.h |  3 +++
 include/linux/mmzone.h |  6 --
 include/linux/topology.h   |  7 +++
 mm/slub.c  | 19 ---
 5 files changed, 32 insertions(+), 14 deletions(-)

--
2.18.1

[PATCH 3/4] mm: Implement reset_numa_mem

2020-03-17 Thread Srikar Dronamraju

For a memoryless or offline nodes, node_numa_mem refers to a N_MEMORY
fallback node. Currently kernel has an API set_numa_mem that sets
node_numa_mem for memoryless node. However this API cannot be used for
offline nodes. Hence all offline nodes will have their node_numa_mem set
to 0. However systems can themselves have node 0 as offline i.e
memoryless and cpuless at this time. In such cases,
node_to_mem_node() fails to provide a N_MEMORY fallback node.

Mitigate this by having a new API that sets the default node_numa_mem for
offline nodes to be first_memory_node.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 include/asm-generic/topology.h | 3 +++
 include/linux/topology.h   | 7 +++
 2 files changed, 10 insertions(+)

diff --git a/include/asm-generic/topology.h b/include/asm-generic/topology.h
index 238873739550..e803ee7850e6 100644
--- a/include/asm-generic/topology.h
+++ b/include/asm-generic/topology.h
@@ -68,6 +68,9 @@
 #ifndef set_numa_mem
 #define set_numa_mem(node)
 #endif
+#ifndef reset_numa_mem
+#define reset_numa_mem(node)
+#endif
 #ifndef set_cpu_numa_mem
 #define set_cpu_numa_mem(cpu, node)
 #endif
diff --git a/include/linux/topology.h b/include/linux/topology.h
index eb2fe6edd73c..bebda80038bf 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -147,6 +147,13 @@ static inline int node_to_mem_node(int node)
 }
 #endif
 
+#ifndef reset_numa_mem
+static inline void reset_numa_mem(int node)
+{
+   _node_numa_mem_[node] = first_memory_node;
+}
+#endif
+
 #ifndef numa_mem_id
 /* Returns the number of the nearest Node with memory */
 static inline int numa_mem_id(void)
-- 
2.18.1

[PATCH 4/4] powerpc/numa: Set fallback nodes for offline nodes

2020-03-17 Thread Srikar Dronamraju

Currently fallback nodes for offline nodes aren't set. Hence by default
node 0 ends up being the default node. However node 0 might be offline.

Fix this by explicitly setting fallback node. Ensure first_memory_node
is set before kernel does explicit setting of fallback node.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 281531340230..6e97ab6575cb 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -827,7 +827,16 @@ void __init dump_numa_cpu_topology(void)
if (!numa_enabled)
return;
 
-   for_each_online_node(node) {
+   for_each_node(node) {
+   /*
+* For all possible but not yet online nodes, ensure their
+* node_numa_mem is set correctly so that kmalloc_node works
+* for such nodes.
+*/
+   if (!node_online(node)) {
+   reset_numa_mem(node);
+   continue;
+   }
pr_info("Node %d CPUs:", node);
 
count = 0;
-- 
2.18.1

[PATCH 1/4] mm: Check for node_online in node_present_pages

2020-03-17 Thread Srikar Dronamraju

Calling a kmalloc_node on a possible node which is not yet onlined can
lead to panic. Currently node_present_pages() doesn't verify the node is
online before accessing the pgdat for the node. However pgdat struct may
not be available resulting in a crash.

NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
LR [c03d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
[c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
[c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
[c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
[c008b3783b90] [c0235aa8] online_css+0x48/0xd0
[c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
[c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
[c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
[c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
[c008b3783e20] [c000b278] system_call+0x5c/0x68

Fix this by verifying the node is online before accessing the pgdat
structure. Fix the same for node_spanned_pages() too.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 include/linux/mmzone.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f3f264826423..88078a3b95e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -756,8 +756,10 @@ typedef struct pglist_data {
atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
 } pg_data_t;
 
-#define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
-#define node_spanned_pages(nid)(NODE_DATA(nid)->node_spanned_pages)
+#define node_present_pages(nid)\
+   (node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
+#define node_spanned_pages(nid)\
+   (node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 #define pgdat_page_nr(pgdat, pagenr)   ((pgdat)->node_mem_map + (pagenr))
 #else
-- 
2.18.1

Re: [PATCH 1/4] mm: Check for node_online in node_present_pages

2020-03-17 Thread Srikar Dronamraju

* Srikar Dronamraju  [2020-03-17 18:47:50]:

> 
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Signed-off-by: Srikar Dronamraju 
> ---
>  include/linux/mmzone.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f3f264826423..88078a3b95e5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -756,8 +756,10 @@ typedef struct pglist_data {
>   atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
>  } pg_data_t;
> 
> -#define node_present_pages(nid)  (NODE_DATA(nid)->node_present_pages)
> -#define node_spanned_pages(nid)  (NODE_DATA(nid)->node_spanned_pages)
> +#define node_present_pages(nid)  \
> + (node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
> +#define node_spanned_pages(nid)  \
> + (node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
>  #ifdef CONFIG_FLAT_NODE_MEM_MAP
>  #define pgdat_page_nr(pgdat, pagenr) ((pgdat)->node_mem_map + (pagenr))
>  #else

If we indeed start having pgdat/NODE_DATA for even offline nodes as Michal
Hocko, then we may not this particular patch.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-17 14:34:25]:

> On 3/17/20 2:17 PM, Srikar Dronamraju wrote:
> > Currently while allocating a slab for a offline node, we use its
> > associated node_numa_mem to search for a partial slab. If we don't find
> > a partial slab, we try allocating a slab from the offline node using
> > __alloc_pages_node. However this is bound to fail.
> > 
> > NIP [c039a300] __alloc_pages_nodemask+0x130/0x3b0
> > LR [c039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0
> > Call Trace:
> > [c008b36837f0] [c039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 
> > (unreliable)
> > [c008b3683870] [c03d1ff8] new_slab+0x128/0xcf0
> > [c008b3683950] [c03d6060] ___slab_alloc+0x410/0x820
> > [c008b3683a40] [c03d64a4] __slab_alloc+0x34/0x60
> > [c008b3683a70] [c03d78b0] __kmalloc_node+0x110/0x490
> > [c008b3683af0] [c0343a08] kvmalloc_node+0x58/0x110
> > [c008b3683b30] [c03ffd44] mem_cgroup_css_online+0x104/0x270
> > [c008b3683b90] [c0234e08] online_css+0x48/0xd0
> > [c008b3683bc0] [c023dedc] 
> > cgroup_apply_control_enable+0x2ec/0x4d0
> > [c008b3683ca0] [c02416f8] cgroup_mkdir+0x228/0x5f0
> > [c008b3683d10] [c0520360] kernfs_iop_mkdir+0x90/0xf0
> > [c008b3683d50] [c043e400] vfs_mkdir+0x110/0x230
> > [c008b3683da0] [c0441ee0] do_mkdirat+0xb0/0x1a0
> > [c008b3683e20] [c000b278] system_call+0x5c/0x68
> > 
> > Mitigate this by allocating the new slab from the node_numa_mem.
> 
> Are you sure this is really needed and the other 3 patches are not enough for
> the current SLUB code to work as needed? It seems you are changing the 
> semantics
> here...
> 

The other 3 patches are not enough because we don't carry the searchnode
when the actual alloc_pages_node gets called. 

With only the 3 patches, we see the above Panic, its signature is slightly
different from what Sachin first reported and which I have carried in 1st
patch.

> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> > flags, int node,
> > struct kmem_cache_cpu *c)
> >  {
> > void *object;
> > -   int searchnode = node;
> >  
> > -   if (node == NUMA_NO_NODE)
> > -   searchnode = numa_mem_id();
> > -   else if (!node_present_pages(node))
> > -   searchnode = node_to_mem_node(node);
> > -
> > -   object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > +   object = get_partial_node(s, get_node(s, node), c, flags);
> > if (object || node != NUMA_NO_NODE)>return object;
> >
> >   return get_any_partial(s, flags, c);
> 
> I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the
> hunk below), thus the get_any_partial() call becomes dead code?
> 
> > @@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct 
> > kmem_cache *s, gfp_t flags,
> >  
> > WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
> >  
> > +   if (node == NUMA_NO_NODE)
> > +   node = numa_mem_id();
> > +   else if (!node_present_pages(node))
> > +   node = node_to_mem_node(node);
> > +
> > freelist = get_partial(s, flags, node, c);
> >  
> > if (freelist)
> > @@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, 
> > gfp_t gfpflags, int node,
> >  redo:
> >  
> > if (unlikely(!node_match(page, node))) {
> > -   int searchnode = node;
> > -
> > if (node != NUMA_NO_NODE && !node_present_pages(node))
> > -   searchnode = node_to_mem_node(node);
> > +   node = node_to_mem_node(node);
> >  
> > -   if (unlikely(!node_match(page, searchnode))) {
> > +   if (unlikely(!node_match(page, node))) {
> > stat(s, ALLOC_NODE_MISMATCH);
> > deactivate_slab(s, page, c->freelist, c);
> > goto new_slab;
> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 4/4] powerpc/numa: Set fallback nodes for offline nodes

2020-03-17 Thread Srikar Dronamraju

* Bharata B Rao  [2020-03-17 19:52:32]:

Thanks Bharata.

> This patchset can also fix a related problem that I reported earlier at
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-March/206076.html
> with an additional change, suggested by Srikar as shown below:
> 
> >  
> > -   for_each_online_node(node) {
> > +   for_each_node(node) {
> > +   /*
> > +* For all possible but not yet online nodes, ensure their
> > +* node_numa_mem is set correctly so that kmalloc_node works
> > +* for such nodes.
> > +*/
> > +   if (!node_online(node)) {
> 
> Change the above line to like below:
> 
> +   if (!node_state(node, N_MEMORY)) {
> 

Just to clarify, this is needed if we don't have
http://lore.kernel.org/lkml/20200311110237.5731-1-sri...@linux.vnet.ibm.com/t/#u


-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-17 14:53:26]:

> >> > 
> >> > Mitigate this by allocating the new slab from the node_numa_mem.
> >> 
> >> Are you sure this is really needed and the other 3 patches are not enough 
> >> for
> >> the current SLUB code to work as needed? It seems you are changing the 
> >> semantics
> >> here...
> >> 
> > 
> > The other 3 patches are not enough because we don't carry the searchnode
> > when the actual alloc_pages_node gets called. 
> > 
> > With only the 3 patches, we see the above Panic, its signature is slightly
> > different from what Sachin first reported and which I have carried in 1st
> > patch.
> 
> Ah, I see. So that's the missing pgdat after your series [1] right?

Yes the pgdat would be missing after my cpuless, memoryless node patchset.
However..
> 
> That sounds like an argument for Michal's suggestions that pgdats exist and 
> have
> correctly populated zonelists for all possible nodes.

Only the first patch in this series would be affected by pgdat existing or
not.  Even if the pgdat existed, the NODE_DATA[nid]->node_present_pages
would be 0. Right? So it would look at node_to_mem_node(). And since node 0 is
cpuless it would return 0. If we pass this node 0 (which is memoryless/cpuless) 
to
alloc_pages_node. Please note I am only setting node_numa_mem only
for offline nodes. However we could change this to set for all offline and
memoryless nodes.

> node_to_mem_node() could be just a shortcut for the first zone's node in the
> zonelist, so that fallback follows the topology.
> 
> [1]
> https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#m76e5b4c4084380b1d4b193d5aa0359b987f2290e
> 



-- 
Thanks and Regards
Srikar Dronamraju

Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-17 16:56:04]:

> 
> I wonder why do you get a memory leak while Sachin in the same situation [1]
> gets a crash? I don't understand anything anymore.

Sachin was testing on linux-next which has Kirill's patch which modifies
slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
upstream, which doesn't have this. 

> 
> [1]
> https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-17 14:34:25]:

> 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> > flags, int node,
> > struct kmem_cache_cpu *c)
> >  {
> > void *object;
> > -   int searchnode = node;
> >  
> > -   if (node == NUMA_NO_NODE)
> > -   searchnode = numa_mem_id();
> > -   else if (!node_present_pages(node))
> > -   searchnode = node_to_mem_node(node);
> > -
> > -   object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > +   object = get_partial_node(s, get_node(s, node), c, flags);
> > if (object || node != NUMA_NO_NODE)>return object;
> >
> >   return get_any_partial(s, flags, c);
> 
> I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the
> hunk below), thus the get_any_partial() call becomes dead code?

Very true. 

Would it be okay if we remove the node != NUMA_NO_NODE
if (object || node != NUMA_NO_NODE)
    return object;

will now become
if (object)
return object;

-- 
Thanks and Regards
Srikar Dronamraju

Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-17 17:45:15]:

> On 3/17/20 5:25 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka  [2020-03-17 16:56:04]:
> > 
> >> 
> >> I wonder why do you get a memory leak while Sachin in the same situation 
> >> [1]
> >> gets a crash? I don't understand anything anymore.
> > 
> > Sachin was testing on linux-next which has Kirill's patch which modifies
> > slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
> > upstream, which doesn't have this. 
> 
> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
> that there has to be something else that calls kmalloc_node(node) where node 
> is
> one that doesn't have present pages.
> 
> He mentions alloc_fair_sched_group() which has:
> 
> for_each_possible_cpu(i) {
> cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
>   GFP_KERNEL, cpu_to_node(i));
> ...
> se = kzalloc_node(sizeof(struct sched_entity),
>   GFP_KERNEL, cpu_to_node(i));
> 

Sachin's experiment.
Upstream-next/ memcg /
possible nodes were 0-31
online nodes were 0-1
kmalloc_node called for_each_node / for_each_possible_node.
This would crash while allocating slab from !N_ONLINE nodes.

Bharata's experiment.
Upstream
possible nodes were 0-1
online nodes were 0-1
kmalloc_node called for_each_online_node/ for_each_possible_cpu
i.e kmalloc is called for N_ONLINE nodes.
So wouldn't crash

Even if his possible nodes were 0-256. I don't think we have kmalloc_node
being called in !N_ONLINE nodes. Hence its not crashing.
If we see the above code that you quote, kzalloc_node is using cpu_to_node
which in Bharata's case will always return 1.

> I assume one of these structs is 1k and other 512 bytes (rounded) and that for
> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And 
> as
> Bharata pasted, node_to_mem_node(0) = 0
> So this looks like the same scenario, but it doesn't crash? Is the node 0
> actually online here, and/or does it have N_NORMAL_MEMORY state?

I still dont have any clue on the leak though.

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v2 0/4] Fix kmalloc_node on offline nodes

2020-03-18 Thread Srikar Dronamraju

Changelog v1 -> v2:
- Handled comments from Vlastimil Babka and Bharata B Rao
- Changes only in patch 2 and 4.

Sachin recently reported that linux-next was no more bootable on few
powerpc systems.
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/

# numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10
#

Sachin bisected the problem to Commit a75056fc1e7c ("mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node")

The root cause analysis showed that mm/slub and powerpc/numa had some 
shortcomings
with respect to offline nodes.

This patch series is on top of patches posted at
https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Srikar Dronamraju (4):
  mm: Check for node_online in node_present_pages
  mm/slub: Use mem_node to allocate a new slab
  mm: Implement reset_numa_mem
  powerpc/numa: Set fallback nodes for offline nodes

 arch/powerpc/include/asm/topology.h | 16 
 arch/powerpc/kernel/smp.c   |  1 +
 include/asm-generic/topology.h  |  3 +++
 include/linux/mmzone.h  |  6 --
 include/linux/topology.h|  7 +++
 mm/slub.c   |  9 +
 6 files changed, 36 insertions(+), 6 deletions(-)

-- 
2.18.1

[PATCH v2 1/4] mm: Check for node_online in node_present_pages

2020-03-18 Thread Srikar Dronamraju

Calling a kmalloc_node on a possible node which is not yet onlined can
lead to panic. Currently node_present_pages() doesn't verify the node is
online before accessing the pgdat for the node. However pgdat struct may
not be available resulting in a crash.

NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
LR [c03d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
[c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
[c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
[c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
[c008b3783b90] [c0235aa8] online_css+0x48/0xd0
[c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
[c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
[c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
[c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
[c008b3783e20] [c000b278] system_call+0x5c/0x68

Fix this by verifying the node is online before accessing the pgdat
structure. Fix the same for node_spanned_pages() too.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 include/linux/mmzone.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f3f264826423..88078a3b95e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -756,8 +756,10 @@ typedef struct pglist_data {
atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
 } pg_data_t;
 
-#define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
-#define node_spanned_pages(nid)(NODE_DATA(nid)->node_spanned_pages)
+#define node_present_pages(nid)\
+   (node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
+#define node_spanned_pages(nid)\
+   (node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 #define pgdat_page_nr(pgdat, pagenr)   ((pgdat)->node_mem_map + (pagenr))
 #else
-- 
2.18.1

[PATCH v2 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-18 Thread Srikar Dronamraju

Currently while allocating a slab for a offline node, we use its
associated node_numa_mem to search for a partial slab. If we don't find
a partial slab, we try allocating a slab from the offline node using
__alloc_pages_node. However this is bound to fail.

NIP [c039a300] __alloc_pages_nodemask+0x130/0x3b0
LR [c039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0
Call Trace:
[c008b36837f0] [c039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 
(unreliable)
[c008b3683870] [c03d1ff8] new_slab+0x128/0xcf0
[c008b3683950] [c03d6060] ___slab_alloc+0x410/0x820
[c008b3683a40] [c03d64a4] __slab_alloc+0x34/0x60
[c008b3683a70] [c03d78b0] __kmalloc_node+0x110/0x490
[c008b3683af0] [c0343a08] kvmalloc_node+0x58/0x110
[c008b3683b30] [c03ffd44] mem_cgroup_css_online+0x104/0x270
[c008b3683b90] [c0234e08] online_css+0x48/0xd0
[c008b3683bc0] [c023dedc] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3683ca0] [c02416f8] cgroup_mkdir+0x228/0x5f0
[c008b3683d10] [c0520360] kernfs_iop_mkdir+0x90/0xf0
[c008b3683d50] [c043e400] vfs_mkdir+0x110/0x230
[c008b3683da0] [c0441ee0] do_mkdirat+0xb0/0x1a0
[c008b3683e20] [c000b278] system_call+0x5c/0x68

Mitigate this by allocating the new slab from the node_numa_mem.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 -> v2:
- Handled comments from Vlastimil Babka
- Now node gets set to node_numa_mem in new_slab_objects.

 mm/slub.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1c55bf7892bf..2dc603a84290 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,6 +2475,9 @@ static inline void *new_slab_objects(struct kmem_cache 
*s, gfp_t flags,
if (freelist)
return freelist;
 
+   if (node != NUMA_NO_NODE && !node_present_pages(node))
+   node = node_to_mem_node(node);
+
page = new_slab(s, flags, node);
if (page) {
c = raw_cpu_ptr(s->cpu_slab);
@@ -2569,12 +2572,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
gfpflags, int node,
 redo:
 
if (unlikely(!node_match(page, node))) {
-   int searchnode = node;
-
if (node != NUMA_NO_NODE && !node_present_pages(node))
-   searchnode = node_to_mem_node(node);
+   node = node_to_mem_node(node);
 
-   if (unlikely(!node_match(page, searchnode))) {
+   if (unlikely(!node_match(page, node))) {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
goto new_slab;
-- 
2.18.1

[PATCH v2 3/4] mm: Implement reset_numa_mem

2020-03-18 Thread Srikar Dronamraju

For a memoryless or offline nodes, node_numa_mem refers to a N_MEMORY
fallback node. Currently kernel has an API set_numa_mem that sets
node_numa_mem for memoryless node. However this API cannot be used for
offline nodes. Hence all offline nodes will have their node_numa_mem set
to 0. However systems can themselves have node 0 as offline i.e
memoryless and cpuless at this time. In such cases,
node_to_mem_node() fails to provide a N_MEMORY fallback node.

Mitigate this by having a new API that sets the default node_numa_mem for
offline nodes to be first_memory_node.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 include/asm-generic/topology.h | 3 +++
 include/linux/topology.h   | 7 +++
 2 files changed, 10 insertions(+)

diff --git a/include/asm-generic/topology.h b/include/asm-generic/topology.h
index 238873739550..e803ee7850e6 100644
--- a/include/asm-generic/topology.h
+++ b/include/asm-generic/topology.h
@@ -68,6 +68,9 @@
 #ifndef set_numa_mem
 #define set_numa_mem(node)
 #endif
+#ifndef reset_numa_mem
+#define reset_numa_mem(node)
+#endif
 #ifndef set_cpu_numa_mem
 #define set_cpu_numa_mem(cpu, node)
 #endif
diff --git a/include/linux/topology.h b/include/linux/topology.h
index eb2fe6edd73c..bebda80038bf 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -147,6 +147,13 @@ static inline int node_to_mem_node(int node)
 }
 #endif
 
+#ifndef reset_numa_mem
+static inline void reset_numa_mem(int node)
+{
+   _node_numa_mem_[node] = first_memory_node;
+}
+#endif
+
 #ifndef numa_mem_id
 /* Returns the number of the nearest Node with memory */
 static inline int numa_mem_id(void)
-- 
2.18.1

[PATCH v2 4/4] powerpc/numa: Set fallback nodes for offline nodes

2020-03-18 Thread Srikar Dronamraju

Currently fallback nodes for offline nodes aren't set. Hence by default
node 0 ends up being the default node. However node 0 might be offline.

Fix this by explicitly setting fallback node. Ensure first_memory_node
is set before kernel does explicit setting of fallback node.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 -> v2:
- Handled comments from Bharata B Rao
- Dont use dump_numa_cpu_topology to set fallback nodes

 arch/powerpc/include/asm/topology.h | 16 
 arch/powerpc/kernel/smp.c   |  1 +
 2 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2db7ba789720..baa89364197c 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -62,6 +62,21 @@ static inline int early_cpu_to_node(int cpu)
 */
return (nid < 0) ? 0 : nid;
 }
+
+static inline int update_default_numa_mem(void)
+{
+   unsigned int node;
+
+   for_each_node(node) {
+   /*
+* For all possible but not yet online nodes, ensure their
+* node_numa_mem is set correctly so that kmalloc_node works
+* for such nodes.
+*/
+   if (!node_online(node))
+   reset_numa_mem(node);
+   }
+}
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -90,6 +105,7 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
return 0;
 }
 
+static inline int update_default_numa_mem(void) {}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 37c12e3bab9e..d23faa70ea2d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1383,6 +1383,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
+   update_default_numa_mem();
dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.18.1

Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-18 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-17 16:29:21]:

> > If we pass this node 0 (which is memoryless/cpuless) to
> > alloc_pages_node. Please note I am only setting node_numa_mem only
> > for offline nodes. However we could change this to set for all offline and
> > memoryless nodes.
> 
> That would indeed make sense.
> 
> But I guess that alloc_pages would still crash as the result of
> numa_to_mem_node() is not passed down to alloc_pages() without this patch. In
> __alloc_pages_node() we currently have "The node must be valid and online" so
> offline nodes don't have zonelists. Either they get them, or we indeed need
> something like this patch. But in order to not make get_any_partial() dead 
> code,
> the final replacement of invalid node with a valid one should be done in
> alloc_slab_page() I guess?
> 

I am posting v2 with this change.

> >> node_to_mem_node() could be just a shortcut for the first zone's node in 
> >> the
> >> zonelist, so that fallback follows the topology.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-03-18 Thread Srikar Dronamraju

* Michal Hocko  [2020-03-16 09:54:25]:

> On Sun 15-03-20 14:20:05, Cristopher Lameter wrote:
> > On Wed, 11 Mar 2020, Srikar Dronamraju wrote:
> > 
> > > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > > possible nodes, marks node 0 as online at boot.  However in practice,
> > > there are systems which have node 0 as memoryless and cpuless.
> > 
> > Would it not be better and simpler to require that node 0 always has
> > memory (and processors)? A  mininum operational set?
> 
> I do not think you can simply ignore the reality. I cannot say that I am
> a fan of memoryless/cpuless numa configurations but they are a sad
> reality of different LPAR configurations. We have to deal with them.
> Besides that I do not really see any strong technical arguments to lack
> a support for those crippled configurations. We do have zonelists that
> allow to do reasonable decisions on memoryless nodes. So no, I do not
> think that this is a viable approach.
> 

I agree with Michal, kernel should accept the reality and work with
different Lpar configurations.

> > We can dynamically number the nodes right? So just make sure that the
> > firmware properly creates memory on node 0?
> 
> Are you suggesting that the OS would renumber NUMA nodes coming
> from FW just to satisfy node 0 existence? If yes then I believe this is
> really a bad idea because it would make HW/LPAR configuration matching
> to the resulting memory layout really hard to follow.
> 
> -- 
> Michal Hocko
> SUSE Labs

Michal, Vlastimil, Christoph and others, do you have any more comments,
suggestions or any other feedback. If not, can you please add your
reviewed-by, acked etc.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v2 1/4] mm: Check for node_online in node_present_pages

2020-03-18 Thread Srikar Dronamraju

* Michal Hocko  [2020-03-18 11:02:56]:

> On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote:
> > Calling a kmalloc_node on a possible node which is not yet onlined can
> > lead to panic. Currently node_present_pages() doesn't verify the node is
> > online before accessing the pgdat for the node. However pgdat struct may
> > not be available resulting in a crash.
> >
> > NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
> > LR [c03d5b94] __slab_alloc+0x34/0x60
> > Call Trace:
> > [c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
> > [c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
> > [c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
> > [c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
> > [c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
> > [c008b3783b90] [c0235aa8] online_css+0x48/0xd0
> > [c008b3783bc0] [c023eaec] 
> > cgroup_apply_control_enable+0x2ec/0x4d0
> > [c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
> > [c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
> > [c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
> > [c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
> > [c008b3783e20] [c000b278] system_call+0x5c/0x68
> >
> > Fix this by verifying the node is online before accessing the pgdat
> > structure. Fix the same for node_spanned_pages() too.
> >
> > Cc: Andrew Morton 
> > Cc: linux...@kvack.org
> > Cc: Mel Gorman 
> > Cc: Michael Ellerman 
> > Cc: Sachin Sant 
> > Cc: Michal Hocko 
> > Cc: Christopher Lameter 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Joonsoo Kim 
> > Cc: Kirill Tkhai 
> > Cc: Vlastimil Babka 
> > Cc: Srikar Dronamraju 
> > Cc: Bharata B Rao 
> > Cc: Nathan Lynch 
> >
> > Reported-by: Sachin Sant 
> > Tested-by: Sachin Sant 
> > Signed-off-by: Srikar Dronamraju 
> > ---
> >  include/linux/mmzone.h | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index f3f264826423..88078a3b95e5 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -756,8 +756,10 @@ typedef struct pglist_data {
> > atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
> >  } pg_data_t;
> >
> > -#define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
> > -#define node_spanned_pages(nid)(NODE_DATA(nid)->node_spanned_pages)
> > +#define node_present_pages(nid)\
> > +   (node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
> > +#define node_spanned_pages(nid)\
> > +   (node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
>
> I believe this is a wrong approach. We really do not want to special
> case all the places which require NODE_DATA. Can we please go and
> allocate pgdat for all possible nodes?
>

I can do that but the question I had was should we make this change just for
Powerpc or should the change be for other archs.

NODE_DATA initialization always seems to be in arch specific code.

The other archs that are affected seem to be mips, sh and sparc
These archs seem to have making an assumption that NODE_DATA has to be local
only,

For example on sparc / arch/sparc/mm/init_64.c in allocate_node_data function.

  NODE_DATA(nid) = memblock_alloc_node(sizeof(struct pglist_data),
 SMP_CACHE_BYTES, nid);
if (!NODE_DATA(nid)) {
prom_printf("Cannot allocate pglist_data for nid[%d]\n", nid);
prom_halt();
}

NODE_DATA(nid)->node_id = nid;

So even if I make changes to allocate NODE_DATA from fallback node, I may not
be able to test them.

So please let me know your thoughts around the same.

> The current state of memory less hacks subtle bugs poping up here and
> there just prove that we should have done that from the very begining
> IMHO.
>
> >  #ifdef CONFIG_FLAT_NODE_MEM_MAP
> >  #define pgdat_page_nr(pgdat, pagenr)   ((pgdat)->node_mem_map + 
> > (pagenr))
> >  #else
> > --
> > 2.18.1
>
> --
> Michal Hocko
> SUSE Labs
>

--
Thanks and Regards
Srikar Dronamraju

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-19 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-19 14:47:58]:

> 8<
> diff --git a/mm/slub.c b/mm/slub.c
> index 17dc00e33115..7113b1f9cd77 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> flags, int node,
> 
>   if (node == NUMA_NO_NODE)
>   searchnode = numa_mem_id();
> - else if (!node_present_pages(node))
> - searchnode = node_to_mem_node(node);
> 
>   object = get_partial_node(s, get_node(s, searchnode), c, flags);

Are we okay with passing a node to get_partial_node with !NUMA_NO_NODE and
!N_MEMORY including possible nodes?

>   if (object || node != NUMA_NO_NODE)

-- 
Thanks and Regards
Srikar Dronamraju

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-19 15:10:19]:

> On 3/19/20 3:05 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka  [2020-03-19 14:47:58]:
> > 
> >> 8<
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 17dc00e33115..7113b1f9cd77 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> >> flags, int node,
> >> 
> >>if (node == NUMA_NO_NODE)
> >>searchnode = numa_mem_id();
> >> -  else if (!node_present_pages(node))
> >> -  searchnode = node_to_mem_node(node);
> >> 
> >>object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > 
> > Are we okay with passing a node to get_partial_node with !NUMA_NO_NODE and
> > !N_MEMORY including possible nodes?
> 
> No, but AFAICS, such node values are already handled in ___slab_alloc, and
> cannot reach get_partial(). If you see something I missed, please do tell.
> 

Ah I probably got confused with your previous version where
alloc_slab_page() was modified. I see no problems with this version.

Sorry for the noise.

A question just for my better understanding,
How worse would it be to set node to numa_mem_id() instead of NUMA_NODE_ID
when the current node is !N_NORMAL_MEMORY?

> >>if (object || node != NUMA_NO_NODE)
> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-20 09:43:11]:

> On 3/20/20 8:46 AM, Srikar Dronamraju wrote:
> > * Vlastimil Babka  [2020-03-19 15:10:19]:
> > 
> >> On 3/19/20 3:05 PM, Srikar Dronamraju wrote:
> >> > * Vlastimil Babka  [2020-03-19 14:47:58]:
> >> > 
> >> 
> >> No, but AFAICS, such node values are already handled in ___slab_alloc, and
> >> cannot reach get_partial(). If you see something I missed, please do tell.
> >> 
> > 
> > Ah I probably got confused with your previous version where
> > alloc_slab_page() was modified. I see no problems with this version.
> 
> Thanks!
> 
> > Sorry for the noise.
> 
> No problem.
> 
> > A question just for my better understanding,
> > How worse would it be to set node to numa_mem_id() instead of NUMA_NODE_ID
> > when the current node is !N_NORMAL_MEMORY?
> 

Yes,

> (I'm assuming you mean s/NUMA_NODE_ID/NUMA_NO_NODE/)
> 
> Well, numa_mem_id() should work too, but it would make the allocation
> constrained to the node of current cpu, with all the consequences 
> (deactivating
> percpu slab if it was from a different node etc).
> 
> There's no reason why this cpu's node should be the closest node to the one 
> that
> was originally requested (but has no memory), so it's IMO pointless or even
> suboptimal to constraint to it. This can be revisited in case we get 
> guaranteed
> existence of node data with zonelists for all possible nodes, but for now
> NUMA_NO_NODE seems the most reasonable fix to me.
> 

Okay.


-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Srikar Dronamraju

* Vlastimil Babka  [2020-03-20 12:55:32]:

> Sachin reports [1] a crash in SLUB __slab_alloc():
> 
> BUG: Kernel NULL pointer dereference on read at 0x73b0
> Faulting instruction address: 0xc03d55f4
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in:
> CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
> NIP:  c03d55f4 LR: c03d5b94 CTR: 
> REGS: c008b37836d0 TRAP: 0300   Not tainted  
> (5.6.0-rc2-next-20200218-autotest)
> MSR:  80009033   CR: 24004844  XER: 
> CFAR: c000dec4 DAR: 73b0 DSISR: 4000 IRQMASK: 1
> GPR00: c03d5b94 c008b3783960 c155d400 c008b301f500
> GPR04: 0dc0 0002 c03443d8 c008bb398620
> GPR08: 0008ba2f 0001  
> GPR12: 24004844 c0001ec52a00  
> GPR16: c008a1b20048 c1595898 c1750c18 0002
> GPR20: c1750c28 c1624470 000fffe0 5deadbeef122
> GPR24: 0001 0dc0 0002 c03443d8
> GPR28: c008b301f500 c008bb398620  c00c02287180
> NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
> LR [c03d5b94] __slab_alloc+0x34/0x60
> Call Trace:
> [c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
> [c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
> [c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
> [c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
> [c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
> [c008b3783b90] [c0235aa8] online_css+0x48/0xd0
> [c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
> [c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
> [c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
> [c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
> [c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
> [c008b3783e20] [c000b278] system_call+0x5c/0x68
> 
> This is a PowerPC platform with following NUMA topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31
> node 1 size: 35247 MB
> node 1 free: 30907 MB
> node distances:
> node   0   1
>   0:  10  40
>   1:  40  10
> 
> possible numa nodes: 0-31
> 
> This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map 
> on
> appropriate NUMA node" [2] which effectively calls kmalloc_node for each
> possible node. SLUB however only allocates kmem_cache_node on online
> N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid 
> node
> for other nodes since commit a561ce00b09e ("slub: fall back to
> node_to_mem_node() node if allocating on memoryless node"). This is however 
> not
> true in this configuration where the _node_numa_mem_ array is not initialized
> for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up
> accessing non-allocated kmem_cache_node.
> 
> A related issue was reported by Bharata (originally by Ramachandran) [3] where
> a similar PowerPC configuration, but with mainline kernel without patch [2]
> ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This 
> seems
> to have the same underlying issue with node_to_mem_node() not behaving as
> expected, and might probably also lead to an infinite loop with
> CONFIG_SLUB_CPU_PARTIAL [4].
> 
> This patch should fix both issues by not relying on node_to_mem_node() anymore
> and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is
> attempted for a node that's not online, or has no usable memory. The "usable
> memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY
> node state, as that is exactly the condition that SLUB uses to allocate
> kmem_cache_node structures. The check in get_partial() is removed completely,
> as the checks in ___slab_alloc() are now sufficient to prevent get_partial()
> being reached with an invalid node.
> 
> [1] 
> https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
> [2] 
> https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c...@virtuozzo.com/
> [3] https://lore.kernel.org/linux-mm/20200317092624.gb22...@in.ibm.com/
> [4] 
> https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125a...@suse.cz/
> 
> R

Re: [PATCH v4 0/5] Early node associativity

2019-11-15 Thread Srikar Dronamraju

Hi Michael,

> Nathan Lynch  writes:
> > Srikar Dronamraju  writes:
> >> Abdul reported  a warning on a shared lpar.
> >> "WARNING: workqueue cpumask: online intersect > possible intersect".
> >> This is because per node workqueue possible mask is set very early in the
> >> boot process even before the system was querying the home node
> >> associativity. However per node workqueue online cpumask gets updated
> >> dynamically. Hence there is a chance when per node workqueue online cpumask
> >> is a superset of per node workqueue possible mask.
> >
> > Sorry for the delay in following up on these. The series looks good to
> > me, and I've given it a little testing with LPM and DLPAR. I've also
> > verified that the cpu assignments occur early as intended on an LPAR
> > where that workqueue warning had been triggered.
> 
> If this is applied I think we can remove about 500 loc from numa.c. Once
> splpar cpu-node assignments are done in the same sequence as with
> dedicated processor mode, numa_update_cpu_topology() and related code
> becomes unneeded.

Since its 2 months since version 5 was posted and more than a month since
Nathan added his reviewed-by, can you please let us know if there is any
reason why this patchset has not been applied yet. Or do let me know if have
any further comments.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v2] powerpc/smp: Use nid as fallback for package_id

2019-11-15 Thread Srikar Dronamraju

Hey Michael,

* Srikar Dronamraju  [2019-08-22 20:08:53]:

> Package_id is to find out all cores that are part of the same chip. On
> PowerNV machines, package_id defaults to chip_id. However ibm,chip_id
> property is not present in device-tree of PowerVM Lpars. Hence lscpu
> output shows one core per socket and multiple cores.
> 

Can you let me know if you have any comments on this patch?

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-04 Thread Srikar Dronamraju

ri Lelli 
Cc: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/spinlock.h | 5 +++--
 arch/powerpc/mm/numa.c  | 4 
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index e9a960e28f3c..866f6ca0427a 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -35,11 +35,12 @@
 #define LOCK_TOKEN 1
 #endif
 
-#ifdef CONFIG_PPC_PSERIES
+#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR)
+DECLARE_STATIC_KEY_FALSE(shared_processor);
 #define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(int cpu)
 {
-   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   if (!static_branch_unlikely(&shared_processor))
return false;
return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d21ddcc..ffb971f3a63c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1568,9 +1568,13 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
+DEFINE_STATIC_KEY_FALSE(shared_processor);
+EXPORT_SYMBOL_GPL(shared_processor);
+
 void __init shared_proc_topology_init(void)
 {
if (lppaca_shared_proc(get_lppaca())) {
+   static_branch_enable(&shared_processor);
bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
nr_cpumask_bits);
numa_update_cpu_topology(false);
-- 
2.18.1

[PATCH 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-04 Thread Srikar Dronamraju

With the static key shared processor available, is_shared_processor()
can return without having to query the lppaca structure.

Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/spinlock.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 866f6ca0427a..699eb306a835 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -115,9 +115,8 @@ static inline bool is_shared_processor(void)
  * LPPACA is only available on Pseries so guard anything LPPACA related to
  * allow other platforms (which include this common header) to compile.
  */
-#ifdef CONFIG_PPC_PSERIES
-   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
-   lppaca_shared_proc(local_paca->lppaca_ptr));
+#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR)
+   return static_branch_unlikely(&shared_processor);
 #else
return false;
 #endif
-- 
2.18.1

Re: [PATCH 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-04 Thread Srikar Dronamraju

* Srikar Dronamraju  [2019-12-04 19:14:58]:

> 
> 
>   # perf stat -a -r 5 ./schbench
> v5.4  v5.4 + patch
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 47   50.th: 33
>   74.th: 64   75.th: 44
>   89.th: 76   90.th: 50
>   94.th: 83   95.th: 53
>   *98.th: 103 *99.th: 57
>   98.5000th: 2124 99.5000th: 59
>   98.9000th: 7976 99.9000th: 83
>   min=-1, max=10519   min=0, max=117
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 45   50.th: 34
>   74.th: 61   75.th: 45
>   89.th: 70   90.th: 52
>   94.th: 77   95.th: 56
>   *98.th: 504 *99.th: 62
>   98.5000th: 4012 99.5000th: 64
>   98.9000th: 8168 99.9000th: 79
>   min=-1, max=14500   min=0, max=123
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 48   50.th: 35
>   74.th: 65   75.th: 47
>   89.th: 76   90.th: 55
>   94.th: 82   95.th: 59
>   *98.th: 1098*99.th: 67
>   98.5000th: 3988 99.5000th: 71
>   98.9000th: 9360 99.9000th: 98
>   min=-1, max=19283   min=0, max=137
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 46   50.th: 35
>   74.th: 63   75.th: 46
>   89.th: 73   90.th: 53
>   94.th: 78   95.th: 57
>   *98.th: 113 *99.th: 63
>   98.5000th: 2316 99.5000th: 65
>   98.9000th: 7704 99.9000th: 83
>   min=-1, max=17976   min=0, max=139
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 46   50.th: 34
>   74.th: 62   75.th: 46
>   89.th: 73   90.th: 53
>   94.th: 79   95.th: 57
>   *98.th: 97  *99.th: 64
>   98.5000th: 1398 99.5000th: 70
>   98.9000th: 8136 99.9000th: 100
>   min=-1, max=10008       min=0, max=142
> 

Just to be clear, since these are latency values, lesser number is better.

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v2 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-04 Thread Srikar Dronamraju

With the static key shared processor available, is_shared_processor()
can return without having to query the lppaca structure.

Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2:
Now that we no more refer to lppaca, remove the comment.

 arch/powerpc/include/asm/spinlock.h | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 866f6ca0427a..251fe6e47471 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 
 static inline bool is_shared_processor(void)
 {
-/*
- * LPPACA is only available on Pseries so guard anything LPPACA related to
- * allow other platforms (which include this common header) to compile.
- */
-#ifdef CONFIG_PPC_PSERIES
-   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
-   lppaca_shared_proc(local_paca->lppaca_ptr));
+#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR)
+   return static_branch_unlikely(&shared_processor);
 #else
return false;
 #endif
-- 
2.18.1

[PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-05 Thread Srikar Dronamraju

With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
This leads to wrong choice of CPU, which in-turn leads to larger wakeup
latencies. Eventually, it leads to performance regression in latency
sensitive benchmarks like soltp, schbench etc.

On Powerpc, vcpu_is_preempted only looks at yield_count. If the
yield_count is odd, the vCPU is assumed to be preempted. However
yield_count is increased whenever LPAR enters CEDE state. So any CPU
that has entered CEDE state is assumed to be preempted.

Even if vCPU of dedicated LPAR is preempted/donated, it should have
right of first-use since they are suppose to own the vCPU.

On a Power9 System with 32 cores
 # lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   16
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127

  # perf stat -a -r 5 ./schbench
v5.4 v5.4 + patch
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 39
75.th: 62   75.th: 53
90.th: 71   90.th: 67
95.th: 77   95.th: 76
*99.th: 91  *99.th: 89
99.5000th: 707  99.5000th: 93
99.9000th: 6920 99.9000th: 118
min=0, max=10048min=0, max=211
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 72   90.th: 53
95.th: 79   95.th: 56
*99.th: 691 *99.th: 61
99.5000th: 3972 99.5000th: 63
99.9000th: 8368 99.9000th: 78
min=0, max=16606min=0, max=228
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 71   90.th: 53
95.th: 77   95.th: 57
*99.th: 106 *99.th: 63
99.5000th: 2364 99.5000th: 68
99.9000th: 7480 99.9000th: 100
min=0, max=10001min=0, max=134
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 62   75.th: 46
90.th: 72   90.th: 53
95.th: 78   95.th: 56
*99.th: 93  *99.th: 61
99.5000th: 108  99.5000th: 64
99.9000th: 6792 99.9000th: 85
min=0, max=17681min=0, max=121
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 46   50.th: 33
75.th: 62   75.th: 44
90.th: 73   90.th: 51
95.th: 79   95.th: 54
*99.th: 113 *99.th: 61
99.5000th: 2724 99.5000th: 64
99.9000th: 6184 99.9000th: 82
min=0, max=9887 min=0, max=121

 Performance counter stats for 'system wide' (5 runs):

context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )

Waiman Long suggested using static_keys.

Reported-by: Parth Shah 
Reported-by: Ihor Pasichnyk 
Cc: Parth Shah 
Cc: Ihor Pasichnyk 
Cc: Juri Lelli 
Cc: Phil Auld 
Cc: Waiman Long 
Cc: Gautham R. Shenoy 
Tested-by: Juri Lelli 
Ack-by: Waiman Long 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 (https://patchwork.ozlabs.org/patch/1204190/) ->v3:
Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
This was suggested by Waiman Long.

 arch/powerpc/include/as

[PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-05 Thread Srikar Dronamraju

With the static key shared processor available, is_shared_processor()
can return without having to query the lppaca structure.

Cc: Parth Shah 
Cc: Ihor Pasichnyk 
Cc: Juri Lelli 
Cc: Phil Auld 
Cc: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2:
Now that we no more refer to lppaca, remove the comment.

Changelog v2->v3:
Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
This was suggested by Waiman Long.

 arch/powerpc/include/asm/spinlock.h | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index de817c25deff..e83d57f27566 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 
 static inline bool is_shared_processor(void)
 {
-/*
- * LPPACA is only available on Pseries so guard anything LPPACA related to
- * allow other platforms (which include this common header) to compile.
- */
-#ifdef CONFIG_PPC_PSERIES
-   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
-   lppaca_shared_proc(local_paca->lppaca_ptr));
+#ifdef CONFIG_PPC_SPLPAR
+   return static_branch_unlikely(&shared_processor);
 #else
return false;
 #endif
-- 
2.18.1

[PATCH v2 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-04-28 Thread Srikar Dronamraju

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..b3615b7fdbdf 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
 
reset_numa_cpu_lookup_table();
 
-   for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   for_each_possible_cpu(cpu) {
+   /*
+* Powerpc with CONFIG_NUMA always used to have a node 0,
+* even if it was memoryless or cpuless. For all cpus that
+* are possible but not present, cpu_to_node() would point
+* to node 0. To remove a cpuless, memoryless dummy node,
+* powerpc need to make sure all possible but not present
+* cpu_to_node are set to a proper node.
+*/
+   if (cpu_present(cpu))
+   numa_setup_cpu(cpu);
+   else
+   set_cpu_numa_node(cpu, first_online_node);
+   }
 }
 
 void __init initmem_init(void)
-- 
2.20.1

[PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn

2020-04-28 Thread Srikar Dronamraju

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b3615b7fdbdf..281531340230 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
 */
for_each_present_cpu(i) {
struct device_node *cpu;
-   int nid;
-
-   cpu = of_get_cpu_node(i, NULL);
-   BUG_ON(!cpu);
-   nid = of_node_to_nid_single(cpu);
-   of_node_put(cpu);
+   int nid = vphn_get_nid(i);
 
/*
 * Don't fall back to default_nid yet -- we will plug
 * cpus into nodes once the memory scan has discovered
 * the topology.
 */
-   if (nid < 0)
-   continue;
+   if (nid == NUMA_NO_NODE) {
+   cpu = of_get_cpu_node(i, NULL);
+   if (cpu) {
+   nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
+   }
+   }
node_set_online(nid);
}
 
-- 
2.20.1

[PATCH v2 0/3] Offline memoryless cpuless node 0

2020-04-28 Thread Srikar Dronamraju

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.7-rc3 + patches
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 32 ++--
 mm/page_alloc.c|  4 +++-
 2 files changed, 25 insertions(+), 11 deletions(-)

-- 
2.20.1

[PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-04-28 Thread Srikar Dronamraju

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.7-rc3 + patch
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.

 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69827d4fa052..03b89592af04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+   [N_ONLINE] = NODE_MASK_NONE,
+#else
[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
2.20.1

Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-04-28 Thread Srikar Dronamraju

> > 
> > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> > always online.
> > 
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> >   */
> >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > [N_POSSIBLE] = NODE_MASK_ALL,
> > +#ifdef CONFIG_NUMA
> > +   [N_ONLINE] = NODE_MASK_NONE,
> > +#else
> > [N_ONLINE] = { { [0] = 1UL } },
> > -#ifndef CONFIG_NUMA
> > [N_NORMAL_MEMORY] = { { [0] = 1UL } },
> >  #ifdef CONFIG_HIGHMEM
> > [N_HIGH_MEMORY] = { { [0] = 1UL } },
> 
> So on all other NUMA machines, when does node 0 get marked online?
> 
> This change means that for some time during boot, such machines will
> now be running with node 0 marked as offline.  What are the
> implications of this?  Will something break?

Till the nodes are detected, marking Node 0 as online tends to be redundant.
Because the system doesn't know if its a NUMA or a non-NUMA system.
Once we detect the nodes, we online them immediately. Hence I don't see any
side-effects or negative implications of this change.

However if I am missing anything, please do let me know.

>From my part, I have tested this on
1. Non-NUMA Single node but CPUs and memory coming from zero node.
2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
3. NUMA Multi node but with CPUs and memory from node 0.
4. NUMA Multi node but with no CPUs and memory from node 0.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn

2020-04-29 Thread Srikar Dronamraju

* Gautham R Shenoy  [2020-04-29 12:22:29]:

> Hello Srikar,
> 
> 
> > +   if (nid == NUMA_NO_NODE) {
> > +   cpu = of_get_cpu_node(i, NULL);
> > +   if (cpu) {
> 
> Why are we not retaining the BUG_ON(!cpu) assert here ?
> 
> > +   nid = of_node_to_nid_single(cpu);
> > +   of_node_put(cpu);
> > +   }
> > +   }
> 
> Is it possible at this point that both vphn_get_nid(i) and
> of_node_to_nid_single(cpu) returns NUMA_NO_NODE ? If so,
> should we still call node_set_online() below ?

Yeah, I think It makes sense to retain the BUG_ON and if check.

Will incorporate both of them in the next version.

> 
> 
> > node_set_online(nid);
> > }
> > 
> > -- 
> > 2.20.1
> > 
> --
> Thanks and Regards
> gautham.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-04-30 Thread Srikar Dronamraju

* Michal Hocko  [2020-04-29 14:22:11]:

> On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote:
> > > > 
> > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 
> > > > is
> > > > always online.
> > > > 
> > > > ...
> > > >
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > > >   */
> > > >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > > > [N_POSSIBLE] = NODE_MASK_ALL,
> > > > +#ifdef CONFIG_NUMA
> > > > +   [N_ONLINE] = NODE_MASK_NONE,
> > > > +#else
> > > > [N_ONLINE] = { { [0] = 1UL } },
> > > > -#ifndef CONFIG_NUMA
> > > > [N_NORMAL_MEMORY] = { { [0] = 1UL } },
> > > >  #ifdef CONFIG_HIGHMEM
> > > > [N_HIGH_MEMORY] = { { [0] = 1UL } },
> > > 
> > > So on all other NUMA machines, when does node 0 get marked online?
> > > 
> > > This change means that for some time during boot, such machines will
> > > now be running with node 0 marked as offline.  What are the
> > > implications of this?  Will something break?
> > 
> > Till the nodes are detected, marking Node 0 as online tends to be redundant.
> > Because the system doesn't know if its a NUMA or a non-NUMA system.
> > Once we detect the nodes, we online them immediately. Hence I don't see any
> > side-effects or negative implications of this change.
> > 
> > However if I am missing anything, please do let me know.
> > 
> > >From my part, I have tested this on
> > 1. Non-NUMA Single node but CPUs and memory coming from zero node.
> > 2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
> > 3. NUMA Multi node but with CPUs and memory from node 0.
> > 4. NUMA Multi node but with no CPUs and memory from node 0.
> 
> Have you tested on something else than ppc? Each arch does the NUMA
> setup separately and this is a big mess. E.g. x86 marks even memory less
> nodes (see init_memory_less_node) as online.
> 

while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
enabled/disabled on both single node and multi node machines.
However, I dont have a cpuless/memoryless x86 system.

> Honestly I have hard time to evaluate the effect of this patch. It makes
> some sense to assume all nodes offline before they get online but this
> is a land mine territory.
> 
> I am also not sure what kind of problem this is going to address. You
> have mentioned numa balancing without many details.

1. On a machine with just one node with node number not being 0,
the current setup will end up showing 2 online nodes. And when there are
more than one online nodes, numa_balancing gets enabled.

Without patch
$ grep numa /proc/vmstat
numa_hit 95179
numa_miss 0
numa_foreign 0
numa_interleave 3764
numa_local 95179
numa_other 0
numa_pte_updates 1206973 <--
numa_huge_pte_updates 4654 <--
numa_hint_faults 19560 <--
numa_hint_faults_local 19560 <--
numa_pages_migrated 0


With patch
$ grep numa /proc/vmstat 
numa_hit 322338756
numa_miss 0
numa_foreign 0
numa_interleave 3790
numa_local 322338756
numa_other 0
numa_pte_updates 0 <--
numa_huge_pte_updates 0 <--
numa_hint_faults 0     <--
numa_hint_faults_local 0 <--
numa_pages_migrated 0

So we have a redundant page hinting numa faults which we can avoid.

2. Few people have complained about existence of this dummy node when
parsing lscpu and numactl o/p. They somehow start to think that the tools
are reporting incorrectly or the kernel is not able to recognize resources
connected to the node.

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v3 0/3] Offline memoryless cpuless node 0

2020-04-30 Thread Srikar Dronamraju

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: 
https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.
Link v1: 
https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.7-rc3 + patches
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Gautham R Shenoy 

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 35 ---
 mm/page_alloc.c|  4 +++-
 2 files changed, 27 insertions(+), 12 deletions(-)

-- 
2.20.1

[PATCH v3 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-04-30 Thread Srikar Dronamraju

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Gautham R Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..b3615b7fdbdf 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
 
reset_numa_cpu_lookup_table();
 
-   for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   for_each_possible_cpu(cpu) {
+   /*
+* Powerpc with CONFIG_NUMA always used to have a node 0,
+* even if it was memoryless or cpuless. For all cpus that
+* are possible but not present, cpu_to_node() would point
+* to node 0. To remove a cpuless, memoryless dummy node,
+* powerpc need to make sure all possible but not present
+* cpu_to_node are set to a proper node.
+*/
+   if (cpu_present(cpu))
+   numa_setup_cpu(cpu);
+   else
+   set_cpu_numa_node(cpu, first_online_node);
+   }
 }
 
 void __init initmem_init(void)
-- 
2.20.1

[PATCH v3 2/3] powerpc/numa: Prefer node id queried from vphn

2020-04-30 Thread Srikar Dronamraju

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Gautham R Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: 
https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b3615b7fdbdf..79c74a2753c1 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -719,21 +719,22 @@ static int __init parse_numa_properties(void)
 */
for_each_present_cpu(i) {
struct device_node *cpu;
-   int nid;
-
-   cpu = of_get_cpu_node(i, NULL);
-   BUG_ON(!cpu);
-   nid = of_node_to_nid_single(cpu);
-   of_node_put(cpu);
+   int nid = vphn_get_nid(i);
 
/*
 * Don't fall back to default_nid yet -- we will plug
 * cpus into nodes once the memory scan has discovered
 * the topology.
 */
-   if (nid < 0)
-   continue;
-   node_set_online(nid);
+   if (nid == NUMA_NO_NODE) {
+   cpu = of_get_cpu_node(i, NULL);
+   BUG_ON(!cpu);
+   nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
+   }
+
+   if (likely(nid > 0))
+   node_set_online(nid);
}
 
get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
-- 
2.20.1

[PATCH v3 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-04-30 Thread Srikar Dronamraju

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.7-rc3 + patch
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Gautham R Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1:->v2:
- Rebased to v5.7-rc3
Link v2: 
https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u

 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69827d4fa052..03b89592af04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+   [N_ONLINE] = NODE_MASK_NONE,
+#else
[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
2.20.1

Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-08 Thread Srikar Dronamraju

 to online some of the resources or the resources have gone bad.
Please do note that on hypervisors like PowerVM, the admins don't have
control over which nodes the resources are allocated.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-08 Thread Srikar Dronamraju

* Christopher Lameter  [2020-05-02 23:05:28]:

> On Fri, 1 May 2020, Srikar Dronamraju wrote:
> 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> >   */
> >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > [N_POSSIBLE] = NODE_MASK_ALL,
> > +#ifdef CONFIG_NUMA
> > +   [N_ONLINE] = NODE_MASK_NONE,
> 
> Hmmm I would have expected that you would have added something early
> in boot that would mark the current node (whatever is is) online instead?

Do correct me, but these are structure initialization in page_alloc.c
Wouldn't these happen much before the numa initialization happens?
I think we are already marking nodes as online as soon as we detect the
nodes.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-05-08 Thread Srikar Dronamraju

* Christopher Lameter  [2020-05-02 22:55:16]:

> On Fri, 1 May 2020, Srikar Dronamraju wrote:
> 
> > -   for_each_present_cpu(cpu)
> > -   numa_setup_cpu(cpu);
> > +   for_each_possible_cpu(cpu) {
> > +   /*
> > +* Powerpc with CONFIG_NUMA always used to have a node 0,
> > +* even if it was memoryless or cpuless. For all cpus that
> > +* are possible but not present, cpu_to_node() would point
> > +* to node 0. To remove a cpuless, memoryless dummy node,
> > +* powerpc need to make sure all possible but not present
> > +* cpu_to_node are set to a proper node.
> > +*/
> > +   if (cpu_present(cpu))
> > +   numa_setup_cpu(cpu);
> > +   else
> > +   set_cpu_numa_node(cpu, first_online_node);
> > +   }
> >  }
> 
> 
> Can this be folded into numa_setup_cpu?
> 
> This looks more like numa_setup_cpu needs to change?
> 

We can fold this into numa_setup_cpu().

However till now we were sure that numa_setup_cpu() would be called only for
a present cpu. That assumption will change.
+ (non-consequential) an additional check everytime cpu is hotplugged in.

If Michael Ellerman is okay with the change, I can fold it in.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-11 Thread Srikar Dronamraju

* David Hildenbrand  [2020-05-08 15:42:12]:

Hi David,

Thanks for the steps to tryout.

> > 
> > #! /bin/bash
> > sudo x86_64-softmmu/qemu-system-x86_64 \
> > --enable-kvm \
> > -m 4G,maxmem=20G,slots=2 \
> > -smp sockets=2,cores=2 \
> > -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G 
> > \
> 
> Sorry, this line has to be
> 
> -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G \
> 
> > -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \
> > -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \
> > -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \
> > -machine pc,nvdimm \
> > -nographic \
> > -nodefaults \
> > -chardev stdio,id=serial \
> > -device isa-serial,chardev=serial \
> > -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
> > -mon chardev=monitor,mode=readline
> > 
> > to get a cpu-less and memory-less node 1. Never tried with node 0.
> > 

I tried 

qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 
-cpu host -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G -vga 
none -nographic -serial mon:stdio /home/srikar/fedora.qcow2

and the resulting guest was.

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 3927 MB
node 0 free: 3316 MB
node distances:
node   0
  0:  10

[root@localhost ~]# lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   40 bits physical, 48 bits virtual
CPU(s):  4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):   2
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   46
Model name:  Intel(R) Xeon(R) CPU   X7560  @ 2.27GHz
Stepping:6
CPU MHz: 2260.986
BogoMIPS:4521.97
Virtualization:  VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:   32K
L1i cache:   32K
L2 cache:4096K
L3 cache:16384K
NUMA node0 CPU(s):   0-3
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc 
arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault 
pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip 
arch_capabilities

[root@localhost ~]# cat /sys/devices/system/node/online
0
[root@localhost ~]# cat /sys/devices/system/node/possible
0-1

-

I also tried

qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 
-cpu host -numa node,nodeid=1,cpus=0-3,mem=4G -numa node,nodeid=0,mem=0G -vga 
none -nographic -serial mon:stdio /home/srikar/fedora.qcow2

and the resulting guest was.

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 3927 MB
node 0 free: 3316 MB
node distances:
node   0
  0:  10

[root@localhost ~]# lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   40 bits physical, 48 bits virtual
CPU(s):  4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):   2
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   46
Model name:  Intel(R) Xeon(R) CPU   X7560  @ 2.27GHz
Stepping:6
CPU MHz: 2260.986
BogoMIPS:4521.97
Virtualization:  VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:   32K
L1i cache:   32K
L2 cache:4096K
L3 cache:16384K
NUMA node0 CPU(s):   0-3
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc 
arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault 
pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip 
arch_capabilities

[root@localhost ~]# cat /sys/devices/system/node/online
0
[root@localhost ~]# cat /sys/devices/system/node/possible
0-1

Even without my patch, both the combinations, I am still unable to see a
cpuless, memoryless node being online. And the interesting part being even
if I mark node 0 as cpuless,memoryless and node 1 as actual node, the system
somewhere marks node 0 as the actual node.

> 
> David / dhildenb
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-12 Thread Srikar Dronamraju

* David Hildenbrand  [2020-05-12 09:49:05]:

> On 11.05.20 19:47, Srikar Dronamraju wrote:
> > * David Hildenbrand  [2020-05-08 15:42:12]:
> > 
> > 
> > [root@localhost ~]# cat /sys/devices/system/node/online
> > 0
> > [root@localhost ~]# cat /sys/devices/system/node/possible
> > 0-1
> > 
> > Even without my patch, both the combinations, I am still unable to see a
> > cpuless, memoryless node being online. And the interesting part being even
> 
> Yeah, I think on x86, all memory-less and cpu-less nodes are offline as
> default. Especially when hotunplugging cpus/memory, we set them offline
> as well.

I also came to the same conclusion that we may not have a cpuless,memoryless
node on x86.

> 
> But as Michal mentioned, the node handling code is complicated and
> differs between various architectures.
> 

I do agree that node handling code differs across various architectures and
quite complicated.

> > if I mark node 0 as cpuless,memoryless and node 1 as actual node, the system
> > somewhere marks node 0 as the actual node.
> 
> Is the kernel maybe mapping PXM 1 to node 0 in that case, because it
> always requires node 0 to be online/contain memory? Would be interesting
> what happens if you hotplug a DIMM to (QEMU )node 0 - if PXM 0 will be
> mapped to node 1 then as well.
> 

Satheesh Rajendra had tried with cpu hotplug on a similar setup and we found
that it crashes the x86 system.
reference: https://bugzilla.kernel.org/show_bug.cgi?id=202187

Even if we were able to hotplug 1 DIMM memory into node 1, that would no
more be a memoryless node.

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v4 0/3] Offline memoryless cpuless node 0

2020-05-12 Thread Srikar Dronamraju

Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: 
http://lore.kernel.org/lkml/20200501031128.19584-1-sri...@linux.vnet.ibm.com/t/#u

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: 
https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.
Link v1: 
https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.7-rc3 + patches
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: David Hildenbrand 

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 32 ++--
 mm/page_alloc.c|  4 +++-
 2 files changed, 25 insertions(+), 11 deletions(-)

-- 
1.8.3.1

[PATCH v4 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-05-12 Thread Srikar Dronamraju

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: David Hildenbrand 
Signed-off-by: Srikar Dronamraju 
---
Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: 
http://lore.kernel.org/lkml/20200501031128.19584-1-sri...@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d1..5b7918c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -506,6 +506,11 @@ static int numa_setup_cpu(unsigned long lcpu)
int fcpu = cpu_first_thread_sibling(lcpu);
int nid = NUMA_NO_NODE;
 
+   if (!cpu_present(lcpu)) {
+   set_cpu_numa_node(lcpu, first_online_node);
+   return first_online_node;
+   }
+
/*
 * If a valid cpu-to-node mapping is already available, use it
 * directly instead of querying the firmware, since it represents
@@ -931,8 +936,17 @@ void __init mem_topology_setup(void)
 
reset_numa_cpu_lookup_table();
 
-   for_each_present_cpu(cpu)
+   for_each_possible_cpu(cpu) {
+   /*
+* Powerpc with CONFIG_NUMA always used to have a node 0,
+* even if it was memoryless or cpuless. For all cpus that
+* are possible but not present, cpu_to_node() would point
+* to node 0. To remove a cpuless, memoryless dummy node,
+* powerpc need to make sure all possible but not present
+* cpu_to_node are set to a proper node.
+*/
numa_setup_cpu(cpu);
+   }
 }
 
 void __init initmem_init(void)
-- 
1.8.3.1

[PATCH v4 2/3] powerpc/numa: Prefer node id queried from vphn

2020-05-12 Thread Srikar Dronamraju

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: David Hildenbrand 
Signed-off-by: Srikar Dronamraju 
---
Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: 
https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b3615b7..2815313 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
 */
for_each_present_cpu(i) {
struct device_node *cpu;
-   int nid;
-
-   cpu = of_get_cpu_node(i, NULL);
-   BUG_ON(!cpu);
-   nid = of_node_to_nid_single(cpu);
-   of_node_put(cpu);
+   int nid = vphn_get_nid(i);
 
/*
 * Don't fall back to default_nid yet -- we will plug
 * cpus into nodes once the memory scan has discovered
 * the topology.
 */
-   if (nid < 0)
-   continue;
-   node_set_online(nid);
+   if (nid == NUMA_NO_NODE) {
+   cpu = of_get_cpu_node(i, NULL);
+   BUG_ON(!cpu);
+   nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
+   }
+
+   if (likely(nid > 0))
+   node_set_online(nid);
}
 
get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
-- 
1.8.3.1

[PATCH v4 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-12 Thread Srikar Dronamraju

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.7-rc3 + patch
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: David Hildenbrand 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1:->v2:
- Rebased to v5.7-rc3
Link v2: 
https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u

 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69827d4..03b8959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@ struct pcpu_drain {
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+   [N_ONLINE] = NODE_MASK_NONE,
+#else
[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
1.8.3.1

[PATCH] powerpc/topology: Check at boot for topology updates

2018-08-07 Thread Srikar Dronamraju

0-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Previous attempt to solve this problem
https://patchwork.ozlabs.org/patch/530090/

Reported-by: Manjunatha H R 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h |  4 
 arch/powerpc/kernel/smp.c   |  6 ++
 arch/powerpc/mm/numa.c  | 22 ++
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 16b077801a5f..a14550476bc7 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -92,6 +92,7 @@ extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
+extern void check_topology_updates(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -113,6 +114,9 @@ static inline int timed_topology_update(int nsecs)
 {
return 0;
 }
+static void check_topology_updates(void)
+{
+}
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 4794d6b4f4d2..2aa0ffd954c9 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1156,6 +1156,12 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
+   /*
+* On a shared LPAR, associativity needs to be requested.
+* Hence, check for numa topology updates before dumping
+* cpu topology
+*/
+   check_topology_updates();
dump_numa_cpu_topology();
 
/*
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0c7e05d89244..eab46a44436f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1515,6 +1515,7 @@ int start_topology_update(void)
   lppaca_shared_proc(get_lppaca())) {
if (!vphn_enabled) {
vphn_enabled = 1;
+   topology_update_needed = 1;
setup_cpu_associativity_change_counters();
timer_setup(&topology_timer, topology_timer_fn,
TIMER_DEFERRABLE);
@@ -1551,6 +1552,19 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
+void check_topology_updates(void)
+{
+   /* Do not poll for changes if disabled at boot */
+   if (topology_updates_enabled)
+   start_topology_update();
+
+   if (topology_update_needed) {
+   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
+   nr_cpumask_bits);
+   numa_update_cpu_topology(false);
+   }
+}
+
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
@@ -1597,10 +1611,6 @@ static const struct file_operations topology_ops = {
 
 static int topology_update_init(void)
 {
-   /* Do not poll for changes if disabled at boot */
-   if (topology_updates_enabled)
-   start_topology_update();
-
if (vphn_enabled)
topology_schedule_update();
 
@@ -1608,10 +1618,6 @@ static int topology_update_init(void)
return -ENOMEM;
 
topology_inited = 1;
-   if (topology_update_needed)
-   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
-   nr_cpumask_bits);
-
return 0;
 }
 device_initcall(topology_update_init);
-- 
2.17.1

[PATCH v2] powerpc/topology: Check at boot for topology updates

2018-08-08 Thread Srikar Dronamraju

416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Previous attempt to solve this problem
https://patchwork.ozlabs.org/patch/530090/

Reported-by: Manjunatha H R 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2:
Fix compile warnings and checkpatch issues.

 arch/powerpc/include/asm/topology.h |  4 
 arch/powerpc/kernel/smp.c   |  6 ++
 arch/powerpc/mm/numa.c  | 22 ++
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 16b077801a5f..1024f1587e18 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -43,6 +43,7 @@ extern void __init dump_numa_cpu_topology(void);
 extern int sysfs_add_device_to_node(struct device *dev, int nid);
 extern void sysfs_remove_device_from_node(struct device *dev, int nid);
 extern int numa_update_cpu_topology(bool cpus_locked);
+extern void check_topology_updates(void);
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
 {
@@ -84,6 +85,9 @@ static inline int numa_update_cpu_topology(bool cpus_locked)
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
+static void check_topology_updates(void)
+{
+}
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 4794d6b4f4d2..2aa0ffd954c9 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1156,6 +1156,12 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
+   /*
+* On a shared LPAR, associativity needs to be requested.
+* Hence, check for numa topology updates before dumping
+* cpu topology
+*/
+   check_topology_updates();
dump_numa_cpu_topology();
 
/*
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0c7e05d89244..eab46a44436f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1515,6 +1515,7 @@ int start_topology_update(void)
   lppaca_shared_proc(get_lppaca())) {
if (!vphn_enabled) {
vphn_enabled = 1;
+   topology_update_needed = 1;
setup_cpu_associativity_change_counters();
timer_setup(&topology_timer, topology_timer_fn,
TIMER_DEFERRABLE);
@@ -1551,6 +1552,19 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
+void check_topology_updates(void)
+{
+   /* Do not poll for changes if disabled at boot */
+   if (topology_updates_enabled)
+   start_topology_update();
+
+   if (topology_update_needed) {
+   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
+   nr_cpumask_bits);
+   numa_update_cpu_topology(false);
+   }
+}
+
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
@@ -1597,10 +1611,6 @@ static const struct file_operations topology_ops = {
 
 static int topology_update_init(void)
 {
-   /* Do not poll for changes if disabled at boot */
-   if (topology_updates_enabled)
-   start_topology_update();
-
if (vphn_enabled)
topology_schedule_update();
 
@@ -1608,10 +1618,6 @@ static int topology_update_init(void)
return -ENOMEM;
 
topology_inited = 1;
-   if (topology_update_needed)
-   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
-   nr_cpumask_bits);
-
return 0;
 }
 device_initcall(topology_update_init);
-- 
2.17.1

Re: [PATCH v6 2/2] powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores

2018-08-09 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2018-08-09 11:02:08]:

> 
> 3) ppc64_cpu --smt=2
>SMT domain ceases to exist as each domain consists of just one
>group.
> 

When seen in isolation, the above looks as if ppc64_cpu --smt=2 o/p says
" SMT domain ceases to exist"

> @@ -999,7 +1012,17 @@ static void add_cpu_to_masks(int cpu)
>  {
>   int first_thread = cpu_first_thread_sibling(cpu);
>   int chipid = cpu_to_chip_id(cpu);
> - int i;
> +
> + struct thread_groups tg;
> + int i, cpu_group_start = -1;
> +
> + if (has_big_cores) {
> + struct device_node *dn = of_get_cpu_node(cpu, NULL);
> +

Not checking for validity of dn and no of_node_puts?

> + parse_thread_groups(dn, &tg);
> + cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
> + cpumask_set_cpu(cpu, cpu_smallcore_sibling_mask(cpu));
> + }
> 
>   /*
>* This CPU will not be in the online mask yet so we need to manually

The rest looks good

Re: [PATCH v6 1/2] powerpc: Detect the presence of big-cores via "ibm,thread-groups"

2018-08-09 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2018-08-09 11:02:07]:

> 
>  int threads_per_core, threads_per_subcore, threads_shift;
> +bool has_big_cores;
>  cpumask_t threads_core_mask;
>  EXPORT_SYMBOL_GPL(threads_per_core);
>  EXPORT_SYMBOL_GPL(threads_per_subcore);
>  EXPORT_SYMBOL_GPL(threads_shift);
> +EXPORT_SYMBOL_GPL(has_big_cores);

Why do we need EXPORT_SYMBOL_GPL?

>  EXPORT_SYMBOL_GPL(threads_core_mask);
> 
> + *
> + * Returns 0 on success, -EINVAL if the property does not exist,
> + * -ENODATA if property does not have a value, and -EOVERFLOW if the
> + * property data isn't large enough.
> + */
> +int parse_thread_groups(struct device_node *dn,
> + struct thread_groups *tg)
> +{
> + unsigned int nr_groups, threads_per_group, property;
> + int i;
> + u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
> + u32 *thread_list;
> + size_t total_threads;
> + int ret;
> +
> + ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> +  thread_group_array, 3);
> +
> + if (ret)
> + goto out_err;
> +
> + property = thread_group_array[0];
> + nr_groups = thread_group_array[1];
> + threads_per_group = thread_group_array[2];
> + total_threads = nr_groups * threads_per_group;
> +

Shouldnt we check for property and nr_groups
If the property is not 1 and nr_groups < 1, we should error out
No point in calling a of_property read if property is not right.


Nit: 
Cant we directly assign to tg->property, and hence avoid local
variables, property, nr_groups and threads_per_group?

> + ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> +  thread_group_array,
> +  3 + total_threads);
> +
> +static inline bool dt_has_big_core(struct device_node *dn,
> +struct thread_groups *tg)
> +{
> + if (parse_thread_groups(dn, tg))
> + return false;
> +
> + if (tg->property != 1)
> + return false;
> +
> + if (tg->nr_groups < 1)
> + return false;

Can avoid these check if we can check in parse_thread_groups.

>  /**
>   * setup_cpu_maps - initialize the following cpu maps:
>   *  cpu_possible_mask
> @@ -457,6 +605,7 @@ void __init smp_setup_cpu_maps(void)
>   int cpu = 0;
>   int nthreads = 1;
> 
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index 755dc98..f5717de 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "cacheinfo.h"
>  #include "setup.h"
> @@ -1025,6 +1026,33 @@ static ssize_t show_physical_id(struct device *dev,
>  }
>  static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
> 
> +static ssize_t show_small_core_siblings(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct cpu *cpu = container_of(dev, struct cpu, dev);
> + struct device_node *dn = of_get_cpu_node(cpu->dev.id, NULL);
> + struct thread_groups tg;
> + int i, j;
> + ssize_t ret = 0;
> +

Here we need to check for validity of dn and error out accordingly.


> + if (parse_thread_groups(dn, &tg))
> + return -ENODATA;

Did we miss a of_node_put(dn)?

> +
> + i = get_cpu_thread_group_start(cpu->dev.id, &tg);
> +
> + if (i == -1)
> + return -ENODATA;
> +
> + for (j = 0; j < tg.threads_per_group - 1; j++)
> + ret += sprintf(buf + ret, "%d,", tg.thread_list[i + j]);

Here, we are making the assumption that group_start will always be the
first thread in the thread_group. However we didnt make the same
assumption in get_cpu_thread_group_start.

[PATCH v3] powerpc/topology: Check at boot for topology updates

2018-08-09 Thread Srikar Dronamraju

416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Previous attempt to solve this problem
https://patchwork.ozlabs.org/patch/530090/

Reported-by: Manjunatha H R 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2
Fix compile warnings and checkpatch issues.
Changelog v2->v3
Fix compile warnings on !CONFIG_SMP

 arch/powerpc/include/asm/topology.h |  5 +
 arch/powerpc/kernel/smp.c   |  6 ++
 arch/powerpc/mm/numa.c  | 22 ++
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 16b077801a5f..70f2d2285ba7 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -92,6 +92,7 @@ extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
+extern void __init check_topology_updates(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -113,6 +114,10 @@ static inline int timed_topology_update(int nsecs)
 {
return 0;
 }
+
+#ifdef CONFIG_SMP
+static inline void check_topology_updates(void) {}
+#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 4794d6b4f4d2..2aa0ffd954c9 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1156,6 +1156,12 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
+   /*
+* On a shared LPAR, associativity needs to be requested.
+* Hence, check for numa topology updates before dumping
+* cpu topology
+*/
+   check_topology_updates();
dump_numa_cpu_topology();
 
/*
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0c7e05d89244..32c13a208589 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1515,6 +1515,7 @@ int start_topology_update(void)
   lppaca_shared_proc(get_lppaca())) {
if (!vphn_enabled) {
vphn_enabled = 1;
+   topology_update_needed = 1;
setup_cpu_associativity_change_counters();
timer_setup(&topology_timer, topology_timer_fn,
TIMER_DEFERRABLE);
@@ -1551,6 +1552,19 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
+void __init check_topology_updates(void)
+{
+   /* Do not poll for changes if disabled at boot */
+   if (topology_updates_enabled)
+   start_topology_update();
+
+   if (topology_update_needed) {
+   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
+   nr_cpumask_bits);
+   numa_update_cpu_topology(false);
+   }
+}
+
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
@@ -1597,10 +1611,6 @@ static const struct file_operations topology_ops = {
 
 static int topology_update_init(void)
 {
-   /* Do not poll for changes if disabled at boot */
-   if (topology_updates_enabled)
-   start_topology_update();
-
if (vphn_enabled)
topology_schedule_update();
 
@@ -1608,10 +1618,6 @@ static int topology_update_init(void)
return -ENOMEM;
 
topology_inited = 1;
-   if (topology_update_needed)
-   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
-   nr_cpumask_bits);
-
return 0;
 }
 device_initcall(topology_update_init);
-- 
2.17.1

Re: [PATCH v3] powerpc/topology: Check at boot for topology updates

2018-08-10 Thread Srikar Dronamraju

* Michael Ellerman  [2018-08-10 21:42:28]:

> Srikar Dronamraju  writes:
> > diff --git a/arch/powerpc/include/asm/topology.h 
> > b/arch/powerpc/include/asm/topology.h
> > index 16b077801a5f..70f2d2285ba7 100644
> > --- a/arch/powerpc/include/asm/topology.h
> > +++ b/arch/powerpc/include/asm/topology.h
> > @@ -113,6 +114,10 @@ static inline int timed_topology_update(int nsecs)
> >  {
> > return 0;
> >  }
> > +
> > +#ifdef CONFIG_SMP
> > +static inline void check_topology_updates(void) {}
> > +#endif
> 
> I realise that's only called from smp.c but it doesn't really need to be
> inside #ifdef CONFIG_SMP, it's harmless to have an empty version for
> SMP=n.
> 

I did that just to fix
https://openpower.xyz/job/snowpatch/job/snowpatch-linux-sparse/577//


> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 4794d6b4f4d2..2aa0ffd954c9 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1156,6 +1156,12 @@ void __init smp_cpus_done(unsigned int max_cpus)
> > if (smp_ops && smp_ops->bringup_done)
> > smp_ops->bringup_done();
> >  
> > +   /*
> > +* On a shared LPAR, associativity needs to be requested.
> > +* Hence, check for numa topology updates before dumping
> > +* cpu topology
> > +*/
> > +   check_topology_updates();
> 
> I get that you're calling it here because you want to get it correct
> before we do the dump below, but we could move the dump later also.
> 
> You mention we need to do the check before the sched domains are
> initialised, but where exactly does that happen?

Almost immediately after smp_cpus_done, 

kernel_init_freeable() calls smp_init() and sched_init_smp() back to
back.

smp_init() calls smp_cpus_done()
sched_init_smp() calls sched_init_numa() and sched_init_domains()

sched_init_numa() initialises sched_domains_numa_masks which is the
cpumask that matters the most in our discussions.
sched_init_domains initialised the sched_domain.

So I dont think we can move any later.

> 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 0c7e05d89244..32c13a208589 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1515,6 +1515,7 @@ int start_topology_update(void)
> >lppaca_shared_proc(get_lppaca())) {
> > if (!vphn_enabled) {
> > vphn_enabled = 1;
> > +   topology_update_needed = 1;
> 
> I really don't understand what topology_update_needed is for.
> We set it here.

topology_update_needed is even set by numa_update_cpu_topology which
mighet get called from rtasd.

> 
> > setup_cpu_associativity_change_counters();
> > timer_setup(&topology_timer, topology_timer_fn,
> > TIMER_DEFERRABLE);
> > @@ -1551,6 +1552,19 @@ int prrn_is_enabled(void)
> > return prrn_enabled;
> >  }
> >  
> > +void __init check_topology_updates(void)
> > +{
> > +   /* Do not poll for changes if disabled at boot */
> > +   if (topology_updates_enabled)
> > +   start_topology_update();
> 
> Here we call start_topology_update(), which might set
> topology_update_needed to 1.
> 
> > +
> > +   if (topology_update_needed) {
> 
> And then we check it here?
> Why is this logic not *in* start_topology_update() ?

I have split topology_update_init into two parts.  First part being
check_topology_update() and the next part being topology_update_init().

By the time the current topology_update_init() gets called,
sched_domains are already created. So we need to move this part before.

However the scheduled topology updates because of vphn can wait, atleast
that how the current code is written. topology_inited indicates if the
scheduled topology updates have been setup.

> 
> > +   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
> > +   nr_cpumask_bits);
> > +   numa_update_cpu_topology(false);
> > +   }
> > +}
> 
> I'm not really clear on what check_topology_update() is responsible for
> doing vs topology_update_init(). Why do we need both?

I am okay to move the complete topology_update_init above and delete the
device_initcall. But then we would be moving the polling of vphn events
to a little bit earlier stage in the initialisation.

> 
> > @@ -1608,10 +1618,6 @@ static int topology_update_init(void)
> > return -ENOMEM;
> >  
> > topology_inited = 1;
> 
> What does that mean? Didn&#x

Re: [PATCH] sched/topology: Use Identity node only if required

2018-08-10 Thread Srikar Dronamraju

* Peter Zijlstra  [2018-08-08 09:58:41]:

> On Wed, Aug 08, 2018 at 12:39:31PM +0530, Srikar Dronamraju wrote:
> > With Commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
> > sched domain") scheduler introduces an extra numa level. However that
> > leads to
> > 
> >  - numa topology on 2 node systems no more marked as NUMA_DIRECT.  After
> >this commit, it gets reported as NUMA_BACKPLANE. This is because
> >sched_domains_numa_level now equals 2 on 2 node systems.
> > 
> >  - Extra numa sched domain that gets added and degenerated on most
> >machines.  The Identity node is only needed on very few systems.
> >Also all non-numa systems will end up populating
> >sched_domains_numa_distance and sched_domains_numa_masks tables.
> > 
> >  - On shared lpars like powerpc, this extra sched domain creation can
> >lead to repeated rcu stalls, sometimes even causing unresponsive
> >systems on boot. On such stalls, it was noticed that
> >init_sched_groups_capacity() (sg != sd->groups is always true).
> 
> The idea was that if the topology level is redundant (as it often is);
> then the degenerate code would take it out.
> 
> Why is that not working (right) and can we fix that instead?
> 

Here is my analysis on another box showing same issue.
numactl o/p

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 64 65 66 67 68 69 70 71 96 
97 98 99 100 101 102 103 128 129 130 131 132 133 134 135 160 161 162 163 164 
165 166 167 192 193 194 195 196 197 198 199 224 225 226 227 228 229 230 231 256 
257 258 259 260 261 262 263 288 289 290 291 292 293 294 295
node 0 size: 536888 MB
node 0 free: 533582 MB
node 1 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 88 89 90 91 92 93 
94 95 120 121 122 123 124 125 126 127 152 153 154 155 156 157 158 159 184 185 
186 187 188 189 190 191 216 217 218 219 220 221 222 223 248 249 250 251 252 253 
254 255 280 281 282 283 284 285 286 287
node 1 size: 502286 MB
node 1 free: 501283 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 80 81 82 83 84 85 
86 87 112 113 114 115 116 117 118 119 144 145 146 147 148 149 150 151 176 177 
178 179 180 181 182 183 208 209 210 211 212 213 214 215 240 241 242 243 244 245 
246 247 272 273 274 275 276 277 278 279
node 2 size: 503054 MB
node 2 free: 502854 MB
node 3 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 
79 104 105 106 107 108 109 110 111 136 137 138 139 140 141 142 143 168 169 170 
171 172 173 174 175 200 201 202 203 204 205 206 207 232 233 234 235 236 237 238 
239 264 265 266 267 268 269 270 271 296 297 298 299 300 301 302 303
node 3 size: 503310 MB
node 3 free: 498465 MB
node distances:
node   0   1   2   3
  0:  10  40  40  40
  1:  40  10  40  40
  2:  40  40  10  40
  3:  40  40  40  10

Extracting the contents of dmesg using sched_debug kernel parameter

CPU0 attaching NULL sched-domain.
CPU1 attaching NULL sched-domain.


CPU302 attaching NULL sched-domain.
CPU303 attaching NULL sched-domain.
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain
.
.
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain

 CPU0 attaching sched-domain(s):
   domain-2: sdA, span=0-303 level=NODE
groups: sg=sgL 0:{ 
span=0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 
cap=81920 }, sgM 8:{ 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
cap=81920 }, sdN 16:{ 
span=16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279 
cap=73728 }, sgO 24:{ 
span=24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287 
cap=73728 }
CPU1  attaching sched-domain(s):
   domain-2: sdB, span=0-303 level=NODE
[  367.739387] groups: sg=sgL 0:{ 
span=0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 
cap=81920 }, sgM 8:{ 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
cap=81920 }, sdN 16:{ 
span=16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279 
cap=73728 }, sgO 24:{ 
span=24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287 
cap=73728 }


CPU8  attaching sched-domain(s):
   domain-2: sdC, 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
level=NODE
groups: sgM 8:{ 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
cap=81920 }
domain-3: sdD, span=0-303 level=NUMA
 groups: sgX 8:{ 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
cap=81920 }, sgY 16:{ 
span=16-23,48-55,80-87,112-119,144

[PATCH 1/2] sched/topology: Set correct numa topology type

2018-08-10 Thread Srikar Dronamraju

With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
sched domain") scheduler introduces an new numa level. However this
leads to numa topology on 2 node systems no more marked as NUMA_DIRECT.
After this commit, it gets reported as NUMA_BACKPLANE. This is because
sched_domains_numa_level is now 2 on 2 node systems.

Fix this by allowing setting systems that have upto 2 numa levels as
NUMA_DIRECT.

While here remove a code that assumes level can be 0.

Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
Signed-off-by: Srikar Dronamraju 
---
 kernel/sched/topology.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a6e6b855ba81..cec3ee3ed320 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1315,7 +1315,7 @@ static void init_numa_topology_type(void)
 
n = sched_max_numa_distance;
 
-   if (sched_domains_numa_levels <= 1) {
+   if (sched_domains_numa_levels <= 2) {
sched_numa_topology_type = NUMA_DIRECT;
return;
}
@@ -1400,9 +1400,6 @@ void sched_init_numa(void)
break;
}
 
-   if (!level)
-   return;
-
/*
 * 'level' contains the number of unique distances
 *
-- 
2.12.3

[PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch

2018-08-10 Thread Srikar Dronamraju

With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
sched domain") scheduler introduces an new numa level. However on shared
lpars like powerpc, this extra sched domain creation can lead to
repeated rcu stalls, sometimes even causing unresponsive systems on
boot. On such stalls, it was noticed that init_sched_groups_capacity()
(sg != sd->groups is always true).

INFO: rcu_sched self-detected stall on CPU
 1-: (240039 ticks this GP) idle=c32/1/4611686018427387906 softirq=782/782 
fqs=80012
  (t=240039 jiffies g=6272 c=6271 q=263040)
NMI backtrace for cpu 1
CPU: 1 PID: 1576 Comm: kworker/1:1 Kdump: loaded Tainted: GE 
4.18.0-rc7-master+ #42
Workqueue: events topology_work_fn
Call Trace:
[c0832132f190] [c09557ac] dump_stack+0xb0/0xf4 (unreliable)
[c0832132f1d0] [c095ed54] nmi_cpu_backtrace+0x1b4/0x230
[c0832132f270] [c095efac] nmi_trigger_cpumask_backtrace+0x1dc/0x220
[c0832132f310] [c005f77c] arch_trigger_cpumask_backtrace+0x2c/0x40
[c0832132f330] [c01a32d4] rcu_dump_cpu_stacks+0x100/0x15c
[c0832132f380] [c01a2024] rcu_check_callbacks+0x894/0xaa0
[c0832132f4a0] [c01ace9c] update_process_times+0x4c/0xa0
[c0832132f4d0] [c01c5400] tick_sched_handle.isra.13+0x50/0x80
[c0832132f4f0] [c01c549c] tick_sched_timer+0x6c/0xd0
[c0832132f530] [c01ae044] __hrtimer_run_queues+0x134/0x360
[c0832132f5b0] [c01aeea4] hrtimer_interrupt+0x124/0x300
[c0832132f660] [c0024a04] timer_interrupt+0x114/0x2f0
[c0832132f6c0] [c00090f4] decrementer_common+0x114/0x120
--- interrupt: 901 at __bitmap_weight+0x70/0x100
LR = __bitmap_weight+0x78/0x100
[c0832132f9b0] [c09bb738] __func__.61127+0x0/0x20 (unreliable)
[c0832132fa00] [c016c178] build_sched_domains+0xf98/0x13f0
[c0832132fb30] [c016d73c] partition_sched_domains+0x26c/0x440
[c0832132fc20] [c01ee284] rebuild_sched_domains_locked+0x64/0x80
[c0832132fc50] [c01f11ec] rebuild_sched_domains+0x3c/0x60
[c0832132fc80] [c007e1c4] topology_work_fn+0x24/0x40
[c0832132fca0] [c0126704] process_one_work+0x1a4/0x470
[c0832132fd30] [c0126a68] worker_thread+0x98/0x540
[c0832132fdc0] [c012f078] kthread+0x168/0x1b0
[c0832132fe30] [c000b65c]
ret_from_kernel_thread+0x5c/0x80

Similar problem was earlier also reported at
https://lwn.net/ml/linux-kernel/20180512100233.GB3738@osiris/

Allow arch to set and clear masks corresponding to numa sched domain.

Cc: linuxppc-dev 
Cc: Michael Ellerman 
Cc: Peter Zijlstra 
Cc: Heiko Carstens 
Cc: Ingo Molnar 
Cc: LKML 
Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
Signed-off-by: Srikar Dronamraju 

Signed-off-by: Srikar Dronamraju 
---
 include/linux/sched/topology.h | 6 ++
 kernel/sched/sched.h   | 4 
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..13c7baeb7789 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -52,6 +52,12 @@ static inline int cpu_numa_flags(void)
 {
return SD_NUMA;
 }
+
+extern void sched_domains_numa_masks_set(unsigned int cpu);
+extern void sched_domains_numa_masks_clear(unsigned int cpu);
+#else
+static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
+static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..1028f3df8777 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1057,12 +1057,8 @@ extern bool find_numa_distance(int distance);
 
 #ifdef CONFIG_NUMA
 extern void sched_init_numa(void);
-extern void sched_domains_numa_masks_set(unsigned int cpu);
-extern void sched_domains_numa_masks_clear(unsigned int cpu);
 #else
 static inline void sched_init_numa(void) { }
-static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
-static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-- 
2.12.3

[PATCH v4] powerpc/topology: Get topology for shared processors at boot

2018-08-13 Thread Srikar Dronamraju

e7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Previous attempt to solve this problem
https://patchwork.ozlabs.org/patch/530090/

Reported-by: Manjunatha H R 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2
 Fix compile warnings and checkpatch issues.

Changelog v2->v3
 Fix compile warnings on !CONFIG_SMP

Changelog v3->v4
 Now do early topology init on shared processor.  Earlier we used to do only
 for vphn enabled.  However we want this update to happen even when
 topology_updates=off.  Changed patch title accordingly

 arch/powerpc/include/asm/topology.h |  5 +
 arch/powerpc/kernel/smp.c   |  6 ++
 arch/powerpc/mm/numa.c  | 22 ++
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 16b077801a5f..a4a718dbfec6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -92,6 +92,7 @@ extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
+extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -113,6 +114,10 @@ static inline int timed_topology_update(int nsecs)
 {
return 0;
 }
+
+#ifdef CONFIG_SMP
+static inline void shared_proc_topology_init(void) {}
+#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 4794d6b4f4d2..b3142c7b9c31 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1156,6 +1156,11 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
+   /*
+* On a shared LPAR, associativity needs to be requested.
+* Hence, get numa topology before dumping cpu topology
+*/
+   shared_proc_topology_init();
dump_numa_cpu_topology();
 
/*
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0c7e05d89244..16250fb25411 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1551,6 +1551,15 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
+void __init shared_proc_topology_init(void)
+{
+   if (lppaca_shared_proc(get_lppaca())) {
+   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
+   nr_cpumask_bits);
+   numa_update_cpu_topology(false);
+   }
+}
+
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
-- 
2.17.1

[PATCH v5] powerpc/topology: Get topology for shared processors at boot

2018-08-17 Thread Srikar Dronamraju

e7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Previous attempt to solve this problem
https://patchwork.ozlabs.org/patch/530090/

Reported-by: Manjunatha H R 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2
 Fix compile warnings and checkpatch issues.

Changelog v2->v3
 Fix compile warnings on !CONFIG_SMP

Changelog v3->v4
 Now do early topology init on shared processor.  Earlier we used to do only
 for vphn enabled.  However we want this update to happen even when
 topology_updates=off.  Changed patch title accordingly

Changelog v4->v5:
 Handled comments from Michael Ellerman where numa_update_cpu_topology
 was simply returning without updating the topology.

 arch/powerpc/include/asm/topology.h |  5 +
 arch/powerpc/kernel/smp.c   |  6 ++
 arch/powerpc/mm/numa.c  | 22 ++
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 16b077801a5f..a4a718dbfec6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -92,6 +92,7 @@ extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
+extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -113,6 +114,10 @@ static inline int timed_topology_update(int nsecs)
 {
return 0;
 }
+
+#ifdef CONFIG_SMP
+static inline void shared_proc_topology_init(void) {}
+#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 4794d6b4f4d2..b3142c7b9c31 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1156,6 +1156,11 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
+   /*
+* On a shared LPAR, associativity needs to be requested.
+* Hence, get numa topology before dumping cpu topology
+*/
+   shared_proc_topology_init();
dump_numa_cpu_topology();
 
/*
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0c7e05d89244..35ac5422903a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1078,7 +1078,6 @@ static int prrn_enabled;
 static void reset_topology_timer(void);
 static int topology_timer_secs = 1;
 static int topology_inited;
-static int topology_update_needed;
 
 /*
  * Change polling interval for associativity changes.
@@ -1306,11 +1305,8 @@ int numa_update_cpu_topology(bool cpus_locked)
struct device *dev;
int weight, new_nid, i = 0;
 
-   if (!prrn_enabled && !vphn_enabled) {
-   if (!topology_inited)
-   topology_update_needed = 1;
+   if (!prrn_enabled && !vphn_enabled && topology_inited)
return 0;
-   }
 
weight = cpumask_weight(&cpu_associativity_changes_mask);
if (!weight)
@@ -1423,7 +1419,6 @@ int numa_update_cpu_topology(bool cpus_locked)
 
 out:
kfree(updates);
-   topology_update_needed = 0;
return changed;
 }
 
@@ -1551,6 +1546,15 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
+void __init shared_proc_topology_init(void)
+{
+   if (lppaca_shared_proc(get_lppaca())) {
+   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
+   nr_cpumask_bits);
+   numa_update_cpu_topology(false);
+   }
+}
+
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
@@ -1608,10 +1612,6 @@ static int topology_update_init(void)
return -ENOMEM;
 
topology_inited = 1;
-   if (topology_update_needed)
-   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
-   nr_cpumask_bits);
-
-- 
2.17.1

[PATCH v5] powerpc/topology: Get topology for shared processors at boot

2018-08-17 Thread Srikar Dronamraju

e7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Previous attempt to solve this problem
https://patchwork.ozlabs.org/patch/530090/

Reported-by: Manjunatha H R 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2
 Fix compile warnings and checkpatch issues.

Changelog v2->v3
 Fix compile warnings on !CONFIG_SMP

Changelog v3->v4
 Now do early topology init on shared processor.  Earlier we used to do only
 for vphn enabled.  However we want this update to happen even when
 topology_updates=off.  Changed patch title accordingly

 arch/powerpc/include/asm/topology.h |  5 +
 arch/powerpc/kernel/smp.c   |  6 ++
 arch/powerpc/mm/numa.c  | 22 ++
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 16b077801a5f..a4a718dbfec6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -92,6 +92,7 @@ extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
+extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -113,6 +114,10 @@ static inline int timed_topology_update(int nsecs)
 {
return 0;
 }
+
+#ifdef CONFIG_SMP
+static inline void shared_proc_topology_init(void) {}
+#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 4794d6b4f4d2..b3142c7b9c31 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1156,6 +1156,11 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
+   /*
+* On a shared LPAR, associativity needs to be requested.
+* Hence, get numa topology before dumping cpu topology
+*/
+   shared_proc_topology_init();
dump_numa_cpu_topology();
 
/*
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0c7e05d89244..35ac5422903a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1078,7 +1078,6 @@ static int prrn_enabled;
 static void reset_topology_timer(void);
 static int topology_timer_secs = 1;
 static int topology_inited;
-static int topology_update_needed;
 
 /*
  * Change polling interval for associativity changes.
@@ -1306,11 +1305,8 @@ int numa_update_cpu_topology(bool cpus_locked)
struct device *dev;
int weight, new_nid, i = 0;
 
-   if (!prrn_enabled && !vphn_enabled) {
-   if (!topology_inited)
-   topology_update_needed = 1;
+   if (!prrn_enabled && !vphn_enabled && topology_inited)
return 0;
-   }
 
weight = cpumask_weight(&cpu_associativity_changes_mask);
if (!weight)
@@ -1423,7 +1419,6 @@ int numa_update_cpu_topology(bool cpus_locked)
 
 out:
kfree(updates);
-   topology_update_needed = 0;
return changed;
 }
 
@@ -1551,6 +1546,15 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
+void __init shared_proc_topology_init(void)
+{
+   if (lppaca_shared_proc(get_lppaca())) {
+   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
+   nr_cpumask_bits);
+   numa_update_cpu_topology(false);
+   }
+}
+
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
@@ -1608,10 +1612,6 @@ static int topology_update_init(void)
return -ENOMEM;
 
topology_inited = 1;
-   if (topology_update_needed)
-   bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
-   nr_cpumask_bits);
-
return 0;
 }
 device_initcall(topology_update_init);
-- 
2.17.1

Re: [PATCH v7 1/2] powerpc: Detect the presence of big-cores via "ibm,thread-groups"

2018-08-20 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2018-08-20 11:11:43]:

> From: "Gautham R. Shenoy" 
> 
> On IBM POWER9, the device tree exposes a property array identifed by
one small nit:
s/identifed/identified/g

> "ibm,thread-groups" which will indicate which groups of threads share
> a particular set of resources.
> 
> As of today we only have one form of grouping identifying the group of
> threads in the core that share the L1 cache, translation cache and
> instruction data flow.
> 
> This patch defines the helper function to parse the contents of
> "ibm,thread-groups" and a new structure to contain the parsed output.
> 
> The patch also creates the sysfs file named "small_core_siblings" that
> returns the physical ids of the threads in the core that share the L1
> cache, translation cache and instruction data flow.
> 
> Signed-off-by: Gautham R. Shenoy 

Otherwise looks good to me.

Reviewed-by: Srikar Dronamraju

Re: [PATCH v7 2/2] powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores

2018-08-20 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2018-08-20 11:11:44]:

> From: "Gautham R. Shenoy" 
> 
> Each of the SMT4 cores forming a big-core are more or less independent
> units. Thus when multiple tasks are scheduled to run on the fused
> core, we get the best performance when the tasks are spread across the
> pair of SMT4 cores.
> 
> This patch achieves this by setting the SMT level mask to correspond
> to the smallcore sibling mask on big-core systems. This patch also
> ensures that while checked for shared-caches on big-core system, we
> use the smallcore_sibling_mask to compare with the l2_cache_mask.
> This ensure that the CACHE level sched-domain is created, whose groups
> correspond to the threads of the big-core.
> 
> With this patch, the SMT sched-domain with SMT=8,4,2 on big-core
> systems are as follows:


Reviewed-by: Srikar Dronamraju

Re: [PATCH 1/2] sched/topology: Set correct numa topology type

2018-08-21 Thread Srikar Dronamraju

* Srikar Dronamraju  [2018-08-10 22:30:18]:

> With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
> sched domain") scheduler introduces an new numa level. However this
> leads to numa topology on 2 node systems no more marked as NUMA_DIRECT.
> After this commit, it gets reported as NUMA_BACKPLANE. This is because
> sched_domains_numa_level is now 2 on 2 node systems.
> 
> Fix this by allowing setting systems that have upto 2 numa levels as
> NUMA_DIRECT.
> 
> While here remove a code that assumes level can be 0.
> 
> Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
> Signed-off-by: Srikar Dronamraju 
> ---

Hey Peter,

Did you look at these two patches?

-- 
Thanks and Regards
Srikar Dronamraju

Re: [v5] powerpc/topology: Get topology for shared processors at boot

2018-08-21 Thread Srikar Dronamraju

* Michael Ellerman  [2018-08-21 20:35:23]:

> On Fri, 2018-08-17 at 14:54:39 UTC, Srikar Dronamraju wrote:
> > On a shared lpar, Phyp will not update the cpu associativity at boot
> > time. Just after the boot system does recognize itself as a shared lpar and
> > trigger a request for correct cpu associativity. But by then the scheduler
> > would have already created/destroyed its sched domains.
> > 
> > This causes
> > - Broken load balance across Nodes causing islands of cores.
> > - Performance degradation esp if the system is lightly loaded
> > - dmesg to wrongly report all cpus to be in Node 0.
> > - Messages in dmesg saying borken topology.
> > - With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
> >   node sched domain"), can cause rcu stalls at boot up.
> > 
> > 
> > Previous attempt to solve this problem
> > https://patchwork.ozlabs.org/patch/530090/
> > 
> > Reported-by: Manjunatha H R 
> > Signed-off-by: Srikar Dronamraju 
> 
> Applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/2ea62630681027c455117aa471ea3a
> 

Once it gets to Linus's tree, can we request this to be included in
stable trees?

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/topology: Use Identity node only if required

2018-08-31 Thread Srikar Dronamraju

* Peter Zijlstra  [2018-08-29 10:43:48]:

> On Fri, Aug 10, 2018 at 09:45:33AM -0700, Srikar Dronamraju wrote:
> 
> > 
> > CPU302 attaching NULL sched-domain.
> > CPU303 attaching NULL sched-domain.
> > BUG: arch topology borken
> >  the DIE domain not a subset of the NODE domain
> 
> ^ CLUE!!
> 
> but nowhere did you show what it thinks the DIE mask is.
> 
> >  CPU0 attaching sched-domain(s):
> >domain-2: sdA, span=0-303 level=NODE
> > groups: sg=sgL 0:{ 
> > span=0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 
> > cap=81920 }, sgM 8:{ 
> > span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303
> >  cap=81920 }, sdN 16:{ 
> > span=16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279 
> > cap=73728 }, sgO 24:{ 
> > span=24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287 
> > cap=73728 }
> > CPU1  attaching sched-domain(s):
> >domain-2: sdB, span=0-303 level=NODE
> > [  367.739387] groups: sg=sgL 0:{ 
> > span=0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 
> > cap=81920 }, sgM 8:{ 
> > span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303
> >  cap=81920 }, sdN 16:{ 
> > span=16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279 
> > cap=73728 }, sgO 24:{ 
> > span=24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287 
> > cap=73728 }
> 
> You forgot to provide the rest of it... what's domain-[01] look like?

At boot: Before topology update.

For  CPU 0 
domain-0: span=0-7 level=SMT
 groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 
5:{ span=5 }, 6:{ span=6 }, 7:{ span=7 }
 domain-1: span=0-303 level=DIE
  groups: 0:{ span=0-7 cap=8192 }, 8:{ span=8-15 cap=8192 }, 16:{ span=16-23 
cap=8192 }, 24:{ span=24-31 cap=8192 }, 32:{ span=32-39 cap=8192 }, 40:{ 
span=40-47 cap=8192 }, 48:{ span=48-55 cap=8192 }, 56:{ span=56-63 cap=8192 }, 
64:{ span=64-71 cap=8192 }, 72:{ span=72-79 cap=8192 }, 80:{ span=80-87 
cap=8192 }, 88:{ span=88-95 cap=8192 }, 96:{ span=96-103 cap=8192 }, 104:{ 
span=104-111 cap=8192 }, 112:{ span=112-119 cap=8192 }, 120:{ span=120-127 
cap=8192 }, 128:{ span=128-135 cap=8192 }, 136:{ span=136-143 cap=8192 }, 144:{ 
span=144-151 cap=8192 }, 152:{ span=152-159 cap=8192 }, 160:{ span=160-167 
cap=8192 }, 168:{ span=168-175 cap=8192 }, 176:{ span=176-183 cap=8192 }, 184:{ 
span=184-191 cap=8192 }, 192:{ span=192-199 cap=8192 }, 200:{ span=200-207 
cap=8192 }, 208:{ span=208-215 cap=8192 }, 216:{ span=216-223 cap=8192 }, 224:{ 
span=224-231 cap=8192 }, 232:{ span=232-239 cap=8192 }, 240:{ span=240-247 
cap=8192 }, 248:{ span=248-255 cap=8192 }, 256:{ span=256-263 cap=8192 }, 264:{ 
span=264-271 cap=8192 }, 272:{ span=272-279 cap=8192 }, 280:{ span=280-287 
cap=8192 }, 288:{ span=288-295 cap=8192 }, 296:{ span=296-303 cap=8192 }

For  CPU 1 
domain-0: span=0-7 level=SMT
 groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 
6:{ span=6 }, 7:{ span=7 }, 0:{ span=0 }
 domain-1: span=0-303 level=DIE
  groups: 0:{ span=0-7 cap=8192 }, 8:{ span=8-15 cap=8192 }, 16:{ span=16-23 
cap=8192 }, 24:{ span=24-31 cap=8192 }, 32:{ span=32-39 cap=8192 }, 40:{ 
span=40-47 cap=8192 }, 48:{ span=48-55 cap=8192 }, 56:{ span=56-63 cap=8192 }, 
64:{ span=64-71 cap=8192 }, 72:{ span=72-79 cap=8192 }, 80:{ span=80-87 
cap=8192 }, 88:{ span=88-95 cap=8192 }, 96:{ span=96-103 cap=8192 }, 104:{ 
span=104-111 cap=8192 }, 112:{ span=112-119 cap=8192 }, 120:{ span=120-127 
cap=8192 }, 128:{ span=128-135 cap=8192 }, 136:{ span=136-143 cap=8192 }, 144:{ 
span=144-151 cap=8192 }, 152:{ span=152-159 cap=8192 }, 160:{ span=160-167 
cap=8192 }, 168:{ span=168-175 cap=8192 }, 176:{ span=176-183 cap=8192 }, 184:{ 
span=184-191 cap=8192 }, 192:{ span=192-199 cap=8192 }, 200:{ span=200-207 
cap=8192 }, 208:{ span=208-215 cap=8192}, 216:{ span=216-223 cap=8192 }, 224:{ 
span=224-231 cap=8192 }, 232:{ span=232-239 cap=8192 }, 240:{ span=240-247 
cap=8192 }, 248:{ span=248-255 cap=8192 }, 256:{ span=256-263 cap=8192 }, 264:{ 
span=264-271 cap=8192 }, 272:{ span=272-279 cap=8192 }, 280:{ span=280-287 
cap=8192 }, 288:{ span=288-295 cap=8192 }, 296:{ span=296-303 cap=8192 }


For  CPU 8
domain-0: span=8-15 level=SMT
 groups: 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 12:{ 
span=12 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }
 domain-1: span=0-303 level=DIE
  groups: 8:{ span=8-15 cap=8192 }, 16:{ span=16-23 cap=8192 }, 24:{ span=24-31 
cap=8192 }, 32:{ span=32-39 cap=8192 }, 40:{ span=40-47 cap=8192 }, 48:{ 
span=48-55 cap=8192 }, 56:{ span=56-63 cap=8192 }, 64:{ span=64-71 cap=8192 }, 
72:{ span=72-79 cap=8192 }, 80:{ span=80-87 cap=8192 }, 88:{ span=88-95 
cap=8192 }, 96:{ span=96-103 cap=8192

Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch

2018-08-31 Thread Srikar Dronamraju

* Peter Zijlstra  [2018-08-29 10:02:19]:

> On Fri, Aug 10, 2018 at 10:30:19PM +0530, Srikar Dronamraju wrote:
> > With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
> > sched domain") scheduler introduces an new numa level. However on shared
> > lpars like powerpc, this extra sched domain creation can lead to
> > repeated rcu stalls, sometimes even causing unresponsive systems on
> > boot. On such stalls, it was noticed that init_sched_groups_capacity()
> > (sg != sd->groups is always true).
> > 
> > INFO: rcu_sched self-detected stall on CPU
> >  1-: (240039 ticks this GP) idle=c32/1/4611686018427387906 
> > softirq=782/782 fqs=80012
> >   (t=240039 jiffies g=6272 c=6271 q=263040)
> > NMI backtrace for cpu 1
> 
> > --- interrupt: 901 at __bitmap_weight+0x70/0x100
> > LR = __bitmap_weight+0x78/0x100
> > [c0832132f9b0] [c09bb738] __func__.61127+0x0/0x20 (unreliable)
> > [c0832132fa00] [c016c178] build_sched_domains+0xf98/0x13f0
> > [c0832132fb30] [c016d73c] partition_sched_domains+0x26c/0x440
> > [c0832132fc20] [c01ee284] rebuild_sched_domains_locked+0x64/0x80
> > [c0832132fc50] [c01f11ec] rebuild_sched_domains+0x3c/0x60
> > [c0832132fc80] [c007e1c4] topology_work_fn+0x24/0x40
> > [c0832132fca0] [c0126704] process_one_work+0x1a4/0x470
> > [c0832132fd30] [c0126a68] worker_thread+0x98/0x540
> > [c0832132fdc0] [c012f078] kthread+0x168/0x1b0
> > [c0832132fe30] [c000b65c]
> > ret_from_kernel_thread+0x5c/0x80
> > 
> > Similar problem was earlier also reported at
> > https://lwn.net/ml/linux-kernel/20180512100233.GB3738@osiris/
> > 
> > Allow arch to set and clear masks corresponding to numa sched domain.
> 

> What this Changelog fails to do is explain the problem and motivate why
> this is the right solution.
> 
> As-is, this reads like, something's buggered, I changed this random thing
> and it now works.
> 
> So what is causing that domain construction error?
> 

Powerpc lpars running on Phyp have 2 modes. Dedicated and shared.

Dedicated lpars are similar to kvm guest with vcpupin.

Shared  lpars are similar to kvm guest without any pinning. When running
shared lpar mode, Phyp allows overcommitting. Now if more lpars are
created/destroyed, Phyp will internally move / consolidate the cores. The
objective is similar to what autonuma tries achieves on the host but with a
different approach (consolidating to optimal nodes to achieve the best
possible output).  This would mean that the actual underlying cpus/node
mapping has changed. Phyp will propogate upwards an event to the lpar.  The
lpar / os can choose to ignore or act on the same.

We have found that acting on the event will provide upto 40% improvement
over ignoring the event. Acting on the event would mean moving the cpu from
one node to the other, and topology_work_fn exactly does that.

In the case where we didn't have the NUMA sched domain, we would build the
independent (aka overlap) sched_groups. With NUMA  sched domain
introduction, we try to reuse sched_groups (aka non-overlay). This results
in the above, which I thought I tried to explain in
https://lwn.net/ml/linux-kernel/20180810164533.gb42...@linux.vnet.ibm.com

In the typical case above, lets take 2 node, 8 core each having SMT 8
threads.  Initially all the 8 cores might come from node 0.  Hence
sched_domains_numa_masks[NODE][node1] and
sched_domains_numa_mask[NUMA][node1] is set at sched_init_numa will have
blank cpumasks.

Let say Phyp decides to move some of the load to another node, node 1, which
till now has 0 cpus.  Hence we will see

"BUG: arch topology borken \n the DIE domain not a subset of the NODE
domain"   which is probably okay. This problem is even present even before
NODE domain was created and systems still booted and ran.

However with the introduction of NODE sched_domain,
init_sched_groups_capacity() gets called for non-overlay sched_domains which
gets us into even worse problems. Here we will end up in a situation where
sgA->sgB->sgC-sgD->sgA gets converted into sgA->sgB->sgC->sgB which ends up
creating cpu stalls.

So the request is to expose the sched_domains_numa_masks_set /
sched_domains_numa_masks_clear to arch, so that on topology update i.e event
from phyp, arch set the mask correctly. The scheduler seems to take care of
everything else.

-- 
Thanks and Regards
Srikar Dronamraju

1 2 3 4 5 6 >

1 - 100 of 580 matches

Mail list logo