Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: > On 01/31/2013 02:19 PM, Simon Jeons wrote: > > Hi Tang, > > On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote: > >> Hi Simon, > >> > >> Please see below. :) > >> > >> On 01/31/2013 09:22 AM, Simon Jeons wrote: > >>> > >>> Sorry, I still confuse. :( > >>> update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or > >>> node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE? > >>> > >>> node_states is what? node_states[N_NORMAL_MEMOR] or > >>> node_states[N_MEMORY]? > >> > >> Are you asking what node_states[] is ? > >> > >> node_states[] is an array of nodemask, > >> > >> extern nodemask_t node_states[NR_NODE_STATES]; > >> > >> For example, node_states[N_NORMAL_MEMOR] represents which nodes have > >> normal memory. > >> If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is > >> node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... > >> ZONE_MOVABLE. > >> > > > > Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... > > *ZONE_MOVABLE*, the comment of enum nodes_states said that > > N_NORMAL_MEMORY just means the node has regular memory. > > > > Hi Simon, > > Let's say it in this way. > > If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We > don't have a separate > macro to represent highmem because we don't have highmem. > This is easy to understand, right ? > > Now, think it just like above: > If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == > N_NORMAL_MEMORY. > This means we don't allow a node to have only movable memory, not we > don't have movable memory. > A node could have normal memory and movable memory. So > nodes_state[N_NORMAL_MEMORY] represents > a node have 0 ... *ZONE_MOVABLE*. > > I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have > only movable memory. > So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have > movable memory. It means > the node cannot have only movable memory. It can have normal memory and > movable memory. > > 1) With CONFIG_MOVABLE_NODE: > N_NORMAL_MEMORY: nodes who have normal memory. > normal memory only > normal and highmem > normal and highmem and movablemem > normal and movablemem > N_MEMORY: nodes who has memory (any memory) > normal memory only > normal and highmem > normal and highmem and movablemem > normal and movablemem We can have > movablemem. > highmem only - > highmem and movablemem --- > movablemem only -- We can have > movablemem only.*** > > 2) With out CONFIG_MOVABLE_NODE: > N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY) > normal memory only > normal and highmem > normal and highmem and movablemem > normal and movablemem We can have > movablemem. > No movablemem only --- We cannot > have movablemem only. *** > > The semantics is not that clear here. So we can only try to understand > it from the code where > we use N_MEMORY. :) > > That is my understanding of this. Thanks for your clarify, very clear now. :) > > Thanks. :) > > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [GIT PULL 00/21] perf/core improvements and fixes
* Arnaldo Carvalho de Melo wrote: > Hi Ingo, > > Please consider pulling. > > Namhyung, Jiri, the 'group report' patches are at acme/perf/group, > will send a pull req later if it survives further testing. > > - Arnaldo > > The following changes since commit a2d28d0c198b65fac28ea6212f5f8edc77b29c27: > > Merge tag 'perf-core-for-mingo' of > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core > (2013-01-25 11:34:00 +0100) > > are available in the git repository at: > > > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux > tags/perf-core-for-mingo > > for you to fetch changes up to 5809fde040de2afa477a6c593ce2e8fd2c11d9d3: > > perf header: Fix double fclose() on do_write(fd, xxx) failure (2013-01-30 > 10:40:44 -0300) > > > perf/core improvements and fixes: > > . Fix some leaks in exit paths. > > . Use memdup where applicable > > . Remove some die() calls, allowing callers to handle exit paths > gracefully. > > . Correct typo in tools Makefile, fix from Borislav Petkov. > > . Add 'perf bench numa mem' NUMA performance measurement suite, from Ingo > Molnar. > > . Handle dynamic array's element size properly, fix from Jiri Olsa. > > . Fix memory leaks on evsel->counts, from Namhyung Kim. > > . Make numa benchmark optional, allowing the build in machines where required > numa libraries are not present, fix from Peter Hurley. > > . Add interval printing in 'perf stat', from Stephane Eranian. > > . Fix compile warnings in tests/attr.c, from Sukadev Bhattiprolu. > > . Fix double free, pclose instead of fclose, leaks and double fclose errors > found with the cppcheck tool, from Thomas Jarosch. > > Signed-off-by: Arnaldo Carvalho de Melo > > > Arnaldo Carvalho de Melo (8): > perf tools: Stop using 'self' in strlist > perf tools: Stop using 'self' in map.[ch] > perf tools: Use memdup in map__clone > perf kmem: Use memdup() > perf header: Stop using die() calls when processing tracing data > perf ui browser: Free browser->helpline() on ui_browser__hide() > perf tests: Call machine__exit in the vmlinux matches kallsyms test > perf tests: Fix leaks on PERF_RECORD_* test > > Borislav Petkov (1): > tools: Correct typo in tools Makefile > > Ingo Molnar (1): > perf: Add 'perf bench numa mem' NUMA performance measurement suite > > Jiri Olsa (1): > tools lib traceevent: Handle dynamic array's element size properly > > Namhyung Kim (1): > perf evsel: Fix memory leaks on evsel->counts > > Peter Hurley (1): > perf tools: Make numa benchmark optional > > Stephane Eranian (2): > perf evsel: Add prev_raw_count field > perf stat: Add interval printing > > Sukadev Bhattiprolu (1): > perf tools, powerpc: Fix compile warnings in tests/attr.c > > Thomas Jarosch (5): > perf tools: Fix possible double free on error > perf sort: Use pclose() instead of fclose() on pipe stream > perf tools: Fix memory leak on error > perf header: Fix memory leak for the "Not caching a kptr_restrict'ed > /proc/kallsyms" case > perf header: Fix double fclose() on do_write(fd, xxx) failure > > tools/Makefile |2 +- > tools/lib/traceevent/event-parse.c | 39 +- > tools/perf/Documentation/perf-stat.txt |4 + > tools/perf/Makefile | 13 + > tools/perf/arch/common.c |1 + > tools/perf/bench/bench.h |1 + > tools/perf/bench/numa.c | 1731 > ++ > tools/perf/builtin-bench.c | 17 + > tools/perf/builtin-kmem.c|6 +- > tools/perf/builtin-stat.c| 158 ++- > tools/perf/config/feature-tests.mak | 11 + > tools/perf/tests/attr.c |5 + > tools/perf/tests/open-syscall-all-cpus.c |1 + > tools/perf/tests/perf-record.c | 12 +- > tools/perf/tests/vmlinux-kallsyms.c |4 +- > tools/perf/ui/browser.c |2 + > tools/perf/util/event.c |4 +- > tools/perf/util/evsel.c | 31 + > tools/perf/util/evsel.h |2 + > tools/perf/util/header.c | 25 +- > tools/perf/util/map.c| 118 +- > tools/perf/util/map.h| 24 +- > tools/perf/util/sort.c |7 +- > tools/perf/util/strlist.c| 54 +- > tools/perf/util/strlist.h| 42 +- > 25 files changed, 2154 insertions(+), 160 deletions(-) > create mode 100644 tools/perf/bench/numa.c Pulled, thanks a lot Arnaldo! Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/l
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? > On 01/31/2013 02:19 PM, Simon Jeons wrote: > > Hi Tang, > > On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote: > >> Hi Simon, > >> > >> Please see below. :) > >> > >> On 01/31/2013 09:22 AM, Simon Jeons wrote: > >>> > >>> Sorry, I still confuse. :( > >>> update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or > >>> node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE? > >>> > >>> node_states is what? node_states[N_NORMAL_MEMOR] or > >>> node_states[N_MEMORY]? > >> > >> Are you asking what node_states[] is ? > >> > >> node_states[] is an array of nodemask, > >> > >> extern nodemask_t node_states[NR_NODE_STATES]; > >> > >> For example, node_states[N_NORMAL_MEMOR] represents which nodes have > >> normal memory. > >> If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is > >> node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... > >> ZONE_MOVABLE. > >> > > > > Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ... > > *ZONE_MOVABLE*, the comment of enum nodes_states said that > > N_NORMAL_MEMORY just means the node has regular memory. > > > > Hi Simon, > > Let's say it in this way. > > If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We > don't have a separate > macro to represent highmem because we don't have highmem. > This is easy to understand, right ? > > Now, think it just like above: > If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == > N_NORMAL_MEMORY. > This means we don't allow a node to have only movable memory, not we > don't have movable memory. > A node could have normal memory and movable memory. So > nodes_state[N_NORMAL_MEMORY] represents > a node have 0 ... *ZONE_MOVABLE*. > > I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have > only movable memory. > So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have > movable memory. It means > the node cannot have only movable memory. It can have normal memory and > movable memory. > > 1) With CONFIG_MOVABLE_NODE: > N_NORMAL_MEMORY: nodes who have normal memory. > normal memory only > normal and highmem > normal and highmem and movablemem > normal and movablemem > N_MEMORY: nodes who has memory (any memory) > normal memory only > normal and highmem > normal and highmem and movablemem > normal and movablemem We can have > movablemem. > highmem only - > highmem and movablemem --- > movablemem only -- We can have > movablemem only.*** > > 2) With out CONFIG_MOVABLE_NODE: > N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY) > normal memory only > normal and highmem > normal and highmem and movablemem > normal and movablemem We can have > movablemem. > No movablemem only --- We cannot > have movablemem only. *** > > The semantics is not that clear here. So we can only try to understand > it from the code where > we use N_MEMORY. :) > > That is my understanding of this. > > Thanks. :) > > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: Hi Tang, On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: 1. IIUC, there is a button on machine which supports hot-remove memory, then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. 2. Since kernel memory is linear mapping(I mean direct mapping part), why can't put kernel direct mapping memory into one memory device, and other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. As you know x86_64 don't need highmem, IIUC, all kernel memory will linear mapping in this case. Is my idea available? If is correct, x86_32 can't implement in the same way since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. 3. In current implementation, if memory hotplug just need memory subsystem and ACPI codes support? Or also needs firmware take part in? Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. So far, I only know this. :) 4. What's the status of memory hotplug? Apart from can't remove kernel memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: > Hi Simon, > > On 01/31/2013 04:48 PM, Simon Jeons wrote: > > Hi Tang, > > On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: > > > > 1. IIUC, there is a button on machine which supports hot-remove memory, > > then what's the difference between press button and echo to /sys? > > No important difference, I think. Since I don't have the machine you are > saying, I cannot surely answer you. :) > AFAIK, pressing the button means trigger the hotplug from hardware, sysfs > is just another entrance. At last, they will run into the same code. > > > 2. Since kernel memory is linear mapping(I mean direct mapping part), > > why can't put kernel direct mapping memory into one memory device, and > > other memory into the other devices? > > We cannot do that because in that way, we will lose NUMA performance. > > If you know NUMA, you will understand the following example: > > node0:node1: > cpu0~cpu15cpu16~cpu31 > memory0~memory511 memory512~memory1023 > > cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. > If we set direct mapping area in node0, and movable area in node1, then > the kernel code running on cpu16~cpu31 will have to access > memory0~memory511. > This is a terrible performance down. So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G kernel memory only at Node 0? Can part of kernel memory also at Node 1? How big is kernel direct mapping memory in x86_64? Is there max limit? It seems that only around 896MB on x86_32. > > >As you know x86_64 don't need > > highmem, IIUC, all kernel memory will linear mapping in this case. Is my > > idea available? If is correct, x86_32 can't implement in the same way > > since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's > > hard to focus kernel memory on single memory device. > > Sorry, I'm not quite familiar with x86_32 box. > > > 3. In current implementation, if memory hotplug just need memory > > subsystem and ACPI codes support? Or also needs firmware take part in? > > Hope you can explain in details, thanks in advance. :) > > We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware > based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? > > So far, I only know this. :) > > > 4. What's the status of memory hotplug? Apart from can't remove kernel > > memory, other things are fully implementation? > > I think the main job is done for now. And there are still bugs to fix. > And this functionality is not stable. > > Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
On 30.01.2013, at 14:29, Mihai Caraman wrote: > VCPU's MMUCFG register initialization should not depend on KVM_CAP_SW_TLB > ioctl call. Move it earlier into tlb initalization phase. Quite the contrary. The fact that there is an mfspr() in e500_mmu.c already tells us that the code is broken. The TLB guest code should only depend on input from the SW_TLB configuration. It's completely orthogonal to the host capabilities. Alex > > Signed-off-by: Mihai Caraman > --- > arch/powerpc/kvm/e500_mmu.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c > index 5c44759..bb1b2b0 100644 > --- a/arch/powerpc/kvm/e500_mmu.c > +++ b/arch/powerpc/kvm/e500_mmu.c > @@ -692,8 +692,6 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu, > vcpu_e500->gtlb_offset[0] = 0; > vcpu_e500->gtlb_offset[1] = params.tlb_sizes[0]; > > - vcpu->arch.mmucfg = mfspr(SPRN_MMUCFG) & ~MMUCFG_LPIDSIZE; > - > vcpu->arch.tlbcfg[0] &= ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); > if (params.tlb_sizes[0] <= 2048) > vcpu->arch.tlbcfg[0] |= params.tlb_sizes[0]; > @@ -781,6 +779,8 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 > *vcpu_e500) > if (!vcpu_e500->g2h_tlb1_map) > goto err; > > + vcpu->arch.mmucfg = mfspr(SPRN_MMUCFG) & ~MMUCFG_LPIDSIZE; > + > /* Init TLB configuration register */ > vcpu->arch.tlbcfg[0] = mfspr(SPRN_TLB0CFG) & >~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); > -- > 1.7.4.1 > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/5] KVM: PPC: e500: Emulate TLBnPS registers
On 30.01.2013, at 14:29, Mihai Caraman wrote: > Emulate TLBnPS registers which are available in MMU Architecture Version > (MAV) 2.0. > > Signed-off-by: Mihai Caraman > --- > arch/powerpc/include/asm/kvm_host.h |1 + > arch/powerpc/kvm/e500.h |5 + > arch/powerpc/kvm/e500_emulate.c | 10 ++ > arch/powerpc/kvm/e500_mmu.c |5 + > 4 files changed, 21 insertions(+), 0 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index 8a72d59..88fcfe6 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -501,6 +501,7 @@ struct kvm_vcpu_arch { > spinlock_t wdt_lock; > struct timer_list wdt_timer; > u32 tlbcfg[4]; > + u32 tlbps[4]; > u32 mmucfg; > u32 epr; > struct kvmppc_booke_debug_reg dbg_reg; > diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h > index 41cefd4..b9f76d8 100644 > --- a/arch/powerpc/kvm/e500.h > +++ b/arch/powerpc/kvm/e500.h > @@ -303,4 +303,9 @@ static inline unsigned int get_tlbmiss_tid(struct > kvm_vcpu *vcpu) > #define get_tlb_sts(gtlbe) (MAS1_TS) > #endif /* !BOOKE_HV */ > > +static inline unsigned int has_mmu_v2(const struct kvm_vcpu *vcpu) bool. Also rename it to "is_..." then. > +{ > + return ((vcpu->arch.mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2); > +} > + > #endif /* KVM_E500_H */ > diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c > index e78f353..5515dc5 100644 > --- a/arch/powerpc/kvm/e500_emulate.c > +++ b/arch/powerpc/kvm/e500_emulate.c > @@ -329,6 +329,16 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int > sprn, ulong *spr_val) > *spr_val = vcpu->arch.ivor[BOOKE_IRQPRIO_DBELL_CRIT]; > break; > #endif > + case SPRN_TLB0PS: > + if (!has_mmu_v2(vcpu)) > + return EMULATE_FAIL; > + *spr_val = vcpu->arch.tlbps[0]; > + break; > + case SPRN_TLB1PS: > + if (!has_mmu_v2(vcpu)) > + return EMULATE_FAIL; > + *spr_val = vcpu->arch.tlbps[1]; > + break; > default: > emulated = kvmppc_booke_emulate_mfspr(vcpu, sprn, spr_val); > } > diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c > index bb1b2b0..129299a 100644 > --- a/arch/powerpc/kvm/e500_mmu.c > +++ b/arch/powerpc/kvm/e500_mmu.c > @@ -794,6 +794,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 > *vcpu_e500) > vcpu->arch.tlbcfg[1] |= > vcpu_e500->gtlb_params[1].ways << TLBnCFG_ASSOC_SHIFT; > > + if (has_mmu_v2(vcpu)) { > + vcpu->arch.tlbps[0] = mfspr(SPRN_TLB0PS); > + vcpu->arch.tlbps[1] = mfspr(SPRN_TLB1PS); So I suppose that means that user space doesn't tell us the possible TLB entry sizes through the SW_TLB config? Then we should add them there. To not break untested code paths, we can still compare if the values user space asks for are identical to what physical hardware does. But eventually we shouldn't care. Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] KVM: PPC: e500: Remove E.PT category from VCPUs
On 30.01.2013, at 14:29, Mihai Caraman wrote: > Embedded.Page Table (E.PT) category in VMs requires indirect tlb entries > emulation which is not supported yet. Configure TLBnCFG to remove E.PT > category from VCPUs. > > Signed-off-by: Mihai Caraman Please do this in a separate function that you call from these locations. That way the code is self-documenting on what it actually does. Also add a comment to this one function that removes E.PT related bits from TLBCFG that our _guest_ mmu emulation currently doesn't handle E.PT. Alex > --- > arch/powerpc/kvm/e500_mmu.c | 10 ++ > 1 files changed, 6 insertions(+), 4 deletions(-) > > diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c > index 129299a..9a1f7b7 100644 > --- a/arch/powerpc/kvm/e500_mmu.c > +++ b/arch/powerpc/kvm/e500_mmu.c > @@ -692,12 +692,14 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu, > vcpu_e500->gtlb_offset[0] = 0; > vcpu_e500->gtlb_offset[1] = params.tlb_sizes[0]; > > - vcpu->arch.tlbcfg[0] &= ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); > + vcpu->arch.tlbcfg[0] &= > + ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND); > if (params.tlb_sizes[0] <= 2048) > vcpu->arch.tlbcfg[0] |= params.tlb_sizes[0]; > vcpu->arch.tlbcfg[0] |= params.tlb_ways[0] << TLBnCFG_ASSOC_SHIFT; > > - vcpu->arch.tlbcfg[1] &= ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); > + vcpu->arch.tlbcfg[1] &= > + ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND); > vcpu->arch.tlbcfg[1] |= params.tlb_sizes[1]; > vcpu->arch.tlbcfg[1] |= params.tlb_ways[1] << TLBnCFG_ASSOC_SHIFT; > > @@ -783,13 +785,13 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 > *vcpu_e500) > > /* Init TLB configuration register */ > vcpu->arch.tlbcfg[0] = mfspr(SPRN_TLB0CFG) & > - ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); > + ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND); > vcpu->arch.tlbcfg[0] |= vcpu_e500->gtlb_params[0].entries; > vcpu->arch.tlbcfg[0] |= > vcpu_e500->gtlb_params[0].ways << TLBnCFG_ASSOC_SHIFT; > > vcpu->arch.tlbcfg[1] = mfspr(SPRN_TLB1CFG) & > - ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC); > + ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND); > vcpu->arch.tlbcfg[1] |= vcpu_e500->gtlb_params[1].entries; > vcpu->arch.tlbcfg[1] |= > vcpu_e500->gtlb_params[1].ways << TLBnCFG_ASSOC_SHIFT; > -- > 1.7.4.1 > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/5] KVM: PPC: e500: Emulate EPTCFG register
On 30.01.2013, at 14:29, Mihai Caraman wrote: > EPTCFG register defined by E.PT is accessed unconditionally by Linux guests > in the presence of MAV 2.0. Emulate EPTCFG register now. > > Signed-off-by: Mihai Caraman > --- > arch/powerpc/include/asm/kvm_host.h |1 + > arch/powerpc/kvm/e500.h |6 ++ > arch/powerpc/kvm/e500_emulate.c |9 + > arch/powerpc/kvm/e500_mmu.c |5 + > 4 files changed, 21 insertions(+), 0 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index 88fcfe6..f480b20 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -503,6 +503,7 @@ struct kvm_vcpu_arch { > u32 tlbcfg[4]; > u32 tlbps[4]; > u32 mmucfg; > + u32 eptcfg; This too needs to be settable through SW_TLB. > u32 epr; > struct kvmppc_booke_debug_reg dbg_reg; > #endif > diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h > index b9f76d8..983eb95 100644 > --- a/arch/powerpc/kvm/e500.h > +++ b/arch/powerpc/kvm/e500.h > @@ -308,4 +308,10 @@ static inline unsigned int has_mmu_v2(const struct > kvm_vcpu *vcpu) > return ((vcpu->arch.mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2); > } > > +static inline unsigned int supports_page_tables(const struct kvm_vcpu *vcpu) bool again. Can we generalize this a bit more? How about a small framework that allows us to differentiate across e.XX features? if (has_feature(vcpu, FEATURE_E_PT)) ... > +{ > + return ((vcpu->arch.tlbcfg[0] & TLBnCFG_IND) > + || (vcpu->arch.tlbcfg[1] & TLBnCFG_IND)); > +} > + > #endif /* KVM_E500_H */ > diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c > index 5515dc5..493e231 100644 > --- a/arch/powerpc/kvm/e500_emulate.c > +++ b/arch/powerpc/kvm/e500_emulate.c > @@ -339,6 +339,15 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int > sprn, ulong *spr_val) > return EMULATE_FAIL; > *spr_val = vcpu->arch.tlbps[1]; > break; > + case SPRN_EPTCFG: > + if (!has_mmu_v2(vcpu)) > + return EMULATE_FAIL; > + /* > + * Legacy Linux guests access EPTCFG register even if the E.PT > + * category is disabled in the VM. Give them a chance to live. > + */ > + *spr_val = vcpu->arch.eptcfg; > + break; > default: > emulated = kvmppc_booke_emulate_mfspr(vcpu, sprn, spr_val); > } > diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c > index 9a1f7b7..199c11e 100644 > --- a/arch/powerpc/kvm/e500_mmu.c > +++ b/arch/powerpc/kvm/e500_mmu.c > @@ -799,6 +799,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 > *vcpu_e500) > if (has_mmu_v2(vcpu)) { > vcpu->arch.tlbps[0] = mfspr(SPRN_TLB0PS); > vcpu->arch.tlbps[1] = mfspr(SPRN_TLB1PS); > + > + if (supports_page_tables(vcpu)) > + vcpu->arch.eptcfg = mfspr(SPRN_EPTCFG); Please don't introduce new mfspr()s here :). Just have user space set it. Alex > + else > + vcpu->arch.eptcfg = 0; > } > > kvmppc_recalc_tlb1map_range(vcpu_e500); > -- > 1.7.4.1 > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/5] KVM: PPC: e500mc: Enable e6500 cores
On 30.01.2013, at 14:29, Mihai Caraman wrote: > Extend processor compatibility names to e6500 cores. > > Signed-off-by: Mihai Caraman Looks good to me. Reviewed-by: Alexander Graf Alex > --- > arch/powerpc/kvm/e500mc.c |2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c > index 1f89d26..6c87299 100644 > --- a/arch/powerpc/kvm/e500mc.c > +++ b/arch/powerpc/kvm/e500mc.c > @@ -172,6 +172,8 @@ int kvmppc_core_check_processor_compat(void) > r = 0; > else if (strcmp(cur_cpu_spec->cpu_name, "e5500") == 0) > r = 0; > + else if (strcmp(cur_cpu_spec->cpu_name, "e6500") == 0) > + r = 0; > else > r = -ENOTSUPP; > > -- > 1.7.4.1 > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/5] KVM: PPC: e500: Emulate TLBnPS registers
On 31.01.2013, at 14:24, Alexander Graf wrote: > > On 30.01.2013, at 14:29, Mihai Caraman wrote: > >> Emulate TLBnPS registers which are available in MMU Architecture Version >> (MAV) 2.0. >> >> Signed-off-by: Mihai Caraman >> --- >> arch/powerpc/include/asm/kvm_host.h |1 + >> arch/powerpc/kvm/e500.h |5 + >> arch/powerpc/kvm/e500_emulate.c | 10 ++ >> arch/powerpc/kvm/e500_mmu.c |5 + >> 4 files changed, 21 insertions(+), 0 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/kvm_host.h >> b/arch/powerpc/include/asm/kvm_host.h >> index 8a72d59..88fcfe6 100644 >> --- a/arch/powerpc/include/asm/kvm_host.h >> +++ b/arch/powerpc/include/asm/kvm_host.h >> @@ -501,6 +501,7 @@ struct kvm_vcpu_arch { >> spinlock_t wdt_lock; >> struct timer_list wdt_timer; >> u32 tlbcfg[4]; >> +u32 tlbps[4]; >> u32 mmucfg; >> u32 epr; >> struct kvmppc_booke_debug_reg dbg_reg; >> diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h >> index 41cefd4..b9f76d8 100644 >> --- a/arch/powerpc/kvm/e500.h >> +++ b/arch/powerpc/kvm/e500.h >> @@ -303,4 +303,9 @@ static inline unsigned int get_tlbmiss_tid(struct >> kvm_vcpu *vcpu) >> #define get_tlb_sts(gtlbe) (MAS1_TS) >> #endif /* !BOOKE_HV */ >> >> +static inline unsigned int has_mmu_v2(const struct kvm_vcpu *vcpu) > > bool. Also rename it to "is_..." then. In light of the comment I did in a later patch, this too could be convert to feature flags. Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
On Thu, 2013-01-31 at 05:24 +, Greg KH wrote: > On Wed, Jan 30, 2013 at 06:15:12PM -0700, Toshi Kani wrote: > > > Please make it a "real" pointer, and not a void *, those shouldn't be > > > used at all if possible. > > > > How about changing the "void *handle" to acpi_dev_node below? > > > >struct acpi_dev_nodeacpi_node; > > > > Basically, it has the same challenge as struct device, which uses > > acpi_dev_node as well. We can add other FW node when needed (just like > > device also has *of_node). > > That sounds good to me. Great! Thanks Greg, -Toshi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
> -Original Message- > From: Alexander Graf [mailto:ag...@suse.de] > Sent: Thursday, January 31, 2013 3:21 PM > To: Caraman Mihai Claudiu-B02008 > Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc- > d...@lists.ozlabs.org > Subject: Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register > initialization earlier > > > On 30.01.2013, at 14:29, Mihai Caraman wrote: > > > VCPU's MMUCFG register initialization should not depend on > KVM_CAP_SW_TLB > > ioctl call. Move it earlier into tlb initalization phase. > > Quite the contrary. The fact that there is an mfspr() in e500_mmu.c > already tells us that the code is broken. The TLB guest code should only > depend on input from the SW_TLB configuration. It's completely orthogonal > to the host capabilities. Then we have the same issue for TLBnCFG registers which need to be configured via SW_TLB ioctl. What is the purpose of guest tlb initalization in e500_mmu.c if we rely on SW_TLB? -Mike ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
On 31.01.2013, at 15:56, Caraman Mihai Claudiu-B02008 wrote: >> -Original Message- >> From: Alexander Graf [mailto:ag...@suse.de] >> Sent: Thursday, January 31, 2013 3:21 PM >> To: Caraman Mihai Claudiu-B02008 >> Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc- >> d...@lists.ozlabs.org >> Subject: Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register >> initialization earlier >> >> >> On 30.01.2013, at 14:29, Mihai Caraman wrote: >> >>> VCPU's MMUCFG register initialization should not depend on >> KVM_CAP_SW_TLB >>> ioctl call. Move it earlier into tlb initalization phase. >> >> Quite the contrary. The fact that there is an mfspr() in e500_mmu.c >> already tells us that the code is broken. The TLB guest code should only >> depend on input from the SW_TLB configuration. It's completely orthogonal >> to the host capabilities. > > Then we have the same issue for TLBnCFG registers which need to be configured > via SW_TLB ioctl. What is the purpose of guest tlb initalization in e500_mmu.c > if we rely on SW_TLB? It's to provide a fallback to user space that doesn't implement SW_TLB configuration yet. Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 4/5] KVM: PPC: e500: Emulate EPTCFG register
> -Original Message- > From: Alexander Graf [mailto:ag...@suse.de] > Sent: Thursday, January 31, 2013 3:31 PM > To: Caraman Mihai Claudiu-B02008 > Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc- > d...@lists.ozlabs.org > Subject: Re: [PATCH 4/5] KVM: PPC: e500: Emulate EPTCFG register > > > On 30.01.2013, at 14:29, Mihai Caraman wrote: > > > EPTCFG register defined by E.PT is accessed unconditionally by Linux > guests > > in the presence of MAV 2.0. Emulate EPTCFG register now. > > > > Signed-off-by: Mihai Caraman > > --- > > arch/powerpc/include/asm/kvm_host.h |1 + > > arch/powerpc/kvm/e500.h |6 ++ > > arch/powerpc/kvm/e500_emulate.c |9 + > > arch/powerpc/kvm/e500_mmu.c |5 + > > 4 files changed, 21 insertions(+), 0 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > > index 88fcfe6..f480b20 100644 > > --- a/arch/powerpc/include/asm/kvm_host.h > > +++ b/arch/powerpc/include/asm/kvm_host.h > > @@ -503,6 +503,7 @@ struct kvm_vcpu_arch { > > u32 tlbcfg[4]; > > u32 tlbps[4]; > > u32 mmucfg; > > + u32 eptcfg; > > This too needs to be settable through SW_TLB. > > > u32 epr; > > struct kvmppc_booke_debug_reg dbg_reg; > > #endif > > diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h > > index b9f76d8..983eb95 100644 > > --- a/arch/powerpc/kvm/e500.h > > +++ b/arch/powerpc/kvm/e500.h > > @@ -308,4 +308,10 @@ static inline unsigned int has_mmu_v2(const struct > kvm_vcpu *vcpu) > > return ((vcpu->arch.mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2); > > } > > > > +static inline unsigned int supports_page_tables(const struct kvm_vcpu > *vcpu) > > bool again. Can we generalize this a bit more? How about a small > framework that allows us to differentiate across e.XX features? I thought you will ask for it :) -Mike ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
> -Original Message- > From: Alexander Graf [mailto:ag...@suse.de] > Sent: Thursday, January 31, 2013 4:58 PM > To: Caraman Mihai Claudiu-B02008 > Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc- > d...@lists.ozlabs.org > Subject: Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register > initialization earlier > > > On 31.01.2013, at 15:56, Caraman Mihai Claudiu-B02008 wrote: > > >> -Original Message- > >> From: Alexander Graf [mailto:ag...@suse.de] > >> Sent: Thursday, January 31, 2013 3:21 PM > >> To: Caraman Mihai Claudiu-B02008 > >> Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc- > >> d...@lists.ozlabs.org > >> Subject: Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register > >> initialization earlier > >> > >> > >> On 30.01.2013, at 14:29, Mihai Caraman wrote: > >> > >>> VCPU's MMUCFG register initialization should not depend on > >> KVM_CAP_SW_TLB > >>> ioctl call. Move it earlier into tlb initalization phase. > >> > >> Quite the contrary. The fact that there is an mfspr() in e500_mmu.c > >> already tells us that the code is broken. The TLB guest code should > only > >> depend on input from the SW_TLB configuration. It's completely > orthogonal > >> to the host capabilities. > > > > Then we have the same issue for TLBnCFG registers which need to be > configured > > via SW_TLB ioctl. What is the purpose of guest tlb initalization in > e500_mmu.c > > if we rely on SW_TLB? > > It's to provide a fallback to user space that doesn't implement SW_TLB > configuration yet. Do we have such a case now or is it just hypothetical? For the fallback we need to initialize the MMUCFG register which I intended to say in the commit message. > > > Alex > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
On 01/31/2013 09:26:20 AM, Caraman Mihai Claudiu-B02008 wrote: > -Original Message- > From: Alexander Graf [mailto:ag...@suse.de] > Sent: Thursday, January 31, 2013 4:58 PM > To: Caraman Mihai Claudiu-B02008 > Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc- > d...@lists.ozlabs.org > Subject: Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register > initialization earlier > > > On 31.01.2013, at 15:56, Caraman Mihai Claudiu-B02008 wrote: > > >> -Original Message- > >> From: Alexander Graf [mailto:ag...@suse.de] > >> Sent: Thursday, January 31, 2013 3:21 PM > >> To: Caraman Mihai Claudiu-B02008 > >> Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc- > >> d...@lists.ozlabs.org > >> Subject: Re: [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register > >> initialization earlier > >> > >> > >> On 30.01.2013, at 14:29, Mihai Caraman wrote: > >> > >>> VCPU's MMUCFG register initialization should not depend on > >> KVM_CAP_SW_TLB > >>> ioctl call. Move it earlier into tlb initalization phase. > >> > >> Quite the contrary. The fact that there is an mfspr() in e500_mmu.c > >> already tells us that the code is broken. The TLB guest code should > only > >> depend on input from the SW_TLB configuration. It's completely > orthogonal > >> to the host capabilities. > > > > Then we have the same issue for TLBnCFG registers which need to be > configured > > via SW_TLB ioctl. What is the purpose of guest tlb initalization in > e500_mmu.c > > if we rely on SW_TLB? > > It's to provide a fallback to user space that doesn't implement SW_TLB > configuration yet. Do we have such a case now or is it just hypothetical? For the fallback we need to initialize the MMUCFG register which I intended to say in the commit message. I don't think we need to support a fallback for e6500, since there's nothing to be backwards compatible with. As for use case, I don't see us ever supporting the guest being a different CPU than the host. Page sizes probably aren't a problem, but there are other barriers. The main reasons that TLBnCFG are settable through SW_TLB are: 1. The guest TLB can be enlarged as a performance hack (like in Topaz, though QEMU doesn't currently do this), 2. The legacy default in KVM is based on the e500v1 TLB0 size, which is half of what e500v2/e500mc have, and 3. QEMU needs to know the exact geometry of the TLB so that it can interpret the shared data properly. #3 seems like a compelling reason here, to avoid silent weirdness if there's a slight mismatch between what QEMU thinks it's modelling and what we're actually running on. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 24/25] perf/POWER7: Make some POWER7 events available in sysfs
From: Sukadev Bhattiprolu Make some POWER7-specific perf events available in sysfs. $ /bin/ls -1 /sys/bus/event_source/devices/cpu/events/ branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions PM_BRU_FIN PM_BRU_MPRED PM_CMPLU_STALL PM_CYC PM_GCT_NOSLOT_CYC PM_INST_CMPL PM_LD_MISS_L1 PM_LD_REF_L1 stalled-cycles-backend stalled-cycles-frontend where the 'PM_*' events are POWER specific and the others are the generic events. This will enable users to specify these events with their symbolic names rather than with their raw code. perf stat -e 'cpu/PM_CYC' ... Signed-off-by: Sukadev Bhattiprolu Cc: Andi Kleen Cc: Anton Blanchard Cc: Ingo Molnar Cc: Jiri Olsa Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Robert Richter Cc: Stephane Eranian Cc: linuxppc-...@ozlabs.org Link: http://lkml.kernel.org/r/20130123062528.ge13...@us.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- arch/powerpc/include/asm/perf_event_server.h | 3 +++ arch/powerpc/perf/power7-pmu.c | 18 ++ 2 files changed, 21 insertions(+) diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index b9b6c55..b29fcc6 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -132,3 +132,6 @@ extern ssize_t power_events_sysfs_show(struct device *dev, #defineGENERIC_EVENT_ATTR(_name, _id) EVENT_ATTR(_name, _id, _g) #defineGENERIC_EVENT_PTR(_id) EVENT_PTR(_id, _g) + +#definePOWER_EVENT_ATTR(_name, _id)EVENT_ATTR(PM_##_name, _id, _p) +#definePOWER_EVENT_PTR(_id)EVENT_PTR(_id, _p) diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 269bf24..b554879 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -384,6 +384,15 @@ GENERIC_EVENT_ATTR(cache-misses, LD_MISS_L1); GENERIC_EVENT_ATTR(branch-instructions,BRU_FIN); GENERIC_EVENT_ATTR(branch-misses, BRU_MPRED); +POWER_EVENT_ATTR(CYC, CYC); +POWER_EVENT_ATTR(GCT_NOSLOT_CYC, GCT_NOSLOT_CYC); +POWER_EVENT_ATTR(CMPLU_STALL, CMPLU_STALL); +POWER_EVENT_ATTR(INST_CMPL,INST_CMPL); +POWER_EVENT_ATTR(LD_REF_L1,LD_REF_L1); +POWER_EVENT_ATTR(LD_MISS_L1, LD_MISS_L1); +POWER_EVENT_ATTR(BRU_FIN, BRU_FIN) +POWER_EVENT_ATTR(BRU_MPRED,BRU_MPRED); + static struct attribute *power7_events_attr[] = { GENERIC_EVENT_PTR(CYC), GENERIC_EVENT_PTR(GCT_NOSLOT_CYC), @@ -393,6 +402,15 @@ static struct attribute *power7_events_attr[] = { GENERIC_EVENT_PTR(LD_MISS_L1), GENERIC_EVENT_PTR(BRU_FIN), GENERIC_EVENT_PTR(BRU_MPRED), + + POWER_EVENT_PTR(CYC), + POWER_EVENT_PTR(GCT_NOSLOT_CYC), + POWER_EVENT_PTR(CMPLU_STALL), + POWER_EVENT_PTR(INST_CMPL), + POWER_EVENT_PTR(LD_REF_L1), + POWER_EVENT_PTR(LD_MISS_L1), + POWER_EVENT_PTR(BRU_FIN), + POWER_EVENT_PTR(BRU_MPRED), NULL }; -- 1.8.1.1.361.gec3ae6e ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 25/25] perf: Document the ABI of perf sysfs entries
From: Sukadev Bhattiprolu This patchset addes two new sets of files to sysfs for POWER architecture. - perf event config format in /sys/devices/cpu/format/event - generic and POWER-specific perf events in /sys/devices/cpu/events/ The format of the first file is already documented in: sysfs-bus-event_source-devices-format Document the format of the second set of files '/sys/devices/cpu/events/*' which would also become part of the ABI. Changelog[v4]: [Jiri Olsa]: Mention that multiple event= like terms can be specified in the 'events' file. [Jiri Olsa]: Remove the documentation for the 'config format' file as it is already documented in 'Documentation/ABI/testing/'. [Jiri Olsa]: Move ABI documentation from 'stable/' to 'testing/' Changelog[v3]: [Greg KH] Include ABI documentation. Signed-off-by: Sukadev Bhattiprolu Acked-by: Jiri Olsa Cc: Andi Kleen Cc: Anton Blanchard Cc: Ingo Molnar Cc: Jiri Olsa Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Robert Richter Cc: Stephane Eranian Cc: linuxppc-...@ozlabs.org Link: http://lkml.kernel.org/r/20130123062645.gg13...@us.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- Documentation/ABI/stable/sysfs-devices-cpu-events | 0 .../testing/sysfs-bus-event_source-devices-events | 62 ++ 2 files changed, 62 insertions(+) delete mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-events create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-events diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events b/Documentation/ABI/stable/sysfs-devices-cpu-events deleted file mode 100644 index e69de29..000 diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events new file mode 100644 index 000..0adeb52 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events @@ -0,0 +1,62 @@ +What: /sys/devices/cpu/events/ + /sys/devices/cpu/events/branch-misses + /sys/devices/cpu/events/cache-references + /sys/devices/cpu/events/cache-misses + /sys/devices/cpu/events/stalled-cycles-frontend + /sys/devices/cpu/events/branch-instructions + /sys/devices/cpu/events/stalled-cycles-backend + /sys/devices/cpu/events/instructions + /sys/devices/cpu/events/cpu-cycles + +Date: 2013/01/08 + +Contact: Linux kernel mailing list + +Description: Generic performance monitoring events + + A collection of performance monitoring events that may be + supported by many/most CPUs. These events can be monitored + using the 'perf(1)' tool. + + The contents of each file would look like: + + event=0x + + where 'N' is a hex digit and the number '0x' shows the + "raw code" for the perf event identified by the file's + "basename". + + +What: /sys/devices/cpu/events/PM_LD_MISS_L1 + /sys/devices/cpu/events/PM_LD_REF_L1 + /sys/devices/cpu/events/PM_CYC + /sys/devices/cpu/events/PM_BRU_FIN + /sys/devices/cpu/events/PM_GCT_NOSLOT_CYC + /sys/devices/cpu/events/PM_BRU_MPRED + /sys/devices/cpu/events/PM_INST_CMPL + /sys/devices/cpu/events/PM_CMPLU_STALL + +Date: 2013/01/08 + +Contact: Linux kernel mailing list + Linux Powerpc mailing list + +Description: POWER-systems specific performance monitoring events + + A collection of performance monitoring events that may be + supported by the POWER CPU. These events can be monitored + using the 'perf(1)' tool. + + These events may not be supported by other CPUs. + + The contents of each file would look like: + + event=0x + + where 'N' is a hex digit and the number '0x' shows the + "raw code" for the perf event identified by the file's + "basename". + + Further, multiple terms like 'event=0x' can be specified + and separated with comma. All available terms are defined in + the /sys/bus/event_source/devices//format file. -- 1.8.1.1.361.gec3ae6e ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[GIT PULL 00/25] perf/core improvements and fixes
Hi Ingo, Please consider pulling, - Arnaldo The following changes since commit 152fefa921535665f95840c08062844ab2f5593e: Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core (2013-01-31 10:20:14 +0100) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux tags/perf-core-for-mingo for you to fetch changes up to 2ac3634a7e1c8eedc961030c87c5c36ebd5bbf8e: perf: Document the ABI of perf sysfs entries (2013-01-31 13:07:51 -0300) perf/core improvements and fixes: . Make some POWER7 events available in sysfs, equivalent to what was done on x86, from Sukadev Bhattiprolu. . Add event group view, from Namyung Kim: To use it, 'perf record' should group events when recording. And then perf report parses the saved group relation from file header and prints them together if --group option is provided. You can use 'perf evlist' command to see event group information: $ perf record -e '{ref-cycles,cycles}' noploop 1 [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.385 MB perf.data (~16807 samples) ] $ perf evlist --group {ref-cycles,cycles} With this example, default perf report will show you each event separately like this: $ perf report ... # group: {ref-cycles,cycles} # # Samples: 3K of event 'ref-cycles' # Event count (approx.): 3153797218 # # Overhead Command Shared Object Symbol # ... . .. 99.84% noploop noploop[.] main 0.07% noploop ld-2.15.so [.] strcmp 0.03% noploop [kernel.kallsyms] [k] timerqueue_del 0.03% noploop [kernel.kallsyms] [k] sched_clock_cpu 0.02% noploop [kernel.kallsyms] [k] account_user_time 0.01% noploop [kernel.kallsyms] [k] __alloc_pages_nodemask 0.00% noploop [kernel.kallsyms] [k] native_write_msr_safe # Samples: 3K of event 'cycles' # Event count (approx.): 3722310525 # # Overhead Command Shared Object Symbol # ... . . 99.76% noploop noploop[.] main 0.11% noploop [kernel.kallsyms] [k] _raw_spin_lock 0.06% noploop [kernel.kallsyms] [k] find_get_page 0.03% noploop [kernel.kallsyms] [k] sched_clock_cpu 0.02% noploop [kernel.kallsyms] [k] rcu_check_callbacks 0.02% noploop [kernel.kallsyms] [k] __current_kernel_time 0.00% noploop [kernel.kallsyms] [k] native_write_msr_safe In this case the event group information will be shown in the end of header area. So you can use --group option to enable event group view. $ perf report --group ... # group: {ref-cycles,cycles} # # Samples: 7K of event 'anon group { ref-cycles, cycles }' # Event count (approx.): 6876107743 # # Overhead Command Shared Object Symbol # ... . .. 99.84% 99.76% noploop noploop[.] main 0.07% 0.00% noploop ld-2.15.so [.] strcmp 0.03% 0.00% noploop [kernel.kallsyms] [k] timerqueue_del 0.03% 0.03% noploop [kernel.kallsyms] [k] sched_clock_cpu 0.02% 0.00% noploop [kernel.kallsyms] [k] account_user_time 0.01% 0.00% noploop [kernel.kallsyms] [k] __alloc_pages_nodemask 0.00% 0.00% noploop [kernel.kallsyms] [k] native_write_msr_safe 0.00% 0.11% noploop [kernel.kallsyms] [k] _raw_spin_lock 0.00% 0.06% noploop [kernel.kallsyms] [k] find_get_page 0.00% 0.02% noploop [kernel.kallsyms] [k] rcu_check_callbacks 0.00% 0.02% noploop [kernel.kallsyms] [k] __current_kernel_time As you can see the Overhead column now contains both of ref-cycles and cycles and header line shows group information also - 'anon group { ref-cycles, cycles }'. The output is sorted by period of group leader first. If perf.data file doesn't contain group information, this --group option does nothing. So if you want enable event group view by default you can set it in ~/.perfconfig file: $ cat ~/.perfconfig [report] group = true It can be overridden with command line if you want: $ perf report --no-group Signed-off-by: Arnaldo Carvalho de Melo Arnaldo Carvalho de Melo (2): perf top: Stop using exit() perf top: Delete maps on exit Namhyung Kim (18): perf tools: Keep group information perf tests: Add group test conditions perf header: Add HEADER_GROUP_DES
[PATCH 23/25] perf/POWER7: Make generic event translations available in sysfs
From: Sukadev Bhattiprolu Make the generic perf events in POWER7 available via sysfs. $ ls /sys/bus/event_source/devices/cpu/events branch-instructions branch-misses cache-misses cache-references cpu-cycles instructions stalled-cycles-backend stalled-cycles-frontend $ cat /sys/bus/event_source/devices/cpu/events/cache-misses event=0x400f0 This patch is based on commits that implement this functionality on x86. Eg: commit a47473939db20e3961b200eb00acf5fcf084d755 Author: Jiri Olsa Date: Wed Oct 10 14:53:11 2012 +0200 perf/x86: Make hardware event translations available in sysfs Changelog:[v2] [Jiri Osla] Drop EVENT_ID() macro since it is only used once. Signed-off-by: Sukadev Bhattiprolu Cc: Andi Kleen Cc: Anton Blanchard Cc: Ingo Molnar Cc: Jiri Olsa Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Robert Richter Cc: Stephane Eranian Cc: linuxppc-...@ozlabs.org Link: http://lkml.kernel.org/r/20130123062454.gd13...@us.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- Documentation/ABI/stable/sysfs-devices-cpu-events | 0 arch/powerpc/include/asm/perf_event_server.h | 23 +++ arch/powerpc/perf/core-book3s.c | 12 arch/powerpc/perf/power7-pmu.c| 34 +++ 4 files changed, 69 insertions(+) create mode 100644 Documentation/ABI/stable/sysfs-devices-cpu-events diff --git a/Documentation/ABI/stable/sysfs-devices-cpu-events b/Documentation/ABI/stable/sysfs-devices-cpu-events new file mode 100644 index 000..e69de29 diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h index 9710be3..b9b6c55 100644 --- a/arch/powerpc/include/asm/perf_event_server.h +++ b/arch/powerpc/include/asm/perf_event_server.h @@ -11,6 +11,7 @@ #include #include +#include #define MAX_HWEVENTS 8 #define MAX_EVENT_ALTERNATIVES 8 @@ -35,6 +36,7 @@ struct power_pmu { void(*disable_pmc)(unsigned int pmc, unsigned long mmcr[]); int (*limited_pmc_event)(u64 event_id); u32 flags; + const struct attribute_group**attr_groups; int n_generic; int *generic_events; int (*cache_events)[PERF_COUNT_HW_CACHE_MAX] @@ -109,3 +111,24 @@ extern unsigned long perf_instruction_pointer(struct pt_regs *regs); * If an event_id is not subject to the constraint expressed by a particular * field, then it will have 0 in both the mask and value for that field. */ + +extern ssize_t power_events_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page); + +/* + * EVENT_VAR() is same as PMU_EVENT_VAR with a suffix. + * + * Having a suffix allows us to have aliases in sysfs - eg: the generic + * event 'cpu-cycles' can have two entries in sysfs: 'cpu-cycles' and + * 'PM_CYC' where the latter is the name by which the event is known in + * POWER CPU specification. + */ +#defineEVENT_VAR(_id, _suffix) event_attr_##_id##_suffix +#defineEVENT_PTR(_id, _suffix) &EVENT_VAR(_id, _suffix) + +#defineEVENT_ATTR(_name, _id, _suffix) \ + PMU_EVENT_ATTR(_name, EVENT_VAR(_id, _suffix), PME_PM_##_id,\ + power_events_sysfs_show) + +#defineGENERIC_EVENT_ATTR(_name, _id) EVENT_ATTR(_name, _id, _g) +#defineGENERIC_EVENT_PTR(_id) EVENT_PTR(_id, _g) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index aa2465e..fa476d5 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1305,6 +1305,16 @@ static int power_pmu_event_idx(struct perf_event *event) return event->hw.idx; } +ssize_t power_events_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct perf_pmu_events_attr *pmu_attr; + + pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr); + + return sprintf(page, "event=0x%02llx\n", pmu_attr->id); +} + struct pmu power_pmu = { .pmu_enable = power_pmu_enable, .pmu_disable= power_pmu_disable, @@ -1537,6 +1547,8 @@ int __cpuinit register_power_pmu(struct power_pmu *pmu) pr_info("%s performance monitor hardware support registered\n", pmu->name); + power_pmu.attr_groups = ppmu->attr_groups; + #ifdef MSR_HV /* * Use FCHV to ignore kernel events if MSR.HV is set. diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index eebb36d..269bf24 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -374,6 +374,39 @@ static int power7_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = { }, }; + +GENERIC_EVEN
[PATCH 21/25] perf/Power7: Use macros to identify perf events
From: Sukadev Bhattiprolu Define and use macros to identify perf events codes This would make it easier and more readable when these event codes need to be used in more than one place. Signed-off-by: Sukadev Bhattiprolu Acked-by: Jiri Olsa Cc: Andi Kleen Cc: Anton Blanchard Cc: Ingo Molnar Cc: Jiri Olsa Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Robert Richter Cc: Stephane Eranian Cc: linuxppc-...@ozlabs.org Link: http://lkml.kernel.org/r/20130123062353.gb13...@us.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- arch/powerpc/perf/power7-pmu.c | 28 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c index 2ee01e3..eebb36d 100644 --- a/arch/powerpc/perf/power7-pmu.c +++ b/arch/powerpc/perf/power7-pmu.c @@ -51,6 +51,18 @@ #define MMCR1_PMCSEL_MSK 0xff /* + * Power7 event codes. + */ +#definePME_PM_CYC 0x1e +#definePME_PM_GCT_NOSLOT_CYC 0x100f8 +#definePME_PM_CMPLU_STALL 0x4000a +#definePME_PM_INST_CMPL0x2 +#definePME_PM_LD_REF_L10xc880 +#definePME_PM_LD_MISS_L1 0x400f0 +#definePME_PM_BRU_FIN 0x10068 +#definePME_PM_BRU_MPRED0x400f6 + +/* * Layout of constraint bits: * 554433221100 * 3210987654321098765432109876543210987654321098765432109876543210 @@ -307,14 +319,14 @@ static void power7_disable_pmc(unsigned int pmc, unsigned long mmcr[]) } static int power7_generic_events[] = { - [PERF_COUNT_HW_CPU_CYCLES] = 0x1e, - [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */ - [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a, /* CMPLU_STALL */ - [PERF_COUNT_HW_INSTRUCTIONS] = 2, - [PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880, /* LD_REF_L1_LSU*/ - [PERF_COUNT_HW_CACHE_MISSES] = 0x400f0, /* LD_MISS_L1 */ - [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068, /* BRU_FIN */ - [PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,/* BR_MPRED */ + [PERF_COUNT_HW_CPU_CYCLES] =PME_PM_CYC, + [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = PME_PM_GCT_NOSLOT_CYC, + [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =PME_PM_CMPLU_STALL, + [PERF_COUNT_HW_INSTRUCTIONS] = PME_PM_INST_CMPL, + [PERF_COUNT_HW_CACHE_REFERENCES] = PME_PM_LD_REF_L1, + [PERF_COUNT_HW_CACHE_MISSES] = PME_PM_LD_MISS_L1, + [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = PME_PM_BRU_FIN, + [PERF_COUNT_HW_BRANCH_MISSES] = PME_PM_BRU_MPRED, }; #define C(x) PERF_COUNT_HW_CACHE_##x -- 1.8.1.1.361.gec3ae6e ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 22/25] perf: Make EVENT_ATTR global
From: Sukadev Bhattiprolu Rename EVENT_ATTR() to PMU_EVENT_ATTR() and make it global so it is available to all architectures. Further to allow architectures flexibility, have PMU_EVENT_ATTR() pass in the variable name as a parameter. Changelog[v2] - [Jiri Olsa] No need to define PMU_EVENT_PTR() Signed-off-by: Sukadev Bhattiprolu Acked-by: Jiri Olsa Cc: Andi Kleen Cc: Anton Blanchard Cc: Ingo Molnar Cc: Jiri Olsa Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Robert Richter Cc: Stephane Eranian Cc: linuxppc-...@ozlabs.org Link: http://lkml.kernel.org/r/20130123062422.gc13...@us.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- arch/x86/kernel/cpu/perf_event.c | 13 +++-- include/linux/perf_event.h | 11 +++ 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 6774c17..c0df5ed2 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -1310,11 +1310,6 @@ static struct attribute_group x86_pmu_format_group = { .attrs = NULL, }; -struct perf_pmu_events_attr { - struct device_attribute attr; - u64 id; -}; - /* * Remove all undefined events (x86_pmu.event_map(id) == 0) * out of events_attr attributes. @@ -1348,11 +1343,9 @@ static ssize_t events_sysfs_show(struct device *dev, struct device_attribute *at #define EVENT_VAR(_id) event_attr_##_id #define EVENT_PTR(_id) &event_attr_##_id.attr.attr -#define EVENT_ATTR(_name, _id) \ -static struct perf_pmu_events_attr EVENT_VAR(_id) = { \ - .attr = __ATTR(_name, 0444, events_sysfs_show, NULL), \ - .id = PERF_COUNT_HW_##_id, \ -}; +#define EVENT_ATTR(_name, _id) \ + PMU_EVENT_ATTR(_name, EVENT_VAR(_id), PERF_COUNT_HW_##_id, \ + events_sysfs_show) EVENT_ATTR(cpu-cycles, CPU_CYCLES ); EVENT_ATTR(instructions, INSTRUCTIONS); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 6bfb2faa..42adf01 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -817,6 +817,17 @@ do { \ } while (0) +struct perf_pmu_events_attr { + struct device_attribute attr; + u64 id; +}; + +#define PMU_EVENT_ATTR(_name, _var, _id, _show) \ +static struct perf_pmu_events_attr _var = {\ + .attr = __ATTR(_name, 0444, _show, NULL), \ + .id = _id, \ +}; + #define PMU_FORMAT_ATTR(_name, _format) \ static ssize_t \ _name##_show(struct device *dev, \ -- 1.8.1.1.361.gec3ae6e ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
On Wednesday, January 30, 2013 07:57:45 PM Toshi Kani wrote: > On Tue, 2013-01-29 at 23:58 -0500, Greg KH wrote: > > On Thu, Jan 10, 2013 at 04:40:19PM -0700, Toshi Kani wrote: > > > +/* > > > + * Hot-plug device information > > > + */ > > > > Again, stop it with the "generic" hotplug term here, and everywhere > > else. You are doing a very _specific_ type of hotplug devices, so spell > > it out. We've worked hard to hotplug _everything_ in Linux, you are > > going to confuse a lot of people with this type of terms. > > Agreed. I will clarify in all places. > > > > +union shp_dev_info { > > > + struct shp_cpu { > > > + u32 cpu_id; > > > + } cpu; > > > > What is this? Why not point to the system device for the cpu? > > This info is used to on-line a new CPU and create its system/cpu device. > In other word, a system/cpu device is created as a result of CPU > hotplug. > > > > + struct shp_memory { > > > + int node; > > > + u64 start_addr; > > > + u64 length; > > > + } mem; > > > > Same here, why not point to the system device? > > Same as above. > > > > + struct shp_hostbridge { > > > + } hb; > > > + > > > + struct shp_node { > > > + } node; > > > > What happened here with these? Empty structures? Huh? > > They are place holders for now. PCI bridge hot-plug and node hot-plug > are still very much work in progress, so I have not integrated them into > this framework yet. > > > > +}; > > > + > > > +struct shp_device { > > > + struct list_headlist; > > > + struct device *device; > > > > No, make it a "real" device, embed the device into it. > > This device pointer is used to send KOBJ_ONLINE/OFFLINE event during CPU > online/offline operation in order to maintain the current behavior. CPU > online/offline operation only changes the state of CPU, so its > system/cpu device continues to be present before and after an operation. > (Whereas, CPU hot-add/delete operation creates or removes a system/cpu > device.) So, this "*device" needs to be a pointer to reference an > existing device that is to be on-lined/off-lined. > > > But, again, I'm going to ask why you aren't using the existing cpu / > > memory / bridge / node devices that we have in the kernel. Please use > > them, or give me a _really_ good reason why they will not work. > > We cannot use the existing system devices or ACPI devices here. During > hot-plug, ACPI handler sets this shp_device info, so that cpu and memory > handlers (drivers/cpu.c and mm/memory_hotplug.c) can obtain their target > device information in a platform-neutral way. During hot-add, we first > creates an ACPI device node (i.e. device under /sys/bus/acpi/devices), > but platform-neutral modules cannot use them as they are ACPI-specific. But suppose we're smart and have ACPI scan handlers that will create "physical" device nodes for those devices during the ACPI namespace scan. Then, the platform-neutral nodes will be able to bind to those "physical" nodes. Moreover, it should be possible to get a hierarchy of device objects this way that will reflect all of the dependencies we need to take into account during hot-add and hot-remove operations. That may not be what we have today, but I don't see any *fundamental* obstacles preventing us from using this approach. This is already done for PCI host bridges and platform devices and I don't see why we can't do that for the other types of devices too. The only missing piece I see is a way to handle the "eject" problem, i.e. when we try do eject a device at the top of a subtree and need to tear down the entire subtree below it, but if that's going to lead to a system crash, for example, we want to cancel the eject. It seems to me that we'll need some help from the driver core here. Thanks, Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 2013/1/31 18:38, Simon Jeons wrote: > Hi Tang, > On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: >> Hi Simon, >> >> On 01/31/2013 04:48 PM, Simon Jeons wrote: >>> Hi Tang, >>> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: >>> >>> 1. IIUC, there is a button on machine which supports hot-remove memory, >>> then what's the difference between press button and echo to /sys? >> >> No important difference, I think. Since I don't have the machine you are >> saying, I cannot surely answer you. :) >> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs >> is just another entrance. At last, they will run into the same code. >> >>> 2. Since kernel memory is linear mapping(I mean direct mapping part), >>> why can't put kernel direct mapping memory into one memory device, and >>> other memory into the other devices? >> >> We cannot do that because in that way, we will lose NUMA performance. >> >> If you know NUMA, you will understand the following example: >> >> node0:node1: >> cpu0~cpu15cpu16~cpu31 >> memory0~memory511 memory512~memory1023 >> >> cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. >> If we set direct mapping area in node0, and movable area in node1, then >> the kernel code running on cpu16~cpu31 will have to access >> memory0~memory511. >> This is a terrible performance down. > > So if config NUMA, kernel memory will not be linear mapping anymore? For > example, > > Node 0 Node 1 > > 0 ~ 10G 11G~14G > > kernel memory only at Node 0? Can part of kernel memory also at Node 1? > > How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. > It seems that only around 896MB on x86_32. > >> >>> As you know x86_64 don't need >>> highmem, IIUC, all kernel memory will linear mapping in this case. Is my >>> idea available? If is correct, x86_32 can't implement in the same way >>> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's >>> hard to focus kernel memory on single memory device. >> >> Sorry, I'm not quite familiar with x86_32 box. >> >>> 3. In current implementation, if memory hotplug just need memory >>> subsystem and ACPI codes support? Or also needs firmware take part in? >>> Hope you can explain in details, thanks in advance. :) >> >> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware >> based memory migration mentioned by Liu Jiang. > > Is there any material about firmware based memory migration? > >> >> So far, I only know this. :) >> >>> 4. What's the status of memory hotplug? Apart from can't remove kernel >>> memory, other things are fully implementation? >> >> I think the main job is done for now. And there are still bugs to fix. >> And this functionality is not stable. >> >> Thanks. :) > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org > > . > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
On Thu, 2013-01-31 at 21:54 +0100, Rafael J. Wysocki wrote: > On Wednesday, January 30, 2013 07:57:45 PM Toshi Kani wrote: > > On Tue, 2013-01-29 at 23:58 -0500, Greg KH wrote: > > > On Thu, Jan 10, 2013 at 04:40:19PM -0700, Toshi Kani wrote: : > > > > +}; > > > > + > > > > +struct shp_device { > > > > + struct list_headlist; > > > > + struct device *device; > > > > > > No, make it a "real" device, embed the device into it. > > > > This device pointer is used to send KOBJ_ONLINE/OFFLINE event during CPU > > online/offline operation in order to maintain the current behavior. CPU > > online/offline operation only changes the state of CPU, so its > > system/cpu device continues to be present before and after an operation. > > (Whereas, CPU hot-add/delete operation creates or removes a system/cpu > > device.) So, this "*device" needs to be a pointer to reference an > > existing device that is to be on-lined/off-lined. > > > > > But, again, I'm going to ask why you aren't using the existing cpu / > > > memory / bridge / node devices that we have in the kernel. Please use > > > them, or give me a _really_ good reason why they will not work. > > > > We cannot use the existing system devices or ACPI devices here. During > > hot-plug, ACPI handler sets this shp_device info, so that cpu and memory > > handlers (drivers/cpu.c and mm/memory_hotplug.c) can obtain their target > > device information in a platform-neutral way. During hot-add, we first > > creates an ACPI device node (i.e. device under /sys/bus/acpi/devices), > > but platform-neutral modules cannot use them as they are ACPI-specific. > > But suppose we're smart and have ACPI scan handlers that will create > "physical" device nodes for those devices during the ACPI namespace scan. > Then, the platform-neutral nodes will be able to bind to those "physical" > nodes. Moreover, it should be possible to get a hierarchy of device objects > this way that will reflect all of the dependencies we need to take into > account during hot-add and hot-remove operations. That may not be what we > have today, but I don't see any *fundamental* obstacles preventing us from > using this approach. I misstated in my previous email. system/cpu device is actually created by ACPI driver during ACPI scan in case of hot-add. This is done by acpi_processor_hotadd_init(), which I consider as a hack but can be done. system/memory device is created in add_memory() by the mm module. > This is already done for PCI host bridges and platform devices and I don't > see why we can't do that for the other types of devices too. > > The only missing piece I see is a way to handle the "eject" problem, i.e. > when we try do eject a device at the top of a subtree and need to tear down > the entire subtree below it, but if that's going to lead to a system crash, > for example, we want to cancel the eject. It seems to me that we'll need some > help from the driver core here. There are three different approaches suggested for system device hot-plug: A. Proceed within system device bus scan. B. Proceed within ACPI bus scan. C. Proceed with a sequence (as a mini-boot). Option A uses system devices as tokens, option B uses acpi devices as tokens, and option C uses resource tables as tokens, for their handlers. Here is summary of key questions & answers so far. I hope this clarifies why I am suggesting option 3. 1. What are the system devices? System devices provide system-wide core computing resources, which are essential to compose a computer system. System devices are not connected to any particular standard buses. 2. Why are the system devices special? The system devices are initialized during early boot-time, by multiple subsystems, from the boot-up sequence, in pre-defined order. They provide low-level services to enable other subsystems to come up. 3. Why can't initialize the system devices from the driver structure at boot? The driver structure is initialized at the end of the boot sequence and requires the low-level services from the system devices initialized beforehand. 4. Why do we need a new common framework? Sysfs CPU and memory on-lining/off-lining are performed within the CPU and memory modules. They are common code and do not depend on ACPI. Therefore, a new common framework is necessary to integrate both on-lining/off-lining operation and hot-plugging operation of system devices into a single framework. 5. Why can't do everything with ACPI bus scan? Software dependency among system devices may not be dictated by the ACPI hierarchy. For instance, memory should be initialized before CPUs (i.e. a new cpu may need its local memory), but such ordering cannot be guaranteed by the ACPI hierarchy. Also, as described in 4, online/offline operations are independent from ACPI. Thanks, -Toshi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/l
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 02/01/2013 09:36 AM, Simon Jeons wrote: On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: So if config NUMA, kernel memory will not be linear mapping anymore? For example, Node 0 Node 1 0 ~ 10G 11G~14G It has nothing to do with linear mapping, I think. kernel memory only at Node 0? Can part of kernel memory also at Node 1? Please refer to find_zone_movable_pfns_for_nodes(). The kernel is not only on node0. It uses all the online nodes evenly. :) How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. For example, I have 8G memory, all of them will be direct mapping for kernel? then userspace memory allocated from where? I think you misunderstood what Wu tried to say. :) The kernel mapped that large space, it doesn't mean it is using that large space. The mapping is to make kernel be able to access all the memory, not for the kernel to use only. User space can also use the memory, but each process has its own mapping. For example: 64TB, what ever xxxTB, what ever logic address space: |_kernel___|_user_| \ \ / / \ /\ / physical address space: |___\/__\/_| 4GB or 8GB, what ever * The * part physical is mapped to user space in the process' own pagetable. It is also direct mapped in kernel's pagetable. So the kernel can also access it. :) It seems that only around 896MB on x86_32. We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. Is there any material about firmware based memory migration? No, I don't have any because this is a functionality of machine from HUAWEI. I think you can ask Liu Jiang or Wu Jianguo to share some with you. :) Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 2013/2/1 9:36, Simon Jeons wrote: > On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: >> On 2013/1/31 18:38, Simon Jeons wrote: >> >>> Hi Tang, >>> On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: Hi Simon, On 01/31/2013 04:48 PM, Simon Jeons wrote: > Hi Tang, > On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: > > 1. IIUC, there is a button on machine which supports hot-remove memory, > then what's the difference between press button and echo to /sys? No important difference, I think. Since I don't have the machine you are saying, I cannot surely answer you. :) AFAIK, pressing the button means trigger the hotplug from hardware, sysfs is just another entrance. At last, they will run into the same code. > 2. Since kernel memory is linear mapping(I mean direct mapping part), > why can't put kernel direct mapping memory into one memory device, and > other memory into the other devices? We cannot do that because in that way, we will lose NUMA performance. If you know NUMA, you will understand the following example: node0:node1: cpu0~cpu15cpu16~cpu31 memory0~memory511 memory512~memory1023 cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. If we set direct mapping area in node0, and movable area in node1, then the kernel code running on cpu16~cpu31 will have to access memory0~memory511. This is a terrible performance down. >>> >>> So if config NUMA, kernel memory will not be linear mapping anymore? For >>> example, >>> >>> Node 0 Node 1 >>> >>> 0 ~ 10G 11G~14G >>> >>> kernel memory only at Node 0? Can part of kernel memory also at Node 1? >>> >>> How big is kernel direct mapping memory in x86_64? Is there max limit? >> >> >> Max kernel direct mapping memory in x86_64 is 64TB. > > For example, I have 8G memory, all of them will be direct mapping for > kernel? then userspace memory allocated from where? Direct mapping memory means you can use __va() and pa(), but not means that them can be only used by kernel, them can be used by user-space too, as long as them are free. > >> >>> It seems that only around 896MB on x86_32. >>> > As you know x86_64 don't need > highmem, IIUC, all kernel memory will linear mapping in this case. Is my > idea available? If is correct, x86_32 can't implement in the same way > since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's > hard to focus kernel memory on single memory device. Sorry, I'm not quite familiar with x86_32 box. > 3. In current implementation, if memory hotplug just need memory > subsystem and ACPI codes support? Or also needs firmware take part in? > Hope you can explain in details, thanks in advance. :) We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware based memory migration mentioned by Liu Jiang. >>> >>> Is there any material about firmware based memory migration? >>> So far, I only know this. :) > 4. What's the status of memory hotplug? Apart from can't remove kernel > memory, other things are fully implementation? I think the main job is done for now. And there are still bugs to fix. And this functionality is not stable. Thanks. :) >>> >>> >>> -- >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>> the body to majord...@kvack.org. For more info on Linux MM, >>> see: http://www.linux-mm.org/ . >>> Don't email: mailto:"d...@kvack.org";> em...@kvack.org >>> >>> . >>> >> >> >> > > > > . > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Jianguo, On Fri, 2013-02-01 at 09:57 +0800, Jianguo Wu wrote: > On 2013/2/1 9:36, Simon Jeons wrote: > > > On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: > >> On 2013/1/31 18:38, Simon Jeons wrote: > >> > >>> Hi Tang, > >>> On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: > Hi Simon, > > On 01/31/2013 04:48 PM, Simon Jeons wrote: > > Hi Tang, > > On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: > > > > 1. IIUC, there is a button on machine which supports hot-remove memory, > > then what's the difference between press button and echo to /sys? > > No important difference, I think. Since I don't have the machine you are > saying, I cannot surely answer you. :) > AFAIK, pressing the button means trigger the hotplug from hardware, sysfs > is just another entrance. At last, they will run into the same code. > > > 2. Since kernel memory is linear mapping(I mean direct mapping part), > > why can't put kernel direct mapping memory into one memory device, and > > other memory into the other devices? > > We cannot do that because in that way, we will lose NUMA performance. > > If you know NUMA, you will understand the following example: > > node0:node1: > cpu0~cpu15cpu16~cpu31 > memory0~memory511 memory512~memory1023 > > cpu16~cpu31 access memory16~memory1023 much faster than > memory0~memory511. > If we set direct mapping area in node0, and movable area in node1, then > the kernel code running on cpu16~cpu31 will have to access > memory0~memory511. > This is a terrible performance down. > >>> > >>> So if config NUMA, kernel memory will not be linear mapping anymore? For > >>> example, > >>> > >>> Node 0 Node 1 > >>> > >>> 0 ~ 10G 11G~14G > >>> > >>> kernel memory only at Node 0? Can part of kernel memory also at Node 1? > >>> > >>> How big is kernel direct mapping memory in x86_64? Is there max limit? > >> > >> > >> Max kernel direct mapping memory in x86_64 is 64TB. > > > > For example, I have 8G memory, all of them will be direct mapping for > > kernel? then userspace memory allocated from where? > > Direct mapping memory means you can use __va() and pa(), but not means that > them > can be only used by kernel, them can be used by user-space too, as long as > them are free. IIUC, the benefit of va() and pa() is just for quick get virtual/physical address, it takes advantage of linear mapping. But mmu still need to go through pgd/pud/pmd/pte, correct? > > > > >> > >>> It seems that only around 896MB on x86_32. > >>> > > > As you know x86_64 don't need > > highmem, IIUC, all kernel memory will linear mapping in this case. Is my > > idea available? If is correct, x86_32 can't implement in the same way > > since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's > > hard to focus kernel memory on single memory device. > > Sorry, I'm not quite familiar with x86_32 box. > > > 3. In current implementation, if memory hotplug just need memory > > subsystem and ACPI codes support? Or also needs firmware take part in? > > Hope you can explain in details, thanks in advance. :) > > We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware > based memory migration mentioned by Liu Jiang. > >>> > >>> Is there any material about firmware based memory migration? > >>> > > So far, I only know this. :) > > > 4. What's the status of memory hotplug? Apart from can't remove kernel > > memory, other things are fully implementation? > > I think the main job is done for now. And there are still bugs to fix. > And this functionality is not stable. > > Thanks. :) > >>> > >>> > >>> -- > >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in > >>> the body to majord...@kvack.org. For more info on Linux MM, > >>> see: http://www.linux-mm.org/ . > >>> Don't email: mailto:"d...@kvack.org";> em...@kvack.org > >>> > >>> . > >>> > >> > >> > >> > > > > > > > > . > > > > > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On 2013/2/1 10:06, Simon Jeons wrote: > Hi Jianguo, > On Fri, 2013-02-01 at 09:57 +0800, Jianguo Wu wrote: >> On 2013/2/1 9:36, Simon Jeons wrote: >> >>> On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: On 2013/1/31 18:38, Simon Jeons wrote: > Hi Tang, > On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: >> Hi Simon, >> >> On 01/31/2013 04:48 PM, Simon Jeons wrote: >>> Hi Tang, >>> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: >>> >>> 1. IIUC, there is a button on machine which supports hot-remove memory, >>> then what's the difference between press button and echo to /sys? >> >> No important difference, I think. Since I don't have the machine you are >> saying, I cannot surely answer you. :) >> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs >> is just another entrance. At last, they will run into the same code. >> >>> 2. Since kernel memory is linear mapping(I mean direct mapping part), >>> why can't put kernel direct mapping memory into one memory device, and >>> other memory into the other devices? >> >> We cannot do that because in that way, we will lose NUMA performance. >> >> If you know NUMA, you will understand the following example: >> >> node0:node1: >> cpu0~cpu15cpu16~cpu31 >> memory0~memory511 memory512~memory1023 >> >> cpu16~cpu31 access memory16~memory1023 much faster than >> memory0~memory511. >> If we set direct mapping area in node0, and movable area in node1, then >> the kernel code running on cpu16~cpu31 will have to access >> memory0~memory511. >> This is a terrible performance down. > > So if config NUMA, kernel memory will not be linear mapping anymore? For > example, > > Node 0 Node 1 > > 0 ~ 10G 11G~14G > > kernel memory only at Node 0? Can part of kernel memory also at Node 1? > > How big is kernel direct mapping memory in x86_64? Is there max limit? Max kernel direct mapping memory in x86_64 is 64TB. >>> >>> For example, I have 8G memory, all of them will be direct mapping for >>> kernel? then userspace memory allocated from where? >> >> Direct mapping memory means you can use __va() and pa(), but not means that >> them >> can be only used by kernel, them can be used by user-space too, as long as >> them are free. > > IIUC, the benefit of va() and pa() is just for quick get > virtual/physical address, it takes advantage of linear mapping. But mmu > still need to go through pgd/pud/pmd/pte, correct? Yes. > >> >>> > It seems that only around 896MB on x86_32. > >> >>> As you know x86_64 don't need >>> highmem, IIUC, all kernel memory will linear mapping in this case. Is my >>> idea available? If is correct, x86_32 can't implement in the same way >>> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's >>> hard to focus kernel memory on single memory device. >> >> Sorry, I'm not quite familiar with x86_32 box. >> >>> 3. In current implementation, if memory hotplug just need memory >>> subsystem and ACPI codes support? Or also needs firmware take part in? >>> Hope you can explain in details, thanks in advance. :) >> >> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware >> based memory migration mentioned by Liu Jiang. > > Is there any material about firmware based memory migration? > >> >> So far, I only know this. :) >> >>> 4. What's the status of memory hotplug? Apart from can't remove kernel >>> memory, other things are fully implementation? >> >> I think the main job is done for now. And there are still bugs to fix. >> And this functionality is not stable. >> >> Thanks. :) > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org > > . > >>> >>> >>> >>> . >>> >> >> >> > > > > . > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Simon, On 02/01/2013 10:17 AM, Simon Jeons wrote: For example: 64TB, what ever xxxTB, what ever logic address space: |_kernel___|_user_| \ \ / / \ /\ / physical address space: |___\/__\/_| 4GB or 8GB, what ever * How much address space user process can have on x86_64? Also 8GB? Usually, we don't say that. 8GB is your physical memory, right ? But kernel space and user space is the logic conception in OS. They are in logic address space. So both the kernel space and the user space can use all the physical memory. But if the page is already in use by either of them, the other one cannot use it. For example, some pages are direct mapped to kernel, and is in use by kernel, the user space cannot map it. The * part physical is mapped to user space in the process' own pagetable. It is also direct mapped in kernel's pagetable. So the kernel can also access it. :) But how to protect user process not modify kernel memory? This is the job of CPU. On intel cpus, user space code is running in level 3, and kernel space code is running in level 0. So the code in level 3 cannot access the data segment in level 0. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Fri, 2013-02-01 at 09:57 +0800, Tang Chen wrote: > On 02/01/2013 09:36 AM, Simon Jeons wrote: > > On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: > >>> > >>> So if config NUMA, kernel memory will not be linear mapping anymore? For > >>> example, > >>> > >>> Node 0 Node 1 > >>> > >>> 0 ~ 10G 11G~14G > > It has nothing to do with linear mapping, I think. > > >>> > >>> kernel memory only at Node 0? Can part of kernel memory also at Node 1? > > Please refer to find_zone_movable_pfns_for_nodes(). I see, thanks. :) > The kernel is not only on node0. It uses all the online nodes evenly. :) > > >>> > >>> How big is kernel direct mapping memory in x86_64? Is there max limit? > >> > >> > >> Max kernel direct mapping memory in x86_64 is 64TB. > > > > For example, I have 8G memory, all of them will be direct mapping for > > kernel? then userspace memory allocated from where? > > I think you misunderstood what Wu tried to say. :) > > The kernel mapped that large space, it doesn't mean it is using that > large space. > The mapping is to make kernel be able to access all the memory, not for > the kernel > to use only. User space can also use the memory, but each process has > its own mapping. > > For example: > > 64TB, what ever > xxxTB, what ever > logic address space: |_kernel___|_user_| > \ \ / / > \ /\ / > physical address space: |___\/__\/_| 4GB or > 8GB, what ever >* How much address space user process can have on x86_64? Also 8GB? > > The * part physical is mapped to user space in the process' own > pagetable. > It is also direct mapped in kernel's pagetable. So the kernel can also > access it. :) But how to protect user process not modify kernel memory? > > > > >> > >>> It seems that only around 896MB on x86_32. > >>> > > We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware > based memory migration mentioned by Liu Jiang. > >>> > >>> Is there any material about firmware based memory migration? > > No, I don't have any because this is a functionality of machine from HUAWEI. > I think you can ask Liu Jiang or Wu Jianguo to share some with you. :) > > Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: kernel/kgdb.c: fix memory leakage
On 01/14/2013 11:26 AM, Cong Ding wrote: > the variable backup_current_thread_info isn't freed before existing the > function. > > Signed-off-by: Cong Ding > --- > arch/powerpc/kernel/kgdb.c |5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c > index 8747447..5ca82cd 100644 > --- a/arch/powerpc/kernel/kgdb.c > +++ b/arch/powerpc/kernel/kgdb.c > @@ -154,12 +154,12 @@ static int kgdb_handle_breakpoint(struct pt_regs *regs) > static int kgdb_singlestep(struct pt_regs *regs) > { > struct thread_info *thread_info, *exception_thread_info; > - struct thread_info *backup_current_thread_info = \ > - (struct thread_info *)kmalloc(sizeof(struct thread_info), > GFP_KERNEL); > + struct thread_info *backup_current_thread_info; Woh... This is definitely wrong. You have found a problem for sure, but this is not the right way to fix it. It is not a good idea to kmalloc while single stepping because you can hang the kernel if you single step any operation in kmalloc(). I am in the process of going through all the kgdb mails from the last few months while I had been away from the project, so I didn't catch this one and I see it has upstream commit (fefd9e6f8). I'll submit another patch to fix this the right way and use a static variable. This is ok to use a static variable here because this is not something we can recursively call at a single CPU level. If Ben prefers we not burn the memory unless kgdb is active we can kmalloc / kfree the space we need at the time that kgdb is initialized. Else we can go with this patch you see below. We'll see what Ben desires. - diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c index a7bc752..bb12c8b 100644 --- a/arch/powerpc/kernel/kgdb.c +++ b/arch/powerpc/kernel/kgdb.c @@ -151,15 +151,16 @@ static int kgdb_handle_breakpoint(struct pt_regs *regs) return 1; } +static struct thread_info kgdb_backup_thread_info[NR_CPUS]; + static int kgdb_singlestep(struct pt_regs *regs) { struct thread_info *thread_info, *exception_thread_info; - struct thread_info *backup_current_thread_info; + int cpu = raw_smp_processor_id(); if (user_mode(regs)) return 0; - backup_current_thread_info = (struct thread_info *)kmalloc(sizeof(struct thread_info), GFP_KERNEL); /* * On Book E and perhaps other processors, singlestep is handled on * the critical exception stack. This causes current_thread_info() @@ -175,7 +176,7 @@ static int kgdb_singlestep(struct pt_regs *regs) if (thread_info != exception_thread_info) { /* Save the original current_thread_info. */ - memcpy(backup_current_thread_info, exception_thread_info, sizeof *thread_info); + memcpy(&kgdb_backup_thread_info[cpu], exception_thread_info, sizeof *thread_info); memcpy(exception_thread_info, thread_info, sizeof *thread_info); } @@ -183,9 +184,8 @@ static int kgdb_singlestep(struct pt_regs *regs) if (thread_info != exception_thread_info) /* Restore current_thread_info lastly. */ - memcpy(exception_thread_info, backup_current_thread_info, sizeof *thread_info); + memcpy(exception_thread_info, &kgdb_backup_thread_info[cpu], sizeof *thread_info); - kfree(backup_current_thread_info); return 1; } - Thanks, Jason. > > if (user_mode(regs)) > return 0; > > + backup_current_thread_info = (struct thread_info > *)kmalloc(sizeof(struct thread_info), GFP_KERNEL); > /* >* On Book E and perhaps other processors, singlestep is handled on >* the critical exception stack. This causes current_thread_info() > @@ -185,6 +185,7 @@ static int kgdb_singlestep(struct pt_regs *regs) > /* Restore current_thread_info lastly. */ > memcpy(exception_thread_info, backup_current_thread_info, > sizeof *thread_info); > > + kfree(backup_current_thread_info); > return 1; > } > > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Tang, On Fri, 2013-02-01 at 10:42 +0800, Tang Chen wrote: I confuse! > Hi Simon, > > On 02/01/2013 10:17 AM, Simon Jeons wrote: > >> For example: > >> > >> 64TB, what ever > >> xxxTB, what ever > >> logic address space: > >> |_kernel___|_user_| > >> \ \ / / > >> \ /\ / > >> physical address space: |___\/__\/_| 4GB or > >> 8GB, what ever > >> * > > > > How much address space user process can have on x86_64? Also 8GB? > > Usually, we don't say that. > > 8GB is your physical memory, right ? > But kernel space and user space is the logic conception in OS. They are > in logic > address space. > > So both the kernel space and the user space can use all the physical memory. > But if the page is already in use by either of them, the other one > cannot use it. > For example, some pages are direct mapped to kernel, and is in use by > kernel, the > user space cannot map it. How can distinguish map and use? I mean how can confirm memory is used by kernel instead of map? > > > > >> > >> The * part physical is mapped to user space in the process' own > >> pagetable. > >> It is also direct mapped in kernel's pagetable. So the kernel can also > >> access it. :) > > > > But how to protect user process not modify kernel memory? > > This is the job of CPU. On intel cpus, user space code is running in > level 3, and > kernel space code is running in level 0. So the code in level 3 cannot > access the data > segment in level 0. 1) If user process and kenel map to same physical memory, user process will get SIGSEGV during #PF if access to this memory, but If user proces s will map to the same memory which kernel map? Why? It can't access it. 2) If two user processes map to same physical memory, what will happen if one process access the memory? > > Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
On Fri, 2013-02-01 at 09:32 +0800, Jianguo Wu wrote: > On 2013/1/31 18:38, Simon Jeons wrote: > > > Hi Tang, > > On Thu, 2013-01-31 at 17:44 +0800, Tang Chen wrote: > >> Hi Simon, > >> > >> On 01/31/2013 04:48 PM, Simon Jeons wrote: > >>> Hi Tang, > >>> On Thu, 2013-01-31 at 15:10 +0800, Tang Chen wrote: > >>> > >>> 1. IIUC, there is a button on machine which supports hot-remove memory, > >>> then what's the difference between press button and echo to /sys? > >> > >> No important difference, I think. Since I don't have the machine you are > >> saying, I cannot surely answer you. :) > >> AFAIK, pressing the button means trigger the hotplug from hardware, sysfs > >> is just another entrance. At last, they will run into the same code. > >> > >>> 2. Since kernel memory is linear mapping(I mean direct mapping part), > >>> why can't put kernel direct mapping memory into one memory device, and > >>> other memory into the other devices? > >> > >> We cannot do that because in that way, we will lose NUMA performance. > >> > >> If you know NUMA, you will understand the following example: > >> > >> node0:node1: > >> cpu0~cpu15cpu16~cpu31 > >> memory0~memory511 memory512~memory1023 > >> > >> cpu16~cpu31 access memory16~memory1023 much faster than memory0~memory511. > >> If we set direct mapping area in node0, and movable area in node1, then > >> the kernel code running on cpu16~cpu31 will have to access > >> memory0~memory511. > >> This is a terrible performance down. > > > > So if config NUMA, kernel memory will not be linear mapping anymore? For > > example, > > > > Node 0 Node 1 > > > > 0 ~ 10G 11G~14G > > > > kernel memory only at Node 0? Can part of kernel memory also at Node 1? > > > > How big is kernel direct mapping memory in x86_64? Is there max limit? > > > Max kernel direct mapping memory in x86_64 is 64TB. For example, I have 8G memory, all of them will be direct mapping for kernel? then userspace memory allocated from where? > > > It seems that only around 896MB on x86_32. > > > >> > >>> As you know x86_64 don't need > >>> highmem, IIUC, all kernel memory will linear mapping in this case. Is my > >>> idea available? If is correct, x86_32 can't implement in the same way > >>> since highmem(kmap/kmap_atomic/vmalloc) can map any address, so it's > >>> hard to focus kernel memory on single memory device. > >> > >> Sorry, I'm not quite familiar with x86_32 box. > >> > >>> 3. In current implementation, if memory hotplug just need memory > >>> subsystem and ACPI codes support? Or also needs firmware take part in? > >>> Hope you can explain in details, thanks in advance. :) > >> > >> We need firmware take part in, such as SRAT in ACPI BIOS, or the firmware > >> based memory migration mentioned by Liu Jiang. > > > > Is there any material about firmware based memory migration? > > > >> > >> So far, I only know this. :) > >> > >>> 4. What's the status of memory hotplug? Apart from can't remove kernel > >>> memory, other things are fully implementation? > >> > >> I think the main job is done for now. And there are still bugs to fix. > >> And this functionality is not stable. > >> > >> Thanks. :) > > > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majord...@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: mailto:"d...@kvack.org";> em...@kvack.org > > > > . > > > > > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
Hi Simon, On 02/01/2013 11:06 AM, Simon Jeons wrote: How can distinguish map and use? I mean how can confirm memory is used by kernel instead of map? If the page is free, for example, it is in the buddy system, it is not in use. Even if it is direct mapped by kernel, the kernel logic should not to access it because you didn't allocate it. This is the kernel's logic. Of course the hardware and the user will not know this. You want to access some memory, you should first have a logic address, right? So how can you get a logic address ? You call alloc api. For example, when you are coding, of course you write: p = alloc_xxx(); allocate memory, now, it is in use, alloc_xxx() makes kernel know it. *p = .. use the memory You won't write: p = 0x8745; if so, kernel doesn't know it is in use *p = .. wrong... right ? The kernel mapped a page, it doesn't mean it is using the page. You should allocate it. That is just the kernel's allocating logic. Well, I think I can only give you this answer now. If you want something deeper, I think you need to read how the kernel manage the physical pages. :) 1) If user process and kenel map to same physical memory, user process will get SIGSEGV during #PF if access to this memory, but If user proces s will map to the same memory which kernel map? Why? It can't access it. When you call malloc() to allocate memory in user space, the OS logic will assure that you won't map a page that has already been used by kernel. A page is mapped by kernel, but not used by kernel (not allocated, like above), malloc() could allocate it, and map it to user space. This is the situation you are talking about, right ? Now it is mapped by kernel and user, but it is only allocated by user. So the kernel will not use it. When the kernel wants some memory, it will allocate some other memory. This is just the kernel logic. This is what memory management subsystem does. I think I cannot answer more because I'm also a student in memory management. This is just my understanding. And I hope it is helpful. :) 2) If two user processes map to same physical memory, what will happen if one process access the memory? Obviously you don't need to worry about this situation. We can swap the page used by process 1 out, and process 2 can use the same page. When process 1 wants to access it again, we swap it in. This only happens when the physical memory is not enough to use. :) And also, if you are using shared memory in user space, like shmget(), shmat().. it is the shared memory, both processes can use it at the same time. Thanks. :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
On Thu, Jan 31, 2013 at 09:54:51PM +0100, Rafael J. Wysocki wrote: > > > But, again, I'm going to ask why you aren't using the existing cpu / > > > memory / bridge / node devices that we have in the kernel. Please use > > > them, or give me a _really_ good reason why they will not work. > > > > We cannot use the existing system devices or ACPI devices here. During > > hot-plug, ACPI handler sets this shp_device info, so that cpu and memory > > handlers (drivers/cpu.c and mm/memory_hotplug.c) can obtain their target > > device information in a platform-neutral way. During hot-add, we first > > creates an ACPI device node (i.e. device under /sys/bus/acpi/devices), > > but platform-neutral modules cannot use them as they are ACPI-specific. > > But suppose we're smart and have ACPI scan handlers that will create > "physical" device nodes for those devices during the ACPI namespace scan. > Then, the platform-neutral nodes will be able to bind to those "physical" > nodes. Moreover, it should be possible to get a hierarchy of device objects > this way that will reflect all of the dependencies we need to take into > account during hot-add and hot-remove operations. That may not be what we > have today, but I don't see any *fundamental* obstacles preventing us from > using this approach. I would _much_ rather see that be the solution here as I think it is the proper one. > This is already done for PCI host bridges and platform devices and I don't > see why we can't do that for the other types of devices too. I agree. > The only missing piece I see is a way to handle the "eject" problem, i.e. > when we try do eject a device at the top of a subtree and need to tear down > the entire subtree below it, but if that's going to lead to a system crash, > for example, we want to cancel the eject. It seems to me that we'll need some > help from the driver core here. I say do what we always have done here, if the user asked us to tear something down, let it happen as they are the ones that know best :) Seriously, I guess this gets back to the "fail disconnect" idea that the ACPI developers keep harping on. I thought we already resolved this properly by having them implement it in their bus code, no reason the same thing couldn't happen here, right? I don't think the core needs to do anything special, but if so, I'll be glad to review it. thanks, gre k-h ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
On Thu, Jan 31, 2013 at 06:32:18PM -0700, Toshi Kani wrote: > This is already done for PCI host bridges and platform devices and I don't > > see why we can't do that for the other types of devices too. > > > > The only missing piece I see is a way to handle the "eject" problem, i.e. > > when we try do eject a device at the top of a subtree and need to tear down > > the entire subtree below it, but if that's going to lead to a system crash, > > for example, we want to cancel the eject. It seems to me that we'll need > > some > > help from the driver core here. > > There are three different approaches suggested for system device > hot-plug: > A. Proceed within system device bus scan. > B. Proceed within ACPI bus scan. > C. Proceed with a sequence (as a mini-boot). > > Option A uses system devices as tokens, option B uses acpi devices as > tokens, and option C uses resource tables as tokens, for their handlers. > > Here is summary of key questions & answers so far. I hope this > clarifies why I am suggesting option 3. > > 1. What are the system devices? > System devices provide system-wide core computing resources, which are > essential to compose a computer system. System devices are not > connected to any particular standard buses. Not a problem, lots of devices are not connected to any "particular standard busses". All this means is that system devices are connected to the "system" bus, nothing more. > 2. Why are the system devices special? > The system devices are initialized during early boot-time, by multiple > subsystems, from the boot-up sequence, in pre-defined order. They > provide low-level services to enable other subsystems to come up. Sorry, no, that doesn't mean they are special, nothing here is unique for the point of view of the driver model from any other device or bus. > 3. Why can't initialize the system devices from the driver structure at > boot? > The driver structure is initialized at the end of the boot sequence and > requires the low-level services from the system devices initialized > beforehand. Wait, what "driver structure"? If you need to initialize the driver core earlier, then do so. Or, even better, just wait until enough of the system has come up and then go initialize all of the devices you have found so far as part of your boot process. None of the above things you have stated seem to have anything to do with your proposed patch, so I don't understand why you have mentioned them... > 4. Why do we need a new common framework? > Sysfs CPU and memory on-lining/off-lining are performed within the CPU > and memory modules. They are common code and do not depend on ACPI. > Therefore, a new common framework is necessary to integrate both > on-lining/off-lining operation and hot-plugging operation of system > devices into a single framework. {sigh} Removing and adding devices and handling hotplug operations is what the driver core was written for, almost 10 years ago. To somehow think that your devices are "special" just because they don't use ACPI is odd, because the driver core itself has nothing to do with ACPI. Don't get the current mix of x86 system code tied into ACPI confused with an driver core issues here please. > 5. Why can't do everything with ACPI bus scan? > Software dependency among system devices may not be dictated by the ACPI > hierarchy. For instance, memory should be initialized before CPUs (i.e. > a new cpu may need its local memory), but such ordering cannot be > guaranteed by the ACPI hierarchy. Also, as described in 4, > online/offline operations are independent from ACPI. That's fine, the driver core is independant from ACPI. I don't care how you do the scaning of your devices, but I do care about you creating new driver core pieces that duplicate the existing functionality of what we have today. In short, I like Rafael's proposal better, and I fail to see how anything you have stated here would matter in how this is implemented. :) thanks, greg k-h ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev