Re: [PATCH] mm/x86: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
- jwbo...@redhat.com wrote: > On Wed, Mar 13, 2013 at 09:25:44AM -0400, Boris Ostrovsky wrote: > > On 03/01/2013 07:14 AM, Josh Boyer wrote: > > >On Thu, Feb 28, 2013 at 04:52:20PM -0800, H. Peter Anvin wrote: > > >>On 02/28/2013 04:42 PM, Josh Boyer wrote: > > >>>On Fri, Mar 01, 2013 at 01:36:29AM +0100, Borislav Petkov wrote: > > >>>>On Thu, Feb 28, 2013 at 04:15:45PM -0800, H. Peter Anvin wrote: > > >>>>>>I'll try to get someone to test this tomorrow. > > >>>>Btw, you'd need to apply that other patch too > > >>>> > > >>>>http://marc.info/?l=xen-devel&m=136206183814547&w=2 > > >>>> > > >>>>so that arch_flush_lazy_mmu_mode() has at least one caller on > x86_64. > > >>>Yeah, we already have that applied. It stops crashes in xen > > >>>environments so we pulled it in as a bugfix. Thanks though! > > >>> > > >>Who are "we"? > > >Sorry, Fedora. That patch has a link to a bug in it. We applied > the > > >patch for that bug. I'll apply Boris' patch on top and get the > same > > >people to test it. > > > > Josh, have you had a chance to test this? > > I've tested it on bare metal for a while now. No problems noticed at > all. I've not heard back from Krishna who was testing it in the Xen > environment. Krishna? Any updates? Thanks. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] mm/x86: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
On 03/23/2013 09:36 AM, Konrad Rzeszutek Wilk wrote: From: Boris Ostrovsky Invoking arch_flush_lazy_mmu_mode() results in calls to preempt_enable()/disable() which may have performance impact. Since lazy MMU is not used on bare metal we can patch away arch_flush_lazy_mmu_mode() so that it is never called in such environment. Signed-off-by: Boris Ostrovsky Tested-by: Josh Boyer Tested-by: Konrad Rzeszutek Wilk Acked-by: Borislav Petkov Signed-off-by: Konrad Rzeszutek Wilk Peter, what's the status of these two patches? They are not going into 3.9, right? Thanks. -boris --- arch/x86/include/asm/paravirt.h | 5 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/kernel/paravirt.c| 25 + arch/x86/lguest/boot.c| 1 + arch/x86/xen/mmu.c| 1 + 5 files changed, 21 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5edd174..7361e47 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -703,7 +703,10 @@ static inline void arch_leave_lazy_mmu_mode(void) PVOP_VCALL0(pv_mmu_ops.lazy_mode.leave); } -void arch_flush_lazy_mmu_mode(void); +static inline void arch_flush_lazy_mmu_mode(void) +{ + PVOP_VCALL0(pv_mmu_ops.lazy_mode.flush); +} static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, phys_addr_t phys, pgprot_t flags) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..b3b0ec1 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -91,6 +91,7 @@ struct pv_lazy_ops { /* Set deferred update mode, used for batching operations. */ void (*enter)(void); void (*leave)(void); + void (*flush)(void); }; struct pv_time_ops { @@ -679,6 +680,7 @@ void paravirt_end_context_switch(struct task_struct *next); void paravirt_enter_lazy_mmu(void); void paravirt_leave_lazy_mmu(void); +void paravirt_flush_lazy_mmu(void); void _paravirt_nop(void); u32 _paravirt_ident_32(u32); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..8bfb335 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -263,6 +263,18 @@ void paravirt_leave_lazy_mmu(void) leave_lazy(PARAVIRT_LAZY_MMU); } +void paravirt_flush_lazy_mmu(void) +{ + preempt_disable(); + + if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU) { + arch_leave_lazy_mmu_mode(); + arch_enter_lazy_mmu_mode(); + } + + preempt_enable(); +} + void paravirt_start_context_switch(struct task_struct *prev) { BUG_ON(preemptible()); @@ -292,18 +304,6 @@ enum paravirt_lazy_mode paravirt_get_lazy_mode(void) return this_cpu_read(paravirt_lazy_mode); } -void arch_flush_lazy_mmu_mode(void) -{ - preempt_disable(); - - if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU) { - arch_leave_lazy_mmu_mode(); - arch_enter_lazy_mmu_mode(); - } - - preempt_enable(); -} - struct pv_info pv_info = { .name = "bare hardware", .paravirt_enabled = 0, @@ -475,6 +475,7 @@ struct pv_mmu_ops pv_mmu_ops = { .lazy_mode = { .enter = paravirt_nop, .leave = paravirt_nop, + .flush = paravirt_nop, }, .set_fixmap = native_set_fixmap, diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c index 1cbd89c..7114c63 100644 --- a/arch/x86/lguest/boot.c +++ b/arch/x86/lguest/boot.c @@ -1334,6 +1334,7 @@ __init void lguest_init(void) pv_mmu_ops.read_cr3 = lguest_read_cr3; pv_mmu_ops.lazy_mode.enter = paravirt_enter_lazy_mmu; pv_mmu_ops.lazy_mode.leave = lguest_leave_lazy_mmu_mode; + pv_mmu_ops.lazy_mode.flush = paravirt_flush_lazy_mmu; pv_mmu_ops.pte_update = lguest_pte_update; pv_mmu_ops.pte_update_defer = lguest_pte_update; diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index e8e3493..f4f4105 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -2197,6 +2197,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = { .lazy_mode = { .enter = paravirt_enter_lazy_mmu, .leave = xen_leave_lazy_mmu, + .flush = paravirt_flush_lazy_mmu, }, .set_fixmap = xen_set_fixmap, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Enable WC+ memory type on AMD family 10h processors
From: Boris Ostrovsky Enable WC+ memory type on AMD family 10h processors if BIOS doesn't do this. WC+ is used only in in virtualized scenarios and never in bare metal cases (see AMD APM v2). Also clean up init_amd() a little. Boris Ostrovsky (2): AMD,x86: Clean up init_amd() x86,AMD: Enable WC+ memory type on family 10 processors arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kernel/cpu/amd.c | 49 ++--- 2 files changed, 28 insertions(+), 22 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] AMD,x86: Clean up init_amd()
From: Boris Ostrovsky Clean up multiple declarations of variable used for rd/wrmsr Signed-off-by: Boris Ostrovsky --- arch/x86/kernel/cpu/amd.c | 29 - 1 files changed, 12 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 15239ff..dd4a5b6 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -518,10 +518,9 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c) static void __cpuinit init_amd(struct cpuinfo_x86 *c) { u32 dummy; - -#ifdef CONFIG_SMP unsigned long long value; +#ifdef CONFIG_SMP /* * Disable TLB flush filter by setting HWCR.FFDIS on K8 * bit 6 of msr C001_0015 @@ -559,12 +558,10 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) * (AMD Erratum #110, docId: 25759). */ if (c->x86_model < 0x14 && cpu_has(c, X86_FEATURE_LAHF_LM)) { - u64 val; - clear_cpu_cap(c, X86_FEATURE_LAHF_LM); - if (!rdmsrl_amd_safe(0xc001100d, &val)) { - val &= ~(1ULL << 32); - wrmsrl_amd_safe(0xc001100d, val); + if (!rdmsrl_amd_safe(0xc001100d, &value)) { + value &= ~(1ULL << 32); + wrmsrl_amd_safe(0xc001100d, value); } } @@ -617,13 +614,12 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) if ((c->x86 == 0x15) && (c->x86_model >= 0x10) && (c->x86_model <= 0x1f) && !cpu_has(c, X86_FEATURE_TOPOEXT)) { - u64 val; - if (!rdmsrl_safe(0xc0011005, &val)) { - val |= 1ULL << 54; - wrmsrl_safe(0xc0011005, val); - rdmsrl(0xc0011005, val); - if (val & (1ULL << 54)) { + if (!rdmsrl_safe(0xc0011005, &value)) { + value |= 1ULL << 54; + wrmsrl_safe(0xc0011005, value); + rdmsrl(0xc0011005, value); + if (value & (1ULL << 54)) { set_cpu_cap(c, X86_FEATURE_TOPOEXT); printk(KERN_INFO FW_INFO "CPU: Re-enabling " "disabled Topology Extensions Support\n"); @@ -637,11 +633,10 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) */ if ((c->x86 == 0x15) && (c->x86_model >= 0x02) && (c->x86_model < 0x20)) { - u64 val; - if (!rdmsrl_safe(0xc0011021, &val) && !(val & 0x1E)) { - val |= 0x1E; - wrmsrl_safe(0xc0011021, val); + if (!rdmsrl_safe(0xc0011021, &value) && !(value & 0x1E)) { + value |= 0x1E; + wrmsrl_safe(0xc0011021, value); } } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] x86,AMD: Enable WC+ memory type on family 10 processors
From: Boris Ostrovsky In some cases BIOS may not enable WC+ memory type on family 10 processors, instead converting what would be WC+ memory to CD type. On guests using nested pages this could result in performance degradation. This patch enables WC+. Signed-off-by: Boris Ostrovsky --- arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kernel/cpu/amd.c | 21 - 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index 433a59f..158cde9 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -173,6 +173,7 @@ #define MSR_AMD64_OSVW_ID_LENGTH 0xc0010140 #define MSR_AMD64_OSVW_STATUS 0xc0010141 #define MSR_AMD64_DC_CFG 0xc0011022 +#define MSR_AMD64_BU_CFG2 0xc001102a #define MSR_AMD64_IBSFETCHCTL 0xc0011030 #define MSR_AMD64_IBSFETCHLINAD0xc0011031 #define MSR_AMD64_IBSFETCHPHYSAD 0xc0011032 diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index dd4a5b6..721ef32 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -698,13 +698,11 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) if (c->x86 > 0x11) set_cpu_cap(c, X86_FEATURE_ARAT); - /* -* Disable GART TLB Walk Errors on Fam10h. We do this here -* because this is always needed when GART is enabled, even in a -* kernel which has no MCE support built in. -*/ if (c->x86 == 0x10) { /* +* Disable GART TLB Walk Errors on Fam10h. We do this here +* because this is always needed when GART is enabled, even in a +* kernel which has no MCE support built in. * BIOS should disable GartTlbWlk Errors themself. If * it doesn't do it here as suggested by the BKDG. * @@ -718,6 +716,19 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) mask |= (1 << 10); wrmsrl_safe(MSR_AMD64_MCx_MASK(4), mask); } + + /* +* On family 10h BIOS may not have properly enabled WC+ support, +* causing it to be converted to CD memtype. This may result in +* performance degradation for certain nested-paging guests. +* Prevent this conversion by clearing bit 24 in +* MSR_AMD64_BU_CFG2. +*/ + if (c->x86 == 0x10) { + rdmsrl(MSR_AMD64_BU_CFG2, value); + value &= ~(1ULL << 24); + wrmsrl(MSR_AMD64_BU_CFG2, value); + } } rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/2] Enable WC+ memory type on AMD family 10h processors
Resending since previous message went with incorrect From header. Also fixing redundant check in the second patch. Enable WC+ memory type on AMD family 10h processors if BIOS doesn't do this. WC+ is used only in in virtualized scenarios and never in bare metal cases (see AMD APM v2). Also clean up init_amd() a little. Boris Ostrovsky (2): AMD,x86: Clean up init_amd() x86,AMD: Enable WC+ memory type on family 10 processors arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kernel/cpu/amd.c | 49 ++--- 2 files changed, 28 insertions(+), 22 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] AMD,x86: Clean up init_amd()
Clean up multiple declarations of variable used for rd/wrmsr Signed-off-by: Boris Ostrovsky --- arch/x86/kernel/cpu/amd.c | 29 - 1 files changed, 12 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 15239ff..dd4a5b6 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -518,10 +518,9 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c) static void __cpuinit init_amd(struct cpuinfo_x86 *c) { u32 dummy; - -#ifdef CONFIG_SMP unsigned long long value; +#ifdef CONFIG_SMP /* * Disable TLB flush filter by setting HWCR.FFDIS on K8 * bit 6 of msr C001_0015 @@ -559,12 +558,10 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) * (AMD Erratum #110, docId: 25759). */ if (c->x86_model < 0x14 && cpu_has(c, X86_FEATURE_LAHF_LM)) { - u64 val; - clear_cpu_cap(c, X86_FEATURE_LAHF_LM); - if (!rdmsrl_amd_safe(0xc001100d, &val)) { - val &= ~(1ULL << 32); - wrmsrl_amd_safe(0xc001100d, val); + if (!rdmsrl_amd_safe(0xc001100d, &value)) { + value &= ~(1ULL << 32); + wrmsrl_amd_safe(0xc001100d, value); } } @@ -617,13 +614,12 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) if ((c->x86 == 0x15) && (c->x86_model >= 0x10) && (c->x86_model <= 0x1f) && !cpu_has(c, X86_FEATURE_TOPOEXT)) { - u64 val; - if (!rdmsrl_safe(0xc0011005, &val)) { - val |= 1ULL << 54; - wrmsrl_safe(0xc0011005, val); - rdmsrl(0xc0011005, val); - if (val & (1ULL << 54)) { + if (!rdmsrl_safe(0xc0011005, &value)) { + value |= 1ULL << 54; + wrmsrl_safe(0xc0011005, value); + rdmsrl(0xc0011005, value); + if (value & (1ULL << 54)) { set_cpu_cap(c, X86_FEATURE_TOPOEXT); printk(KERN_INFO FW_INFO "CPU: Re-enabling " "disabled Topology Extensions Support\n"); @@ -637,11 +633,10 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) */ if ((c->x86 == 0x15) && (c->x86_model >= 0x02) && (c->x86_model < 0x20)) { - u64 val; - if (!rdmsrl_safe(0xc0011021, &val) && !(val & 0x1E)) { - val |= 0x1E; - wrmsrl_safe(0xc0011021, val); + if (!rdmsrl_safe(0xc0011021, &value) && !(value & 0x1E)) { + value |= 0x1E; + wrmsrl_safe(0xc0011021, value); } } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/2] x86,AMD: Enable WC+ memory type on family 10 processors
In some cases BIOS may not enable WC+ memory type on family 10 processors, instead converting what would be WC+ memory to CD type. On guests using nested pages this could result in performance degradation. This patch enables WC+. Signed-off-by: Boris Ostrovsky --- arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kernel/cpu/amd.c | 19 ++- 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index 433a59f..158cde9 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -173,6 +173,7 @@ #define MSR_AMD64_OSVW_ID_LENGTH 0xc0010140 #define MSR_AMD64_OSVW_STATUS 0xc0010141 #define MSR_AMD64_DC_CFG 0xc0011022 +#define MSR_AMD64_BU_CFG2 0xc001102a #define MSR_AMD64_IBSFETCHCTL 0xc0011030 #define MSR_AMD64_IBSFETCHLINAD0xc0011031 #define MSR_AMD64_IBSFETCHPHYSAD 0xc0011032 diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index dd4a5b6..a873bcc 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -698,13 +698,11 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) if (c->x86 > 0x11) set_cpu_cap(c, X86_FEATURE_ARAT); - /* -* Disable GART TLB Walk Errors on Fam10h. We do this here -* because this is always needed when GART is enabled, even in a -* kernel which has no MCE support built in. -*/ if (c->x86 == 0x10) { /* +* Disable GART TLB Walk Errors on Fam10h. We do this here +* because this is always needed when GART is enabled, even in a +* kernel which has no MCE support built in. * BIOS should disable GartTlbWlk Errors themself. If * it doesn't do it here as suggested by the BKDG. * @@ -718,6 +716,17 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) mask |= (1 << 10); wrmsrl_safe(MSR_AMD64_MCx_MASK(4), mask); } + + /* +* On family 10h BIOS may not have properly enabled WC+ support, +* causing it to be converted to CD memtype. This may result in +* performance degradation for certain nested-paging guests. +* Prevent this conversion by clearing bit 24 in +* MSR_AMD64_BU_CFG2. +*/ + rdmsrl(MSR_AMD64_BU_CFG2, value); + value &= ~(1ULL << 24); + wrmsrl(MSR_AMD64_BU_CFG2, value); } rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
When CONFIG_DEBUG_PAGEALLOC is set page table updates made by kernel_map_pages() are not made visible (via TLB flush) immediately if lazy MMU is on. In environments that support lazy MMU (e.g. Xen) this may lead to fatal page faults, for example, when zap_pte_range() needs to allocate pages in __tlb_remove_page() -> tlb_next_batch(). Signed-off-by: Boris Ostrovsky --- arch/x86/mm/pageattr.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index ca1f1c2..7b3216e 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1369,6 +1369,8 @@ void kernel_map_pages(struct page *page, int numpages, int enable) * but that can deadlock->flush only current cpu: */ __flush_tlb_all(); + + arch_flush_lazy_mmu_mode(); } #ifdef CONFIG_HIBERNATION -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
- h...@zytor.com wrote: > On 02/26/2013 02:56 PM, Boris Ostrovsky wrote: > > When CONFIG_DEBUG_PAGEALLOC is set page table updates made by > > kernel_map_pages() are not made visible (via TLB flush) immediately > if lazy > > MMU is on. In environments that support lazy MMU (e.g. Xen) this may > lead to > > fatal page faults, for example, when zap_pte_range() needs to > allocate pages > > in __tlb_remove_page() -> tlb_next_batch(). > > > > Signed-off-by: Boris Ostrovsky > > --- > > arch/x86/mm/pageattr.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c > > index ca1f1c2..7b3216e 100644 > > --- a/arch/x86/mm/pageattr.c > > +++ b/arch/x86/mm/pageattr.c > > @@ -1369,6 +1369,8 @@ void kernel_map_pages(struct page *page, int > numpages, int enable) > > * but that can deadlock->flush only current cpu: > > */ > > __flush_tlb_all(); > > + > > + arch_flush_lazy_mmu_mode(); > > } > > > > #ifdef CONFIG_HIBERNATION > > > > This sounds like a critical fix, i.e. a -stable candidate. Am I > correct? I considered copying stable but then I decided that this is a debugging feature --- kernel_map_pages() is only defined if CONFIG_DEBUG_PAGEALLOC is set and my thinking was that stable kernels usually don't do this. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On 02/28/2013 11:10 AM, Borislav Petkov wrote: On Thu, Feb 28, 2013 at 07:53:44AM -0800, H. Peter Anvin wrote: At the very least we should have an early filter for the **COMMON!** case that we are not on a PV platform. ... or, patch it out with the alternatives on baremetal, as Steven suggested. I think making a check for paravirt_enabled() is safe enough. I'll send a patch shortly. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On 02/28/2013 11:22 AM, Borislav Petkov wrote: On Thu, Feb 28, 2013 at 11:20:20AM -0500, Boris Ostrovsky wrote: On 02/28/2013 11:10 AM, Borislav Petkov wrote: On Thu, Feb 28, 2013 at 07:53:44AM -0800, H. Peter Anvin wrote: At the very least we should have an early filter for the **COMMON!** case that we are not on a PV platform. ... or, patch it out with the alternatives on baremetal, as Steven suggested. What was the suggestion exactly? I don't remember seeing that message. -boris I think making a check for paravirt_enabled() is safe enough. I'll send a patch shortly. Why not make it absolutely for free? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm/x86: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
Invoking arch_flush_lazy_mmu_mode() results in calls to preempt_enable()/disable() which may have performance impact. Since lazy MMU is not used on bare metal we can patch away arch_flush_lazy_mmu_mode() so that it is never called in such environment. Signed-off-by: Boris Ostrovsky --- arch/x86/include/asm/paravirt.h | 5 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/kernel/paravirt.c| 25 + arch/x86/lguest/boot.c| 1 + arch/x86/xen/mmu.c| 1 + 5 files changed, 21 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5edd174..7361e47 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -703,7 +703,10 @@ static inline void arch_leave_lazy_mmu_mode(void) PVOP_VCALL0(pv_mmu_ops.lazy_mode.leave); } -void arch_flush_lazy_mmu_mode(void); +static inline void arch_flush_lazy_mmu_mode(void) +{ + PVOP_VCALL0(pv_mmu_ops.lazy_mode.flush); +} static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, phys_addr_t phys, pgprot_t flags) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..b3b0ec1 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -91,6 +91,7 @@ struct pv_lazy_ops { /* Set deferred update mode, used for batching operations. */ void (*enter)(void); void (*leave)(void); + void (*flush)(void); }; struct pv_time_ops { @@ -679,6 +680,7 @@ void paravirt_end_context_switch(struct task_struct *next); void paravirt_enter_lazy_mmu(void); void paravirt_leave_lazy_mmu(void); +void paravirt_flush_lazy_mmu(void); void _paravirt_nop(void); u32 _paravirt_ident_32(u32); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..8bfb335 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -263,6 +263,18 @@ void paravirt_leave_lazy_mmu(void) leave_lazy(PARAVIRT_LAZY_MMU); } +void paravirt_flush_lazy_mmu(void) +{ + preempt_disable(); + + if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU) { + arch_leave_lazy_mmu_mode(); + arch_enter_lazy_mmu_mode(); + } + + preempt_enable(); +} + void paravirt_start_context_switch(struct task_struct *prev) { BUG_ON(preemptible()); @@ -292,18 +304,6 @@ enum paravirt_lazy_mode paravirt_get_lazy_mode(void) return this_cpu_read(paravirt_lazy_mode); } -void arch_flush_lazy_mmu_mode(void) -{ - preempt_disable(); - - if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU) { - arch_leave_lazy_mmu_mode(); - arch_enter_lazy_mmu_mode(); - } - - preempt_enable(); -} - struct pv_info pv_info = { .name = "bare hardware", .paravirt_enabled = 0, @@ -475,6 +475,7 @@ struct pv_mmu_ops pv_mmu_ops = { .lazy_mode = { .enter = paravirt_nop, .leave = paravirt_nop, + .flush = paravirt_nop, }, .set_fixmap = native_set_fixmap, diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c index 1cbd89c..7114c63 100644 --- a/arch/x86/lguest/boot.c +++ b/arch/x86/lguest/boot.c @@ -1334,6 +1334,7 @@ __init void lguest_init(void) pv_mmu_ops.read_cr3 = lguest_read_cr3; pv_mmu_ops.lazy_mode.enter = paravirt_enter_lazy_mmu; pv_mmu_ops.lazy_mode.leave = lguest_leave_lazy_mmu_mode; + pv_mmu_ops.lazy_mode.flush = paravirt_flush_lazy_mmu; pv_mmu_ops.pte_update = lguest_pte_update; pv_mmu_ops.pte_update_defer = lguest_pte_update; diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index e8e3493..f4f4105 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -2197,6 +2197,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = { .lazy_mode = { .enter = paravirt_enter_lazy_mmu, .leave = xen_leave_lazy_mmu, + .flush = paravirt_flush_lazy_mmu, }, .set_fixmap = xen_set_fixmap, -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] x86/cpu changes for v3.9
On 02/19/2013 12:57 PM, Konrad Rzeszutek Wilk wrote: On Tue, Feb 19, 2013 at 06:47:58PM +0100, Borislav Petkov wrote: On Tue, Feb 19, 2013 at 09:38:31AM -0800, H. Peter Anvin wrote: My fault... I was tracking the fix and lost track of the thread. The problem is that the fix is necessary but not sufficient, as it introduces an undesirable host-guest dependency. In order to allow neerw guests to work on older hosts we also should use the {rd,wr}msr_safe() functions to manipulate this MSR, with a comment as to why. Boris, could you prepare such a patch, please? I don't think Boris O. is at AMD anymore. Want me to add that to my fix for kvm or prep a separate patch? CC-ing Boris. BorisP's patch is what I should have done. Can you take it? -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] x86/cpu changes for v3.9
On 02/19/2013 01:21 PM, H. Peter Anvin wrote: On 02/19/2013 10:19 AM, Boris Ostrovsky wrote: On 02/19/2013 12:57 PM, Konrad Rzeszutek Wilk wrote: On Tue, Feb 19, 2013 at 06:47:58PM +0100, Borislav Petkov wrote: On Tue, Feb 19, 2013 at 09:38:31AM -0800, H. Peter Anvin wrote: My fault... I was tracking the fix and lost track of the thread. The problem is that the fix is necessary but not sufficient, as it introduces an undesirable host-guest dependency. In order to allow neerw guests to work on older hosts we also should use the {rd,wr}msr_safe() functions to manipulate this MSR, with a comment as to why. Boris, could you prepare such a patch, please? I don't think Boris O. is at AMD anymore. Want me to add that to my fix for kvm or prep a separate patch? CC-ing Boris. BorisP's patch is what I should have done. Can you take it? As I stated: >>>> The problem is that the fix is necessary but not sufficient, as it >>>> introduces an undesirable host-guest dependency. In order to allow >>>> neerw guests to work on older hosts we also should use the >>>> {rd,wr}msr_safe() functions to manipulate this MSR, with a comment >>>> as to why. Ah, sorry --- I missed that part. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/x86: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
On 03/01/2013 07:14 AM, Josh Boyer wrote: On Thu, Feb 28, 2013 at 04:52:20PM -0800, H. Peter Anvin wrote: On 02/28/2013 04:42 PM, Josh Boyer wrote: On Fri, Mar 01, 2013 at 01:36:29AM +0100, Borislav Petkov wrote: On Thu, Feb 28, 2013 at 04:15:45PM -0800, H. Peter Anvin wrote: I'll try to get someone to test this tomorrow. Btw, you'd need to apply that other patch too http://marc.info/?l=xen-devel&m=136206183814547&w=2 so that arch_flush_lazy_mmu_mode() has at least one caller on x86_64. Yeah, we already have that applied. It stops crashes in xen environments so we pulled it in as a bugfix. Thanks though! Who are "we"? Sorry, Fedora. That patch has a link to a bug in it. We applied the patch for that bug. I'll apply Boris' patch on top and get the same people to test it. Josh, have you had a chance to test this? -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86/mce: Use MCG_CAP MSR to find out number of banks on AMD
Currently number of error reporting register banks is hardcoded to 6 on AMD processors. This may break in virtualized scenarios when a hypervisor prefers to report fewer banks that the physical HW provides. Since number of supported banks is reported in MSR_IA32_MCG_CAP[7:0] that's what we should use. Signed-off-by: Boris Ostrovsky --- arch/x86/kernel/cpu/mcheck/mce_amd.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c index 1ac581f..cb7c739 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c @@ -33,7 +33,6 @@ #include #include -#define NR_BANKS 6 #define NR_BLOCKS 9 #define THRESHOLD_MAX 0xFFF #define INT_TYPE_APIC 0x0002 @@ -57,9 +56,9 @@ static const char * const th_names[] = { "execution_unit", }; -static DEFINE_PER_CPU(struct threshold_bank * [NR_BANKS], threshold_banks); +static DEFINE_PER_CPU(struct threshold_bank **, threshold_banks); -static unsigned char shared_bank[NR_BANKS] = { +static unsigned char shared_bank[MAX_NR_BANKS] = { 0, 0, 0, 0, 1 }; @@ -214,7 +213,7 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c) unsigned int bank, block; int offset = -1; - for (bank = 0; bank < NR_BANKS; ++bank) { + for (bank = 0; bank < mca_cfg.banks; ++bank) { for (block = 0; block < NR_BLOCKS; ++block) { if (block == 0) address = MSR_IA32_MC0_MISC + bank * 4; @@ -276,7 +275,7 @@ static void amd_threshold_interrupt(void) mce_setup(&m); /* assume first bank caused it */ - for (bank = 0; bank < NR_BANKS; ++bank) { + for (bank = 0; bank < mca_cfg.banks; ++bank) { if (!(per_cpu(bank_map, m.cpu) & (1 << bank))) continue; for (block = 0; block < NR_BLOCKS; ++block) { @@ -467,7 +466,7 @@ static __cpuinit int allocate_threshold_blocks(unsigned int cpu, u32 low, high; int err; - if ((bank >= NR_BANKS) || (block >= NR_BLOCKS)) + if ((bank >= mca_cfg.banks) || (block >= NR_BLOCKS)) return 0; if (rdmsr_safe_on_cpu(cpu, address, &low, &high)) @@ -637,7 +636,12 @@ static __cpuinit int threshold_create_device(unsigned int cpu) unsigned int bank; int err = 0; - for (bank = 0; bank < NR_BANKS; ++bank) { + per_cpu(threshold_banks, cpu) = kzalloc(sizeof(struct threshold_bank *) + * mca_cfg.banks, GFP_KERNEL); + if (per_cpu(threshold_banks, cpu) == NULL) + return -ENOMEM; + + for (bank = 0; bank < mca_cfg.banks; ++bank) { if (!(per_cpu(bank_map, cpu) & (1 << bank))) continue; err = threshold_create_bank(cpu, bank); @@ -719,11 +723,12 @@ static void threshold_remove_device(unsigned int cpu) { unsigned int bank; - for (bank = 0; bank < NR_BANKS; ++bank) { + for (bank = 0; bank < mca_cfg.banks; ++bank) { if (!(per_cpu(bank_map, cpu) & (1 << bank))) continue; threshold_remove_bank(cpu, bank); } + kfree(per_cpu(threshold_banks, cpu)); } /* get notified when a cpu comes on/off */ -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] x86/mce: Replace shared_bank array with is_shared_bank() helper
Use helper function instead of an array to report whether register bank is shared. Currently only bank 4 (northbridge) is shared. Signed-off-by: Boris Ostrovsky --- arch/x86/kernel/cpu/mcheck/mce_amd.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c index 1ac581f..654a155 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c @@ -58,11 +58,6 @@ static const char * const th_names[] = { }; static DEFINE_PER_CPU(struct threshold_bank * [NR_BANKS], threshold_banks); - -static unsigned char shared_bank[NR_BANKS] = { - 0, 0, 0, 0, 1 -}; - static DEFINE_PER_CPU(unsigned char, bank_map);/* see which banks are on */ static void amd_threshold_interrupt(void); @@ -79,6 +74,12 @@ struct thresh_restart { u16 old_limit; }; +static inline bool is_shared_bank(int bank) +{ + /* Bank 4 is for northbridge reporting is thus is shared */ + return (bank == 4); +} + static const char * const bank4_names(struct threshold_block *b) { switch (b->address) { @@ -575,7 +576,7 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) const char *name = th_names[bank]; int err = 0; - if (shared_bank[bank]) { + if (is_shared_bank(bank)) { nb = node_to_amd_nb(amd_get_nb_id(cpu)); /* threshold descriptor already initialized on this node? */ @@ -609,7 +610,7 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) per_cpu(threshold_banks, cpu)[bank] = b; - if (shared_bank[bank]) { + if (is_shared_bank(bank)) { atomic_set(&b->cpus, 1); /* nb is already initialized, see above */ @@ -691,7 +692,7 @@ static void threshold_remove_bank(unsigned int cpu, int bank) if (!b->blocks) goto free_out; - if (shared_bank[bank]) { + if (is_shared_bank(bank)) { if (!atomic_dec_and_test(&b->cpus)) { __threshold_remove_blocks(b); per_cpu(threshold_banks, cpu)[bank] = NULL; -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/2] AMD MCE fixes
Boris, Here is the updated patch for determining number of regiter banks on AMD plus a patch removing shared_bank array, as you suggested. Offline/online testing didn't show any issues. Boris Ostrovsky (2): x86/mce: Replace shared_bank array with is_shared_bank() helper x86/mce: Use MCG_CAP MSR to find out number of banks on AMD arch/x86/kernel/cpu/mcheck/mce_amd.c | 38 ++-- 1 file changed, 23 insertions(+), 15 deletions(-) -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/2] x86/mce: Use MCG_CAP MSR to find out number of banks on AMD
Currently number of error reporting register banks is hardcoded to 6 on AMD processors. This may break in virtualized scenarios when a hypervisor prefers to report fewer banks than what the physical HW provides. Since number of supported banks is reported in MSR_IA32_MCG_CAP[7:0] that's what we should use. Signed-off-by: Boris Ostrovsky --- arch/x86/kernel/cpu/mcheck/mce_amd.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c index 654a155..13a22e2 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c @@ -33,7 +33,6 @@ #include #include -#define NR_BANKS 6 #define NR_BLOCKS 9 #define THRESHOLD_MAX 0xFFF #define INT_TYPE_APIC 0x0002 @@ -57,7 +56,7 @@ static const char * const th_names[] = { "execution_unit", }; -static DEFINE_PER_CPU(struct threshold_bank * [NR_BANKS], threshold_banks); +static DEFINE_PER_CPU(struct threshold_bank **, threshold_banks); static DEFINE_PER_CPU(unsigned char, bank_map);/* see which banks are on */ static void amd_threshold_interrupt(void); @@ -215,7 +214,7 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c) unsigned int bank, block; int offset = -1; - for (bank = 0; bank < NR_BANKS; ++bank) { + for (bank = 0; bank < mca_cfg.banks; ++bank) { for (block = 0; block < NR_BLOCKS; ++block) { if (block == 0) address = MSR_IA32_MC0_MISC + bank * 4; @@ -277,7 +276,7 @@ static void amd_threshold_interrupt(void) mce_setup(&m); /* assume first bank caused it */ - for (bank = 0; bank < NR_BANKS; ++bank) { + for (bank = 0; bank < mca_cfg.banks; ++bank) { if (!(per_cpu(bank_map, m.cpu) & (1 << bank))) continue; for (block = 0; block < NR_BLOCKS; ++block) { @@ -468,7 +467,7 @@ static __cpuinit int allocate_threshold_blocks(unsigned int cpu, u32 low, high; int err; - if ((bank >= NR_BANKS) || (block >= NR_BLOCKS)) + if ((bank >= mca_cfg.banks) || (block >= NR_BLOCKS)) return 0; if (rdmsr_safe_on_cpu(cpu, address, &low, &high)) @@ -636,9 +635,16 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) static __cpuinit int threshold_create_device(unsigned int cpu) { unsigned int bank; + struct threshold_bank **bp; int err = 0; - for (bank = 0; bank < NR_BANKS; ++bank) { + bp = kzalloc(sizeof(struct threshold_bank *) * mca_cfg.banks, +GFP_KERNEL); + if (bp == NULL) + return -ENOMEM; + per_cpu(threshold_banks, cpu) = bp; + + for (bank = 0; bank < mca_cfg.banks; ++bank) { if (!(per_cpu(bank_map, cpu) & (1 << bank))) continue; err = threshold_create_bank(cpu, bank); @@ -720,11 +726,12 @@ static void threshold_remove_device(unsigned int cpu) { unsigned int bank; - for (bank = 0; bank < NR_BANKS; ++bank) { + for (bank = 0; bank < mca_cfg.banks; ++bank) { if (!(per_cpu(bank_map, cpu) & (1 << bank))) continue; threshold_remove_bank(cpu, bank); } + kfree(per_cpu(threshold_banks, cpu)); } /* get notified when a cpu comes on/off */ -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [xen] double fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
- torva...@linux-foundation.org wrote: > On Sun, Oct 6, 2013 at 1:23 AM, Fengguang Wu > wrote: > > > > I got the below dmesg and the first bad commit is commit > cf39c8e5352b: > > Merge tag 'stable/for-linus-3.12-rc0-tag' of > git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip > > Ugh. How reliable is the double fault? Because bisecting it to the > merge that didn't even have any conflicts in it as far as I can > remember means that there's something really subtle going on wrt some > semantic conflict or other. Or, alternatively, it means that the > bisect failed because the double fault isn't 100% reliable.. > > Anyway, the stack is crap when the original fault happens at > "boot_tvec_bases+0x1fe", and that causes the double fault debug code > to take *another* fault, which means that it doesn't even show the > right code sequence. Too bad. So ignore the latter part of the oops, > but the top part looks valid: > > > [4.136137] double fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > [4.137521] CPU: 0 PID: 132 Comm: bootlogd Not tainted > 3.12.0-rc2-00153-g14951f2 #129 > > [4.139156] task: 88000c9a6580 ti: 88000c9ba000 task.ti: > 88000c9ba000 > > [4.140042] RIP: 0010:[] [] > boot_tvec_bases+0x1fe/0x2080 > > [4.140042] RSP: 0018:88000cd8 EFLAGS: 00010212 > > [4.140042] RAX: 004f RBX: 0100 RCX: > > > [4.140042] RDX: 0f1e RSI: 81f746a8 RDI: > 81f31c48 > > [4.140042] RBP: 88000f003ee0 R08: R09: > > > [4.140042] R10: 0001 R11: 88000f00a000 R12: > 88000c9bbfd8 > > [4.140042] R13: 81f31c48 R14: 81f31c48 R15: > 81f31c48 > > [4.140042] FS: 7fb1f9662700() GS:88000f00() > knlGS: > > [4.140042] CS: 0010 DS: ES: CR0: 80050033 > > [4.140042] CR2: 88000cc8 CR3: 0c9cd000 CR4: > 06b0 > > [4.140042] Stack: > > > but it has jumped into a data section and is executing random data as > code, and there is no sign of where it jumped *from*, since the > random > code clearly corrupted the stack - resulting in the double fault in > the first place. > > So the oops is almost entirely useless as a debug aid in this > situation. I'm almost hoping that your bisect was wrong, and you > could > try to see if you could do that again.. For what it's worth, the commit in question touches almost exclusively Xen files, the only exception being lib/swiotlb.c (with what appear to be fairly trivial changes). And CONFIG_XEN in the config file for this report is not set. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] xen/p2m: Don't call get_balloon_scratch_page() twice, keep interrupts disabled for multicalls
m2p_remove_override() calls get_balloon_scratch_page() in MULTI_update_va_mapping() even though it already has pointer to this page from the earlier call (in scratch_page). This second call doesn't have a matching put_balloon_scratch_page() thus not restoring preempt count back. (Also, there is no put_balloon_scratch_page() in the error path.) In addition, the second multicall uses __xen_mc_entry() which does not disable interrupts. Rearrange xen_mc_* calls to keep interrupts off while performing multicalls. Signed-off-by: Boris Ostrovsky --- arch/x86/xen/p2m.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c index 0d4ec35..8b901e8 100644 --- a/arch/x86/xen/p2m.c +++ b/arch/x86/xen/p2m.c @@ -990,10 +990,13 @@ int m2p_remove_override(struct page *page, printk(KERN_WARNING "m2p_remove_override: " "pfn %lx mfn %lx, failed to modify kernel mappings", pfn, mfn); + put_balloon_scratch_page(); return -1; } - mcs = xen_mc_entry( + xen_mc_batch(); + + mcs = __xen_mc_entry( sizeof(struct gnttab_unmap_and_replace)); unmap_op = mcs.args; unmap_op->host_addr = kmap_op->host_addr; @@ -1003,12 +1006,11 @@ int m2p_remove_override(struct page *page, MULTI_grant_table_op(mcs.mc, GNTTABOP_unmap_and_replace, unmap_op, 1); - xen_mc_issue(PARAVIRT_LAZY_MMU); - mcs = __xen_mc_entry(0); MULTI_update_va_mapping(mcs.mc, scratch_page_address, - pfn_pte(page_to_pfn(get_balloon_scratch_page()), + pfn_pte(page_to_pfn(scratch_page), PAGE_KERNEL_RO), 0); + xen_mc_issue(PARAVIRT_LAZY_MMU); kmap_op->host_addr = 0; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/xen: remove deprecated IRQF_DISABLED
- michael.opdenac...@free-electrons.com wrote: > This patch proposes to remove the IRQF_DISABLED flag from x86/xen > code. It's a NOOP since 2.6.35 and it will be removed one day. > > Signed-off-by: Michael Opdenacker > > --- > arch/x86/xen/smp.c | 10 +- > arch/x86/xen/spinlock.c | 2 +- > arch/x86/xen/time.c | 3 +-- > 3 files changed, 7 insertions(+), 8 deletions(-) If you are cleaning up Xen's use of IRQF_DISABLED then you should probably also update drivers/xen/evtchn.c and drivers/xen/platform-pci.c -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v1 2/5] xen/PMU: Sysfs interface for setting Xen PMU mode
Set Xen's PMU mode via /sys/hypervisor/pmu/pmu_mode. Add XENPMU hypercall. Signed-off-by: Boris Ostrovsky --- arch/x86/include/asm/xen/hypercall.h | 6 ++ arch/x86/xen/xen-head.S | 5 +- drivers/xen/sys-hypervisor.c | 118 +++ include/xen/interface/xen.h | 1 + include/xen/interface/xenpmu.h | 44 + 5 files changed, 173 insertions(+), 1 deletion(-) create mode 100644 include/xen/interface/xenpmu.h diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index e709884..f022cef 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -465,6 +465,12 @@ HYPERVISOR_tmem_op( return _hypercall1(int, tmem_op, op); } +static inline int +HYPERVISOR_xenpmu_op(unsigned int op, void *arg) +{ + return _hypercall2(int, xenpmu_op, op, arg); +} + static inline void MULTI_fpu_taskswitch(struct multicall_entry *mcl, int set) { diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index 7faed58..f2415e6 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -73,8 +73,11 @@ NEXT_HYPERCALL(sysctl) NEXT_HYPERCALL(domctl) NEXT_HYPERCALL(kexec_op) NEXT_HYPERCALL(tmem_op) /* 38 */ +ENTRY(xenclient_rsvd) + .skip 32 +NEXT_HYPERCALL(xenpmu_op) /* 40 */ ENTRY(xen_hypercall_rsvr) - .skip 320 + .skip 256 NEXT_HYPERCALL(mca) /* 48 */ NEXT_HYPERCALL(arch_1) NEXT_HYPERCALL(arch_2) diff --git a/drivers/xen/sys-hypervisor.c b/drivers/xen/sys-hypervisor.c index 96453f8..2218154 100644 --- a/drivers/xen/sys-hypervisor.c +++ b/drivers/xen/sys-hypervisor.c @@ -20,6 +20,7 @@ #include #include #include +#include #define HYPERVISOR_ATTR_RO(_name) \ static struct hyp_sysfs_attr _name##_attr = __ATTR_RO(_name) @@ -368,6 +369,115 @@ static void xen_properties_destroy(void) sysfs_remove_group(hypervisor_kobj, &xen_properties_group); } +struct pmu_mode { + const char *name; + uint32_t mode; +}; +struct pmu_mode pmu_modes[] = { + {"enable", VPMU_ON}, + {"priv_enable", VPMU_PRIV}, + {"disable", 0} +}; +static ssize_t pmu_mode_store(struct hyp_sysfs_attr *attr, + const char *buffer, size_t len) +{ + int ret; + struct xenpmu_params xp; + int i; + + for (i = 0; i < ARRAY_SIZE(pmu_modes); i++) { + if (!strncmp(buffer, pmu_modes[i].name, +strlen(pmu_modes[i].name))) { + xp.control = pmu_modes[i].mode; + break; + } + } + + if (i == ARRAY_SIZE(pmu_modes)) + return -EINVAL; + + ret = HYPERVISOR_xenpmu_op(XENPMU_mode_set, &xp); + if (ret) + return ret; + + return len; +} + +static ssize_t pmu_mode_show(struct hyp_sysfs_attr *attr, char *buffer) +{ + int ret; + struct xenpmu_params xp; + int i; + uint32_t mode; + + ret = HYPERVISOR_xenpmu_op(XENPMU_mode_get, &xp); + if (ret) + return ret; + + mode = (uint32_t)xp.control & VPMU_MODE_MASK; + for (i = 0; i < ARRAY_SIZE(pmu_modes); i++) { + if (mode == pmu_modes[i].mode) + return sprintf(buffer, "%s\n", pmu_modes[i].name); + } + + return -EINVAL; +} +HYPERVISOR_ATTR_RW(pmu_mode); + + +static ssize_t pmu_flags_store(struct hyp_sysfs_attr *attr, + const char *buffer, size_t len) +{ + int ret; + uint32_t flags; + struct xenpmu_params xp; + + ret = kstrtou32(buffer, 0, &flags); + if (ret) + return ret; + + xp.control = flags; + ret = HYPERVISOR_xenpmu_op(XENPMU_flags_set, &xp); + if (ret) + return ret; + + return len; +} + +static ssize_t pmu_flags_show(struct hyp_sysfs_attr *attr, char *buffer) +{ + int ret; + struct xenpmu_params xp; + + ret = HYPERVISOR_xenpmu_op(XENPMU_flags_get, &xp); + if (ret) + return ret; + + return sprintf(buffer, "0x%x\n", (uint32_t)xp.control); +} +HYPERVISOR_ATTR_RW(pmu_flags); + +static struct attribute *xen_pmu_attrs[] = { + &pmu_mode_attr.attr, + &pmu_flags_attr.attr, + NULL +}; + +static const struct attribute_group xen_pmu_group = { + .name = "pmu", + .attrs = xen_pmu_attrs, +}; + +static int __init xen_pmu_init(void) +{ + return sysfs_create_group(hypervisor_kobj, &xen_pmu_group); +} + +static void xen_pmu_destroy(void) +{ + sysfs_remove_group(hypervisor_kobj, &xen_pmu_group); +} + static int __init hyper_sysfs_init(void) { int ret; @@ -390,9 +500,16 @@ static int __init hyper_sysfs_init(void) ret = xen_properties_init(); if (ret)
[PATCH v1 0/5] xen/PMU: PMU support for Xen PV guests
This is the Linux side of Xen PMU support for PV guests, including dom0. Only kernel changes are here, toolstack patch will be provided separately. Here is description from the hypervisor patch submission that applies to this series as well: This version has following limitations: * For accurate profiling of dom0/Xen dom0 VCPUs should be pinned. * Hypervisor code is only profiled on processors that have running dom0 VCPUs on them. * No backtrace support. * Will fail to load under XSM: we ran out of bits in permissions vector and this needs to be fixed separately A few notes that may help reviewing: * A shared data structure (xenpmu_data_t) between each PV VPCU and hypervisor CPU is used for passing registers' values as well as PMU state at the time of PMU interrupt. * PMU interrupts are taken by hypervisor at NMI level for both HVM and PV. * Guest's interrupt handler does not read/write PMU MSRs directly. Instead, it accesses xenpmu_data_t and flushes it to HW it before returning. * PMU mode is controlled at runtime via /sys/hypervisor/pmu/pmu/{pmu_mode,pmu_flags} in addition to 'vpmu' boot option (which is preserved for back compatibility). The following modes are provided: * disable: VPMU is off * enable: VPMU is on. Guests can profile themselves, dom0 profiles itself and Xen * priv_enable: dom0 only profiling. dom0 collects samples for everyone. Sampling in guests is suspended. * /proc/xen/xensyms file exports hypervisor's symbols to dom0 (similar to /proc/kallsyms) * VPMU infrastructure is now used for both HVM and PV and therefore has been moved up from hvm subtree Boris Ostrovsky (5): xen: xensyms support xen/PMU: Sysfs interface for setting Xen PMU mode xen/PMU: Initialization code for Xen PMU xen/PMU: Add support for PMU registes on PV guests xen/PMU: Cache MSR accesses during interrupt handling arch/x86/include/asm/xen/hypercall.h | 6 + arch/x86/xen/Makefile| 2 +- arch/x86/xen/enlighten.c | 27 ++- arch/x86/xen/pmu.c | 422 +++ arch/x86/xen/pmu.h | 15 ++ arch/x86/xen/smp.c | 31 ++- arch/x86/xen/xen-head.S | 5 +- drivers/xen/Kconfig | 5 + drivers/xen/sys-hypervisor.c | 118 ++ drivers/xen/xenfs/Makefile | 1 + drivers/xen/xenfs/super.c| 3 + drivers/xen/xenfs/xenfs.h| 1 + drivers/xen/xenfs/xensyms.c | 170 ++ include/xen/interface/platform.h | 21 ++ include/xen/interface/xen.h | 2 + include/xen/interface/xenpmu.h | 127 +++ 16 files changed, 947 insertions(+), 9 deletions(-) create mode 100644 arch/x86/xen/pmu.c create mode 100644 arch/x86/xen/pmu.h create mode 100644 drivers/xen/xenfs/xensyms.c create mode 100644 include/xen/interface/xenpmu.h -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v1 1/5] xen: xensyms support
Export Xen symbols to dom0 via /proc/xen/xensyms (similar to /proc/kallsyms). Signed-off-by: Boris Ostrovsky --- drivers/xen/Kconfig | 5 ++ drivers/xen/xenfs/Makefile | 1 + drivers/xen/xenfs/super.c| 3 + drivers/xen/xenfs/xenfs.h| 1 + drivers/xen/xenfs/xensyms.c | 170 +++ include/xen/interface/platform.h | 21 + 6 files changed, 201 insertions(+) create mode 100644 drivers/xen/xenfs/xensyms.c diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig index 9e02d60..a21fce9 100644 --- a/drivers/xen/Kconfig +++ b/drivers/xen/Kconfig @@ -242,4 +242,9 @@ config XEN_MCE_LOG config XEN_HAVE_PVMMU bool +config XEN_SYMS + bool "Xen symbols" + depends on XEN_DOM0 && XENFS && KALLSYMS + default y + endmenu diff --git a/drivers/xen/xenfs/Makefile b/drivers/xen/xenfs/Makefile index b019865..1a83010 100644 --- a/drivers/xen/xenfs/Makefile +++ b/drivers/xen/xenfs/Makefile @@ -2,3 +2,4 @@ obj-$(CONFIG_XENFS) += xenfs.o xenfs-y = super.o xenfs-$(CONFIG_XEN_DOM0) += xenstored.o +xenfs-$(CONFIG_XEN_SYMS) += xensyms.o diff --git a/drivers/xen/xenfs/super.c b/drivers/xen/xenfs/super.c index 06092e0..8559a71 100644 --- a/drivers/xen/xenfs/super.c +++ b/drivers/xen/xenfs/super.c @@ -57,6 +57,9 @@ static int xenfs_fill_super(struct super_block *sb, void *data, int silent) { "privcmd", &xen_privcmd_fops, S_IRUSR|S_IWUSR }, { "xsd_kva", &xsd_kva_file_ops, S_IRUSR|S_IWUSR}, { "xsd_port", &xsd_port_file_ops, S_IRUSR|S_IWUSR}, +#ifdef CONFIG_XEN_SYMS + { "xensyms", &xensyms_ops, S_IRUSR}, +#endif {""}, }; diff --git a/drivers/xen/xenfs/xenfs.h b/drivers/xen/xenfs/xenfs.h index 6b80c77..2c5934e 100644 --- a/drivers/xen/xenfs/xenfs.h +++ b/drivers/xen/xenfs/xenfs.h @@ -3,5 +3,6 @@ extern const struct file_operations xsd_kva_file_ops; extern const struct file_operations xsd_port_file_ops; +extern const struct file_operations xensyms_ops; #endif /* _XENFS_XENBUS_H */ diff --git a/drivers/xen/xenfs/xensyms.c b/drivers/xen/xenfs/xensyms.c new file mode 100644 index 000..e3301f0 --- /dev/null +++ b/drivers/xen/xenfs/xensyms.c @@ -0,0 +1,170 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "xenfs.h" + + +struct xensyms_state { + struct xenpf_symdata symdata; + ssize_t off; +}; + +/* Grab next output page from the hypervisor */ +static int xensyms_next_page(struct xensyms_state *state) +{ + int ret; + struct xen_platform_op op; + + op.cmd = XENPF_get_symbols; + op.u.symdata = state->symdata; + + ret = HYPERVISOR_dom0_op(&op); + if (ret < 0) + return ret; + if (ret == 0) + /* End of symbols */ + return 1; + + state->symdata = op.u.symdata; + state->off = 0; + + return 0; +} + +static void *xensyms_start(struct seq_file *m, loff_t *pos) +{ + struct xensyms_state *state = (struct xensyms_state *)m->private; + + if (state->off == UINT_MAX) { + if (xensyms_next_page(state)) + return NULL; + } + + return m->private; +} + +static void *xensyms_next(struct seq_file *m, void *p, loff_t *pos) +{ + struct xensyms_state *state = (struct xensyms_state *)m->private; + unsigned int str_len, next_off; + + str_len = strlen(&state->symdata.buf[state->off]) + 1; + next_off = state->off + str_len; + + /* If there is no more data on this page then ask Xen for more */ + if (next_off >= XENSYMS_SZ || state->symdata.buf[next_off] == 0) { + if (xensyms_next_page(state)) + return NULL; + } else + state->off = next_off; + + return p; +} + +static int xensyms_show(struct seq_file *m, void *p) +{ + struct xensyms_state *state = (struct xensyms_state *)m->private; + + seq_printf(m, "%s", &state->symdata.buf[state->off]); + + return 0; +} + +static void xensyms_stop(struct seq_file *m, void *p) +{ +} + +static const struct seq_operations xensyms_seq_ops = { + .start = xensyms_start, + .next = xensyms_next, + .show = xensyms_show, + .stop = xensyms_stop, +}; + +static int xensyms_open(struct inode *inode, struct file *file) +{ + int ret; + struct xensyms_state *state; + struct seq_file *m; + + ret = seq_open_private(file, &xensyms_seq_ops, + sizeof(struct xensyms_state)); + if (ret) + return ret; + + m = file->private_data; + state = (struct xensyms_state *)m->private
[PATCH v1 5/5] xen/PMU: Cache MSR accesses during interrupt handling
Avoid trapping to hypervisor on each MSR access during interrupt handling. Instead, use cached MSR values provided in shared xenpmu_data by Xen. When handling is completed, flush the registers to hypervisor who will load them into HW. Signed-off-by: Boris Ostrovsky --- arch/x86/xen/pmu.c | 15 ++- include/xen/interface/xenpmu.h | 1 + 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c index d8b059b..bdd7c4a 100644 --- a/arch/x86/xen/pmu.c +++ b/arch/x86/xen/pmu.c @@ -345,14 +345,27 @@ static void xen_convert_regs(struct cpu_user_regs *xen_regs, irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id) { - int ret = IRQ_NONE; + int err, ret = IRQ_NONE; struct pt_regs regs; struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, smp_processor_id()); + /* +* While handling the interrupt MSR accesses will be cached +* in PMU context +*/ + xenpmu_data->pmu_flags |= PMU_CACHED; xen_convert_regs(&xenpmu_data->regs, ®s); if (x86_pmu.handle_irq(®s)) ret = IRQ_HANDLED; + xenpmu_data->pmu_flags &= ~PMU_CACHED; + + /* Write out cached context to HW */ + err = HYPERVISOR_xenpmu_op(XENPMU_flush, NULL); + if (err) { + WARN(1, "%s failed hypercall, err: %d\n", __func__, err); + return IRQ_NONE; + } return ret; } diff --git a/include/xen/interface/xenpmu.h b/include/xen/interface/xenpmu.h index 16fe1ab..ec0e802 100644 --- a/include/xen/interface/xenpmu.h +++ b/include/xen/interface/xenpmu.h @@ -16,6 +16,7 @@ #define XENPMU_init4 #define XENPMU_finish 5 #define XENPMU_lvtpc_set 6 +#define XENPMU_flush 7 /* Write cached MSR values to HW */ /* Parameter structure for HYPERVISOR_xenpmu_op call */ struct xenpmu_params { -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v1 3/5] xen/PMU: Initialization code for Xen PMU
Map shared data structure that will hold CPU registers, VPMU context, VCPU/PCPI IDs of the VCPU interrupted by PMU interrupt. Hypervisor fills this information in its handler and passes it to the guest for further processing. Set up PMU VIRQ. Signed-off-by: Boris Ostrovsky --- arch/x86/xen/Makefile | 2 +- arch/x86/xen/pmu.c | 122 + arch/x86/xen/pmu.h | 12 arch/x86/xen/smp.c | 31 ++- include/xen/interface/xen.h| 1 + include/xen/interface/xenpmu.h | 77 ++ 6 files changed, 243 insertions(+), 2 deletions(-) create mode 100644 arch/x86/xen/pmu.c create mode 100644 arch/x86/xen/pmu.h diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index 96ab2c0..b187df5 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -13,7 +13,7 @@ CFLAGS_mmu.o := $(nostackp) obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \ time.o xen-asm.o xen-asm_$(BITS).o \ grant-table.o suspend.o platform-pci-unplug.o \ - p2m.o + p2m.o pmu.o obj-$(CONFIG_EVENT_TRACING) += trace.o diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c new file mode 100644 index 000..da061d4 --- /dev/null +++ b/arch/x86/xen/pmu.c @@ -0,0 +1,122 @@ +#include +#include + +#include +#include +#include +#include +#include + +#include "xen-ops.h" +#include "pmu.h" + +/* x86_pmu.handle_irq definition */ +#include <../kernel/cpu/perf_event.h> + + +/* Shared page between hypervisor and domain */ +DEFINE_PER_CPU(struct xenpmu_data *, xenpmu_shared); + +/* perf callbacks*/ +int xen_is_in_guest(void) +{ + struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, + smp_processor_id()); + + if (!xen_initial_domain() || + xenpmu_data->domain_id > DOMID_SELF || xenpmu_data->domain_id == 0) + return 0; + + return 1; +} + +static int xen_is_user_mode(void) +{ + struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, + smp_processor_id()); + return ((xenpmu_data->regs.cs & 3) == 3); +} + +static unsigned long xen_get_guest_ip(void) +{ + struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, + smp_processor_id()); + return xenpmu_data->regs.eip; +} + +static struct perf_guest_info_callbacks xen_guest_cbs = { + .is_in_guest= xen_is_in_guest, + .is_user_mode = xen_is_user_mode, + .get_guest_ip = xen_get_guest_ip, +}; + +/* Convert registers from Xen's format to Linux' */ +static void xen_convert_regs(struct cpu_user_regs *xen_regs, +struct pt_regs *regs) +{ + regs->ip = xen_regs->eip; + regs->cs = xen_regs->cs; +} + +irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id) +{ + int ret = IRQ_NONE; + struct pt_regs regs; + struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, + smp_processor_id()); + + xen_convert_regs(&xenpmu_data->regs, ®s); + if (x86_pmu.handle_irq(®s)) + ret = IRQ_HANDLED; + + return ret; +} + +int xen_pmu_init(int cpu) +{ + int ret = 0; + struct xenpmu_params xp; + unsigned long pfn; + struct xenpmu_data *xenpmu_data; + + BUILD_BUG_ON(sizeof(struct xenpmu_data) > PAGE_SIZE); + xenpmu_data = vmalloc(PAGE_SIZE); + if (!xenpmu_data) { + ret = -ENOMEM; + goto fail; + } + pfn = vmalloc_to_pfn((char *)xenpmu_data); + + xp.mfn = pfn_to_mfn(pfn); + xp.vcpu = cpu; + xp.version.maj = XENPMU_VER_MAJ; + xp.version.min = XENPMU_VER_MIN; + ret = HYPERVISOR_xenpmu_op(XENPMU_init, &xp); + if (ret) + goto fail; + + per_cpu(xenpmu_shared, cpu) = xenpmu_data; + + if (cpu == 0) + perf_register_guest_info_callbacks(&xen_guest_cbs); + + return ret; + +fail: + vfree(xenpmu_data); + return ret; +} + +void xen_pmu_finish(int cpu) +{ + struct xenpmu_params xp; + + xp.vcpu = cpu; + xp.version.maj = XENPMU_VER_MAJ; + xp.version.min = XENPMU_VER_MIN; + + (void)HYPERVISOR_xenpmu_op(XENPMU_finish, &xp); + + vfree(per_cpu(xenpmu_shared, cpu)); + per_cpu(xenpmu_shared, cpu) = NULL; +} diff --git a/arch/x86/xen/pmu.h b/arch/x86/xen/pmu.h new file mode 100644 index 000..51de7d2 --- /dev/null +++ b/arch/x86/xen/pmu.h @@ -0,0 +1,12 @@ +#ifndef __XEN_PMU_H +#define __XEN_PMU_H + +#include + +irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id); +int xen_pmu_init(int cpu); +void xen_pmu_finish(int cpu);
[PATCH v1 4/5] xen/PMU: Add support for PMU registes on PV guests
PMU emulation code: MSR caching in PMU context and LVTPC APIC handling. (Portions of this code are taken from Xen's VPMU implementation) Signed-off-by: Boris Ostrovsky --- arch/x86/xen/enlighten.c | 27 +++- arch/x86/xen/pmu.c | 289 - arch/x86/xen/pmu.h | 3 + include/xen/interface/xenpmu.h | 5 + 4 files changed, 317 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 193097e..2512bd3 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -82,6 +82,7 @@ #include "mmu.h" #include "smp.h" #include "multicalls.h" +#include "pmu.h" EXPORT_SYMBOL_GPL(hypercall_page); @@ -960,6 +961,11 @@ static u32 xen_apic_read(u32 reg) static void xen_apic_write(u32 reg, u32 val) { + if (reg == APIC_LVTPC) { + (void)pmu_apic_update(reg); + return; + } + /* Warn to see if there's any stray references */ WARN_ON(1); } @@ -1064,11 +1070,20 @@ static inline void xen_write_cr8(unsigned long val) BUG_ON(val); } #endif -static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high) + +static u64 xen_read_msr_safe(unsigned int msr, int *err) { - int ret; + u64 val; - ret = 0; + if (pmu_msr_read(msr, &val, err)) + return val; + + return native_read_msr_safe(msr, err); +} + +static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high) +{ + int ret = 0; switch (msr) { #ifdef CONFIG_X86_64 @@ -1102,10 +1117,10 @@ static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high) if (smp_processor_id() == 0) xen_set_pat(((u64)high << 32) | low); break; + } - default: + if (!pmu_msr_write(msr, low, high, &ret)) ret = native_write_msr_safe(msr, low, high); - } return ret; } @@ -1239,7 +1254,7 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = { .wbinvd = native_wbinvd, - .read_msr = native_read_msr_safe, + .read_msr = xen_read_msr_safe, .write_msr = xen_write_msr_safe, .read_tsc = native_read_tsc, diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c index da061d4..d8b059b 100644 --- a/arch/x86/xen/pmu.c +++ b/arch/x86/xen/pmu.c @@ -17,6 +17,291 @@ /* Shared page between hypervisor and domain */ DEFINE_PER_CPU(struct xenpmu_data *, xenpmu_shared); +/* PMU register caching */ + +/* AMD PMU */ +static __read_mostly uint32_t amd_counters_base; +static __read_mostly uint32_t amd_ctrls_base; +static __read_mostly int amd_msr_step; +static __read_mostly int k7_counters_mirrored; +static __read_mostly int amd_num_counters; + +/* Intel PMU */ +#define MSR_TYPE_COUNTER0 +#define MSR_TYPE_CTRL 1 +#define MSR_TYPE_GLOBAL 2 +#define MSR_TYPE_ARCH_COUNTER 3 +#define MSR_TYPE_ARCH_CTRL 4 + +#define PMU_GENERAL_NR_SHIFT8 /* Number of general pmu registers */ +#define PMU_GENERAL_NR_BITS 8 /* 8 bits 8..15 */ +#define PMU_GENERAL_NR_MASK (((1 << PMU_GENERAL_NR_BITS) - 1) \ +<< PMU_GENERAL_NR_SHIFT) + +static __read_mostly int intel_num_counters; + + +static void xen_pmu_arch_init(void) +{ + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) { + + switch (boot_cpu_data.x86) { + case 0x15: + amd_num_counters = F15H_NUM_COUNTERS; + amd_counters_base = MSR_F15H_PERF_CTR; + amd_ctrls_base = MSR_F15H_PERF_CTL; + amd_msr_step = 2; + k7_counters_mirrored = 1; + break; + case 0x10: + case 0x12: + case 0x14: + case 0x16: + default: + amd_num_counters = F10H_NUM_COUNTERS; + amd_counters_base = MSR_K7_PERFCTR0; + amd_ctrls_base = MSR_K7_EVNTSEL0; + amd_msr_step = 1; + k7_counters_mirrored = 0; + break; + } + } else { + uint32_t eax = cpuid_eax(0xa); + + intel_num_counters = (eax & PMU_GENERAL_NR_MASK) >> + PMU_GENERAL_NR_SHIFT; + } +} + +static inline uint32_t get_fam15h_addr(u32 addr) +{ + switch (addr) { + case MSR_K7_PERFCTR0: + case MSR_K7_PERFCTR1: + case MSR_K7_PERFCTR2: + case MSR_K7_PERFCTR3: + return MSR_F15H_PERF_CTR + (addr - MSR_K7_PERFCTR0); + case MSR_K7_EVNTSEL0: + case MSR_K7_EVNTSEL1: + case MSR_K7_EVNTSEL2: + case MSR_K7_EVNTSEL3: + retur
Re: [Xen-devel] [PATCH v1 0/5] xen/PMU: PMU support for Xen PV guests
On 09/11/2013 05:33 AM, David Vrabel wrote: On 10/09/13 16:31, Boris Ostrovsky wrote: This is the Linux side of Xen PMU support for PV guests, including dom0. Only kernel changes are here, toolstack patch will be provided separately. Here is description from the hypervisor patch submission that applies to this series as well: This version has following limitations: * For accurate profiling of dom0/Xen dom0 VCPUs should be pinned. * Hypervisor code is only profiled on processors that have running dom0 VCPUs on them. * No backtrace support. These are some pretty significant limitations. Is there a plan for how to remove them? I don't have a specific plan (other than do it after this stage is finished) but I do have a rough idea of what would be needed to address these. Hypervisor changes for all three should be pretty easy. Linux-wise, for the first one (pinned VCPU) we will probably need to make a change in perf_sample_data. There is a reserved filed there so perhaps we can use it to store PCPU (and maybe domainID). Plus a way to actually write the data, probably a hook or something. Backtrace support should also not be too bad: we can pass Xen IP stack in the shared data area to dom0 and then again have a hook or something to pass it to perf. (Note that backtracing is not supported for KVM neither so both may benefit here) The second one is the most difficult: We need to be able somehow to access MSRs (and APIC, I think) on non-dom0 CPUs. Creating interface for such access is not a big deal but integrating it into perf infrastructure will be a challenge. There are other alternatives but they have problems as well. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] xen: Do not enable spinlocks before jump_label_init() has executed
From: Konrad Rzeszutek Wilk xen_init_spinlocks() currently calls static_key_slow_inc() before jump_label_init() is invoked. When CONFIG_JUMP_LABEL is set (which usually is the case) the effect of this static_key_slow_inc() is deferred until after jump_label_init(). This is different from when CONFIG_JUMP_LABEL is not set, in which case the key is set immediately. Thus, depending on the value of config option, we may observe different behavior. In addition, when we come to __jump_label_transform() from jump_label_init(), the key (paravirt_ticketlocks_enabled) is already enabled. On processors where ideal_nop is not the same as default_nop this will cause a BUG() since it is expected that before a key is enabled the latter is replaced by the former during initialization. To address this problem we need to move static_key_slow_inc(¶virt_ticketlocks_enabled) so that it is called after jump_label_init(). We also need to make sure that this is done before other cpus start to boot. early_initcall appears to be a good place to do so. (Note that we cannot move whole xen_init_spinlocks() there since pv_lock_ops need to be set before alternative_instructions() runs.) Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Boris Ostrovsky --- arch/x86/xen/spinlock.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 253f63f..d90628d 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -267,11 +267,18 @@ void __init xen_init_spinlocks(void) return; } - static_key_slow_inc(¶virt_ticketlocks_enabled); - pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning); pv_lock_ops.unlock_kick = xen_unlock_kick; } +static __init int xen_init_spinlocks_jump(void) +{ + if (!xen_pvspin) + return 0; + + static_key_slow_inc(¶virt_ticketlocks_enabled); + return 0; +} +early_initcall(xen_init_spinlocks_jump); static __init int xen_parse_nopvspin(char *arg) { -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/11] xenbus: convert bus code to use dev_groups
On 10/07/2013 02:55 AM, Greg Kroah-Hartman wrote: The dev_attrs field of struct bus_type is going away soon, dev_groups should be used instead. This converts the xenbus code to use the correct field. Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel Cc: Signed-off-by: Greg Kroah-Hartman --- Konrad, I can take this through my driver-core tree if you like, just let me know what would be the easiest for you. Konrad is likely out this (and possibly next) week but given that you are taking a bunch of these patches it make sense that you take this one as well. (and just in case I gave it a quick test and it looked good). -boris drivers/xen/xenbus/xenbus_probe.c | 24 ++-- drivers/xen/xenbus/xenbus_probe.h | 2 +- drivers/xen/xenbus/xenbus_probe_backend.c | 2 +- drivers/xen/xenbus/xenbus_probe_frontend.c | 2 +- 4 files changed, 21 insertions(+), 9 deletions(-) diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c index 38e92b7..3c0a74b 100644 --- a/drivers/xen/xenbus/xenbus_probe.c +++ b/drivers/xen/xenbus/xenbus_probe.c @@ -384,12 +384,14 @@ static ssize_t nodename_show(struct device *dev, { return sprintf(buf, "%s\n", to_xenbus_device(dev)->nodename); } +static DEVICE_ATTR_RO(nodename); static ssize_t devtype_show(struct device *dev, struct device_attribute *attr, char *buf) { return sprintf(buf, "%s\n", to_xenbus_device(dev)->devicetype); } +static DEVICE_ATTR_RO(devtype); static ssize_t modalias_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -397,14 +399,24 @@ static ssize_t modalias_show(struct device *dev, return sprintf(buf, "%s:%s\n", dev->bus->name, to_xenbus_device(dev)->devicetype); } +static DEVICE_ATTR_RO(modalias); -struct device_attribute xenbus_dev_attrs[] = { - __ATTR_RO(nodename), - __ATTR_RO(devtype), - __ATTR_RO(modalias), - __ATTR_NULL +static struct attribute *xenbus_dev_attrs[] = { + &dev_attr_nodename.attr, + &dev_attr_devtype.attr, + &dev_attr_modalias.attr, + NULL, }; -EXPORT_SYMBOL_GPL(xenbus_dev_attrs); + +static const struct attribute_group xenbus_dev_group = { + .attrs = xenbus_dev_attrs, +}; + +const struct attribute_group *xenbus_dev_groups[] = { + &xenbus_dev_group, + NULL, +}; +EXPORT_SYMBOL_GPL(xenbus_dev_groups); int xenbus_probe_node(struct xen_bus_type *bus, const char *type, diff --git a/drivers/xen/xenbus/xenbus_probe.h b/drivers/xen/xenbus/xenbus_probe.h index 146f857..1085ec2 100644 --- a/drivers/xen/xenbus/xenbus_probe.h +++ b/drivers/xen/xenbus/xenbus_probe.h @@ -54,7 +54,7 @@ enum xenstore_init { XS_LOCAL, }; -extern struct device_attribute xenbus_dev_attrs[]; +extern const struct attribute_group *xenbus_dev_groups[]; extern int xenbus_match(struct device *_dev, struct device_driver *_drv); extern int xenbus_dev_probe(struct device *_dev); diff --git a/drivers/xen/xenbus/xenbus_probe_backend.c b/drivers/xen/xenbus/xenbus_probe_backend.c index 998bbba..5125dce 100644 --- a/drivers/xen/xenbus/xenbus_probe_backend.c +++ b/drivers/xen/xenbus/xenbus_probe_backend.c @@ -200,7 +200,7 @@ static struct xen_bus_type xenbus_backend = { .probe = xenbus_dev_probe, .remove = xenbus_dev_remove, .shutdown = xenbus_dev_shutdown, - .dev_attrs = xenbus_dev_attrs, + .dev_groups = xenbus_dev_groups, }, }; diff --git a/drivers/xen/xenbus/xenbus_probe_frontend.c b/drivers/xen/xenbus/xenbus_probe_frontend.c index 34b20bf..129bf84 100644 --- a/drivers/xen/xenbus/xenbus_probe_frontend.c +++ b/drivers/xen/xenbus/xenbus_probe_frontend.c @@ -154,7 +154,7 @@ static struct xen_bus_type xenbus_frontend = { .probe = xenbus_frontend_dev_probe, .remove = xenbus_dev_remove, .shutdown = xenbus_dev_shutdown, - .dev_attrs = xenbus_dev_attrs, + .dev_groups = xenbus_dev_groups, .pm = &xenbus_pm_ops, }, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 2/5] xen/PMU: Sysfs interface for setting Xen PMU mode
On 09/23/2013 09:17 AM, Konrad Rzeszutek Wilk wrote: On Tue, Sep 10, 2013 at 11:31:47AM -0400, Boris Ostrovsky wrote: + +/* Parameter structure for HYPERVISOR_xenpmu_op call */ +struct xenpmu_params { + union { + struct version { + uint8_t maj; + uint8_t min; + } version; + uint64_t pad; + }; + uint64_t control; +}; + +/* VPMU modes */ +#define VPMU_MODE_MASK 0xff +#define VPMU_OFF 0 +/* guests can profile themselves, (dom0 profiles itself and Xen) */ +#define VPMU_ON(1<<0) +/* + * Only dom0 has access to VPMU and it profiles everyone: itself, + * the hypervisor and the guests. + */ +#define VPMU_PRIV (1<<1) + +/* VPMU flags */ +#define VPMU_FLAGS_MASK((uint32_t)(~VPMU_MODE_MASK)) +#define VPMU_INTEL_BTS (1<<8) /* Ignored on AMD */ + +#endif /* __XEN_PUBLIC_XENPMU_H__ */ Looks OK to me if there are no changes to the hypervisor ABI. v2 patches for hypervisor have different layout for shared structures and parameters so this will have to change as well. I don't believe there are any logic changes to this series due to this v2, except for the first patch (getting Xen symbols into dom0). -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 3/5] xen/PMU: Initialization code for Xen PMU
On 09/23/2013 09:26 AM, Konrad Rzeszutek Wilk wrote: On Tue, Sep 10, 2013 at 11:31:48AM -0400, Boris Ostrovsky wrote: Map shared data structure that will hold CPU registers, VPMU context, VCPU/PCPI IDs of the VCPU interrupted by PMU interrupt. Hypervisor fills this information in its handler and passes it to the guest for further processing. Set up PMU VIRQ. Signed-off-by: Boris Ostrovsky --- arch/x86/xen/Makefile | 2 +- arch/x86/xen/pmu.c | 122 + arch/x86/xen/pmu.h | 12 arch/x86/xen/smp.c | 31 ++- include/xen/interface/xen.h| 1 + include/xen/interface/xenpmu.h | 77 ++ 6 files changed, 243 insertions(+), 2 deletions(-) create mode 100644 arch/x86/xen/pmu.c create mode 100644 arch/x86/xen/pmu.h diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index 96ab2c0..b187df5 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -13,7 +13,7 @@ CFLAGS_mmu.o := $(nostackp) obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \ time.o xen-asm.o xen-asm_$(BITS).o \ grant-table.o suspend.o platform-pci-unplug.o \ - p2m.o + p2m.o pmu.o Perhaps guard the build of this based on CONFIG_PERF_EVENTS? That would of course mean you also have to create in xenpmu.h static inline empy functions for xen_pmu_finish and xen_pmu_init in case CONFIG_PERF_EVENTS is not set. This is interface header, am I allowed to do those sorts of things there? Also, *theoretically* other performance-measuring tools can use this framework, expect for perf-specific things like callbacks and the handler call. So maybe I should put CONFIG_PERF_EVENTS in pmu.c? obj-$(CONFIG_EVENT_TRACING) += trace.o diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c new file mode 100644 index 000..da061d4 --- /dev/null +++ b/arch/x86/xen/pmu.c @@ -0,0 +1,122 @@ +#include +#include + +#include +#include +#include +#include +#include + +#include "xen-ops.h" +#include "pmu.h" + +/* x86_pmu.handle_irq definition */ +#include <../kernel/cpu/perf_event.h> + + +/* Shared page between hypervisor and domain */ +DEFINE_PER_CPU(struct xenpmu_data *, xenpmu_shared); + +/* perf callbacks*/ +int xen_is_in_guest(void) +{ + struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, + smp_processor_id()); + + if (!xen_initial_domain() || + xenpmu_data->domain_id > DOMID_SELF || xenpmu_data->domain_id == 0) + return 0; + + return 1; +} + +static int xen_is_user_mode(void) +{ + struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, + smp_processor_id()); + return ((xenpmu_data->regs.cs & 3) == 3); +} + +static unsigned long xen_get_guest_ip(void) +{ + struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, + smp_processor_id()); + return xenpmu_data->regs.eip; +} + +static struct perf_guest_info_callbacks xen_guest_cbs = { + .is_in_guest= xen_is_in_guest, + .is_user_mode = xen_is_user_mode, + .get_guest_ip = xen_get_guest_ip, +}; + +/* Convert registers from Xen's format to Linux' */ +static void xen_convert_regs(struct cpu_user_regs *xen_regs, +struct pt_regs *regs) +{ + regs->ip = xen_regs->eip; + regs->cs = xen_regs->cs; +} + +irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id) +{ + int ret = IRQ_NONE; + struct pt_regs regs; + struct xenpmu_data *xenpmu_data = per_cpu(xenpmu_shared, + smp_processor_id()); + + xen_convert_regs(&xenpmu_data->regs, ®s); + if (x86_pmu.handle_irq(®s)) + ret = IRQ_HANDLED; + + return ret; +} + +int xen_pmu_init(int cpu) +{ + int ret = 0; + struct xenpmu_params xp; + unsigned long pfn; + struct xenpmu_data *xenpmu_data; + + BUILD_BUG_ON(sizeof(struct xenpmu_data) > PAGE_SIZE); + xenpmu_data = vmalloc(PAGE_SIZE); + if (!xenpmu_data) { + ret = -ENOMEM; + goto fail; + } + pfn = vmalloc_to_pfn((char *)xenpmu_data); + + xp.mfn = pfn_to_mfn(pfn); + xp.vcpu = cpu; + xp.version.maj = XENPMU_VER_MAJ; + xp.version.min = XENPMU_VER_MIN; + ret = HYPERVISOR_xenpmu_op(XENPMU_init, &xp); + if (ret) + goto fail; + + per_cpu(xenpmu_shared, cpu) = xenpmu_data; + + if (cpu == 0) + perf_register_guest_info_callbacks(&xen_guest_cbs); + + return ret; + +fail: + vfree(xenpmu_data); + return ret; +} + +voi
Re: [Xen-devel] [PATCH v1 3/5] xen/PMU: Initialization code for Xen PMU
On 09/23/2013 10:18 AM, Boris Ostrovsky wrote: On 09/23/2013 09:26 AM, Konrad Rzeszutek Wilk wrote: On Tue, Sep 10, 2013 at 11:31:48AM -0400, Boris Ostrovsky wrote: Map shared data structure that will hold CPU registers, VPMU context, VCPU/PCPI IDs of the VCPU interrupted by PMU interrupt. Hypervisor fills this information in its handler and passes it to the guest for further processing. Set up PMU VIRQ. Signed-off-by: Boris Ostrovsky --- arch/x86/xen/Makefile | 2 +- arch/x86/xen/pmu.c | 122 + arch/x86/xen/pmu.h | 12 arch/x86/xen/smp.c | 31 ++- include/xen/interface/xen.h| 1 + include/xen/interface/xenpmu.h | 77 ++ 6 files changed, 243 insertions(+), 2 deletions(-) create mode 100644 arch/x86/xen/pmu.c create mode 100644 arch/x86/xen/pmu.h diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index 96ab2c0..b187df5 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -13,7 +13,7 @@ CFLAGS_mmu.o:= $(nostackp) obj-y:= enlighten.o setup.o multicalls.o mmu.o irq.o \ time.o xen-asm.o xen-asm_$(BITS).o \ grant-table.o suspend.o platform-pci-unplug.o \ -p2m.o +p2m.o pmu.o Perhaps guard the build of this based on CONFIG_PERF_EVENTS? That would of course mean you also have to create in xenpmu.h static inline empy functions for xen_pmu_finish and xen_pmu_init in case CONFIG_PERF_EVENTS is not set. This is interface header, am I allowed to do those sorts of things there? Never mind that, I was thinking of a different file. Sorry. Also, *theoretically* other performance-measuring tools can use this framework, expect for perf-specific things like callbacks and the handler call. So maybe I should put CONFIG_PERF_EVENTS in pmu.c? -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xen: Do not enable spinlocks before jump_label_init() has executed
On 09/12/2013 10:47 PM, Steven Rostedt wrote: On Thu, 12 Sep 2013 22:29:44 -0400 Boris Ostrovsky wrote: From: Konrad Rzeszutek Wilk xen_init_spinlocks() currently calls static_key_slow_inc() before jump_label_init() is invoked. When CONFIG_JUMP_LABEL is set (which usually is the case) the effect of this static_key_slow_inc() is deferred until after jump_label_init(). This is different from when CONFIG_JUMP_LABEL is not set, in which case the key is set immediately. Thus, depending on the value of config option, we may observe different behavior. In addition, when we come to __jump_label_transform() from jump_label_init(), the key (paravirt_ticketlocks_enabled) is already enabled. On processors where ideal_nop is not the same as default_nop this will cause a BUG() since it is expected that before a key is enabled the latter is replaced by the former during initialization. To address this problem we need to move static_key_slow_inc(¶virt_ticketlocks_enabled) so that it is called after jump_label_init(). We also need to make sure that this is done before other cpus start to boot. early_initcall appears to be a good place to do so. (Note that we cannot move whole xen_init_spinlocks() there since pv_lock_ops need to be set before alternative_instructions() runs.) Reviewed-by: Steven Rostedt Peter, This fixes a regression in 3.12 against xen. Please pull and push to Linus sometime soon. Peter, Are you planning on taking this patch? Thanks. -boris Thanks, -- Steve Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Boris Ostrovsky --- arch/x86/xen/spinlock.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 253f63f..d90628d 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -267,11 +267,18 @@ void __init xen_init_spinlocks(void) return; } - static_key_slow_inc(¶virt_ticketlocks_enabled); - pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning); pv_lock_ops.unlock_kick = xen_unlock_kick; } +static __init int xen_init_spinlocks_jump(void) +{ + if (!xen_pvspin) + return 0; + + static_key_slow_inc(¶virt_ticketlocks_enabled); + return 0; +} +early_initcall(xen_init_spinlocks_jump); static __init int xen_parse_nopvspin(char *arg) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86,AMD: Power driver support for AMD's family 16h processors
Add family 16h PCI ID to AMD's power driver to allow it report power consumption on these processors. Signed-off-by: Boris Ostrovsky --- drivers/hwmon/fam15h_power.c |1 + include/linux/pci_ids.h |1 + 2 files changed, 2 insertions(+) diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c index 4f41104..dda1113 100644 --- a/drivers/hwmon/fam15h_power.c +++ b/drivers/hwmon/fam15h_power.c @@ -248,6 +248,7 @@ static void __devexit fam15h_power_remove(struct pci_dev *pdev) static DEFINE_PCI_DEVICE_TABLE(fam15h_power_id_table) = { { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_15H_NB_F4) }, + { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_16H_NB_F4) }, {} }; MODULE_DEVICE_TABLE(pci, fam15h_power_id_table); diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index 9d36b82..9c9a464 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -524,6 +524,7 @@ #define PCI_DEVICE_ID_AMD_15H_NB_F30x1603 #define PCI_DEVICE_ID_AMD_15H_NB_F40x1604 #define PCI_DEVICE_ID_AMD_15H_NB_F50x1605 +#define PCI_DEVICE_ID_AMD_16H_NB_F40x1534 #define PCI_DEVICE_ID_AMD_CNB17H_F30x1703 #define PCI_DEVICE_ID_AMD_LANCE0x2000 #define PCI_DEVICE_ID_AMD_LANCE_HOME 0x2001 -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] x86,AMD: Power driver support for AMD's family 16h processors
Add family 16h PCI ID to AMD's power driver to allow it report power consumption on these processors. Signed-off-by: Boris Ostrovsky --- drivers/hwmon/fam15h_power.c |4 1 file changed, 4 insertions(+) diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c index 4f41104..34ab2a8 100644 --- a/drivers/hwmon/fam15h_power.c +++ b/drivers/hwmon/fam15h_power.c @@ -31,6 +31,9 @@ MODULE_DESCRIPTION("AMD Family 15h CPU processor power monitor"); MODULE_AUTHOR("Andreas Herrmann "); MODULE_LICENSE("GPL"); +/* Family 16h Northbridge's function 4 PCI ID */ +#define PCI_DEVICE_ID_AMD_16H_NB_F40x1534 + /* D18F3 */ #define REG_NORTHBRIDGE_CAP0xe8 @@ -248,6 +251,7 @@ static void __devexit fam15h_power_remove(struct pci_dev *pdev) static DEFINE_PCI_DEVICE_TABLE(fam15h_power_id_table) = { { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_15H_NB_F4) }, + { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_16H_NB_F4) }, {} }; MODULE_DEVICE_TABLE(pci, fam15h_power_id_table); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86, microcode, AMD: Add support for family 16h processors
Add valid patch size for family 16h processors Signed-off-by: Boris Ostrovsky --- arch/x86/kernel/microcode_amd.c |4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kernel/microcode_amd.c b/arch/x86/kernel/microcode_amd.c index 7720ff5..58790e8 100644 --- a/arch/x86/kernel/microcode_amd.c +++ b/arch/x86/kernel/microcode_amd.c @@ -190,6 +190,7 @@ static unsigned int verify_patch_size(int cpu, u32 patch_size, #define F1XH_MPB_MAX_SIZE 2048 #define F14H_MPB_MAX_SIZE 1824 #define F15H_MPB_MAX_SIZE 4096 +#define F16H_MPB_MAX_SIZE 3458 switch (c->x86) { case 0x14: @@ -198,6 +199,9 @@ static unsigned int verify_patch_size(int cpu, u32 patch_size, case 0x15: max_size = F15H_MPB_MAX_SIZE; break; + case 0x16: + max_size = F16H_MPB_MAX_SIZE; + break; default: max_size = F1XH_MPB_MAX_SIZE; break; -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86, microcode, AMD: Add support for family 16h processors
On 11/15/2012 03:45 PM, Henrique de Moraes Holschuh wrote: On Thu, 15 Nov 2012, Boris Ostrovsky wrote: Add valid patch size for family 16h processors Signed-off-by: Boris Ostrovsky Is this something that needs to go to -stable ? #define F1XH_MPB_MAX_SIZE 2048 #define F14H_MPB_MAX_SIZE 1824 #define F15H_MPB_MAX_SIZE 4096 +#define F16H_MPB_MAX_SIZE 3458 switch (c->x86) { case 0x14: @@ -198,6 +199,9 @@ static unsigned int verify_patch_size(int cpu, u32 patch_size, case 0x15: max_size = F15H_MPB_MAX_SIZE; break; + case 0x16: + max_size = F16H_MPB_MAX_SIZE; + break; default: max_size = F1XH_MPB_MAX_SIZE; break; Because it looks like without this patch, some valid microcode updates would be rejected by the kernel... Right, patch loading will fail. I wasn't sure whether stable would be appropriate since this is support for new HW. OTOH since this would result in loss of functionality one could consider this a bug. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86, microcode, AMD: Add support for family 16h processors
On 11/15/2012 06:01 PM, Gene Heskett wrote: On Thursday 15 November 2012, Henrique de Moraes Holschuh wrote: On Thu, 15 Nov 2012, Boris Ostrovsky wrote: Add valid patch size for family 16h processors Signed-off-by: Boris Ostrovsky Is this something that needs to go to -stable ? IMO, and if I had an oar in this water, yes. Its been missing since the Intel folks started playing with it a couple years back up the log. I have the amd_ucode files in my /lib/firmware tree, root@coyote:/opt/os9# ls -l /lib/firmware/amd-ucode/ total 76 -rw-r--r-- 1 root root 642 2012-01-17 11:50 INSTALL -rw-r--r-- 1 root root 9987 2012-01-17 11:50 LICENSE -rw-r--r-- 1 root root 12404 2012-01-17 11:50 microcode_amd.bin -rw-r--r-- 1 root root 1526 2012-01-17 11:50 microcode_amd.bin.README -rw-r--r-- 1 root root 2644 2012-01-17 11:50 microcode_amd_fam15h.bin -rw-r--r-- 1 root root 510 2012-01-17 11:50 microcode_amd_fam15h.bin.README -rw-r--r-- 1 root root 2012 2009-01-20 04:48 microcode_amd.phenom-V83 -rw-r--r-- 1 root root 15020 2012-01-17 11:50 microcode_amd_solaris.bin -rw-r--r-- 1 root root 1685 2012-01-17 11:50 microcode_amd_solaris.bin.README -rw-r--r-- 1 root root 6227 2012-01-17 11:50 README but I can't recall the last time I saw the code sign in during dmesg. Old, slow 4 core phenom here. AMD was forgotten about when the loading of it was moved from the kernel options to /etc/init.d/microcode. For an AMD user, that was not a show stopper, but it wasn't a Good Thing(TM) either. One possibility is that BIOS already incorporated all patches (which typically is the case) and so the driver doesn't have to do anything. -boris #define F1XH_MPB_MAX_SIZE 2048 #define F14H_MPB_MAX_SIZE 1824 #define F15H_MPB_MAX_SIZE 4096 +#define F16H_MPB_MAX_SIZE 3458 switch (c->x86) { case 0x14: @@ -198,6 +199,9 @@ static unsigned int verify_patch_size(int cpu, u32 patch_size, case 0x15: max_size = F15H_MPB_MAX_SIZE; break; + case 0x16: + max_size = F16H_MPB_MAX_SIZE; + break; default: max_size = F1XH_MPB_MAX_SIZE; break; Because it looks like without this patch, some valid microcode updates would be rejected by the kernel... Cheers, Gene -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] MAINTAINERS: Add in two extra co-maintainers of the Xen tree.
On 08/05/2013 02:05 PM, Konrad Rzeszutek Wilk wrote: Both Boris and David have graciously volunteered to help in maintaining the Xen subsystem tree. Cementing this in the MAINTAINERS file so they are copied on Xen related patches. CC: Boris Ostrovsky CC: David Vrabel Signed-off-by: Konrad Rzeszutek Wilk Acked-by: Boris Ostrovsky --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 785f56a..0161dad 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9237,6 +9237,8 @@ F:drivers/media/tuners/tuner-xc2028.* XEN HYPERVISOR INTERFACE M:Konrad Rzeszutek Wilk +M: Boris Ostrovsky +M: David Vrabel L:xen-de...@lists.xenproject.org (moderated for non-subscribers) S:Supported F:arch/x86/xen/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [tip:x86/urgent] x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
On 04/10/2013 08:30 PM, tip-bot for Boris Ostrovsky wrote: Commit-ID: 511ba86e1d386f671084b5d0e6f110bb30b8eeb2 Gitweb: http://git.kernel.org/tip/511ba86e1d386f671084b5d0e6f110bb30b8eeb2 Author: Boris Ostrovsky AuthorDate: Sat, 23 Mar 2013 09:36:36 -0400 Committer: H. Peter Anvin CommitDate: Wed, 10 Apr 2013 11:25:10 -0700 x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal Invoking arch_flush_lazy_mmu_mode() results in calls to preempt_enable()/disable() which may have performance impact. Since lazy MMU is not used on bare metal we can patch away arch_flush_lazy_mmu_mode() so that it is never called in such environment. [ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU updates" may cause a minor performance regression on bare metal. This patch resolves that performance regression. It is somewhat unclear to me if this is a good -stable candidate. ] I think this https://lkml.org/lkml/2013/2/26/420 was also part of lazy mmu set of patches but is missing in the latest batch of commits. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
When CONFIG_DEBUG_PAGEALLOC is set page table updates made by kernel_map_pages() are not made visible (via TLB flush) immediately if lazy MMU is on. In environments that support lazy MMU (e.g. Xen) this may lead to fatal page faults, for example, when zap_pte_range() needs to allocate pages in __tlb_remove_page() -> tlb_next_batch(). Signed-off-by: Boris Ostrovsky --- arch/x86/mm/pageattr.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 091934e..2ccbe0b 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1413,6 +1413,8 @@ void kernel_map_pages(struct page *page, int numpages, int enable) * but that can deadlock->flush only current cpu: */ __flush_tlb_all(); + + arch_flush_lazy_mmu_mode(); } #ifdef CONFIG_HIBERNATION -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.6.11 AMD-Vi: Completion-Wait loop timed out
On 01/20/2013 06:57 AM, Jörg Rödel wrote: On Sun, Jan 20, 2013 at 12:48:28PM +0100, Borislav Petkov wrote: On Sun, Jan 20, 2013 at 12:40:11PM +0100, Jörg Rödel wrote: Yes, the BIOS vendor can fix this issue. They need to disable NB clock gating for the IOMMU. Right, Udo, you can try Gigabyte first. Btw, can't we add a quirk to disable NB clock gating? Maybe Boris and Jacob could help. CCed. Are you talking about erratum 746? -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] kernel 3.7+ cpufreq regression on AMD system running as dom0
On 01/18/2013 02:00 PM, Konrad Rzeszutek Wilk wrote: So something like this in the hypervisor maybe (not even tested): diff --git a/xen/arch/x86/acpi/cpufreq/powernow.c b/xen/arch/x86/acpi/cpufreq/powernow.c index a9b7792..54e7808 100644 --- a/xen/arch/x86/acpi/cpufreq/powernow.c +++ b/xen/arch/x86/acpi/cpufreq/powernow.c @@ -146,7 +146,40 @@ static int powernow_cpufreq_target(struct cpufreq_policy *policy, return 0; } +#define MSR_AMD_PSTATE_DEF_BASE 0xc0010064 +static void amd_fixup_frequency(struct xen_processor_px *px, int i) +{ + u32 hi, lo, fid, did; + int index = px->control & 0x0007; + + if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD) + return; + + if ((boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model < 10) + || boot_cpu_data.x86 == 0x11) { + rdmsr(MSR_AMD_PSTATE_DEF_BASE + index, lo, hi); +/* Bit 63 indicates whether contents are valid */ +if (!(hi & 0x8000)) +return; + + fid = lo & 0x3f; + did = (lo >> 6) & 7; + if (boot_cpu_data.x86 == 0x10) + px->core_frequency = (100 * (fid + 0x10)) >> did; + else + px->core_frequency = (100 * (fid + 8)) >> did; + } +} + +static void amd_fixup_freq(struct processor_performance *perf) +{ +int i; + +for (i = 0; i < perf->state_count; i++) +amd_fixup_frequency(perf->states, i); amd_fixup_frequency(&perf->states[i]) will walk P-states. + +} static int powernow_cpufreq_verify(struct cpufreq_policy *policy) { struct acpi_cpufreq_data *data; @@ -158,6 +191,8 @@ static int powernow_cpufreq_verify(struct cpufreq_policy *policy) perf = &processor_pminfo[policy->cpu]->perf; +amd_fixup_freq(perf); Maybe move to powernow_cpufreq_cpu_init(), although this works too. -boris + cpufreq_verify_within_limits(policy, 0, perf->states[perf->platform_limit].core_frequency * 1000); ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.6.11 AMD-Vi: Completion-Wait loop timed out
On 01/22/2013 09:13 AM, Udo van den Heuvel wrote: Gigabyte demonstrate that using ESX 5i IOMMU works fine. (with pictures attached). There are no attachments to your message. I am not sure that 5i supports IOMMU (but I may well be wrong). What can we bring against that? How reproducible is the problem that you are seeing? -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.6.11 AMD-Vi: Completion-Wait loop timed out
On 01/22/2013 10:27 AM, Udo van den Heuvel wrote: On 2013-01-22 15:36, Boris Ostrovsky wrote: Gigabyte demonstrate that using ESX 5i IOMMU works fine. (with pictures attached). There are no attachments to your message. Correct, gigabyte did send them via their support web-interface. Do yo uneed to see them? They just show IOMMU enabled or similar. No, I thought you ran this yourself. What can we bring against that? How reproducible is the problem that you are seeing? Seen once over here. Correlated with raid-check. Then the answer from Gigabyte doesn't prove anything. You can also boot Linux without seeing this problem in most cases. Your BIOS does not have the required erratum workaround. We will provide a patch to close that hole but since the problem is not easily reproducible (and the erratum is also not easy to trigger) it may be difficult to say whether it really helped with your problem. -boris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] kernel 3.7+ cpufreq regression on AMD system running as dom0
On 01/18/2013 02:00 PM, Konrad Rzeszutek Wilk wrote: Right, that information is gathered from the MSRs. I think the Xen would need to do this since it can do the MSRs correctly and modify the P-states. So something like this in the hypervisor maybe (not even tested): Is there any harm in allowing dom0 read P-state registers? Something along these lines: diff -r 40881d58e991 xen/arch/x86/traps.c --- a/xen/arch/x86/traps.c Thu Jan 17 14:47:04 2013 -0500 +++ b/xen/arch/x86/traps.c Fri Jan 18 09:32:51 2013 -0500 @@ -2535,7 +2535,7 @@ static int emulate_privileged_op(struct case MSR_K8_PSTATE7: if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD ) goto fail; -if ( !is_cpufreq_controller(v->domain) ) +if ( d->domain_id != 0 ) { regs->eax = regs->edx = 0; break; (It does seem to fix the bug too) -boris diff --git a/xen/arch/x86/acpi/cpufreq/powernow.c b/xen/arch/x86/acpi/cpufreq/powernow.c index a9b7792..54e7808 100644 --- a/xen/arch/x86/acpi/cpufreq/powernow.c +++ b/xen/arch/x86/acpi/cpufreq/powernow.c @@ -146,7 +146,40 @@ static int powernow_cpufreq_target(struct cpufreq_policy *policy, return 0; } +#define MSR_AMD_PSTATE_DEF_BASE 0xc0010064 +static void amd_fixup_frequency(struct xen_processor_px *px, int i) +{ + u32 hi, lo, fid, did; + int index = px->control & 0x0007; + + if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD) + return; + + if ((boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model < 10) + || boot_cpu_data.x86 == 0x11) { + rdmsr(MSR_AMD_PSTATE_DEF_BASE + index, lo, hi); +/* Bit 63 indicates whether contents are valid */ +if (!(hi & 0x8000)) +return; + + fid = lo & 0x3f; + did = (lo >> 6) & 7; + if (boot_cpu_data.x86 == 0x10) + px->core_frequency = (100 * (fid + 0x10)) >> did; + else + px->core_frequency = (100 * (fid + 8)) >> did; + } +} + +static void amd_fixup_freq(struct processor_performance *perf) +{ +int i; + +for (i = 0; i < perf->state_count; i++) +amd_fixup_frequency(perf->states, i); + +} static int powernow_cpufreq_verify(struct cpufreq_policy *policy) { struct acpi_cpufreq_data *data; @@ -158,6 +191,8 @@ static int powernow_cpufreq_verify(struct cpufreq_policy *policy) perf = &processor_pminfo[policy->cpu]->perf; +amd_fixup_freq(perf); + cpufreq_verify_within_limits(policy, 0, perf->states[perf->platform_limit].core_frequency * 1000); ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xen/pvhvm: Support more than 32 VCPUs when migrating (v3).
On 07/21/2016 10:14 AM, Konrad Rzeszutek Wilk wrote: > On Fri, Jul 10, 2015 at 02:57:51PM -0400, Konrad Rzeszutek Wilk wrote: >> On Fri, Jul 10, 2015 at 02:37:46PM -0400, Konrad Rzeszutek Wilk wrote: >>> When Xen migrates an HVM guest, by default its shared_info can >>> only hold up to 32 CPUs. As such the hypercall >>> VCPUOP_register_vcpu_info was introduced which allowed us to >>> setup per-page areas for VCPUs. This means we can boot PVHVM >>> guest with more than 32 VCPUs. During migration the per-cpu >>> structure is allocated freshly by the hypervisor (vcpu_info_mfn >>> is set to INVALID_MFN) so that the newly migrated guest >>> can make an VCPUOP_register_vcpu_info hypercall. >>> >>> Unfortunatly we end up triggering this condition in Xen: >>> /* Run this command on yourself or on other offline VCPUS. */ >>> if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) ) >>> >>> which means we are unable to setup the per-cpu VCPU structures >>> for running vCPUS. The Linux PV code paths make this work by >>> iterating over every vCPU with: >>> >>> 1) is target CPU up (VCPUOP_is_up hypercall?) >>> 2) if yes, then VCPUOP_down to pause it. >>> 3) VCPUOP_register_vcpu_info >>> 4) if it was down, then VCPUOP_up to bring it back up >>> >>> But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are >>> not allowed on HVM guests we can't do this. However with the >>> Xen git commit f80c5623a126afc31e6bb9382268d579f0324a7a >>> ("xen/x86: allow HVM guests to use hypercalls to bring up vCPUs"") >> I was in my local tree which was Roger's 'hvm_without_dm_v3' >> looking at patches and spotted this - and thought it was already in! >> >> Sorry about this patch - and please ignore it until the VCPU_op* >> can be used by HVM guests. > The corresponding patch is in Xen tree > (192df6f9122ddebc21d0a632c10da3453aeee1c2) > > Could folks take a look at the patch pls? > > Without it you can't migrate an Linux guest with more than 32 vCPUs. > >>> we can do this. As such first check if VCPUOP_is_up is actually >>> possible before trying this dance. >>> >>> As most of this dance code is done already in 'xen_setup_vcpu' >>> lets make it callable on both PV and HVM. This means moving one >>> of the checks out to 'xen_setup_runstate_info'. >>> >>> Signed-off-by: Konrad Rzeszutek Wilk >>> --- >>> arch/x86/xen/enlighten.c | 23 +-- >>> arch/x86/xen/suspend.c | 7 +-- >>> arch/x86/xen/time.c | 3 +++ >>> 3 files changed, 21 insertions(+), 12 deletions(-) >>> >>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c >>> index 46957ea..187dec6 100644 >>> --- a/arch/x86/xen/enlighten.c >>> +++ b/arch/x86/xen/enlighten.c >>> @@ -238,12 +238,23 @@ static void xen_vcpu_setup(int cpu) >>> void xen_vcpu_restore(void) >>> { >>> int cpu; >>> + bool vcpuops = true; >>> + const struct cpumask *mask; >>> >>> - for_each_possible_cpu(cpu) { >>> + mask = xen_pv_domain() ? cpu_possible_mask : cpu_online_mask; >>> + >>> + /* Only Xen 4.5 and higher supports this. */ >>> + if (HYPERVISOR_vcpu_op(VCPUOP_is_up, smp_processor_id(), NULL) == >>> -ENOSYS) >>> + vcpuops = false; >>> + >>> + for_each_cpu(cpu, mask) { >>> bool other_cpu = (cpu != smp_processor_id()); >>> - bool is_up = HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL); >>> + bool is_up = false; >>> >>> - if (other_cpu && is_up && >>> + if (vcpuops) >>> + is_up = HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL); You can just say is_up = (HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL) > 0); and then you won't need vcpuops bool. >>> + >>> + if (vcpuops && other_cpu && is_up && >>> HYPERVISOR_vcpu_op(VCPUOP_down, cpu, NULL)) >>> BUG(); >>> >>> @@ -252,7 +263,7 @@ void xen_vcpu_restore(void) >>> if (have_vcpu_info_placement) >>> xen_vcpu_setup(cpu); >>> >>> - if (other_cpu && is_up && >>> + if (vcpuops && other_cpu && is_up && >>> HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL)) >>> BUG(); >>> } >>> @@ -1704,8 +1715,8 @@ void __ref xen_hvm_init_shared_info(void) >>> * in that case multiple vcpus might be online. */ >>> for_each_online_cpu(cpu) { >>> /* Leave it to be NULL. */ >>> - if (cpu >= MAX_VIRT_CPUS) >>> - continue; >>> + if (cpu >= MAX_VIRT_CPUS && cpu <= NR_CPUS) >>> + per_cpu(xen_vcpu, cpu) = NULL; /* Triggers >>> xen_vcpu_setup.*/ >>> per_cpu(xen_vcpu, cpu) = >>> &HYPERVISOR_shared_info->vcpu_info[cpu]; I don't think I understand this change. Can you have cpu > NR_CPUS? And isn't per_cpu(xen_vcpu, cpu) NULL already (as the comment at the top suggests)? -boris
[PATCH] xen/PMU: Log VPMU initialization error at lower level
This will match how PMU errors are reported at check_hw_exists()'s msr_fail label, which is reached when VPMU initialzation fails. Signed-off-by: Boris Ostrovsky --- arch/x86/xen/pmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c index 9466354..32bdc2c 100644 --- a/arch/x86/xen/pmu.c +++ b/arch/x86/xen/pmu.c @@ -547,7 +547,7 @@ void xen_pmu_init(int cpu) return; fail: - pr_warn_once("Could not initialize VPMU for cpu %d, error %d\n", + pr_info_once("Could not initialize VPMU for cpu %d, error %d\n", cpu, err); free_pages((unsigned long)xenpmu_data, 0); } -- 1.8.1.4
Re: [PATCH] xen/pciback: Fix conf_space read/write overlap check.
On 06/21/2016 10:37 AM, Andrey Grodzovsky wrote: > Current overlap check is evaluating to false a case where a filter field > is fully contained (proper subset) of a r/w request. > This change applies classical overlap check instead to include > all the scenarios. > > Related to https://www.mail-archive.com/xen-devel@lists.xen.org/msg72174.html > > Cc: Jan Beulich > Cc: Boris Ostrovsky > Cc: sta...@vger.kernel.org > Signed-off-by: Andrey Grodzovsky + David and Juergen (maintainers) and kernel list. Reviewed-by: Boris Ostrovsky > --- > drivers/xen/xen-pciback/conf_space.c | 6 ++ > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/drivers/xen/xen-pciback/conf_space.c > b/drivers/xen/xen-pciback/conf_space.c > index 8e67336..6a25533 100644 > --- a/drivers/xen/xen-pciback/conf_space.c > +++ b/drivers/xen/xen-pciback/conf_space.c > @@ -183,8 +183,7 @@ int xen_pcibk_config_read(struct pci_dev *dev, int > offset, int size, > field_start = OFFSET(cfg_entry); > field_end = OFFSET(cfg_entry) + field->size; > > - if ((req_start >= field_start && req_start < field_end) > - || (req_end > field_start && req_end <= field_end)) { > + if (req_end > field_start && field_end > req_start) { > err = conf_space_read(dev, cfg_entry, field_start, > &tmp_val); > if (err) > @@ -230,8 +229,7 @@ int xen_pcibk_config_write(struct pci_dev *dev, int > offset, int size, u32 value) > field_start = OFFSET(cfg_entry); > field_end = OFFSET(cfg_entry) + field->size; > > - if ((req_start >= field_start && req_start < field_end) > - || (req_end > field_start && req_end <= field_end)) { > + if (req_end > field_start && field_end > req_start) { > tmp_val = 0; > > err = xen_pcibk_config_read(dev, field_start,
Re: [PATCH v3 1/3] xen/pciback: simplify pcistub device handling
On 09/22/2016 04:45 AM, Juergen Gross wrote: > The Xen pciback driver maintains a list of all its seized devices. > There are two functions searching the list for a specific device with > basically the same semantics just returning different structures in > case of a match. > > Split out the search function. > > Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH v3 2/3] xen/pciback: avoid multiple entries in slot list
On 09/22/2016 04:45 AM, Juergen Gross wrote: > The Xen pciback driver has a list of all pci devices it is ready to > seize. There is no check whether a to be added entry already exists. > While this might be no problem in the common case it might confuse > those which consume the list via sysfs. > > Modify the handling of this list by not adding an entry which already > exists. As this will be needed later split out the list handling into > a separate function. > > Signed-off-by: Juergen Gross > --- > drivers/xen/xen-pciback/pci_stub.c | 39 > ++ > 1 file changed, 31 insertions(+), 8 deletions(-) > > diff --git a/drivers/xen/xen-pciback/pci_stub.c > b/drivers/xen/xen-pciback/pci_stub.c > index 79a9e4d..0179333 100644 > --- a/drivers/xen/xen-pciback/pci_stub.c > +++ b/drivers/xen/xen-pciback/pci_stub.c > @@ -478,6 +478,36 @@ static int __init pcistub_init_devices_late(void) > return 0; > } > > +static void pcistub_device_id_add_list(struct pcistub_device_id *new, > +int domain, int bus, unsigned int devfn) > +{ > + struct pcistub_device_id *pci_dev_id; > + unsigned long flags; > + int found = 0; > + > + spin_lock_irqsave(&device_ids_lock, flags); > + > + list_for_each_entry(pci_dev_id, &pcistub_device_ids, slot_list) { > + if (pci_dev_id->domain == domain && pci_dev_id->bus == bus && > + pci_dev_id->devfn == devfn) { > + found = 1; > + break; > + } > + } > + > + if (!found) { > + new->domain = domain; > + new->bus = bus; > + new->devfn = devfn; > + list_add_tail(&new->slot_list, &pcistub_device_ids); > + } > + > + spin_unlock_irqrestore(&device_ids_lock, flags); > + > + if (found) > + kfree(new); I'd rather free 'new' in the caller (who allocated it) and return something like -EEXIST if device is already on the list. -boris > +} > + > static int pcistub_seize(struct pci_dev *dev) > { > struct pcistub_device *psdev; > @@ -1012,7 +1042,6 @@ static inline int str_to_quirk(const char *buf, int > *domain, int *bus, int > static int pcistub_device_id_add(int domain, int bus, int slot, int func) > { > struct pcistub_device_id *pci_dev_id; > - unsigned long flags; > int rc = 0, devfn = PCI_DEVFN(slot, func); > > if (slot < 0) { > @@ -1042,16 +1071,10 @@ static int pcistub_device_id_add(int domain, int bus, > int slot, int func) > if (!pci_dev_id) > return -ENOMEM; > > - pci_dev_id->domain = domain; > - pci_dev_id->bus = bus; > - pci_dev_id->devfn = devfn; > - > pr_debug("wants to seize %04x:%02x:%02x.%d\n", >domain, bus, slot, func); > > - spin_lock_irqsave(&device_ids_lock, flags); > - list_add_tail(&pci_dev_id->slot_list, &pcistub_device_ids); > - spin_unlock_irqrestore(&device_ids_lock, flags); > + pcistub_device_id_add_list(pci_dev_id, domain, bus, devfn); > > return 0; > }
Re: [PATCH v3 3/3] xen/pciback: support driver_override
On 09/22/2016 04:45 AM, Juergen Gross wrote: > Support the driver_override scheme introduced with commit 782a985d7af2 > ("PCI: Introduce new device binding path using pci_dev.driver_override") > > As pcistub_probe() is called for all devices (it has to check for a > match based on the slot address rather than device type) it has to > check for driver_override set to "pciback" itself. > > Up to now for assigning a pci device to pciback you need something like: > > echo :07:10.0 > /sys/bus/pci/devices/\:07\:10.0/driver/unbind > echo :07:10.0 > /sys/bus/pci/drivers/pciback/new_slot > echo :07:10.0 > /sys/bus/pci/drivers_probe > > while with the patch you can use the same mechanism as for similar > drivers like pci-stub and vfio-pci: > > echo pciback > /sys/bus/pci/devices/\:07\:10.0/driver_override > echo :07:10.0 > /sys/bus/pci/devices/\:07\:10.0/driver/unbind > echo :07:10.0 > /sys/bus/pci/drivers_probe > > So e.g. libvirt doesn't need special handling for pciback. > > Signed-off-by: Juergen Gross > --- > V3: Move override check up in order not to miss the PCI_HEADER_TYPE > check. > Add an assigned device to the slot list. > > V2: Removed now unused label > --- > drivers/xen/xen-pciback/pci_stub.c | 36 +--- > 1 file changed, 29 insertions(+), 7 deletions(-) > > diff --git a/drivers/xen/xen-pciback/pci_stub.c > b/drivers/xen/xen-pciback/pci_stub.c > index 0179333..6331a95 100644 > --- a/drivers/xen/xen-pciback/pci_stub.c > +++ b/drivers/xen/xen-pciback/pci_stub.c > @@ -25,6 +25,8 @@ > #include "conf_space.h" > #include "conf_space_quirks.h" > > +#define PCISTUB_DRIVER_NAME "pciback" > + > static char *pci_devs_to_hide; > wait_queue_head_t xen_pcibk_aer_wait_queue; > /*Add sem for sync AER handling and xen_pcibk remove/reconfigue ops, > @@ -508,15 +510,18 @@ static void pcistub_device_id_add_list(struct > pcistub_device_id *new, > kfree(new); > } > > -static int pcistub_seize(struct pci_dev *dev) > +static int pcistub_seize(struct pci_dev *dev, > + struct pcistub_device_id *pci_dev_id) > { > struct pcistub_device *psdev; > unsigned long flags; > int err = 0; > > psdev = pcistub_device_alloc(dev); > - if (!psdev) > + if (!psdev) { > + kfree(pci_dev_id); Here too --- I'd rather have it freed by whoever allocated it (i.e. the caller). In fact I think you can allocate this here and instead of passing pci_dev_id pointer pass a flag of some sort. -boris > return -ENOMEM; > + } > > spin_lock_irqsave(&pcistub_devices_lock, flags); > > @@ -537,8 +542,12 @@ static int pcistub_seize(struct pci_dev *dev) > > spin_unlock_irqrestore(&pcistub_devices_lock, flags); > > - if (err) > + if (err) { > + kfree(pci_dev_id); > pcistub_device_put(psdev); > + } else if (pci_dev_id) > + pcistub_device_id_add_list(pci_dev_id, pci_domain_nr(dev->bus), > +dev->bus->number, dev->devfn); > > return err; > } > @@ -547,11 +556,16 @@ static int pcistub_seize(struct pci_dev *dev) > * other functions that take the sysfs lock. */ > static int pcistub_probe(struct pci_dev *dev, const struct pci_device_id *id) > { > - int err = 0; > + int err = 0, match; > + struct pcistub_device_id *pci_dev_id = NULL; > > dev_dbg(&dev->dev, "probing...\n"); > > - if (pcistub_match(dev)) { > + match = pcistub_match(dev); > + > + if ((dev->driver_override && > + !strcmp(dev->driver_override, PCISTUB_DRIVER_NAME)) || > + match) { > > if (dev->hdr_type != PCI_HEADER_TYPE_NORMAL > && dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { > @@ -562,8 +576,16 @@ static int pcistub_probe(struct pci_dev *dev, const > struct pci_device_id *id) > goto out; > } > > + if (!match) { > + pci_dev_id = kmalloc(sizeof(*pci_dev_id), GFP_ATOMIC); > + if (!pci_dev_id) { > + err = -ENOMEM; > + goto out; > + } > + } > + > dev_info(&dev->dev, "seizing device\n"); > - err = pcistub_seize(dev); > + err = pcistub_seize(dev, pci_dev_id); > } else > /* Didn't find the device */ > err = -ENODEV; > @@ -975,7 +997,7 @@ static const struct pci_error_handlers > xen_pcibk_error_handler = { > static struct pci_driver xen_pcibk_pci_driver = { > /* The name should be xen_pciback, but until the tools are updated >* we will keep it as pciback. */ > - .name = "pciback", > + .name = PCISTUB_DRIVER_NAME, > .id_table = pcistub_ids, > .probe = pcistub_probe, > .remove = pcistub_remove,
[PATCH] perf/bts: Don't try handling BTS interrupt if BTS is not enabled
Otherwise we will try to dereference ds which has not been allocated. Signed-off-by: Boris Ostrovsky --- arch/x86/events/intel/bts.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index bdcd651..1f5657f 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -451,6 +451,9 @@ int intel_bts_interrupt(void) s64 old_head; int err = -ENOSPC, handled = 0; + if (!x86_pmu.bts) + return 0; + /* * The only surefire way of knowing if this NMI is ours is by checking * the write ptr against the PMI threshold. -- 1.8.3.1
Re: [tip:smp/hotplug 5/6] arch/x86/xen/enlighten.c:1522:2: error: implicit declaration of function 'xen_pvh_early_cpu_init'
On 07/28/2016 12:14 PM, kbuild test robot wrote: tree: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/hotplug head: c713c8cb2055f5f3b32ee4315be589177a2658cc commit: 854e9fa5a56a9771fad4701a427e4844d2cbade1 [5/6] xen/x86: Define stubs for xen_smp_intr_init/xen_smp_intr_free config: x86_64-randconfig-x011-201630 (attached as .config) compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 reproduce: git checkout 854e9fa5a56a9771fad4701a427e4844d2cbade1 # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): arch/x86/xen/enlighten.c: In function 'xen_pvh_early_guest_init': arch/x86/xen/enlighten.c:1522:2: error: implicit declaration of function 'xen_pvh_early_cpu_init' [-Werror=implicit-function-declaration] xen_pvh_early_cpu_init(0, false); ^~ Thomas, So I clearly missed the PVH case and I see that the patches were reverted from the branch. Do you want me to take them through Xen tree so you won't have to deal with breakage? If yes I will assume your ACK for the second patch http://marc.info/?i=1458221613-21563-3-git-send-email-boris.ostrovsky%40oracle.com Sorry about that. -boris cc1: some warnings being treated as errors vim +/xen_pvh_early_cpu_init +1522 arch/x86/xen/enlighten.c d285d683 Mukesh Rathor 2013-12-13 1516 c9f6e997 Roger Pau Monne 2014-01-20 1517 if (!xen_feature(XENFEAT_hvm_callback_vector)) c9f6e997 Roger Pau Monne 2014-01-20 1518 return; c9f6e997 Roger Pau Monne 2014-01-20 1519 d285d683 Mukesh Rathor 2013-12-13 1520 xen_have_vector_callback = 1; a2ef5dc2 Mukesh Rathor 2014-09-10 1521 a2ef5dc2 Mukesh Rathor 2014-09-10 @1522 xen_pvh_early_cpu_init(0, false); c9f6e997 Roger Pau Monne 2014-01-20 1523 xen_pvh_set_cr_flags(0); d285d683 Mukesh Rathor 2013-12-13 1524 d285d683 Mukesh Rathor 2013-12-13 1525 #ifdef CONFIG_X86_32 :: The code at line 1522 was first introduced by commit :: a2ef5dc2c7cbedbeb4c847039845afaea5e63745 x86/xen: Set EFER.NX and EFER.SCE in PVH guests :: TO: Mukesh Rathor :: CC: David Vrabel --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [PATCH] xen: rename xen_pmu_init() in sys-hypervisor.c
On 08/01/2016 07:40 AM, Juergen Gross wrote: > There are two functions with name xen_pmu_init() in the kernel. Rename > the one in drivers/xen/sys-hypervisor.c to avoid shadowing the one in > arch/x86/xen/pmu.c > > Signed-off-by: Juergen Gross While at it, how about changing xen_properties_init to xen_sysfs_properties_init as well for consistency? -boris > --- > drivers/xen/sys-hypervisor.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/xen/sys-hypervisor.c b/drivers/xen/sys-hypervisor.c > index 6881b3c..17ea846 100644 > --- a/drivers/xen/sys-hypervisor.c > +++ b/drivers/xen/sys-hypervisor.c > @@ -455,7 +455,7 @@ static const struct attribute_group xen_pmu_group = { > .attrs = xen_pmu_attrs, > }; > > -static int __init xen_pmu_init(void) > +static int __init xen_sysfs_pmu_init(void) > { > return sysfs_create_group(hypervisor_kobj, &xen_pmu_group); > } > @@ -485,7 +485,7 @@ static int __init hyper_sysfs_init(void) > goto prop_out; > #ifdef CONFIG_XEN_HAVE_VPMU > if (xen_initial_domain()) { > - ret = xen_pmu_init(); > + ret = xen_sysfs_pmu_init(); > if (ret) { > sysfs_remove_group(hypervisor_kobj, > &xen_properties_group);
Re: [PATCH] xen: Make VPMU init message look less scary
On 08/01/2016 07:41 AM, Juergen Gross wrote: > The default for the Xen hypervisor is to not enable VPMU in order to > avoid security issues. In this case the Linux kernel will issue the > message "Could not initialize VPMU for cpu 0, error -95" which looks > more like an error than a normal state. > > Change the message to something less scary in case the hypervisor > returns EOPNOTSUPP when trying to activate VPMU. > > Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky
Re: [PATCH] xen: rename xen_pmu_init() in sys-hypervisor.c
On 08/01/2016 11:44 AM, Juergen Gross wrote: > On 01/08/16 16:10, Boris Ostrovsky wrote: >> On 08/01/2016 07:40 AM, Juergen Gross wrote: >>> There are two functions with name xen_pmu_init() in the kernel. Rename >>> the one in drivers/xen/sys-hypervisor.c to avoid shadowing the one in >>> arch/x86/xen/pmu.c >>> >>> Signed-off-by: Juergen Gross >> While at it, how about changing xen_properties_init to >> xen_sysfs_properties_init as well for consistency? > I don't mind changing this name, too. Will send V2. Thanks. Actually, there is one more --- xen_compilation_init(). -boris
Re: [Xen-devel] [PATCH] xen: Make VPMU init message look less scary
On 08/01/2016 11:48 AM, Juergen Gross wrote: > On 01/08/16 16:11, Konrad Rzeszutek Wilk wrote: >> On Mon, Aug 01, 2016 at 01:41:20PM +0200, Juergen Gross wrote: >>> The default for the Xen hypervisor is to not enable VPMU in order to >>> avoid security issues. In this case the Linux kernel will issue the >>> message "Could not initialize VPMU for cpu 0, error -95" which looks >>> more like an error than a normal state. >>> >>> Change the message to something less scary in case the hypervisor >>> returns EOPNOTSUPP when trying to activate VPMU. >>> >>> Signed-off-by: Juergen Gross >>> --- >>> arch/x86/xen/pmu.c | 7 +-- >>> 1 file changed, 5 insertions(+), 2 deletions(-) >>> >>> diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c >>> index 32bdc2c..e9d66fa 100644 >>> --- a/arch/x86/xen/pmu.c >>> +++ b/arch/x86/xen/pmu.c >>> @@ -547,8 +547,11 @@ void xen_pmu_init(int cpu) >>> return; >>> >>> fail: >>> - pr_info_once("Could not initialize VPMU for cpu %d, error %d\n", >>> - cpu, err); >>> + if (err == -EOPNOTSUPP) >>> + pr_info_once("VPMU usage disabled due to Xen settings\n"); >> How about 'VPMU disabled by hypevisor.' > Hmm, why not. Boris, are you okay with this message, too? Sure.
[PATCH v2 2/2] xen/events: Convert to hotplug state machine
From: Sebastian Andrzej Siewior Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Boris Ostrovsky --- drivers/xen/events/events_fifo.c | 34 -- include/linux/cpuhotplug.h | 1 + 2 files changed, 13 insertions(+), 22 deletions(-) diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c index 266c2c7..7ef27c6 100644 --- a/drivers/xen/events/events_fifo.c +++ b/drivers/xen/events/events_fifo.c @@ -418,30 +418,18 @@ static int evtchn_fifo_alloc_control_block(unsigned cpu) return ret; } -static int evtchn_fifo_cpu_notification(struct notifier_block *self, - unsigned long action, - void *hcpu) +static int xen_evtchn_cpu_prepare(unsigned int cpu) { - int cpu = (long)hcpu; - int ret = 0; - - switch (action) { - case CPU_UP_PREPARE: - if (!per_cpu(cpu_control_block, cpu)) - ret = evtchn_fifo_alloc_control_block(cpu); - break; - case CPU_DEAD: - __evtchn_fifo_handle_events(cpu, true); - break; - default: - break; - } - return ret < 0 ? NOTIFY_BAD : NOTIFY_OK; + if (!per_cpu(cpu_control_block, cpu)) + return evtchn_fifo_alloc_control_block(cpu); + return 0; } -static struct notifier_block evtchn_fifo_cpu_notifier = { - .notifier_call = evtchn_fifo_cpu_notification, -}; +static int xen_evtchn_cpu_dead(unsigned int cpu) +{ + __evtchn_fifo_handle_events(cpu, true); + return 0; +} int __init xen_evtchn_fifo_init(void) { @@ -456,7 +444,9 @@ int __init xen_evtchn_fifo_init(void) evtchn_ops = &evtchn_ops_fifo; - register_cpu_notifier(&evtchn_fifo_cpu_notifier); + cpuhp_setup_state_nocalls(CPUHP_XEN_EVTCHN_PREPARE, + "CPUHP_XEN_EVTCHN_PREPARE", + xen_evtchn_cpu_prepare, xen_evtchn_cpu_dead); out: put_cpu(); return ret; diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 33d352f..5f60316 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -22,6 +22,7 @@ enum cpuhp_state { CPUHP_SMPCFD_PREPARE, CPUHP_RCUTREE_PREP, CPUHP_XEN_PREPARE, + CPUHP_XEN_EVTCHN_PREPARE, CPUHP_NOTIFY_PREPARE, CPUHP_TIMERS_DEAD, CPUHP_BRINGUP_CPU, -- 2.7.4
[PATCH v2 1/2] xen/x86: Convert to hotplug state machine
Switch to new CPU hotplug infrastructure. Signed-off-by: Boris Ostrovsky Suggested-by: Sebastian Andrzej Siewior --- Changes in v2: * Replace xen_cpu_up_cancel with xen_cpu_dead * Use existing CPUHP_AP_ONLINE_DYN instead of introducing new state * Be more careful with return value of cpuhp_setup_state_nocalls() as it may return a positive (non-error) number. (Which suggests that comment on top of __cpuhp_setup_state() is probably incorrect) arch/x86/xen/enlighten.c | 115 + include/linux/cpuhotplug.h | 1 + 2 files changed, 66 insertions(+), 50 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 78e1d06..e4c3399 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -140,7 +140,9 @@ RESERVE_BRK(shared_info_page_brk, PAGE_SIZE); __read_mostly int xen_have_vector_callback; EXPORT_SYMBOL_GPL(xen_have_vector_callback); -static struct notifier_block xen_cpu_notifier; +static int xen_cpu_up_prepare(unsigned int cpu); +static int xen_cpu_up_online(unsigned int cpu); +static int xen_cpu_dead(unsigned int cpu); /* * Point at some empty memory to start with. We map the real shared_info @@ -1541,6 +1543,24 @@ static void __init xen_dom0_set_legacy_features(void) x86_platform.legacy.rtc = 1; } +static int xen_cpuhp_setup(void) +{ + int rc; + + rc = cpuhp_setup_state_nocalls(CPUHP_XEN_PREPARE, + "XEN_HVM_GUEST_PREPARE", + xen_cpu_up_prepare, xen_cpu_dead); + if (rc >= 0) { + rc = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, + "XEN_HVM_GUEST_ONLINE", + xen_cpu_up_online, NULL); + if (rc < 0) + cpuhp_remove_state_nocalls(CPUHP_XEN_PREPARE); + } + + return rc >= 0 ? 0 : rc; +} + /* First C function to be called on Xen boot */ asmlinkage __visible void __init xen_start_kernel(void) { @@ -1629,7 +1649,7 @@ asmlinkage __visible void __init xen_start_kernel(void) xen_initial_gdt = &per_cpu(gdt_page, 0); xen_smp_init(); - register_cpu_notifier(&xen_cpu_notifier); + WARN_ON(xen_cpuhp_setup()); #ifdef CONFIG_ACPI_NUMA /* @@ -1823,63 +1843,58 @@ static void __init init_hvm_pv_info(void) xen_domain_type = XEN_HVM_DOMAIN; } -static int xen_cpu_notify(struct notifier_block *self, unsigned long action, -void *hcpu) +static int xen_cpu_up_prepare(unsigned int cpu) { - int cpu = (long)hcpu; int rc; - switch (action) { - case CPU_UP_PREPARE: - if (xen_hvm_domain()) { - /* -* This can happen if CPU was offlined earlier and -* offlining timed out in common_cpu_die(). -*/ - if (cpu_report_state(cpu) == CPU_DEAD_FROZEN) { - xen_smp_intr_free(cpu); - xen_uninit_lock_cpu(cpu); - } - - if (cpu_acpi_id(cpu) != U32_MAX) - per_cpu(xen_vcpu_id, cpu) = cpu_acpi_id(cpu); - else - per_cpu(xen_vcpu_id, cpu) = cpu; - xen_vcpu_setup(cpu); + if (xen_hvm_domain()) { + /* +* This can happen if CPU was offlined earlier and +* offlining timed out in common_cpu_die(). +*/ + if (cpu_report_state(cpu) == CPU_DEAD_FROZEN) { + xen_smp_intr_free(cpu); + xen_uninit_lock_cpu(cpu); } - if (xen_pv_domain() || - (xen_have_vector_callback && -xen_feature(XENFEAT_hvm_safe_pvclock))) - xen_setup_timer(cpu); + if (cpu_acpi_id(cpu) != U32_MAX) + per_cpu(xen_vcpu_id, cpu) = cpu_acpi_id(cpu); + else + per_cpu(xen_vcpu_id, cpu) = cpu; + xen_vcpu_setup(cpu); + } - rc = xen_smp_intr_init(cpu); - if (rc) { - WARN(1, "xen_smp_intr_init() for CPU %d failed: %d\n", -cpu, rc); - return NOTIFY_BAD; - } + if (xen_pv_domain() || + (xen_have_vector_callback && +xen_feature(XENFEAT_hvm_safe_pvclock))) + xen_setup_timer(cpu); - break; - case CPU_ONLINE: - xen_init_lock_cpu(cpu); - break; - case CPU_UP_CANCELED: - xen_smp_intr_free(cpu); - if (xen_pv_domain() || -
[PATCH v2 0/2] Convert to new CPU hotplug framework
New CPU hotplug framework was introduced recently. These patches convert Xen CPU hotplug code to this infrastructure. The patches (patch 1 specifically) will apply on top of https://lists.xenproject.org/archives/html/xen-devel/2016-08/msg00562.html v2: Changes in patch 1 suggested by Sebastian Boris Ostrovsky (2): xen/x86: Convert to hotplug state machine xen/events: Convert to hotplug state machine arch/x86/xen/enlighten.c | 115 ++- drivers/xen/events/events_fifo.c | 34 include/linux/cpuhotplug.h | 2 + 3 files changed, 79 insertions(+), 72 deletions(-) -- 2.7.4
Re: [PATCH v2] xen/pciback: support driver_override
On 09/02/2016 08:30 AM, Juergen Gross wrote: > Support the driver_override scheme introduced with commit 782a985d7af2 > ("PCI: Introduce new device binding path using pci_dev.driver_override") > > As pcistub_probe() is called for all devices (it has to check for a > match based on the slot address rather than device type) it has to > check for driver_override set to "pciback" itself. > > Signed-off-by: Juergen Gross > --- > V2: removed now unused label > --- > drivers/xen/xen-pciback/pci_stub.c | 16 ++-- > 1 file changed, 10 insertions(+), 6 deletions(-) > > diff --git a/drivers/xen/xen-pciback/pci_stub.c > b/drivers/xen/xen-pciback/pci_stub.c > index 258b7c3..85c28f7 100644 > --- a/drivers/xen/xen-pciback/pci_stub.c > +++ b/drivers/xen/xen-pciback/pci_stub.c > @@ -25,6 +25,8 @@ > #include "conf_space.h" > #include "conf_space_quirks.h" > > +#define PCISTUB_DRIVER_NAME "pciback" > + > static char *pci_devs_to_hide; > wait_queue_head_t xen_pcibk_aer_wait_queue; > /*Add sem for sync AER handling and xen_pcibk remove/reconfigue ops, > @@ -529,16 +531,18 @@ static int pcistub_probe(struct pci_dev *dev, const > struct pci_device_id *id) > "don't have a normal (0) or bridge (1) " > "header type!\n"); > err = -ENODEV; > - goto out; > } > > + } else if (!dev->driver_override || > +strcmp(dev->driver_override, PCISTUB_DRIVER_NAME)) > + /* Didn't find the device */ > + err = -ENODEV; > + > + if (!err) { > dev_info(&dev->dev, "seizing device\n"); > err = pcistub_seize(dev); > - } else > - /* Didn't find the device */ > - err = -ENODEV; > + } Should devices with pciback override be displayed in /sys/bus/pci/drivers/pciback/slots? If they should then they need to be either added to pcistub_device_ids or kept on some other list. Also, do you think checking override might better be done first, before testing for ID match? -boris > > -out: > return err; > } > > @@ -945,7 +949,7 @@ static const struct pci_error_handlers > xen_pcibk_error_handler = { > static struct pci_driver xen_pcibk_pci_driver = { > /* The name should be xen_pciback, but until the tools are updated >* we will keep it as pciback. */ > - .name = "pciback", > + .name = PCISTUB_DRIVER_NAME, > .id_table = pcistub_ids, > .probe = pcistub_probe, > .remove = pcistub_remove,
Re: [PATCH] prctl,x86 Add PR_[GET|SET]_CPUID for controlling the CPUID instruction.
On 09/14/2016 02:52 PM, Andy Lutomirski wrote: > On Tue, Sep 13, 2016 at 11:13 PM, Kyle Huey wrote: >> On Mon, Sep 12, 2016 at 9:56 AM, Andy Lutomirski wrote: >>> You should explicitly check that, if the >>> feature is set under Xen PV, then the MSR actually works as >>> advertised. This may require talking to the Xen folks to make sure >>> you're testing the right configuration. >> This is interesting. When running under Xen PV the kernel is allowed >> to read the real value of MSR_PLATFORM_INFO and see that CPUID >> faulting is supported. But as you suggested, writing to >> MSR_MISC_FEATURES_ENABLES doesn't actually enable CPUID faulting, at >> least not in any way that works. >> >> It's not obvious to me how to test this, because when this feature >> works, CPUID only faults in userspace, not in the kernel. Is there >> existing code somewhere that runs tests like this in userspace? >> > Andrew, Boris: should we expect Xen PV to do anything sensible when we > write to MSR_PLATFORM_INFO to turn on CPUID faulting? Should the Xen > PV rdmsr hooks or perhaps the hypervisor mask out the feature if it > isn't going to be supported? The hypervisor uses CPUID faulting so we shouldn't advertise this feature to guests. -boris
Re: [Xen-devel] [PATCH v2 2/3] x86 Test and expose CPUID faulting capabilities in /proc/cpuinfo
On 09/15/2016 03:11 PM, Kyle Huey wrote: > On Thu, Sep 15, 2016 at 3:25 AM, Jan Beulich wrote: > On 15.09.16 at 12:05, wrote: >>> On 14/09/16 22:01, Kyle Huey wrote: Xen advertises the underlying support for CPUID faulting but not does pass through writes to the relevant MSR, nor does it virtualize it, so it does not actually work. For now mask off the relevant bit on MSR_PLATFORM_INFO. >>> Could you clarify in the commit message that it is PV guests that are >>> affected. >> What makes you think HVM ones aren't? > Testing on EC2, HVM guests are affected as well. Not sure what to do > about that. You could clear capability bit in xen_set_cpu_features() but of course this assumes you will never again read MSR_PLATFORM_INFO. -boris
Re: [PATCH 1/2] xen/x86: Convert to hotplug state machine
On 08/31/2016 12:15 PM, Sebastian Andrzej Siewior wrote: > On 2016-08-26 15:37:38 [-0400], Boris Ostrovsky wrote: >>> If you do find the time, you might manage to rework the code to avoid >>> using the _nocalls() function. If see this right, you use >>> xen_setup_vcpu_info_placement() for the init in the first place. This >>> uses for_each_possible_cpu macro. The cpuhp_setup_state() function would >>> perform the init for all CPUs before they come up. >> I am not sure I see what this would buy us. >> >> Besides, cpuhp_setup_state() uses for_each_present_cpu(). > Correct. So you would avoid running the init code on CPUs which are > within the for_each_possible_cpu() set but not in for_each_present_cpu(). > > Assuming a NUMA box with two CPUs, 8 cores each gives you 32 CPUs in > Linux with hyper threading. BIOS may report 240 CPUs as the upper limit > (possible CPUs) but if you never deploy them you don't need to > initialize them… Should they be plugged physically then the > for_each_present_cpu() loop will cover them once they come up. > That's not going to help Xen guests: all possible CPUs are brought up right away and then those that are not in use are unplugged. -boris
Re: [PATCH] xen: cleanup xen.h
On 07/27/2017 11:44 AM, Juergen Gross wrote: > On 27/07/17 17:37, Boris Ostrovsky wrote: >> On 07/27/2017 11:11 AM, Juergen Gross wrote: >>> The macros for testing domain types are more complicated then they >>> need to. Simplify them. >>> >>> Signed-off-by: Juergen Gross >>> --- >>> include/xen/xen.h | 20 +--- >>> 1 file changed, 9 insertions(+), 11 deletions(-) >>> >>> diff --git a/include/xen/xen.h b/include/xen/xen.h >>> index 6e8b7fc79801..28c59ca529d7 100644 >>> --- a/include/xen/xen.h >>> +++ b/include/xen/xen.h >>> @@ -13,11 +13,16 @@ extern enum xen_domain_type xen_domain_type; >>> #define xen_domain_typeXEN_NATIVE >>> #endif >>> >>> +#ifdef CONFIG_XEN_PVH >>> +extern bool xen_pvh; >>> +#else >>> +#define xen_pvh0 >>> +#endif >>> + >>> #define xen_domain() (xen_domain_type != XEN_NATIVE) >>> -#define xen_pv_domain()(xen_domain() && >>> \ >>> -xen_domain_type == XEN_PV_DOMAIN) >>> -#define xen_hvm_domain() (xen_domain() &&\ >>> -xen_domain_type == XEN_HVM_DOMAIN) >>> +#define xen_pv_domain()(xen_domain_type == XEN_PV_DOMAIN) >> Stray tab. > No. This is just due to the '+' of the patch. Applied to for-linus-4.14 -boris
Re: [PATCH] xen-platform: constify pci_device_id.
On 08/02/2017 06:36 PM, Boris Ostrovsky wrote: > On 08/02/2017 01:46 PM, Arvind Yadav wrote: >> pci_device_id are not supposed to change at runtime. All functions >> working with pci_device_id provided by work with >> const pci_device_id. So mark the non-const structs as const. >> >> Signed-off-by: Arvind Yadav > Reviewed-by: Boris Ostrovsky Applied to for-linus-4.14. -boris
Re: [PATCH 0/3] xen: do some cleanups
On 08/04/2017 07:36 AM, Juergen Gross wrote: > Remove stuff no longer needed. > > Juergen Gross (3): > xen: remove tests for pvh mode in pure pv paths > xen: remove unused function xen_set_domain_pte() > xen: remove not used trace functions > > arch/x86/include/asm/xen/page.h | 5 - > arch/x86/xen/mmu_pv.c | 20 > arch/x86/xen/p2m.c | 25 + > arch/x86/xen/setup.c| 5 + > include/trace/events/xen.h | 38 -- > 5 files changed, 2 insertions(+), 91 deletions(-) > Applied to for-linus-4.14. -boris
Re: [PATCH v4] xen: get rid of paravirt op adjust_exception_frame
On 08/11/2017 10:54 AM, Juergen Gross wrote: > When running as Xen pv-guest the exception frame on the stack contains > %r11 and %rcx additional to the other data pushed by the processor. > > Instead of having a paravirt op being called for each exception type > prepend the Xen specific code to each exception entry. When running as > Xen pv-guest just use the exception entry with prepended instructions, > otherwise use the entry without the Xen specific code. > > Signed-off-by: Juergen Gross > --- > arch/x86/entry/entry_64.S | 23 ++--- > arch/x86/entry/entry_64_compat.S | 1 - > arch/x86/include/asm/paravirt.h | 5 -- > arch/x86/include/asm/paravirt_types.h | 4 -- > arch/x86/include/asm/proto.h | 3 ++ > arch/x86/include/asm/traps.h | 33 ++-- > arch/x86/kernel/asm-offsets_64.c | 1 - > arch/x86/kernel/paravirt.c| 3 -- > arch/x86/xen/enlighten_pv.c | 96 > +++ > arch/x86/xen/irq.c| 3 -- > arch/x86/xen/xen-asm_64.S | 45 ++-- > arch/x86/xen/xen-ops.h| 1 - > 12 files changed, 140 insertions(+), 78 deletions(-) Reviewed-by: Boris Ostrovsky Applied to for-linus-4.14. -boris
Re: [PATCH] xen/pvcalls: use WARN_ON(1) instead of __WARN()
On 07/21/2017 03:26 PM, Stefano Stabellini wrote: > On Fri, 21 Jul 2017, Arnd Bergmann wrote: >> __WARN() is an internal helper that is only available on >> some architectures, but causes a build error e.g. on ARM64 >> in some configurations: >> >> drivers/xen/pvcalls-back.c: In function 'set_backend_state': >> drivers/xen/pvcalls-back.c:1097:5: error: implicit declaration of function >> '__WARN' [-Werror=implicit-function-declaration] >> >> Unfortunately, there is no equivalent of BUG() that takes no >> arguments, but WARN_ON(1) is commonly used in other drivers >> and works on all configurations. >> >> Fixes: 7160378206b2 ("xen/pvcalls: xenbus state handling") >> Signed-off-by: Arnd Bergmann > Reviewed-by: Stefano Stabellini > Applied to for-linus-4.14 -boris
Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush
On 08/16/2017 12:42 PM, Vitaly Kuznetsov wrote: > Vitaly Kuznetsov writes: > >> Peter Zijlstra writes: >> >>> On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote: On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra wrote: > I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that > would make it work again), but this was some years ago and I cannot > readily find those emails. I think the only time we really talked about HAVE_RCU_TABLE_FREE for x86 (at least that I was cc'd on) was not because of RCU freeing, but because we just wanted to use the generic page table lookup code on x86 *despite* not using RCU freeing. And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP. There was only passing mention of maybe making x86 use RCU, but the discussion was really about why the IF flag meant that x86 didn't need to, iirc. I don't recall us ever discussing *really* making x86 use RCU. >>> Google finds me this: >>> >>> https://lwn.net/Articles/500188/ >>> >>> Which includes: >>> >>> http://www.mail-archive.com/kvm@vger.kernel.org/msg72918.html >>> >>> which does as was suggested here, selects HAVE_RCU_TABLE_FREE for >>> PARAVIRT_TLB_FLUSH. >>> >>> But yes, this is very much virt specific nonsense, native would never >>> need this. >> In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled >> kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what >> do you think about the required testing? Any suggestion for a >> specifically crafted micro benchmark in addition to standard >> ebizzy/kernbench/...? > In the meantime I tested HAVE_RCU_TABLE_FREE with kernbench (enablement > patch I used is attached; I know that it breaks other architectures) on > bare metal with PARAVIRT enabled in config. The results are: > > 6-CPU host: > > Average Half load -j 3 Run (std deviation): > CURRENT HAVE_RCU_TABLE_FREE > === === > Elapsed Time 400.498 (0.179679) Elapsed Time 399.909 (0.162853) > User Time 1098.72 (0.278536) User Time 1097.59 (0.283894) > System Time 100.301 (0.201629)System Time 99.736 (0.196254) > Percent CPU 299 (0) Percent CPU 299 (0) > Context Switches 5774.1 (69.2121) Context Switches 5744.4 (79.4162) > Sleeps 87621.2 (78.1093) Sleeps 87586.1 (99.7079) > > Average Optimal load -j 24 Run (std deviation): > CURRENT HAVE_RCU_TABLE_FREE > === === > Elapsed Time 219.03 (0.652534)Elapsed Time 218.959 (0.598674) > User Time 1119.51 (21.3284) User Time 1118.81 (21.7793) > System Time 100.499 (0.389308)System Time 99.8335 (0.251423) > Percent CPU 432.5 (136.974) Percent CPU 432.45 (136.922) > Context Switches 81827.4 (78029.5)Context Switches 81818.5 (78051) > Sleeps 97124.8 (9822.4) Sleeps 97207.9 (9955.04) > > 16-CPU host: > > Average Half load -j 8 Run (std deviation): > CURRENT HAVE_RCU_TABLE_FREE > === === > Elapsed Time 213.538 (3.7891) Elapsed Time 212.5 (3.10939) > User Time 1306.4 (1.83399)User Time 1307.65 (1.01364) > System Time 194.59 (0.864378) System Time 195.478 (0.794588) > Percent CPU 702.6 (13.5388) Percent CPU 707 (11.1131) > Context Switches 21189.2 (1199.4) Context Switches 21288.2 (552.388) > Sleeps 89390.2 (482.325) Sleeps 89677 (277.06) > > Average Optimal load -j 64 Run (std deviation): > CURRENT HAVE_RCU_TABLE_FREE > === === > Elapsed Time 137.866 (0.787928) Elapsed Time 138.438 (0.218792) > User Time 1488.92 (192.399) User Time 1489.92 (192.135) > System Time 234.981 (42.5806) System Time 236.09 (42.8138) > Percent CPU 1057.1 (373.826) Percent CPU 1057.1 (369.114) > Context Switches 187514 (175324) Context Switches 187358 (175060) > Sleeps 112633 (24535.5) Sleeps 111743 (23297.6) > > As you can see, there's no notable difference. I'll think of a > microbenchmark though. FWIW, I was about to send a very similar patch (but with only Xen-PV enabling RCU-based free by default) and saw similar results with kernbench, both Xen PV and baremetal. >> Additionally, I see another option for us: enable 'rcu table free' on >> boot (e.g. by taking tlb_remove_table to pv_ops and doing boot-time >> patching for it) so bare metal and other hypervisors are not affected >> by the change. > It seems there's no need for that and we can keep things simple... > > -- Vitaly > > 0001-x86-enable-RCU-based-table-free-w
Re: [PATCH] xen: fix build failure related to removing adjust_exception_frame
On 08/17/2017 05:03 AM, Juergen Gross wrote: > A kernel configured with XEN_PV but without KVM_GUEST will fail to > build since the patch removing the adjust_exception_frame paravirt > op. > > Fix this failure. > > Reported-by: Sander Eikelenboom > Signed-off-by: Juergen Gross > --- > arch/x86/xen/xen-asm_64.S | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S > index f72ff71cc897..62039a35f70e 100644 > --- a/arch/x86/xen/xen-asm_64.S > +++ b/arch/x86/xen/xen-asm_64.S > @@ -47,7 +47,6 @@ xen_pv_trap segment_not_present > xen_pv_trap stack_segment > xen_pv_trap general_protection > xen_pv_trap page_fault > -xen_pv_trap async_page_fault > xen_pv_trap spurious_interrupt_bug > xen_pv_trap coprocessor_error > xen_pv_trap alignment_check Applied to for-linus-4.14. -boris
Re: [PATCH] xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()
On 08/17/2017 12:14 PM, Julien Grall wrote: > When booting Linux as Xen guest with CONFIG_DEBUG_ATOMIC, the following > splat appears: > > [0.002323] Mountpoint-cache hash table entries: 1024 (order: 1, 8192 > bytes) > [0.019717] ASID allocator initialised with 65536 entries > [0.020019] xen:grant_table: Grant tables using version 1 layout > [0.020051] Grant table initialized > [0.020069] BUG: sleeping function called from invalid context at > /data/src/linux/mm/page_alloc.c:4046 > [0.020100] in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: swapper/0 > [0.020123] no locks held by swapper/0/1. > [0.020143] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc5 #598 > [0.020166] Hardware name: FVP Base (DT) > [0.020182] Call trace: > [0.020199] [] dump_backtrace+0x0/0x270 > [0.020222] [] show_stack+0x24/0x30 > [0.020244] [] dump_stack+0xb8/0xf0 > [0.020267] [] ___might_sleep+0x1c8/0x1f8 > [0.020291] [] __might_sleep+0x58/0x90 > [0.020313] [] __alloc_pages_nodemask+0x1c0/0x12e8 > [0.020338] [] alloc_page_interleave+0x38/0x88 > [0.020363] [] alloc_pages_current+0xdc/0xf0 > [0.020387] [] __get_free_pages+0x28/0x50 > [0.020411] [] evtchn_fifo_alloc_control_block+0x2c/0xa0 > [0.020437] [] xen_evtchn_fifo_init+0x38/0xb4 > [0.020461] [] xen_init_IRQ+0x44/0xc8 > [0.020484] [] xen_guest_init+0x250/0x300 > [0.020507] [] do_one_initcall+0x44/0x130 > [0.020531] [] kernel_init_freeable+0x120/0x288 > [0.020556] [] kernel_init+0x18/0x110 > [0.020578] [] ret_from_fork+0x10/0x40 > [0.020606] xen:events: Using FIFO-based ABI > [0.020658] Xen: initializing cpu0 > [0.027727] Hierarchical SRCU implementation. > [0.036235] EFI services will not be available. > [0.043810] smp: Bringing up secondary CPUs ... > > This is because get_cpu() in xen_evtchn_fifo_init() will disable > preemption, but __get_free_page() might sleep (GFP_ATOMIC is not set). > > xen_evtchn_fifo_init() will always be called before SMP is initialized, > so {get,put}_cpu() could be replaced by a simple smp_processor_id(). On x86 this will be called out of init_IRQ(), which is already preceded by preempt_disable(). Reviewed-by: Boris Ostrovsky
Re: [PATCH v4 08/18] xen/pvcalls: implement connect command
On 06/15/2017 03:09 PM, Stefano Stabellini wrote: > Allocate a socket. Keep track of socket <-> ring mappings with a new data > structure, called sock_mapping. Implement the connect command by calling > inet_stream_connect, and mapping the new indexes page and data ring. > Allocate a workqueue and a work_struct, called ioworker, to perform > reads and writes to the socket. > > When an active socket is closed (sk_state_change), set in_error to > -ENOTCONN and notify the other end, as specified by the protocol. > > sk_data_ready and pvcalls_back_ioworker will be implemented later. > > Signed-off-by: Stefano Stabellini > CC: boris.ostrov...@oracle.com > CC: jgr...@suse.com > --- > drivers/xen/pvcalls-back.c | 171 > + > 1 file changed, 171 insertions(+) > > diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c > index 953458b..4eecd83 100644 > --- a/drivers/xen/pvcalls-back.c > +++ b/drivers/xen/pvcalls-back.c > @@ -56,6 +56,39 @@ struct pvcalls_fedata { > struct work_struct register_work; > }; > > +struct pvcalls_ioworker { > + struct work_struct register_work; > + struct workqueue_struct *wq; > +}; > + > +struct sock_mapping { > + struct list_head list; > + struct pvcalls_fedata *fedata; > + struct socket *sock; > + uint64_t id; > + grant_ref_t ref; > + struct pvcalls_data_intf *ring; > + void *bytes; > + struct pvcalls_data data; > + uint32_t ring_order; > + int irq; > + atomic_t read; > + atomic_t write; > + atomic_t io; > + atomic_t release; > + void (*saved_data_ready)(struct sock *sk); > + struct pvcalls_ioworker ioworker; > +}; > + > +static irqreturn_t pvcalls_back_conn_event(int irq, void *sock_map); > +static int pvcalls_back_release_active(struct xenbus_device *dev, > +struct pvcalls_fedata *fedata, > +struct sock_mapping *map); > + > +static void pvcalls_back_ioworker(struct work_struct *work) > +{ > +} > + > static int pvcalls_back_socket(struct xenbus_device *dev, > struct xen_pvcalls_request *req) > { > @@ -84,9 +117,142 @@ static int pvcalls_back_socket(struct xenbus_device *dev, > return 0; > } > > +static void pvcalls_sk_state_change(struct sock *sock) > +{ > + struct sock_mapping *map = sock->sk_user_data; > + struct pvcalls_data_intf *intf; > + > + if (map == NULL) > + return; > + > + intf = map->ring; > + intf->in_error = -ENOTCONN; > + notify_remote_via_irq(map->irq); > +} > + > +static void pvcalls_sk_data_ready(struct sock *sock) > +{ > +} > + > +static struct sock_mapping *pvcalls_new_active_socket( > + struct pvcalls_fedata *fedata, > + uint64_t id, > + grant_ref_t ref, > + uint32_t evtchn, > + struct socket *sock) > +{ > + int ret; > + struct sock_mapping *map; > + void *page; > + > + map = kzalloc(sizeof(*map), GFP_KERNEL); > + if (map == NULL) > + return NULL; > + > + map->fedata = fedata; > + map->sock = sock; > + map->id = id; > + map->ref = ref; > + > + ret = xenbus_map_ring_valloc(fedata->dev, &ref, 1, &page); > + if (ret < 0) > + goto out; > + map->ring = page; > + map->ring_order = map->ring->ring_order; > + /* first read the order, then map the data ring */ > + virt_rmb(); > + if (map->ring_order > MAX_RING_ORDER) { > + pr_warn("%s frontend requested ring_order %u, which is > MAX > (%u)\n", > + __func__, map->ring_order, MAX_RING_ORDER); > + goto out; > + } > + ret = xenbus_map_ring_valloc(fedata->dev, map->ring->ref, > + (1 << map->ring_order), &page); > + if (ret < 0) > + goto out; > + map->bytes = page; > + > + ret = bind_interdomain_evtchn_to_irqhandler(fedata->dev->otherend_id, > + evtchn, > + pvcalls_back_conn_event, > + 0, > + "pvcalls-backend", > + map); > + if (ret < 0) > + goto out; > + map->irq = ret; > + > + map->data.in = map->bytes; > + map->data.out = map->bytes + XEN_FLEX_RING_SIZE(map->ring_order); > + > + map->ioworker.wq = alloc_workqueue("pvcalls_io", WQ_UNBOUND, 1); > + if (!map->ioworker.wq) > + goto out; > + atomic_set(&map->io, 1); > + INIT_WORK(&map->ioworker.register_work, pvcalls_back_ioworker); > + > + down(&fedata->socket_lock); > + list_add_tail(&map->list, &fedata->socket_mappings); > + up(&fedata->socket_lock); > + > + write_lock_bh(&map->sock->sk->sk_callback_lock); > + map->saved_data_ready =
Re: [PATCH v4 08/18] xen/pvcalls: implement connect command
>> + >> static int pvcalls_back_connect(struct xenbus_device *dev, >> struct xen_pvcalls_request *req) >> { >> +struct pvcalls_fedata *fedata; >> +int ret = -EINVAL; >> +struct socket *sock; >> +struct sock_mapping *map; >> +struct xen_pvcalls_response *rsp; >> + >> +fedata = dev_get_drvdata(&dev->dev); >> + >> +ret = sock_create(AF_INET, SOCK_STREAM, 0, &sock); >> +if (ret < 0) >> +goto out; >> +ret = inet_stream_connect(sock, (struct sockaddr *)&req->u.connect.addr, >> + req->u.connect.len, req->u.connect.flags); >> +if (ret < 0) { >> +sock_release(sock); >> +goto out; >> +} >> + >> +map = pvcalls_new_active_socket(fedata, >> +req->u.connect.id, >> +req->u.connect.ref, >> +req->u.connect.evtchn, >> +sock); >> +if (!map) { >> +sock_release(map->sock); >> +goto out; > Unnecessary goto. Oh, and also ret needs to be set since it will be cleared by inet_stream_connect(). -boris > >> +} >> + >> +out: >> +rsp = RING_GET_RESPONSE(&fedata->ring, fedata->ring.rsp_prod_pvt++); >> +rsp->req_id = req->req_id; >> +rsp->cmd = req->cmd; >> +rsp->u.connect.id = req->u.connect.id; >> +rsp->ret = ret; >> + >> +return ret; >
Re: [PATCH v4 07/18] xen/pvcalls: implement socket command
On 06/15/2017 03:09 PM, Stefano Stabellini wrote: > Just reply with success to the other end for now. Delay the allocation > of the actual socket to bind and/or connect. > > Signed-off-by: Stefano Stabellini > CC: boris.ostrov...@oracle.com > CC: jgr...@suse.com Reviewed-by: Boris Ostrovsky
Re: [PATCH v4 10/18] xen/pvcalls: implement listen command
On 06/15/2017 03:09 PM, Stefano Stabellini wrote: > Call inet_listen to implement the listen command. > > Signed-off-by: Stefano Stabellini > CC: boris.ostrov...@oracle.com > CC: jgr...@suse.com Reviewed-by: Boris Ostrovsky > --- > drivers/xen/pvcalls-back.c | 19 +++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c > index c17d970..e5c535d 100644 > --- a/drivers/xen/pvcalls-back.c > +++ b/drivers/xen/pvcalls-back.c > @@ -358,6 +358,25 @@ static int pvcalls_back_bind(struct xenbus_device *dev, > static int pvcalls_back_listen(struct xenbus_device *dev, > struct xen_pvcalls_request *req) > { > + struct pvcalls_fedata *fedata; > + int ret = -EINVAL; > + struct sockpass_mapping *map; > + struct xen_pvcalls_response *rsp; > + > + fedata = dev_get_drvdata(&dev->dev); > + > + map = radix_tree_lookup(&fedata->socketpass_mappings, req->u.listen.id); > + if (map == NULL) > + goto out; > + > + ret = inet_listen(map->sock, req->u.listen.backlog); > + > +out: > + rsp = RING_GET_RESPONSE(&fedata->ring, fedata->ring.rsp_prod_pvt++); > + rsp->req_id = req->req_id; > + rsp->cmd = req->cmd; > + rsp->u.listen.id = req->u.listen.id; > + rsp->ret = ret; > return 0; newline before return would be useful (if you respin this) > } >
Re: [PATCH v4 11/18] xen/pvcalls: implement accept command
> static void __pvcalls_back_accept(struct work_struct *work) > { > + struct sockpass_mapping *mappass = container_of( > + work, struct sockpass_mapping, register_work); > + struct sock_mapping *map; > + struct pvcalls_ioworker *iow; > + struct pvcalls_fedata *fedata; > + struct socket *sock; > + struct xen_pvcalls_response *rsp; > + struct xen_pvcalls_request *req; > + int notify; > + int ret = -EINVAL; > + unsigned long flags; > + > + fedata = mappass->fedata; > + /* > + * __pvcalls_back_accept can race against pvcalls_back_accept. > + * We only need to check the value of "cmd" on read. It could be > + * done atomically, but to simplify the code on the write side, we > + * use a spinlock. > + */ > + spin_lock_irqsave(&mappass->copy_lock, flags); > + req = &mappass->reqcopy; > + if (req->cmd != PVCALLS_ACCEPT) { > + spin_unlock_irqrestore(&mappass->copy_lock, flags); > + return; > + } > + spin_unlock_irqrestore(&mappass->copy_lock, flags); > + > + sock = sock_alloc(); > + if (sock == NULL) > + goto out_error; > + sock->type = mappass->sock->type; > + sock->ops = mappass->sock->ops; > + > + ret = inet_accept(mappass->sock, sock, O_NONBLOCK, true); > + if (ret == -EAGAIN) { > + sock_release(sock); > + goto out_error; > + } > + > + map = pvcalls_new_active_socket(fedata, > + req->u.accept.id_new, > + req->u.accept.ref, > + req->u.accept.evtchn, > + sock); > + if (!map) { > + sock_release(sock); > + goto out_error; Again, lost ret value. > + } > + > + map->sockpass = mappass; > + iow = &map->ioworker; > + atomic_inc(&map->read); > + atomic_inc(&map->io); > + queue_work(iow->wq, &iow->register_work); > + > +out_error: > + rsp = RING_GET_RESPONSE(&fedata->ring, fedata->ring.rsp_prod_pvt++); > + rsp->req_id = req->req_id; > + rsp->cmd = req->cmd; > + rsp->u.accept.id = req->u.accept.id; > + rsp->ret = ret; > + RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&fedata->ring, notify); > + if (notify) > + notify_remote_via_irq(fedata->irq); > + > + mappass->reqcopy.cmd = 0; > } > > static void pvcalls_pass_sk_data_ready(struct sock *sock) > { > + struct sockpass_mapping *mappass = sock->sk_user_data; > + > + if (mappass == NULL) > + return; > + > + queue_work(mappass->wq, &mappass->register_work); > } > > static int pvcalls_back_bind(struct xenbus_device *dev, > @@ -383,6 +456,43 @@ static int pvcalls_back_listen(struct xenbus_device *dev, > static int pvcalls_back_accept(struct xenbus_device *dev, > struct xen_pvcalls_request *req) > { > + struct pvcalls_fedata *fedata; > + struct sockpass_mapping *mappass; > + int ret = -EINVAL; Unnecessary initialization. -boris > + struct xen_pvcalls_response *rsp; > + unsigned long flags; > + > + fedata = dev_get_drvdata(&dev->dev); > + > + mappass = radix_tree_lookup(&fedata->socketpass_mappings, > + req->u.accept.id); > + if (mappass == NULL) > + goto out_error; > + > + /* > + * Limitation of the current implementation: only support one > + * concurrent accept or poll call on one socket. > + */ > + spin_lock_irqsave(&mappass->copy_lock, flags); > + if (mappass->reqcopy.cmd != 0) { > + spin_unlock_irqrestore(&mappass->copy_lock, flags); > + ret = -EINTR; > + goto out_error; > + } > + > + mappass->reqcopy = *req; > + spin_unlock_irqrestore(&mappass->copy_lock, flags); > + queue_work(mappass->wq, &mappass->register_work); > + > + /* Tell the caller we don't need to send back a notification yet */ > + return -1; > + > +out_error: > + rsp = RING_GET_RESPONSE(&fedata->ring, fedata->ring.rsp_prod_pvt++); > + rsp->req_id = req->req_id; > + rsp->cmd = req->cmd; > + rsp->u.accept.id = req->u.accept.id; > + rsp->ret = ret; > return 0; > } >
Re: [PATCH v4 12/18] xen/pvcalls: implement poll command
> @@ -499,6 +521,55 @@ static int pvcalls_back_accept(struct xenbus_device *dev, > static int pvcalls_back_poll(struct xenbus_device *dev, >struct xen_pvcalls_request *req) > { > + struct pvcalls_fedata *fedata; > + struct sockpass_mapping *mappass; > + struct xen_pvcalls_response *rsp; > + struct inet_connection_sock *icsk; > + struct request_sock_queue *queue; > + unsigned long flags; > + int ret; > + bool data; > + > + fedata = dev_get_drvdata(&dev->dev); > + > + mappass = radix_tree_lookup(&fedata->socketpass_mappings, > req->u.poll.id); > + if (mappass == NULL) > + return -EINVAL; > + > + /* > + * Limitation of the current implementation: only support one > + * concurrent accept or poll call on one socket. > + */ > + spin_lock_irqsave(&mappass->copy_lock, flags); > + if (mappass->reqcopy.cmd != 0) { > + ret = -EINTR; > + goto out; > + } > + > + mappass->reqcopy = *req; > + icsk = inet_csk(mappass->sock->sk); > + queue = &icsk->icsk_accept_queue; > + spin_lock(&queue->rskq_lock); > + data = queue->rskq_accept_head != NULL; > + spin_unlock(&queue->rskq_lock); What is the purpose of the queue lock here? -boris > + if (data) { > + mappass->reqcopy.cmd = 0; > + ret = 0; > + goto out; > + } > + spin_unlock_irqrestore(&mappass->copy_lock, flags); > + > + /* Tell the caller we don't need to send back a notification yet */ > + return -1; >
Re: [PATCH v4 13/18] xen/pvcalls: implement release command
> + > +static int pvcalls_back_release_passive(struct xenbus_device *dev, > + struct pvcalls_fedata *fedata, > + struct sockpass_mapping *mappass) > +{ > + if (mappass->sock->sk != NULL) { > + write_lock_bh(&mappass->sock->sk->sk_callback_lock); > + mappass->sock->sk->sk_user_data = NULL; > + mappass->sock->sk->sk_data_ready = mappass->saved_data_ready; > + write_unlock_bh(&mappass->sock->sk->sk_callback_lock); > + } > + down(&fedata->socket_lock); > + radix_tree_delete(&fedata->socketpass_mappings, mappass->id); > + sock_release(mappass->sock); > + flush_workqueue(mappass->wq); > + destroy_workqueue(mappass->wq); > + kfree(mappass); > + up(&fedata->socket_lock); Can you raise the semaphore earlier, once the mapping is deleted from the tree? Also, why are you not locking the tree in pvcalls_back_accept()? -boris
Re: [PATCH v4 15/18] xen/pvcalls: implement the ioworker functions
On 06/15/2017 03:09 PM, Stefano Stabellini wrote: > We have one ioworker per socket. Each ioworker goes through the list of > outstanding read/write requests. Once all requests have been dealt with, > it returns. > > We use one atomic counter per socket for "read" operations and one > for "write" operations to keep track of the reads/writes to do. > > We also use one atomic counter ("io") per ioworker to keep track of how > many outstanding requests we have in total assigned to the ioworker. The > ioworker finishes when there are none. > > Signed-off-by: Stefano Stabellini > CC: boris.ostrov...@oracle.com > CC: jgr...@suse.com Reviewed-by: Boris Ostrovsky
Re: [PATCH v4 12/18] xen/pvcalls: implement poll command
>>> + >>> + mappass->reqcopy = *req; >>> + icsk = inet_csk(mappass->sock->sk); >>> + queue = &icsk->icsk_accept_queue; >>> + spin_lock(&queue->rskq_lock); >>> + data = queue->rskq_accept_head != NULL; >>> + spin_unlock(&queue->rskq_lock); >> What is the purpose of the queue lock here? > It is only there to protect accesses to rskq_accept_head. Functions that > change rskq_accept_head take this lock, see for example > net/ipv4/inet_connection_sock.c:inet_csk_reqsk_queue_add. I'll add an > in-code comment. I am not sure I follow. You are not changing rskq_accept_head, you are simply reading it under the lock. It may be set by others to NULL as soon as you drop the lock, at which point 'data' test below will be obsolete. In inet_csk_reqsk_queue_add() it is read and then, based on read result, is written with a value so a lock is indeed need there. -boris > > >>> + if (data) { >>> + mappass->reqcopy.cmd = 0; >>> + ret = 0; >>> + goto out; >>> + } >>> + spin_unlock_irqrestore(&mappass->copy_lock, flags); >>> + >>> + /* Tell the caller we don't need to send back a notification yet */ >>> + return -1;
Re: [PATCH v4 16/18] xen/pvcalls: implement read
On 06/15/2017 03:09 PM, Stefano Stabellini wrote: > When an active socket has data available, increment the io and read > counters, and schedule the ioworker. > > Implement the read function by reading from the socket, writing the data > to the data ring. > > Set in_error on error. > > Signed-off-by: Stefano Stabellini > CC: boris.ostrov...@oracle.com > CC: jgr...@suse.com > --- > drivers/xen/pvcalls-back.c | 85 > ++ > 1 file changed, 85 insertions(+) > > diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c > index b9a10b9..65d9eba 100644 > --- a/drivers/xen/pvcalls-back.c > +++ b/drivers/xen/pvcalls-back.c > @@ -100,6 +100,81 @@ static int pvcalls_back_release_active(struct > xenbus_device *dev, > > static void pvcalls_conn_back_read(void *opaque) > { > + struct sock_mapping *map = (struct sock_mapping *)opaque; > + struct msghdr msg; > + struct kvec vec[2]; > + RING_IDX cons, prod, size, wanted, array_size, masked_prod, masked_cons; > + int32_t error; > + struct pvcalls_data_intf *intf = map->ring; > + struct pvcalls_data *data = &map->data; > + unsigned long flags; > + int ret; > + > + array_size = XEN_FLEX_RING_SIZE(map->ring_order); I noticed that in the next patch you call this 'ring_size. Can you make those things consistent? (There may be more than just this variable and, in fact, perhaps some things can be factored out? There are code fragments that look similar) > + cons = intf->in_cons; > + prod = intf->in_prod; > + error = intf->in_error; > + /* read the indexes first, then deal with the data */ > + virt_mb(); > + > + if (error) > + return; > + > + size = pvcalls_queued(prod, cons, array_size); > + if (size >= array_size) > + return; > + spin_lock_irqsave(&map->sock->sk->sk_receive_queue.lock, flags); > + if (skb_queue_empty(&map->sock->sk->sk_receive_queue)) { > + atomic_set(&map->read, 0); > + spin_unlock_irqrestore(&map->sock->sk->sk_receive_queue.lock, > + flags); > + return; > + } > + spin_unlock_irqrestore(&map->sock->sk->sk_receive_queue.lock, flags); > + wanted = array_size - size; > + masked_prod = pvcalls_mask(prod, array_size); > + masked_cons = pvcalls_mask(cons, array_size); > + > + memset(&msg, 0, sizeof(msg)); > + msg.msg_iter.type = ITER_KVEC|WRITE; > + msg.msg_iter.count = wanted; > + if (masked_prod < masked_cons) { > + vec[0].iov_base = data->in + masked_prod; > + vec[0].iov_len = wanted; > + msg.msg_iter.kvec = vec; > + msg.msg_iter.nr_segs = 1; > + } else { > + vec[0].iov_base = data->in + masked_prod; > + vec[0].iov_len = array_size - masked_prod; > + vec[1].iov_base = data->in; > + vec[1].iov_len = wanted - vec[0].iov_len; > + msg.msg_iter.kvec = vec; > + msg.msg_iter.nr_segs = 2; > + } This is probably obvious to everyone but me but can you explain what is going on here? ;-) > + > + atomic_set(&map->read, 0); Is this not atomic_dec() by any chance? -boris > + ret = inet_recvmsg(map->sock, &msg, wanted, MSG_DONTWAIT); > + WARN_ON(ret > wanted); > + if (ret == -EAGAIN) /* shouldn't happen */ > + return; > + if (!ret) > + ret = -ENOTCONN; > + spin_lock_irqsave(&map->sock->sk->sk_receive_queue.lock, flags); > + if (ret > 0 && !skb_queue_empty(&map->sock->sk->sk_receive_queue)) > + atomic_inc(&map->read); > + spin_unlock_irqrestore(&map->sock->sk->sk_receive_queue.lock, flags); > + > + /* write the data, then modify the indexes */ > + virt_wmb(); > + if (ret < 0) > + intf->in_error = ret; > + else > + intf->in_prod = prod + ret; > + /* update the indexes, then notify the other end */ > + virt_wmb(); > + notify_remote_via_irq(map->irq); > + > + return; > } > > static int pvcalls_conn_back_write(struct sock_mapping *map) > @@ -172,6 +247,16 @@ static void pvcalls_sk_state_change(struct sock *sock) > > static void pvcalls_sk_data_ready(struct sock *sock) > { > + struct sock_mapping *map = sock->sk_user_data; > + struct pvcalls_ioworker *iow; > + > + if (map == NULL) > + return; > + > + iow = &map->ioworker; > + atomic_inc(&map->read); > + atomic_inc(&map->io); > + queue_work(iow->wq, &iow->register_work); > } > > static struct sock_mapping *pvcalls_new_active_socket(
Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/12/2017 03:53 PM, Boris Ostrovsky wrote: > On 10/12/2017 03:27 PM, Andrew Cooper wrote: >> On 12/10/17 20:11, Boris Ostrovsky wrote: >>> There is also another problem: >>> >>> [1.312425] general protection fault: [#1] SMP >>> [1.312901] Modules linked in: >>> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 >>> [1.313878] task: 88003e2c task.stack: c938c000 >>> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5 >>> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046 >>> [1.315336] RAX: 000c RBX: 55f550168040 RCX: >>> 7fcfc959f59a >>> [1.315827] RDX: RSI: RDI: >>> >>> [1.316315] RBP: 000a R08: 037f R09: >>> 0064 >>> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12: >>> 7fcfc958ad60 >>> [1.317300] R13: R14: 55f550185954 R15: >>> 1000 >>> [1.317801] FS: () GS:88003f80() >>> knlGS: >>> [1.318267] CS: e033 DS: ES: CR0: 80050033 >>> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4: >>> 00042660 >>> [1.319235] Call Trace: >>> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 >>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 >>> 00 50 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 >>> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: c938ff50 >>> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- >>> [1.345009] Kernel panic - not syncing: Attempted to kill init! >>> exitcode=0x000b >>> >>> >>> All code >>> >>>0:51 push %rcx >>>1:50 push %rax >>>2:57 push %rdi >>>3:56 push %rsi >>>4:52 push %rdx >>>5:51 push %rcx >>>6:6a dapushq $0xffda >>>8:41 50push %r8 >>>a:41 51push %r9 >>>c:41 52push %r10 >>>e:41 53push %r11 >>> 10:48 83 ec 30 sub$0x30,%rsp >>> 14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11 >>> 1b:00 00 >>> 1d:41 f7 03 df 39 08 90 testl $0x900839df,(%r11) >>> 24:0f 85 a5 00 00 00jne0xcf >>> 2a:50 push %rax >>> 2b:*ff 15 9c 95 d0 ffcallq *-0x2f6a64(%rip)# >>> 0xffd095cd<-- trapping instruction >>> 31:58 pop%rax >>> 32:48 3d 4c 01 00 00cmp$0x14c,%rax >>> 38:77 0fja 0x49 >>> 3a:4c 89 d1 mov%r10,%rcx >>> 3d:ff .byte 0xff >>> 3e:14 c5adc$0xc5,%al >>> >>> >>> so the original 'cli' was replaced with the pv call but to me the offset >>> looks a bit off, no? Shouldn't it always be positive? >> callq takes a 32bit signed displacement, so jumping back by up to 2G is >> perfectly legitimate. > Yes, but > > ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath > 817365dd t entry_SYSCALL_64_fastpath > ostr@workbase> nm vmlinux | grep " pv_irq_ops" > 81c2dbc0 D pv_irq_ops > ostr@workbase> > > so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I > didn't mean that x86 instruction set doesn't allow negative > displacement, I was trying to say that pv_irq_ops always live further down) I believe the problem is this: #define PV_INDIRECT(addr) *addr(%rip) The displacement that the linker computes will be relative to the where this instruction is placed at the time of linking, which is in .pv_altinstructions (and not .text). So when we copy it into .text the displacement becomes bogus. Replacing the macro with #define PV_INDIRECT(addr) *addr // well, it's not so much indirect anymore makes things work. Or maybe it can be adjusted top be kept truly indirect. -boris
Re: [PATCH] xen-netfront, xen-netback: Use correct minimum MTU values
On 10/16/2017 11:05 AM, Wei Liu wrote: > On Mon, Oct 16, 2017 at 03:20:32PM +0200, Mohammed Gamal wrote: >> RFC791 specifies the minimum MTU to be 68, while xen-net{front|back} >> drivers use a minimum value of 0. >> >> When set MTU to 0~67 with xen_net{front|back} driver, the network >> will become unreachable immediately, the guest can no longer be pinged. >> >> xen_net{front|back} should not allow the user to set this value which causes >> network problems. >> >> Reported-by: Chen Shi >> Signed-off-by: Mohammed Gamal > Acked-by: Wei Liu > > CC netfront maintainers Reviewed-by: Boris Ostrovsky I can take it via Xen tree unless there are objections. -boris
Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/17/2017 01:24 AM, Josh Poimboeuf wrote: > On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote: >> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote: >>> On 10/12/2017 03:27 PM, Andrew Cooper wrote: >>>> On 12/10/17 20:11, Boris Ostrovsky wrote: >>>>> There is also another problem: >>>>> >>>>> [1.312425] general protection fault: [#1] SMP >>>>> [1.312901] Modules linked in: >>>>> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 >>>>> [1.313878] task: 88003e2c task.stack: c938c000 >>>>> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5 >>>>> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046 >>>>> [1.315336] RAX: 000c RBX: 55f550168040 RCX: >>>>> 7fcfc959f59a >>>>> [1.315827] RDX: RSI: RDI: >>>>> >>>>> [1.316315] RBP: 000a R08: 037f R09: >>>>> 0064 >>>>> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12: >>>>> 7fcfc958ad60 >>>>> [1.317300] R13: R14: 55f550185954 R15: >>>>> 1000 >>>>> [1.317801] FS: () GS:88003f80() >>>>> knlGS: >>>>> [1.318267] CS: e033 DS: ES: CR0: 80050033 >>>>> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4: >>>>> 00042660 >>>>> [1.319235] Call Trace: >>>>> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 >>>>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 >>>>> 00 50 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 >>>>> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: >>>>> c938ff50 >>>>> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- >>>>> [1.345009] Kernel panic - not syncing: Attempted to kill init! >>>>> exitcode=0x000b >>>>> >>>>> >>>>> All code >>>>> >>>>>0:51 push %rcx >>>>>1:50 push %rax >>>>>2:57 push %rdi >>>>>3:56 push %rsi >>>>>4:52 push %rdx >>>>>5:51 push %rcx >>>>>6:6a dapushq $0xffda >>>>>8:41 50push %r8 >>>>>a:41 51push %r9 >>>>>c:41 52push %r10 >>>>>e:41 53push %r11 >>>>> 10:48 83 ec 30 sub$0x30,%rsp >>>>> 14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11 >>>>> 1b:00 00 >>>>> 1d:41 f7 03 df 39 08 90 testl $0x900839df,(%r11) >>>>> 24:0f 85 a5 00 00 00jne0xcf >>>>> 2a:50 push %rax >>>>> 2b:*ff 15 9c 95 d0 ffcallq *-0x2f6a64(%rip)# >>>>> 0xffd095cd<-- trapping instruction >>>>> 31:58 pop%rax >>>>> 32:48 3d 4c 01 00 00cmp$0x14c,%rax >>>>> 38:77 0fja 0x49 >>>>> 3a:4c 89 d1 mov%r10,%rcx >>>>> 3d:ff .byte 0xff >>>>> 3e:14 c5adc$0xc5,%al >>>>> >>>>> >>>>> so the original 'cli' was replaced with the pv call but to me the offset >>>>> looks a bit off, no? Shouldn't it always be positive? >>>> callq takes a 32bit signed displacement, so jumping back by up to 2G is >>>> perfectly legitimate. >>> Yes, but >>> >>> ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath >>> 817365dd t entry_SYSCALL_64_fastpath >>> ostr@workbase> nm vmlinux | grep " pv_irq_ops" >>> 81c2dbc0 D pv_irq_ops >>> ostr@workbase> >>> >>> so pv_
Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/17/2017 09:10 AM, Brian Gerst wrote: > On Mon, Oct 16, 2017 at 2:18 PM, Boris Ostrovsky > wrote: >> >> Replacing the macro with >> >> #define PV_INDIRECT(addr) *addr // well, it's not so much >> indirect anymore >> >> makes things work. Or maybe it can be adjusted top be kept truly indirect. > That is still an indirect call, just using absolute addressing for the > pointer instead of RIP-relative. Oh, right, I've got my terminology all wrong. -boris > Alternatives has very limited > relocation capabilities. It will only handle a single call or jmp > replacement. Using absolute addressing is slightly less efficient > (takes one extra byte to encode, and needs a relocation for KASLR), > but it works just as well. You could also relocate the instruction > manually by adding the delta between the original and replacement code > to the displacement.
Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/17/2017 10:36 AM, Josh Poimboeuf wrote: > > Maybe we can add a new field to the alternatives entry struct which > specifies the offset to the CALL instruction, so apply_alternatives() > can find it. We'd also have to assume that the restore part of an alternative entry is the same size as the save part. Which is true now. -boris
Re: [PATCH v5 02/13] xen/pvcalls: implement frontend disconnect
On 10/06/2017 08:30 PM, Stefano Stabellini wrote: > Introduce a data structure named pvcalls_bedata. It contains pointers to > the command ring, the event channel, a list of active sockets and a list > of passive sockets. Lists accesses are protected by a spin_lock. > > Introduce a waitqueue to allow waiting for a response on commands sent > to the backend. > > Introduce an array of struct xen_pvcalls_response to store commands > responses. > > pvcalls_refcount is used to keep count of the outstanding pvcalls users. > Only remove connections once the refcount is zero. > > Implement pvcalls frontend removal function. Go through the list of > active and passive sockets and free them all, one at a time. > > Signed-off-by: Stefano Stabellini > CC: boris.ostrov...@oracle.com > CC: jgr...@suse.com > --- > drivers/xen/pvcalls-front.c | 67 > + > 1 file changed, 67 insertions(+) > > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c > index a8d38c2..d8b7a04 100644 > --- a/drivers/xen/pvcalls-front.c > +++ b/drivers/xen/pvcalls-front.c > @@ -20,6 +20,46 @@ > #include > #include > > +#define PVCALLS_INVALID_ID UINT_MAX > +#define PVCALLS_RING_ORDER XENBUS_MAX_RING_GRANT_ORDER > +#define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE) > + > +struct pvcalls_bedata { > + struct xen_pvcalls_front_ring ring; > + grant_ref_t ref; > + int irq; > + > + struct list_head socket_mappings; > + struct list_head socketpass_mappings; > + spinlock_t socket_lock; > + > + wait_queue_head_t inflight_req; > + struct xen_pvcalls_response rsp[PVCALLS_NR_REQ_PER_RING]; Did you mean _REQ_ or _RSP_ in the macro name? > +}; > +/* Only one front/back connection supported. */ > +static struct xenbus_device *pvcalls_front_dev; > +static atomic_t pvcalls_refcount; > + > +/* first increment refcount, then proceed */ > +#define pvcalls_enter() { \ > + atomic_inc(&pvcalls_refcount); \ > +} > + > +/* first complete other operations, then decrement refcount */ > +#define pvcalls_exit() {\ > + atomic_dec(&pvcalls_refcount); \ > +} > + > +static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id) > +{ > + return IRQ_HANDLED; > +} > + > +static void pvcalls_front_free_map(struct pvcalls_bedata *bedata, > +struct sock_mapping *map) > +{ > +} > + > static const struct xenbus_device_id pvcalls_front_ids[] = { > { "pvcalls" }, > { "" } > @@ -27,6 +67,33 @@ > > static int pvcalls_front_remove(struct xenbus_device *dev) > { > + struct pvcalls_bedata *bedata; > + struct sock_mapping *map = NULL, *n; > + > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev); > + dev_set_drvdata(&dev->dev, NULL); > + pvcalls_front_dev = NULL; > + if (bedata->irq >= 0) > + unbind_from_irqhandler(bedata->irq, dev); > + > + smp_mb(); > + while (atomic_read(&pvcalls_refcount) > 0) > + cpu_relax(); > + list_for_each_entry_safe(map, n, &bedata->socket_mappings, list) { > + pvcalls_front_free_map(bedata, map); > + kfree(map); > + } > + list_for_each_entry_safe(map, n, &bedata->socketpass_mappings, list) { > + spin_lock(&bedata->socket_lock); > + list_del_init(&map->list); > + spin_unlock(&bedata->socket_lock); > + kfree(map); Why do you re-init the entry if you are freeing it? And do you really need the locks around it? This looks similar to the case we've discussed for other patches --- if we are concerned that someone may grab this entry then something must be wrong. (Sorry, this must have been here in earlier versions but I only now noticed it.) -boris > + } > + if (bedata->ref >= 0) > + gnttab_end_foreign_access(bedata->ref, 0, 0); > + kfree(bedata->ring.sring); > + kfree(bedata); > + xenbus_switch_state(dev, XenbusStateClosed); > return 0; > } >
Re: [PATCH v5 04/13] xen/pvcalls: implement socket command and handle events
On 10/06/2017 08:30 PM, Stefano Stabellini wrote: > Send a PVCALLS_SOCKET command to the backend, use the masked > req_prod_pvt as req_id. This way, req_id is guaranteed to be between 0 > and PVCALLS_NR_REQ_PER_RING. We already have a slot in the rsp array > ready for the response, and there cannot be two outstanding responses > with the same req_id. > > Wait for the response by waiting on the inflight_req waitqueue and > check for the req_id field in rsp[req_id]. Use atomic accesses and > barriers to read the field. Note that the barriers are simple smp > barriers (as opposed to virt barriers) because they are for internal > frontend synchronization, not frontend<->backend communication. > > Once a response is received, clear the corresponding rsp slot by setting > req_id to PVCALLS_INVALID_ID. Note that PVCALLS_INVALID_ID is invalid > only from the frontend point of view. It is not part of the PVCalls > protocol. > > pvcalls_front_event_handler is in charge of copying responses from the > ring to the appropriate rsp slot. It is done by copying the body of the > response first, then by copying req_id atomically. After the copies, > wake up anybody waiting on waitqueue. > > socket_lock protects accesses to the ring. > > Create a new struct sock_mapping and convert the pointer into an > uint64_t and use it as id for the new socket to pass to the backend. The > struct will be fully initialized later on connect or bind. In this patch > the struct sock_mapping is empty, the fields will be added by the next > patch. > > sock->sk->sk_send_head is not used for ip sockets: reuse the field to > store a pointer to the struct sock_mapping corresponding to the socket. > This way, we can easily get the struct sock_mapping from the struct > socket. > > Signed-off-by: Stefano Stabellini Reviewed-by: Boris Ostrovsky with one question: > + /* > + * PVCalls only supports domain AF_INET, > + * type SOCK_STREAM and protocol 0 sockets for now. > + * > + * Check socket type here, AF_INET and protocol checks are done > + * by the caller. > + */ > + if (sock->type != SOCK_STREAM) > + return -ENOTSUPP; > + Is this ENOTSUPP or EOPNOTSUPP? I didn't know the former even existed and include/linux/errno.h suggests that this is NFSv3-specific. -boris
Re: [PATCH v5 06/13] xen/pvcalls: implement bind command
On 10/06/2017 08:30 PM, Stefano Stabellini wrote: > Send PVCALLS_BIND to the backend. Introduce a new structure, part of > struct sock_mapping, to store information specific to passive sockets. > > Introduce a status field to keep track of the status of the passive > socket. > > Signed-off-by: Stefano Stabellini > CC: boris.ostrov...@oracle.com > CC: jgr...@suse.com > --- > drivers/xen/pvcalls-front.c | 66 > + > drivers/xen/pvcalls-front.h | 3 +++ > 2 files changed, 69 insertions(+) > > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c > index 7c9261b..4cafd9b 100644 > --- a/drivers/xen/pvcalls-front.c > +++ b/drivers/xen/pvcalls-front.c > @@ -71,6 +71,13 @@ struct sock_mapping { > > wait_queue_head_t inflight_conn_req; > } active; > + struct { > + /* Socket status */ > +#define PVCALLS_STATUS_UNINITALIZED 0 > +#define PVCALLS_STATUS_BIND 1 > +#define PVCALLS_STATUS_LISTEN2 > + uint8_t status; > + } passive; > }; > }; > > @@ -347,6 +354,65 @@ int pvcalls_front_connect(struct socket *sock, struct > sockaddr *addr, > return ret; > } > > +int pvcalls_front_bind(struct socket *sock, struct sockaddr *addr, int > addr_len) > +{ > + struct pvcalls_bedata *bedata; > + struct sock_mapping *map = NULL; > + struct xen_pvcalls_request *req; > + int notify, req_id, ret; > + > + if (addr->sa_family != AF_INET || sock->type != SOCK_STREAM) > + return -ENOTSUPP; > + > + pvcalls_enter(); > + if (!pvcalls_front_dev) { > + pvcalls_exit(); > + return -ENOTCONN; The connect patch returns -ENETUNREACH here. Is there a deliberate distinction between these cases? Other than that Reviewed-by: Boris Ostrovsky
Re: [PATCH v5 08/13] xen/pvcalls: implement accept command
On 10/06/2017 08:30 PM, Stefano Stabellini wrote: > Introduce a waitqueue to allow only one outstanding accept command at > any given time and to implement polling on the passive socket. Introduce > a flags field to keep track of in-flight accept and poll commands. > > Send PVCALLS_ACCEPT to the backend. Allocate a new active socket. Make > sure that only one accept command is executed at any given time by > setting PVCALLS_FLAG_ACCEPT_INFLIGHT and waiting on the > inflight_accept_req waitqueue. > > Convert the new struct sock_mapping pointer into an uint64_t and use it > as id for the new socket to pass to the backend. > > Check if the accept call is non-blocking: in that case after sending the > ACCEPT command to the backend store the sock_mapping pointer of the new > struct and the inflight req_id then return -EAGAIN (which will respond > only when there is something to accept). Next time accept is called, > we'll check if the ACCEPT command has been answered, if so we'll pick up > where we left off, otherwise we return -EAGAIN again. > > Note that, differently from the other commands, we can use > wait_event_interruptible (instead of wait_event) in the case of accept > as we are able to track the req_id of the ACCEPT response that we are > waiting. > > Signed-off-by: Stefano Stabellini > CC: boris.ostrov...@oracle.com > CC: jgr...@suse.com > --- > drivers/xen/pvcalls-front.c | 146 > > drivers/xen/pvcalls-front.h | 3 + > 2 files changed, 149 insertions(+) > > diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c > index 5433fae..8958e74 100644 > --- a/drivers/xen/pvcalls-front.c > +++ b/drivers/xen/pvcalls-front.c > @@ -77,6 +77,16 @@ struct sock_mapping { > #define PVCALLS_STATUS_BIND 1 > #define PVCALLS_STATUS_LISTEN2 > uint8_t status; > + /* > + * Internal state-machine flags. > + * Only one accept operation can be inflight for a socket. > + * Only one poll operation can be inflight for a given socket. > + */ > +#define PVCALLS_FLAG_ACCEPT_INFLIGHT 0 > + uint8_t flags; > + uint32_t inflight_req_id; > + struct sock_mapping *accept_map; > + wait_queue_head_t inflight_accept_req; > } passive; > }; > }; > @@ -392,6 +402,8 @@ int pvcalls_front_bind(struct socket *sock, struct > sockaddr *addr, int addr_len) > memcpy(req->u.bind.addr, addr, sizeof(*addr)); > req->u.bind.len = addr_len; > > + init_waitqueue_head(&map->passive.inflight_accept_req); > + > map->active_socket = false; > > bedata->ring.req_prod_pvt++; > @@ -470,6 +482,140 @@ int pvcalls_front_listen(struct socket *sock, int > backlog) > return ret; > } > > +int pvcalls_front_accept(struct socket *sock, struct socket *newsock, int > flags) > +{ > + struct pvcalls_bedata *bedata; > + struct sock_mapping *map; > + struct sock_mapping *map2 = NULL; > + struct xen_pvcalls_request *req; > + int notify, req_id, ret, evtchn, nonblock; > + > + pvcalls_enter(); > + if (!pvcalls_front_dev) { > + pvcalls_exit(); > + return -ENOTCONN; > + } > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev); > + > + map = (struct sock_mapping *) sock->sk->sk_send_head; > + if (!map) { > + pvcalls_exit(); > + return -ENOTSOCK; > + } > + > + if (map->passive.status != PVCALLS_STATUS_LISTEN) { > + pvcalls_exit(); > + return -EINVAL; > + } > + > + nonblock = flags & SOCK_NONBLOCK; > + /* > + * Backend only supports 1 inflight accept request, will return > + * errors for the others > + */ > + if (test_and_set_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT, > + (void *)&map->passive.flags)) { > + req_id = READ_ONCE(map->passive.inflight_req_id); > + if (req_id != PVCALLS_INVALID_ID && > + READ_ONCE(bedata->rsp[req_id].req_id) == req_id) { READ_ONCE (especially the second one)? I know I may sound fixated on this but I really don't understand how compiler may do anything wrong if straight reads were used. For the first case, I guess, theoretically the compiler may decide to re-fetch map->passive.inflight_req_id. But even if it did, would that be a problem? Both of these READ_ONCE targets are updated below before PVCALLS_FLAG_ACCEPT_INFLIGHT is cleared so there should not be any change between re-fetching, I think. (The only exception is the noblock case, which does WRITE_ONCE that don't understand either) > + map2 = map->passive.accept_map; > + goto received; > + } > + if (nonblock) { > + pvcalls_exit(); > + return -EAGAIN; > + } > +
Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/17/2017 04:17 PM, Josh Poimboeuf wrote: > On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote: >> On 10/17/2017 10:36 AM, Josh Poimboeuf wrote: >>> Maybe we can add a new field to the alternatives entry struct which >>> specifies the offset to the CALL instruction, so apply_alternatives() >>> can find it. >> We'd also have to assume that the restore part of an alternative entry >> is the same size as the save part. Which is true now. > Why? > Don't you need to know the size of the instruction without save and restore part? + if (a->replacementlen == 6 && *insnbuf == 0xff && *(insnbuf+1) == 0x15) Otherwise you'd need another field for the actual instruction length. -boris
Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/17/2017 04:50 PM, Josh Poimboeuf wrote: > On Tue, Oct 17, 2017 at 04:36:00PM -0400, Boris Ostrovsky wrote: >> On 10/17/2017 04:17 PM, Josh Poimboeuf wrote: >>> On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote: >>>> On 10/17/2017 10:36 AM, Josh Poimboeuf wrote: >>>>> Maybe we can add a new field to the alternatives entry struct which >>>>> specifies the offset to the CALL instruction, so apply_alternatives() >>>>> can find it. >>>> We'd also have to assume that the restore part of an alternative entry >>>> is the same size as the save part. Which is true now. >>> Why? >>> >> Don't you need to know the size of the instruction without save and >> restore part? >> >> + if (a->replacementlen == 6 && *insnbuf == 0xff && *(insnbuf+1) == 0x15) >> >> Otherwise you'd need another field for the actual instruction length. > If we know where the CALL instruction starts, and can verify that it > starts with "ff 15", then we know the instruction length: 6 bytes. > Right? > Oh, OK. Then you shouldn't need a->replacementlen test(s?) in apply_alternatives()? -boris
Re: [PATCH v5 09/13] xen/pvcalls: implement sendmsg
> +static int __write_ring(struct pvcalls_data_intf *intf, > + struct pvcalls_data *data, > + struct iov_iter *msg_iter, > + int len) > +{ > + RING_IDX cons, prod, size, masked_prod, masked_cons; > + RING_IDX array_size = XEN_FLEX_RING_SIZE(PVCALLS_RING_ORDER); > + int32_t error; > + > + error = intf->out_error; > + if (error < 0) > + return error; > + cons = intf->out_cons; > + prod = intf->out_prod; > + /* read indexes before continuing */ > + virt_mb(); > + > + size = pvcalls_queued(prod, cons, array_size); > + if (size >= array_size) > + return 0; I thought you were going to return an error here? If this can only be due to someone messing up indexes is there a reason to continue trying to write? What are the chances that the index will get corrected? -boris > + if (len > array_size - size) > + len = array_size - size; > + > + masked_prod = pvcalls_mask(prod, array_size); > + masked_cons = pvcalls_mask(cons, array_size); > + > + if (masked_prod < masked_cons) { > + copy_from_iter(data->out + masked_prod, len, msg_iter); > + } else { > + if (len > array_size - masked_prod) { > + copy_from_iter(data->out + masked_prod, > +array_size - masked_prod, msg_iter); > + copy_from_iter(data->out, > +len - (array_size - masked_prod), > +msg_iter); > + } else { > + copy_from_iter(data->out + masked_prod, len, msg_iter); > + } > + } > + /* write to ring before updating pointer */ > + virt_wmb(); > + intf->out_prod += len; > + > + return len; > +}
Re: [PATCH v5 10/13] xen/pvcalls: implement recvmsg
> + > +int pvcalls_front_recvmsg(struct socket *sock, struct msghdr *msg, size_t > len, > + int flags) > +{ > + struct pvcalls_bedata *bedata; > + int ret; > + struct sock_mapping *map; > + > + if (flags & (MSG_CMSG_CLOEXEC|MSG_ERRQUEUE|MSG_OOB|MSG_TRUNC)) > + return -EOPNOTSUPP; > + > + pvcalls_enter(); > + if (!pvcalls_front_dev) { > + pvcalls_exit(); > + return -ENOTCONN; > + } > + bedata = dev_get_drvdata(&pvcalls_front_dev->dev); > + > + map = (struct sock_mapping *) sock->sk->sk_send_head; > + if (!map) { > + pvcalls_exit(); > + return -ENOTSOCK; > + } > + > + mutex_lock(&map->active.in_mutex); > + if (len > XEN_FLEX_RING_SIZE(PVCALLS_RING_ORDER)) > + len = XEN_FLEX_RING_SIZE(PVCALLS_RING_ORDER); > + > + while (!(flags & MSG_DONTWAIT) && !pvcalls_front_read_todo(map)) { > + wait_event_interruptible(map->active.inflight_conn_req, > + pvcalls_front_read_todo(map)); > + } > + ret = __read_ring(map->active.ring, &map->active.data, > + &msg->msg_iter, len, flags); > + > + if (ret > 0) > + notify_remote_via_irq(map->active.irq); > + if (ret == 0) > + ret = -EAGAIN; Why not 0? The manpage says: EAGAIN or EWOULDBLOCK The socket is marked nonblocking and the receive operation would block, or a receive timeout had been set and the timeout expired before data was received. POSIX.1 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities. I don't think either of these conditions is true here. (Again, should have noticed this earlier, sorry) -boris > + if (ret == -ENOTCONN) > + ret = 0; > + > + mutex_unlock(&map->active.in_mutex); > + pvcalls_exit(); > + return ret; > +}
Re: [PATCH v5 11/13] xen/pvcalls: implement poll command
> > +static unsigned int pvcalls_front_poll_passive(struct file *file, > +struct pvcalls_bedata *bedata, > +struct sock_mapping *map, > +poll_table *wait) > +{ > + int notify, req_id, ret; > + struct xen_pvcalls_request *req; > + > + if (test_bit(PVCALLS_FLAG_ACCEPT_INFLIGHT, > + (void *)&map->passive.flags)) { > + uint32_t req_id = READ_ONCE(map->passive.inflight_req_id); > + > + if (req_id != PVCALLS_INVALID_ID && > + READ_ONCE(bedata->rsp[req_id].req_id) == req_id) > + return POLLIN | POLLRDNORM; Same READ_ONCE() question as for an earlier patch. -boris > + > + poll_wait(file, &map->passive.inflight_accept_req, wait); > + return 0; > + } > +