"Shutdown" not shutting down...
I was wondering if someone could explain something to me. I've got a dual-boot Red Hat 6.2/Win95B box, AMD K6-500, NVidia / SGS Thomson|Riva128 video card w/8MB memory, BTC-1817 sound card, 64MB RAM, and a Phoebe 56K modem. On the Win95 side, when I shutdown the system it goes through the shutdown procedure, and at the end it turns off the box. In Linux, "shutdown -h now" results in Linux going through the process, but at the end I get the following: Call trace: [] [] [] [] [] [] [] [] [] [] [] [] Code: <1>Unable to handle kernel paging request at virtual address 893d current->tss.cr3 = 03799000, %cr3 = 03799000 *pde = Oops: CPU: 0 EIP: 0010: [] EFLAGS: 00010046 eax:893d ebx: ecx:893e edx:0002 esi:0026 edi:c0346000 ebp:c400 esp:c0345d54 ds:0018 ex:0018 ss:0018 Process halt (pid 913, process nr:14, stackpage = c0345000) Stack: fee1dead 6789 0082 893e c480 c010a4c5 c0345da4 c01e3907 c01e39dc fdf8 c0345da4 c010a9c2 c01e39dc c0345da4 fdf8 c0344000 bfff8026 c010a119 c0345da4 fdf8 0001 0001 bfff8026 Call trace: [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] Code: 0f b6 0c 03 89 4c 24 14 51 68 ff 38 1e c0 e8 4e a3 00 00 83 (the above was all that I could see on the screen. if there was anything prior to the first Call trace, I'm afraid my tired old eyes missed it. this was also all manually copied, and manually entered, but I believe I've gotten it correctly) Anyone have any idea what I'm doing wrong here? Thx, Wolf - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.
On 10/22/2012 10:33 AM, Rik van Riel wrote: On 10/16/2012 10:23 PM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. How do s390 and Power systems deal with reporting that kind of information? IMHO it would be good to see what those do, so we do not end up re-inventing the wheel, and confusing admins with yet another way of reporting the information... Sorry for the delay in the response. I'm assuming you are asking about s390 and Power lpars. In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] Alter steal time reporting in KVM
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c |7 + 15 files changed, 120 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu capping. The amount consigned_time will be passed in using an ioctl. The time will be expressed in the number of nanoseconds to be lost in during the fixed period. The fixed period is currently 1/10th of a second. Signed-off-by: Michael Wolf --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(&boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, "cpu%d", i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, "intr %llu", (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 1865b1f..e5978b0 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] Add the code to send the consigned time from the host to the guest
Add the code to send the consigned time from the host to the guest. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/kvm_para.h |3 ++- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h|2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..434d378 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -426,6 +426,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index eb3e9d8..1763369 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a5f9f30..d39e8d0 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index ac357b3..4439a5c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src->version; rmb(); *steal = src->steal; + *consigned = src->consigned; rmb(); } while ((version & 1) || (version != src->version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1eefebe..683b531 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal; + vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned; vcpu->arch.st.steal.version += 2; vcpu->arch.st.accum_steal = 0; + vcpu->arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e5978b0..91afaa3 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); extern void account_process_tick(struct task_struct *, int user); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 593b647..53bd0be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /* + * This accounts for the time that is split out of steal time. + *
[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.
Add a timer to the host. This will define the period. During a period the first n ticks will go into the consigned bucket. Any other ticks that occur within the period will be placed in the stealtime bucket. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |2 +- arch/x86/kvm/x86.c | 42 ++- 3 files changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 434d378..4794c95 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -41,6 +41,8 @@ #define KVM_PIO_PAGE_OFFSET 1 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 +#define KVM_STEAL_TIMER_DELAY 1UL + #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ @@ -353,6 +355,14 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + u64 consigned_limit; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d39e8d0..6db79f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,7 +196,7 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 683b531..c91f4c9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; - vcpu->arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) { + vcpu->arch.current_consigned += delta; + if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) { + steal_delta = vcpu->arch.current_consigned + - vcpu->arch.consigned_limit; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu->arch.st.accum_steal = steal_delta; + vcpu->arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) struct static_key kvm_no_apic_vcpu __read_mostly; +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + vcpu->arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(&vcpu->arch.steal_timer, now, + ktime_set(0, KVM_STEAL_TIMER_DELAY)); + return HRTIMER_RESTART; +} + int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; int r; + ktime_t ktime; BUG_ON(vcpu->kvm == NULL); kvm = vcpu->kvm; @@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); + /* Initialize and start a timer to capture steal and consigned time */ + hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu->arch.steal_timer.function = &steal_timer_fn; + ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY); + hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL); return 0; fail_free_mce_banks: @@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { int idx; +
[PATCH 5/5] Add an ioctl to communicate the consign limit to the host.
Add an ioctl to communicate the consign limit to the host. Signed-off-by: Michael Wolf --- arch/x86/kvm/x86.c |6 ++ include/linux/kvm_host.h |2 ++ include/uapi/linux/kvm.h |2 ++ virt/kvm/kvm_main.c |7 +++ 4 files changed, 17 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c91f4c9..5d57469 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5938,6 +5938,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, return 0; } +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long entitlement) +{ + vcpu->arch.consigned_limit = entitlement; + return 0; +} + int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { struct i387_fxsave_struct *fxsave = diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0e2212f..de13648 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -590,6 +590,8 @@ void kvm_arch_hardware_unsetup(void); void kvm_arch_check_processor_compat(void *rtn); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, + long entitlement); void kvm_free_physmem(struct kvm *kvm); diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0a6d6ba..86f24bb 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -921,6 +921,8 @@ struct kvm_s390_ucas_mapping { #define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg) /* VM is being stopped by host */ #define KVM_KVMCLOCK_CTRL_IO(KVMIO, 0xad) +/* Set the consignment limit which will be used to separete steal time */ +#define KVM_SET_ENTITLEMENT _IOW(KVMIO, 0xae, unsigned long) #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0) #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..c712fe5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2062,6 +2062,13 @@ out_free2: r = 0; break; } + case KVM_SET_ENTITLEMENT: { + r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg); + if (r) + goto out; + r = 0; + break; + } default: r = kvm_arch_vcpu_ioctl(filp, ioctl, arg); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] Expand the steal time msr to also contain the consigned time.
Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 4180a87..ac357b3 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src->version; rmb(); - steal = src->steal; + *steal = src->steal; rmb(); } while ((version & 1) || (version != src->version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c2e077c..b21d92d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; @@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), &steal, &consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq->prev_steal_time_rq; if (unlikely(steal > delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 8d859da..593b647 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(¶virt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), &steal); steal -= this_rq()->prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu capping. The amount consigned_time will be passed in using an ioctl. The time will be expressed in the number of nanoseconds to be lost in during the fixed period. The fixed period is currently 1/10th of a second. Signed-off-by: Michael Wolf --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(&boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, "cpu%d", i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, "intr %llu", (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 1865b1f..e5978b0 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] Alter stealtime reporting in KVM
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. CREDITS|5 Documentation/arm64/memory.txt | 12 Documentation/cgroups/memory.txt |4 .../devicetree/bindings/net/mdio-gpio.txt |9 Documentation/filesystems/proc.txt | 16 Documentation/hwmon/fam15h_power |2 Documentation/kernel-parameters.txt| 20 Documentation/networking/netdev-features.txt |2 Documentation/scheduler/numa-problem.txt | 20 MAINTAINERS| 87 + Makefile |2 arch/alpha/kernel/osf_sys.c|6 arch/arm/boot/Makefile | 10 arch/arm/boot/dts/tegra30.dtsi |4 arch/arm/include/asm/io.h |4 arch/arm/include/asm/sched_clock.h |2 arch/arm/include/asm/vfpmacros.h | 12 arch/arm/include/uapi/asm/hwcap.h |3 arch/arm/kernel/sched_clock.c | 18 arch/arm/mach-at91/at91rm9200_devices.c|2 arch/arm/mach-at91/at91sam9260_devices.c |2 arch/arm/mach-at91/at91sam9261_devices.c |2 arch/arm/mach-at91/at91sam9263_devices.c |2 arch/arm/mach-at91/at91sam9g45_devices.c | 12 arch/arm/mach-davinci/dm644x.c |3 arch/arm/mach-highbank/system.c|3 arch/arm/mach-imx/clk-gate2.c |2 arch/arm/mach-imx/ehci-imx25.c |2 arch/arm/mach-imx/ehci-imx35.c |2 arch/arm/mach-omap2/board-igep0020.c |5 arch/arm/mach-omap2/clockdomains44xx_data.c|2 arch/arm/mach-omap2/devices.c | 79 + arch/arm/mach-omap2/omap_hwmod.c | 63 + arch/arm/mach-omap2/omap_hwmod_44xx_data.c | 36 arch/arm/mach-omap2/twl-common.c |3 arch/arm/mach-omap2/vc.c |2 arch/arm/mach-pxa/hx4700.c |8 arch/arm/mach-pxa/spitz_pm.c |8 arch/arm/mm/alignment.c|2 arch/arm/plat-omap/include/plat/omap_hwmod.h |6 arch/arm/tools/Makefile|2 arch/arm/vfp/vfpmodule.c |9 arch/arm/xen/enlighten.c | 11 arch/arm/xen/hypercall.S | 14 arch/arm64/Kconfig |1 arch/arm64/include/asm/elf.h |5 arch/arm64/include/asm/fpsimd.h|5 arch/arm64/include/asm/io.h| 10 arch/arm64/include/asm/pgtable-hwdef.h |6 arch/arm64/include/asm/pgtable.h | 40 - arch/arm64/include/asm/processor.h |2 arch/arm64/include/asm/unistd.h|1 arch/arm64/kernel/perf_event.c | 10 arch/arm64/kernel/process.c| 18 arch/arm64/kernel/smp.c|3 arch/arm64/mm/init.c |2 arch/frv/Kconfig |1 arch/frv/boot/Makefile | 10 arch/frv/include/asm/unistd.h |1 arch/frv/kernel/entry.S| 28 arch/frv/kernel/process.c |5 arch/frv/mb93090-mb00/pci-dma-nommu.c |1 arch/h8300/include/asm/cache.h |3 arch/ia64/mm/init.c|1 arch/m68k/include/asm/signal.h |6 arch/mips/cavium-octeon/executive/cvmx-l2c.c | 900 arch/unicore32/include/asm/byteorder.h | 24 arch
[PATCH 2/5] Expand the steal time msr to also contain the consigned time.
Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 4180a87..ac357b3 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src->version; rmb(); - steal = src->steal; + *steal = src->steal; rmb(); } while ((version & 1) || (version != src->version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c2e077c..b21d92d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; @@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), &steal, &consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq->prev_steal_time_rq; if (unlikely(steal > delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 8d859da..593b647 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(¶virt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), &steal); steal -= this_rq()->prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] Add the code to send the consigned time from the host to the guest
Add the code to send the consigned time from the host to the guest. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/kvm_para.h |3 ++- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h|2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..434d378 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -426,6 +426,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index eb3e9d8..1763369 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a5f9f30..d39e8d0 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index ac357b3..4439a5c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src->version; rmb(); *steal = src->steal; + *consigned = src->consigned; rmb(); } while ((version & 1) || (version != src->version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1eefebe..683b531 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal; + vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned; vcpu->arch.st.steal.version += 2; vcpu->arch.st.accum_steal = 0; + vcpu->arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e5978b0..91afaa3 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); extern void account_process_tick(struct task_struct *, int user); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 593b647..53bd0be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /* + * This accounts for the time that is split out of steal time. + *
[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.
Add a timer to the host. This will define the period. During a period the first n ticks will go into the consigned bucket. Any other ticks that occur within the period will be placed in the stealtime bucket. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |2 +- arch/x86/kvm/x86.c | 42 ++- 3 files changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 434d378..4794c95 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -41,6 +41,8 @@ #define KVM_PIO_PAGE_OFFSET 1 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 +#define KVM_STEAL_TIMER_DELAY 1UL + #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ @@ -353,6 +355,14 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + u64 consigned_limit; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d39e8d0..6db79f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,7 +196,7 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 683b531..c91f4c9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; - vcpu->arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) { + vcpu->arch.current_consigned += delta; + if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) { + steal_delta = vcpu->arch.current_consigned + - vcpu->arch.consigned_limit; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu->arch.st.accum_steal = steal_delta; + vcpu->arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) struct static_key kvm_no_apic_vcpu __read_mostly; +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + vcpu->arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(&vcpu->arch.steal_timer, now, + ktime_set(0, KVM_STEAL_TIMER_DELAY)); + return HRTIMER_RESTART; +} + int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; int r; + ktime_t ktime; BUG_ON(vcpu->kvm == NULL); kvm = vcpu->kvm; @@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); + /* Initialize and start a timer to capture steal and consigned time */ + hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu->arch.steal_timer.function = &steal_timer_fn; + ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY); + hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL); return 0; fail_free_mce_banks: @@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { int idx; +
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/27/2012 02:48 AM, Glauber Costa wrote: Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. Will do. The patchset was sent twice yesterday by mistake. Got an error the first time and didn't think the patches went out. This has been corrected. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. yes, will do that. When I took the RFC off the patches I was looking at it as a new patchset which was a mistake. I will make sure to add a changelog when I submit again. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: "In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg." Yes, but we still get questions from users asking what is steal time? why am I seeing this? Now, that is a *way* more sensible thing to do. Much more. "Confusing users" is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. Something like this could certainly be done. But when I was submitting the patch set as an RFC then qemu was passing a cpu percentage that would be used by the guest kernel to adjust the steal time. This percentage was being stored on the guest as a sysctl value. Avi stated he didn't like that kind of coupling, and that the value could get out of sync. Anthony stated "The guest shouldn't need to know it's entitlement. Or at least, it's up to a management tool to report that in a way that's meaningful for the guest." So perhaps I misunderstood what they were suggesting, but I took it to mean that they did not want the guest to know what the entitlement was. That the host should take care of it and just report the already adjusted data to the guest. So in this version of the code the host would use a set period for a timer and be passed essentially a number of ticks of expected steal time. The host would then use the timer to break out the steal time into consigned and steal buckets which would be reported to the guest. Both the consigned and the steal would be reported via /proc/stat. So anyone needing to see total time away could add the two fields together. The user, however, when using tools like top or vmstat would see the usage based on what the guest is entitled to. Do you have suggestions for how I can build consensus around one of the two approaches? --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c
Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.
On 11/27/2012 03:03 PM, Konrad Rzeszutek Wilk wrote: On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote: Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) So its u64 here. { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); But not u64 here? Any particular reason? It should be void everywhere, thanks for catching that I will change the code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/27/2012 05:24 PM, Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. The definition of stolen time is 'time during which the virtual CPU is runnable to not running'. Overcommit is the main scenario which steal time helps to detect. Can you describe the 'capped' case? In the capped case, the time that the guest spends waiting due to it having used its full allottment of time shows up as steal time. The way my patchset currently stands is that you would set up the bandwidth control and you would have to pass it a matching value from qemu. In the future, it would be possible to have something parse the bandwidth setting and automatically adjust the setting in the host used for steal time reporting. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c |7 + 15 files changed, 120 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/28/2012 02:45 AM, Glauber Costa wrote: On 11/27/2012 07:10 PM, Michael Wolf wrote: On 11/27/2012 02:48 AM, Glauber Costa wrote: Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. Will do. The patchset was sent twice yesterday by mistake. Got an error the first time and didn't think the patches went out. This has been corrected. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. yes, will do that. When I took the RFC off the patches I was looking at it as a new patchset which was a mistake. I will make sure to add a changelog when I submit again. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: "In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg." Yes, but we still get questions from users asking what is steal time? why am I seeing this? Now, that is a *way* more sensible thing to do. Much more. "Confusing users" is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. Something like this could certainly be done. But when I was submitting the patch set as an RFC then qemu was passing a cpu percentage that would be used by the guest kernel to adjust the steal time. This percentage was being stored on the guest as a sysctl value. Avi stated he didn't like that kind of coupling, and that the value could get out of sync. Anthony stated "The guest shouldn't need to know it's entitlement. Or at least, it's up to a management tool to report that in a way that's meaningful for the guest." So perhaps I misunderstood what they were suggesting, but I took it to mean that they did not want the guest to know what the entitlement was. That the host should take care of it and just report the already adjusted data to the guest. So in this version of the code the host would use a set period for a timer and be passed essentially a number of ticks of expected steal time. The host would then use the timer to break out the steal time into consigned and steal buckets which would be reported to the guest. Both the consigned and the steal would be reported via /proc/stat. So anyone needing to see total time away could add the two fields together. The user, however, when using tools like top or vmstat would see the usage based on what the guest is entitled to. Do you have suggestions for how I can build consensus around one of the two approaches? Before I answer this, can you please detail which mechanism are you using to enforce the entitlement? Is it the cgroup cpu controller, or something else? It is setup using cpu overcommit. But the request was for something that would work in both the overcommit environment as well as when hard capping is being used. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/28/2012 02:55 PM, Glauber Costa wrote: On 11/28/2012 10:43 PM, Michael Wolf wrote: On 11/27/2012 05:24 PM, Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. The definition of stolen time is 'time during which the virtual CPU is runnable to not running'. Overcommit is the main scenario which steal time helps to detect. Can you describe the 'capped' case? In the capped case, the time that the guest spends waiting due to it having used its full allottment of time shows up as steal time. The way my patchset currently stands is that you would set up the bandwidth control and you would have to pass it a matching value from qemu. In the future, it would be possible to have something parse the bandwidth setting and automatically adjust the setting in the host used for steal time reporting. Ok, so correct me if I am wrong, but I believe you would be using something like the bandwidth capper in the cpu cgroup to set those entitlements, right? Yes, in the context above I'm referring to the cfs bandwidth control. Some time has passed since I last looked into it, but IIRC, after you get are out of your quota, you should be out of the runqueue. In the lovely world of KVM, we approximate steal time as runqueue time: arch/x86/kvm/x86.c: delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; vcpu->arch.st.accum_steal = delta; include/linux/sched.h: unsigned long long run_delay; /* time spent waiting on a runqueue */ So if you are out of the runqueue, you won't get steal time accounted, and then I truly fail to understand what you are doing. So I looked at something like this in the past. To make sure things haven't changed I set up a cgroup on my test server running a kernel built from the latest tip tree. [root]# cat cpu.cfs_quota_us 5 [root]# cat cpu.cfs_period_us 10 [root]# cat cpuset.cpus 1 [root]# cat cpuset.mems 0 Next I put the PID from the cpu thread into tasks. When I start a script that will hog the cpu I see the following in top on the guest Cpu(s): 1.9%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 48.3%hi, 0.0%si, 49.8%st So the steal time here is in line with the bandwidth control settings. In case I am wrong, and run_delay also includes the time you can't run because you are out of capacity, then maybe what we should do, is to just subtract it from run_delay in kvm/x86.c before we pass it on. In summary: About a year ago I was playing with this patch. It is out of date now but will give you an idea of what I was looking at. kernel/sched_fair.c |4 ++-- kernel/sched_stats.h |7 ++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5c9e679..a837e4e 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -707,7 +707,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) #ifdef CONFIG_FAIR_GROUP_SCHED /* we need this in update_cfs_load and load-balance functions below */ -static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); +inline int throttled_hierarchy(struct cfs_rq *cfs_rq); # ifdef CONFIG_SMP static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq, int global_update) @@ -1420,7 +1420,7 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) } /* check whether cfs_rq, or any parent, is throttled */ -static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) +inline int throttled_hierarchy(struct cfs_rq *cfs_rq) { return cfs_rq->throttle_count; } diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h index 87f9e36..e30ff26 100644 --- a/kernel/sched_stats.h +++ b/kernel/sched_stats.h @@ -213,14 +213,19 @@ static inline void sched_info_queued(struct task_struct *t) * sched_info_queued() to mark that it has now again started waiting on * the runqueue. */ +extern inline int throttled_hierarchy(struct cfs_rq *cfs_rq); static inline void sched_info_depart(struct task_struct *t) { +struct task_group *tg = task_group(t); +struct cfs_rq *cfs_rq; unsigned long long delta = task_rq(t)->clock - t->sched_info.last_arrival; +cfs_rq = tg->cfs_rq[smp_processor_id()]; rq_sched_info_depart(task_rq(t), delta); -if (t->state == TASK_RUNNING) + +if (t->state == TASK_RUNNING && !throttled_hierarchy(cfs_rq)) sched_info_queued(t); } So then the steal time did not show on the guest. You have no value that needs to be passed around. What I did not like about this approach was * only works for cfs bandwidth control. If another type of hard limit was added to the kernel the code wou
[PATCH 0/4] Alter steal-time reporting in the guest
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. Tthe steal time will only be altered if hard limits (cfs bandwidth control) is used. The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. Changes from V2: * Dropped the ioctl that allowed qemu to send the entitlement value to the guest. * Added code to get the entitlement period and quota from cfs bandwidth. Changes from V1: * Removed the steal time allowed percentage from the guest * Moved the separation of consigned (expected steal) and steal time to the host. * No longer include a sysctl interface. --- Michael Wolf (4): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |4 +- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/include/uapi/asm/kvm_para.h |3 +- arch/x86/kernel/kvm.c |8 ++-- arch/x86/kernel/paravirt.c|4 +- arch/x86/kvm/x86.c| 64 - fs/proc/stat.c|9 - include/linux/kernel_stat.h |2 + kernel/sched/core.c | 30 +++ kernel/sched/cputime.c| 21 ++- kernel/sched/sched.h |2 + 12 files changed, 142 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu hard capping. Signed-off-by: Michael Wolf --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(&boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, "cpu%d", i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, "intr %llu", (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 66b7078..e352052 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] Expand the steal time msr to also contain the consigned time.
Expand the steal time msr to also contain the consigned time. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5edd174..9b753ea 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline void paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index fe75a28..89e5468 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -386,9 +386,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -396,11 +395,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src->version; rmb(); - steal = src->steal; + *steal = src->steal; rmb(); } while ((version & 1) || (version != src->version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..efc2652 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -757,6 +757,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; @@ -785,8 +786,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), &steal, &consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq->prev_steal_time_rq; if (unlikely(steal > delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 825a956..0b4f1ec 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(¶virt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), &steal); steal -= this_rq()->prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] Add the code to send the consigned time from the host to the guest
Change the paravirt calls that retrieve the steal-time information from the host. Add to it getting the consigned value as well as the steal time. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/uapi/asm/kvm_para.h |3 ++- arch/x86/kernel/kvm.c|3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h |2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dc87b65..fe5a37b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -428,6 +428,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 9b753ea..77f05e7 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline void paravirt_steal_clock(int cpu, u64 *steal) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 06fdbd9..55d617f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 89e5468..fb52f8a 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -386,7 +386,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -396,6 +396,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src->version; rmb(); *steal = src->steal; + *consigned = src->consigned; rmb(); } while ((version & 1) || (version != src->version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c243b81..51b63d1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1867,8 +1867,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal; + vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned; vcpu->arch.st.steal.version += 2; vcpu->arch.st.accum_steal = 0; + vcpu->arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e352052..f58ed0f 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); #ifdef CONFIG_VIRT_CPU_ACCOUNTING diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0b4f1ec..2a2d4be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_
[PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |9 ++ arch/x86/kvm/x86.c | 62 ++- kernel/sched/core.c | 20 + 3 files changed, 90 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fe5a37b..9518613 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -355,6 +355,15 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + s64 consigned_quota; + s64 consigned_period; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 51b63d1..79d144d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1848,13 +1848,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; - vcpu->arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu->arch.current_consigned < vcpu->arch.consigned_quota) { + vcpu->arch.current_consigned += delta; + if (vcpu->arch.current_consigned > vcpu->arch.consigned_quota) { + steal_delta = vcpu->arch.current_consigned + - vcpu->arch.consigned_quota; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu->arch.st.accum_steal = steal_delta; + vcpu->arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -2629,8 +2648,35 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu) !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY); } +extern int sched_use_hard_capping(int cpuid, int num_vcpus, s64 *quota, + s64 *period); +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + struct kvm *kvm; + int num_vcpus; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + kvm = vcpu->kvm; + num_vcpus = atomic_read(&kvm->online_vcpus); + sched_use_hard_capping(vcpu->cpu, num_vcpus, + &vcpu->arch.consigned_quota, + &vcpu->arch.consigned_period); + vcpu->arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(&vcpu->arch.steal_timer, now, + ktime_set(0, vcpu->arch.consigned_period)); + + return HRTIMER_RESTART; +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { + struct kvm *kvm; + int num_vcpus; + ktime_t ktime; + /* Address WBINVD may be executed by guest */ if (need_emulate_wbinvd(vcpu)) { if (kvm_x86_ops->has_wbinvd_exit()) @@ -2670,6 +2716,18 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_migrate_timers(vcpu); vcpu->cpu = cpu; } + /* Initialize and start a timer to capture steal and consigned time */ + kvm = vcpu->kvm; + num_vcpus = atomic_read(&kvm->online_vcpus); + num_vcpus = (num_vcpus == 0) ? 1 : num_vcpus; + sched_use_hard_capping(vcpu->cpu, num_vcpus, + &vcpu->arch.consigned_quota, + &vcpu->arch.consigned_period); + hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu->arch.steal_timer.function = &steal_timer_fn; + ktime = ktime_set(0, vcpu->arch.consigned_period); + hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL); accumulate_steal_time(vcpu); kvm_make_request(KV
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On 02/06/2013 08:36 AM, Glauber Costa wrote: On 02/06/2013 01:49 AM, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Sorry: What is the business of a timer in here? Whenever we read steal time, we know how much time has passed and with that information we can know the entitlement for the period. This breaks if we suspend, but we know that we suspended, so this is not a problem. I may be missing something, but how do we know how much time has passed? That is why I had the timer in there. I will go look again at the code but I thought the data was collected as ticks and passed at random times. The ticks are also accumulating so we are looking at the difference in the count between reads. Everything bigger the entitlement is steal time. I agree provided I know the amount of total time that the steal time was accumulated. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] Expand the steal time msr to also contain the consigned time.
On 02/06/2013 03:14 PM, Rik van Riel wrote: On 02/05/2013 04:49 PM, Michael Wolf wrote: Expand the steal time msr to also contain the consigned time. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5edd174..9b753ea 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline void paravirt_steal_clock(int cpu, u64 *steal) { -return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); +PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } This may be a stupid question, but what happens if a KVM guest with this change, runs on a kernel that still has the old steal time interface? What happens if the host has the new steal time interface, but the guest uses the old interface? Will both cases continue to work as expected with your patch series? If so, could you document (in the source code) why things continue to work? I will test the scenarios you suggest and will report back the results. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] Add the code to send the consigned time from the host to the guest
On 02/06/2013 03:18 PM, Rik van Riel wrote: On 02/05/2013 04:49 PM, Michael Wolf wrote: Change the paravirt calls that retrieve the steal-time information from the host. Add to it getting the consigned value as well as the steal time. Signed-off-by: Michael Wolf diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 06fdbd9..55d617f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; +__u64 consigned; __u32 version; __u32 flags; -__u32 pad[12]; +__u32 pad[10]; }; The function kvm_register_steal_time passes the address of such a structure to the host kernel, which then does something with it. Could running a guest with the above patch, on top of a host with the old code, result in the values for "version" and "flags" being written into "consigned"? yes, good point. Could that result in confusing the guest kernel to no end, and generally breaking things? Ok I will move the consigned field to be after the flags. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On 02/07/2013 02:46 AM, Glauber Costa wrote: On 02/06/2013 10:07 PM, Michael Wolf wrote: On 02/06/2013 08:36 AM, Glauber Costa wrote: On 02/06/2013 01:49 AM, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Sorry: What is the business of a timer in here? Whenever we read steal time, we know how much time has passed and with that information we can know the entitlement for the period. This breaks if we suspend, but we know that we suspended, so this is not a problem. I may be missing something, but how do we know how much time has passed? That is why I had the timer in there. I will go look again at the code but I thought the data was collected as ticks and passed at random times. The ticks are also accumulating so we are looking at the difference in the count between reads. They can be collected at random times, but you can of course record the time in which it happened. ok. Let me add a previous_read field and take out the timer. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.
On Wed, 2012-10-17 at 21:14 +0400, Glauber Costa wrote: > On 10/17/2012 06:23 AM, Michael Wolf wrote: > > In the case of where you have a system that is running in a > > capped or overcommitted environment the user may see steal time > > being reported in accounting tools such as top or vmstat. This can > > cause confusion for the end user. To ease the confusion this patch set > > adds the idea of consigned (expected steal) time. The host will separate > > the consigned time from the steal time. The consignment limit passed to the > > host will be the amount of steal time expected within a fixed period of > > time. Any other steal time accruing during that period will show as the > > traditional steal time. > > > > TODO: > > * Change native_clock to take params and not return a value > > * Change update_rq_clock_task > > > > Changes from V1: > > * Removed the steal time allowed percentage from the guest > > * Moved the separation of consigned (expected steal) and steal time to the > > host. > > * No longer include a sysctl interface. > > > > You are showing this in the guest somewhere, but tools like top will > still not show it. So for quite a while, it achieves nothing. > > Of course this is a barrier that any new statistic has to go through. So > while annoying, this is per-se ultimately not a blocker. > > What I still fail to see, is how this is useful information to be shown > in the guest. Honestly, if I'm in a guest VM or container, any time > during which I am not running is time I lost. It doesn't matter if this > was expected or not. This still seems to me as a host-side problem, to > be solved entirely by tooling. > What tools like top and vmstat will show is altered. When I put time in the consign bucket it does not show up in steal. So now as long as the system is performing as expected the user will see 100% and 0% steal. I added the consign field to /proc/stat so that all time accrued in the period is accounted for and also for debugging purposes. The user wont care about consign and will not see it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 12/05/2012 06:46 AM, Glauber Costa wrote: I am deeply sorry. I was busy first time I read this, so I postponed answering and ended up forgetting. Sorry include/linux/sched.h: unsigned long long run_delay; /* time spent waiting on a runqueue */ So if you are out of the runqueue, you won't get steal time accounted, and then I truly fail to understand what you are doing. So I looked at something like this in the past. To make sure things haven't changed I set up a cgroup on my test server running a kernel built from the latest tip tree. [root]# cat cpu.cfs_quota_us 5 [root]# cat cpu.cfs_period_us 10 [root]# cat cpuset.cpus 1 [root]# cat cpuset.mems 0 Next I put the PID from the cpu thread into tasks. When I start a script that will hog the cpu I see the following in top on the guest Cpu(s): 1.9%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 48.3%hi, 0.0%si, 49.8%st So the steal time here is in line with the bandwidth control settings. Ok. So I was wrong in my hunch that it would be outside the runqueue, therefore work automatically. Still, the host kernel has all the information in cgroups. So then the steal time did not show on the guest. You have no value that needs to be passed around. What I did not like about this approach was * only works for cfs bandwidth control. If another type of hard limit was added to the kernel the code would potentially need to change. This is true for almost everything we have in the kernel! It is *very* unlikely for other bandwidth control mechanism to ever appear. If it ever does, it's *their* burden to make sure it works for steal time (provided it is merged). Code in tree gets precedence. Ok, I will work on a patch that uses the cgroup information for bandwidth control to separate out the time. * This approach doesn't help if the limits are set by overcommitting the cpus. It is my understanding that this is a common approach. I can't say anything about commonality, but common or not, it is a *crazy* approach. When you simply overcommit, you have no way to differentiate between intended steal time and non-intended steal time. Moreover, when you overcommit, your cpu usage will vary over time. If two guests use the cpu to their full power, you will have 50 % each. But if one of them slows down, the other gets more. What is your entitlement value? How do you define this? And then after you define it, you end up using more than this, what is your cpu usage? 130 %? yes exactly you would ideally show a boosted amount of cpu. However to do that you would need to either create a new tool or modify the current accounting tools such as top. My understanding is that you are not capping in this case as much as you are guaranteeing a minimum level of performance. The only sane way to do it, is to communicate this value to the kernel somehow. The bandwidth controller is the interface we have for that. So everybody that wants to *intentionally* overcommit needs to communicate this to the controller. IOW: Any sane configuration should be explicit about your capping. Add an ioctl to communicate the consign limit to the host. This definitely should go away. More specifically, *whatever* way we use to cap the processor, the host system will have all the information at all times. I'm not understanding that comment. If you are capping by simply controlling the amount of overcommit on the host then wouldn't you still need some value to indicate the desired amount. No, that is just crazy, and I don't like it a single bit. So in the light of it: Whatever capping mechanism we have, we need to be explicit about the expected entitlement. At this point, the kernel already knows what it is, and needs no extra ioctls or anything like that. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
Sorry for the delay in the response. I did not see your question. On Mon, 2013-02-18 at 20:57 -0300, Marcelo Tosatti wrote: > On Tue, Feb 05, 2013 at 03:49:41PM -0600, Michael Wolf wrote: > > Add a helper routine to scheduler/core.c to allow the kvm module > > to retrieve the cpu hardlimit settings. The values will be used > > to set up a timer that is used to separate the consigned from the > > steal time. > > 1) Can you please describe, in english, the mechanics of subtracting cpu > hardlimit values from steal time reported via run_delay supposed to > work? > > "The period and the quota used to separate the consigned time > (expected steal) from the steal time are taken > from the cfs bandwidth control settings. Any other steal time > accruing during that period will show as the traditional steal time." > > There is no "expected steal time" over a fixed period of real time. There is expected steal time in the sense that the administrator of the system sets up guests on the host so that there will be cpu overcommitment. The end user who is using the guest does not know this, they only know they have been guaranteed a certain level of performance. So if steal time shows up the end user typically thinks they are not getting their guaranteed performance. So this patchset is meant to allow top to show 100% utilization and ONLY show steal time if it is over the level of steal time that the host administrator setup. So take a simple example of a host with 1 cpu and two guest on it. If each guest is fully utilized a user will see 50% utilization and 50% steal in either of the guests. In this case the amount of steal time that the host administrator would expect to see is 50%. As long as the steal in the guest does not exceed 50% the guest is running as expected. If for some reason the steal increases to 60%, now something is wrong and the steal time needs to be reported and the end user will make inquiries? > > 2) From the description of patch 1: "In the case of where you have > a system that is running in a capped or overcommitted environment > the user may see steal time being reported in accounting tools > such as top or vmstat." > > This is outdated, right? Because overcommitted environment is exactly > what steal time should report. I hope I'm not missing your point here. But again this comes down to the point of view. The end user is guaranteed a capability/level of performance that may not be a whole cpu. So only show steal time if the amount of steal time exceeds what the host admin expected when the guest was set up. > > > Thanks thanks Mike Wolf -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Alter steal-time reporting in the guest
Sorry for the delay in the response. I did not see the email right away. On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: > > 2013/2/5 Michael Wolf : > > > In the case of where you have a system that is running in a > > > capped or overcommitted environment the user may see steal time > > > being reported in accounting tools such as top or vmstat. This can > > > cause confusion for the end user. > > > > Sorry, I'm no expert in this area. But I don't really understand what > > is confusing for the end user here. > > I suppose that what is wanted is to subtract stolen time due to 'known > reasons' from steal time reporting. 'Known reasons' being, for example, > hard caps. So a vcpu executing instructions with no halt, but limited to > 80% of available bandwidth, would not have 20% of stolen time reported. Yes exactly and the end user many times did not set up the guest and is not aware of the capping. The end user is only aware of the performance level that they were told they would get with the guest. > > But yes, a description of the scenario that is being dealt with, with > details, is important. I will add more detail to the description next time I submit the patches. How about something like,"In a cloud environment the user of a kvm guest is not aware of the underlying hardware or how many other guests are running on it. The end user is only aware of a level of performance that they should see." or does that just muddy the picture more?? > > > > To ease the confusion this patch set > > > adds the idea of consigned (expected steal) time. The host will separate > > > the consigned time from the steal time. Tthe steal time will only be > > > altered > > > if hard limits (cfs bandwidth control) is used. The period and the quota > > > used > > > to separate the consigned time (expected steal) from the steal time are > > > taken > > > from the cfs bandwidth control settings. Any other steal time accruing > > > during > > > that period will show as the traditional steal time. > > > > I'm also a bit confused here. steal time will then only account the > > cpu time lost due to quotas from cfs bandwidth control? Also what do > > you exactly mean by "expected steal time"? Is it steal time due to > > overcommitting minus scheduler quotas? > > > > Thanks. > Thanks Mike Wolf -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote: > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote: > > Sorry for the delay in the response. I did not see the email > > right away. > > > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: > > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: > > > > 2013/2/5 Michael Wolf : > > > > > In the case of where you have a system that is running in a > > > > > capped or overcommitted environment the user may see steal time > > > > > being reported in accounting tools such as top or vmstat. This can > > > > > cause confusion for the end user. > > > > > > > > Sorry, I'm no expert in this area. But I don't really understand what > > > > is confusing for the end user here. > > > > > > I suppose that what is wanted is to subtract stolen time due to 'known > > > reasons' from steal time reporting. 'Known reasons' being, for example, > > > hard caps. So a vcpu executing instructions with no halt, but limited to > > > 80% of available bandwidth, would not have 20% of stolen time reported. > > > > Yes exactly and the end user many times did not set up the guest and is > > not aware of the capping. The end user is only aware of the performance > > level that they were told they would get with the guest. > > > But yes, a description of the scenario that is being dealt with, with > > > details, is important. > > > > I will add more detail to the description next time I submit the > > patches. How about something like,"In a cloud environment the user of a > > kvm guest is not aware of the underlying hardware or how many other > > guests are running on it. The end user is only aware of a level of > > performance that they should see." or does that just muddy the picture > > more?? > > So the feature aims for is to report stolen time relative to hard > capping. That is: stolen time should be counted as time stolen from > the guest _beyond_ hard capping. Yes? Yes, that is the goal. > > Probably don't need to report new data to the guest for that. Not sure I understand what you are saying here. Do you mean that I don't need to report the expected steal from the guest? If I don't do that then I'm not reporting all of the time and changing /proc/stat in a bigger way than adding another catagory. Also I thought I would need to provide the consigned time and the steal time for debugging purposes. Maybe I'm missing your point. > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Wed, 2013-03-06 at 12:13 +0400, Glauber Costa wrote: > On 03/06/2013 05:41 AM, Marcelo Tosatti wrote: > > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote: > >> Sorry for the delay in the response. I did not see the email > >> right away. > >> > >> On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: > >>> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: > >>>> 2013/2/5 Michael Wolf : > >>>>> In the case of where you have a system that is running in a > >>>>> capped or overcommitted environment the user may see steal time > >>>>> being reported in accounting tools such as top or vmstat. This can > >>>>> cause confusion for the end user. > >>>> > >>>> Sorry, I'm no expert in this area. But I don't really understand what > >>>> is confusing for the end user here. > >>> > >>> I suppose that what is wanted is to subtract stolen time due to 'known > >>> reasons' from steal time reporting. 'Known reasons' being, for example, > >>> hard caps. So a vcpu executing instructions with no halt, but limited to > >>> 80% of available bandwidth, would not have 20% of stolen time reported. > >> > >> Yes exactly and the end user many times did not set up the guest and is > >> not aware of the capping. The end user is only aware of the performance > >> level that they were told they would get with the guest. > >>> But yes, a description of the scenario that is being dealt with, with > >>> details, is important. > >> > >> I will add more detail to the description next time I submit the > >> patches. How about something like,"In a cloud environment the user of a > >> kvm guest is not aware of the underlying hardware or how many other > >> guests are running on it. The end user is only aware of a level of > >> performance that they should see." or does that just muddy the picture > >> more?? > > > > So the feature aims for is to report stolen time relative to hard > > capping. That is: stolen time should be counted as time stolen from > > the guest _beyond_ hard capping. Yes? > > > > Probably don't need to report new data to the guest for that. > > > If we take into account that 1 second always have one second, I believe > that you can just subtract the consigned time from the steal time the > host passes to the guest. > > During each second, the numbers won't sum up to 100. The delta to 100 is > the consigned time, if anyone cares. > > Adopting this would simplify this a lot. All you need to do, really, is > to get your calculation right from the bandwidth given by the cpu > controller. Subtract it in the host, and voila. I looked at doing that once but was told that I was changing the interface in an unacceptable way, because now I was not reporting all of the elapsed time. I agree it would make things simpler. > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Wed, 2013-03-06 at 14:34 +0100, Frederic Weisbecker wrote: > 2013/3/5 Michael Wolf : > > Sorry for the delay in the response. I did not see the email > > right away. > > > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: > >> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: > >> > 2013/2/5 Michael Wolf : > >> > > In the case of where you have a system that is running in a > >> > > capped or overcommitted environment the user may see steal time > >> > > being reported in accounting tools such as top or vmstat. This can > >> > > cause confusion for the end user. > >> > > >> > Sorry, I'm no expert in this area. But I don't really understand what > >> > is confusing for the end user here. > >> > >> I suppose that what is wanted is to subtract stolen time due to 'known > >> reasons' from steal time reporting. 'Known reasons' being, for example, > >> hard caps. So a vcpu executing instructions with no halt, but limited to > >> 80% of available bandwidth, would not have 20% of stolen time reported. > > > > Yes exactly and the end user many times did not set up the guest and is > > not aware of the capping. The end user is only aware of the performance > > level that they were told they would get with the guest. > > > >> > >> But yes, a description of the scenario that is being dealt with, with > >> details, is important. > > > > I will add more detail to the description next time I submit the > > patches. How about something like,"In a cloud environment the user of a > > kvm guest is not aware of the underlying hardware or how many other > > guests are running on it. The end user is only aware of a level of > > performance that they should see." or does that just muddy the picture > > more?? > > That alone is probably not enough. But yeah, make sure you clearly > state the difference between expected (caps, sched bandwidth...) and > unexpected (overcommitting, competing load...) stolen time. Then add a > practical example as you made above that explains why it matters to > make that distinction and why you want to report it. > Ok, I understand what you are requesting. I will make sure to add it to the description the next time I submit the patches. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Thu, 2013-03-07 at 14:11 +1100, Paul Mackerras wrote: > On Wed, Mar 06, 2013 at 09:52:16PM -0300, Marcelo Tosatti wrote: > > On Wed, Mar 06, 2013 at 10:29:12AM -0600, Michael Wolf wrote: > > > I looked at doing that once but was told that I was changing the > > > interface in an unacceptable way, because now I was not reporting all of > > > the elapsed time. I agree it would make things simpler. > > > > Pointer to that claim, please? > > Back in about 2004 or 2005 or so I was looking at changing how user > and system times were calculated (in the context of trying to find a > better way to report resources used by a thread in an SMT processor). > I found that utilities such as top expected the deltas in the > /proc/stat numbers to add up to elapsed time, and would report strange > and inconsistent results if that wasn't the case. Unfortunately at > this distance I don't recall the exact details. I don't know whether > the expectation that the deltas in the /proc/stat numbers over a > period of time add up to the elapsed real time is documented anywhere, > but I wouldn't be at all surprised if some programs depend on it, so > it's better to maintain that property. I will have to look at this again. When looking at the cpu data where steal time is reported there isn't a problem today. I will have to run it and see if there is anything incorrect with the time being reported for the individual processes. My real concern here was that in changing the /proc/stat interface am I going to mess private tools that look at that information. When I've looked at vmstat and top they report the cpu information fine, but I may end up creating problems for home grown scripts and tools. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Wed, 2013-03-06 at 23:30 -0300, Marcelo Tosatti wrote: > On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote: > > On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote: > > > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote: > > > > Sorry for the delay in the response. I did not see the email > > > > right away. > > > > > > > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: > > > > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: > > > > > > 2013/2/5 Michael Wolf : > > > > > > > In the case of where you have a system that is running in a > > > > > > > capped or overcommitted environment the user may see steal time > > > > > > > being reported in accounting tools such as top or vmstat. This > > > > > > > can > > > > > > > cause confusion for the end user. > > > > > > > > > > > > Sorry, I'm no expert in this area. But I don't really understand > > > > > > what > > > > > > is confusing for the end user here. > > > > > > > > > > I suppose that what is wanted is to subtract stolen time due to 'known > > > > > reasons' from steal time reporting. 'Known reasons' being, for > > > > > example, > > > > > hard caps. So a vcpu executing instructions with no halt, but limited > > > > > to > > > > > 80% of available bandwidth, would not have 20% of stolen time > > > > > reported. > > > > > > > > Yes exactly and the end user many times did not set up the guest and is > > > > not aware of the capping. The end user is only aware of the performance > > > > level that they were told they would get with the guest. > > > > > But yes, a description of the scenario that is being dealt with, with > > > > > details, is important. > > > > > > > > I will add more detail to the description next time I submit the > > > > patches. How about something like,"In a cloud environment the user of a > > > > kvm guest is not aware of the underlying hardware or how many other > > > > guests are running on it. The end user is only aware of a level of > > > > performance that they should see." or does that just muddy the picture > > > > more?? > > > > > > So the feature aims for is to report stolen time relative to hard > > > capping. That is: stolen time should be counted as time stolen from > > > the guest _beyond_ hard capping. Yes? > > Yes, that is the goal. > > > > > > Probably don't need to report new data to the guest for that. > > Not sure I understand what you are saying here. Do you mean that I don't > > need to report the expected steal from the guest? If I don't do that > > then I'm not reporting all of the time and changing /proc/stat in a > > bigger way than adding another catagory. Also I thought I would need to > > provide the consigned time and the steal time for debugging purposes. > > Maybe I'm missing your point. > > OK so the usefulness of steal time comes from the ability to measure > CPU cycles that the guest is being deprived of, relative to some unit > (implicitly the CPU frequency presented to the VM). That way, it becomes > easier to properly allocate resources. > > From top man page: > st : time stolen from this vm by the hypervisor > > Not only its a problem for the lender, it is also confusing for the user > (who has to subtract from the reported value himself), the hardcapping > from reported steal time. > > > The problem with the algorithm in the patchset is the following > (practical example): > > - Hard capping set to 80% of available CPU. > - vcpu does not exceed its threshold, say workload with 40% > CPU utilization. > - Under this scenario it is possible for vcpu to be deprived > of cycles (because out of the 40% that workload uses, only 30% of > actual CPU time are being provided). > - The algorithm in this patchset will not report any stolen time > because it assumes 20% of stolen time reported via 'run_delay' > is fixed at all times (which is false), therefore any valid > stolen time below 20% will not be reported. > > Makes sense? I understand the scenerio. I will have to go back and look at the CFS bandwidth code and run some tests. The question I have to look at is how is everything reported in your
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Wed, 2013-03-06 at 23:30 -0300, Marcelo Tosatti wrote: > On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote: > > On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote: > > > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote: > > > > Sorry for the delay in the response. I did not see the email > > > > right away. > > > > > > > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: > > > > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: > > > > > > 2013/2/5 Michael Wolf : > > > > > > > In the case of where you have a system that is running in a > > > > > > > capped or overcommitted environment the user may see steal time > > > > > > > being reported in accounting tools such as top or vmstat. This > > > > > > > can > > > > > > > cause confusion for the end user. > > > > > > > > > > > > Sorry, I'm no expert in this area. But I don't really understand > > > > > > what > > > > > > is confusing for the end user here. > > > > > > > > > > I suppose that what is wanted is to subtract stolen time due to 'known > > > > > reasons' from steal time reporting. 'Known reasons' being, for > > > > > example, > > > > > hard caps. So a vcpu executing instructions with no halt, but limited > > > > > to > > > > > 80% of available bandwidth, would not have 20% of stolen time > > > > > reported. > > > > > > > > Yes exactly and the end user many times did not set up the guest and is > > > > not aware of the capping. The end user is only aware of the performance > > > > level that they were told they would get with the guest. > > > > > But yes, a description of the scenario that is being dealt with, with > > > > > details, is important. > > > > > > > > I will add more detail to the description next time I submit the > > > > patches. How about something like,"In a cloud environment the user of a > > > > kvm guest is not aware of the underlying hardware or how many other > > > > guests are running on it. The end user is only aware of a level of > > > > performance that they should see." or does that just muddy the picture > > > > more?? > > > > > > So the feature aims for is to report stolen time relative to hard > > > capping. That is: stolen time should be counted as time stolen from > > > the guest _beyond_ hard capping. Yes? > > Yes, that is the goal. > > > > > > Probably don't need to report new data to the guest for that. > > Not sure I understand what you are saying here. Do you mean that I don't > > need to report the expected steal from the guest? If I don't do that > > then I'm not reporting all of the time and changing /proc/stat in a > > bigger way than adding another catagory. Also I thought I would need to > > provide the consigned time and the steal time for debugging purposes. > > Maybe I'm missing your point. > > OK so the usefulness of steal time comes from the ability to measure > CPU cycles that the guest is being deprived of, relative to some unit > (implicitly the CPU frequency presented to the VM). That way, it becomes > easier to properly allocate resources. > > From top man page: > st : time stolen from this vm by the hypervisor > > Not only its a problem for the lender, it is also confusing for the user > (who has to subtract from the reported value himself), the hardcapping > from reported steal time. > > > The problem with the algorithm in the patchset is the following > (practical example): > > - Hard capping set to 80% of available CPU. > - vcpu does not exceed its threshold, say workload with 40% > CPU utilization. > - Under this scenario it is possible for vcpu to be deprived > of cycles (because out of the 40% that workload uses, only 30% of > actual CPU time are being provided). > - The algorithm in this patchset will not report any stolen time > because it assumes 20% of stolen time reported via 'run_delay' > is fixed at all times (which is false), therefore any valid > stolen time below 20% will not be reported. > > Makes sense? > > Not sure what the concrete way to report stolen time relative to hard > capping is (probably easier inside the scheduler, where run_delay is > calculated). >
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Thu, 2013-03-07 at 18:25 -0300, Marcelo Tosatti wrote: > On Thu, Mar 07, 2013 at 03:15:09PM -0600, Michael Wolf wrote: > > > > > > Makes sense? > > > > > > Not sure what the concrete way to report stolen time relative to hard > > > capping is (probably easier inside the scheduler, where run_delay is > > > calculated). > > > > > > Reporting the hard capping to the guest is a good idea (which saves the > > > user from having to measure it themselves), but better done separately > > > via new field. > > > > didnt respond to this in the previous response. I'm not sure I'm > > following you here. I thought this is what I was doing by having a > > consigned (expected steal) field add to the /proc/stat output. Are you > > looking for something else or a better naming convention? > > Expected steal is not a good measure to use (because as mentioned in the > previous email there is no expected steal over a fixed period of time). > > It is fine to report 'maximum percentage of underlying physical CPU' > (what percentage of the physical CPU time guest VM is allowed to make > use of). > > And then steal time is relative to maximum percentage of underlying > physical CPU time allowed. > So last August I had sent out an RFC set of patches to do this. That patchset was meant to handle the general overcommit case as well as the capping case by having qemu pass a percentage to the host that would then be passed onto the guest and used to adjust the steal time. Here is the link to the discussion http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html As you will see there Avi didn't like the idea of a percentage down in the guest, among other reasons he was concerned about migration. Also in the email thread you will see that Anthony Liguori was opposed to the idea of just changing the steal time, he wanted it split out. What Glauber has suggested and I am working on implementing is taking out the timer and adding a last read field in the host. So in the host I can determine the total time that has passed and compute a percentage and apply that percentage to the steal time while the info is still on the host. Then pass the steal and consigned time to the guest. Does that address your concerns? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 0/3] Add guest cpu_entitlement reporting
This is an RFC regarding the reporting of stealtime. In the case of where you have a system that is running with partial processors such as KVM the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds a sysctl interface to set the cpu entitlement. This is the percentage of cpu that the guest system is expected to receive. As long as the steal time is within its expected range it will show up as 0 in /proc/stat. The user will then see in the accounting tools that they are getting a full utilization of the cpu resources assigned to them. This patchset is changing the contents/output of /proc/stat and could affect user tools. However the default setting is that the cpu is entitled to 100% so the code will act as before. Also another field could be added to the /proc/stat output and show the unaltered steal time. Since this additional field could cause more confusion than it would clear up I have left it out for now. Michael Wolf (3): Add a sysctl interface to control the cpu entitlement setting. Add a hypercall to retrieve the cpu entitlement value from the host. Modify the amount of stealtime that the kernel reports via the /proc --- Documentation/sysctl/kernel.txt | 14 +++ arch/x86/kvm/x86.c |8 fs/proc/stat.c | 80 ++- include/linux/kernel_stat.h |2 + include/linux/kvm.h |3 ++ include/linux/kvm_host.h|3 ++ include/linux/kvm_para.h|1 + kernel/sysctl.c | 10 + virt/kvm/kvm_main.c |7 9 files changed, 126 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 1/3] Add a sysctl interface to control and report the cpu entitlement setting.
This setting will later be used to compute an expected steal time. Signed-off-by: Michael Wolf --- Documentation/sysctl/kernel.txt | 14 ++ fs/proc/stat.c |1 + kernel/sysctl.c | 10 ++ 3 files changed, 25 insertions(+) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 6d78841..0f617dc 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -28,6 +28,7 @@ show up in /proc/sys/kernel: - core_pattern - core_pipe_limit - core_uses_pid +- cpu_entitlement - ctrl-alt-del - dmesg_restrict - domainname @@ -226,6 +227,19 @@ the filename. == +cpu_entitlement: + +The cpu_entitlement is the percentage of cpu utilization that +the system expects to receive. By default this is set to 100, +in a guest system this could be set to a value between 0 and 100. +This value is used to adjust the amount of steal time that +process accounting code in the guest will display. The end effect +will be is that steal time will only be reported if the +percentage of steal time is greater than 100 - cpu_entitlement +value. + +== + ctrl-alt-del: When the value in this file is 0, ctrl-alt-del is trapped and diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 64c3b31..14e26c8 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -12,6 +12,7 @@ #include #include +int cpu_entitlement = 100; #ifndef arch_irq_stat_cpu #define arch_irq_stat_cpu(cpu) 0 #endif diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 87174ef..85efbc2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -109,6 +109,7 @@ extern int percpu_pagelist_fraction; extern int compat_log; extern int latencytop_enabled; extern int sysctl_nr_open_min, sysctl_nr_open_max; +extern int cpu_entitlement; #ifndef CONFIG_MMU extern int sysctl_nr_trim_pages; #endif @@ -673,6 +674,15 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "cpu_entitlement", + .data = &cpu_entitlement, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one_hundred, + }, #if defined CONFIG_PRINTK { .procname = "printk", -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 3/3] Modify the amount of stealtime that the kernel reports via the /proc interface.
Stealtime will be adjusted based on the cpu entitlement setting. The user will supply the cpu_entitlement which is the percentage of cpu the guest can expect to receive. The expected steal time is based on the expected steal percentage which is 100 - cpu_entitlement. If steal_time is less than the expected steal time that is reported steal_time is changed to 0 no other fields are changed. If the steal_time is greater than the expected_steal then the difference is reported. By default the cpu_entitlement will be 100% and the steal time will be reported without any modification. Signed-off-by: Michael Wolf --- fs/proc/stat.c | 70 ++- include/linux/kernel_stat.h |2 + 2 files changed, 70 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index cf5..efbaa03 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -73,6 +73,68 @@ static u64 get_iowait_time(int cpu) #endif +/* + * This function will alter the steal time value that is written out + * to /proc/stat. The cpu_entitlement is set by the user/admin and is + * meant to reflect the percentage of the processor that is expected to + * be used. So as long as the amount of steal time is less than the + * expected steal time (based on cpu_entitlement) then report steal time + * as zero. + */ +static void kstat_adjust_steal_time(int currcpu) +{ + int j; + u64 cpustat_delta[NR_STATS]; + u64 total_elapsed_time; + int expected_steal_pct; + u64 expected_steal; + u64 *currstat, *prevstat; + + /* +* if cpu_entitlement = 100% then the expected steal time is 0 +* so we don't need to do any adjustments to the fields. +*/ + if (cpu_entitlement == 100) { + kcpustat_cpu(currcpu).cpustat[CPUTIME_ADJ_STEAL] = + kcpustat_cpu(currcpu).cpustat[CPUTIME_STEAL]; + return; + } + /* +* For the user it is more intuitive to think in terms of +* cpu entitlement. To do the calculations it is easier to +* think in terms of allowed steal time. So convert the percentage +* from cpu_entitlement to expected_steal_percent. +*/ + expected_steal_pct = 100 - cpu_entitlement; + + total_elapsed_time = 0; + /* determine the total time elapsed between calls */ + currstat = kcpustat_cpu(currcpu).cpustat; + prevstat = kcpustat_cpu(currcpu).prev_cpustat; + for (j = CPUTIME_USER; j < CPUTIME_GUEST; j++) { + cpustat_delta[j] = currstat[j] - prevstat[j]; + prevstat[j] = currstat[j]; + total_elapsed_time = total_elapsed_time + cpustat_delta[j]; + } + + /* +* calculate the amount of expected steal time. Add 5 as a +* rounding factor. +*/ + + expected_steal = (total_elapsed_time * expected_steal_pct + 5) / 100; + if (cpustat_delta[CPUTIME_STEAL] < expected_steal) + cpustat_delta[CPUTIME_STEAL] = 0; + else + cpustat_delta[CPUTIME_STEAL] -= expected_steal; + + /* Adjust the steal time accordingly */ + currstat[CPUTIME_ADJ_STEAL] = prevstat[CPUTIME_ADJ_STEAL] + + cpustat_delta[CPUTIME_STEAL]; + prevstat[CPUTIME_ADJ_STEAL] = currstat[CPUTIME_ADJ_STEAL]; +} + + static int show_stat(struct seq_file *p, void *v) { int i, j; @@ -90,7 +152,11 @@ static int show_stat(struct seq_file *p, void *v) getboottime(&boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { + /* adjust the steal time based on the processor entitlement */ + kstat_adjust_steal_time(i); + user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]; @@ -98,7 +164,7 @@ static int show_stat(struct seq_file *p, void *v) iowait += get_iowait_time(i); irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ]; softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]; - steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; + steal += kcpustat_cpu(i).cpustat[CPUTIME_ADJ_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; sum += kstat_cpu_irqs_sum(i); @@ -135,7 +201,7 @@ static int show_stat(struct seq_file *p, void *v) iowait = get_iowait_time(i); irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ]; softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]; - steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; + steal = kcpustat_cpu(i).cpustat[CPUTIME_ADJ_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
[PATCH RFC 2/3] Add a hypercall to retrieve the cpu entitlement value from the host.
If the hypercall is not implemented on the host a default value of 100 will be used. This value will be stored in /proc/sys/kernel/cpu_entitlements. Signed-off-by: Michael Wolf --- arch/x86/kvm/x86.c |8 fs/proc/stat.c |9 + include/linux/kvm.h |3 +++ include/linux/kvm_host.h |3 +++ include/linux/kvm_para.h |1 + virt/kvm/kvm_main.c |7 +++ 6 files changed, 31 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 42bce48..734bc3d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5052,6 +5052,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) case KVM_HC_VAPIC_POLL_IRQ: ret = 0; break; + case KVM_HC_CPU_ENTITLEMENT: + ret = vcpu->kvm->vcpu_entitlement; + break; default: ret = -KVM_ENOSYS; break; @@ -5897,6 +5900,11 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, return 0; } +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm *kvm, long entitlement) +{ + kvm->vcpu_entitlement = (int) entitlement; + return 0; +} int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { struct i387_fxsave_struct *fxsave = diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 14e26c8..cf5 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -11,6 +11,7 @@ #include #include #include +#include int cpu_entitlement = 100; #ifndef arch_irq_stat_cpu @@ -213,6 +214,14 @@ static const struct file_operations proc_stat_operations = { static int __init proc_stat_init(void) { + long ret; + + if (kvm_para_available()) { + ret = kvm_hypercall0(KVM_HC_CPU_ENTITLEMENT); + if (ret > 0) + cpu_entitlement = (int) ret; + } + proc_create("stat", 0, NULL, &proc_stat_operations); return 0; } diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 2ce09aa..fccd08e 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -904,6 +904,9 @@ struct kvm_s390_ucas_mapping { #define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg) /* VM is being stopped by host */ #define KVM_KVMCLOCK_CTRL_IO(KVMIO, 0xad) +/* Set the cpu entitlement this will be used to adjust stealtime reporting */ +#define KVM_SET_ENTITLEMENT _IOW(KVMIO, 0xae, unsigned long) + #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0) #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b70b48b..71e3d73 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -276,6 +276,7 @@ struct kvm { struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; atomic_t online_vcpus; int last_boosted_vcpu; + int vcpu_entitlement; struct list_head vm_list; struct mutex lock; struct kvm_io_bus *buses[KVM_NR_BUSES]; @@ -538,6 +539,8 @@ void kvm_arch_hardware_unsetup(void); void kvm_arch_check_processor_compat(void *rtn); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm *kvm, + long entitlement); void kvm_free_physmem(struct kvm *kvm); diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h index ff476dd..95f3387 100644 --- a/include/linux/kvm_para.h +++ b/include/linux/kvm_para.h @@ -19,6 +19,7 @@ #define KVM_HC_MMU_OP 2 #define KVM_HC_FEATURES3 #define KVM_HC_PPC_MAP_MAGIC_PAGE 4 +#define KVM_HC_CPU_ENTITLEMENT 5 /* * hypercalls use architecture specific diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2468523..a0a4939 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2093,6 +2093,13 @@ static long kvm_vm_ioctl(struct file *filp, break; } #endif + case KVM_SET_ENTITLEMENT: { + r = kvm_arch_vcpu_ioctl_set_entitlement(kvm, arg); + if (r) + goto out; + r = 0; + break; + } default: r = kvm_arch_vm_ioctl(filp, ioctl, arg); if (r == -ENOTTY) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote: > On 08/24/2012 03:14 AM, Michael Wolf wrote: > > This is an RFC regarding the reporting of stealtime. In the case of > > where you have a system that is running with partial processors such as > > KVM the user may see steal time being reported in accounting tools such > > as top or vmstat. This can cause confusion for the end user. To > > ease the confusion this patch set adds a sysctl interface to set the > > cpu entitlement. This is the percentage of cpu that the guest system is > > expected to receive. As long as the steal time is within its expected > > range it will show up as 0 in /proc/stat. The user will then see in the > > accounting tools that they are getting a full utilization of the cpu > > resources assigned to them. > > > > And how is such a knob not confusing? > > Steal time is pretty well defined in meaning and is shown in top for > ages. I really don't see the point for this. Currently you can see the steal time but you have no way of knowing if the cpu utilization you are seeing on the guest is the expected amount. I decided on making it a knob because a guest could be migrated to another system and it's entitlement could change because of hardware or load differences. It could simply be a /proc file and report the current entitlement if needed. As things are currently implemented I don't see how someone knows if the guest is running as expected or whether there is a problem. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote: > On 08/24/2012 11:11 AM, Michael Wolf wrote: > > On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote: > >> On 08/24/2012 03:14 AM, Michael Wolf wrote: > >>> This is an RFC regarding the reporting of stealtime. In the case of > >>> where you have a system that is running with partial processors such as > >>> KVM the user may see steal time being reported in accounting tools such > >>> as top or vmstat. This can cause confusion for the end user. To > >>> ease the confusion this patch set adds a sysctl interface to set the > >>> cpu entitlement. This is the percentage of cpu that the guest system is > >>> expected to receive. As long as the steal time is within its expected > >>> range it will show up as 0 in /proc/stat. The user will then see in the > >>> accounting tools that they are getting a full utilization of the cpu > >>> resources assigned to them. > >>> > >> > >> And how is such a knob not confusing? > >> > >> Steal time is pretty well defined in meaning and is shown in top for > >> ages. I really don't see the point for this. > > > > Currently you can see the steal time but you have no way of knowing if > > the cpu utilization you are seeing on the guest is the expected amount. > > I decided on making it a knob because a guest could be migrated to > > another system and it's entitlement could change because of hardware or > > load differences. It could simply be a /proc file and report the > > current entitlement if needed. As things are currently implemented I > > don't see how someone knows if the guest is running as expected or > > whether there is a problem. > > > > Turning off steal time display won't get even close to displaying the > information you want. What you probably want is a guest-visible way to > say how many miliseconds you are expected to run each second. Right? It is not clear to me how knowing how many milliseconds you are expecting to run will help the user. Currently the users will run top to see how well the guest is running. If they see _any_ steal time some users think they are not getting the full use of their processor entitlement. Maybe I'm missing what you are proposing, but even if you knew the milliseconds that you were expecting for your processor you would have to adjust the top output in your head so to speak. You would see the utilization and then say 'ok that matches the number of milliseconds I expected to run..." If we take away the steal time (as long as it is equal to or less than the expected amount of steal time) then the user running top will see the 100% utilization. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On Mon, 2012-08-27 at 11:50 -0700, Glauber Costa wrote: > On 08/27/2012 08:50 AM, Michael Wolf wrote: > > On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote: > >> On 08/24/2012 11:11 AM, Michael Wolf wrote: > >>> On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote: > >>>> On 08/24/2012 03:14 AM, Michael Wolf wrote: > >>>>> This is an RFC regarding the reporting of stealtime. In the case of > >>>>> where you have a system that is running with partial processors such as > >>>>> KVM the user may see steal time being reported in accounting tools such > >>>>> as top or vmstat. This can cause confusion for the end user. To > >>>>> ease the confusion this patch set adds a sysctl interface to set the > >>>>> cpu entitlement. This is the percentage of cpu that the guest system is > >>>>> expected to receive. As long as the steal time is within its expected > >>>>> range it will show up as 0 in /proc/stat. The user will then see in the > >>>>> accounting tools that they are getting a full utilization of the cpu > >>>>> resources assigned to them. > >>>>> > >>>> > >>>> And how is such a knob not confusing? > >>>> > >>>> Steal time is pretty well defined in meaning and is shown in top for > >>>> ages. I really don't see the point for this. > >>> > >>> Currently you can see the steal time but you have no way of knowing if > >>> the cpu utilization you are seeing on the guest is the expected amount. > >>> I decided on making it a knob because a guest could be migrated to > >>> another system and it's entitlement could change because of hardware or > >>> load differences. It could simply be a /proc file and report the > >>> current entitlement if needed. As things are currently implemented I > >>> don't see how someone knows if the guest is running as expected or > >>> whether there is a problem. > >>> > >> > >> Turning off steal time display won't get even close to displaying the > >> information you want. What you probably want is a guest-visible way to > >> say how many miliseconds you are expected to run each second. Right? > > > > It is not clear to me how knowing how many milliseconds you are > > expecting to run will help the user. Currently the users will run top > > to see how well the guest is running. If they see _any_ steal time some > > users think they are not getting the full use of their processor > > entitlement. > > > > And your plan is just to selectively lie about it, but disabling it with > a knob? It is about making it very obvious to the end user whether they are receiving their cpu entitlement. If there is more steal time than expected that will still show up. I have experimented, and it seems to work, to put the raw stealtime at the end of each cpu line in /proc/stat. That way the raw data is there as well. Do you have another suggestion to communicate to the user whether they are receiving their full entitlement? At the very least shouldn't the entitlement reside in a /proc file somewhere so that the user could look up the value and "do the math"? > > > Maybe I'm missing what you are proposing, but even if you knew the > > milliseconds that you were expecting for your processor you would have > > to adjust the top output in your head so to speak. You would see the > > utilization and then say 'ok that matches the number of milliseconds I > > expected to run..." If we take away the steal time (as long as it is > > equal to or less than the expected amount of steal time) then the user > > running top will see the 100% utilization. > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On Mon, 2012-08-27 at 11:55 -0700, Avi Kivity wrote: > On 08/23/2012 04:14 PM, Michael Wolf wrote: > > This is an RFC regarding the reporting of stealtime. In the case of > > where you have a system that is running with partial processors such as > > KVM the user may see steal time being reported in accounting tools such > > as top or vmstat. This can cause confusion for the end user. To > > ease the confusion this patch set adds a sysctl interface to set the > > cpu entitlement. This is the percentage of cpu that the guest system is > > expected to receive. As long as the steal time is within its expected > > range it will show up as 0 in /proc/stat. The user will then see in the > > accounting tools that they are getting a full utilization of the cpu > > resources assigned to them. > > > > This patchset is changing the contents/output of /proc/stat and could > > affect > > user tools. However the default setting is that the cpu is entitled to > > 100% > > so the code will act as before. Also another field could be added to the > > /proc/stat output and show the unaltered steal time. Since this additional > > field could cause more confusion than it would clear up I have left it out > > for now. > > > > How would a guest know what its entitlement is? > > Currently the Admin/management tool setting up the guests will put it on the qemu commandline. From this it is passed via an ioctl to the host. The guest will get the value from the host via a hypercall. In the future the host could try and do some of it automatically in some cases. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote: > On 08/27/2012 01:23 PM, Michael Wolf wrote: > > > > > > How would a guest know what its entitlement is? > > > > > > > > > > Currently the Admin/management tool setting up the guests will put it on > > the qemu commandline. From this it is passed via an ioctl to the host. > > The guest will get the value from the host via a hypercall. > > > > In the future the host could try and do some of it automatically in some > > cases. > > Seems to me it's a meaningless value for the guest. Suppose it is > migrated to a host that is more powerful, and as a result its relative > entitlement is reduced. The value needs to be adjusted. This is why I chose to manage the value from the sysctl interface rather than just have it stored as a value in /proc. Whatever tool was used to migrate the vm could hopefully adjust the sysctl value on the guest. > > This is best taken care of from the host side. Not sure what you are getting at here. If you are running in a cloud environment, you purchase a VM with the understanding that you are getting certain resources. As this type of user I don't believe you have any access to the host to see this type of information. So the user still wouldnt have a way to confirm that they are receiving what they should be in the way of processor resources. Would you please elaborate a little more on this? > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On Mon, 2012-08-27 at 14:41 -0700, Glauber Costa wrote: > On 08/27/2012 02:27 PM, Michael Wolf wrote: > > On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote: > >> On 08/27/2012 01:23 PM, Michael Wolf wrote: > >>>> > >>>> How would a guest know what its entitlement is? > >>>> > >>>> > >>> > >>> Currently the Admin/management tool setting up the guests will put it on > >>> the qemu commandline. From this it is passed via an ioctl to the host. > >>> The guest will get the value from the host via a hypercall. > >>> > >>> In the future the host could try and do some of it automatically in some > >>> cases. > >> > >> Seems to me it's a meaningless value for the guest. Suppose it is > >> migrated to a host that is more powerful, and as a result its relative > >> entitlement is reduced. The value needs to be adjusted. > > > > This is why I chose to manage the value from the sysctl interface rather > > than just have it stored as a value in /proc. Whatever tool was used to > > migrate the vm could hopefully adjust the sysctl value on the guest. > >> > >> This is best taken care of from the host side. > > > > Not sure what you are getting at here. If you are running in a cloud > > environment, you purchase a VM with the understanding that you are > > getting certain resources. As this type of user I don't believe you > > have any access to the host to see this type of information. So the > > user still wouldnt have a way to confirm that they are receiving what > > they should be in the way of processor resources. > > > > Would you please elaborate a little more on this? > > What do you mean they have no access to the host? > They have access to all sorts of tools that display information from the > host. Speaking of a view-only resource, those are strictly equivalent. > > > ok. I will go look at those resources. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. TODO: * Change native_clock to take params and not return a value * Change update_rq_clock_task Changes from V1: * Removed the steal time allowed percentage from the guest * Moved the separation of consigned (expected steal) and steal time to the host. * No longer include a sysctl interface. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 + arch/x86/include/asm/kvm_para.h |3 ++ arch/x86/include/asm/paravirt.h |4 ++- arch/x86/kernel/kvm.c |8 ++ arch/x86/kvm/x86.c | 50 ++- fs/proc/stat.c |9 +-- include/linux/kernel_stat.h |2 ++ include/linux/kvm.h |2 ++ include/linux/kvm_host.h|2 ++ kernel/sched/cputime.c | 21 +++- kernel/sched/sched.h|2 ++ virt/kvm/kvm_main.c |7 + 12 files changed, 108 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V2 1/5] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu capping. The amount consigned_time will be passed in using an ioctl. The time will be expressed in the number of nanoseconds to be lost in during the fixed period. The fixed period is currently 1/10th of a second. Signed-off-by: Michael Wolf --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(&boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, "cpu%d", i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, "intr %llu", (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 36d12f0..c0b0095 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -27,6 +27,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V2 2/5] Expand the steal time msr to also contain the consigned time.
Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |7 ++- kernel/sched/cputime.c |2 +- 3 files changed, 5 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index c1d61ee..91b3b2a 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -368,9 +368,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static u64 kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -378,11 +377,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src->version; rmb(); - steal = src->steal; + *steal = src->steal; rmb(); } while ((version & 1) || (version != src->version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 81b763b..dd3fd46 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(¶virt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), &steal); steal -= this_rq()->prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V2 3/5] Add the code to send the consigned time from the host to the guest
Add the code to send the consigned time from the host to the guest Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/kvm_para.h |3 ++- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |3 ++- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h|2 ++ 8 files changed, 31 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1eaa6b0..bd4e412 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -409,6 +409,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 2f7712e..debe72e 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a5f9f30..d39e8d0 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 91b3b2a..4e5582a 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -368,7 +368,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu, u64 *steal) +static u64 kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -378,6 +378,7 @@ static u64 kvm_steal_clock(int cpu, u64 *steal) version = src->version; rmb(); *steal = src->steal; + *consigned = src->consigned; rmb(); } while ((version & 1) || (version != src->version)); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1f09552..801cfa8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1554,8 +1554,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal; + vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned; vcpu->arch.st.steal.version += 2; vcpu->arch.st.accum_steal = 0; + vcpu->arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index c0b0095..253fdce 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -125,6 +125,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); extern void account_process_tick(struct task_struct *, int user); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index dd3fd46..bf2025a 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /* + * This accounts for the time that is split out of steal time. + * Consigned time represents the amount of time that the cpu was + * expected to be somewhere else. + */ +void account_consigned_time(cputime_t cputime) +{ + u64 *cpustat = kcpustat_this_cpu->cpustat; + + cpustat[CPUTIME_CONSIGN] += (__force u64) cputime; +} + +/* * Account for involuntary wait time. * @cputime: the cpu time spent in involuntary wait */ @@ -274,15 +286,20 @@ static __always_inline bool steal_account_process_tick(void) #ifdef CONFIG_PARAVIRT if (static_key_false(¶virt_steal_enabled)) {
[PATCH RFC V2 4/5] Add a timer to allow the separation of consigned from steal time.
Add a timer to the host. This will define the period. During a period the first n ticks will go into the consigned bucket. Any other ticks that occur within the period will be placed in the stealtime bucket. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |2 +- arch/x86/kvm/x86.c | 42 ++- 3 files changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bd4e412..d700850 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -41,6 +41,8 @@ #define KVM_PIO_PAGE_OFFSET 1 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 +#define KVM_STEAL_TIMER_DELAY 1UL + #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ @@ -339,6 +341,14 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + u64 consigned_limit; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d39e8d0..6db79f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,7 +196,7 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 801cfa8..469e748 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1535,13 +1535,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; - vcpu->arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) { + vcpu->arch.current_consigned += delta; + if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) { + steal_delta = vcpu->arch.current_consigned + - vcpu->arch.consigned_limit; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu->arch.st.accum_steal = steal_delta; + vcpu->arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -6187,11 +6206,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) return irqchip_in_kernel(vcpu->kvm) == (vcpu->arch.apic != NULL); } +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + vcpu->arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(&vcpu->arch.steal_timer, now, + ktime_set(0, KVM_STEAL_TIMER_DELAY)); + return HRTIMER_RESTART; +} + int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; int r; + ktime_t ktime; BUG_ON(vcpu->kvm == NULL); kvm = vcpu->kvm; @@ -6234,6 +6267,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); + /* Initialize and start a timer to capture steal and consigned time */ + hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu->arch.steal_timer.function = &steal_timer_fn; + ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY); + hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL); return 0; fail_free_mce_banks: @@ -6252,6 +6291,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
[PATCH RFC V2 5/5] Add an ioctl to communicate the consign limit to the host.
Add an ioctl to communicate the consign limit to the host. Signed-off-by: Michael Wolf --- arch/x86/kvm/x86.c |6 ++ include/linux/kvm.h |2 ++ include/linux/kvm_host.h |2 ++ virt/kvm/kvm_main.c |7 +++ 4 files changed, 17 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 469e748..5a4b8db 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5928,6 +5928,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, return 0; } +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long entitlement) +{ + vcpu->arch.consigned_limit = entitlement; + return 0; +} + int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { struct i387_fxsave_struct *fxsave = diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 2ce09aa..6b211fb 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -904,6 +904,8 @@ struct kvm_s390_ucas_mapping { #define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg) /* VM is being stopped by host */ #define KVM_KVMCLOCK_CTRL_IO(KVMIO, 0xad) +/* Set the consignment limit which will be used to separete steal time */ +#define KVM_SET_ENTITLEMENT _IOW(KVMIO, 0xae, unsigned long) #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0) #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8a59e0a..9a0a791 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -538,6 +538,8 @@ void kvm_arch_hardware_unsetup(void); void kvm_arch_check_processor_compat(void *rtn); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, + long entitlement); void kvm_free_physmem(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d617f69..ccf0aec 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1936,6 +1936,13 @@ out_free2: r = 0; break; } + case KVM_SET_ENTITLEMENT: { + r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg); + if (r) + goto out; + r = 0; + break; + } default: r = kvm_arch_vcpu_ioctl(filp, ioctl, arg); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
OT: ROCK Linux 1.3.11 and 2.4.0-test kernels
Hi, sorry for posting something offtopic - but this might be for interrest to everyone who would like to try a Linux Kernel 2.4.0 - based system. A new development release of the ROCK Linux Distribution came out yesterday. ROCK Linux 1.3.11 is based on the Linux Kernel 2.4.0-test9, GNU Libc 2.1.3, GCC 2.95.2 and Binutils 2.10.0.26. That means that every singe package has been built with this software components, on a system running this software components. We had some minor problems with the new kernel header files, but no real troubles. Here is the full announcement for those of you who are interested: ANNOUNCE: ROCK Linux 1.3.11 [ 2000-10-23 ] The ROCK Linux Developers are pleased to announce version 1.3.11 of the ROCK Linux Distribution - a Linux Distribution for Admins, Hackers, Geeks and skilled unix users. ROCK Linux 1.3.11 is a development release and not intended for production environments. It's based on the Linux Kernel 2.4.0-test9, GNU Libc 2.1.3, GCC 2.95.2 and Binutils 2.10.0.26. The 1.3.11 release supports Intel IA-32, Alpha AXP and PowerPC Systems. ROCK Linux uses DevFS, XFree 4.0.1, GNOME 1.2, Message Passing for Beowulf Clusters and other bleeding edge technology. ROCK Linux is based on a simple build system using shell scripts which allow you to re-build the entire distribution using full optimization for your CPU or some special adaptions for your needs. It also makes porting ROCK Linux to new architectures a simple task. More information on ROCK Linux can be found on the Homepage at [ http://www.rocklinux.org/ ] or the "ROCK Linux Guide" at [ http://www.rocklinux.org/projects/doc/GUIDE/ ] A complete CHANGELOG is available at [ http://www.rocklinux.org/changelog/ ] The ROCK Linux Sources are available for download on the ROCK Linux Homepage. The package sources used by ROCK Linux will show up on the ROCK Linux mirrors withing the next couple of days. A lot of fun and happy hacking, - Clifford Wolf (Project Leader) --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- Clifford Wolf www.clifford.at IRC: http://opirc.nu The ROCK Projects Workgroup .. www.rock-projects.com Tel: +43-699-10063494 The ROCK Linux Workgroup . www.rocklinux.org Fax: +43-2235-42788-4 The NTx Consulting Group . www.ntx.atemail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Scheduler behaviour
Arjan van de Ven wrote: On Wed, 05 Dec 2007 21:15:30 +0100 Holger Wolf <[EMAIL PROTECTED]> wrote: We discovered performance degradation with dbench when using kernel 2.6.23 compared to kernel 2.6.22. In our case we booted a Linux in a IBM System z9 LPAR with 256MB of ram with 4 CPU's. This system uses a striped LV with 16 disks on a Storage Server connected via 8 4GBit links. A dbench was started on that system performing I/O operations on the striped LV. dbench runs were performed with 1 to 62 processes. Measurements with a 2.6.22 kernel were compared to measurements with a 2.6.23 kernel. We saw a throughput degradation from 7.2 to 23.4 this is good news! dbench rewards unfair behavior... so higher dbench usually means a worse kernel ;) tests with 2.6.22 including CFS show the same results. This means the pressure on page cache is much higher when all processes run in parallel. We see this behavior as well with iozone when writing on many disks with many threads and just 256 MB memory. This means the scheduler schedules as it should - fair. regards Holger -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] powerpc systbl.h broken
Hi, In current 2.6.23 (I have checked 2.6.23.12 and 2.6.23.9) the end of include/asm-powerpc/systbl.h reads: --snip-- SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) COMPAT_SYS_SPU(utimensat) COMPAT_SYS(fallocate) COMPAT_SYS_SPU(signalfd) COMPAT_SYS_SPU(timerfd) SYSCALL_SPU(eventfd) COMPAT_SYS_SPU(sync_file_range2) --snap-- This obviously does not match the definitions in asm-powerpc/unistd.h: --snip-- #define __NR_getcpu 302 #define __NR_epoll_pwait303 #define __NR_utimensat 304 #define __NR_signalfd 305 #define __NR_timerfd306 #define __NR_eventfd307 #define __NR_sync_file_range2 308 #define __NR_fallocate 309 --snap-- which breaks the system calls 305 to 309 inclusive. I've attached the simple patch which fixes the problem. yours, - clifford -- "Any sufficiently advanced technology is indistinguishable from magic." - Arthur C. Clarke --- linux-2.6.23.12/include/asm-powerpc/systbl.h.orig 2008-01-02 15:09:04.0 +0100 +++ linux-2.6.23.12/include/asm-powerpc/systbl.h2008-01-02 15:09:29.0 +0100 @@ -308,8 +308,8 @@ SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) COMPAT_SYS_SPU(utimensat) -COMPAT_SYS(fallocate) COMPAT_SYS_SPU(signalfd) COMPAT_SYS_SPU(timerfd) SYSCALL_SPU(eventfd) COMPAT_SYS_SPU(sync_file_range2) +COMPAT_SYS(fallocate)
Scheduler behaviour
1 10.96 3.84% 41.081.05 2.97% 81.081.07 1.10% 121.09 1.1 -1.30% 16 1.11.12 -1.90% 201.121.15 -2.61% 261.181.17 0.86% 321.241.36 -9.50% 40 1.31.52 -16.97% 461.29 1.6 -23.92% 501.371.64 -20.12% 541.281.66 -29.60% 62 1.5 1.69 -12.84% regards Holger Wolf -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] rlim in proc//status
Hi, because I needed it already twice in two different projects this week: the following patch adds rlim (ulimits) output to /proc//status. Please let me know if there is another (already existing) way of accessing this information easy (i.e. connecting with gdb to the process in question and 'injecting' a getrlimit() call does not count.. ;-). yours, - clifford Signed-off-by: Clifford Wolf <[EMAIL PROTECTED]> --- linux/fs/proc/array.c (revision 757) +++ linux/fs/proc/array.c (working copy) @@ -239,6 +239,55 @@ } } +static char *rlim_names[RLIM_NLIMITS] = { + [RLIMIT_CPU]= "CPU", + [RLIMIT_FSIZE] = "FSize", + [RLIMIT_DATA] = "Data", + [RLIMIT_STACK] = "Stack", + [RLIMIT_CORE] = "Core", + [RLIMIT_RSS]= "RSS", + [RLIMIT_NPROC] = "NProc", + [RLIMIT_NOFILE] = "NoFile", + [RLIMIT_MEMLOCK]= "MemLock", + [RLIMIT_AS] = "AddrSpace", + [RLIMIT_LOCKS] = "Locks", + [RLIMIT_SIGPENDING] = "SigPending", + [RLIMIT_MSGQUEUE] = "MsgQueue", + [RLIMIT_NICE] = "Nice", + [RLIMIT_RTPRIO] = "RTPrio" +}; + +static inline char *task_rlim(struct task_struct *p, char *buffer) +{ + unsigned long flags; + struct rlimit rlim[RLIM_NLIMITS]; + int i; + + rcu_read_lock(); + if (lock_task_sighand(p, &flags)) { + for (i=0; isignal->rlim[i]; + } + rcu_read_unlock(); + + for (i=0; ihttp://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rlim in proc//status
Hi, On Tue, Jan 15, 2008 at 07:47:22PM +0900, KOSAKI Motohiro wrote: > sound good for me. > a few question please. > > > + for (i=0; i > + if (rlim_names[i]) > > + buffer += sprintf(buffer, "Rlim%s:\t", rlim_names[i]); > > + else > > + buffer += sprintf(buffer, "Rlim%d:\t", i); > > this else is really necessary? no. not with the current sources. maybe something like the following would be better: #if RLIM_NLIMITS != 15 # error New RLIM_NLIMITS add mising entries to rlim_names[] #endif > > + if (rlim[i].rlim_cur != ~0) > > + buffer += sprintf(buffer, "%lu\t", rlim[i].rlim_cur); > > + else > > + buffer += sprintf(buffer, "-\t"); > > + if (rlim[i].rlim_max != ~0) > > + buffer += sprintf(buffer, "%lu\n", rlim[i].rlim_max); > > + else > > + buffer += sprintf(buffer, "-\n"); > > Why do you don't use RLIM_INFINITY? because I'm blind and didn't see it... ;-) maybe it would also be better to output 'inf' instead of '-' in this case? yours, - clifford -- The number of the beast - vi vi vi. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rlim in proc//status (2nd rev.)
Hi, On Tue, Jan 15, 2008 at 02:36:59PM -0600, [EMAIL PROTECTED] wrote: > > + rcu_read_lock(); > > + if (lock_task_sighand(p, &flags)) { > > + for (i=0; i > + rlim[i] = p->signal->rlim[i]; > > I'm confused - where do you unlock_task_sighand()? oh fsck! thanks for that pointer.. Here is a new version of the patch which solves this issue and the issues adressed earlier in this thread by kosaki. yours, - clifford Signed-off-by: Clifford Wolf <[EMAIL PROTECTED]> --- linux/fs/proc/array.c (revision 750) +++ linux/fs/proc/array.c (revision 764) @@ -239,6 +239,58 @@ } } +static char *rlim_names[RLIM_NLIMITS] = { + [RLIMIT_CPU]= "CPU", + [RLIMIT_FSIZE] = "FSize", + [RLIMIT_DATA] = "Data", + [RLIMIT_STACK] = "Stack", + [RLIMIT_CORE] = "Core", + [RLIMIT_RSS]= "RSS", + [RLIMIT_NPROC] = "NProc", + [RLIMIT_NOFILE] = "NoFile", + [RLIMIT_MEMLOCK]= "MemLock", + [RLIMIT_AS] = "AddrSpace", + [RLIMIT_LOCKS] = "Locks", + [RLIMIT_SIGPENDING] = "SigPending", + [RLIMIT_MSGQUEUE] = "MsgQueue", + [RLIMIT_NICE] = "Nice", + [RLIMIT_RTPRIO] = "RTPrio" +}; + +#if RLIM_NLIMITS != 15 +# error Value of RLIM_NLIMITS changed. \ + Please update rlim_names in fs/proc/array.c +#endif + +static inline char *task_rlim(struct task_struct *p, char *buffer) +{ + unsigned long flags; + struct rlimit rlim[RLIM_NLIMITS]; + int i; + + rcu_read_lock(); + if (lock_task_sighand(p, &flags)) { + for (i=0; isignal->rlim[i]; + unlock_task_sighand(p, &flags); + } + rcu_read_unlock(); + + for (i=0; ihttp://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rlim in proc//status (2nd rev.)
Hi, On Wed, Jan 16, 2008 at 04:33:12PM +0900, KOSAKI Motohiro wrote: > Hi Clifford, > > > +static inline char *task_rlim(struct task_struct *p, char *buffer) > > +{ > > + unsigned long flags; > > + struct rlimit rlim[RLIM_NLIMITS]; > > + int i; > > + > > + rcu_read_lock(); > > + if (lock_task_sighand(p, &flags)) { > > + for (i=0; i > + rlim[i] = p->signal->rlim[i]; > > + unlock_task_sighand(p, &flags); > > + } > > lock_task_sighand is possible return NULL? > if so, rlim is uninitialized when NULL. hmm.. can p->sighand be NULL here? If so, there also is a problem in the task_sig() function in the same file. In fact that code is copy&paste from task_sig(). In fact I'm not even sure if I actually need to lock anything to access p->signal->rlim.. I've searched the kernel code a bit and it looks like most acesses to the rlimits are done unlocked, which is no problem imo given the typical change-pattern of this values.. anyone a clue on this issues? yours, - clifford -- 2B OR (NOT 2B) That is the question. The answer is FF. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Problem with SMC Etherpower II + kernel newer 2.4.2
Hi everybody, currently I experience some strange problems with every kernels newer than 2.4.2 and my SMC Etherpower II network card. While running such a kernel, the network hangs and I get lots of errors like these listed below: Jul 2 13:06:59 localhost kernel: eth0: Too much work at interrupt, IntrStatus=0x008d0004. Jul 2 13:07:06 localhost last message repeated 5 times Jul 2 13:07:20 localhost kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 2 13:07:20 localhost kernel: eth0: Transmit timeout using MII device, Tx status 4003. Jul 2 13:07:22 localhost kernel: eth0: Too much work at interrupt, IntrStatus=0x008d0004. The /proc/pci lists the following system components: PCI devices found: Bus 0, device 0, function 0: Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev 3). Master Capable. Latency=32. Prefetchable 32 bit memory at 0xd800 [0xdbff]. Bus 0, device 1, function 0: PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP] (rev 0). Master Capable. No bursts. Min Gnt=12. Bus 0, device 7, function 0: ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 34). Bus 0, device 7, function 1: IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 16). Master Capable. Latency=32. I/O at 0xc000 [0xc00f]. Bus 0, device 7, function 2: USB Controller: VIA Technologies, Inc. UHCI USB (rev 16). IRQ 9. Master Capable. Latency=32. I/O at 0xc400 [0xc41f]. Bus 0, device 7, function 3: USB Controller: VIA Technologies, Inc. UHCI USB (#2) (rev 16). IRQ 9. Master Capable. Latency=32. I/O at 0xc800 [0xc81f]. Bus 0, device 7, function 4: Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 48). Bus 0, device 7, function 5: Multimedia audio controller: VIA Technologies, Inc. AC97 Audio Controller (rev 32). IRQ 11. I/O at 0xcc00 [0xccff]. I/O at 0xd000 [0xd003]. I/O at 0xd400 [0xd403]. Bus 0, device 9, function 0: Multimedia audio controller: Xilinx, Inc. RME Digi96/8 (rev 4). IRQ 10. Non-prefetchable 32 bit memory at 0xde00 [0xdeff]. Bus 0, device 10, function 0: Multimedia audio controller: Ensoniq 5880 AudioPCI (rev 2). IRQ 5. Master Capable. Latency=32. Min Gnt=12.Max Lat=128. I/O at 0xdc00 [0xdc3f]. Bus 0, device 11, function 0: Ethernet controller: Standard Microsystems Corp [SMC] 83C170QF (rev 9). IRQ 11. Master Capable. Latency=32. Min Gnt=8.Max Lat=28. I/O at 0xe000 [0xe0ff]. Non-prefetchable 32 bit memory at 0xe000 [0xefff]. Bus 1, device 0, function 0: VGA compatible controller: nVidia Corporation NV11 (rev 161). IRQ 10. Master Capable. Latency=32. Min Gnt=5.Max Lat=1. Non-prefetchable 32 bit memory at 0xdc00 [0xdcff]. Prefetchable 32 bit memory at 0xd000 [0xd7ff]. Does anybody else got these errors or knows about a solution for this ?? The 2.2.x kernels and all kernel versions below (including) 2.4.2 work fine on the same system and I did not find any entries in the changelogs for the SMC driver code. Thx Juergen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with SMC Etherpower II + kernel newer 2.4.2
John Jasen wrote: > > a) bad cable? > b) not negotiating speed and duplex with the switch correctly? > c) bad card? > Hi, The same errors show up with the network cable plugged or unplugged on all computers with the SMC card around here. But all these computers are equipped with nearly the same hardware (see my first posting of the /proc/pci file). Also the 2.4.6 kernel does not solve the problem. > d) IRQ sharing causing a conflict? I dont think so, at least I dont get a IRQ conflict message and there is no other device shown as using the same interrupt. If I use the 2.4.2 kernel or a version below everything works fine on the same host. Another strange effect is, that if I wait for quite some time (5-10 Minutes) while trying to start up the eth0 device with "ifconfig eth0 up" I see messages like Jul 4 09:38:58 localhost kernel: eth0: Setting full-duplex based on MII #3 link partner capability of 41e1. Jul 4 09:39:00 localhost kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 4 09:39:00 localhost kernel: eth0: Transmit timeout using MII device, Tx status 000b. Jul 4 09:39:02 localhost kernel: eth0: Too much work at interrupt, IntrStatus=0x008d0004. Jul 4 09:40:55 localhost kernel: eth0: Setting half-duplex based on MII #3 link partner capability of 0001. in between hundreds of "too much work at interrupt" messages. This error also occures regardles of the network cable is plugged or unplugged. Regards Juergen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with SMC Etherpower II + kernel newer 2.4.2
Francois Romieu wrote: > > Is X or something like a nvidia module enabled ? > Hi, the nvidia modul is not loaded or enabled but X is running sometimes. Anyways, it seems to happen if X is not running too. Luckily I got a very helpfull hint from Hans-Christian Armingeon in reply to my questions here on the list. The epic100.c from http://lrcwww.epfl.ch/~boch/sw/epic100.c.txt fixes the problem in all the affected kernel versions. Thanx for your help guys Juergen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with SMC Etherpower II + kernel newer 2.4.2
Francois Romieu wrote: > > Could you try 2.4.6 with just this modification: ? > hm, looks like thats really the point. After applying your diff file everything works fine. Regards, Juergen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with SMC Etherpower II + kernel newer 2.4.2
Jeff Garzik wrote: > > Does it work with this line? > > outl(0x12, ioaddr + MIICfg); > yes, works fine too Regards, Juergen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
DMA Engine vs. DMA Transfer API
Hi Shannon, hi lkml, I'm writing a driver for the built-in DMA controller of the Freescale MPC8349E PowerPC based microcontroller. Unfortunately the DMA-Engine API seams to be the totally wrong thing for that since this DMA controller has many features which are not available thru dmaengine afaics, such as: * Scatter-Gather on source and/or destination * Fifo modes for 8-, 16- and 32-bit (on source and dest) * Snooping mode, Multiple direct PCI read modes, ... Also this DMA controller has only 4 channels (I do not know how many channels I/OAT has). So I needed a spooling subsystem so many drivers can easily share the available dma channels. The api also supports more than one backend device and automatically finds out which backend device fits a dma request best. I've named the entire thing dmatransfer api. A current devel snapshot can be found at: http://www.clifford.at/priv/dmatransfer-20070711.diff (I am on vacation right now and don't have access to my test hardware. So the entire thing is 'frozen' right now until the second august week when I'm back in the office.) The smart infrasturcture adds some overhead: In a testcase copying 1GB of data from memory to memory in 64kB chunks (300MHz DDR SDRAM, 500MHz PowerPC CPU) this overhead is about 0.25% over controlling the dma controller directly without my framework. Afaics there is only one user of DMA Engine framework right now and only one driver: The network stack. any chances that we can migrate I/OAT to my more flexible DMA Engine framework and modify the network stack to use dmatransfer instead of dmaengine or in any other way integrate the APIs? yours, - clifford -- "In the future, airplanes will be flown by a dog and a pilot. And the dog's job will be to make sure that if the pilot tries to touch any of the buttons, the dog bites him." - Scott Adams (author of Dilbert). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Mem-2-Mem DMA - Generalized API (MPC8349E)
Hi, I'm curently working on an embedded project using the Freescale MPC8349E microcontroller (PPC e300 core + almost everything one needs for communicating with the rest of the world on one ~100$ chip). One feature of that microcontroller is a generic memory-2-memory four channel dma controller. Unlike the PC 8237 DMA controller (which copies data between the memory and dedicated DMA channels) the memory-2-memory DMA controller of the MPC8349E can copy between any bus addresses and thus can be used for copying data from one memory area to another, between memory and periphery mapped on bus addresses as well as from one peripheral device on the bus to another. The MPC8349E has four DMA channels which (unlike the PC 8237 DMA) are not bound to be used exclusively for dma transfers from or to a dedicated peripheral device. I doubt that the MPC8349E is the only device out there with such a DMA controller and so I was looking for a generalized kernel API for working with such DMA controllers to add MPC8349E support to it (since grepping the kernel for the register addresses didn't show up any usefull results I assume that there is no existing implementaion). Has anyone thought already about an API fur such a generic kernel framework? I was thinking about a 'copy using DMA if possible' function which is first sleeping until a DMA channel gets free, then reprograms this channel and again sleeps until the DMA transfer has been finished. The MPC8349E has scatter-gather features for source and destination and can eighter increment the source/destination addresses during the transfer or run in (8, 16 or 32 bit) fifo mode. So it is a bit more complicated than a 'memcpy on bus addresses which may sleep'. Extending the request_dma() - free_dma() API looks like the wrong way to go on this issue. It is afaics designed for a compleatly diffrent kind of DMA controller. Any thoughts? Am I completely missing something? yours, - clifford -- For extra security, this message has been encrypted with double-ROT13. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: coding style
Hey, On Fri, Jun 15, 2007 at 11:16:08AM +0200, Jan Engelhardt wrote: > >> so which one is preferred for the kernel? > > err = very_long_function_name(lots_of_arguments, > less, > less, > less, > less, > even_more_arguments, > more_of_this, > more_of_that, > more, > more, > more); > > IMO, preferred: > > err = very_long_function_name(lots_of_arguments, less, less, less, less, > even_more_arguments, more_of_this, more_of_that, more, more, more); Looking at e.g. the fuction declarations in fs/namespace.c shows very well that there seams to be no 'preferred in the kernel source' for this question. I presonally prefer indenting the continuation of a line with TWO additional tabs so it is good to distinguish from a normally indented command block. E.g.: static int function_with_long_name(int and_many_arguments, int not_fitting_in_a_signle_line_anymore) { if (and_many_arguments > not_fitting_in_a_signle_line_anymore && not_fitting_in_a_signle_line_anymore > 0) and_many_arguments += not_fitting_in_a_signle_line_anymore; else not_fitting_in_a_signle_line_anymore *= not_fitting_in_a_signle_line_anymore; return and_many_arguments ^ not_fitting_in_a_signle_line_anymore; } maybe this won't win a design contest but it is a simple and non-ambiguous coding style and afaics does not conflict with the CodingStyle document. yours, - clifford -- When your hammer is C++, everything begins to look like a thumb. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] spi_mpc83xx.c underclocking hotfix
Hi kumar (spi_mpc83xx maintainer), Hi list, the MPC83xx SPI controller clock divider can divide the system clock by not more then 1024. The spi_mpc83xx driver does not check this and silently writes garbage to the SPI controller registers when asked to run at lower frequencies. I've tried to run the SPI on a 266MHz MPC8349E with 100kHz for debugging a bus problem and suddenly was confronted with a 2nd problem to debug.. ;-) The attached patch adds an additional check which avoids writing garbage to the SPI controller registers and warn the user about it. This might help others to avoid simmilar problems. yours, - clifford -- Hardware /nm./: the part of the computer that you can kick. Index: drivers/spi/spi_mpc83xx.c === --- drivers/spi/spi_mpc83xx.c (revision 170) +++ drivers/spi/spi_mpc83xx.c (working copy) @@ -158,6 +158,12 @@ if ((mpc83xx_spi->sysclk / spi->max_speed_hz) >= 64) { u8 pm = mpc83xx_spi->sysclk / (spi->max_speed_hz * 64); + if (pm > 0x0f) { + printk(KERN_WARNING "MPC83xx SPI: SPICLK can't be less then a SYSCLK/1024!\n" + "Requested SPICLK is %d Hz. Will use %d Hz instead.\n", + spi->max_speed_hz, mpc83xx_spi->sysclk / 1024); + pm = 0x0f; + } regval |= SPMODE_PM(pm) | SPMODE_DIV16; } else { u8 pm = mpc83xx_spi->sysclk / (spi->max_speed_hz * 4);
Interrupt-Handling in User Space
Hi, afaics there is no mechanism in the official kernel tree to handle interrupts in user space yet. I'd need that for an embedded project I'm woking on atm and are so far not sure if I'm going to implement such a generic interface or just write a simple driver that does the job for my application. So I'd like to find out what the chances are that such a feature will get accepted in the official kernel.. Rationale: I'm not a fried of 'device drivers in userland'. But sometimes one wants to talk to the hardware directly without anything like a 'device driver' at all (e.g. for hardwaretesting, prototyping and debugging). In my application I have my hardware block memory mmapped in the application using /dev/mem to avoid additional copying steps between kernel and userland. All I need now is an easy way to handle interrupts in my application. My plan is to extend drivers/char/mem.c to implement an additional /dev/irq device file with the following ICOTLs: IOCTL_IRQ_REGISTER Integer argument: Interrupt Number Must be the first ioctl issued after opening the device file. It registeres an interrupt handler which is associated with the fd and gets unregisteredwhen the fd is closed. IOCTL_IRQ_AUTHORITIVE Argument is ignored Usually the interrupt handler returns IRQ_NONE. After this ioctl has been called it returns IRQ_HANDLED instead. IOCTL_IRQ_FLUSH Argument is ignored. Returns the number of interrupts since the last IOCTL_IRQ_FLUSH, IOCTL_IRQ_WAIT or read() call. IOCTL_IRQ_WAIT Argument is ignored. Returns without blocking if there have been any interrupts since the last IOCTL_IRQ_FLUSH or IOCTL_IRQ_WAIT call or blocks until the next interrupt otherwise. Returns the number of interrupts since the last IOCTL_IRQ_FLUSH, IOCTL_IRQ_WAIT or read() call. Reading from the device file has the same effect as calling IOCTL_IRQ_WAIT when reading in blocking or IOCTL_IRQ_FLUSH when reading in non-blocking mode. When the read buffer len is 4 or more and there have been any interrupts the first 4 bytes are filled with the number of interrupts since the last IOCTL_IRQ_FLUSH, IOCTL_IRQ_WAIT or read() call and 4 is returned. Using select() or epoll() will also be possible. Reading returns zero (EOF) when there haven't been any interrupts and the read is non-blocking. What do you think? Any arguments against approach? Does anyone else also feel the need for something like that in the kernel? yours, - clifford -- Once you have something that grows faster than education grows, you're always going to get a pop culture. -- Alan Kay - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: staging: pi433: Possible bug in rf69.c
Hi Dan, I checked it on my local SVN. You are right. I submitted the code with '&'. Accodring to a check-in message on my SVN, there was a bugreport end of July and most probably a patch - either from me, you, Joseph Wright, Colin King or Julia Lawall, changing '&' to '|'. I guess the patch for some reason wasn't accepted, but fortunatley I introduced the change to my SVN. So from my point of view, we need a change from '&' to '|'. I could prepare such a patch, but I am still unsure, which repo to use. Shortly befor I fell ill, you proposed me to use Gregs staging for my further development. But Colin yesterday was working on a repo, called linux-next. Can you (or anyone else) please tell me, when (or for which kind of patches) to use the Gregs staging and wehen (or for which kind of patches) to use the linux-next? Sorry for not being familiar with that stuff! Thanks a lot, Marcus Am 10.11.2017 um 20:32 schrieb Dan Carpenter: > On Fri, Nov 10, 2017 at 06:23:32PM +0100, Marcus Wolf wrote: >> Hi everybody! >> >> Just comparing the master of Gregs statging of pi433 with my local SVN >> to review all changes, that were done the last monthes. >> >> I am not sure, but maybe we imported a bug in rf69.c lines 378 and >> following: >> >> Gregs repo: >> case automatic: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & >> ~MASK_LNA_GAIN) & LNA_GAIN_AUTO) ); >> my repo: >> case automatic: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & >> ~MASK_LNA_GAIN) | LNA_GAIN_AUTO) ); > > I edited the lines for clarity. The difference is that your repo does > a bitwise OR "| LNA_GAIN_AUTO" and the kernel.org code does a bitwise > "& LNA_GAIN_AUTO". > > The kernel repo hasn't changed since you sent us the driver in commit > 874bcba65f9a ('staging: pi433: New driver'). I agree that & doesn't > seem to make sense and I'm disapointed that it doesn't cause a Smatch > warning. > > But LNA_GAIN_AUTO is zero so maybe | BIT(LNA_GAIN_AUTO) was intended > instead of | LNA_GAIN_AUTO. I don't know... No one on this list knows > the answer probably. :/ > > regards, > dan caprenter >
Re: staging: pi433: Possible bug in rf69.c
Hi Dan, thanks fot the link. I can't remeber, why and what I wanted to redo. Maybe there was a complaint about the format of the patch... In that patch, we also have the topic with the '>> 3', we were discussing a few days ago! I'd suggest, not to invest the history any more. I'm ok with preparing a new patch/new patches, so we can import the fixes. I also have several improvements for the rf69.c, I'd like to offer. But I still need to know when to use staging and when to use linux-next. I don't want to prepare patches for the wrong tree. Cheers, Marcus Am 11.11.2017 um 10:45 schrieb Dan Carpenter: On Sat, Nov 11, 2017 at 08:55:30AM +0100, Marcus Wolf wrote: Hi Dan, I checked it on my local SVN. You are right. I submitted the code with '&'. Accodring to a check-in message on my SVN, there was a bugreport end of July and most probably a patch - either from me, you, Joseph Wright, Colin King or Julia Lawall, changing '&' to '|'. I guess the patch for some reason wasn't accepted, but fortunatley I introduced the change to my SVN. You sent the patch, but then talked about sending a new version so that's why it wasn't merged. Greg probably would have merged it as-is if it hadn't sounded like you were going to redo it. http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-July/108821.html regards, dan carpenter
Re: staging: pi433: Possible bug in rf69.c
Hi Greg, that's fine. Is this the right URL: git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git Is there already an aprox. date, when 4.15rc1 will be out and backintegration will be done? Thx, Marcus Am 11.11.2017 um 13:18 schrieb Greg Kroah-Hartman: On Sat, Nov 11, 2017 at 11:42:09AM +0200, Marcus Wolf wrote: Hi Dan, thanks fot the link. I can't remeber, why and what I wanted to redo. Maybe there was a complaint about the format of the patch... In that patch, we also have the topic with the '>> 3', we were discussing a few days ago! I'd suggest, not to invest the history any more. I'm ok with preparing a new patch/new patches, so we can import the fixes. I also have several improvements for the rf69.c, I'd like to offer. But I still need to know when to use staging and when to use linux-next. I don't want to prepare patches for the wrong tree. I recommend waiting for 4.15-rc1 to come out, all of the different trees will be merged, and then you can just work off of the staging-next tree and I can take the patches. thanks, greg k-h
Re: staging: pi433: Possible bug in rf69.c
Hi Greg, ok. I'll postpone all my work until then. Give me a hook, when I can start :-) Thanks, Marcus Am 11.11.2017 um 13:49 schrieb Greg Kroah-Hartman: On Sat, Nov 11, 2017 at 01:42:27PM +0200, Marcus Wolf wrote: Hi Greg, that's fine. Is this the right URL: git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git Yes. Is there already an aprox. date, when 4.15rc1 will be out and backintegration will be done? Should be 2 weeks from now. thanks, greg k-h
Re: staging: pi433: Possible bug in rf69.c
Hi Joe, thank you for your suggestion. The enums are necessary for the (old fashioned) ioctl interface, too. So the user space uses these enums in order to configure the driver. If we want to completely remove rf69_enum.h, we need to find a solution for that, too. From the optics/readability, I like your idea with the Macro for the cases. On the other hand, I have already prepared a patch, that uses setbit, resetbit and readmodifywrite inline fuctions instead of the macros WRITE_REG, ... That was an idea of Walter Harms in order to increase readability and reduce macros, because Walter prefers inline functions to macros. As discussed with Greg, I will provide the patch, as soon as 4.15rc1 is out. Maybe we should move the discussion to then, so you can have a look to that? Cheers, Marcus Am 11.11.2017 um 18:02 schrieb Joe Perches: On Fri, 2017-11-10 at 18:23 +0100, Marcus Wolf wrote: Hi everybody! Just comparing the master of Gregs statging of pi433 with my local SVN to review all changes, that were done the last monthes. I am not sure, but maybe we imported a bug in rf69.c lines 378 and following: Gregs repo: case automatic: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) & LNA_GAIN_AUTO) ); case max:return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) & LNA_GAIN_MAX) ); case maxMinus6: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) & LNA_GAIN_MAX_MINUS_6) ); case maxMinus12: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) & LNA_GAIN_MAX_MINUS_12) ); case maxMinus24: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) & LNA_GAIN_MAX_MINUS_24) ); case maxMinus36: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) & LNA_GAIN_MAX_MINUS_36) ); case maxMinus48: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) & LNA_GAIN_MAX_MINUS_48) ); my repo: case automatic: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) | LNA_GAIN_AUTO) ); case max:return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) | LNA_GAIN_MAX) ); case maxMinus6: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) | LNA_GAIN_MAX_MINUS_6) ); case maxMinus12: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) | LNA_GAIN_MAX_MINUS_12) ); case maxMinus24: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) | LNA_GAIN_MAX_MINUS_24) ); case maxMinus36: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) | LNA_GAIN_MAX_MINUS_36) ); case maxMinus48: return WRITE_REG(REG_LNA, ( (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) | LNA_GAIN_MAX_MINUS_48) ); Up to my opinion, my (old) version is better then Gregs (new) version. If you agree, I'll prepare a patch, to revert the modification. There seems to be a lot of enum/#define duplication in this driver. For instance: drivers/staging/pi433/rf69_registers.h #define LNA_GAIN_AUTO 0x00 /* default */ #define LNA_GAIN_MAX 0x01 #define LNA_GA IN_MAX_MINUS_6 0x02 #define LNA_GAIN_MAX_MINUS_12 0x03 #define LNA_GAIN_MAX_MINUS_24 0x04 #define LNA_GAIN_MAX_MINUS_36 0x05 #d efine LNA_GAIN_MAX_MINUS_480x06 vs drivers/staging/pi433/rf69_enum.h enum lnaGain { automatic, max, maxMinus6, maxMinus12, maxM inus24, maxMinus36, maxMinus48, undefined }; My suggestion would be to remove drivers/staging/pi433/rf69_enum.h where possible and convert all these switch/case entries into macros like #define GAIN_CASE(type) \ case type: return WRITE_REG(REG_LNA,\ (READ_REG(REG_LNA) & ~MASK_LNA_GAIN) | (type)); so for example this switch becomes switch (lnaGain) { GAIN_CASE(LNA_GAIN_AUTO); ... }
Re: pi433: initialization of tx config in pi433_open()
Hi Hugo, thank you for all your work on Pi433 driver. For a better understanding some info about Pi433 and the ideas behind it. Pi433 was developed by me in order to have a simple to mount CE-compliant 433MHz shield for the Raspberry Pi. I wanted to put it on sale on the one side and develop a further product for home automation, using the Pi433. After three years of development and hard trying to find sales partner, I almost gave up, since after three years still earn on those topics by far do not cover the spendings. If nothing changes, I'll have to close my company at the end of this year. At a distinct point, the work on trying to sell exceeded the technical work, so no time was left for the driver development. And now I started over in freelancing to earn money. That's why all of you nearly hear nothing from me - very sad about that! Back to technics: There was already the idea of equipping the driver with default values. I see no benefit with setting the default values from the data sheet, since they do not lead to a good, maybe even not working setup of the rf69. Idea was to choose settings, that allow to use pipeoperators on the command sehll for transmitting and receiving telegrams. There was a longer discussion about one year ago with Marcin Ciupak about that topic. Most important point from my side was, that the defaults should be in a way, that CE recommandations are meat. You can find a lot of configurations, making Pi433 work in a way, that it isn't CE-compliant any more. The 4711 sound like just beeing a place holder. Since - like I told before - I inteded to use Pi433 mainly for OOK/ASK, my idea was to use an good OOK/ASK setup as default. I could provide you with a source code, I used the Pi433 with - but I think attachments are unwanted here. As far as I can remember, there were some more details to modify on the driver, to use it with pipes on command line. But I had a rough implementation. At least send was working... To long ago to remeber :-( Since it might happen, that Pi433 will go off the market in a few monthes (tears in my eyes), I think it's a good idea to modify the driver to become a generic hope-RF driver. I already did investigations on different hope-RF chips and asked hope-RF for a little support (e.g. partnership), but they were of opinion, that they don't need such a driver. It would be easy to cover up to five/six chips of them - even their brand new LoRaWan-chip. RFM-12 will be a little bit harder, since it works a little bit different. There were already preparations to support several chips - e. g. by capsulating the HW layer. But even hard discussions one year ago didn't help - according to kernel guide lines, it wasn't allowed to keep this capsulations. So the strict capsulation was broken and some of the basics for supporting more chips are lost now. Also on this topic I had several discussions with Marcin Ciupak, how to realise the support of the next chips. Maybe you can search the mailing list? If not, I can try to find the discussions in my inbox. I would love to help you with these extending topics, but since I am almost out of money, at the moment there is no margin for complimentary developments any more :-/ But if you like, I can discuss some stuff with you from time to time. Thank you so much for your interest in Pi433 and my driver!! If you need hradware for testing, let me know. Marcus Am 22.06.2018 um 04:47 schrieb Hugo Lefeuvre: > Hi Marcus, > > I'm currently working on the following TODO: > > 966 /* setup instance data*/ > 967 instance->device = device; > 968 instance->tx_cfg.bit_rate = 4711; > 969 // TODO: fill instance->tx_cfg; > > If a user calls write() right after open()-ing an instance, the driver > might try to setup the device with uninitialized garbage. In fact > nothing really bad will happen because the rf69 interface abstraction > will filter out wrong values, but this might be a confusing behavior > for the user. > > What do you think about initializing instance->tx_cfg with the default > values of the rf69 datasheet[0] ? > > Also, is there a specific reason why you chose 4711 as a default value > for the bit rate ? I couldn't find it anywhere in the datasheet nor on > the internet. > > Thanks ! > > Regards, > Hugo > > [0] http://www.hoperf.com/upload/rf/RFM69CW-V1.1.pdf >
Re: [PATCH v3] staging: pi433: replace simple switch statements
Reviewed-by: Marcus Wolf Thank you Valentin, very nice patch :-) Valentin Vidic schrieb am 24.06.2018 18:31: Use const array to map switch cases to resulting values. Signed-off-by: Valentin Vidic --- v2: use correct type for const arrays v3: add missing static keyword for af_map drivers/staging/pi433/rf69.c | 233 ++- 1 file changed, 93 insertions(+), 140 deletions(-) diff --git a/drivers/staging/pi433/rf69.c b/drivers/staging/pi433/rf69.c index 90280e9b006d..341d6901a801 100644 --- a/drivers/staging/pi433/rf69.c +++ b/drivers/staging/pi433/rf69.c @@ -111,27 +111,22 @@ static inline int rf69_read_mod_write(struct spi_device *spi, u8 reg, int rf69_set_mode(struct spi_device *spi, enum mode mode) { - switch (mode) { - case transmit: - return rf69_read_mod_write(spi, REG_OPMODE, MASK_OPMODE_MODE, - OPMODE_MODE_TRANSMIT); - case receive: - return rf69_read_mod_write(spi, REG_OPMODE, MASK_OPMODE_MODE, - OPMODE_MODE_RECEIVE); - case synthesizer: - return rf69_read_mod_write(spi, REG_OPMODE, MASK_OPMODE_MODE, - OPMODE_MODE_SYNTHESIZER); - case standby: - return rf69_read_mod_write(spi, REG_OPMODE, MASK_OPMODE_MODE, - OPMODE_MODE_STANDBY); - case mode_sleep: - return rf69_read_mod_write(spi, REG_OPMODE, MASK_OPMODE_MODE, - OPMODE_MODE_SLEEP); - default: + static const u8 mode_map[] = { + [transmit] = OPMODE_MODE_TRANSMIT, + [receive] = OPMODE_MODE_RECEIVE, + [synthesizer] = OPMODE_MODE_SYNTHESIZER, + [standby] = OPMODE_MODE_STANDBY, + [mode_sleep] = OPMODE_MODE_SLEEP, + }; + + if (unlikely(mode >= ARRAY_SIZE(mode_map))) { dev_dbg(&spi->dev, "set: illegal input param"); return -EINVAL; } + return rf69_read_mod_write(spi, REG_OPMODE, MASK_OPMODE_MODE, + mode_map[mode]); + // we are using packet mode, so this check is not really needed // but waiting for mode ready is necessary when going from sleep because the FIFO may not be immediately available from previous mode //while (_mode == RF69_MODE_SLEEP && (READ_REG(REG_IRQFLAGS1) & RF_IRQFLAGS1_MODEREADY) == 0x00); // Wait for ModeReady @@ -145,19 +140,19 @@ int rf69_set_data_mode(struct spi_device *spi, u8 data_mode) int rf69_set_modulation(struct spi_device *spi, enum modulation modulation) { - switch (modulation) { - case OOK: - return rf69_read_mod_write(spi, REG_DATAMODUL, - MASK_DATAMODUL_MODULATION_TYPE, - DATAMODUL_MODULATION_TYPE_OOK); - case FSK: - return rf69_read_mod_write(spi, REG_DATAMODUL, - MASK_DATAMODUL_MODULATION_TYPE, - DATAMODUL_MODULATION_TYPE_FSK); - default: + static const u8 modulation_map[] = { + [OOK] = DATAMODUL_MODULATION_TYPE_OOK, + [FSK] = DATAMODUL_MODULATION_TYPE_FSK, + }; + + if (unlikely(modulation >= ARRAY_SIZE(modulation_map))) { dev_dbg(&spi->dev, "set: illegal input param"); return -EINVAL; } + + return rf69_read_mod_write(spi, REG_DATAMODUL, + MASK_DATAMODUL_MODULATION_TYPE, + modulation_map[modulation]); } static enum modulation rf69_get_modulation(struct spi_device *spi) @@ -373,43 +368,30 @@ int rf69_set_output_power_level(struct spi_device *spi, u8 power_level) int rf69_set_pa_ramp(struct spi_device *spi, enum pa_ramp pa_ramp) { - switch (pa_ramp) { - case ramp3400: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_3400); - case ramp2000: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_2000); - case ramp1000: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_1000); - case ramp500: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_500); - case ramp250: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_250); - case ramp125: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_125); - case ramp100: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_100); - case ramp62: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_62); - case ramp50: - return rf69_write_reg(spi, REG_PARAMP, PARAMP_50); - case ramp40: - return rf69_write_reg(spi, REG_PARAMP, PAR
Re: [PATCH v3] staging: pi433: replace simple switch statements
Hi Dan, I'd like to mention once more, that the idea of the abstraction was to support multiple modules of Hope-RF. If the decision of the "team" of developer of this driver is, that it should be reduced to a Pi433 or RFM69CW driver only, I fully agree, that the abstraction layer isn't necessary (tough improving readability). But if the "team" wants to extend the driver - here I explicitly want to Name Marcin Ciupak and Hogo Lefeuvre, both were discussing this with me - I highly recommend to keep the abstraction layer. And once again, I have to announce, that - if noone appears, who wants to help me with selling Pi433 - I can't effort to let Pi433 on the market longer then end of this year. From this Point of view on long term it might be senseless to prepare a Pi433-only driver. Cheers, Marcus Dan Carpenter schrieb am 25.06.2018 14:35: > I'd still prefer if we just removed this abstraction entirely and used > OPMODE_MODE_TRANSMIT everywhere instead of bringing "transmit" into it. > > I know that every author thinks their abstraction will definitely be > useful in the future, but generally kernel style is to remove > abstractions. > > But I guess this code is an improvement over the original so the patch > is fine. > > regards, > dan carpenter > >
Re: [PATCH] staging: pi433: remove unnecessary parentheses
Joe Perches schrieb am 10.01.2018 10:05: > On Wed, 2018-01-10 at 09:44 +0100, Greg Kroah-Hartman wrote: > > On Tue, Jan 09, 2018 at 11:42:16AM -0800, Joe Perches wrote: > > > if (a == b && c == d) > > > is pretty trivial. > > > > But again, don't do that. > > We disagree. Life goes on. > > cheers, Joe > > For me the line above isn't obvious and easy to read. If I would be in doubt, whether it really performs correctly, I would have to ask the c-guide, to be absolutely shure. But to be honest: If I need to find a bug arround taht lines, I wouldn't ask the c-guide, but simply add some (). Then it would be 100% clear and no one would be in doubt any more. What's the disadvantage of () to emphasise waht is going on. An other Option for me would be, to spend a command line and write that info in form of a comment. Just my opinion and the way, I would go on if I am in doubt and need to find a bug. Cheers, Marcus
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Hi Valentin, > But I have an idea to remove this kfifo_reset call in pi433_write > handling partial message writes: > > kfifo_reset(&device->tx_fifo); // TODO: maybe find a solution, not to > discard already stored, valid entries > > The writer could acquire the lock and than use kfifo_avail to check if > there is enough space to write the whole message. What do you think? Unfortunaly I can't find the time to have a closer look on the code this weekend - still busy with tax stuff :-( Idea sounds great. I'll try to look at the code and think about it during Easter hollidays. Cheers, Marcus -- Smarthome-Wolf UG (haftungsbeschränkt) Helene-Lange-Weg 23 80637 München Amtsgericht München, HRB 223529 Umastzsteuer-ID: DE304719911 Geschäftsführer: Marcus Wolf
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Hi Valentin, I am not at home the next two weeks. So I can do a codereading on Easter, but testing will not take place earlier then mid/end of April :-( If you are interested, I can provide you an engineering sample of Pi433. Cheers, Marcus Am 25.03.2018 um 15:09 schrieb Valentin Vidic: > On Sun, Mar 25, 2018 at 03:00:09PM +0200, Marcus Wolf wrote: >> Unfortunaly I can't find the time to have a closer look on the code this >> weekend - still busy with tax stuff :-( >> >> Idea sounds great. I'll try to look at the code and think about it >> during Easter hollidays. > > No problem, there is no hurry. But please do test the patch I sent yesterday: > > [PATCH] staging: pi433: cleanup tx_fifo locking > > As I don't have the hardware this is just compile tested now :) > -- Smarthome-Wolf UG (haftungsbeschränkt) Helene-Lange-Weg 23 80637 München Amtsgericht München, HRB 223529 Umastzsteuer-ID: DE304719911 Geschäftsführer: Marcus Wolf
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Hi Valentin, The hardware of Pi433 is working with every Raspberry Pi (on zero, you need to solder the GPIO-pins) and with several other fruits like banana pi. The only thing is, that you need different versions of the driver, according to the processor, mounted on your fruit. If you'd like to test more then ther is no hang up or crash, you will need a second party. You could use a 433MHz socket for testing TX or a 433 thermometer for testing RX. An example code for communication with a socket is available on the Pi433 productpage (www.pi433.de). The sample for the thermometer isn't published yet (as always lack of time). If you want to test more deeply (using different features of the Rf69 chip or even do some long time testing, you have more options, if you use to Pis with Pi433. Just let me know, what you'd like to do and what equipment, you need. Cheers, Marcus Am 25.03.2018 um 16:24 schrieb Valentin Vidic: > On Sun, Mar 25, 2018 at 03:12:52PM +0200, Marcus Wolf wrote: >> I am not at home the next two weeks. So I can do a codereading on >> Easter, but testing will not take place earlier then mid/end of April :-( >> >> If you are interested, I can provide you an engineering sample of Pi433. > > Sure, let me know which version of Rpi it supports and if I need two > to test communication. >
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Hi Valentin, like promissed, I finally had a deeper look to your proposal with kfifo_avail and your patch from 24th of March. In principle, I think it is a very nice idea, and we should try to implement it. But there is a snag: There is no guarantee, that kfifo_in will only fail, if the fifo is full. If there will be any another reason for kfifo_in to fail, with the new implementation there will be no handling for that. But I think the chance of such an situation is low to impossible and the win in simplicity of the code is really great. Regarding your patch, I did not understand, why you did not remove the mutex_lock in pi433_write. Wasn't it the goal to remove it? Below find a proposal of pi433_write function, I wrote on base of my outdated (!), private repo. It is not compiled and not tested. Since there is no more handling in case of an error (as well in the propsal as in your patch), I removed the error handling completely. I only do a test to detect proplems while writing to the tx_fifo, but (like in your patch) do nothing for solving, just printing a line. If this unexpected situation will occur (most probably never), the tx_fifo will be (and stay) out of sync until driver gets unloaded. We have to decide, whether we can stay with that. Like written above, I thinkt the benefits are great, the chance of such kind of error very low. What do you think? It could be discussed, whether it is better to return EMSGSIZE or EAGAIN on the first check. On the one hand, there is a problem with the message size, on the other hand (if message isn't generally too big) after a while, there should be some more space available in fifo, so EAGAIN may be better choice. static ssize_t pi433_write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos) { struct pi433_instance *instance; struct pi433_device *device; intrequired, copied, retval; instance = filp->private_data; device = instance->device; /* check whether there is enough space available in tx_fifo */ required = sizeof(instance->tx_cfg) + sizeof(size_t) + count; if ( num_of_bytes_to_store_in_fifo > kfifo_avail(&device->tx_fifo) ) { dev_dbg(device->dev, "Not enough space in fifo. Need %d, but only have" , required , kfifo_avail(&device->tx_fifo) ); return -EMSGSIZE; } /* write the following sequence into fifo: * - tx_cfg * - size of message * - message */ retval = kfifo_in(&device->tx_fifo, &instance->tx_cfg, sizeof(instance->tx_cfg)); retval += kfifo_in (&device->tx_fifo, &count, sizeof(size_t)); retval += kfifo_from_user(&device->tx_fifo, buf, count, &copied); if (retval != required ) { dev_dbg(device->dev, "write to fifo failed, reason unknown, non recoverable."); return -EAGAIN; } /* start transfer */ wake_up_interruptible(&device->tx_wait_queue); dev_dbg(device->dev, "write: generated new msg with %d bytes.", copied); return 0; } Hope this helps :-) Marcus Am 25.03.2018 um 15:09 schrieb Valentin Vidic: > On Sun, Mar 25, 2018 at 03:00:09PM +0200, Marcus Wolf wrote: >> Unfortunaly I can't find the time to have a closer look on the code this >> weekend - still busy with tax stuff :-( >> >> Idea sounds great. I'll try to look at the code and think about it >> during Easter hollidays. > > No problem, there is no hurry. But please do test the patch I sent yesterday: > > [PATCH] staging: pi433: cleanup tx_fifo locking > > As I don't have the hardware this is just compile tested now :) >
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Hi Valentin, could you please decribe in short words, why you think, that hte lock isn't obsolete? I wasn't sure, but close to remove the lock. That's why I putted the comment. Thanks, Marcus Am 23.03.2018 um 10:47 schrieb Valentin Vidic: > Removes TODO for tx_fifo_lock as tx_fifo is modified from > both pi433_tx_thread and pi433_write. > > Fixes checkpatch warning: > > CHECK: struct mutex definition without comment > > Signed-off-by: Valentin Vidic > --- > drivers/staging/pi433/pi433_if.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/staging/pi433/pi433_if.c > b/drivers/staging/pi433/pi433_if.c > index d1e0ddbc79ce..f6f106a3ff8e 100644 > --- a/drivers/staging/pi433/pi433_if.c > +++ b/drivers/staging/pi433/pi433_if.c > @@ -87,7 +87,7 @@ struct pi433_device { > > /* tx related values */ > STRUCT_KFIFO_REC_1(MSG_FIFO_SIZE) tx_fifo; > - struct mutextx_fifo_lock; // TODO: check, whether necessary > or obsolete > + struct mutextx_fifo_lock; /* serialize access to tx_fifo */ > struct task_struct *tx_task_struct; > wait_queue_head_t tx_wait_queue; > u8 free_in_fifo; > @@ -100,7 +100,7 @@ struct pi433_device { > u32 rx_bytes_to_drop; > u32 rx_bytes_dropped; > unsigned intrx_position; > - struct mutexrx_lock; > + struct mutexrx_lock; /* serialize read requests */ > wait_queue_head_t rx_wait_queue; > > /* fifo wait queue */ > -- Smarthome-Wolf UG (haftungsbeschränkt) Helene-Lange-Weg 23 80637 München Amtsgericht München, HRB 223529 Umastzsteuer-ID: DE304719911 Geschäftsführer: Marcus Wolf
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Hi Valentin, I had no time to work on the code for monthes now and the memorisation of my thoughts when I was programming that (approx. one year ago) is quite pale. As far as I remember, I read something, that the fifo has an integrated protection, so no external protection is needed. But absolutely unsure. If I will find some time within the next days, I'll have a look at the code and try to recall. But the most important thing already took place: We started thinking about it :-) Thanks, Marcus Am 23.03.2018 um 12:42 schrieb Valentin Vidic: > On Fri, Mar 23, 2018 at 11:22:39AM +0100, Marcus Wolf wrote: >> could you please decribe in short words, why you think, that hte lock >> isn't obsolete? >> >> I wasn't sure, but close to remove the lock. That's why I putted the >> comment. > > Sure, if pi433_tx_thread runs on one CPU it might be possible > to call pi433_write concurrently on another CPU and they would > both modify tx_fifo. But maybe there is some other protection > in place that would prevent this? > -- Smarthome-Wolf UG (haftungsbeschränkt) Helene-Lange-Weg 23 80637 München Amtsgericht München, HRB 223529 Umastzsteuer-ID: DE304719911 Geschäftsführer: Marcus Wolf
Re: rf69_set_deviation in rf69.c (pi433 driver)
Hi Hugo, sorry for the late response and thank you for all that deep investigation in the pi433 driver! > According to the datasheet[0], the deviation should always be smaller > than 300kHz, and the following equation should be respected: > > (1) FDA + BRF/2 =< 500 kHz > > Why did you choose 500kHz as max for FDA, instead of 300kHz ? It looks like > a bug to me. My focus was always on OOK and ASK. PSK was only used for a few measurements in the laboratory, I engaged to check CE compliance. Most probably 500kHz was a value, that's common for PSK and I didn't pay any attention to the datasheet. So I think, you are right: This is a bug and could be revised. Never the less: While using it in the lab, the transmission was fine and the signals over air have been clean and fitted to the recommondations of the CE. > > Concerning the TODO, I can see two solutions currently: > > 1. Introduce a new rf69_get_bit_rate function which reads REG_BITRATE_MSB >and REG_BITRATE_LSB and returns reconstructed BRF. >We could use this function in rf69_set_deviation to retrieve the BRF. > > + clean > + intuitive > - heavy / slow Why not: I like clean and intuitive implementations. Since it's used during configuration, we shouldn't be that squeezed in time, that we need to hurry. > > 2. Store BRF somewhere in rf69.c, initialize it with the default value >(4.8 kb/s) and update it when rf69_set_bit_rate is called. > > + easy > - dirty, doesn't fit well with the design of rf69.c (do not store > anything, simply expose API) Up to my experience, storing reg values is always a bit problematic, since the value may be outdated. And if you update the value every time you want to use it to be sure, it's correct, there is no win in having the copy of the reg value. So this wouldn't be my favourite. > > I'd really prefer going for the first one, but I wanted to have your opinion > on this. Agree. > Thanks for your work ! Your welcome :-) Marcus
[PATCH 1/2] staging: fbtft: Fix line over 80 characters
Fix checkpatch line over 80 characters where it seemed appropriate Signed-off-by: Matthias Wolf Signed-off-by: Felix Siegel Signed-off-by: Tim Cofala --- drivers/staging/fbtft/fb_s6d02a1.c | 30 -- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/drivers/staging/fbtft/fb_s6d02a1.c b/drivers/staging/fbtft/fb_s6d02a1.c index 7529576..55513a3 100644 --- a/drivers/staging/fbtft/fb_s6d02a1.c +++ b/drivers/staging/fbtft/fb_s6d02a1.c @@ -21,20 +21,27 @@ static const s16 default_init_sequence[] = { -1, 0xfc, 0x5a, 0x5a, - -1, 0xfa, 0x02, 0x1f, 0x00, 0x10, 0x22, 0x30, 0x38, 0x3A, 0x3A, 0x3A, 0x3A, 0x3A, 0x3d, 0x02, 0x01, + -1, 0xfa, 0x02, 0x1f, 0x00, 0x10, 0x22, 0x30, 0x38, + 0x3A, 0x3A, 0x3A, 0x3A, 0x3A, 0x3d, 0x02, 0x01, - -1, 0xfb, 0x21, 0x00, 0x02, 0x04, 0x07, 0x0a, 0x0b, 0x0c, 0x0c, 0x16, 0x1e, 0x30, 0x3f, 0x01, 0x02, + -1, 0xfb, 0x21, 0x00, 0x02, 0x04, 0x07, 0x0a, 0x0b, + 0x0c, 0x0c, 0x16, 0x1e, 0x30, 0x3f, 0x01, 0x02, /* power setting sequence */ - -1, 0xfd, 0x00, 0x00, 0x00, 0x17, 0x10, 0x00, 0x01, 0x01, 0x00, 0x1f, 0x1f, + -1, 0xfd, 0x00, 0x00, 0x00, 0x17, 0x10, 0x00, 0x01, + 0x01, 0x00, 0x1f, 0x1f, - -1, 0xf4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x3f, 0x3f, 0x07, 0x00, 0x3C, 0x36, 0x00, 0x3C, 0x36, 0x00, + -1, 0xf4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x3f, 0x3f, + 0x07, 0x00, 0x3C, 0x36, 0x00, 0x3C, 0x36, 0x00, - -1, 0xf5, 0x00, 0x70, 0x66, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x6d, 0x66, 0x06, + -1, 0xf5, 0x00, 0x70, 0x66, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x6d, 0x66, 0x06, - -1, 0xf6, 0x02, 0x00, 0x3f, 0x00, 0x00, 0x00, 0x02, 0x00, 0x06, 0x01, 0x00, + -1, 0xf6, 0x02, 0x00, 0x3f, 0x00, 0x00, 0x00, 0x02, + 0x00, 0x06, 0x01, 0x00, - -1, 0xf2, 0x00, 0x01, 0x03, 0x08, 0x08, 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x04, 0x08, 0x08, + -1, 0xf2, 0x00, 0x01, 0x03, 0x08, 0x08, 0x04, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x04, 0x08, 0x08, -1, 0xf8, 0x11, @@ -54,7 +61,8 @@ static const s16 default_init_sequence[] = { -1, 0xf3, 0x00, 0x0f, -2, 50, - -1, 0xf4, 0x00, 0x04, 0x00, 0x00, 0x00, 0x3f, 0x3f, 0x07, 0x00, 0x3C, 0x36, 0x00, 0x3C, 0x36, 0x00, + -1, 0xf4, 0x00, 0x04, 0x00, 0x00, 0x00, 0x3f, 0x3f, + 0x07, 0x00, 0x3C, 0x36, 0x00, 0x3C, 0x36, 0x00, -2, 50, -1, 0xf3, 0x00, 0x1f, @@ -65,9 +73,11 @@ static const s16 default_init_sequence[] = { -1, 0xf3, 0x00, 0xff, -2, 50, - -1, 0xfd, 0x00, 0x00, 0x00, 0x17, 0x10, 0x00, 0x00, 0x01, 0x00, 0x16, 0x16, + -1, 0xfd, 0x00, 0x00, 0x00, 0x17, 0x10, 0x00, 0x00, + 0x01, 0x00, 0x16, 0x16, - -1, 0xf4, 0x00, 0x09, 0x00, 0x00, 0x00, 0x3f, 0x3f, 0x07, 0x00, 0x3C, 0x36, 0x00, 0x3C, 0x36, 0x00, + -1, 0xf4, 0x00, 0x09, 0x00, 0x00, 0x00, 0x3f, 0x3f, + 0x07, 0x00, 0x3C, 0x36, 0x00, 0x3C, 0x36, 0x00, /* initializing sequence */ -- 2.7.4
[PATCH 2/2] staging: fbtft: Fix line continuation
Fix checkpatch warning: avoid unnecessary line continuation to allow grepping of whole error message. Signed-off-by: Matthias Wolf Signed-off-by: Felix Siegel Signed-off-by: Tim Cofala --- drivers/staging/fbtft/fb_ssd1351.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/staging/fbtft/fb_ssd1351.c b/drivers/staging/fbtft/fb_ssd1351.c index b8ef75f..1b92691 100644 --- a/drivers/staging/fbtft/fb_ssd1351.c +++ b/drivers/staging/fbtft/fb_ssd1351.c @@ -126,16 +126,14 @@ static int set_gamma(struct fbtft_par *par, u32 *curves) for (i = 0; i < 63; i++) { if (i > 0 && curves[i] < 2) { dev_err(par->info->device, - "Illegal value in Grayscale Lookup Table at index %d. " \ - "Must be greater than 1\n", i); + "Illegal value in Grayscale Lookup Table at index %d. Must be greater than 1\n", i); return -EINVAL; } acc += curves[i]; tmp[i] = acc; if (acc > 180) { dev_err(par->info->device, - "Illegal value(s) in Grayscale Lookup Table. " \ - "At index=%d, the accumulated value has exceeded 180\n", i); + "Illegal value(s) in Grayscale Lookup Table. At index=%d, the accumulated value has exceeded 180\n", i); return -EINVAL; } } -- 2.7.4
Der Ausweg für Ihr Fin anz pro ble m ist ein Ausl andsk redi t
Hallo, sind Sie ansprechbar auf eine wirklich gute Nachricht? will Ihre Bank Ihnen auch keinen Kr ed it mehr geben? Lassen Sie sich das nicht bieten, denn es gibt auch kulante Banken. Auch bei laufenden Verpflichtungen und bereits von Ihrer eigenen Hausbank abgelehntem Antrag können wir Ihnen ein Darlehen vermitteln zu günstigen Konditionen (z.B. ab 3,2% für Konsumentenkredite, ab 1,8% Zinsen p.a. für Grundbesitz und ab 3,1% für Firmen). Durch die extrem langen Laufzeiten von 10 bzw. 20 Jahren belasten Sie die mtl. Zahlungen nicht großartig. Ihren Kredit bekommen Sie jetzt aus Lateinamerika ohne viel Formularkram, ohne Bankauskunft und schnell. Unsere Kooperationspartner sind mehrere Banken, und wir erledigen für Sie alle Formalitäten. Also ideal für Jungunternehmer, Angestellte und Rentner. Auf diese Weise lösen Sie Ihr Geldproblem. Alles weitere auf unserer Seite: RapidCredito und fügen Sie an co m Mit freundlichem Gruß Anton Fischer Kundenberater -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Am 12.04.2018 um 19:15 schrieb Valentin Vidic: > On Sun, Apr 08, 2018 at 04:15:30PM +0200, Marcus Wolf wrote: >> The hardware of Pi433 is working with every Raspberry Pi (on zero, you >> need to solder the GPIO-pins) and with several other fruits like banana >> pi. The only thing is, that you need different versions of the driver, >> according to the processor, mounted on your fruit. >> >> If you'd like to test more then ther is no hang up or crash, you will >> need a second party. You could use a 433MHz socket for testing TX or a >> 433 thermometer for testing RX. An example code for communication with a >> socket is available on the Pi433 productpage (www.pi433.de). The sample >> for the thermometer isn't published yet (as always lack of time). >> >> If you want to test more deeply (using different features of the Rf69 >> chip or even do some long time testing, you have more options, if you >> use to Pis with Pi433. >> >> Just let me know, what you'd like to do and what equipment, you need. > > At the moment I have Rpi2 but not any other 433MHz equipment, so I > could only do some basic testing unfortunately. In case I get the > new Rpi3 some time soon I would be able to do something better. > Hi Valentin, let me know, what you like to have. For sure with just one station and no other 433MHz equipment, options for testing are quite limited. Marcus -- Smarthome-Wolf UG (haftungsbeschränkt) Helene-Lange-Weg 23 80637 München Amtsgericht München, HRB 223529 Umastzsteuer-ID: DE304719911 Geschäftsführer: Marcus Wolf
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Am 12.04.2018 um 20:46 schrieb Valentin Vidic: > On Sun, Apr 08, 2018 at 05:22:46PM +0200, Marcus Wolf wrote: >> Regarding your patch, I did not understand, why you did not remove >> the mutex_lock in pi433_write. Wasn't it the goal to remove it? > > Is it possible for more than one userspace program to open the pi433 > device and send messages? In that case more than one pi433_write > could be running and it needs to hold a mutex_lock before calling > kfifo_in. You are right. The write function is open for multiple userspace programs. So mutex_lock schould remain. >> Below find a proposal of pi433_write function, I wrote on base >> of my outdated (!), private repo. It is not compiled and not tested. >> Since there is no more handling in case of an error (as well in the >> propsal as in your patch), I removed the error handling completely. >> I only do a test to detect proplems while writing to the tx_fifo, >> but (like in your patch) do nothing for solving, just printing a line. >> If this unexpected situation will occur (most probably never), >> the tx_fifo will be (and stay) out of sync until driver gets unloaded. >> We have to decide, whether we can stay with that. Like written above, >> I thinkt the benefits are great, the chance of such kind of error >> very low. >> What do you think? > > Yes, if there is only one writer and it checks the available size, > kfifo_in should not fail. The only problem might be copy_from_user > but perhaps that is also quite unlikely. A workaround for that could > be to copy the data into a temporary kernel buffer first and than > start kfifo writes using only kernel memory. For my feeling, that's a safe solution but most probably oversized. >> It could be discussed, whether it is better to return EMSGSIZE or >> EAGAIN on the first check. On the one hand, there is a problem with >> the message size, on the other hand (if message isn't generally too >> big) after a while, there should be some more space available in >> fifo, so EAGAIN may be better choice. > > EAGAIN does seem better unless the message is too big to ever fit > in the kfifo. > >> if (retval != required ) { >> dev_dbg(device->dev, "write to fifo failed, reason unknown, non >> recoverable."); >> return -EAGAIN; >> } > > Maybe this should be dev_warn or even dev_crit if the driver is not > usable anymore when this happens? The error message should than also > be adjusted to EBADF or something similar. >From my point of veiew, the warning is (in combination of the size-check) the by far better solution then the fifo reset. So all in all, I think, we should go with your proposal. Unfortunaly still no time to setup my test environment with a current version of the driver in order to give it a try :-( Sorry, Marcus
Re: [PATCH] staging: pi433: add descriptions for mutex locks
Hi! So if you like, give me your address, and I'll send you two development samples of Pi433. Cheers, Marcus Am 19.04.2018 um 11:47 schrieb Valentin Vidic: > On Thu, Apr 19, 2018 at 11:25:19AM +0200, Marcus Wolf wrote: >> let me know, what you like to have. For sure with just one station and >> no other 433MHz equipment, options for testing are quite limited. > > I can get Rpi3 and with two shields test 433MHz communication between > Rpi2 and Rpi3. > -- Smarthome-Wolf UG (haftungsbeschränkt) Helene-Lange-Weg 23 80637 München Amtsgericht München, HRB 223529 Umastzsteuer-ID: DE304719911 Geschäftsführer: Marcus Wolf
Re: [PATCH v2] staging: pi433: cleanup tx_fifo locking
Reviewed-by: Marcus Wolf Am 19.04.2018 um 12:25 schrieb Valentin Vidic: > pi433_write requires locking due to multiple writers. After acquiring > the lock check if enough free space is available in the kfifo to write > the whole message. This check should prevent partial writes to tx_fifo > so kfifo_reset is not needed anymore. > > pi433_tx_thread is the only reader so it does not require locking > after kfifo_reset is removed. > > Signed-off-by: Valentin Vidic > --- > v2: print a warning if partial fifo write happens > > drivers/staging/pi433/pi433_if.c | 23 ++- > 1 file changed, 14 insertions(+), 9 deletions(-) > > diff --git a/drivers/staging/pi433/pi433_if.c > b/drivers/staging/pi433/pi433_if.c > index d1e0ddbc79ce..2a05eff88469 100644 > --- a/drivers/staging/pi433/pi433_if.c > +++ b/drivers/staging/pi433/pi433_if.c > @@ -87,7 +87,7 @@ struct pi433_device { > > /* tx related values */ > STRUCT_KFIFO_REC_1(MSG_FIFO_SIZE) tx_fifo; > - struct mutextx_fifo_lock; // TODO: check, whether necessary > or obsolete > + struct mutextx_fifo_lock; /* serialize userspace writers */ > struct task_struct *tx_task_struct; > wait_queue_head_t tx_wait_queue; > u8 free_in_fifo; > @@ -589,19 +589,15 @@ pi433_tx_thread(void *data) >* - size of message >* - message >*/ > - mutex_lock(&device->tx_fifo_lock); > - > retval = kfifo_out(&device->tx_fifo, &tx_cfg, sizeof(tx_cfg)); > if (retval != sizeof(tx_cfg)) { > dev_dbg(device->dev, "reading tx_cfg from fifo failed: > got %d byte(s), expected %d", retval, (unsigned int)sizeof(tx_cfg)); > - mutex_unlock(&device->tx_fifo_lock); > continue; > } > > retval = kfifo_out(&device->tx_fifo, &size, sizeof(size_t)); > if (retval != sizeof(size_t)) { > dev_dbg(device->dev, "reading msg size from fifo > failed: got %d, expected %d", retval, (unsigned int)sizeof(size_t)); > - mutex_unlock(&device->tx_fifo_lock); > continue; > } > > @@ -634,7 +630,6 @@ pi433_tx_thread(void *data) > sizeof(device->buffer) - position); > dev_dbg(device->dev, > "read %d message byte(s) from fifo queue.", retval); > - mutex_unlock(&device->tx_fifo_lock); > > /* if rx is active, we need to interrupt the waiting for >* incoming telegrams, to be able to send something. > @@ -818,7 +813,7 @@ pi433_write(struct file *filp, const char __user *buf, > struct pi433_instance *instance; > struct pi433_device *device; > int retval; > - unsigned intcopied; > + unsigned intrequired, available, copied; > > instance = filp->private_data; > device = instance->device; > @@ -833,6 +828,16 @@ pi433_write(struct file *filp, const char __user *buf, >* - message >*/ > mutex_lock(&device->tx_fifo_lock); > + > + required = sizeof(instance->tx_cfg) + sizeof(size_t) + count; > + available = kfifo_avail(&device->tx_fifo); > + if (required > available) { > + dev_dbg(device->dev, "write to fifo failed: %d bytes required > but %d available", > + required, available); > + mutex_unlock(&device->tx_fifo_lock); > + return -EAGAIN; > + } > + > retval = kfifo_in(&device->tx_fifo, &instance->tx_cfg, > sizeof(instance->tx_cfg)); > if (retval != sizeof(instance->tx_cfg)) > @@ -855,8 +860,8 @@ pi433_write(struct file *filp, const char __user *buf, > return copied; > > abort: > - dev_dbg(device->dev, "write to fifo failed: 0x%x", retval); > - kfifo_reset(&device->tx_fifo); // TODO: maybe find a solution, not to > discard already stored, valid entries > + dev_warn(device->dev, > + "write to fifo failed, non recoverable: 0x%x", retval); > mutex_unlock(&device->tx_fifo_lock); > return -EAGAIN; > } >
Re: Submit of a driver for Pi433 - a radio module for Raspberry Pi
Hi! It's me again. Seems like I also need help on sending the email :-/ I checked the whitespace/line wrap problem, but couldn't find any suspicious lines. What I did: * Looked into my outbox - the copy of my mail to you seems to be okay... * I sent the patch once again (just to me) - result: Seems to be fine, too. Maybe you can bounce back my mail? Maybe I'll get an idea what went wrong, if I see the smashed code... Cheers, Marcus > Greg KH hat am 15. Juli 2017 um 13:24 > geschrieben: > > > On Sat, Jul 15, 2017 at 01:15:43PM +0200, Marcus Wolf wrote: > > Hi Greg, > > > > thanks for your reply :-) > > > > Today I moved the documentation and header files to drivers/staging/pi433 > > and > > fromated it as a single patch. > > > > Cheers, > > > > Marcus > > I still need it in a format I can apply it in (i.e. proper subject, > changelog text, signed-off-by, etc.) Can you do that? > > Also, for staging drivers, I need a TODO file, much like the existing > drivers/staging/*/TODO files saying what needs to be done with the code > in order to get it out of staging. > > And finally, your patch seemed to have the whitespace corrupted and was > line-wrapped, making it impossible to apply even if I could have :( > > thanks, > > greg k-h >
Re: Submit of a driver for Pi433 - a radio module for Raspberry Pi
Hi! Ok. Thanks for the info. Didn't observe that. I'll give the old patch another try. I will send a new patch as soon, as I know. how to produce the poper format (with subjects, signing and so on). Cheers, Marcus diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig index 268d4e6..fdf060c 100644 --- a/drivers/staging/Kconfig +++ b/drivers/staging/Kconfig @@ -110,4 +110,6 @@ source "drivers/staging/ccree/Kconfig" source "drivers/staging/typec/Kconfig" +source "drivers/staging/pi433/Kconfig" + endif # STAGING diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index b93e6f5..998f644 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -44,3 +44,4 @@ obj-$(CONFIG_KS7010)+= ks7010/ obj-$(CONFIG_GREYBUS)+= greybus/ obj-$(CONFIG_BCM2835_VCHIQ)+= vc04_services/ obj-$(CONFIG_CRYPTO_DEV_CCREE)+= ccree/ +obj-$(CONFIG_PI433)+= pi433/ diff --git a/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts b/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts new file mode 100644 index 000..004b502 --- /dev/null +++ b/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts @@ -0,0 +1,53 @@ +// Definitions for Pi433 +/dts-v1/; +/plugin/; + +/ { +compatible = "bcm,bcm2835", "bcm,bcm2708", "bcm,bcm2709"; + +fragment@0 { +target = <&spi0>; +__overlay__ { +status = "okay"; + +spidev@0{ +status = "disabled"; +}; + +spidev@1{ +status = "disabled"; +}; +}; +}; + +fragment@1 { +target = <&gpio>; +__overlay__ { +pi433_pins: pi433_pins { +brcm,pins = <7 25 24>; +brcm,function = <0 0 0>; // in in in +}; +}; +}; + +fragment@2 { +target = <&spi0>; +__overlay__ { +#address-cells = <1>; +#size-cells = <0>; +status = "okay"; + +pi433: pi433@0 { +compatible = "Smarthome-Wolf,pi433"; +reg = <0>; +spi-max-frequency = <1000>; +status = "okay"; + +pinctrl-0 = <&pi433_pins>; +DIO0-gpio = <&gpio 24 0>; +DIO1-gpio = <&gpio 25 0>; +DIO2-gpio = <&gpio 7 0>; +}; +}; +}; +}; diff --git a/drivers/staging/pi433/Documentation/devicetree/pi433.txt b/drivers/staging/pi433/Documentation/devicetree/pi433.txt new file mode 100644 index 000..14197c6 --- /dev/null +++ b/drivers/staging/pi433/Documentation/devicetree/pi433.txt @@ -0,0 +1,62 @@ +* Smarthome-Wolf Pi433 - a 433MHz radio module/shield for Raspberry Pi (see www.pi433.de) + +Required properties: +- compatible: must be "Smarthome-Wolf,pi433" +- reg: chip select of SPI Interface +- DIOx-gpio must be dedicated to the GPIO, connected with DIOx of the RFM69 module + + +Example: + +With the following lines in gpio-section, the gpio pins, connected with pi433 are +reserved/declared. + +&gpio{ +[...] + +pi433_pins: pi433_pins { +brcm,pins = <7 25 24>; +brcm,function = <0 0 0>; // in in in +}; + +[...] +} + +With the following lines in spi section, the device pi433 is declared. +It consists of the three gpio pins and an spi interface (here chip select 0) + +&spi0{ +[...] + +pi433: pi433@0 { +compatible = "Smarthome-Wolf,pi433"; +reg = <0>; /* CE 0 */ +#address-cells = <1>; +#size-cells = <0>; +spi-max-frequency = <1000>; + +pinctrl-0 = <&pi433_pins>; +DIO0-gpio = <&gpio 24 0>; +DIO1-gpio = <&gpio 25 0>; +DIO2-gpio = <&gpio 7 0>; +}; +} + + + +For Raspbian users only +=== +Since Raspbian supports device tree overlays, you may use and overlay, instead +of editing your boards device tree. +For using the overlay, you need to compile the file pi433-overlay.dts you can +find aside to this documentation. +The file needs to be compiled - either manually or by integration in your kernel +source tree. For a manual compile, you may use a command line like the following: +'linux/scripts/dtc/dtc -@ -I dts -O dtb -o pi433.dtbo pi433-overlay.dts' + +For compiling inside of the kernel tree, you need to copy pi433-overlay.dts to +arch/arm/boot/dts/overlays and you need to add the file to the list of files +in the Makefile over there. Execute '
Re: Submit of a driver for Pi433 - a radio module for Raspberry Pi
Am Sa, 15.07.2017, 15:47 schrieb Greg KH: > On Sat, Jul 15, 2017 at 03:40:30PM +0200, Marcus Wolf wrote: >> Hi Greg, >> Hi Greg, now I added a TODO file and did a manual patchwork with lines of the old patch (git format-patch master --stdout -p > pi433_patch) and the newer patch (git diff master > pi433_patch). Stil don't know how to retreive multiple commits in one single patch directly from git :-/ I did my best to meet a lot of rules of the link, you send me.If there is stil something essential wrong or missing, please excuse and let me know. Especially I was in doubt about the verbosity of the change log. If you'd like to see the list of features of the driver over there (please see my first mail), let me know and I'll add. Sorry for posting in HTML once again on last mail. Now I changed the frontend of my mailtool. Hope this works better. If not, excuse and let me know. I Think about moving to gits mail function for my next patches. Cheers, Marcus From: Marcus Wolf Date: Tue,15 Jul 2017 17:52:06 +0100 Subject: [PATCH 1/1] drivers/staging/pi433: New driver Added a driver for the pi433 radio module (see https://www.pi433.de/en.html for details). Signed-off-by: Marcus Wolf Tested-by: Marcus Wolf on Raspbian, running Kernel v4.12 --- diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig index 268d4e6..fdf060c 100644 --- a/drivers/staging/Kconfig +++ b/drivers/staging/Kconfig @@ -110,4 +110,6 @@ source "drivers/staging/ccree/Kconfig" source "drivers/staging/typec/Kconfig" +source "drivers/staging/pi433/Kconfig" + endif # STAGING diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index b93e6f5..998f644 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -44,3 +44,4 @@ obj-$(CONFIG_KS7010) += ks7010/ obj-$(CONFIG_GREYBUS) += greybus/ obj-$(CONFIG_BCM2835_VCHIQ)+= vc04_services/ obj-$(CONFIG_CRYPTO_DEV_CCREE) += ccree/ +obj-$(CONFIG_PI433)+= pi433/ diff --git a/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts b/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts new file mode 100644 index 000..004b502 --- /dev/null +++ b/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts @@ -0,0 +1,53 @@ +// Definitions for Pi433 +/dts-v1/; +/plugin/; + +/ { +compatible = "bcm,bcm2835", "bcm,bcm2708", "bcm,bcm2709"; + +fragment@0 { +target = <&spi0>; +__overlay__ { +status = "okay"; + +spidev@0{ +status = "disabled"; +}; + +spidev@1{ +status = "disabled"; +}; +}; +}; + + fragment@1 { + target = <&gpio>; + __overlay__ { + pi433_pins: pi433_pins { + brcm,pins = <7 25 24>; + brcm,function = <0 0 0>; // in in in + }; + }; + }; + + fragment@2 { + target = <&spi0>; + __overlay__ { + #address-cells = <1>; + #size-cells = <0>; + status = "okay"; + + pi433: pi433@0 { + compatible = "Smarthome-Wolf,pi433"; + reg = <0>; + spi-max-frequency = <1000>; + status = "okay"; + + pinctrl-0 = <&pi433_pins>; + DIO0-gpio = <&gpio 24 0>; + DIO1-gpio = <&gpio 25 0>; + DIO2-gpio = <&gpio 7 0>; + }; + }; + }; +}; diff --git a/drivers/staging/pi433/Documentation/devicetree/pi433.txt b/drivers/staging/pi433/Documentation/devicetree/pi433.txt new file mode 100644 index 000..14197c6 --- /dev/null +++ b/drivers/staging/pi433/Documentation/devicetree/pi433.txt @@ -0,0 +1,62 @@ +* Smarthome-Wolf Pi433 - a 433MHz radio module/shield for Raspberry Pi (see www.pi433.de) + +Required properties: +- compatible: must be "Smarthome-Wolf,pi433" +- reg: chip select of SPI Interface +- DIOx-gpio must be dedicated to the GPIO, connected with DIOx of the RFM69 module + + +Example: + +With the following lines in gpio-section, the gpio pins, connected with pi433 are +reserved/declared. + +&gpio{ + [...] + + pi433_pins: pi433_pins { + brcm,pins = <7 25 24>; + brcm,function = <0 0 0>; // in i
Re: [PATCH 1/1] drivers/staging/pi433: New driver (fwd)
Hi Julia, thank you very much for your investigation! I totally agree - on line 894 device->dev must not be used. Any suggestions? Is there a kind of generic value, I can pass as first argument of dev_dbg() in such cases? Or switch to printk()? Regarding second bug: line 688 modified as follows: if (bytes_received > 0) and line 654 modified as follows: int bytes_received; The fix will already be included in my next try, to send a well formatted patch to Greg. the position of the brackets is wrong throughout the complete source. The source was developed against an other style guide. It's one of the todos to adjust the formatting that it fully complies with the kernel coding style guide. Cheers, Marcus > Julia Lawall hat am 15. Juli 2017 um 22:50 geschrieben: > > > Please check on lines 894 and 688 (not shown) for the issues mentioned > below. > > Note that the ifs and {s in the following code snippet don't always follow > the kernel coding style, eg on line 901. > > julia > > -- Forwarded message -- > Date: Sun, 16 Jul 2017 04:35:49 +0800 > From: kbuild test robot > To: kbu...@01.org > Cc: Julia Lawall > Subject: Re: [PATCH 1/1] drivers/staging/pi433: New driver > > Hi Wolf, > > [auto build test WARNING on staging/staging-testing] > [also build test WARNING on v4.12 next-20170714] > [if your patch is applied to the wrong git tree, please drop us a note to help > improve the system] > > url: > https://github.com/0day-ci/linux/commits/Wolf-Entwicklungen/drivers-staging-pi433-New-driver/20170716-021625 > :: branch date: 2 hours ago > :: commit date: 2 hours ago > > >> drivers/staging/pi433/pi433_if.c:894:18-21: ERROR: device is NULL but > >> dereferenced. > -- > >> drivers/staging/pi433/pi433_if.c:688:5-19: WARNING: Unsigned expression > >> compared with zero: bytes_received >= 0 > > git remote add linux-review https://github.com/0day-ci/linux > git remote update linux-review > git checkout 6b5d85fc273ec7c19addf7770155414da647de7e > vim +894 drivers/staging/pi433/pi433_if.c > > 6b5d85fc Wolf Entwicklungen 2017-07-15 883 > 6b5d85fc Wolf Entwicklungen 2017-07-15 884 static int pi433_open(struct inode > *inode, struct file *filp) > 6b5d85fc Wolf Entwicklungen 2017-07-15 885 { > 6b5d85fc Wolf Entwicklungen 2017-07-15 886 struct pi433_device *device; > 6b5d85fc Wolf Entwicklungen 2017-07-15 887 struct pi433_instance *instance; > 6b5d85fc Wolf Entwicklungen 2017-07-15 888 > 6b5d85fc Wolf Entwicklungen 2017-07-15 889 mutex_lock(&minor_lock); > 6b5d85fc Wolf Entwicklungen 2017-07-15 890 device = idr_find(&pi433_idr, > iminor(inode)); > 6b5d85fc Wolf Entwicklungen 2017-07-15 891 > 6b5d85fc Wolf Entwicklungen 2017-07-15 892 mutex_unlock(&minor_lock); > 6b5d85fc Wolf Entwicklungen 2017-07-15 893 if (!device) { > 6b5d85fc Wolf Entwicklungen 2017-07-15 @894 dev_dbg(device->dev, "device: > minor %d unknown.\n", iminor(inode)); > 6b5d85fc Wolf Entwicklungen 2017-07-15 895 return -ENODEV; > 6b5d85fc Wolf Entwicklungen 2017-07-15 896 } > 6b5d85fc Wolf Entwicklungen 2017-07-15 897 > 6b5d85fc Wolf Entwicklungen 2017-07-15 898 if (!device->rx_buffer) { > 6b5d85fc Wolf Entwicklungen 2017-07-15 899 device->rx_buffer = > kmalloc(MAX_MSG_SIZE, GFP_KERNEL); > 6b5d85fc Wolf Entwicklungen 2017-07-15 900 if (!device->rx_buffer) > 6b5d85fc Wolf Entwicklungen 2017-07-15 901 { > 6b5d85fc Wolf Entwicklungen 2017-07-15 902 dev_dbg(device->dev, > "open/ENOMEM\n"); > 6b5d85fc Wolf Entwicklungen 2017-07-15 903 return -ENOMEM; > 6b5d85fc Wolf Entwicklungen 2017-07-15 904 } > 6b5d85fc Wolf Entwicklungen 2017-07-15 905 } > 6b5d85fc Wolf Entwicklungen 2017-07-15 906 > 6b5d85fc Wolf Entwicklungen 2017-07-15 907 device->users++; > 6b5d85fc Wolf Entwicklungen 2017-07-15 908 instance = > kzalloc(sizeof(*instance), GFP_KERNEL); > 6b5d85fc Wolf Entwicklungen 2017-07-15 909 if (!instance) > 6b5d85fc Wolf Entwicklungen 2017-07-15 910 { > 6b5d85fc Wolf Entwicklungen 2017-07-15 911 kfree(device->rx_buffer); > 6b5d85fc Wolf Entwicklungen 2017-07-15 912 device->rx_buffer = NULL; > 6b5d85fc Wolf Entwicklungen 2017-07-15 913 return -ENOMEM; > 6b5d85fc Wolf Entwicklungen 2017-07-15 914 } > 6b5d85fc Wolf Entwicklungen 2017-07-15 915 > 6b5d85fc Wolf Entwicklungen 2017-07-15 916 /* setup instance data*/ > 6b5d85fc Wolf Entwicklungen 2017-07-15 917 instance->device = device; > 6b5d85fc Wolf Entwicklungen 2017-07-15 918 instance->tx_cfg.bit_rate = 4711; > 6b5d85fc Wolf Entwicklungen 2017-07-15 919 // TODO: fill instance->tx_cfg; > 6b5d85fc Wolf Entwicklungen 2017-07-15 920 > 6b5d85fc Wolf Entwicklungen 2017-07-15 921 /* instance data as
[PATCH 1/1] drivers/staging/pi433: New driver
From: Marcus Wolf Date: Tue,16 Jul 2017 11:52:06 +0100 Subject: [PATCH 1/1] drivers/staging/pi433: New driver Added a driver for the pi433 radio module (see https://www.pi433.de/en.html for details). Signed-off-by: Marcus Wolf --- diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig index 268d4e6..fdf060c 100644 --- a/drivers/staging/Kconfig +++ b/drivers/staging/Kconfig @@ -110,4 +110,6 @@ source "drivers/staging/ccree/Kconfig" source "drivers/staging/typec/Kconfig" +source "drivers/staging/pi433/Kconfig" + endif # STAGING diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index b93e6f5..998f644 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -44,3 +44,4 @@ obj-$(CONFIG_KS7010) += ks7010/ obj-$(CONFIG_GREYBUS) += greybus/ obj-$(CONFIG_BCM2835_VCHIQ)+= vc04_services/ obj-$(CONFIG_CRYPTO_DEV_CCREE) += ccree/ +obj-$(CONFIG_PI433)+= pi433/ diff --git a/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts b/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts new file mode 100644 index 000..004b502 --- /dev/null +++ b/drivers/staging/pi433/Documentation/devicetree/pi433-overlay.dts @@ -0,0 +1,53 @@ +// Definitions for Pi433 +/dts-v1/; +/plugin/; + +/ { +compatible = "bcm,bcm2835", "bcm,bcm2708", "bcm,bcm2709"; + +fragment@0 { +target = <&spi0>; +__overlay__ { +status = "okay"; + +spidev@0{ +status = "disabled"; +}; + +spidev@1{ +status = "disabled"; +}; +}; +}; + + fragment@1 { + target = <&gpio>; + __overlay__ { + pi433_pins: pi433_pins { + brcm,pins = <7 25 24>; + brcm,function = <0 0 0>; // in in in + }; + }; + }; + + fragment@2 { + target = <&spi0>; + __overlay__ { + #address-cells = <1>; + #size-cells = <0>; + status = "okay"; + + pi433: pi433@0 { + compatible = "Smarthome-Wolf,pi433"; + reg = <0>; + spi-max-frequency = <1000>; + status = "okay"; + + pinctrl-0 = <&pi433_pins>; + DIO0-gpio = <&gpio 24 0>; + DIO1-gpio = <&gpio 25 0>; + DIO2-gpio = <&gpio 7 0>; + }; + }; + }; +}; diff --git a/drivers/staging/pi433/Documentation/devicetree/pi433.txt b/drivers/staging/pi433/Documentation/devicetree/pi433.txt new file mode 100644 index 000..14197c6 --- /dev/null +++ b/drivers/staging/pi433/Documentation/devicetree/pi433.txt @@ -0,0 +1,62 @@ +* Smarthome-Wolf Pi433 - a 433MHz radio module/shield for Raspberry Pi (see www.pi433.de) + +Required properties: +- compatible: must be "Smarthome-Wolf,pi433" +- reg: chip select of SPI Interface +- DIOx-gpio must be dedicated to the GPIO, connected with DIOx of the RFM69 module + + +Example: + +With the following lines in gpio-section, the gpio pins, connected with pi433 are +reserved/declared. + +&gpio{ + [...] + + pi433_pins: pi433_pins { + brcm,pins = <7 25 24>; + brcm,function = <0 0 0>; // in in in + }; + + [...] +} + +With the following lines in spi section, the device pi433 is declared. +It consists of the three gpio pins and an spi interface (here chip select 0) + +&spi0{ + [...] + + pi433: pi433@0 { + compatible = "Smarthome-Wolf,pi433"; + reg = <0>; /* CE 0 */ + #address-cells = <1>; + #size-cells = <0>; + spi-max-frequency = <1000>; + + pinctrl-0 = <&pi433_pins>; + DIO0-gpio = <&gpio 24 0>; + DIO1-gpio = <&gpio 25 0>; + DIO2-gpio = <&gpio 7 0>; + }; +} + + + +For Raspbian users only +=== +Since Raspbian supports device tree overlays, you may use and overlay, instead +of editing your boards device tree. +For using the overlay, you need to compile the file pi433-overlay.dts you can +find aside to this documentation. +The file needs to be compiled - either manually or by integration in