On Wednesday, June 10, 2015 09:18:45 AM Prarit Bhargava wrote: > I looked into switching to div64_s64() instead of the 32-bit version in > div_fp(), however, this would result in sample_ratio and core_busy returning > 0 which is something we don't want. > > P. > > ---8<--- > > The kernel may delay interrupts for a long time which can result in timers > being delayed. If this occurs the intel_pstate driver will crash with > a divide by zero error: > > divide error: 0000 [#1] SMP > Modules linked in: btrfs zlib_deflate raid6_pq xor msdos ext4 mbcache jbd2 > binfmt_misc arc4 md4 nls_utf8 cifs dns_resolver tcp_lp bnep bluetooth rfkill > fuse dm_service_time iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi > nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp ip6t_rpfilter > ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc > ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 > nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter > ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat > nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter > ip_tables intel_powerclamp coretemp vfat fat kvm_intel iTCO_wdt > iTCO_vendor_support ipmi_devintf sr_mod kvm crct10dif_pclmul > crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel cdc_ether lrw > usbnet cdrom mii gf128mul glue_helper ablk_helper cryptd lpc_ich mfd_core > pcspkr sb_edac edac_core ipmi_si ipmi_msghandler ioatdma wmi shpchp acpi_pad > nfsd auth_rpcgss nfs_acl lockd uinput dm_multipath sunrpc xfs libcrc32c > usb_storage sd_mod crc_t10dif crct10dif_common ixgbe mgag200 syscopyarea > sysfillrect sysimgblt mdio drm_kms_helper ttm igb drm ptp pps_core dca > i2c_algo_bit megaraid_sas i2c_core dm_mirror dm_region_hash dm_log dm_mod > CPU: 113 PID: 0 Comm: swapper/113 Tainted: G W -------------- > 3.10.0-229.1.2.el7.x86_64 #1 > Hardware name: IBM x3950 X6 -[3837AC2]-/00FN827, BIOS -[A8E112BUS-1.00]- > 08/27/2014 > task: ffff880fe8abe660 ti: ffff880fe8ae4000 task.ti: ffff880fe8ae4000 > RIP: 0010:[<ffffffff814a9279>] [<ffffffff814a9279>] > intel_pstate_timer_func+0x179/0x3d0 > RSP: 0018:ffff883fff4e3db8 EFLAGS: 00010206 > RAX: 0000000027100000 RBX: ffff883fe6965100 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000010 RDI: 000000002e53632d > RBP: ffff883fff4e3e20 R08: 000e6f69a5a125c0 R09: ffff883fe84ec001 > R10: 0000000000000002 R11: 0000000000000005 R12: 00000000000049f5 > R13: 0000000000271000 R14: 00000000000049f5 R15: 0000000000000246 > FS: 0000000000000000(0000) GS:ffff883fff4e0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f7668601000 CR3: 000000000190a000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Stack: > ffff883fff4e3e58 ffffffff81099dc1 0000000000000086 0000000000000071 > ffff883fff4f3680 0000000000000071 fbdc8a965e33afee ffffffff810b69dd > ffff883fe84ec000 ffff883fe6965108 0000000000000100 ffffffff814a9100 > Call Trace: > <IRQ> > > [<ffffffff81099dc1>] ? run_posix_cpu_timers+0x51/0x840 > [<ffffffff810b69dd>] ? trigger_load_balance+0x5d/0x200 > [<ffffffff814a9100>] ? pid_param_set+0x130/0x130 > [<ffffffff8107df56>] call_timer_fn+0x36/0x110 > [<ffffffff814a9100>] ? pid_param_set+0x130/0x130 > [<ffffffff8107fdcf>] run_timer_softirq+0x21f/0x320 > [<ffffffff81077b2f>] __do_softirq+0xef/0x280 > [<ffffffff816156dc>] call_softirq+0x1c/0x30 > [<ffffffff81015d95>] do_softirq+0x65/0xa0 > [<ffffffff81077ec5>] irq_exit+0x115/0x120 > [<ffffffff81616355>] smp_apic_timer_interrupt+0x45/0x60 > [<ffffffff81614a1d>] apic_timer_interrupt+0x6d/0x80 > <EOI> > > [<ffffffff814a9c32>] ? cpuidle_enter_state+0x52/0xc0 > [<ffffffff814a9c28>] ? cpuidle_enter_state+0x48/0xc0 > [<ffffffff814a9d65>] cpuidle_idle_call+0xc5/0x200 > [<ffffffff8101d14e>] arch_cpu_idle+0xe/0x30 > [<ffffffff810c67c1>] cpu_startup_entry+0xf1/0x290 > [<ffffffff8104228a>] start_secondary+0x1ba/0x230 > Code: 42 0f 00 45 89 e6 48 01 c2 43 8d 44 6d 00 39 d0 73 26 49 c1 e5 08 89 d2 > 4d 63 f4 49 63 c5 48 c1 e2 08 48 c1 e0 08 48 63 ca 48 99 <48> f7 f9 48 98 4c > 0f af f0 49 c1 ee 08 8b 43 78 c1 e0 08 44 29 > RIP [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0 > RSP <ffff883fff4e3db8> > > The kernel values for cpudata for CPU 113 were: > > struct cpudata { > cpu = 113, > timer = { > entry = { > next = 0x0, > prev = 0xdead000000200200 > }, > expires = 8357799745, > base = 0xffff883fe84ec001, > function = 0xffffffff814a9100 <intel_pstate_timer_func>, > data = 18446612406765768960, > <snip> > i_gain = 0, > d_gain = 0, > deadband = 0, > last_err = 22489 > }, > last_sample_time = { > tv64 = 4063132438017305 > }, > prev_aperf = 287326796397463, > prev_mperf = 251427432090198, > sample = { > core_pct_busy = 23081, > aperf = 2937407, > mperf = 3257884, > freq = 2524484, > time = { > tv64 = 4063149215234118 > } > } > } > > which results in the time between samples = last_sample_time - sample.time > = 4063149215234118 - 4063132438017305 = 16777216813 which is 16.777 seconds. > > The duration between reads of the APERF and MPERF registers overflowed a s32 > sized integer in intel_pstate_get_scaled_busy()'s call to div_fp(). The > result > is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0. > > While the kernel shouldn't be delaying for a long time, it can and does > happen, and the intel_pstate driver should not panic in this situation. This > patch checks for an overflow and ignores the current calculation cycle by > returning -EINVAL. Since intel_pstate_sample() has been called, subsequent > timer function calls will then again pick up the correct calculations and the > system will continue functioning properly. > > Cc: Kristen Carlson Accardi <kris...@linux.intel.com> > Cc: "Rafael J. Wysocki" <r...@rjwysocki.net> > Cc: Viresh Kumar <viresh.ku...@linaro.org> > Cc: linux...@vger.kernel.org > > Signed-off-by: Prarit Bhargava <pra...@redhat.com>
Looks OK to me, but I need Kristen to review it too. > --- > drivers/cpufreq/intel_pstate.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index 6414661..6a145e5 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -823,6 +823,9 @@ static inline int32_t intel_pstate_get_scaled_busy(struct > cpudata *cpu) > sample_time = pid_params.sample_rate_ms * USEC_PER_MSEC; > duration_us = (u32) ktime_us_delta(cpu->sample.time, > cpu->last_sample_time); > + if (duration_us > S32_MAX) > + return -EINVAL; > + > if (duration_us > sample_time * 3) { > sample_ratio = div_fp(int_tofp(sample_time), > int_tofp(duration_us)); > @@ -832,7 +835,7 @@ static inline int32_t intel_pstate_get_scaled_busy(struct > cpudata *cpu) > return core_busy; > } > > -static inline void intel_pstate_adjust_busy_pstate(struct cpudata *cpu) > +static inline int32_t intel_pstate_adjust_busy_pstate(struct cpudata *cpu) > { > int32_t busy_scaled; > struct _pid *pid; > @@ -840,11 +843,15 @@ static inline void > intel_pstate_adjust_busy_pstate(struct cpudata *cpu) > > pid = &cpu->pid; > busy_scaled = intel_pstate_get_scaled_busy(cpu); > + if (busy_scaled < 0) > + return S32_MAX; > > ctl = pid_calc(pid, busy_scaled); > > /* Negative values of ctl increase the pstate and vice versa */ > intel_pstate_set_pstate(cpu, cpu->pstate.current_pstate - ctl); > + > + return busy_scaled; > } > > static void intel_hwp_timer_func(unsigned long __data) > @@ -859,15 +866,16 @@ static void intel_pstate_timer_func(unsigned long > __data) > { > struct cpudata *cpu = (struct cpudata *) __data; > struct sample *sample; > + int32_t busy_scaled; > > intel_pstate_sample(cpu); > > sample = &cpu->sample; > > - intel_pstate_adjust_busy_pstate(cpu); > + busy_scaled = intel_pstate_adjust_busy_pstate(cpu); > > trace_pstate_sample(fp_toint(sample->core_pct_busy), > - fp_toint(intel_pstate_get_scaled_busy(cpu)), > + fp_toint(busy_scaled), > cpu->pstate.current_pstate, > sample->mperf, > sample->aperf, > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/