On Tue, Jan 22, 2019 at 04:42:19PM +0800, Su Yue wrote: > Thanks for your quick reply! Paul > > On 1/22/19 12:01 PM, Paul E. McKenney wrote: > >On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote: > >>Hi, guys > >> While running rcutorture tests with "onoff_interval", some tests > >>failed and results show like: > >> > >>===================================================================== > >>[ 316.354501] srcud-torture:--- End of test: RCU_HOTPLUG: > >>nreaders=1 nfakewriters=4 stat_interval=60 verbose=2 > >>test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\ > >>s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 > >>test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 > >>stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\ > >>rrier_cbs=0 onoff_interval=3 onoff_holdoff=0 > >>==================================================================== > >> > >>I am wondering that meaning of "RCU_HOTPLUG". Is it expected because > >>cpu hotplug is enabled in the test? Or just represents another type of > >>failure? > > > >This says that at least one CPU hotplug operation failed, that is, > >the CPU didn't actually come online or go offline as requested. If you > >are introducing CPU hotplug to an architecture, this usually indicates > >that you have bugs in your CPU-hotplug code. Or it nmight be that > > It should hit the case since there is no RCU CPU stall warnings. > > >RCU grace periods failed to progress -- though this would normally > >also result in RCU CPU stall warnings. > > > >There should be lines containing "ver:" in your console output. What > >does one of the later one of these say? > > > > The line says: > ====================================================================== > [ 318.850175] busted_srcud-torture: rtc: (null) ver: > 27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke: > 0 rtbre: 0 rtbf: 0 rtb: 0 \ > nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542 > (HZ=1000) barrier: 0/0:0
Yes, you have many more offline attempts than successes, which is why RCU_HOTPLUG was printed. > ===================================================================== > > And here are useful errors: > ===================================================================== > kern :info : [ 135.379693] KVM setup async PF for cpu 1 > kern :info : [ 135.381412] kvm-stealtime: cpu 1, msr 23fd16180 > kern :alert : [ 135.386897] busted_srcud-torture:torture_onoff Just so your know, busted_srcud can sometimes fail by design. Hence the "busted" in the name. But failure didn't happen this time. > task: onlined 1 > kern :alert : [ 135.408241] busted_srcud-torture:torture_onoff > task: offlining 1 > kern :info : [ 135.423310] Unregister pv shared memory for cpu 1 > kern :info : [ 135.427940] smpboot: CPU 1 is now offline > kern :alert : [ 135.430106] busted_srcud-torture:torture_onoff > task: offlined 1 > kern :alert : [ 135.436404] busted_srcud-torture:torture_onoff > task: offlining 0 > kern :alert : [ 135.446173] busted_srcud-torture:torture_onoff > task: offline 0 failed: errno -16 > kern :alert : [ 135.453076] busted_srcud-torture:torture_onoff > task: offlining 0 > kern :alert : [ 135.457461] busted_srcud-torture:torture_onoff > task: offline 0 failed: errno -16 > > > ===================================================================== > There are only two CPUs on the VM. Torture try to offline the last one > but -EBUSY occured. > > I spent time to understand kernel/torture.c. > There is torture_onoff(): > > 225 while (!torture_must_stop()) { > 226 cpu = (torture_random(&rand) >> 4) % (maxcpu + 1); > 227 if (!torture_offline(cpu, > 228 &n_offline_attempts, > &n_offline_successes, > 229 &sum_offline, &min_offline, > &max_offline)) > 230 torture_online(cpu, > 231 &n_online_attempts, > &n_online_successes, > 232 &sum_online, &min_online, > &max_online); > 233 schedule_timeout_interruptible(onoff_interval); > 234 } > 235 > > torture_offline() and torture_offline() don't pre judge if the current > cpu is only one usable. That does appear to be the case, and that would be a problem with the CONFIG_BOOTPARAM_HOTPLUG_CPU0 listed below. Good catch! > Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If > there are only one oneline and hotplugable cpux, then > n_offline_successes != n_offline_attempts which caused "End of test: > RCU_HOTPLUG". > > Does I misunderstand something above? Feel free to correct me. Does the following patch help? Thanx, Paul ------------------------------------------------------------------------ diff --git a/kernel/torture.c b/kernel/torture.c index a03ff722352b..2b6700ca2a43 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -101,6 +101,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes, if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu)) return false; + if (num_online_cpus() <= 1) + return false; /* Can't offline the last CPU. */ if (verbose > 1) pr_alert("%s" TORTURE_FLAG