Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"

Paul E. McKenney Tue, 22 Jan 2019 19:23:52 -0800

On Tue, Jan 22, 2019 at 04:42:19PM +0800, Su Yue wrote:
> Thanks for your quick reply! Paul
> 
> On 1/22/19 12:01 PM, Paul E. McKenney wrote:
> >On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote:
> >>Hi, guys
> >>   While running rcutorture tests with "onoff_interval", some tests
> >>failed and results show like:
> >>
> >>=====================================================================
> >>[  316.354501] srcud-torture:--- End of test: RCU_HOTPLUG:
> >>nreaders=1 nfakewriters=4 stat_interval=60 verbose=2
> >>test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\
> >>s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0
> >>test_boost_interval=7 test_boost_duration=4 shutdown_secs=0
> >>stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\
> >>rrier_cbs=0 onoff_interval=3 onoff_holdoff=0
> >>====================================================================
> >>
> >>I am wondering that meaning of "RCU_HOTPLUG". Is it expected because
> >>cpu hotplug is enabled in the test? Or just represents another type of
> >>failure?
> >
> >This says that at least one CPU hotplug operation failed, that is,
> >the CPU didn't actually come online or go offline as requested.  If you
> >are introducing CPU hotplug to an architecture, this usually indicates
> >that you have bugs in your CPU-hotplug code.  Or it nmight be that
> 
> It should hit the case since there is no RCU CPU stall warnings.
> 
> >RCU grace periods failed to progress -- though this would normally
> >also result in RCU CPU stall warnings.
> >
> >There should be lines containing "ver:" in your console output.  What
> >does one of the later one of these say?
> >
> 
> The line says:
> ======================================================================
> [  318.850175] busted_srcud-torture: rtc:           (null) ver:
> 27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke:
> 0 rtbre: 0 rtbf: 0 rtb: 0 \
> nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542
> (HZ=1000) barrier: 0/0:0


Yes, you have many more offline attempts than successes, which is
why RCU_HOTPLUG was printed.

> =====================================================================
> 
> And here are useful errors:
> =====================================================================
> kern  :info  : [  135.379693] KVM setup async PF for cpu 1
> kern  :info  : [  135.381412] kvm-stealtime: cpu 1, msr 23fd16180
> kern  :alert : [  135.386897] busted_srcud-torture:torture_onoff

Just so your know, busted_srcud can sometimes fail by design.  Hence
the "busted" in the name.  But failure didn't happen this time.

> task: onlined 1
> kern  :alert : [  135.408241] busted_srcud-torture:torture_onoff
> task: offlining 1
> kern  :info  : [  135.423310] Unregister pv shared memory for cpu 1
> kern  :info  : [  135.427940] smpboot: CPU 1 is now offline
> kern  :alert : [  135.430106] busted_srcud-torture:torture_onoff
> task: offlined 1
> kern  :alert : [  135.436404] busted_srcud-torture:torture_onoff
> task: offlining 0
> kern  :alert : [  135.446173] busted_srcud-torture:torture_onoff
> task: offline 0 failed: errno -16
> kern  :alert : [  135.453076] busted_srcud-torture:torture_onoff
> task: offlining 0
> kern  :alert : [  135.457461] busted_srcud-torture:torture_onoff
> task: offline 0 failed: errno -16
> 
> 
> =====================================================================
> There are only two CPUs on the VM. Torture try to offline the last one
> but -EBUSY occured.
> 
> I spent time to understand kernel/torture.c.
> There is torture_onoff():
> 
> 225        while (!torture_must_stop()) {
> 226                cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
> 227                if (!torture_offline(cpu,
> 228                                     &n_offline_attempts,
> &n_offline_successes,
> 229                                     &sum_offline, &min_offline,
> &max_offline))
> 230                        torture_online(cpu,
> 231                                       &n_online_attempts,
> &n_online_successes,
> 232                                       &sum_online, &min_online,
> &max_online);
> 233                schedule_timeout_interruptible(onoff_interval);
> 234        }
> 235
> 
> torture_offline() and torture_offline() don't pre judge if the current
> cpu is only one usable.

That does appear to be the case, and that would be a problem with
the CONFIG_BOOTPARAM_HOTPLUG_CPU0 listed below.

Good catch!

> Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If
> there are only one oneline and hotplugable cpux, then
> n_offline_successes != n_offline_attempts which caused "End of test:
> RCU_HOTPLUG".
> 
> Does I misunderstand something above? Feel free to correct me.

Does the following patch help?

                                                        Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/torture.c b/kernel/torture.c
index a03ff722352b..2b6700ca2a43 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -101,6 +101,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long 
*n_offl_successes,
 
        if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
                return false;
+       if (num_online_cpus() <= 1)
+               return false;  /* Can't offline the last CPU. */
 
        if (verbose > 1)
                pr_alert("%s" TORTURE_FLAG

Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"

Reply via email to