RE: [PATCH v3] test/service: fix spurious failures by extending timeout

Van Haaren, Harry Mon, 27 Feb 2023 00:42:00 -0800

> -----Original Message-----
> From: Thomas Monjalon <tho...@monjalon.net>
> Sent: Thursday, February 23, 2023 8:15 PM
> To: Van Haaren, Harry <harry.van.haa...@intel.com>
> Cc: David Marchand <david.march...@redhat.com>; dev@dpdk.org;
> dpdk...@iol.unh.edu; c...@dpdk.org; honnappa.nagaraha...@arm.com;
> mattias.ronnblom <mattias.ronnb...@ericsson.com>; Morten Brørup
> <m...@smartsharesystems.com>; Tyler Retzlaff <roret...@linux.microsoft.com>;
> Aaron Conole <acon...@redhat.com>; Richardson, Bruce
> <bruce.richard...@intel.com>
> Subject: Re: [PATCH v3] test/service: fix spurious failures by extending 
> timeout
<snip>
> > > We are talking about seconds.
> > > There are setups where scheduling a thread is taking seconds?
> >
> > Apparently so - otherwise these tests would always pass.
> >
> > They *only* fail at random runs in CI, and reliably pass everywhere else.. 
> > I've not
> had
> > them fail locally, and that includes running in a loop for hours with a 
> > busy system..
> > but not a low-priority CI VM in a busy datacenter.
> >
> >
> > [Bruce wrote in separate mail]
> 
> Bruce was not Cc'ed in this reply.


Correct, I missed that he wasn't on the thread already, thanks for adding him 
on CC.


> > >>> For me, the question is - why hasn't the service-core been scheduled? 
> > >>> Can
> > >>> we use sched-yield or some other mechanism to force a wakeup of it?
> >
> > I'm not aware of a way to make *a specific other pthread* wakeup.  We could
> sacrifice
> > the current lcore that's waiting for the service-lcore, with a 
> > sched_yield() as you
> suggest.
> > It would potentially "churn" the scheduler enough to give the service core 
> > some
> CPU?
> > It's a guess/gamble in the end, kind of like the timeouts we have today..
> >
> > > > Thoughts and input welcomed, I'm happy to make the code changes
> > > themselves, its small effort
> > > > For both option 1 & 2.
> > >
> > > For time-sensitive tests, yes they should be in perf tests category.
> > > As David said earlier, no timeout approach in functional tests.
> >
> > Ok, as before, option 1) is to while(1) and wait for "success". Then there's
> > no timeout in the test code, but our meson test runner will time-out/fail 
> > after
> ~10sec IIRC.
> >
> > Or we move the tests perf-tests, as per Option 2), and these simply won't 
> > run in
> CI.
> >
> > I'm OK with all 3 (including testing with sched_yield() for a month or two 
> > and if
> that helps?)
> 
> Did you send a patch to go in a direction or another?
> If not, please move the test to perf-test as suggested before.
> We are still hitting the issues in the CI and it is *very* annoying.
> It is consuming time of a lot of people for a lot of patches,
> just to check it is again an issue with this test.
> 
> Please let's remove this test from the CI now.

Patch sent: 
http://patches.dpdk.org/project/dpdk/patch/20230224173637.243266-1-harry.van.haa...@intel.com/

RE: [PATCH v3] test/service: fix spurious failures by extending timeout

Reply via email to