> -----Original Message----- > From: Thomas Monjalon <tho...@monjalon.net> > Sent: Thursday, February 23, 2023 8:15 PM > To: Van Haaren, Harry <harry.van.haa...@intel.com> > Cc: David Marchand <david.march...@redhat.com>; dev@dpdk.org; > dpdk...@iol.unh.edu; c...@dpdk.org; honnappa.nagaraha...@arm.com; > mattias.ronnblom <mattias.ronnb...@ericsson.com>; Morten Brørup > <m...@smartsharesystems.com>; Tyler Retzlaff <roret...@linux.microsoft.com>; > Aaron Conole <acon...@redhat.com>; Richardson, Bruce > <bruce.richard...@intel.com> > Subject: Re: [PATCH v3] test/service: fix spurious failures by extending > timeout <snip> > > > We are talking about seconds. > > > There are setups where scheduling a thread is taking seconds? > > > > Apparently so - otherwise these tests would always pass. > > > > They *only* fail at random runs in CI, and reliably pass everywhere else.. > > I've not > had > > them fail locally, and that includes running in a loop for hours with a > > busy system.. > > but not a low-priority CI VM in a busy datacenter. > > > > > > [Bruce wrote in separate mail] > > Bruce was not Cc'ed in this reply.
Correct, I missed that he wasn't on the thread already, thanks for adding him on CC. > > >>> For me, the question is - why hasn't the service-core been scheduled? > > >>> Can > > >>> we use sched-yield or some other mechanism to force a wakeup of it? > > > > I'm not aware of a way to make *a specific other pthread* wakeup. We could > sacrifice > > the current lcore that's waiting for the service-lcore, with a > > sched_yield() as you > suggest. > > It would potentially "churn" the scheduler enough to give the service core > > some > CPU? > > It's a guess/gamble in the end, kind of like the timeouts we have today.. > > > > > > Thoughts and input welcomed, I'm happy to make the code changes > > > themselves, its small effort > > > > For both option 1 & 2. > > > > > > For time-sensitive tests, yes they should be in perf tests category. > > > As David said earlier, no timeout approach in functional tests. > > > > Ok, as before, option 1) is to while(1) and wait for "success". Then there's > > no timeout in the test code, but our meson test runner will time-out/fail > > after > ~10sec IIRC. > > > > Or we move the tests perf-tests, as per Option 2), and these simply won't > > run in > CI. > > > > I'm OK with all 3 (including testing with sched_yield() for a month or two > > and if > that helps?) > > Did you send a patch to go in a direction or another? > If not, please move the test to perf-test as suggested before. > We are still hitting the issues in the CI and it is *very* annoying. > It is consuming time of a lot of people for a lot of patches, > just to check it is again an issue with this test. > > Please let's remove this test from the CI now. Patch sent: http://patches.dpdk.org/project/dpdk/patch/20230224173637.243266-1-harry.van.haa...@intel.com/