03/02/2023 17:09, Van Haaren, Harry: > From: Thomas Monjalon <tho...@monjalon.net> > > 03/02/2023 16:03, Van Haaren, Harry: > > > From: Van Haaren, Harry > > > > > The timeout approach just does not have its place in a functional > > > > > test. > > > > > Either this test is rewritten, or it must go to the performance tests > > > > > list so that we stop getting false positives. > > > > > Can you work on this? > > > > > > > > I'll investigate various approaches on Thursday and reply here with > > > > suggested > > > > next steps. > > > > > > I've identified 3 checks that fail in CI (from the above log outputs), > > > all 3 cases > > > Have different dlays: 100 ms delay, 200 ms delay and 1000ms. > > > In the CI, the service-core just hasn't been scheduled (yet) and causes > > > the > > "failure". > > > > > > Option 1) > > > One option is to while(1) loop, waiting for the service-thread to be > > > scheduled. > > This can be > > > seen as "increasing the timeout", however in this case the test-case > > > would be > > errored > > > not in the test-code, but in the meson-test runner as a timeout (with a > > > 10sec > > default?) > > > The benefit here is that massively increasing (~1sec or less to 10 sec) > > > will cover > > all/many > > > of the CI timeouts. > > > > > > Option 2) > > > Move to perf-tests, and not run these in a noisy-CI environment where the > > results are not > > > consistent enough to have value. This would mean that the tests are not > > > run in > > CI for the > > > 3 checks in question are below, they all *require* the service core to be > > scheduled: > > > service_attr_get() -> requires service core to run for service stats to > > > increment > > > service_lcore_attr_get() -> requires service core to run for lcore stats > > > to > > increment > > > service_lcore_start_stop() -> requires service to run to to ensure > > > service-func > > itself executes. > > > > > > I don't see how we can "improve" option 2 to not require the > > > service-thread to > > be scheduled by the OS.. > > > And the only way to make the OS schedule it in the CI more consistently > > > is to > > give it more time? > > > > We are talking about seconds. > > There are setups where scheduling a thread is taking seconds? > > Apparently so - otherwise these tests would always pass. > > They *only* fail at random runs in CI, and reliably pass everywhere else.. > I've not had > them fail locally, and that includes running in a loop for hours with a busy > system.. > but not a low-priority CI VM in a busy datacenter. > > > [Bruce wrote in separate mail]
Bruce was not Cc'ed in this reply. > >>> For me, the question is - why hasn't the service-core been scheduled? Can > >>> we use sched-yield or some other mechanism to force a wakeup of it? > > I'm not aware of a way to make *a specific other pthread* wakeup. We could > sacrifice > the current lcore that's waiting for the service-lcore, with a sched_yield() > as you suggest. > It would potentially "churn" the scheduler enough to give the service core > some CPU? > It's a guess/gamble in the end, kind of like the timeouts we have today.. > > > > Thoughts and input welcomed, I'm happy to make the code changes > > themselves, its small effort > > > For both option 1 & 2. > > > > For time-sensitive tests, yes they should be in perf tests category. > > As David said earlier, no timeout approach in functional tests. > > Ok, as before, option 1) is to while(1) and wait for "success". Then there's > no timeout in the test code, but our meson test runner will time-out/fail > after ~10sec IIRC. > > Or we move the tests perf-tests, as per Option 2), and these simply won't run > in CI. > > I'm OK with all 3 (including testing with sched_yield() for a month or two > and if that helps?) Did you send a patch to go in a direction or another? If not, please move the test to perf-test as suggested before. We are still hitting the issues in the CI and it is *very* annoying. It is consuming time of a lot of people for a lot of patches, just to check it is again an issue with this test. Please let's remove this test from the CI now.