On 2022-10-05 22:52, Thomas Monjalon wrote: > 05/10/2022 22:33, Mattias Rönnblom: >> On 2022-10-05 21:14, David Marchand wrote: >>> Hello, >>> >>> The service_autotest unit test has been failing randomly. >>> This is not something new. >>> We have been fixing this unit test and the service code, here and there. >>> For some time we were "fine": the failures were rare. >>> >>> But recenly (for the last two weeks at least), it started failing more >>> frequently in UNH lab. >>> >>> The symptoms are linked to places where the unit test code is "waiting >>> for some time": >>> >>> - service_lcore_attr_get: >>> + TestCase [ 5] : service_lcore_attr_get failed >>> EAL: Test assert service_lcore_attr_get line 422 failed: Service lcore >>> not stopped after waiting. >>> >>> >>> - service_may_be_active: >>> + TestCase [15] : service_may_be_active failed >>> ... >>> EAL: Test assert service_may_be_active line 960 failed: Error: Service >>> not stopped after 100ms >>> >>> Ideas? >>> >>> >>> Thanks. >> >> Do you run the test suite in a controlled environment? I.e., one where >> you can trust that the lcore threads aren't interrupted for long periods >> of time. >> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for the >> CPU with other threads. > > You mean the tests cannot be interrupted?
I just took a very quick look, but it seems like the main thread can, but the worker lcore thread cannot be interrupt for anything close to 100 ms, or you risk a test failure. > Then it looks very fragile. Tests like this are by their very nature racey. If a test thread sends a request to another thread, there is no way for it to decide when a non-response should result in a test failure, unless the scheduling latency of the receiving thread has an upper bound. If you grep for "sleep", or "delay", in app/test/test_*.c, you will get a lot of matches. I bet there more like the service core one, but they allow for longer interruptions. That said, 100 ms sounds like very short. I don't see why this can be a lot longer. ...and that said, I would argue you still need a reasonably controlled environment for the autotests. If you have a server is arbitrarily overloaded, maybe also with high memory pressure (and associated instruction page faults and god-knows-what), the real-world worst-case interruptions could be very long indeed. Seconds. Designing inherently racey tests for that kind of environment will make them have very long run times. > Please could help making it more robust? > I can send a patch, if Harry can't.