On 2022-10-05 22:52, Thomas Monjalon wrote:
> 05/10/2022 22:33, Mattias Rönnblom:
>> On 2022-10-05 21:14, David Marchand wrote:
>>> Hello,
>>>
>>> The service_autotest unit test has been failing randomly.
>>> This is not something new.
>>> We have been fixing this unit test and the service code, here and there.
>>> For some time we were "fine": the failures were rare.
>>>
>>> But recenly (for the last two weeks at least), it started failing more
>>> frequently in UNH lab.
>>>
>>> The symptoms are linked to places where the unit test code is "waiting
>>> for some time":
>>>
>>> -  service_lcore_attr_get:
>>> + TestCase [ 5] : service_lcore_attr_get failed
>>> EAL: Test assert service_lcore_attr_get line 422 failed: Service lcore
>>> not stopped after waiting.
>>>
>>>
>>> -  service_may_be_active:
>>> + TestCase [15] : service_may_be_active failed
>>> ...
>>> EAL: Test assert service_may_be_active line 960 failed: Error: Service
>>> not stopped after 100ms
>>>
>>> Ideas?
>>>
>>>
>>> Thanks.
>>
>> Do you run the test suite in a controlled environment? I.e., one where
>> you can trust that the lcore threads aren't interrupted for long periods
>> of time.
>>
>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for the
>> CPU with other threads.
> 
> You mean the tests cannot be interrupted?

I just took a very quick look, but it seems like the main thread can, 
but the worker lcore thread cannot be interrupt for anything close to 
100 ms, or you risk a test failure.

> Then it looks very fragile.

Tests like this are by their very nature racey. If a test thread sends a 
request to another thread, there is no way for it to decide when a 
non-response should result in a test failure, unless the scheduling 
latency of the receiving thread has an upper bound.

If you grep for "sleep", or "delay", in app/test/test_*.c, you will get 
a lot of matches. I bet there more like the service core one, but they 
allow for longer interruptions.

That said, 100 ms sounds like very short. I don't see why this can be a 
lot longer.

...and that said, I would argue you still need a reasonably controlled 
environment for the autotests. If you have a server is arbitrarily 
overloaded, maybe also with high memory pressure (and associated 
instruction page faults and god-knows-what), the real-world worst-case 
interruptions could be very long indeed. Seconds. Designing inherently 
racey tests for that kind of environment will make them have very long 
run times.

> Please could help making it more robust?
> 

I can send a patch, if Harry can't.

Reply via email to