On 2022-10-06 08:53, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com]
>> Sent: Wednesday, 5 October 2022 23.34
>>
>> On 2022-10-05 22:52, Thomas Monjalon wrote:
>>> 05/10/2022 22:33, Mattias Rönnblom:
>>>> On 2022-10-05 21:14, David Marchand wrote:
>>>>> Hello,
>>>>>
>>>>> The service_autotest unit test has been failing randomly.
>>>>> This is not something new.
> 
> [...]
> 
>>>>> EAL: Test assert service_may_be_active line 960 failed: Error:
>> Service
>>>>> not stopped after 100ms
>>>>>
>>>>> Ideas?
>>>>>
>>>>>
>>>>> Thanks.
>>>>
>>>> Do you run the test suite in a controlled environment? I.e., one
>> where
>>>> you can trust that the lcore threads aren't interrupted for long
>> periods
>>>> of time.
>>>>
>>>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
>> the
>>>> CPU with other threads.
>>>
>>> You mean the tests cannot be interrupted?
>>
>> I just took a very quick look, but it seems like the main thread can,
>> but the worker lcore thread cannot be interrupt for anything close to
>> 100 ms, or you risk a test failure.
>>
>>> Then it looks very fragile.
>>
>> Tests like this are by their very nature racey. If a test thread sends
>> a
>> request to another thread, there is no way for it to decide when a
>> non-response should result in a test failure, unless the scheduling
>> latency of the receiving thread has an upper bound.
>>
>> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
>> a lot of matches. I bet there more like the service core one, but they
>> allow for longer interruptions.
>>
>> That said, 100 ms sounds like very short. I don't see why this can be a
>> lot longer.
>>
>> ...and that said, I would argue you still need a reasonably controlled
>> environment for the autotests. If you have a server is arbitrarily
>> overloaded, maybe also with high memory pressure (and associated
>> instruction page faults and god-knows-what), the real-world worst-case
>> interruptions could be very long indeed. Seconds. Designing inherently
>> racey tests for that kind of environment will make them have very long
>> run times.
> 
> Forgive me, if I am sidetracking a bit here... The issue discussed seems to 
> be related to some threads waiting for other threads, and my question is not 
> directly related to that.
> 
> I have been wondering how accurate the tests really are. Where can I see what 
> is being done to ensure that the EAL worker threads are fully isolated, and 
> never interrupted by the O/S scheduler or similar?
> 

There are kernel-level counters for how many times a thread have been 
involuntarily interrupted, and also, if I recall correctly, the amount 
of wall-time the thread have been runnable, but not running (i.e., 
waiting to be scheduled). The latter may require some scheduler debug 
kernel option being enabled on the kernel build.

> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is 
> configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us 
> (microseconds!) if not servicing the ingress queue when receiving at max 
> packet rate.
> 
> I recently posted some code for monitoring the O/S noise in EAL worker 
> threads [1]. What should I do if I want to run that code in the automated 
> test environment? It would be for informational purposes only, i.e. I would 
> manually look at the test output to see the result.
> 
> I would write a test application that simply starts the O/S noise monitor 
> thread as an isolated EAL worker thread, the main thread would then wait for 
> 10 minutes (or some other duration), dump the result to the standard output, 
> and exit the application.
> 
> [1]: 
> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-2ff9afcfa197abb4&q=1&e=379fc3f1-046a-4ad8-a55d-5ebc6f63d4ff&u=http%3A%2F%2Finbox.dpdk.org%2Fdev%2F98CBD80474FA8B44BF855DF32C47DC35D87352%40smartserver.smartshare.dk%2F
> 

Reply via email to