Morten Brørup <m...@smartsharesystems.com> writes:

>> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com]
>> Sent: Wednesday, 5 October 2022 23.34
>> 
>> On 2022-10-05 22:52, Thomas Monjalon wrote:
>> > 05/10/2022 22:33, Mattias Rönnblom:
>> >> On 2022-10-05 21:14, David Marchand wrote:
>> >>> Hello,
>> >>>
>> >>> The service_autotest unit test has been failing randomly.
>> >>> This is not something new.
>
> [...]
>
>> >>> EAL: Test assert service_may_be_active line 960 failed: Error:
>> Service
>> >>> not stopped after 100ms
>> >>>
>> >>> Ideas?
>> >>>
>> >>>
>> >>> Thanks.
>> >>
>> >> Do you run the test suite in a controlled environment? I.e., one
>> where
>> >> you can trust that the lcore threads aren't interrupted for long
>> periods
>> >> of time.
>> >>
>> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
>> the
>> >> CPU with other threads.
>> >
>> > You mean the tests cannot be interrupted?
>> 
>> I just took a very quick look, but it seems like the main thread can,
>> but the worker lcore thread cannot be interrupt for anything close to
>> 100 ms, or you risk a test failure.
>> 
>> > Then it looks very fragile.
>> 
>> Tests like this are by their very nature racey. If a test thread sends
>> a
>> request to another thread, there is no way for it to decide when a
>> non-response should result in a test failure, unless the scheduling
>> latency of the receiving thread has an upper bound.
>> 
>> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
>> a lot of matches. I bet there more like the service core one, but they
>> allow for longer interruptions.
>> 
>> That said, 100 ms sounds like very short. I don't see why this can be a
>> lot longer.
>> 
>> ...and that said, I would argue you still need a reasonably controlled
>> environment for the autotests. If you have a server is arbitrarily
>> overloaded, maybe also with high memory pressure (and associated
>> instruction page faults and god-knows-what), the real-world worst-case
>> interruptions could be very long indeed. Seconds. Designing inherently
>> racey tests for that kind of environment will make them have very long
>> run times.
>
> Forgive me, if I am sidetracking a bit here... The issue discussed
> seems to be related to some threads waiting for other threads, and my
> question is not directly related to that.
>
> I have been wondering how accurate the tests really are. Where can I
> see what is being done to ensure that the EAL worker threads are fully
> isolated, and never interrupted by the O/S scheduler or similar?
>
> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a
> NIC is configured with 4096 Rx descriptors, packet loss will occur
> after ca. 70 us (microseconds!) if not servicing the ingress queue
> when receiving at max packet rate.
>
> I recently posted some code for monitoring the O/S noise in EAL worker
> threads [1]. What should I do if I want to run that code in the
> automated test environment? It would be for informational purposes
> only, i.e. I would manually look at the test output to see the result.

One hacky way is to post a PATCH telling that it should never be merged,
but that introduces your test case, and then look at the logs.

> I would write a test application that simply starts the O/S noise
> monitor thread as an isolated EAL worker thread, the main thread would
> then wait for 10 minutes (or some other duration), dump the result to
> the standard output, and exit the application.
>
> [1]: 
> http://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35d87...@smartserver.smartshare.dk/

Reply via email to