On 2022-10-06 08:53, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] >> Sent: Wednesday, 5 October 2022 23.34 >> >> On 2022-10-05 22:52, Thomas Monjalon wrote: >>> 05/10/2022 22:33, Mattias Rönnblom: >>>> On 2022-10-05 21:14, David Marchand wrote: >>>>> Hello, >>>>> >>>>> The service_autotest unit test has been failing randomly. >>>>> This is not something new. > > [...] > >>>>> EAL: Test assert service_may_be_active line 960 failed: Error: >> Service >>>>> not stopped after 100ms >>>>> >>>>> Ideas? >>>>> >>>>> >>>>> Thanks. >>>> >>>> Do you run the test suite in a controlled environment? I.e., one >> where >>>> you can trust that the lcore threads aren't interrupted for long >> periods >>>> of time. >>>> >>>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for >> the >>>> CPU with other threads. >>> >>> You mean the tests cannot be interrupted? >> >> I just took a very quick look, but it seems like the main thread can, >> but the worker lcore thread cannot be interrupt for anything close to >> 100 ms, or you risk a test failure. >> >>> Then it looks very fragile. >> >> Tests like this are by their very nature racey. If a test thread sends >> a >> request to another thread, there is no way for it to decide when a >> non-response should result in a test failure, unless the scheduling >> latency of the receiving thread has an upper bound. >> >> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get >> a lot of matches. I bet there more like the service core one, but they >> allow for longer interruptions. >> >> That said, 100 ms sounds like very short. I don't see why this can be a >> lot longer. >> >> ...and that said, I would argue you still need a reasonably controlled >> environment for the autotests. If you have a server is arbitrarily >> overloaded, maybe also with high memory pressure (and associated >> instruction page faults and god-knows-what), the real-world worst-case >> interruptions could be very long indeed. Seconds. Designing inherently >> racey tests for that kind of environment will make them have very long >> run times. > > Forgive me, if I am sidetracking a bit here... The issue discussed seems to > be related to some threads waiting for other threads, and my question is not > directly related to that. > > I have been wondering how accurate the tests really are. Where can I see what > is being done to ensure that the EAL worker threads are fully isolated, and > never interrupted by the O/S scheduler or similar? >
There are kernel-level counters for how many times a thread have been involuntarily interrupted, and also, if I recall correctly, the amount of wall-time the thread have been runnable, but not running (i.e., waiting to be scheduled). The latter may require some scheduler debug kernel option being enabled on the kernel build. > For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is > configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us > (microseconds!) if not servicing the ingress queue when receiving at max > packet rate. > > I recently posted some code for monitoring the O/S noise in EAL worker > threads [1]. What should I do if I want to run that code in the automated > test environment? It would be for informational purposes only, i.e. I would > manually look at the test output to see the result. > > I would write a test application that simply starts the O/S noise monitor > thread as an isolated EAL worker thread, the main thread would then wait for > 10 minutes (or some other duration), dump the result to the standard output, > and exit the application. > > [1]: > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-2ff9afcfa197abb4&q=1&e=379fc3f1-046a-4ad8-a55d-5ebc6f63d4ff&u=http%3A%2F%2Finbox.dpdk.org%2Fdev%2F98CBD80474FA8B44BF855DF32C47DC35D87352%40smartserver.smartshare.dk%2F >