Morten Brørup <m...@smartsharesystems.com> writes: >> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] >> Sent: Wednesday, 5 October 2022 23.34 >> >> On 2022-10-05 22:52, Thomas Monjalon wrote: >> > 05/10/2022 22:33, Mattias Rönnblom: >> >> On 2022-10-05 21:14, David Marchand wrote: >> >>> Hello, >> >>> >> >>> The service_autotest unit test has been failing randomly. >> >>> This is not something new. > > [...] > >> >>> EAL: Test assert service_may_be_active line 960 failed: Error: >> Service >> >>> not stopped after 100ms >> >>> >> >>> Ideas? >> >>> >> >>> >> >>> Thanks. >> >> >> >> Do you run the test suite in a controlled environment? I.e., one >> where >> >> you can trust that the lcore threads aren't interrupted for long >> periods >> >> of time. >> >> >> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for >> the >> >> CPU with other threads. >> > >> > You mean the tests cannot be interrupted? >> >> I just took a very quick look, but it seems like the main thread can, >> but the worker lcore thread cannot be interrupt for anything close to >> 100 ms, or you risk a test failure. >> >> > Then it looks very fragile. >> >> Tests like this are by their very nature racey. If a test thread sends >> a >> request to another thread, there is no way for it to decide when a >> non-response should result in a test failure, unless the scheduling >> latency of the receiving thread has an upper bound. >> >> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get >> a lot of matches. I bet there more like the service core one, but they >> allow for longer interruptions. >> >> That said, 100 ms sounds like very short. I don't see why this can be a >> lot longer. >> >> ...and that said, I would argue you still need a reasonably controlled >> environment for the autotests. If you have a server is arbitrarily >> overloaded, maybe also with high memory pressure (and associated >> instruction page faults and god-knows-what), the real-world worst-case >> interruptions could be very long indeed. Seconds. Designing inherently >> racey tests for that kind of environment will make them have very long >> run times. > > Forgive me, if I am sidetracking a bit here... The issue discussed > seems to be related to some threads waiting for other threads, and my > question is not directly related to that. > > I have been wondering how accurate the tests really are. Where can I > see what is being done to ensure that the EAL worker threads are fully > isolated, and never interrupted by the O/S scheduler or similar? > > For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a > NIC is configured with 4096 Rx descriptors, packet loss will occur > after ca. 70 us (microseconds!) if not servicing the ingress queue > when receiving at max packet rate. > > I recently posted some code for monitoring the O/S noise in EAL worker > threads [1]. What should I do if I want to run that code in the > automated test environment? It would be for informational purposes > only, i.e. I would manually look at the test output to see the result.
One hacky way is to post a PATCH telling that it should never be merged, but that introduces your test case, and then look at the logs. > I would write a test application that simply starts the O/S noise > monitor thread as an isolated EAL worker thread, the main thread would > then wait for 10 minutes (or some other duration), dump the result to > the standard output, and exit the application. > > [1]: > http://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35d87...@smartserver.smartshare.dk/