On Thu, Oct 3, 2024 at 5:50 PM Van Haaren, Harry <harry.van.haa...@intel.com> wrote: > > From: David Marchand <david.march...@redhat.com> > > Sent: Thursday, October 3, 2024 10:13 AM > > To: Mattias Rönnblom <mattias.ronnb...@ericsson.com>; Van Haaren, Harry > > <harry.van.haa...@intel.com> > > Cc: dev@dpdk.org <dev@dpdk.org>; step...@networkplumber.org > > <step...@networkplumber.org>; suanmi...@nvidia.com <suanmi...@nvidia.com>; > > tho...@monjalon.net <tho...@monjalon.net>; sta...@dpdk.org > > <sta...@dpdk.org>; Tyler Retzlaff <roret...@linux.microsoft.com>; Aaron > > Conole <acon...@redhat.com> > > Subject: Re: [PATCH v2] service: fix deadlock on worker lcore exit > > > > On Thu, Oct 3, 2024 at 8:57 AM David Marchand <david.march...@redhat.com> > > wrote: > > > > > > From: Mattias Rönnblom <mattias.ronnb...@ericsson.com> > > > > > > Calling rte_exit() from a worker lcore thread causes a deadlock in > > > rte_service_finalize(). > > > > > > This patch makes rte_service_finalize() deadlock-free by avoiding the > > > need to synchronize with service lcore threads, which in turn is > > > achieved by moving service and per-lcore state from the heap to being > > > statically allocated. > > > > > > The BSS segment increases with ~156 kB (on x86_64 with default > > > RTE_MAX_LCORE and RTE_SERVICE_NUM_MAX). > > > > > > According to the service perf autotest, this change also results in a > > > slight reduction of service framework overhead. > > > > > > Fixes: 33666b448f15 ("service: fix crash on exit") > > > Cc: sta...@dpdk.org > > > > > > Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com> > > > Acked-by: Tyler Retzlaff <roret...@linux.microsoft.com> > > > --- > > > Changes since v1: > > > - rebased, > > > > I can't merge this patch in its current state. > > > > At the moment, two CI report a problem with the > > eal_flags_file_prefix_autotest unit test. > > > > -------------------------------------stdout------------------------------------- > > RTE>>eal_flags_file_prefix_autotest > > Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test' > > '--proc-type=secondary' '-m' '18' '--file-prefix=memtest' > > Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test' > > '-m' '18' '--file-prefix=memtest1' > > Error - hugepage files for memtest1 were not deleted! > > Test Failed > > RTE>> > > > > Can you have a look? > > Not sure how the code change in question is relating to the eal-flags > failure, but I can reproduce the failure here. > Reproducing issue on *all* of the below tags; this indicates its likely a > board-config issue, and not a true issue (unless its been there since > 23.11??). > > Tested commits were all bad: > b3485f4293 (HEAD, tag: v24.07) version: 24.07.0 > a9778aad62 (HEAD, tag: v24.03) version: 24.03.0 > eeb0605f11 (HEAD, tag: v23.11) version: 23.11.0 > > So I'm pretty sure this is a board/runner config issue, with the error output > as follows here: > RTE>>eal_flags_file_prefix_autotest > Running binary with argv[]:'./app/test/dpdk-test' '--proc-type=secondary' > '-m' '18' '--file-prefix=memtest' > EAL: Detected CPU lcores: 64 > EAL: Detected NUMA nodes: 2 > EAL: Detected static linkage of DPDK > EAL: Cannot open '/var/run/dpdk/memtest/config' for rte_mem_config > EAL: FATAL: Cannot init config > EAL: Cannot init config > > FAIL: > DPDK_TEST=eal_flags_file_prefix_autotest ./app/test/dpdk-test --no-pci > > PASS: > DPDK_TEST=eal_flags_file_prefix_autotest ./app/test/dpdk-test > > So seems like the eal-flags test is NOT able to handle args like "--no-pci"? > I tend to run tests in no PCI mode to speed up things :)
Well, speeding up, or hiding the issue, I guess. > In short, this service-cores patch is not the root cause. Perhaps some of the > CI folks can confirm if there's extra args passed to the runner? To be clear, I can't merge this patch because of this (systematic) failure in many CI env (GHA, LoongArch, UNH). Adding CI ml in the loop. -- David Marchand