> -----Original Message----- > From: Aaron Conole <acon...@redhat.com> > Sent: Tuesday, February 4, 2020 2:51 PM > To: David Marchand <david.march...@redhat.com> > Cc: Van Haaren, Harry <harry.van.haa...@intel.com>; dev <dev@dpdk.org> > Subject: Re: [RFC] service: stop lcore threads before 'finalize' > > David Marchand <david.march...@redhat.com> writes: > > > On Fri, Jan 17, 2020 at 9:17 AM David Marchand > > <david.march...@redhat.com> wrote: > >> > >> On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <acon...@redhat.com> wrote: > >> > > >> > I've noticed an occasional segfault from the build system in the > >> > service_autotest and after talking with David (CC'd), it seems like > it's > >> > due to the rte_service_finalize deleting the lcore_states object while > >> > active lcores are running. > >> > > >> > The below patch is an attempt to solve it by first reassigning all the > >> > lcores back to ROLE_RTE before releasing the memory. There is probably > >> > a larger question for DPDK proper about actually closing the pending > >> > lcore threads, but that's a separate issue. I've been running with the > >> > patch for a while, and haven't seen the crash anymore on my system. > >> > > >> > Thoughts? Is it acceptable as-is? > >> > >> Added this patch to my env, still reproducing the same issue after ~10-20 > tries. > >> I added a breakpoint to service_lcore_uninit that is indeed caught > >> when exiting the test application (just wanted to make sure your > >> change was in my binary). > > > > Harry, > > > > We need a fix for this issue. > > +1
Hi All, > > Interestingly, Stephen patch that joins all pthreads at > > rte_eal_cleanup [1] makes this issue disappear. > > So my understanding is that we are missing a api (well, I could not > > find a way) to synchronously stop service lcores. > > Maybe we can take that patch as a fix. I hate to see this segfault > in the field. I need to figure out what I missed in my cleanup > (probably missed a synchronization point). I haven't easily reproduced this yet - so I'll investigate a way to reproduce with close to 100% rate, then we can identify the root cause and actually get a clean fix. If you have pointers to reproduce easily, please let me know. -H > > 1: https://patchwork.dpdk.org/patch/64201/