"Van Haaren, Harry" <harry.van.haa...@intel.com> writes: > Hi Aaron, > >> -----Original Message----- >> From: Aaron Conole <acon...@redhat.com> >> Sent: Monday, November 25, 2019 10:54 PM >> To: Thomas Monjalon <tho...@monjalon.net> >> Cc: Van Haaren, Harry <harry.van.haa...@intel.com>; Amber, Kumar >> <kumar.am...@intel.com>; dev@dpdk.org; Wang, Yipeng1 >> <yipeng1.w...@intel.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; Thakur, >> Sham Singh <sham.singh.tha...@intel.com>; David Marchand >> <dmarc...@redhat.com> >> Subject: Re: [dpdk-dev] [PATCH v3] hash: added a new API to hash to query >> key id >> >> Aaron Conole <acon...@redhat.com> writes: >> >> > Thomas Monjalon <tho...@monjalon.net> writes: >> > >> >>> From: Aaron Conole <acon...@redhat.com> >> >>> > - if (!service_valid(id)) >> >>> > + if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id)) >> >> >> >> Why not adding this check in service_valid()? >> > >> > I think the best fix is to use SERVICE_VALID_GET_OR_ERR_RET() in these >> > places. For this, I at least want to try and show that there aren't any >> > further errors. And my test loop has been running for a while now >> > without any more errors or segfaults, so I guess it's okay to build a >> > proper patch. >> >> This popped up: >> >> EAL: Test assert service_lcore_en_dis_able line 487 failed: Ex-service core >> function call had no effect. >> >> So I'll spend some time in this area, it seems. > > > The below diff makes it 100% reproducible here, failing every time. > > It seems like the main thread is returning, before the service thread has > returned. > > The rte_eal_mp_wait_lcore() call seems to not wait on the service-core, which > allows > the main thread to read the "service_remote_launch_flag" value as 0 (before > the service-thread writes it to 1). > > Adding the delay between the service launch and service write being performed > makes this issue much much more likely to occur - so the above description I > have confidence in. > > What I'm not clear on (yet) is why the eal_mp_wait_lcore() isn't waiting...
As I wrote in the other thread, it's because eal_mp_wait_lcore won't look at lcores with ROLE_SERVICE. > -H I've been running something similar to the suggested patch for 24 minutes now with no failure. I've also removed the eal_mp_wait_lcore() call in other areas throughout the test and switched to individual core waiting "just in case." I don't think it's the right fix, though.