> -----Original Message----- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Neil Horman > Sent: Friday, September 26, 2014 4:02 PM > To: Wodkowski, PawelX > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to > thread-safe: > > On Fri, Sep 26, 2014 at 02:01:05PM +0000, Wodkowski, PawelX wrote: > > > > > Maybe I don't see something obvious? :) > > > > > > I think you're missing the fact that your patch doesn't do what you > > > assert above > > > either :) > > > > Issue is not in setting alarms but canceling it. If you look closer to my > > patch you > > see that it address this issue (look at added *do { lock(); ....; unlock(); > > } while( )* > > part). > > > I get where the issue is, and I'm looking at your patch. I see that you did > some locking there. The issue I'm pointing out is that, if you call > rte_eal_alarm_cancel on an alarm callback, you will exit the alarm_cancel > function with, by definition, one alarm executing (the one you are currently > running). You're patch works perfectly for the case where another thread > calls > cancel, in that it waits until the executing alarm is complete, but it doesn't > work in the case where you are calling it from within the alarm callback.
Hm, and why do we need it from alarm callback? After cb_func() is finished given alarm entry will be removed anyway. > If you're goal is to guarantee that all the matching alarms are cancelled and > complete, you haven't done that, because the recursive state is still > unhandled. > > > > > > > First, lets address rte_alarm_set. There is no notion of "re-arming" in > > > this > > > alarm implementation, because theres no ability to refer to a specific > > > alarm > > > from the callers perspective. When you call rte_eal_alarm_set you get a > > > new > > > alarm every time. So I don't really see a race there. It might not be > > > exactly > > > the behavior you want, but its not a race, becuase you're not modifying an > > > alarm > > > in the middle of execution, you're just creating a new alarm, which is > > > safe. > > > > OK, it is safe, but this is not the case. > > > I don't know what you mean by this. We agree its safe, great. But it is the > case as I've described it, you can see it from the implementation, every call > to > rte_eal_alarm_set starts with a malloc of a new alarm structure. > > > > > > > There is a race in what you describe above, insofar as its possible that > > > you > > > might call rte_eal_alarm_cancel and return without having canceled all the > > > matching alarms. I don't see any clear documentation on what the > > > behavior is > > > supposed to be, but if you want to ensure that all matching alarms are > > > cancelled > > > or complete on return from rte_eal_alarm_cancel, thats perfectly fine (in > > > linux > > > API parlance, thats usually denoted as a cancel_sync operation). > > > > Again, look at the patch. I changed documentation to inform about this > > behavior. > > > > This is the documentation included in the patch: > Change alarm cancel function to thread-safe. > It eliminates a race between threads using rte_alarm_cancel and > rte_alarm_set. > > neither have you compeltely described the race condition (though you now have > previously in this thread), nor have you completely addressed it (calling > rte_eal_alarm_cancel and rte_eal_alarm_set still behaves exactly as it did > previously with a 2nd thread). > > > > > > > For that race condition, you're correct, my patch doesn't address it, I > > > see that > > > now. Though your patch doesn't either. If you call rte_eal_alarm_cancel > > > from > > > within a callback function, then, by definition, you can't wait on the > > > completion of the active alarm, because thats a deadlock. Its a > > > necessecary > > > evil, I grant you, but it means that you can't be guaranteed the > > > cancelled and > > > complete (cancel_sync) behavior that you want, at least not with the > > > current > > > api. If you want that behavior, you need to do one of two things: > > > > This patch does not break any API. It only removes undefined behavior. > > > I never said it did break ABI. I said that to completely fix it you would > have > to break ABI. And it doesn't really remove undefined behavior, because you > still have the old behavior in the recursive case (which you may be ok with, I > don't know, but if you really want to address the behavior, you should address > this aspect of it). > > > > > > > 1) Modify the api to allow callers to individually reference timer > > > instances, so > > > that when cancelling, we can return an appropriate return code to > > > indicate to > > > the caller that this alarm is in-progress. That way you can guarantee the > > > caller that the specific alarm that you cancelled is either complete and > > > cancelled > > > or currently executing. Add an api to expicitly wait on a referenced > > > alarm as > > > well. This allows developers to know that, when executing an alarm > > > callback, an > > > -ECURRENTLYEXECUTING return code is ok, because they are in the currently > > > executing context. > > > > This would brake API for sure. > Yes, it would. Bruce Richardson just made a major ABI break with his mbuf > cleanup set. If there was a time to change ABI here, now would be the time I > think. Ok, too many words for me, to be honest :) Can I summarise: As I remember the purpose of the patch was to fix the race condition inside rte_alarm library. I believe that the patch provided by Michal & Pawel fixes the issues you discovered. If you think, that is not the case, could you please provide a list of remaining issues? Excluding ones that you just don't like it, and you are not happy with rte_alarm API in total? If you have any concerns about mbuf reorg/expansion - probably better to contact Bruce and express them. Not to use it as an argument for breaking any existing API without really good reason behind. Konstantin