On 2020-02-28 06:57, Venky Venkatesh wrote: > Hi, > This is concerning the DSW PMD for eventdev. In our application we put > some debugs to see that the same ATOMIC flow isn't scheduled to 2 > different cores. Strangely enough we hit it. Not sure if we are > missing something OR it is a bug. > > We put some instrumentation code inside DSW and found the following: > > * Time T1: flow1 is on core1 > * Time T2: core1 decides to migrate flow1 to core2 (using > seen_events etc.) > * Time T3: core1 completes migration of flow1 to core2 > * Time T4: core3 sends flow1 pkts (PKT1) to core2 > * Time T5: core1 wants to do another migration and again picks > flow1!!! flow1 is owned by core2 at this time > * Time T6-T7: core1 migrates flow1 to core4. core2 happily > acknowledges core1's attempt to migrate flow1 since it just has to > pause, unpause and flush while PKT1 is still in core2. There is no > error check to see if someone else is migrating my flow :) > * Time T8: core3 sends flow1 pkt (PKT2) -- this time to core4. Now > we have the problem of PKT1 in core2 and PKT2 in core4 > > > I think what went wrong was that core1 was using a stale seen_events > which had the old stuff from previous migration. During > dsw_port_end_migration we only reset seen_events_len but not > seen_events_idx. So while the latest events (since the last migration) > are towards the end to port->seen_events (since seen_events_idx is > pointing there) all the candidate flow finding may end up looking from > 0 - seen_events_len: thus using stale information from past migration. > This indeed looks like a bug, although I would propose another fix. The value of the idx doesn't really matter (just a circular buffer), if you do not initiate a new migration until before you have seen/recorded enough events. In fact, that's how I thought it worked already, but I can't find anything in dsw_port_consider_migration() that aborts the migration in case not enough events have been seen.
That said, you must have pretty expensive computation if you recorded less than 128 events in one ms (which is the max DSW migration rate), and still needs to migrate flows away (since the low is high). That's probably why we haven't triggered the bug. I'll make a patch. Thanks! > Pls confirm if the above findings are correct. > > I tried the following and seems to fix the problem. Pls let me know if > this is the way to fix the issue: > > diff -up ./drivers/event/dsw/dsw_event.c.original > ./drivers/event/dsw/dsw_event.c > > --- ./drivers/event/dsw/dsw_event.c.original 2019-09-05 > 16:39:52.662976000 -0700 > > +++ ./drivers/event/dsw/dsw_event.c 2020-02-27 14:49:14.320156000 -0800 > > @@ -627,6 +641,7 @@ dsw_port_end_migration(struct dsw_evdev > > port->migration_state = DSW_MIGRATION_STATE_IDLE; > > port->seen_events_len = 0; > > +port->seen_events_idx = 0; > > dsw_port_migration_stats(port); > > > Thanks > -Venky I