> >>> I think the bigger issues as you've pointed out are the cost of > >>> the additional spin lock and should the additional state be > >>> stored in-band (fewer cache lines) or out-of band (less risk of > >>> breaking due to unpredictable application behavior). > >> > >> We don't need the spinlock if clearing the shadow byte after > >> setting the status to user. > >> > >> Worst case, user will set it back to kernel while the shadow > >> byte is not cleared yet and the next producer will drop a packet. > >> But next producers will make progress, so there is no deadlock > >> or corruption. > > > > I thought so too for a while but after spending more time than I > > care to admit I relized the following sequence was occuring: > > > > Core A Core B > > ------ ------ > > - Enter spin_lock > > - Get tp_status of head (X) > > tp_status == 0 > > - Check inuse > > inuse == 0 > > - Allocate entry X > > advance head (X+1) > > set inuse=1 > > - Exit spin_lock > > > > <very long delay> > > > > <allocate N-1 entries > > where N = size of ring> > > > > - Enter spin_lock > > - get tp_status of head (X+N) > > tp_status == 0 (but slot > > in use for X on core A) > > > > - write tp_status of <--- trouble! > > X = TP_STATUS_USER <--- trouble! > > - write inuse=0 <--- trouble! > > > > - Check inuse > > inuse == 0 > > - Allocate entry X+N > > advance head (X+N+1) > > set inuse=1 > > - Exit spin_lock > > > > > > At this point Core A just passed slot X to userspace with a > > packet and Core B has just been assigned slot X+N (same slot as > > X) for it's new packet. Both cores A and B end up filling in that > > slot. Tracking ths donw was one of the reasons it took me a > > while to produce these updated diffs. > > Is this not just an ordering issue? Since inuse is set after tp_status, > it has to be tested first (and barriers are needed to avoid reordering).
I changed the code as you suggest to do the inuse check first and removed the extra added spin_lock/unlock and it seems to be working. I was able to run through the night without an issue (normally I would hit the ring corruption in 1 to 2 hours). Thanks for pointing that out, I should have caught that myself. Next I'll look at your suggestion for where to put the shadow ring.