Hi Thomas, > Please, could you re-send this serie after having added the description >of > each patch in the commit messages?
Yes, I will move the paragraphs that begin with "Patch n" from patch 0 to their respective patches. > It seems you fix 2 bugs in the first patch. It may be clearer to split it > in 2 patches with separate explanations. No, I respectfully disagree. The only bug that we address in patch 1 is that the slaves become out of sync with the master. The section that begins with "Description of rte_timer_manage() race condition bug" is a general description to give background info for both patches 2 and 3. I would like to leave that section in part 0, if it's OK with you. -- Thanks, Robert >2015-07-23 18:42, rsanford2 at gmail.com: >> From: Robert Sanford <rsanford at akamai.com> >> >> This patchset fixes a bug in timer stress test 2, adds a new stress test >> to expose a race condition bug in API rte_timer_manage(), and then fixes >> the rte_timer_manage() bug. >> -- >> >> Patch 1, app/test timer stress test 2: Sometimes this test fails and >> seg-faults because the slave lcores get out of phase with the master. >> The master uses a single int, 'ready', to synchronize multiple slave >> lcores through multiple phases of the test. >> >> To resolve, we construct simple synchronization primitives that use one >> atomic-int state variable per slave. The master tells the slaves when to >> start, and then waits for all of them to finish. Each slave waits for >> the master to tell it to start, and then tells the master when it has >> finished. >> -- >> >> Description of rte_timer_manage() race condition bug: Through code >> inspection, we noticed a potential problem in rte_timer_manage() that >> leads to corruption of per-lcore pending-lists (implemented as >> skip-lists). The race condition occurs when rte_timer_manage() expires >> multiple timers on lcore A, while lcore B simultaneously invokes >> rte_timer_reset() for one of the expiring timers (other than the first >> one). >> >> Lcore A splits its pending-list, creating a local list of expired timers >> linked through their sl_next[0] pointers, and sets the first expired >> timer to the RUNNING state, all during one list-lock round trip. Lcore A >> then unlocks the list-lock to run the first callback, and that is when >> lcore B can misinterpret the subsequent expired timers' true states. >> Lcore B sees an expired timer still in the PENDING state, locks A's >> list-lock, and reinserts the timer into A's pending-list. We are trying >> to use the same pointer to belong to both lists! >> >> One solution is to remove timers from the pending-list and set them to >> the RUNNING state in one atomic step, i.e., we should perform these two >> actions within one ownership of the list-lock. >> -- >> >> Patch 2, new timer-manage race-condition test: We wrote a test to >> confirm our suspicion that we could crash rte_timer_manage() (RTM) under >> the right circumstances. We repeatedly set several timers to expire at >> roughly the same time on the master core. The master lcore just delays >> and runs RTM about ten times per second. The slave lcores all watch the >> first timer (timer-0) to see when RTM is running on the master, i.e., >> timer-0's state is not PENDING. At this point, each slave attempts to >> reset a subset of the timers to a later expiration time. The goal here >> is to have the slaves moving most of the timers to a different place in >> the master's pending-list, while the master is working with the same >> next-pointers and running callback functions. This eventually results in >> the master traversing a corrupted linked-list. In our observations, it >> results in an infinite loop. >> -- >> >> Patch 3, eliminate race condition in rte_timer_manage(): After splitting >> the pending-list at the current time point, we try to set all expired >> timers to the RUNNING state, and put back into the pending-list any >> timers that we fail to set to the RUNNING state, all during one >> list-lock round trip. It is then safe to run the callback functions >> for all expired timers that remain on our local list, without the lock. > >Please, could you re-send this serie after having added the description of >each patch in the commit messages? >It seems you fix 2 bugs in the first patch. It may be clearer to split it >in 2 patches with separate explanations. > >Thanks