On Thu, Nov 24, 2011 at 08:12:25AM +0000, Shi Xuelin-B29237 wrote: > Hi Ira, > > Thanks for your review. > > After second thought, I think your scenario may not occur. > Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in > practice. > We never query a cookie not returned by fsl_dma_tx_submit(...). >
I agree about this part. > When we call fsl_tx_status(20), the chan->common.cookie is definitely wrote > as 20 and cpu2 could not read as 19. > This is what I don't agree about. However, I'm not an expert on CPU cache vs. memory accesses in an multi-processor system. The section titled "CACHE COHERENCY" in Documentation/memory-barriers.txt leads me to believe that the scenario I described is possible. What happens if CPU1's write of chan->common.cookie only goes into CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19. I don't think you should see any performance impact from the smp_mb() operation. Thanks, Ira > -----Original Message----- > From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] > Sent: 2011年11月23日 2:59 > To: Shi Xuelin-B29237 > Cc: dan.j.willi...@intel.com; Li Yang-R58472; z...@zh-kernel.org; > vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; > linux-ker...@vger.kernel.org > Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing > spinlock use. > > On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29...@freescale.com wrote: > > From: Forrest Shi <b29...@freescale.com> > > > > dma status check function fsl_tx_status is heavily called in > > a tight loop and the desc lock in fsl_tx_status contended by > > the dma status update function. this caused the dma performance > > degrades much. > > > > this patch releases the lock in the fsl_tx_status function. > > I believe it has no neglect impact on the following call of > > dma_async_is_complete(...). > > > > we can see below three conditions will be identified as success > > a) x < complete < use > > b) x < complete+N < use+N > > c) x < complete < use+N > > here complete is the completed_cookie, use is the last_used > > cookie, x is the querying cookie, N is MAX cookie > > > > when chan->completed_cookie is being read, the last_used may > > be incresed. Anyway it has no neglect impact on the dma status > > decision. > > > > Signed-off-by: Forrest Shi <xuelin....@freescale.com> > > --- > > drivers/dma/fsldma.c | 5 ----- > > 1 files changed, 0 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index > > 8a78154..1dca56f 100644 > > --- a/drivers/dma/fsldma.c > > +++ b/drivers/dma/fsldma.c > > @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan > > *dchan, > > struct fsldma_chan *chan = to_fsl_chan(dchan); > > dma_cookie_t last_complete; > > dma_cookie_t last_used; > > - unsigned long flags; > > - > > - spin_lock_irqsave(&chan->desc_lock, flags); > > > > This will cause a bug. See below for a detailed explanation. You need this > instead: > > /* > * On an SMP system, we must ensure that this CPU has seen the > * memory accesses performed by another CPU under the > * chan->desc_lock spinlock. > */ > smp_mb(); > > last_complete = chan->completed_cookie; > > last_used = dchan->cookie; > > > > - spin_unlock_irqrestore(&chan->desc_lock, flags); > > - > > dma_set_tx_state(txstate, last_complete, last_used, 0); > > return dma_async_is_complete(cookie, last_complete, last_used); } > > Facts: > - dchan->cookie is the same member as chan->common.cookie (same memory > location) > - chan->common.cookie is the "last allocated cookie for a pending transaction" > - chan->completed_cookie is the "last completed transaction" > > I have replaced "dchan->cookie" with "chan->common.cookie" in the below > explanation, to keep everything referenced from the same structure. > > Variable usage before your change. Everything is used locked. > - RW chan->common.cookie (fsl_dma_tx_submit) > - R chan->common.cookie (fsl_tx_status) > - R chan->completed_cookie (fsl_tx_status) > - W chan->completed_cookie (dma_do_tasklet) > > Variable usage after your change: > - RW chan->common.cookie LOCKED > - R chan->common.cookie NO LOCK > - R chan->completed_cookie NO LOCK > - W chan->completed_cookie LOCKED > > What if we assume that you have a 2 CPU system (such as a P2020). After your > changes, one possible sequence is: > > === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === > spin_lock_irqsave > descriptor->cookie = 20 (x in your example) > chan->common.cookie = 20 (used in your example) > spin_unlock_irqrestore > > === CPU2 - immediately calls fsl_tx_status() === > chan->common.cookie == 19 > chan->completed_cookie == 19 > descriptor->cookie == 20 > > Since we don't have locks anymore, CPU2 may not have seen the write to > chan->common.cookie yet. > > Also assume that the DMA hardware has not started processing the transaction > yet. Therefore dma_do_tasklet() has not been called, and > chan->completed_cookie has not been updated. > > In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even > though the DMA operation has not succeeded. The DMA operation has not even > started yet! > > The smp_mb() fixes this, since it forces CPU2 to have seen all memory > operations that happened before CPU1 released the spinlock. Spinlocks are > implicit SMP memory barriers. > > Therefore, the above example becomes: > smp_mb(); > chan->common.cookie == 20 > chan->completed_cookie == 19 > descriptor->cookie == 20 > > Then dma_async_is_complete() returns DMA_IN_PROGRESS, which is correct. > > Thanks, > Ira > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev