On Wed, May 16, 2012 at 02:35:38PM +1000, Benjamin Herrenschmidt wrote: > > > > > > > > >+ /* HACK: full memory barrier here */ > > > >+ __sync_synchronize(); > > > > > > I thought you were going to limit this to the TCE iommu? > > > > So, it wasn't my intention to send this one with the rest, but I > > forgot to explain that to Ben when he resent. As the comment > > suggests, this is a hack we have been using internally to see if > > certain ordering problems were what we thought they were. If that > > turned out to be the case (and it now looks like it is), we need to > > work out where to correctly place this barrier. As Ben says, this > > should probably really be in the PCI accessors, and we should use the > > finer grained primitives from qemu-barrier.h rather than the brute > > force __sync_synchronize(). > > Well, I knew you didn't intend to send them but I still think that's the > right patch for now :-) > > So we -could- put it in the PCI accessors ... but that would mean fixing > all drivers to actually use them. For example, ide/ahci or usb/ohci > don't and they aren't the only one.
Uh, right. So, in fact, I didn't mean the PCI accessors precisely, I think this should go in the general DMA accessors. However, it should go in the wrappers in the header file - we want the barrier even in the non-IOMMU case. And we should the finer grained (and arch tailored) barriers from qemu-barrier.h rather than __sync_synchronize(). Actually other patches in my DMA series should fix AHCI and OHCI to use the PCI accessors. > In the end, I don't think there's anything we care about which would not > benefit from ensuring that the DMAs it does appear in the order they > were issued to the guest kernel. Most busses provide that guarantee to > some extent and while some busses do have the ability to explicitly > request relaxed ordering I don't think this is the case with anything we > care about emulating at this stage (and we can always make that separate > accessors or flags to add to the direction for example). > > So by putting the barrier right in the dma_* accessor we kill all the > birds with one stone without having to audit all drivers for use of the > right accessors and all bus types. > > Also while the goal of using more targeted barriers might be worthwhile > in the long run, it's not totally trivial because we do want to order > store vs. subsequent loads in all cases and load vs. loads, and we don't > want to have to keep track of what the previous access was, so at this > stage it's simply easier to just use a full barrier. True, we need more than a wmb() here. But it's not necessarily as strong as __sync_synchronize() (e.g. on POWER I think we only need 'eieio', not 'sync'). More importantly, I believe qemu does build for some platforms with crappy old gcc versions which don't have sync_syncronize, so we want the ifdefs in qemu-barrier.h for that reason (in many cases mb() from there will translate into __sync_synchronize() anyway). > So my suggestion is to see if that patch introduces a measurable > performance regression anywhere we care about (ie on x86) and if not, > just go for it, it will solve a very real problem and we can ponder ways > to do it better as a second step if it's worthwhile. > > Anthony, how do you usually benchmark these things ? Any chance you can > run a few tests to see if there's any visible loss ? > > Cheers, > Ben. > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson