Re: TTM placement & caching issue/questions

2014-09-04 Thread Thomas Hellstrom
On 09/04/2014 10:06 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-09-04 at 09:44 +0200, Thomas Hellstrom wrote:
>
>>> This will, from what I can tell, try to use the same caching mode as the
>>> original object:
>>>
>>> if ((cur_placement & caching) != 0)
>>> result |= (cur_placement & caching);
>>>
>>> And cur_placement comes from bo->mem.placement which as far as I can
>>> tell is based on the placement array which the drivers set up.
>> This originates from the fact that when evicting GTT memory, on x86 it's
>> unnecessary and undesirable to switch caching mode when going to system.
> But that's what I don't quite understand. We have two different mappings
> here. The VRAM and the memory object. We wouldn't be "switching"... we
> are creating a temporary mapping for the memory object in order to do
> the memcpy, but we seem to be doing it by using the caching attributes
> of the VRAM object or am I missing something ? I don't see how that
> makes sense so I suppose I'm missing something here :-)

Well, the intention when TTM was written was that the driver writer
should be smart enough that when he wanted a move from unached VRAM to
system, he'd request cached system in the placement flags in the first
place.  If TTM somehow overrides such a request, that's a bug in TTM.

If the move, for example, is a result of an eviction, then the driver
evict_flags() function should ideally look at the current placement and
decide about a suitable placement based on that: vram-to-system moves
should generally request cacheable memory if the next access is expected
by the CPU. Probably write-combined otherwise.
If the move is the result of a TTM swapout, TTM will automatically
select cachable system, and for most other moves, I think the driver
writer is in full control.

>
>> Last time I tested, (and it seems like Michel is on the same track),
>> writing with the CPU to write-combined memory was substantially faster
>> than writing to cached memory, with the additional side-effect that CPU
>> caches are left unpolluted.
> That's very strange indeed. It's certainly an x86 specific artifact,
> even if we were allowed by our hypervisor to map memory non-cachable
> (the HW somewhat can), we tend to have a higher throughput by going
> cachable, but that could be due to the way the PowerBus works (it's
> basically very biased toward cachable transactions).
>
>> I dislike the approach of rewriting placements. In some cases I think it
>> won't even work, because placements are declared 'static const'
>>
>> What I'd suggest is instead to intercept the driver response from
>> init_mem_type() and filter out undesired caching modes from
>> available_caching and default_caching, 
> This was my original intent but Jerome seems to have different ideas
> (see his proposed patches). I'm happy to revive mine as well and post it
> as an alternative after I've tested it a bit more (tomorrow).
>
>> perhaps also looking at whether
>> the memory type is mappable or not. This should have the additional
>> benefit of working everywhere, and if a caching mode is selected that's
>> not available on the platform, you'll simply get an error. (I guess?)
> You mean that if not mappable we don't bother filtering ?
>
> The rule is really for me pretty simple:
>
>- If it's system memory (PL_SYSTEM/PL_TT), it MUST be cachable
>
>- If it's PCIe memory space (VRAM, registers, ...) it MUST be
> non-cachable.

Yes, something along these lines. I guess checking for VRAM or
TTM_MEMTYPE_FLAG_FIXED would perhaps do the trick

/Thomas

>
> Cheers,
> Ben.
>
>> /Thomas
>>
>>
>>> Cheers,
>>> Ben.
>>>
>>>
>>> ___
>>> dri-devel mailing list
>>> dri-de...@lists.freedesktop.org
>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=C9AHL1VngKBOxe2UrNP2eCZo6FLqdlr6Y90rpfE5rUs%3D%0A&s=73da0633bafc5d54bf116bc861d48d13c39cf8f41832adfb739709e98ec05553
>

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: TTM placement & caching issue/questions

2014-09-04 Thread Thomas Hellstrom
Hi!

Let me try to bring some clarity and suggestions into this.

On 09/04/2014 02:12 AM, Benjamin Herrenschmidt wrote:
> Hi folks !
>
> I've been tracking down some problems with the recent DRI on powerpc and
> stumbled upon something that doesn't look right, and not necessarily
> only for us.
>
> Now it's possible that I haven't fully understood the code here and I
> also don't know to what extent some of that behaviour is necessary for
> some platforms such as Intel GTT bits.
>
> What I've observed with a simple/dumb (no DMA) driver like AST (but this
> probably happens more generally) is that when evicting a BO from VRAM
> into System memory, the TTM tries to preserve the existing caching
> attributes of the VRAM object.
>
> >From what I can tell, we end up with going from VRAM to System memory
> type, and we eventually call ttm_bo_select_caching() to select the
> caching option for the target.
>
> This will, from what I can tell, try to use the same caching mode as the
> original object:
>
>   if ((cur_placement & caching) != 0)
>   result |= (cur_placement & caching);
>
> And cur_placement comes from bo->mem.placement which as far as I can
> tell is based on the placement array which the drivers set up.

This originates from the fact that when evicting GTT memory, on x86 it's
unnecessary and undesirable to switch caching mode when going to system.

>
> Now they tend to uniformly setup the placement for System memory as
> TTM_PL_MASK_CACHING which enables all caching modes.
>
> So I end up with, for example, my System memory BOs having
> TTM_PL_FLAG_CACHED not set (though they also don't have
> TTM_PL_FLAG_UNCACHED) and TTM_PL_FLAG_WC.
>
> We don't seem to use the man->default_caching (which will have
> TTM_PL_FLAG_CACHED) unless there is no matching bit at all between the
> proposed placement and the existing caching mode.
>
> Now this is a problem for several reason that I can think of:
>
>  - On a number of powerpc platforms, such as all our server 64-bit one
> for example, it's actually illegal to map system memory non-cached. The
> system is fully cache coherent for all possible DMA originators (that we
> care about at least) and mapping memory non-cachable while it's mapped
> cachable in the linear mapping can cause nasty cache paradox which, when
> detected by HW, can checkstop the system.
>
>  - A similar issue exists, afaik, on ARM >= v7, so anything mapped
> non-cachable must be removed from the linear mapping explicitly since
> otherwise it can be speculatively prefetched into the cache.
>
>  - I don't know about x86, but even then, it looks quite sub-optimal to
> map the memory backing of the BOs and access it using a WC rather than a
> cachable mapping attribute.

Last time I tested, (and it seems like Michel is on the same track),
writing with the CPU to write-combined memory was substantially faster
than writing to cached memory, with the additional side-effect that CPU
caches are left unpolluted.

Moreover (although only tested on Intel's embedded chipsets), texturing
from cpu-cache-coherent PCI memory was a real GPU performance hog
compared to texturing from non-snooped memory. Hence, whenever a buffer
could be classified as GPU-read-only (or almost at least), it should be
placed in write-combined memory.

>
> Now, some folks on IRC mentioned that there might be reasons for the
> current behaviour as to not change the caching attributes when going
> in/out of the GTT on Intel, I don't know how that relates and how that
> works, but maybe that should be enforced by having a different placement
> mask specifically on those chipsets.
>
> Dave, should we change the various PCI drivers for generally coherent
> devices such that the System memory type doesn't allow placements
> without CACHED attribute ? Or at least on coherent platforms ? How do
> detect that ? Should we have a TTM helper to establish the default
> memory placement attributes that "normal PCI" drivers call to set that
> up so we can have all the necessary arch ifdefs in one single place, at
> least for "classic PCI/PCIe" stuff (AGP might need additional tweaks) ?
>
> Non-PCI and "special" drivers like Intel can use a different set of
> placement attributes to represent the requirements of those specific
> platforms (mostly thinking of embedded ARM here which under some
> circumstances might actually require non-cached mappings).
> Or am I missing another part of the puzzle ?
>
> As it-is, things are broken for me even for dumb drivers, and I suspect
> to a large extent with radeon and nouveau too, though in some case we
> might get away with it most of the time ... until the machine locks up
> for some unexplainable reason... This might cause problems on existing
> distros such as RHEL7 with our radeon adapters even.
>
> Any suggestion of what's the best approach to fix it ? I'm happy to
> produce the patches but I'm not that familiar with the TTM so I would
> like to make sure I'm the right track first :-)

Re: TTM placement & caching issue/questions

2014-09-04 Thread Thomas Hellstrom
On 09/04/2014 11:43 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-09-04 at 11:34 +0200, Daniel Vetter wrote:
>> On Thu, Sep 04, 2014 at 09:44:04AM +0200, Thomas Hellstrom wrote:
>>> Last time I tested, (and it seems like Michel is on the same track),
>>> writing with the CPU to write-combined memory was substantially faster
>>> than writing to cached memory, with the additional side-effect that CPU
>>> caches are left unpolluted.
>>>
>>> Moreover (although only tested on Intel's embedded chipsets), texturing
>>> from cpu-cache-coherent PCI memory was a real GPU performance hog
>>> compared to texturing from non-snooped memory. Hence, whenever a buffer
>>> could be classified as GPU-read-only (or almost at least), it should be
>>> placed in write-combined memory.
>> Just a quick comment since this explicitly referes to intel chips: On
>> desktop/laptop chips with the big shared l3/l4 caches it's the other way
>> round. Cached uploads are substantially faster than wc and not using
>> coherent access is a severe perf hit for texturing. I guess the hw guys
>> worked really hard to hide the snooping costs so that the gpu can benefit
>> from the massive bandwidth these caches can provide.
> This is similar to modern POWER chips as well. We have pretty big L3's
> (though not technically shared they are in a separate quadrant and we
> have a shared L4 in the memory buffer) and our fabric is generally
> optimized for cachable/coherent access performance. In fact, we only
> have so many credits for NC accesses on the bus...
>

Thanks both of you for the update. I haven't dealt with real hardware
for a while..

/Thomas

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev