This is the first part of a series meant to improve our usage of the L3 cache. Currently it's far from ideal since the following objects aren't taking any advantage of it: - Pull constants (i.e. UBOs and demoted uniforms) - Buffer textures - Shader scratch space (i.e. register spills and fills) - Atomic counters - (Soon) Images
This first series addresses the first two issues. Fixing the last three is going to be a bit more difficult because we need to modify the partitioning of the L3 cache in order to increase the number of ways assigned to the DC, which happens to be zero on boot until Gen8. That's likely to require kernel changes because we don't have any extremely satisfactory API to change that from userspace right now. The first patch in the series sets the MOCS L3 cacheability bit in the surface state structure for buffers so the mentioned memory objects (except the shader scratch space that gets its MOCS from elsewhere) have a chance of getting cached in L3. The fourth patch in the series switches to using the constant cache (which, unlike the data cache that was used years ago before we started using the sampler, is cached on L3 with the default partitioning on all gens) for uniform pull constants loads. The overall performance numbers I've collected are included in the commit message of the same patch for future reference. Most of it points at the constant cache being faster than the sampler in a number of cases (assuming the L3 caching settings are correct), it's also likely to alleviate some cache thrashing caused by the competition with textures for the L1/L2 sampler caches, and it allows fetching up to eight consecutive owords (128B) with just one message. The sixth patch enables 4 oword loads because they're basically for free and they avoid some of the shortcomings of the 1 and 2 oword messages (see the commit message for more details). I'll have a look into enabling 8 oword loads but it's going to require an analysis pass to avoid wasting bandwidth and increasing the register pressure unnecessarily when the shader doesn't actually need as many constants. We could do something similar for non-uniform offset pull constant loads and for both kinds of pull constant loads on the vec4 back-end, but I don't have enough performance data to support that yet. [PATCH 1/7] i965: Enable L3 caching of buffer surfaces. [PATCH 2/7] i965: Remove the create_raw_surface vtbl hook. [PATCH 3/7] i965: Let the caller of brw_set_dp_write/read_message control the target cache. [PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants. [PATCH 5/7] i965/fs: Less broken handling of force_writemask_all in lower_load_payload(). [PATCH 6/7] i965/fs: Fetch one cacheline of pull constants at a time. [PATCH 7/7] i965/fs: Remove the FS_OPCODE_SET_SIMD4X2_OFFSET virtual opcode. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev