On Saturday, December 10, 2016 12:37:16 PM PST Matt Turner wrote: > On Fri, Dec 9, 2016 at 8:28 PM, Kenneth Graunke <kenn...@whitecape.org> wrote: > > A number of games have large arrays of constants, which we promote to > > uniforms. This introduces copies from the uniform array to the original > > temporary array. Normally, copy propagation eliminates those copies, > > making everything refer to the uniform array directly. > > > > A number of shaders in "Deus Ex: Mankind Divided" recently exposed a > > limitation of copy propagation - if we had any intrinsics (i.e. image > > access in a compute shader), we weren't able to get rid of these copies. > > > > That meant that any variable indexing remained on the temporary array > > rather being moved to the uniform array. i965's scalar backend > > currently doesn't support indirect addressing of temporary arrays, > > which meant lowering it to if-ladders. This was horrible. > > > > On Skylake: > > > > total instructions in shared programs: 13700090 -> 13654519 (-0.33%) > > instructions in affected programs: 56438 -> 10867 (-80.75%) > > Wow! > > > helped: 14 > > HURT: 0 > > > > total cycles in shared programs: 288879704 -> 291270232 (0.83%) > > cycles in affected programs: 12758080 -> 15148608 (18.74%) > > ... that seems nuts? > > Any idea what's going on with the cycle counts?
Good point...I glossed over the cycle counts when I saw the -80% reduction in instructions with 0 shaders hurt. But they do look pretty bad, so let's take a closer look... There are two nearly identical shaders that are the worst offenders: shaders/closed/steam/deus-ex-mankind-divided/256.shader_test CS SIMD16: instructions: 2770 -> 253 (-2,517 instructions or -90.87%) spills: 25 -> 0 fills: 29 -> 0 cycles: 923266 -> 1420534 (+497,268 cycles or +53.86%) compile time: 2.73 seconds -> 0.17 seconds There are three loops in the program, each of which contains two indirect reads of the uvec4[98] constant array. Before this patch, there were: - 67 UNIFORM_PULL_CONSTANT_LOADs at the top of the program - 1 UNIFORM_PULL_CONSTANT_LOAD in the first (cheap) loop - 1 UNIFORM_PULL_CONSTANT_LOAD in the second (expensive) loop - 1 UNIFORM_PULL_CONSTANT_LOAD in the third (very expensive) loop After this patch, there are: - 0 loads at the top of the program - 1 VARYING_PULL_CONSTANT_LOAD in the first (cheap) loop - 2 VARYING_PULL_CONSTANT_LOAD in the second (expensive) loop - 2 VARYING_PULL_CONSTANT_LOAD in the third (very expensive) loop The array indexes in the expensive loop are a[foo] and a[foo + 1]. foo is modified in the loop, so they can't be hoisted out. I don't think we can determine the number of loop iterations. The two expensive loops look to be twice as expensive after this patch. The numbers aren't quite adding up for me - it looks like we should spend 200 more cycles per loop iteration, but the loops are like 40,000 -> 90,000 cycles. I'm not sure what to do with this information. Eliminating 90% of the instructions seems good. Requiring no scratch access seems good. Eliminating the 67 memory loads outside of the loops seems good. Doing two memory loads per loop doesn't seem too crazy, given that it matches the GLSL source code. Burning 49 registers to store the entire array for the lifetime of the program seems pretty crazy... --Ken
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev