On Sun, 2016-12-11 at 00:00 -0800, Kenneth Graunke wrote: > On Saturday, December 10, 2016 12:37:16 PM PST Matt Turner wrote: > > > > On Fri, Dec 9, 2016 at 8:28 PM, Kenneth Graunke <kenneth@whitecape. > > org> wrote: > > > > > > A number of games have large arrays of constants, which we > > > promote to > > > uniforms. This introduces copies from the uniform array to the > > > original > > > temporary array. Normally, copy propagation eliminates those > > > copies, > > > making everything refer to the uniform array directly. > > > > > > A number of shaders in "Deus Ex: Mankind Divided" recently > > > exposed a > > > limitation of copy propagation - if we had any intrinsics (i.e. > > > image > > > access in a compute shader), we weren't able to get rid of these > > > copies. > > > > > > That meant that any variable indexing remained on the temporary > > > array > > > rather being moved to the uniform array. i965's scalar backend > > > currently doesn't support indirect addressing of temporary > > > arrays, > > > which meant lowering it to if-ladders. This was horrible. > > > > > > On Skylake: > > > > > > total instructions in shared programs: 13700090 -> 13654519 (- > > > 0.33%) > > > instructions in affected programs: 56438 -> 10867 (-80.75%) > > > > Wow! > > > > > > > > helped: 14 > > > HURT: 0 > > > > > > total cycles in shared programs: 288879704 -> 291270232 (0.83%) > > > cycles in affected programs: 12758080 -> 15148608 (18.74%) > > > > ... that seems nuts? > > > > Any idea what's going on with the cycle counts? > > Good point...I glossed over the cycle counts when I saw the -80% > reduction in instructions with 0 shaders hurt. But they do look > pretty bad, so let's take a closer look... > > There are two nearly identical shaders that are the worst offenders: > > shaders/closed/steam/deus-ex-mankind-divided/256.shader_test CS > SIMD16: > > instructions: 2770 -> 253 (-2,517 instructions or -90.87%) > spills: 25 -> 0 > fills: 29 -> 0 > cycles: 923266 -> 1420534 (+497,268 cycles or +53.86%) > compile time: 2.73 seconds -> 0.17 seconds > > There are three loops in the program, each of which contains two > indirect reads of the uvec4[98] constant array. > > Before this patch, there were: > - 67 UNIFORM_PULL_CONSTANT_LOADs at the top of the program > - 1 UNIFORM_PULL_CONSTANT_LOAD in the first (cheap) loop > - 1 UNIFORM_PULL_CONSTANT_LOAD in the second (expensive) loop > - 1 UNIFORM_PULL_CONSTANT_LOAD in the third (very expensive) loop > > After this patch, there are: > - 0 loads at the top of the program > - 1 VARYING_PULL_CONSTANT_LOAD in the first (cheap) loop > - 2 VARYING_PULL_CONSTANT_LOAD in the second (expensive) loop > - 2 VARYING_PULL_CONSTANT_LOAD in the third (very expensive) loop > > The array indexes in the expensive loop are a[foo] and a[foo + 1]. > foo is modified in the loop, so they can't be hoisted out. I don't > think we can determine the number of loop iterations.
The first loop has a trip count of 49. If I force it to unroll I get: CS SIMD16 shader: 682 inst, 2 loops, 1009856 cycles. The current unrolling rules are: static bool is_loop_small_enough_to_unroll(nir_shader *shader, nir_loop_info *li) { unsigned max_iter = shader->options->max_unroll_iterations; if (li->trip_count > max_iter) return false; if (li->force_unroll) // this is for consecutive indirect array access return true; bool loop_not_too_large = li->num_instructions * li->trip_count <= max_iter * 25; return loop_not_too_large; } Maybe we should just drop the first trip count check? Checking the number of intructions * trip count makes a lot more sense. Dropping the check allows the first loop to unroll. > > The two expensive loops look to be twice as expensive after this > patch. > The numbers aren't quite adding up for me - it looks like we should > spend 200 more cycles per loop iteration, but the loops are like > 40,000 > -> 90,000 cycles. > > I'm not sure what to do with this information. Eliminating 90% of > the > instructions seems good. Requiring no scratch access seems good. > Eliminating the 67 memory loads outside of the loops seems good. > Doing two memory loads per loop doesn't seem too crazy, given that > it matches the GLSL source code. Burning 49 registers to store the > entire array for the lifetime of the program seems pretty crazy... > > --Ken > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev