On Mon, Dec 12, 2016 at 6:28 AM, Timothy Arceri <timothy.arc...@collabora.com> wrote: > On Mon, 2016-12-12 at 11:35 +1100, Timothy Arceri wrote: >> On Sun, 2016-12-11 at 00:00 -0800, Kenneth Graunke wrote: >> > >> > On Saturday, December 10, 2016 12:37:16 PM PST Matt Turner wrote: >> > > >> > > >> > > On Fri, Dec 9, 2016 at 8:28 PM, Kenneth Graunke <kenneth@whitecap >> > > e. >> > > org> wrote: >> > > > >> > > > >> > > > A number of games have large arrays of constants, which we >> > > > promote to >> > > > uniforms. This introduces copies from the uniform array to the >> > > > original >> > > > temporary array. Normally, copy propagation eliminates those >> > > > copies, >> > > > making everything refer to the uniform array directly. >> > > > >> > > > A number of shaders in "Deus Ex: Mankind Divided" recently >> > > > exposed a >> > > > limitation of copy propagation - if we had any intrinsics (i.e. >> > > > image >> > > > access in a compute shader), we weren't able to get rid of >> > > > these >> > > > copies. >> > > > >> > > > That meant that any variable indexing remained on the temporary >> > > > array >> > > > rather being moved to the uniform array. i965's scalar backend >> > > > currently doesn't support indirect addressing of temporary >> > > > arrays, >> > > > which meant lowering it to if-ladders. This was horrible. >> > > > >> > > > On Skylake: >> > > > >> > > > total instructions in shared programs: 13700090 -> 13654519 (- >> > > > 0.33%) >> > > > instructions in affected programs: 56438 -> 10867 (-80.75%) >> > > >> > > Wow! >> > > >> > > > >> > > > >> > > > helped: 14 >> > > > HURT: 0 >> > > > >> > > > total cycles in shared programs: 288879704 -> 291270232 (0.83%) >> > > > cycles in affected programs: 12758080 -> 15148608 (18.74%) >> > > >> > > ... that seems nuts? >> > > >> > > Any idea what's going on with the cycle counts? >> > >> > Good point...I glossed over the cycle counts when I saw the -80% >> > reduction in instructions with 0 shaders hurt. But they do look >> > pretty bad, so let's take a closer look... >> > >> > There are two nearly identical shaders that are the worst >> > offenders: >> > >> > shaders/closed/steam/deus-ex-mankind-divided/256.shader_test CS >> > SIMD16: >> > >> > instructions: 2770 -> 253 (-2,517 instructions or -90.87%) >> > spills: 25 -> 0 >> > fills: 29 -> 0 >> > cycles: 923266 -> 1420534 (+497,268 cycles or +53.86%) >> > compile time: 2.73 seconds -> 0.17 seconds >> > >> > There are three loops in the program, each of which contains two >> > indirect reads of the uvec4[98] constant array. >> > >> > Before this patch, there were: >> > - 67 UNIFORM_PULL_CONSTANT_LOADs at the top of the program >> > - 1 UNIFORM_PULL_CONSTANT_LOAD in the first (cheap) loop >> > - 1 UNIFORM_PULL_CONSTANT_LOAD in the second (expensive) loop >> > - 1 UNIFORM_PULL_CONSTANT_LOAD in the third (very expensive) loop >> > >> > After this patch, there are: >> > - 0 loads at the top of the program >> > - 1 VARYING_PULL_CONSTANT_LOAD in the first (cheap) loop >> > - 2 VARYING_PULL_CONSTANT_LOAD in the second (expensive) loop >> > - 2 VARYING_PULL_CONSTANT_LOAD in the third (very expensive) loop >> > >> > The array indexes in the expensive loop are a[foo] and a[foo + 1]. >> > foo is modified in the loop, so they can't be hoisted out. I don't >> > think we can determine the number of loop iterations. >> >> The first loop has a trip count of 49. If I force it to unroll I get: >> >> CS SIMD16 shader: 682 inst, 2 loops, 1009856 cycles. >> >> The current unrolling rules are: >> >> static bool >> is_loop_small_enough_to_unroll(nir_shader *shader, nir_loop_info *li) >> { >> unsigned max_iter = shader->options->max_unroll_iterations; >> >> if (li->trip_count > max_iter) >> return false; >> >> if (li->force_unroll) // this is for consecutive indirect array >> access >> return true; >> >> bool loop_not_too_large = >> li->num_instructions * li->trip_count <= max_iter * 25; >> >> return loop_not_too_large; >> } >> >> >> Maybe we should just drop the first trip count check? Checking the >> number of intructions * trip count makes a lot more sense. Dropping >> the >> check allows the first loop to unroll. > > I tried to get this working for GLSL IR loop unrolling and fixed a > couple of the other problems, panicing at intrinsics (partialy fixed > -well I just ignore them for now), and nequal condition. > > But I also ran into futher issues: > > 1) Not being able to detect an induction variable because we end up > with: > > (assign (x) (var_ref compiler_temp@19) (expression uint bitcast_f2u > (swiz y (var_ref r1) )) ) > (if (expression bool != (var_ref compiler_temp@19) > (constant uint (0)) ) ( > break > )
Yeah, the GLSL compiler can't really do anything if it sees a bitcast. I tried to write a pass that removes bitcasts, but it's a PITA because you can have a vec4 where xzw are used as float and y as int. I just gave up and figured that unrolling in LLVM IR would be simpler. Marek _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev