Improving code generation in the nvptx back end

Thomas Schwinge Fri, 17 Feb 2017 05:01:57 -0800

Hi!

I'm not all to familiar with the nvptx back end, and I keep forgetting
(and then later re-learning) a lot of PTX details, so please bear with
me...  I'd like to discuss/gather some ideas about how to improve
(whatever that may mean exactly) code generation in the nvptx back end.



We're currently looking into updating OpenACC "privatization"/"state
propagation" (between OpenACC gang, worker, and vector parallel regions)
according to how that got clarified in the OpenACC 2.5 standard.  So, not
considering to otherwise touch all this machinery until that task is
resolved.


Obviously, we can generally update the back end to generate code for
newer PTX/CC versions, adding new instructions, and all that.


On <https://gcc.gnu.org/wiki/nvptx>/<https://gcc.gnu.org/wiki/Offloading>
we're arguing that "as these would be difficult to implement due to the
constraints set by PTX itself, the GCC nvptx back end doesn't support
setjmp/longjmp, exceptions (?), alloca, computed goto, non-local goto,
for example".  We could improve on that, but that's probably not too
useful, given the desired use case for nvptx code generation, which is
OpenACC/OpenMP offloaded regions, which don't make use of such
functionality, typically.


The PTX code we generate will later be "JIT"-compiled by the CUDA driver,
so we're expecting that one to "clean up" a lot of stuff for us.

For example, PTX itself doesn't bound the number of registers, so we're
not currently doing any register allocation (and instead just emit all
"virtual" registers), and the PTX "JIT" compiler will then do the
register allocation, according to the actual target hardware
capabilities.  Of course, it remains a valid question, if GCC could do
better register allocation itself (because it has better knowledge of the
code structure, and doesn't have to reconstruct that), or if that would
in fact produce worse code/worse performance, because the PTX "JIT"
compiler then might not understand that code anymore.  I had the idea to
actually try this out, using some benchmarking code, without and with
(manual) register allocation (that is, basically, re-using existing
"dead" registers instead of allocating new ones).


Looking at some actual code.  Given:

    $ cat < s.c
    struct S { double d; int y; };
    
    float f(int, struct S) __attribute__((noinline));
    float f(int x, struct S s)
    {
      if (x == s.y)
        s.d = 0.;
      return s.d;
    }
    
    int main()
    {
      struct S s;
      s.d = 1.;
      s.y = 2;
      if (f(2, s) != 0.)
        __builtin_trap();
      if (f(1, s) != 1.)
        __builtin_trap();
    
      return 0;
    }

..., we currently produce the following "-O2" code:

    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ --sysroot=install/nvptx-none -Wall 
-Wextra s.c -O2 -mmainkernel
    $ install/bin/nvptx-none-run a.out # launches, and completes normally
    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ --sysroot=install/nvptx-none -Wall 
-Wextra s.c -O2 -S
    $ cat -n < s.s
         1  // BEGIN PREAMBLE
         2          .version        3.1
         3          .target sm_30
         4          .address_size 64
         5  // END PREAMBLE
         6  
         7  
         8  // BEGIN GLOBAL FUNCTION DECL: f
         9  .visible .func (.param.f32 %value_out) f (.param.u32 %in_ar0, 
.param.u64 %in_ar1);
        10  
        11  // BEGIN GLOBAL FUNCTION DEF: f
        12  .visible .func (.param.f32 %value_out) f (.param.u32 %in_ar0, 
.param.u64 %in_ar1)
        13  {
        14          .reg.f32 %value;
        15          .reg.u32 %ar0;
        16          ld.param.u32 %ar0, [%in_ar0];
        17          .reg.u64 %ar1;
        18          ld.param.u64 %ar1, [%in_ar1];
        19          .reg.f64 %r23;
        20          .reg.f32 %r24;
        21          .reg.u32 %r25;
        22          .reg.u64 %r26;
        23          .reg.u32 %r27;
        24          .reg.pred %r28;
        25                  mov.u32 %r25, %ar0;
        26                  mov.u64 %r26, %ar1;
        27                  ld.f64  %r23, [%r26];
        28                  ld.u32  %r27, [%r26+8];
        29                  setp.eq.u32     %r28, %r27, %r25;
        30          @%r28   bra     $L3;
        31                  cvt.rn.f32.f64  %r24, %r23;
        32                  bra     $L1;
        33  $L3:
        34                  mov.f32 %r24, 0f00000000;
        35  $L1:
        36                  mov.f32 %value, %r24;
        37          st.param.f32    [%value_out], %value;
        38          ret;
        39  }
        40  
        41  // BEGIN GLOBAL FUNCTION DECL: main
        42  .visible .func (.param.u32 %value_out) main (.param.u32 %in_ar0, 
.param.u64 %in_ar1);
        43  
        44  // BEGIN GLOBAL FUNCTION DEF: main
        45  .visible .func (.param.u32 %value_out) main (.param.u32 %in_ar0, 
.param.u64 %in_ar1)
        46  {
        47          .reg.u32 %value;
        48          .local .align 8 .b8 %frame_ar[32];
        49          .reg.u64 %frame;
        50          cvta.local.u64 %frame, %frame_ar;
        51          .reg.f64 %r25;
        52          .reg.u32 %r26;
        53          .reg.u64 %r28;
        54          .reg.u64 %r29;
        55          .reg.u64 %r31;
        56          .reg.f32 %r32;
        57          .reg.pred %r33;
        58          .reg.u32 %r34;
        59          .reg.u64 %r35;
        60          .reg.u64 %r36;
        61          .reg.f32 %r39;
        62          .reg.pred %r40;
        63                  mov.f64 %r25, 0d3ff0000000000000;
        64                  st.f64  [%frame], %r25;
        65                  mov.u32 %r26, 2;
        66                  st.u32  [%frame+8], %r26;
        67                  mov.u64 %r28, 4607182418800017408;
        68                  st.u64  [%frame+16], %r28;
        69                  ld.u64  %r29, [%frame+8];
        70                  st.u64  [%frame+24], %r29;
        71                  add.u64 %r31, %frame, 16;
        72          {
        73                  .param.f32 %value_in;
        74                  .param.u32 %out_arg1;
        75                  st.param.u32 [%out_arg1], %r26;
        76                  .param.u64 %out_arg2;
        77                  st.param.u64 [%out_arg2], %r31;
        78                  call (%value_in), f, (%out_arg1, %out_arg2);
        79                  ld.param.f32    %r32, [%value_in];
        80          }
        81                  setp.eq.f32     %r33, %r32, 0f00000000;
        82          @%r33   bra     $L5;
        83  $L6:
        84          trap;
        85  $L5:
        86                  ld.u64  %r35, [%frame];
        87                  st.u64  [%frame+16], %r35;
        88                  ld.u64  %r36, [%frame+8];
        89                  st.u64  [%frame+24], %r36;
        90                  mov.u32 %r34, 1;
        91          {
        92                  .param.f32 %value_in;
        93                  .param.u32 %out_arg1;
        94                  st.param.u32 [%out_arg1], %r34;
        95                  .param.u64 %out_arg2;
        96                  st.param.u64 [%out_arg2], %r31;
        97                  call (%value_in), f, (%out_arg1, %out_arg2);
        98                  ld.param.f32    %r39, [%value_in];
        99          }
       100                  setp.neu.f32    %r40, %r39, 0f3f800000;
       101          @%r40   bra     $L6;
       102                  mov.u32 %value, 0;
       103          st.param.u32    [%value_out], %value;
       104          ret;
       105  }

A few ideas:

12, 15 - 18, 37, and 73 - 79, 92 - 98.  The following doesn't apply to
".entry" kernel, but for "normal" functions there is no reason to use the
".param" space for passing arguments in and out of functions.  We can
then get rid of the boilerplate code to move ".param %in_ar*" into ".reg
%ar*", and the other way round for "%value_out"/"%value".  This will then
also simplify the call sites, where all that code "evaporates".  That's
actually something I started to look into, many months ago, and I now
just dug out those changes, and will post them later.

(Very likely, the PTX "JIT" compiler will do the very same thing without
difficulty, but why not directly generate code that is less verbose to
read?)


29 - 35.  Instead of predicating the branch instruction, what would it
take to directly predicate the guarded instructions, that is, turn this
region into:

        29                  setp.eq.u32     %r28, %r27, %r25;
        30
        31          @!%r28  cvt.rn.f32.f64  %r24, %r23;
        32
        33
        34          @%r28   mov.f32 %r24, 0f00000000;
        35

(Of course, again, we don't know what the PTX "JIT" compiler is doing, so
can't tell if that would be an actualy improvement, but it's certainly
easier to read.)


In 48 - 50, a frame in ".local" space is allocated for two "struct S s"
objects, the first of which in 63 - 71 is initialized, and then
duplicated, because then "%r31" is made a pointer to the second instance,
as the object is to be copied by value to the first call of "f", and in
86 - 89 it is then "reloaded", for use by the second by-value call of
"f".

Now, as the address of the actual "s" object doesn't escape, that one
could also be held in ".reg"s instead of ".local" memory.  As far as I
remember, there are passes in GCC to do such things -- do we just have to
enable that, or is that not beneficial, or not possible for some reason?

Then,
<http://docs.nvidia.com/cuda/parallel-thread-execution/#device-function-parameters>
describes how the ".param" space can be used "for passing objects by
value that do not fit within a PTX register, such as C structures larger
than 8 bytes".  That would avoid all this "%frame" indirection when
calling "f"?

The same would apply to function return values, I would think.


Grüße
 Thomas

Improving code generation in the nvptx back end

Reply via email to