Hi! I'm not all to familiar with the nvptx back end, and I keep forgetting (and then later re-learning) a lot of PTX details, so please bear with me... I'd like to discuss/gather some ideas about how to improve (whatever that may mean exactly) code generation in the nvptx back end.
We're currently looking into updating OpenACC "privatization"/"state propagation" (between OpenACC gang, worker, and vector parallel regions) according to how that got clarified in the OpenACC 2.5 standard. So, not considering to otherwise touch all this machinery until that task is resolved. Obviously, we can generally update the back end to generate code for newer PTX/CC versions, adding new instructions, and all that. On <https://gcc.gnu.org/wiki/nvptx>/<https://gcc.gnu.org/wiki/Offloading> we're arguing that "as these would be difficult to implement due to the constraints set by PTX itself, the GCC nvptx back end doesn't support setjmp/longjmp, exceptions (?), alloca, computed goto, non-local goto, for example". We could improve on that, but that's probably not too useful, given the desired use case for nvptx code generation, which is OpenACC/OpenMP offloaded regions, which don't make use of such functionality, typically. The PTX code we generate will later be "JIT"-compiled by the CUDA driver, so we're expecting that one to "clean up" a lot of stuff for us. For example, PTX itself doesn't bound the number of registers, so we're not currently doing any register allocation (and instead just emit all "virtual" registers), and the PTX "JIT" compiler will then do the register allocation, according to the actual target hardware capabilities. Of course, it remains a valid question, if GCC could do better register allocation itself (because it has better knowledge of the code structure, and doesn't have to reconstruct that), or if that would in fact produce worse code/worse performance, because the PTX "JIT" compiler then might not understand that code anymore. I had the idea to actually try this out, using some benchmarking code, without and with (manual) register allocation (that is, basically, re-using existing "dead" registers instead of allocating new ones). Looking at some actual code. Given: $ cat < s.c struct S { double d; int y; }; float f(int, struct S) __attribute__((noinline)); float f(int x, struct S s) { if (x == s.y) s.d = 0.; return s.d; } int main() { struct S s; s.d = 1.; s.y = 2; if (f(2, s) != 0.) __builtin_trap(); if (f(1, s) != 1.) __builtin_trap(); return 0; } ..., we currently produce the following "-O2" code: $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ --sysroot=install/nvptx-none -Wall -Wextra s.c -O2 -mmainkernel $ install/bin/nvptx-none-run a.out # launches, and completes normally $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ --sysroot=install/nvptx-none -Wall -Wextra s.c -O2 -S $ cat -n < s.s 1 // BEGIN PREAMBLE 2 .version 3.1 3 .target sm_30 4 .address_size 64 5 // END PREAMBLE 6 7 8 // BEGIN GLOBAL FUNCTION DECL: f 9 .visible .func (.param.f32 %value_out) f (.param.u32 %in_ar0, .param.u64 %in_ar1); 10 11 // BEGIN GLOBAL FUNCTION DEF: f 12 .visible .func (.param.f32 %value_out) f (.param.u32 %in_ar0, .param.u64 %in_ar1) 13 { 14 .reg.f32 %value; 15 .reg.u32 %ar0; 16 ld.param.u32 %ar0, [%in_ar0]; 17 .reg.u64 %ar1; 18 ld.param.u64 %ar1, [%in_ar1]; 19 .reg.f64 %r23; 20 .reg.f32 %r24; 21 .reg.u32 %r25; 22 .reg.u64 %r26; 23 .reg.u32 %r27; 24 .reg.pred %r28; 25 mov.u32 %r25, %ar0; 26 mov.u64 %r26, %ar1; 27 ld.f64 %r23, [%r26]; 28 ld.u32 %r27, [%r26+8]; 29 setp.eq.u32 %r28, %r27, %r25; 30 @%r28 bra $L3; 31 cvt.rn.f32.f64 %r24, %r23; 32 bra $L1; 33 $L3: 34 mov.f32 %r24, 0f00000000; 35 $L1: 36 mov.f32 %value, %r24; 37 st.param.f32 [%value_out], %value; 38 ret; 39 } 40 41 // BEGIN GLOBAL FUNCTION DECL: main 42 .visible .func (.param.u32 %value_out) main (.param.u32 %in_ar0, .param.u64 %in_ar1); 43 44 // BEGIN GLOBAL FUNCTION DEF: main 45 .visible .func (.param.u32 %value_out) main (.param.u32 %in_ar0, .param.u64 %in_ar1) 46 { 47 .reg.u32 %value; 48 .local .align 8 .b8 %frame_ar[32]; 49 .reg.u64 %frame; 50 cvta.local.u64 %frame, %frame_ar; 51 .reg.f64 %r25; 52 .reg.u32 %r26; 53 .reg.u64 %r28; 54 .reg.u64 %r29; 55 .reg.u64 %r31; 56 .reg.f32 %r32; 57 .reg.pred %r33; 58 .reg.u32 %r34; 59 .reg.u64 %r35; 60 .reg.u64 %r36; 61 .reg.f32 %r39; 62 .reg.pred %r40; 63 mov.f64 %r25, 0d3ff0000000000000; 64 st.f64 [%frame], %r25; 65 mov.u32 %r26, 2; 66 st.u32 [%frame+8], %r26; 67 mov.u64 %r28, 4607182418800017408; 68 st.u64 [%frame+16], %r28; 69 ld.u64 %r29, [%frame+8]; 70 st.u64 [%frame+24], %r29; 71 add.u64 %r31, %frame, 16; 72 { 73 .param.f32 %value_in; 74 .param.u32 %out_arg1; 75 st.param.u32 [%out_arg1], %r26; 76 .param.u64 %out_arg2; 77 st.param.u64 [%out_arg2], %r31; 78 call (%value_in), f, (%out_arg1, %out_arg2); 79 ld.param.f32 %r32, [%value_in]; 80 } 81 setp.eq.f32 %r33, %r32, 0f00000000; 82 @%r33 bra $L5; 83 $L6: 84 trap; 85 $L5: 86 ld.u64 %r35, [%frame]; 87 st.u64 [%frame+16], %r35; 88 ld.u64 %r36, [%frame+8]; 89 st.u64 [%frame+24], %r36; 90 mov.u32 %r34, 1; 91 { 92 .param.f32 %value_in; 93 .param.u32 %out_arg1; 94 st.param.u32 [%out_arg1], %r34; 95 .param.u64 %out_arg2; 96 st.param.u64 [%out_arg2], %r31; 97 call (%value_in), f, (%out_arg1, %out_arg2); 98 ld.param.f32 %r39, [%value_in]; 99 } 100 setp.neu.f32 %r40, %r39, 0f3f800000; 101 @%r40 bra $L6; 102 mov.u32 %value, 0; 103 st.param.u32 [%value_out], %value; 104 ret; 105 } A few ideas: 12, 15 - 18, 37, and 73 - 79, 92 - 98. The following doesn't apply to ".entry" kernel, but for "normal" functions there is no reason to use the ".param" space for passing arguments in and out of functions. We can then get rid of the boilerplate code to move ".param %in_ar*" into ".reg %ar*", and the other way round for "%value_out"/"%value". This will then also simplify the call sites, where all that code "evaporates". That's actually something I started to look into, many months ago, and I now just dug out those changes, and will post them later. (Very likely, the PTX "JIT" compiler will do the very same thing without difficulty, but why not directly generate code that is less verbose to read?) 29 - 35. Instead of predicating the branch instruction, what would it take to directly predicate the guarded instructions, that is, turn this region into: 29 setp.eq.u32 %r28, %r27, %r25; 30 31 @!%r28 cvt.rn.f32.f64 %r24, %r23; 32 33 34 @%r28 mov.f32 %r24, 0f00000000; 35 (Of course, again, we don't know what the PTX "JIT" compiler is doing, so can't tell if that would be an actualy improvement, but it's certainly easier to read.) In 48 - 50, a frame in ".local" space is allocated for two "struct S s" objects, the first of which in 63 - 71 is initialized, and then duplicated, because then "%r31" is made a pointer to the second instance, as the object is to be copied by value to the first call of "f", and in 86 - 89 it is then "reloaded", for use by the second by-value call of "f". Now, as the address of the actual "s" object doesn't escape, that one could also be held in ".reg"s instead of ".local" memory. As far as I remember, there are passes in GCC to do such things -- do we just have to enable that, or is that not beneficial, or not possible for some reason? Then, <http://docs.nvidia.com/cuda/parallel-thread-execution/#device-function-parameters> describes how the ".param" space can be used "for passing objects by value that do not fit within a PTX register, such as C structures larger than 8 bytes". That would avoid all this "%frame" indirection when calling "f"? The same would apply to function return values, I would think. Grüße Thomas