On Mon, Nov 02, 2015 at 02:23:19PM -0500, Nathan Sidwell wrote: > +#pragma acc routine gang > +void __attribute__ ((noinline)) gang (int ary[N]) > +{ > +#pragma acc loop gang > + for (unsigned ix = 0; ix < N; ix++) > + { > + if (__builtin_acc_on_device (5)) > + { > + int g = 0, w = 0, v = 0; > + > + __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g)); > + __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w)); > + __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v)); > + ary[ix] = (g << 16) | (w << 8) | v; > + } > + else > + ary[ix] = ix;
Does this work even with -O0? I mean, the assembler is invalid for any target other than PTX, so you are relying on aggressively folding this away. Jakub