On 03/22/2018 07:44 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> The attached patch generalizes the worker state propagation and >> synchronization code to handle large vectors. When the vector_length is >> larger than a CUDA warp, the nvptx BE will now use shared-memory to >> spill-and-fill vector state when transitioning from vector-single mode >> to vector partitioned. > > I've compiled this test-case: > ... > int > main (void) > { > int a[10]; > #pragma acc parallel loop worker > for (int i = 0; i < 10; i++) > a[i] = i; > > return 0; > } > ... > > without and with the patch series, and observed the following difference > in generated ptx: > ... > -.shared .align 8 .u8 __oacc_bcast[8]; > +.shared .align 8 .u8 __oacc_bcast[264]; > ... > > Why is the example using 33 times more shared memory space with the > patch series applied?
Because the nvptx BE wasn't taking into account that vector_length = 32 doesn't need to use shared-memory to broadcast variables. That magic value of 33 was derived from nvptx_mach_max_workers () + 1. When vector_length > 32, there needs to be nvptx_mach_max_workers () partitions for vector state propagation. There also needs to be a shared-memory buffer for worker-state propagation, because I found situations where some threads where still spilling and filling workers before vector 0 transitioned vector-partitioned mode. The attached, untested, patch should resolve that issue. Cesar
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 3102c79bf96..f81fb0113d5 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4061,9 +4061,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block, if (oacc_bcast_partition < data.offset) { int psize = data.offset; + int pnum = 1; + + if (nvptx_mach_vector_length () > PTX_WARP_SIZE) + pnum = nvptx_mach_max_workers () + 1; + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1); oacc_bcast_partition = psize; - oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1); + oacc_bcast_size = psize * pnum; } } return empty; @@ -4348,9 +4353,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) if (oacc_bcast_partition < size) { int psize = size; + int pnum = 1; + + if (nvptx_mach_vector_length () > PTX_WARP_SIZE) + pnum = nvptx_mach_max_workers () + 1; + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1); oacc_bcast_partition = psize; - oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1); + oacc_bcast_size = psize * pnum; } data.offset = 0;