On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
have been extended to take a barrier ID and a thread count. The idea
here is to assign one barrier for each logical vector. Worker-single
synchronization is controlled by barrier 0. Therefore, the vector
barrier ID is set to tid.y+1 (because there's one vector unit per
worker) in nvptx_init_oacc_workers and placed into a register stored in
cfun->machine->sync_bar. If no workers are present, then the barrier ID
falls back to 0.
I compiled a worker loop before and after the patch series, and observed
this change:
...
@@ -70,7 +71,7 @@
$L2:
// joining 2;
$L5:
- bar.sync 1;
+ bar.sync 0;
// join 2;
ret;
}
...
AFAICT from your explanation above, that change is intentional.
Changing the code generation scheme for workers is fine, but obviously
that should be a minimal, separate patch that we can bisect back to.
Thanks,
- Tom