Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-07 Thread Nathan Sidwell
On 12/01/15 11:01, Bernd Schmidt wrote: On 12/01/2015 04:28 PM, Alexander Monakov wrote: I'm taking a different approach. I want to execute all insns in all warp members, while ensuring that effect (on global and local state) is that same as if any single thread was executing that instruction.

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-03 Thread Alexander Monakov
On Wed, 2 Dec 2015, Nathan Sidwell wrote: > On 12/02/15 12:09, Alexander Monakov wrote: > > > I meant the PTX linked (post PTX-JIT link) image, so regardless of support, > > it's not an issue. E.g. check early in gomp_nvptx_main if .weak > > __nvptx_has_simd != 0. It would only break if there wa

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 12:09, Alexander Monakov wrote: I meant the PTX linked (post PTX-JIT link) image, so regardless of support, it's not an issue. E.g. check early in gomp_nvptx_main if .weak __nvptx_has_simd != 0. It would only break if there was dlopen on PTX. Note I found a bug in .weak support.

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > > It's easy to address: just terminate threads 1-31 if the linked image has > > no SIMD regions, like my pre-simd libgomp was doing. > > Well, can't say the linked image in one shared library call a function > in another linked image in another shared lib

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 11:35, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote: But you never know if people actually use #pragma omp simd regions or not, sometimes they will, sometimes they won't, and if the uniform SIMT increases power consumption, it might not be

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote: > > But you never know if people actually use #pragma omp simd regions or not, > > sometimes they will, sometimes they won't, and if the uniform SIMT > increases > > power consumption, it might not be desirable. > > It's easy to ad

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 10:12, Jakub Jelinek wrote: If we have a reasonable IPA pass to discover which addressable variables can be shared by multiple threads and which can't, then we could use soft-stack for those that can be shared by multiple PTX threads (different warps, or same warp, different threads

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote: > On Wed, 2 Dec 2015, Jakub Jelinek wrote: > > > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > > On 12/02/15 05:40, Jakub Jelinek wrote: > > > > Don't know the HW good enough, is there any power consumption, h

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > On 12/02/15 05:40, Jakub Jelinek wrote: > > > Don't know the HW good enough, is there any power consumption, heat etc. > > >difference between the two approaches? I mean does the HW cons

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 09:41, Alexander Monakov wrote: On Wed, 2 Dec 2015, Nathan Sidwell wrote: On 12/02/15 05:40, Jakub Jelinek wrote: Don't know the HW good enough, is there any power consumption, heat etc. difference between the two approaches? I mean does the HW consume different amount of power if

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Nathan Sidwell wrote: > On 12/02/15 05:40, Jakub Jelinek wrote: > > Don't know the HW good enough, is there any power consumption, heat etc. > > difference between the two approaches? I mean does the HW consume different > > amount of power if only one thread in a warp execute

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 09:24, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: On 12/02/15 09:22, Jakub Jelinek wrote: I believe Alex' testing revealed that if you take address of the same .local objects in several threads, the addresses are the same, and therefore you

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: > > On 12/02/15 09:22, Jakub Jelinek wrote: > > > > >I believe Alex' testing revealed that if you take address of the same > > >.local > > >objects in several threads, the addresses are t

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: > On 12/02/15 09:22, Jakub Jelinek wrote: > > >I believe Alex' testing revealed that if you take address of the same .local > >objects in several threads, the addresses are the same, and therefore you > >refer to your own .local space

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 09:22, Jakub Jelinek wrote: I believe Alex' testing revealed that if you take address of the same .local objects in several threads, the addresses are the same, and therefore you refer to your own .local space rather than the other thread's. Before or after applying cvta? nathan

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 09:14:03AM -0500, Nathan Sidwell wrote: > On 12/02/15 08:46, Jakub Jelinek wrote: > > >Or does the OpenACC execution model not allow anything like that, i.e. > >have some function with an automatic variable pass the address of that > >variable to some other function and tha

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 08:46, Jakub Jelinek wrote: Or does the OpenACC execution model not allow anything like that, i.e. have some function with an automatic variable pass the address of that variable to some other function and that other function use #acc loop kind that expects the caller to be at the wo

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Bernd Schmidt
On 12/02/2015 02:46 PM, Jakub Jelinek wrote: Or does the OpenACC execution model not allow anything like that, i.e. have some function with an automatic variable pass the address of that variable to some other function and that other function use #acc loop kind that expects the caller to be at th

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 08:38:56AM -0500, Nathan Sidwell wrote: > On 12/02/15 08:10, Jakub Jelinek wrote: > >On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > >Always the whole stack, from the current stack pointer up to top of the > >stack, so sometimes a few bytes, sometimes a

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 08:10, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: Always the whole stack, from the current stack pointer up to top of the stack, so sometimes a few bytes, sometimes a few kilobytes or more each time? The frame of the current function. No

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > On 12/02/15 05:40, Jakub Jelinek wrote: > > Don't know the HW good enough, is there any power consumption, heat etc. > >difference between the two approaches? I mean does the HW consume different > >amount of power if only one threa

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 05:40, Jakub Jelinek wrote: Don't know the HW good enough, is there any power consumption, heat etc. difference between the two approaches? I mean does the HW consume different amount of power if only one thread in a warp executes code and the other threads in the same warp just jum

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote: > The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31 > "slaves" which just follow branches without any computation -- that requires > extra jumps and broadcasting branch predicates, -- and 2) broadcast re

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-01 Thread Alexander Monakov
On Tue, 1 Dec 2015, Bernd Schmidt wrote: > > Didn't we also conclude that address-taking (let's say for stack addresses) is > also an operation that does not result in the same state? This is intended to be used with soft-stacks in OpenMP offloading, and soft-stacks are per-warp outside of SIMD r

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-01 Thread Bernd Schmidt
On 12/01/2015 04:28 PM, Alexander Monakov wrote: I'm taking a different approach. I want to execute all insns in all warp members, while ensuring that effect (on global and local state) is that same as if any single thread was executing that instruction. Most instructions automatically satisfy

[gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-01 Thread Alexander Monakov
This patch introduces a code generation variant for NVPTX that I'm using for SIMD work in OpenMP offloading. Let me try to explain the idea behind it... In place of SIMD vectorization, NVPTX is using SIMT (single instruction/multiple threads) execution: groups of 32 threads execute the same instr