On 12/01/15 11:01, Bernd Schmidt wrote:
On 12/01/2015 04:28 PM, Alexander Monakov wrote:
I'm taking a different approach. I want to execute all insns in all warp
members, while ensuring that effect (on global and local state) is that same
as if any single thread was executing that instruction.
On Wed, 2 Dec 2015, Nathan Sidwell wrote:
> On 12/02/15 12:09, Alexander Monakov wrote:
>
> > I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
> > it's not an issue. E.g. check early in gomp_nvptx_main if .weak
> > __nvptx_has_simd != 0. It would only break if there wa
On 12/02/15 12:09, Alexander Monakov wrote:
I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
it's not an issue. E.g. check early in gomp_nvptx_main if .weak
__nvptx_has_simd != 0. It would only break if there was dlopen on PTX.
Note I found a bug in .weak support.
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> > It's easy to address: just terminate threads 1-31 if the linked image has
> > no SIMD regions, like my pre-simd libgomp was doing.
>
> Well, can't say the linked image in one shared library call a function
> in another linked image in another shared lib
On 12/02/15 11:35, Jakub Jelinek wrote:
On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote:
But you never know if people actually use #pragma omp simd regions or not,
sometimes they will, sometimes they won't, and if the uniform SIMT
increases
power consumption, it might not be
On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote:
> > But you never know if people actually use #pragma omp simd regions or not,
> > sometimes they will, sometimes they won't, and if the uniform SIMT
> increases
> > power consumption, it might not be desirable.
>
> It's easy to ad
On 12/02/15 10:12, Jakub Jelinek wrote:
If we have a reasonable IPA pass to discover which addressable variables can
be shared by multiple threads and which can't, then we could use soft-stack
for those that can be shared by multiple PTX threads (different warps, or
same warp, different threads
On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Jakub Jelinek wrote:
>
> > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > > Don't know the HW good enough, is there any power consumption, h
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > Don't know the HW good enough, is there any power consumption, heat etc.
> > >difference between the two approaches? I mean does the HW cons
On 12/02/15 09:41, Alexander Monakov wrote:
On Wed, 2 Dec 2015, Nathan Sidwell wrote:
On 12/02/15 05:40, Jakub Jelinek wrote:
Don't know the HW good enough, is there any power consumption, heat etc.
difference between the two approaches? I mean does the HW consume different
amount of power if
On Wed, 2 Dec 2015, Nathan Sidwell wrote:
> On 12/02/15 05:40, Jakub Jelinek wrote:
> > Don't know the HW good enough, is there any power consumption, heat etc.
> > difference between the two approaches? I mean does the HW consume different
> > amount of power if only one thread in a warp execute
On 12/02/15 09:24, Jakub Jelinek wrote:
On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
On 12/02/15 09:22, Jakub Jelinek wrote:
I believe Alex' testing revealed that if you take address of the same .local
objects in several threads, the addresses are the same, and therefore you
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
> > On 12/02/15 09:22, Jakub Jelinek wrote:
> >
> > >I believe Alex' testing revealed that if you take address of the same
> > >.local
> > >objects in several threads, the addresses are t
On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
> On 12/02/15 09:22, Jakub Jelinek wrote:
>
> >I believe Alex' testing revealed that if you take address of the same .local
> >objects in several threads, the addresses are the same, and therefore you
> >refer to your own .local space
On 12/02/15 09:22, Jakub Jelinek wrote:
I believe Alex' testing revealed that if you take address of the same .local
objects in several threads, the addresses are the same, and therefore you
refer to your own .local space rather than the other thread's.
Before or after applying cvta?
nathan
On Wed, Dec 02, 2015 at 09:14:03AM -0500, Nathan Sidwell wrote:
> On 12/02/15 08:46, Jakub Jelinek wrote:
>
> >Or does the OpenACC execution model not allow anything like that, i.e.
> >have some function with an automatic variable pass the address of that
> >variable to some other function and tha
On 12/02/15 08:46, Jakub Jelinek wrote:
Or does the OpenACC execution model not allow anything like that, i.e.
have some function with an automatic variable pass the address of that
variable to some other function and that other function use #acc loop kind
that expects the caller to be at the wo
On 12/02/2015 02:46 PM, Jakub Jelinek wrote:
Or does the OpenACC execution model not allow anything like that, i.e.
have some function with an automatic variable pass the address of that
variable to some other function and that other function use #acc loop kind
that expects the caller to be at th
On Wed, Dec 02, 2015 at 08:38:56AM -0500, Nathan Sidwell wrote:
> On 12/02/15 08:10, Jakub Jelinek wrote:
> >On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
>
> >Always the whole stack, from the current stack pointer up to top of the
> >stack, so sometimes a few bytes, sometimes a
On 12/02/15 08:10, Jakub Jelinek wrote:
On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
Always the whole stack, from the current stack pointer up to top of the
stack, so sometimes a few bytes, sometimes a few kilobytes or more each time?
The frame of the current function. No
On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> On 12/02/15 05:40, Jakub Jelinek wrote:
> > Don't know the HW good enough, is there any power consumption, heat etc.
> >difference between the two approaches? I mean does the HW consume different
> >amount of power if only one threa
On 12/02/15 05:40, Jakub Jelinek wrote:
Don't know the HW good enough, is there any power consumption, heat etc.
difference between the two approaches? I mean does the HW consume different
amount of power if only one thread in a warp executes code and the other
threads in the same warp just jum
On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote:
> The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
> "slaves" which just follow branches without any computation -- that requires
> extra jumps and broadcasting branch predicates, -- and 2) broadcast re
On Tue, 1 Dec 2015, Bernd Schmidt wrote:
>
> Didn't we also conclude that address-taking (let's say for stack addresses) is
> also an operation that does not result in the same state?
This is intended to be used with soft-stacks in OpenMP offloading, and
soft-stacks are per-warp outside of SIMD r
On 12/01/2015 04:28 PM, Alexander Monakov wrote:
I'm taking a different approach. I want to execute all insns in all warp
members, while ensuring that effect (on global and local state) is that same
as if any single thread was executing that instruction. Most instructions
automatically satisfy
This patch introduces a code generation variant for NVPTX that I'm using for
SIMD work in OpenMP offloading. Let me try to explain the idea behind it...
In place of SIMD vectorization, NVPTX is using SIMT (single
instruction/multiple threads) execution: groups of 32 threads execute the same
instr
26 matches
Mail list logo