device-specifics with C pre-processor (in general, inside 'omp declare variant')

Tobias Burnus Tue, 22 Aug 2023 01:44:13 -0700

On 22.08.23 09:25, Richard Biener wrote:

On Mon, Aug 21, 2023 at 6:23 PM Tobias Burnus <tob...@codesourcery.com> wrote:

...

Err, so the OMP standard doesn't put any constraints on what to allow inside the
variants?  Is declare variant always at the toplevel?


Actually, the OpenMP specification only states the following – which is less 
than I claimed:

"If the context selector of a begin declare variant directive contains traits 
in the device
or implementation set that are known never to be compatible with an OpenMP 
context during
the current compilation, the preprocessed code that follows the begin declare 
variant
directive up to its paired end directive is elided."

With once per target parsing as with clang, code like:

#pragma omp begin declare variant ... arch={...}
 #define FOO 5
 ...
#pragma omp end declare variant ... arch={}

could be effectively replaced by:

#ifdef __nvptx__
 #define FOO 5
 #pragma omp begin declare variant ... arch={...}
   ...
 #pragma omp end declare variant ... arch={}
#end if

such that only code remains which matches the architecture
and ISA – but for all other selectors like
while for

#pragma omp begin declare variant ... construct={teams,parallel,for})
  #define BAR 1
  ...
#pragma omp end declare variant
#pragma omp begin declare variant ... construct={distribute})
  #define FOOBAR 1
#pragma omp end declare variant

the two defines would remain, visible in the whole TU.

* * *

Thus, for GCC not eliding anything – because it might get used in later
processing – would be a conforming implementation.

However, I fear that users expect that code like in the shown example
works, i.e. at least 'arch' (for us: host + nvptx, amdgcn) and possibly
'isa' (host ISA + gfx906, sm_80 etc.) "work", i.e. preprocessed code is
elided.

As such a support come for free with Clang (and most/all other
compilers, most of them are based on Clang), it will work there,
increasing the chance that users want to use it.

* * *

Regarding top level or not, the spec does not really tell – except that
it has to be used in declarative context.

In practical terms, I assume that code elision will (nearly) only be
used at top-level context and via #include – with the idea that this
brings in function declarations (but is likely to bring in #defines as
side effect).

In terms of the spec, more is permitted - including using it inside C++
classes, albeit not for constructors/destructors, virtual, defaulted and
deleted functions. - But that seems to be an odd place for adding an
#include or #define or some other code.

Likewise for inside a function or some scope inside a function.

Thus, IMHO, not supporting non-toplevel elision would be fine.

[Thinking of it, the problem is not only conflicting function
declarations and #define but also conflicting typedef and enum/struct.]

...

But does that really help?  Consider

#ifdef _OPENMP
#pragma omp begin declare variant match(device={arch=NVPTX})
#include "cuda/math.h"
...

#pragma omp begin declare variant match(device={arch=NVPTX})
#include "conflicting with cuda/math.h"
...

I think this will produce a conflict (when the compiler accepts
arch=nvptx) as it has the same context selector; that's independent
whether nested (begin / begin ... end / end) or squential (begin ...
end; begin ... end).

or is there a constraint that "un-varianting" same-match variants need
to produce a valid translation unit?  That is, don't you get combinatorical
explosion with sequenced variants?


You do get an combinatorical explosion – but only handling arch + isa
currently leaves host (+enabled ISA) plus devices (each: + enabled ISA)
such that with current GCC support, only up to 3 combinations remain
(host, amdgcn, nvptx).

And in case of Clang or any once-per-device parsing, only a single
combination remains. The spec permits to handle other things, but
arch/isa seems to be the most useful and for (multi-parse compiler) the
simplest.

(Side remark: It would be useful if we could support multiple ISA per
offload target, e.g. compiling for gfx908 *and* gfx90a, but that's
currently not possible with GCC (but it is with Clang).)

Does the OMP standard at all think of how the resulting C/C++ translation
unit is formed or does it simply take each variant as "finishing" a TU after
omp end declare variant?  Thus do declarations leak out of the "active"
variant into the following parts of the C/C++ TU?


I think it does not really think of finishing but of eliding before it
is processed – such that the remaining code simply applies to the TU as
if there were no surrounding begin/end declare variant.

And it assume a multi-parse setup.

To me it really looks like a very badly designed feature, not to mention
that it involves the preprocessor ...


Yes, it is an odd combination of preprocessor and code gen. If both is
done in one step, I think it works - instructing the extended processor
to skip until 'end declare variant' if the context selector for 'begin
declare variant' does not match.

But if one splits it into two parts: 'cpp' and only then 'compiling',
the parser has to skip over all code, including code it potentially does
not handle, until the #pragma omp end declare variant.

I am not sure whether both variants should to be supported, but only
supporting the former seems to be more important and sufficient.

* * *

Any suggestion?

Something like you propose.  I'd even do it "harder", inventing a new
omppd (openmp preprocessor driver) which will pre-parse a TU and
invoke several compiler instances (GCC drivers) with -fomp-variant=X
making only variants "X" active.

Hmm. That would kind of undo the parse once we currently have, but solve
the issue.

   Doesn't really solve the issue with
sequenced variants unless there are constraints in the OMP spec
making that work.  It should be possible to have the separate compilers
produce LTO bytecode (for the offload target then) from the "same"
C TU and combine them at WPA time.  All the offload table handling
might need to improve here of course, but the omppd might produce
enough meta data to help here.

That said, I really wouldn't try to fiddle "omppd" into the host
compiler parts, that doesn't sound fun for maintainance purposes.


Thanks for your comments!

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Re: [OpenMP/offloading][RFC] How to handle target/device-specifics with C pre-processor (in general, inside 'omp declare variant')

Reply via email to