Re: Version numbers question

2015-06-22 Thread Ilya Verbin
On Mon, Jun 22, 2015 at 08:55:03 -0500, JohnT wrote:
> I am wondering why it appears that GCC has started drastically raising its
> major version number for minor changes, instead of spending several years
> on version 3 and 4. 4.0.1, 4.1.1 and 4.12, 4.2.3, 4.3.2, 4.4.5, up through
> 4.7.0, 4.7.1, 4.7.2, the 4.8 and 4.9 releases, then version 5.1 and
> talking about version 6. Little changes should be reflected in minor
> version and bugfix numbers, not major version jumps.

A part of the discussion (~80 mails) is here:
https://gcc.gnu.org/ml/gcc/2014-07/msg00196.html

  -- Ilya


Re: Does gcc cilk plus support include offloading to graphics hardware?

2016-04-21 Thread Ilya Verbin
2016-04-21 7:09 GMT+03:00 Hal Ashburner :
> Another cilk plus question:
> Is op_ostream also considered to be outside of cilk plus?
> https://www.cilkplus.org/docs/doxygen/include-dir/group___reducers_ostream.html
> I am trying to compile the basic "Cilk Plus Tutorial Sources" code as
> supplied at http://cilkplus.org/download
> reducer-ostream-demo.cpp, reducer-string-demo.cpp and
> reducer-wstring-demo.cpp I am unable to get to compile.

The tutorial samples require the latest Cilk runtime (not in GCC yet).
The new runtime will be merged into mainline soon.

  -- Ilya


Re: (Problems with) coexistence of target and offloading compiler installations

2016-06-14 Thread Ilya Verbin
On Fri, Jun 10, 2016 at 11:31:33 +0200, Jakub Jelinek wrote:
> On Fri, Jun 10, 2016 at 09:39:02AM +0200, Thomas Schwinge wrote:
> > But I'm actually confused as to seeing libgomp.so in that list -- given
> > the conflict of which compiler installations' libgomp.so "wins", I wonder
> > how it can be working that some of the functions in there are supposed to
> > behave differently on/are compiled differently for target vs. offloading
> > target?  Or did I do/understand something wrong?  For a lot of other
> 
> For intelmic offloading, I believe all the libraries should be the same
> (unless one chooses e.g. different tuning or ISA in between the two compiler
> installations), including libgomp, so one should be able to just use the
> libraries from the primary compiler.  At least that has been the goal,
> omp_is_initial_device should be handled by overriding the symbol in the
> magic executable.

Right, currently there is no difference between host and mic libraries in gcc.

> For emul certainly, for XeonPhi KNL PCIe HW, I haven't had a possibility to 
> see
> it in action yet, so I don't know how exactly is the filesystem typically
> handled, if the offloading device has e.g. NFS mount of the host's
> filesystem, or if all the libraries are always copied over on demand over
> the bus, whatever.

Some libraries are copied during the boot of the card (e.g. libc.so), others are
copied during the first offload from the app (e.g. libgomp.so).

  -- Ilya


Re: [gomp4] Questions about "declare target" and "target update" pragmas

2014-07-08 Thread Ilya Verbin
Hi Jakub,

I discovered an issue related to global variables.
In this testcase the 'omp target' child fn uses the local copy of glob_var.
But the 'omp parallel' child fn tries to use the glob_var directly and therefore
crashes.

int glob_var;

void
foo (void)
{
  glob_var = 1;
  #pragma omp target map(to: glob_var)
{
  glob_var = 2;
  #pragma omp parallel
{
  glob_var = 3;
}
}
}

In the spec I found only that if a variable is not present in the enclosing
device data environment, then it is allocated in the device data environment
associated with the construct.  Effectively this means glob_var become a
non-global variable within the omp_target construct.  Then it's not quite clear
what type of glob_var should be in case target is not present and we fallback to
host execution.

Should we forbid 'omp target' to map global vars that are not defined as target?
Or force 'omp parallel' to use local copies inside the target regions?
(ICC reports an error about mapping glob_var for this testcase)

  -- Ilya


Re: [RFC] Offloading Support in libgomp

2014-07-17 Thread Ilya Verbin
2014-07-17 11:51 GMT+04:00 Thomas Schwinge :
>> +  plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");
>
> What is the benefit of making this an environment variable that the user
> set to set, LIBGOMP_PLUGIN_PATH, as opposed to hard-coding it to
> somewhere inside the GCC installation directory ([...]/lib/libgomp/*.so)?
> (There, it can still be overridden; dlopen obeys DT_RPATH/DT_RUNPATH, and
> LD_LIBRARY_PATH.)  Hard-coding it would make libgomp testing a bit
> easier, and it generally seems to make sense to me that the compiler
> (libgomp) should be able to locate its own resources, and I don't think
> the plugin search path is something that a user generally would want to
> override -- or is your use case indeed that the plugin is not built as
> part of libgomp's build process?  (But, in this case you still could use
> LD_LIBRARY_PATH to have it found.)

Hi,

We invented this environment variable almost a year ago, when we
didn’t fully understand how all the parts will work together. So for
now, likely, your proposal is better.
Jakub, what do you think?

  -- Ilya

P.S. Michael is no longer working on this, I'm continuing his work.


[gomp4] Offloading wiki page

2014-07-21 Thread Ilya Verbin
Hi,

I've created a wiki page about offloading.  Any improvements are welcome.

https://gcc.gnu.org/wiki/Offloading

Thanks,
  -- Ilya


Re: "Parallel" mode iterators

2014-08-21 Thread Ilya Verbin
2014-08-21 11:39 GMT+04:00 Dominik Vogt :
> One can define mode iterators for
>
>   (define_mode_iterator ITER1 [DI SI HI])
>   (define_mode_iterator ITER2 [SI HI QI])
>
> Is it possible to write something like this:
>
>   (define_insn "foo"
> [(set (match_operand:ITER1 0 ...)
>  ...
> [(match_operand:ITER1 1 ...)
>  (match_operand:ITER2 2 ...)]
>  ...
>
> so that the pattern is copied only for the combinations DI-SI,
> SI-HI and HI-QI, not for all nine combinations of the two
> iterators?  (Or is there another way to get mode of the second
> argument depending on the first argument?)

Look at ssehalfvecmode in i386/sse.md:

(define_mode_attr ssehalfvecmode
  [(V64QI "V32QI") (V32HI "V16HI") (V16SI "V8SI") (V8DI "V4DI")
   (V32QI "V16QI") (V16HI  "V8HI") (V8SI  "V4SI") (V4DI "V2DI")
   (V16QI  "V8QI") (V8HI   "V4HI") (V4SI  "V2SI")
   (V16SF "V8SF") (V8DF "V4DF")
   (V8SF  "V4SF") (V4DF "V2DF")
   (V4SF  "V2SF")])

(define_expand "avx_vextractf128"
  [(match_operand: 0 "nonimmediate_operand")
   (match_operand:V_256 1 "register_operand")
   (match_operand:SI 2 "const_0_to_1_operand")]

  -- Ilya


Re: Offloading not relocatable

2014-09-17 Thread Ilya Verbin
Yeah, I got that all these prefixes are not working with modified
DESTDIR.  I’ll fix mkoffload.

2014-09-17 20:30 GMT+04:00 Bernd Schmidt :
> That's also a solved problem in nvptx mkoffload - you do need to unset these
> environment variables when invoking the target compiler. I've posted the
> source a few times but here it is again.

I see there:
  unsetenv ("GCC_EXEC_PREFIX");
  unsetenv ("COMPILER_PATH");
  unsetenv ("LIBRARY_PATH");

Or do you mean, that there is no need to set them to the new values
before invoking the target compiler?

Thanks,
  -- Ilya


Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC

2014-11-13 Thread Ilya Verbin
On 13 Nov 09:17, H.J. Lu wrote:
> I noticed many libgomp test failures:
> 
> https://gcc.gnu.org/ml/gcc-regression/2014-11/msg00309.html
> 
> Have you seen them?

Hi H.J.,

I do not see these regressions on i686-linux and x86_64-linux.
Could you please provide more details? (configure options, error log)

Thanks,
  -- Ilya


Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC

2014-11-13 Thread Ilya Verbin
On 13 Nov 10:48, H.J. Lu wrote:
> /usr/local/bin/ld: /tmp/ccA8cExp.o: plugin needed to handle lto object^M

Looks like we should set flag_fat_lto_objects while compilation with offloading.
I'll investigate this issue tomorrow.

Could you please also show a version and configure options for ld?

Thanks,
  -- Ilya


Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC

2014-11-13 Thread Ilya Verbin
On 13 Nov 2014, at 23:11, H.J. Lu  wrote:
> 
> Section Headers:
>  [Nr] Name  TypeAddress  OffSize
> ES Flg Lk Inf Al
>  [ 0]   NULL 00
> 00 00  0   0  0
>  [ 1] .text PROGBITS 40
> 000204 00  AX  0   0 16
>  [ 2] .rela.textRELA 001a60
> d8 18   I 29   1  8
>  [ 3] .data PROGBITS 000260
> 40 00  WA  0   0 32
>  [ 4] .bss  NOBITS   0002a0
> 00 00  WA  0   0  1
>  [ 5] .gnu.offload_lto_.profile.50035f9931394ed4 PROGBITS
>  0002a0 13 00   E  0   0  1
>  [ 6] .gnu.offload_lto_.icf.50035f9931394ed4 PROGBITS
>  0002b3 1e 00   E  0   0  1
>  [ 7] .gnu.offload_lto_.jmpfuncs.50035f9931394ed4 PROGBITS
>  0002d1 19 00   E  0   0  1
>  [ 8] .gnu.offload_lto_.inline.50035f9931394ed4 PROGBITS
>  0002ea 6c 00   E  0   0  1
>  [ 9] .gnu.offload_lto_.pureconst.50035f9931394ed4 PROGBITS
>  000356 13 00   E  0   0  1
>  [10] .gnu.offload_lto_vec_mult._omp_fn.1.50035f9931394ed4 PROGBITS
>  000369 0004ab 00   E  0   0  1
>  [11] .gnu.offload_lto_vec_mult._omp_fn.0.50035f9931394ed4 PROGBITS
>  000814 00035d 00   E  0   0  1
>  [12] .gnu.offload_lto_.symbol_nodes.50035f9931394ed4 PROGBITS
>  000b71 55 00   E  0   0  1
>  [13] .gnu.offload_lto_.refs.50035f9931394ed4 PROGBITS
>  000bc6 14 00   E  0   0  1
>  [14] .gnu.offload_lto_.offload_table.50035f9931394ed4 PROGBITS
>  000bda 11 00   E  0   0  1
>  [15] .gnu.offload_lto_.decls.50035f9931394ed4 PROGBITS
>  000beb 00043d 00   E  0   0  1
>  [16] .gnu.offload_lto_.symtab.50035f9931394ed4 PROGBITS
>  001028 00 00   E  0   0  1
>  [17] .gnu.offload_lto_.opts PROGBITS 001028
> a9 00   E  0   0  1
> 
> Don't you need another plugin to claim those offload IR sections?

No, the plan was that a regular plugin will just ignore offload IR sections by 
default.  In your configuration ld detects a __gnu_lto_slim symbol and decided 
that the object file contains only LTO IR without asm.  I am going to 
investigate where is the difference with my configuration and fix the bug.

  -- Ilya

[PATCH] Fix regressions in libgomp testsuite: set flag_fat_lto_objects for offload

2014-11-14 Thread Ilya Verbin
Hi,

This patch fixes recent regressions in libgomp testsuite:
https://gcc.gnu.org/ml/gcc-regression/2014-11/msg00343.html
They are reproducible only with ld from trunk, ld 2.24 works fine.

When GCC emits sections with offload IR, it should not emit "__gnu_lto_slim"
symbol, otherwise linker plugin tries to compile LTO IR, which is not present.

Bootstrap and regtesting on x86_64-linux using binutils 20141114 in progress.
OK for trunk when finished?

Thanks,
  -- Ilya


gcc/
* cgraphunit.c (symbol_table::compile): Set flag_fat_lto_objects
in case of g->have_offload.


diff --git a/gcc/cgraphunit.c b/gcc/cgraphunit.c
index 534c613..584a84e 100644
--- a/gcc/cgraphunit.c
+++ b/gcc/cgraphunit.c
@@ -2178,7 +2178,10 @@ symbol_table::compile (void)
 
   /* Offloading requires LTO infrastructure.  */
   if (!in_lto_p && g->have_offload)
-flag_generate_lto = 1;
+{
+  flag_generate_lto = 1;
+  flag_fat_lto_objects = 1;
+}
 
   /* If LTO is enabled, initialize the streamer hooks needed by GIMPLE.  */
   if (flag_generate_lto)


Re: GCC 5.0 and OpenMP 4.0 accelerator : Adapteva/Parallella board

2015-02-12 Thread Ilya Verbin
Hi,

On Wed, Feb 11, 2015 at 21:33:47 -0800, Nicholas Yue wrote:
> I would like to find out if this is the correct forum to
> ask/discuss about GCC 5's OpenMP 4.0 implementation, in particular
> the new accelerator feature which from what I understand, allows the
> compute to be offloaded to external GPU/accelerator.
> 
> I have a Parallella board (ARM dual core) which has an Adapteva
> chip (16 cores) and I would like to build a GCC 5 version for it.
> 
> I recall that the Adapteva is a supported CPU with GCC.

Currently offloading to Epiphany targets is not supported by GCC.

To support it, one needs to implement at least 2 things:

1. mkoffload tool, like gcc/config/i386/intelmic-mkoffload.c or
gcc/config/nvptx/mkoffload.c

2. libgomp plugin, like liboffloadmic/plugin/libgomp-plugin-intelmic.cpp or
libgomp/plugin/plugin-nvptx.c

  -- Ilya


A bug (?) with inline functions at O0: undefined reference

2015-03-06 Thread Ilya Verbin
Hi All,

I've discovered a strange behaviour on trunk gcc, here is the reproducer:

inline int foo ()
{
  return 0;
}

int main ()
{
  return foo ();
}

$ gcc main.c
/tmp/ccD1LeXo.o: In function `main':
main.c:(.text+0xa): undefined reference to `foo'
collect2: error: ld returned 1 exit status

Is this a bug?  If yes, is it known?
GCC 4.8.3 works fine though.

Thanks,
  -- Ilya


Re: [gomp4] Questions about "declare target" and "target update" pragmas

2015-03-10 Thread Ilya Verbin
Hi Jakub,

I have one more question :)
This testcase seems to be correct... or not?

#pragma omp declare target
extern int G;
#pragma omp end declare target

int G;

int main ()
{
  #pragma omp target update to(G)

  return 0;
}

If yes, then we have a problem that the decl of G in varpool_node::get_create
doesn't have "omp declare target" attribute.

Thanks,
  -- Ilya


Re: [gomp4] Questions about "declare target" and "target update" pragmas

2015-03-16 Thread Ilya Verbin
On Tue, Mar 10, 2015 at 19:52:52 +0300, Ilya Verbin wrote:
> Hi Jakub,
> 
> I have one more question :)
> This testcase seems to be correct... or not?
> 
> #pragma omp declare target
> extern int G;
> #pragma omp end declare target
> 
> int G;
> 
> int main ()
> {
>   #pragma omp target update to(G)
> 
>   return 0;
> }
> 
> If yes, then we have a problem that the decl of G in varpool_node::get_create
> doesn't have "omp declare target" attribute.

Ping?

I am investigating run-fails on some benchmark, and have found a second
questionable place, where a function argument overrides a global array.
Just to be sure, is this a bug in the test?

#pragma omp declare target
int a1[50], a2[50];
#pragma omp end declare target

void foo (int a1[])
{
  #pragma omp target
{
  a1[10]++;
  a2[10]++;
}
}

int main ()
{
  a1[10] = a2[10] = 10;

  #pragma omp target update to(a1, a2)
  foo (a1);
  #pragma omp target update from(a1, a2)

  if (a1[10] != a2[10])
abort ();
  return 0;
}

Thanks,
  -- Ilya


Re: [gomp4] Questions about "declare target" and "target update" pragmas

2015-03-19 Thread Ilya Verbin
On Thu, Mar 19, 2015 at 14:47:44 +0100, Jakub Jelinek wrote:
> Here is untested patch.  I'm going to check it in after bootstrap/regtest.

Thanks.

> > I am investigating run-fails on some benchmark, and have found a second
> > questionable place, where a function argument overrides a global array.
> > Just to be sure, is this a bug in the test?
> > 
> > #pragma omp declare target
> > int a1[50], a2[50];
> > #pragma omp end declare target
> > 
> > void foo (int a1[])
> > {
> >   #pragma omp target
> > {
> >   a1[10]++;
> >   a2[10]++;
> > }
> > }
> 
> That is a buggy test.  int a1[] function argument is changed
> into int *a1, so it is actually
> #pragma omp target map(tofrom:a1, a2)

Actually, it copies only a1 pointer, since a2 points to the global array.

> {
>   a1[10]++;
>   a2[10]++;
> }
> which copies the a1 pointer to the device by value (no pointer
> transformation).
> Perhaps the testcase writer meant to use #pragma omp target map(a1[10])
> instead (or map(a1[0:50])?

If I understand correctly, it's not allowed to map global target arrays this
way, since it's already present in the initial device data environment:

2.9.4 declare target Directive
If a list item is a variable then the original variable is mapped to a 
corresponding
variable in the initial device data environment for all devices.

2.14.5 map Clause
If a corresponding list item of the original list item is in the enclosing 
device data
environment, the new device data environment uses the corresponding list item 
from the
enclosing device data environment. No additional storage is allocated in the 
new device
data environment and neither initialization nor assignment is performed, 
regardless of
the map-type that is specified.

So, to fix this testcase I can just remove the "int a1[]" function argument, and
add some "#pragma omp target update" where needed.

  -- Ilya


Re: [gomp4] Questions about "declare target" and "target update" pragmas

2015-03-19 Thread Ilya Verbin
On Thu, Mar 19, 2015 at 15:57:10 +0100, Jakub Jelinek wrote:
> On Thu, Mar 19, 2015 at 05:49:47PM +0300, Ilya Verbin wrote:
> > If I understand correctly, it's not allowed to map global target arrays this
> > way, since it's already present in the initial device data environment:
> 
> It of course is allowed.  It just means that it doesn't allocate new memory
> (sizeof(int) large in the map(a1[10]) case, sizeof(int)*50 large in the 
> map(a1[0:50])
> case), nor copy the bytes around, all it will do is allocate memory for the
> target copy of the a1 pointer, and do pointer transformation such that the
> a1 pointer on the target will point to the global target a1 array.
> Without the map(a1[10]) or map(a1[0:50]) clauses (i.e. implicit 
> map(tofrom:a1))
> it does similar thing, except it copies the pointer value to the target (and
> back at the end of the region) instead, which is not what you want...

Ok, got it.

And what about global allocatable fortran arrays?  I didn't find any
restrictions in the specification.  Here is a reduced testcase:

module test
  integer, allocatable, target :: x(:)
  !$omp declare target(x)
end module test
  use test
  integer :: n = 1000
  allocate (x(n))
  !$omp target map(x(1:n))
 x(123) = 456
  !$omp end target
  deallocate (x)
end

It crashes on target with NULL-pointer access, however the memory for x(1:n) is
allocated on target.  Looks like there's something wrong with pointer
transformation.  Is this a wrong testcase or a bug in gcc?

Thanks,
  -- Ilya


Re: [9/10 Regression] [PR87833] Intel MIC (emulated) offloading still broken (was: GCC 9.0.1 Status Report (2019-04-25))

2019-05-09 Thread Ilya Verbin
Hi Hongtao,

I have left Intel 3 years ago. If you have any questions regarding MIC
offloading, you can reach me by iver...@gmail.com

Hongtao Liu :
> I don't konw this guy ilya.ver...@intel.com.
> Do you know him/her, H.J?
>
> --
> BR,
> Hongtao

  || Ilya


Re: Intel Phi co-processor support

2017-02-03 Thread Ilya Verbin
2017-02-03 16:00 GMT+03:00 Jakub Jelinek :
>
> On Fri, Feb 03, 2017 at 02:50:37PM +0200, Angel Dimitrov wrote:
> >  Can I compile on Linux with gfortran code and to run it on Phi
> > co-processor? Or it is better to use Intel FORTRAN compiler?
>
> Depends on which XeonPhi do you have.  GCC doesn't support Knights Ferry
> or Knights Corner, does support Knights Landing.
> That said, for KNL I've only seen so far standalone KNL processors for
> which I'm not sure if offloading is possible or desirable; IMHO if

It is possible using so called "offload over fabric".
Here is a how-to [1], which can be adapted just by replacing "icc
-qopenmp" with "gcc -fopenmp", I guess.

[1] 
https://software.intel.com/en-us/articles/how-to-use-offload-over-fabric-with-knights-landing-intel-xeon-phi-processor

> KNL is the main processor in the computer, then everything is host
> for you and thus just using non-target OpenMP code should be sufficient,
> so the KNL offloading should be (mainly or solely) for the case when
> KNL is a coprocessor, does such thing really exist or is planned?
> Can somebody from Intel please clarify?
>
> Jakub

  -- Ilya


Questions about LTO infrastructure and pragma omp target

2013-08-15 Thread Ilya Verbin
Hi All,

I'm trying to figure out how LTO infrastructure works on a high level.
I want to make sure that I understand this correctly.  Could you please
help me with that?

1.  Execution flow.  As far as I understood, there are 2 modes of
operation - with/without LTO plugin.  Below are the execution flows
for each mode.

Without LTO plugin:

gcc -flto  # Call GCC driver
 |_ cc1# Compile first source file into asm + intermediate language
 |_ as # Assemble these asm + IL into temporary object file
 |_ ...# Compile and assemble all remaining source files
 |_ collect2   # Call linker driver
 |_ lto-wrapper# Call lto-wrapper directly from collect2
 |   |_ gcc# Driver
 |   |   |_ lto1   # Perform WPA and split into partitions
 |   |_ gcc# Driver
 |   |   |_ lto1   # Perform LTRANS for the first partition
 |   |   |_ as # Assemble this partition into final object file
 |   |_ ...# Perform LTRANS for each partition
 |_ collect-ld # Simple wrapper over ld
 |_ ld # Perform linking

Using LTO plugin:

gcc -flto  # Call GCC driver
 |_ cc1# Compile first source file into asm + intermediate language
 |_ as # Assemble these asm + IL into temporary object file
 |_ ...# Compile and assemble all remaining source files
 |_ collect2   # Call linker driver
 |_ collect-ld   # Simple wrapper over ld
 |_ ld with liblto_plugin.so   # Perform LTO and linking
 |_ lto-wrapper# Is called from liblto_plugin.so
 |_ gcc# Driver
 |   |_ lto1   # Perform WPA and split into partitions
 |_ gcc# Driver
 |   |_ lto1   # Perform LTRANS for the first partition
 |   |_ as # Assemble this partition into final object file
 |_ ...# Perform LTRANS for each partition

Are they correct?

2.  The second question, regarding #pragma omp target implementation.
I'm going to reuse LTO approach in a prototype, that will produce 2
binaries - for host and target architectures.  Target binary will contain
functions outlined from omp target region and some infrastructure to run
them.
To produce 2 binaries we need to run gcc and ld twice.  At the first run
gcc will generate object file, that contains optimized code for host and
GIMPLE for target.  At the second run gcc will read the GIMPLE and
generate optimized code for target.

So, the question is - what is the right place for the second run of gcc
and ld?  Should I insert them into liblto_plugin.so?  Or should I create
entirely new plugin, that will only call gcc and ld for target, without
performing any LTO optimizations for host?
Suggestions?

----
Thanks,
Ilya Verbin,
Software Engineer
Intel Corporation


Re: Questions about LTO infrastructure and pragma omp target

2013-08-23 Thread Ilya Verbin
Jakub, Richard, Uday,
Thanks for your answers.

On 15 Aug 20:59, Richard Biener wrote:
> Alternatively you make lto-wrapper aware of this which means that WPA stage 
> would emit extra partitions that it marks for lto-wrapper.
> 
> That sounds better than another plugin to me.  Of course WPA time might be 
> too limiting. Otoh the idea of multiple WPA stages, aka iterating lto could 
> be picked up to have a late WPA stage.
> 
> Richard.

I'm trying to implement the approach with modified lto-wrapper.
Suppose we have a bytecode of the routine foo, streamed during ompexp pass into 
some section, say .gnu.omptarget_foo.
In function lto.c:do_whole_program_analysis() an extra partition should be 
created, that will contain bytecode from .gnu.omptarget_foo, right?
As far as I understood, in addition to the bytecode of foo, we should also 
stream extra symtab_nodes, and read them somewhere in 
lto-cgraph.c:input_symtab().
This means we should maintain 2 symtabs inside WPA stage - original for host 
and new for target?

Richard, also what do you mean by "WPA time might be too limiting"?

Thanks,
-- Ilya


Re: Questions about LTO infrastructure and pragma omp target

2013-08-23 Thread Ilya Verbin
On 23 Aug 13:17, Jakub Jelinek wrote:
> I don't think we should stream into more than one target section.
> There should be just .gnu.target_lto section (or whatever other suitable
> name) and should stream into it:
> 1) all functions and variables with "omp declare target" attribute
> 2) the outlined bodies of #pragma omp target turned into *.ompfn functions
> 3) all the types, symtab etc. needed for that

Why having one target section is preferable than multiple sections for each
function body?

> Then the question is what the plugin should perform with these sections,
> whether it will compile each input .gnu.target_lto section hunk separately
> (as in non-LTO mode), or with -flto also LTO them together.

Yes, it is an important question...  To get started it is easier to implement
"non-target-lto" mode, however this approach should be general enough to extend
it to "target-lto" mode.  Does anyone need it?


On 23 Aug 14:36, Jakub Jelinek wrote:
> On Fri, Aug 23, 2013 at 02:24:42PM +0200, Richard Biener wrote:
> > No, as you will refer to the symbol with the target code from the host
> > code you need a single unified symtab.
> 
> I really think you want two symtabs rather than a unified symtab,
> or just stream a subset of the host symtab into the .gnu.target_lto
> section.  The thing is, the target code (functions, vars, outlined bodies)
> is a strict subset of the host code (because as a fallback, everything
> needs to be able to run on the host), but when not compiling originally with
> -flto, we IMHO should stream just the target subset, not everything
> (and for -flto stream both the target subset into one section and everything
> (host code) as we do right now, either with fat or slim lto objects).

I also think that having two symtabs looks better.  There is no direct refs to
the target symbols from the host code.  And (as far as I see it) unified symtab
will lead to mess in places, where host and target symbols should be handled
differently.

Thanks,
-- Ilya


Re: Questions about LTO infrastructure and pragma omp target

2013-09-16 Thread Ilya Verbin
Hi Richard,

On 23 Aug 14:24, Richard Biener wrote:
> Ilya Verbin  wrote:
> >I'm trying to implement the approach with modified lto-wrapper.
> >Suppose we have a bytecode of the routine foo, streamed during ompexp
> >pass into some section, say .gnu.omptarget_foo.
> >In function lto.c:do_whole_program_analysis() an extra partition should
> >be created, that will contain bytecode from .gnu.omptarget_foo, right?
> 
> Right.
> 
> Richard.

What if we leave WPA stage unchanged?
Here is a patch that passes "fat" object files (with host-side .gnu.lto_ and
target-side .gnu.target_lto_ sections) directly to the target-compiler.
(Currently it works only with -flto enabled.)  Then target-compiler reads
bytecode from .gnu.target_lto_ and produces target-side object file.
At the moment lto-wrapper uses collect_gcc as a target-compiler.  Also it
doesn't properly handle the command-line args.
This looks simpler than emit extra partitions during WPA.  What do you think?


---
 gcc/lto-streamer.c   |  8 ++--
 gcc/lto-streamer.h   |  1 +
 gcc/lto-wrapper.c| 22 +-
 gcc/lto/lang.opt |  4 
 gcc/lto/lto-object.c |  5 +++--
 gcc/lto/lto.c|  5 -
 6 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/gcc/lto-streamer.c b/gcc/lto-streamer.c
index e7b66c1..9e19060 100644
--- a/gcc/lto-streamer.c
+++ b/gcc/lto-streamer.c
@@ -145,6 +145,7 @@ lto_get_section_name (int section_type, const char *name, 
struct lto_file_decl_d
   const char *add;
   char post[32];
   const char *sep;
+  const char *prefix;
 
   if (section_type == LTO_section_function_body)
 {
@@ -172,8 +173,11 @@ lto_get_section_name (int section_type, const char *name, 
struct lto_file_decl_d
   else if (f != NULL) 
 sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, f->id);
   else
-sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, get_random_seed (false)); 
-  return concat (LTO_SECTION_NAME_PREFIX, sep, add, post, NULL);
+sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, get_random_seed (false));
+
+  prefix = flag_openmp_target ? OMP_SECTION_NAME_PREFIX
+ : LTO_SECTION_NAME_PREFIX;
+  return concat (prefix, sep, add, post, NULL);
 }
 
 
diff --git a/gcc/lto-streamer.h b/gcc/lto-streamer.h
index e7c89f1..df72e16 100644
--- a/gcc/lto-streamer.h
+++ b/gcc/lto-streamer.h
@@ -141,6 +141,7 @@ along with GCC; see the file COPYING3.  If not see
name for the functions and static_initializers.  For other types of
sections a '.' and the section type are appended.  */
 #define LTO_SECTION_NAME_PREFIX ".gnu.lto_"
+#define OMP_SECTION_NAME_PREFIX ".gnu.target_lto_"
 
 #define LTO_major_version 2
 #define LTO_minor_version 2
diff --git a/gcc/lto-wrapper.c b/gcc/lto-wrapper.c
index 15a34dd..f3b44ff 100644
--- a/gcc/lto-wrapper.c
+++ b/gcc/lto-wrapper.c
@@ -442,6 +442,7 @@ run_gcc (unsigned argc, char *argv[])
   unsigned i, j;
   const char **new_argv;
   const char **argv_ptr;
+  const char **target_argv;
   char *list_option_full = NULL;
   const char *linker_output = NULL;
   const char *collect_gcc, *collect_gcc_options;
@@ -452,7 +453,7 @@ run_gcc (unsigned argc, char *argv[])
   unsigned int fdecoded_options_count = 0;
   struct cl_decoded_option *decoded_options;
   unsigned int decoded_options_count;
-  struct obstack argv_obstack;
+  struct obstack argv_obstack, target_argv_obstack;
   int new_head_argc;
 
   /* Get the driver and options.  */
@@ -902,6 +903,25 @@ cont:
   free (input_names);
   free (list_option_full);
   obstack_free (&env_obstack, NULL);
+
+  /* Run gcc for target.  */
+  obstack_init (&target_argv_obstack);
+  obstack_ptr_grow (&target_argv_obstack, collect_gcc);
+  obstack_ptr_grow (&target_argv_obstack, "-xlto");
+  obstack_ptr_grow (&target_argv_obstack, "-fopenmp_target");
+  obstack_ptr_grow (&target_argv_obstack, "-c");
+  obstack_ptr_grow (&target_argv_obstack, "-o");
+  obstack_ptr_grow (&target_argv_obstack, "target.o");
+
+  /* Append the input objects.  */
+  for (i = 1; i < argc; ++i)
+   if (strncmp (argv[i], "-fresolution=", sizeof ("-fresolution=") - 1))
+ obstack_ptr_grow (&target_argv_obstack, argv[i]);
+  obstack_ptr_grow (&target_argv_obstack, NULL);
+
+  target_argv = XOBFINISH (&target_argv_obstack, const char **);
+  fork_execute (CONST_CAST (char **, target_argv));
+  obstack_free (&target_argv_obstack, NULL);
 }
 
   obstack_free (&argv_obstack, NULL);
diff --git a/gcc/lto/lang.opt b/gcc/lto/lang.opt
index 7a9aede..cd0098c 100644
--- a/gcc/lto/lang.opt
+++ b/gcc/lto/lang.opt
@@ -40,4 +40,8 @@ fresolution=
 LTO Joined
 The resolution file
 
+fopenmp_target
+LTO Var(flag_openmp_t

Re: Questions about LTO infrastructure and pragma omp target

2013-09-17 Thread Ilya Verbin
On 17 Sep 10:12, Richard Biener wrote:
> It looks more like a hack ;)  It certainly doesn't look scalable to multiple
> target ISAs.  You also unconditionally invoke the target compiler (well, you
> invoke the same compiler ...)

Yes, I currently call the "target" compiler unconditionally, but it can be
placed under a flag/env var/etc.  When we have multiple target ISAs, multiple
target compilers will be invoked.  Each of them will read same IL from
.gnu.target_lto_ and produce an executable for its target.
Why this approach is not scalable?

> As far as I understand your patch the target IL is already produced by
> the compile stage (always? what about possible target IL emit from
> -ftree-parallelize-loops?)?

Yes, I assume that IL is already produced, somehow like this:
http://gcc.gnu.org/ml/gcc/2013-09/msg00123.html
Probably the compile stage should somehow inform the lto-wrapper, that target
compilers should be called.

> As I understand Jakub he prefers things to work without -flto as well, so
> target IL has to be handled by a different linker plugin and LTO would merely
> be required to pass the target IL sections through the LTO pipeline and 
> re-emit
> it during LTRANS?

If we want to reuse "LTO pipeline", the LTO infrastructure should be turn on
(i.e. lto-wrapper should be called).
With -flto, lto-wrapper will perform all usual things (WPA+LTRANS) + invoke all
necessary target compilers.
Without -flto it will merely invoke target compilers.
I do not see how different linker plugin can help.  It will run lto-wrapper same
way like current plugin?

Thanks,
  -- Ilya


Re: [gomp4] GOMP_target fall back execution

2013-09-18 Thread Ilya Verbin
On 18 Sep 10:38, Jakub Jelinek wrote:
> and what test5.c should print I have no idea (does ICC already support this
> and can you see what it prints?).
> 
> test5.c:
> #include 
> #include 
> 
> int
> main ()
> {
>   omp_set_dynamic (0);
>   omp_set_nested (1);
>   #pragma omp parallel num_threads (3)
> if (omp_get_thread_num () == 2)
>   {
>   #pragma omp parallel num_threads (3)
> if (omp_get_thread_num () == 1)
>   {
> #pragma omp target if (0)
>   {
> printf ("inp %d\n", omp_in_parallel ());
> #pragma omp parallel num_threads (2)
>   printf ("%d %d %d %d %d\n", omp_get_level (),
>   omp_get_ancestor_thread_num (0),
>   omp_get_ancestor_thread_num (1),
>   omp_get_ancestor_thread_num (2),
>   omp_get_ancestor_thread_num (3));
>   }
>   }
>   }
>   return 0;
> }
> 
>   Jakub

ICC prints:

inp 1
3 0 2 1 0
3 0 2 1 1

  -- Ilya


Re: Questions about LTO infrastructure and pragma omp target

2013-09-19 Thread Ilya Verbin
On 17 Sep 14:12, Jakub Jelinek wrote:
> On Tue, Sep 17, 2013 at 01:56:39PM +0200, Richard Biener wrote:
> > 
> > Are you sure we have the same IL for all targets and the same targets
> > for all functions?  That would certainly simplify things, but you still need
> > a way to tell the target compiler which symbol to emit the function on
> > as the compile-stage will already necessarily refer to all target
> > variant symbols.
> 
> This has been discussed to some extent during Cauldron.
> Yes, there are various target dependencies in the GIMPLE IL, many of them
> very early.
> Some of the dependencies are there already during preprocessing, there is
> nothing to do about those.
> For some things we will just rely on the host and target having the same
> properties, stuff like BITS_PER_UNIT, type layout/alignment, endianity,
> the OpenMP (and I believe OpenACC too) model effectively requires that,
> while you don't need to have shared address space between host and target
> (but can have that), for the mapping/unmapping it is assumed that you can
> simply take host portions of memory and copy them over to the target device
> or back, as sequence of bytes, there is no form of RPC or similar that would
> tweak endianity, differently sized types, padding, etc.
> While you can say have 64-bit host and 32-bit target or vice versa, the
> target IL will simply contain precision info, alignment, structure layout
> etc. and just will have to generate right code for that (something that is
> native long on the host can be native long long on the target or vice versa
> etc.).
> Then there are dependencies we'd ideally get rid of, at least pre-IPA,
> stuff like BRANCH_COST, but generally that is just an optimization issue and
> thus not that big deal.
> Bigger issue are target specific builtins, I guess we'll either have to just
> sorry on those, or have some helper targhook that will translate a subset of
> md builtins from selected hosts to selected targets.
> Preferrably, before IPA we'd introduce as few target dependencies into the
> IL as possible, and gradually towards RTL can add more dependencies (e.g.
> the vectorizer adds so many target dependencies that at that point trying to
> use the IL for a different target is practically impossible).
> 
>   Jakub

Do I understand correctly that GIMPLE IL is target dependent, but we will emit
the same IL for all targets?

  -- Ilya


Re: [RFC] Offloading Support in libgomp

2013-10-28 Thread Ilya Verbin
Hi Jakub,

We have a MIC offload runtime library (liboffload), which is an abstraction over
COI.  Currently it is a part of ICC, but there are plans of open sourcing it.
However, liboffload requires somewhat different tables comparing to what we have
agreed on.  The liboffload tables serve to associate host functions with target
functions.  They should be inserted at compile-time into special sections of
every executable or DSO with #pragma omp target.  The tables contain pairs of:
{ char *name, void *host_addr } for host binaries, and { char *name, void
*target_addr } for target.  The "name" might be not the actual function name,
but just a key for host->target mapping.
So, in this approach, GOMP_target will take host_addr as input, then MIC plugin
will convert it into the "name" by host-side table, and call on MIC using
liboffload interface.  Perhaps, additional table will be created by MIC plugin
to speed up the name lookup.  This also should eliminate problems with functions
re-ordering at LTO where address tables from different objects will be mixed
into one in executable/shared library.
What do you think, is it ok to save this additional data in the tables?

Thanks,
  -- Ilya


[gomp4] Questions about "declare target" and "target update" pragmas

2014-01-22 Thread Ilya Verbin
Hi Jakub,

I have 2 questions concerning OpenMP 4.0 specification.


1.  Do I understand correctly that every "declare target" directive should be
closed with "end declare target"?  E.g. in this example GCC marks both foo1 and
foo2 with "omp declare target" attribute:

#pragma omp declare target
int foo1 () { return 1; }
int foo2 () { return 2; }
/* EOF */

Shouldn't the frontend issue an error message that there is "declare target"
without corresponding "end declare target"?


2.  Do I understand correctly that the "target update" directive can be used
outside the "target" or "target data" regions?  E.g. this example should
print '2' (and it prints '2' while building with icc):

#pragma omp declare target
int G = 1;
#pragma omp end declare target

int main ()
{
  G = 2;
  #pragma omp target update to(G)
  G = 3;
  int x = 0;
  #pragma omp target
{
  x = G;
}
  printf ("%d\n", x);
}

If it is acceptable, then GOMP_target_update should also map variables that
wasn't mapped.

Thanks,
-- Ilya


Re: [gomp4] Questions about "declare target" and "target update" pragmas

2014-01-28 Thread Ilya Verbin
2014/1/22 Jakub Jelinek :
> This can print 3 (if doing host fallback or device shares address space
> with host), or 2 (otherwise).  It shouldn't print 1 ever, and yes,
> the target update is then well defined.  All variables from omp declare
> target are allocated on the device sometime before
> the first target data/target update/target region; given that they will
> be allocated in the data section of the target DSO, they actually just need
> to be registered with the mapping data structure when the DSO is loaded.
>
> No, the target DSO initialization should use the tables we've talked about
> to initialize the mapping.
>
> Jakub

Yes, when G is global variable marked with 'declare target', everything works
fine.  But this testcase crashes at runtime in GOMP_target_update:

int main ()
{
  int G = 2;
  #pragma omp target update to(G)
  G = 3;
  int x = 0;
  #pragma omp target
{
  x = G;
}
  printf ("%d\n", x);
}

Is it right, that such usage of 'target update' is not allowed by omp
specification?

  -- Ilya


Fwd: [gomp4] Questions about "declare target" and "target update" pragmas

2014-01-30 Thread Ilya Verbin
One more question.  Is it valid to use arr[MAX/2..MAX] on target?

#define MAX 20
void foo ()
{
  int arr[MAX];
  #pragma omp target map(from: arr[0:MAX/2])
{
  int i;
  for (i = 0; i < MAX; i++)
arr[i] = i;
}
}

In this case GOMP_target gets sizes[0]==40 as input.  Due to this,
gomp_map_vars allocates 40 bytes of memory on target for 'arr',
instead of 80 bytes.

  -- Ilya


Re: [RFC] Offloading Support in libgomp

2014-01-31 Thread Ilya Verbin
Looks like there is a bug (in GOMP_target lowering? or in
gomp_map_vars_existing?)
The reproducer:

#define N 1000

void foo ()
{
  int *a = malloc (N * sizeof (int));
  printf ("1: %p\n", a);
  #pragma omp target data map(tofrom: a[0:N])
  {
printf ("2: %p\n", a);
#pragma omp target
{
  int i;
  for (i = 0; i < N; i++)
a[i] = i;
}
printf ("3: %p\n", a);
  }
  printf ("4: %p\n", a);
  free (a);
}

Here GOMP_target believes that the pointer 'a' has a type TOFROM, so
it sets copy_from to true for the existing mapping of the pointer 'a',
that was mapped in GOMP_target_data.  Therefore the output is
incorrect:

1: [host addr]
2: [host addr]
3: [host addr]
4: [target addr]

  -- Ilya


Re: [RFC] Offloading Support in libgomp

2014-01-31 Thread Ilya Verbin
2014-01-31 Jakub Jelinek :
> I'd suggest just using map(tofrom: a[0:N]) also on the #pragma omp target,
> then it is clear what should happen.
>
> Jakub

I agree that this will be clearer.  But there is an example #49.1 in
the document [1] with the same case.  And it crashes because the
pointer 'p' is overwritten after the omp target data region.

[1] http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf

  -- Ilya


Re: [RFC] Offloading Support in libgomp

2014-02-14 Thread Ilya Verbin
2014-01-31 22:03 GMT+04:00 Jakub Jelinek :
> Implicit map(tofrom: a) on #pragma omp target is what the standard
> requires, so I don't see a bug on the compiler side.
> Jakub

There is an exception in the standard (page 177, lines 17-21):

> If a corresponding list item of the original list item is in the enclosing 
> device data
> environment, the new device data environment uses the corresponding list item 
> from the
> enclosing device data environment. No additional storage is allocated in the 
> new device
> data environment and neither initialization nor assignment is performed, 
> regardless of
> the map-type that is specified.

So, the pointer 'a' should inherit map-type ALLOC from the enclosing
device data environment.

  -- Ilya


Re: [RFC] Offloading Support in libgomp

2014-02-17 Thread Ilya Verbin
On 14 Feb 16:43, Jakub Jelinek wrote:
> So, perhaps we should just stop for now oring the copyfrom in and just use
> the copyfrom from the very first mapping only, and wait for what the committee
> actually agrees on.
> 
>   Jakub

Like this?

@@ -171,11 +171,16 @@ gomp_map_vars_existing (splay_tree_key oldn, 
splay_tree_key newn,
"[%p..%p) is already mapped",
(void *) newn->host_start, (void *) newn->host_end,
(void *) oldn->host_start, (void *) oldn->host_end);
+#if 0
+  /* FIXME: Remove this when OpenMP 4.0 will be standardized.  Currently it's
+ unclear regarding overwriting copy_from for the existing mapping.
+ See http://gcc.gnu.org/ml/gcc/2014-02/msg00208.html for details.  */
   if (((kind & 7) == 2 || (kind & 7) == 3)
   && !oldn->copy_from
   && oldn->host_start == newn->host_start
   && oldn->host_end == newn->host_end)
 oldn->copy_from = true;
+#endif
   oldn->refcount++;
 }

  -- Ilya


SPEC2006 436.cactusADM performance depends on the length of $LD_LIBRARY_PATH

2013-02-19 Thread Ilya Verbin
Hi All,

I discovered a strange behavior of SPEC CPU2006 436.cactusADM
benchmark. It’s performance depends on the length of $LD_LIBRARY_PATH
variable.
The benchmark was compiled with "-O3 -funroll-loops -ffast-math
-march=core-avx2 -mtune=core-avx2" using gcc version 4.8.0 20130218.
I used Intel Software Development Emulator 5.38.0 to run AVX2 code:

$ export LD_LIBRARY_PATH=''
$ /sde-bdw-external-5.38.0-2013-01-03-lin/sde -icount -- ./cactusADM
benchADM.par > benchADM.out 2> benchADM.err
ICOUNT: 2593690591

$ export LD_LIBRARY_PATH=''
$ /sde-bdw-external-5.38.0-2013-01-03-lin/sde -icount -- ./cactusADM
benchADM.par > benchADM.out 2> benchADM.err
ICOUNT: 2100709724

In the second case 23% less instructions are executed!

The difference is caused by the code, that checks the alignment of
some pointer in the benchmark. It seems that the environment variables
(including $LD_LIBRARY_PATH) are placed in the app’s memory, this
leads to different alignment, therefore to various execution paths.

Is this a known issue? Is there anything the GCC can do to make the
alignment more consistent?

Thanks,
Ilya