[OMPI users] MPI_ERR_INTERN with openmpi-dev-4691-g277c319 on SuSE Linux

2016-08-28 Thread Siegmar Gross

Hi,

I have installed openmpi-dev-4691-g277c319 on my "SUSE Linux
Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0.
Unfortunately I get an internal error for all my spawn programs.


loki spawn 147 ompi_info | grep -e "Open MPI repo revision" -e "C compiler 
absolute"

  Open MPI repo revision: dev-4691-g277c319
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 148


loki spawn 151 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

[loki:10461] [[46948,1],0] ORTE_ERROR_LOG: Unreachable in file 
../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[46948,1],0]) is on host: loki
  Process 2 ([[46948,2],0]) is on host: loki
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--
[loki:10461] *** An error occurred in MPI_Comm_spawn
[loki:10461] *** reported by process [3076784129,0]
[loki:10461] *** on communicator MPI_COMM_WORLD
[loki:10461] *** MPI_ERR_INTERN: internal error
[loki:10461] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,

[loki:10461] ***and potentially your MPI job)




loki spawn 152 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
spawn_multiple_master


Parent process 0 running on loki
  I create 3 slave processes.

[loki:10482] [[46929,1],0] ORTE_ERROR_LOG: Unreachable in file 
../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[46929,1],0]) is on host: loki
  Process 2 ([[46929,2],0]) is on host: loki
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--
[loki:10482] *** An error occurred in MPI_Comm_spawn_multiple
[loki:10482] *** reported by process [3075538945,0]
[loki:10482] *** on communicator MPI_COMM_WORLD
[loki:10482] *** MPI_ERR_INTERN: internal error
[loki:10482] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,

[loki:10482] ***and potentially your MPI job)




loki spawn 153 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
spawn_intra_comm
Parent process 0: I create 2 slave processes
[loki:10500] [[46915,1],0] ORTE_ERROR_LOG: Unreachable in file 
../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[46915,1],0]) is on host: loki
  Process 2 ([[46915,2],0]) is on host: loki
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--
[loki:10500] *** An error occurred in MPI_Comm_spawn
[loki:10500] *** reported by process [3074621441,0]
[loki:10500] *** on communicator MPI_COMM_WORLD
[loki:10500] *** MPI_ERR_INTERN: internal error
[loki:10500] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,

[loki:10500] ***and potentially your MPI job)
loki spawn 154


I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.


Kind regards

Siegmar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] Segmentation fault for openmpi-v2.0.0-233-gb5f0a4f with SuSE Linux

2016-08-28 Thread Siegmar Gross

Hi,

I have installed openmpi-v2.0.0-233-gb5f0a4f on my "SUSE Linux
Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0.
Unfortunately I have a problem with my program "spawn_master".
It hangs if I run it on my local machine and I get I segmentation
fault if I run it on a remote machine. Both machines use the same
operating system. Everything works as expected if I use five times
the same hostname in "--host" instead of a combination of "--host"
and "slot-list". Everything works also as expected if I use my
program "spawn_multiple_master" instead of "spawn_master".


loki hello_2 151 ompi_info | grep -e "Open MPI repo revision" -e "C compiler 
absolute"

  Open MPI repo revision: v2.0.0-233-gb5f0a4f
 C compiler absolute: /opt/solstudio12.5b/bin/cc


loki spawn 152 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

^C
loki spawn 153 mpiexec -np 1 --host nfs1 --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on nfs1
  I create 4 slave processes

[nfs1:09963] *** Process received signal ***
[nfs1:09963] Signal: Segmentation fault (11)
[nfs1:09963] Signal code: Address not mapped (1)
[nfs1:09963] Failing at address: 0x64
[nfs1:09963] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f6f55794870]
[nfs1:09963] [ 1] 
/usr/local/openmpi-2.0.1_64_cc/lib64/openmpi/mca_state_orted.so(+0x1055a)[0x7f6f5478155a]
[nfs1:09963] [ 2] 
/usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(+0x2306a4)[0x7f6f566f46a4]
[nfs1:09963] [ 3] 
/usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(+0x230a2a)[0x7f6f566f4a2a]
[nfs1:09963] [ 4] 
/usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x2d9)[0x7f6f566f5379]
[nfs1:09963] [ 5] 
/usr/local/openmpi-2.0.1_64_cc/lib64/libopen-rte.so.20(orte_daemon+0x2b66)[0x7f6f56cf63c6]

[nfs1:09963] [ 6] orted[0x407575]
[nfs1:09963] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6f553feb25]
[nfs1:09963] [ 8] orted[0x401832]
[nfs1:09963] *** End of error message ***
Segmentation fault
--
ORTE has lost communication with its daemon located on node:

  hostname:  nfs1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--
loki spawn 154




loki spawn 144 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
spawn_slave 0: argv[0]: spawn_slave
Slave process 1 of 4 running on loki
spawn_slave 1: argv[0]: spawn_slave
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 145



loki spawn 106 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
spawn_multiple_master


Parent process 0 running on loki
  I create 3 slave processes.

Slave process 0 of 2 running on loki
Slave process 1 of 2 running on loki
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 2


loki spawn 107 mpiexec -np 1 --host nfs1 --slot-list 0:0-5,1:0-5 
spawn_multiple_master


Parent process 0 running on nfs1
  I create 3 slave processes.

Slave process 0 of 2 running on nfs1
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
Slave process 1 of 2 running on nfs1
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 2

loki spawn 108



I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.


Kind regards

Siegmar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] MPI_ERR_INTERN with openmpi-dev-4691-g277c319 on SuSE Linux

2016-08-28 Thread Gilles Gouaillardet
Siegmar,

this is a known issue that is tracked at
https://github.com/open-mpi/ompi/issues/1998

Cheers,

Gilles

On Sunday, August 28, 2016, Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi,
>
> I have installed openmpi-dev-4691-g277c319 on my "SUSE Linux
> Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0.
> Unfortunately I get an internal error for all my spawn programs.
>
>
> loki spawn 147 ompi_info | grep -e "Open MPI repo revision" -e "C compiler
> absolute"
>   Open MPI repo revision: dev-4691-g277c319
>  C compiler absolute: /opt/solstudio12.5b/bin/cc
> loki spawn 148
>
>
> loki spawn 151 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> [loki:10461] [[46948,1],0] ORTE_ERROR_LOG: Unreachable in file
> ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[46948,1],0]) is on host: loki
>   Process 2 ([[46948,2],0]) is on host: loki
>   BTLs attempted: self tcp
>
> Your MPI job is now going to abort; sorry.
> --
> [loki:10461] *** An error occurred in MPI_Comm_spawn
> [loki:10461] *** reported by process [3076784129,0]
> [loki:10461] *** on communicator MPI_COMM_WORLD
> [loki:10461] *** MPI_ERR_INTERN: internal error
> [loki:10461] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [loki:10461] ***and potentially your MPI job)
>
>
>
>
> loki spawn 152 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
> spawn_multiple_master
>
> Parent process 0 running on loki
>   I create 3 slave processes.
>
> [loki:10482] [[46929,1],0] ORTE_ERROR_LOG: Unreachable in file
> ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[46929,1],0]) is on host: loki
>   Process 2 ([[46929,2],0]) is on host: loki
>   BTLs attempted: self tcp
>
> Your MPI job is now going to abort; sorry.
> --
> [loki:10482] *** An error occurred in MPI_Comm_spawn_multiple
> [loki:10482] *** reported by process [3075538945,0]
> [loki:10482] *** on communicator MPI_COMM_WORLD
> [loki:10482] *** MPI_ERR_INTERN: internal error
> [loki:10482] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [loki:10482] ***and potentially your MPI job)
>
>
>
>
> loki spawn 153 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
> spawn_intra_comm
> Parent process 0: I create 2 slave processes
> [loki:10500] [[46915,1],0] ORTE_ERROR_LOG: Unreachable in file
> ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[46915,1],0]) is on host: loki
>   Process 2 ([[46915,2],0]) is on host: loki
>   BTLs attempted: self tcp
>
> Your MPI job is now going to abort; sorry.
> --
> [loki:10500] *** An error occurred in MPI_Comm_spawn
> [loki:10500] *** reported by process [3074621441,0]
> [loki:10500] *** on communicator MPI_COMM_WORLD
> [loki:10500] *** MPI_ERR_INTERN: internal error
> [loki:10500] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [loki:10500] ***and potentially your MPI job)
> loki spawn 154
>
>
> I would be grateful, if somebody can fix the problem. Thank you
> very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
__

[OMPI users] problem with exceptions in Java interface

2016-08-28 Thread Siegmar Gross

Hi,

I have installed v1.10.3-31-g35ba6a1, openmpi-v2.0.0-233-gb5f0a4f,
and openmpi-dev-4691-g277c319 on my "SUSE Linux Enterprise Server
12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. In May I had
reported a problem with Java execeptions (PR 1698) which had
been solved in June (PR 1803).

https://github.com/open-mpi/ompi/issues/1698
https://github.com/open-mpi/ompi/pull/1803

Unfortunately the problem still exists or exists once more
in all three branches.


loki fd1026 112 ompi_info | grep -e "Open MPI repo revision" -e "C compiler 
absolute"

  Open MPI repo revision: dev-4691-g277c319
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 112 mpijavac Exception_2_Main.java
warning: [path] bad path element 
"/usr/local/openmpi-master_64_cc/lib64/shmem.jar": no such file or directory

1 warning
loki fd1026 113 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1252)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---
--
mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:

  Process name: [[58548,1],0]
  Exit code:1
--
loki fd1026 114 exit



loki fd1026 116 ompi_info | grep -e "Open MPI repo revision" -e "C compiler 
absolute"

  Open MPI repo revision: v2.0.0-233-gb5f0a4f
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 117 mpijavac Exception_2_Main.java
warning: [path] bad path element 
"/usr/local/openmpi-2.0.1_64_cc/lib64/shmem.jar": no such file or directory

1 warning
loki fd1026 118 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1252)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:

  Process name: [[58485,1],0]
  Exit code:1
--
loki fd1026 119 exit



loki fd1026 107 ompi_info | grep -e "Open MPI repo revision" -e "C compiler 
absolute"

  Open MPI repo revision: v1.10.3-31-g35ba6a1
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 107 mpijavac Exception_2_Main.java
loki fd1026 108 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1231)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:

  Process name: [[34400,1],0]
  Exit code:1
--
loki fd1026 109 exit




I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.


Kind regards

Siegmar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users