[OMPI users] MPI_ERR_INTERN with openmpi-dev-4691-g277c319 on SuSE Linux
Hi, I have installed openmpi-dev-4691-g277c319 on my "SUSE Linux Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. Unfortunately I get an internal error for all my spawn programs. loki spawn 147 ompi_info | grep -e "Open MPI repo revision" -e "C compiler absolute" Open MPI repo revision: dev-4691-g277c319 C compiler absolute: /opt/solstudio12.5b/bin/cc loki spawn 148 loki spawn 151 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master Parent process 0 running on loki I create 4 slave processes [loki:10461] [[46948,1],0] ORTE_ERROR_LOG: Unreachable in file ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426 -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[46948,1],0]) is on host: loki Process 2 ([[46948,2],0]) is on host: loki BTLs attempted: self tcp Your MPI job is now going to abort; sorry. -- [loki:10461] *** An error occurred in MPI_Comm_spawn [loki:10461] *** reported by process [3076784129,0] [loki:10461] *** on communicator MPI_COMM_WORLD [loki:10461] *** MPI_ERR_INTERN: internal error [loki:10461] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [loki:10461] ***and potentially your MPI job) loki spawn 152 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_multiple_master Parent process 0 running on loki I create 3 slave processes. [loki:10482] [[46929,1],0] ORTE_ERROR_LOG: Unreachable in file ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426 -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[46929,1],0]) is on host: loki Process 2 ([[46929,2],0]) is on host: loki BTLs attempted: self tcp Your MPI job is now going to abort; sorry. -- [loki:10482] *** An error occurred in MPI_Comm_spawn_multiple [loki:10482] *** reported by process [3075538945,0] [loki:10482] *** on communicator MPI_COMM_WORLD [loki:10482] *** MPI_ERR_INTERN: internal error [loki:10482] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [loki:10482] ***and potentially your MPI job) loki spawn 153 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_intra_comm Parent process 0: I create 2 slave processes [loki:10500] [[46915,1],0] ORTE_ERROR_LOG: Unreachable in file ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426 -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[46915,1],0]) is on host: loki Process 2 ([[46915,2],0]) is on host: loki BTLs attempted: self tcp Your MPI job is now going to abort; sorry. -- [loki:10500] *** An error occurred in MPI_Comm_spawn [loki:10500] *** reported by process [3074621441,0] [loki:10500] *** on communicator MPI_COMM_WORLD [loki:10500] *** MPI_ERR_INTERN: internal error [loki:10500] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [loki:10500] ***and potentially your MPI job) loki spawn 154 I would be grateful, if somebody can fix the problem. Thank you very much for any help in advance. Kind regards Siegmar ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Segmentation fault for openmpi-v2.0.0-233-gb5f0a4f with SuSE Linux
Hi, I have installed openmpi-v2.0.0-233-gb5f0a4f on my "SUSE Linux Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. Unfortunately I have a problem with my program "spawn_master". It hangs if I run it on my local machine and I get I segmentation fault if I run it on a remote machine. Both machines use the same operating system. Everything works as expected if I use five times the same hostname in "--host" instead of a combination of "--host" and "slot-list". Everything works also as expected if I use my program "spawn_multiple_master" instead of "spawn_master". loki hello_2 151 ompi_info | grep -e "Open MPI repo revision" -e "C compiler absolute" Open MPI repo revision: v2.0.0-233-gb5f0a4f C compiler absolute: /opt/solstudio12.5b/bin/cc loki spawn 152 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master Parent process 0 running on loki I create 4 slave processes ^C loki spawn 153 mpiexec -np 1 --host nfs1 --slot-list 0:0-5,1:0-5 spawn_master Parent process 0 running on nfs1 I create 4 slave processes [nfs1:09963] *** Process received signal *** [nfs1:09963] Signal: Segmentation fault (11) [nfs1:09963] Signal code: Address not mapped (1) [nfs1:09963] Failing at address: 0x64 [nfs1:09963] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f6f55794870] [nfs1:09963] [ 1] /usr/local/openmpi-2.0.1_64_cc/lib64/openmpi/mca_state_orted.so(+0x1055a)[0x7f6f5478155a] [nfs1:09963] [ 2] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(+0x2306a4)[0x7f6f566f46a4] [nfs1:09963] [ 3] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(+0x230a2a)[0x7f6f566f4a2a] [nfs1:09963] [ 4] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x2d9)[0x7f6f566f5379] [nfs1:09963] [ 5] /usr/local/openmpi-2.0.1_64_cc/lib64/libopen-rte.so.20(orte_daemon+0x2b66)[0x7f6f56cf63c6] [nfs1:09963] [ 6] orted[0x407575] [nfs1:09963] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6f553feb25] [nfs1:09963] [ 8] orted[0x401832] [nfs1:09963] *** End of error message *** Segmentation fault -- ORTE has lost communication with its daemon located on node: hostname: nfs1 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -- loki spawn 154 loki spawn 144 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master Parent process 0 running on loki I create 4 slave processes Slave process 0 of 4 running on loki spawn_slave 0: argv[0]: spawn_slave Slave process 1 of 4 running on loki spawn_slave 1: argv[0]: spawn_slave Slave process 2 of 4 running on loki spawn_slave 2: argv[0]: spawn_slave Slave process 3 of 4 running on loki spawn_slave 3: argv[0]: spawn_slave Parent process 0: tasks in MPI_COMM_WORLD:1 tasks in COMM_CHILD_PROCESSES local group: 1 tasks in COMM_CHILD_PROCESSES remote group: 4 loki spawn 145 loki spawn 106 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_multiple_master Parent process 0 running on loki I create 3 slave processes. Slave process 0 of 2 running on loki Slave process 1 of 2 running on loki spawn_slave 1: argv[0]: spawn_slave spawn_slave 1: argv[1]: program type 2 spawn_slave 1: argv[2]: another parameter spawn_slave 0: argv[0]: spawn_slave spawn_slave 0: argv[1]: program type 1 Parent process 0: tasks in MPI_COMM_WORLD:1 tasks in COMM_CHILD_PROCESSES local group: 1 tasks in COMM_CHILD_PROCESSES remote group: 2 loki spawn 107 mpiexec -np 1 --host nfs1 --slot-list 0:0-5,1:0-5 spawn_multiple_master Parent process 0 running on nfs1 I create 3 slave processes. Slave process 0 of 2 running on nfs1 spawn_slave 0: argv[0]: spawn_slave spawn_slave 0: argv[1]: program type 1 Slave process 1 of 2 running on nfs1 spawn_slave 1: argv[0]: spawn_slave spawn_slave 1: argv[1]: program type 2 spawn_slave 1: argv[2]: another parameter Parent process 0: tasks in MPI_COMM_WORLD:1 tasks in COMM_CHILD_PROCESSES local group: 1 tasks in COMM_CHILD_PROCESSES remote group: 2 loki spawn 108 I would be grateful, if somebody can fix the problem. Thank you very much for any help in advance. Kind regards Siegmar ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] MPI_ERR_INTERN with openmpi-dev-4691-g277c319 on SuSE Linux
Siegmar, this is a known issue that is tracked at https://github.com/open-mpi/ompi/issues/1998 Cheers, Gilles On Sunday, August 28, 2016, Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi, > > I have installed openmpi-dev-4691-g277c319 on my "SUSE Linux > Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. > Unfortunately I get an internal error for all my spawn programs. > > > loki spawn 147 ompi_info | grep -e "Open MPI repo revision" -e "C compiler > absolute" > Open MPI repo revision: dev-4691-g277c319 > C compiler absolute: /opt/solstudio12.5b/bin/cc > loki spawn 148 > > > loki spawn 151 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 > spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > [loki:10461] [[46948,1],0] ORTE_ERROR_LOG: Unreachable in file > ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426 > -- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[46948,1],0]) is on host: loki > Process 2 ([[46948,2],0]) is on host: loki > BTLs attempted: self tcp > > Your MPI job is now going to abort; sorry. > -- > [loki:10461] *** An error occurred in MPI_Comm_spawn > [loki:10461] *** reported by process [3076784129,0] > [loki:10461] *** on communicator MPI_COMM_WORLD > [loki:10461] *** MPI_ERR_INTERN: internal error > [loki:10461] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [loki:10461] ***and potentially your MPI job) > > > > > loki spawn 152 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 > spawn_multiple_master > > Parent process 0 running on loki > I create 3 slave processes. > > [loki:10482] [[46929,1],0] ORTE_ERROR_LOG: Unreachable in file > ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426 > -- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[46929,1],0]) is on host: loki > Process 2 ([[46929,2],0]) is on host: loki > BTLs attempted: self tcp > > Your MPI job is now going to abort; sorry. > -- > [loki:10482] *** An error occurred in MPI_Comm_spawn_multiple > [loki:10482] *** reported by process [3075538945,0] > [loki:10482] *** on communicator MPI_COMM_WORLD > [loki:10482] *** MPI_ERR_INTERN: internal error > [loki:10482] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [loki:10482] ***and potentially your MPI job) > > > > > loki spawn 153 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 > spawn_intra_comm > Parent process 0: I create 2 slave processes > [loki:10500] [[46915,1],0] ORTE_ERROR_LOG: Unreachable in file > ../../openmpi-dev-4691-g277c319/ompi/dpm/dpm.c at line 426 > -- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[46915,1],0]) is on host: loki > Process 2 ([[46915,2],0]) is on host: loki > BTLs attempted: self tcp > > Your MPI job is now going to abort; sorry. > -- > [loki:10500] *** An error occurred in MPI_Comm_spawn > [loki:10500] *** reported by process [3074621441,0] > [loki:10500] *** on communicator MPI_COMM_WORLD > [loki:10500] *** MPI_ERR_INTERN: internal error > [loki:10500] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [loki:10500] ***and potentially your MPI job) > loki spawn 154 > > > I would be grateful, if somebody can fix the problem. Thank you > very much for any help in advance. > > > Kind regards > > Siegmar > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > __
[OMPI users] problem with exceptions in Java interface
Hi, I have installed v1.10.3-31-g35ba6a1, openmpi-v2.0.0-233-gb5f0a4f, and openmpi-dev-4691-g277c319 on my "SUSE Linux Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. In May I had reported a problem with Java execeptions (PR 1698) which had been solved in June (PR 1803). https://github.com/open-mpi/ompi/issues/1698 https://github.com/open-mpi/ompi/pull/1803 Unfortunately the problem still exists or exists once more in all three branches. loki fd1026 112 ompi_info | grep -e "Open MPI repo revision" -e "C compiler absolute" Open MPI repo revision: dev-4691-g277c319 C compiler absolute: /opt/solstudio12.5b/bin/cc loki fd1026 112 mpijavac Exception_2_Main.java warning: [path] bad path element "/usr/local/openmpi-master_64_cc/lib64/shmem.jar": no such file or directory 1 warning loki fd1026 113 mpiexec -np 1 java Exception_2_Main Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN. Call "bcast" with index out-of bounds. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at mpi.Comm.bcast(Native Method) at mpi.Comm.bcast(Comm.java:1252) at Exception_2_Main.main(Exception_2_Main.java:22) --- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. --- -- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[58548,1],0] Exit code:1 -- loki fd1026 114 exit loki fd1026 116 ompi_info | grep -e "Open MPI repo revision" -e "C compiler absolute" Open MPI repo revision: v2.0.0-233-gb5f0a4f C compiler absolute: /opt/solstudio12.5b/bin/cc loki fd1026 117 mpijavac Exception_2_Main.java warning: [path] bad path element "/usr/local/openmpi-2.0.1_64_cc/lib64/shmem.jar": no such file or directory 1 warning loki fd1026 118 mpiexec -np 1 java Exception_2_Main Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN. Call "bcast" with index out-of bounds. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at mpi.Comm.bcast(Native Method) at mpi.Comm.bcast(Comm.java:1252) at Exception_2_Main.main(Exception_2_Main.java:22) --- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. --- -- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[58485,1],0] Exit code:1 -- loki fd1026 119 exit loki fd1026 107 ompi_info | grep -e "Open MPI repo revision" -e "C compiler absolute" Open MPI repo revision: v1.10.3-31-g35ba6a1 C compiler absolute: /opt/solstudio12.5b/bin/cc loki fd1026 107 mpijavac Exception_2_Main.java loki fd1026 108 mpiexec -np 1 java Exception_2_Main Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN. Call "bcast" with index out-of bounds. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at mpi.Comm.bcast(Native Method) at mpi.Comm.bcast(Comm.java:1231) at Exception_2_Main.main(Exception_2_Main.java:22) --- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. --- -- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[34400,1],0] Exit code:1 -- loki fd1026 109 exit I would be grateful, if somebody can fix the problem. Thank you very much for any help in advance. Kind regards Siegmar ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users