Hi I'm still using "Open MPI: 1.9a1r27668" and Java 1.7.0_07. Today I implemented a few programs to broadcast int, int[], double, or double[]. I can compile all four programs without problems, which means that "Object buf" as a parameter in "MPI.COMM_WORLD.Bcast" isn't a problem for basic datatypes. Unfortunately I only get the expected result for arrays of a basic datatype.
Process 1 doesn't receive an int value (both processes run on Solaris 10 Sparc) tyr java 159 mpiexec -np 2 java BcastIntMain Process 1 running on tyr.informatik.hs-fulda.de. intValue: 0 Process 0 running on tyr.informatik.hs-fulda.de. intValue: 1234567 Process 1 receives all values from an int array. tyr java 160 mpiexec -np 2 java BcastIntArrayMain Process 0 running on tyr.informatik.hs-fulda.de. intValues[0]: 1234567 intValues[1]: 7654321 Process 1 running on tyr.informatik.hs-fulda.de. intValues[0]: 1234567 intValues[1]: 7654321 The program breaks if I use one little endian and one big endian machine. tyr java 161 mpiexec -np 2 -host sunpc0,tyr java BcastIntMain [tyr:7657] *** An error occurred in MPI_Comm_dup [tyr:7657] *** reported by process [3150053377,1] [tyr:7657] *** on communicator MPI_COMM_WORLD [tyr:7657] *** MPI_ERR_INTERN: internal error [tyr:7657] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [tyr:7657] *** and potentially your MPI job) The program works if I use two "Solaris 10 x86_64" machines. tyr java 163 mpiexec -np 2 -host sunpc0,sunpc1 java BcastIntArrayMain Process 1 running on sunpc1. intValues[0]: 1234567 intValues[1]: 7654321 Process 0 running on sunpc0. intValues[0]: 1234567 intValues[1]: 7654321 The program breaks if I use two Linux.x86_64 machines (Open Suse 12.1). linpc1 etc 101 mpiexec -np 2 -host linpc0,linpc1 java BcastIntArrayMain -------------------------------------------------------------------------- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_base_open failed --> Returned value -2 instead of OPAL_SUCCESS ... ompi_mpi_init: orte_init failed --> Returned "Out of resource" (-2) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [(null):10586] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[16706,1],1] Exit code: 1 -------------------------------------------------------------------------- I use a valid environment on all machines. The problem occurs as well when I compile and run the program directly on the Linux system. linpc1 java 101 mpijavac BcastIntMain.java linpc1 java 102 mpiexec -np 2 -host linpc0,linpc1 java -cp `pwd` BcastIntMain -------------------------------------------------------------------------- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_base_open failed --> Returned value -2 instead of OPAL_SUCCESS I get the same errors for the programs with double values. Does anybody have any suggestions how to solve the problems. Thank you very much for any help in advance. Kind regards Siegmar
BcastIntMain.java
Description: BcastIntMain.java
BcastIntArrayMain.java
Description: BcastIntArrayMain.java
BcastDoubleMain.java
Description: BcastDoubleMain.java
BcastDoubleArrayMain.java
Description: BcastDoubleArrayMain.java