Hi > I can't speak to the other issues, but for these - it looks like > something isn't right in the system. Could be an incompatibility > with Suse 12.1. > > What the errors are saying is that malloc is failing when used at > a very early stage in starting the process. Can you run even a > C-based MPI "hello" program?
Yes. I have implemented more or less the same program in C and Java. tyr hello_1 131 mpiexec -np 2 -host linpc0,linpc1 hello_1_mpi Process 0 of 2 running on linpc0 Process 1 of 2 running on linpc1 Now 1 slave tasks are sending greetings. Greetings from task 1: message type: 3 msg length: 132 characters message: hostname: linpc1 operating system: Linux release: 3.1.10-1.16-desktop processor: x86_64 tyr hello_1 132 mpiexec -np 2 -host linpc0,linpc1 java HelloMainWithBarrier -------------------------------------------------------------------------- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_base_open failed --> Returned value -2 instead of OPAL_SUCCESS ... Thank you very much for any help in advance. Kind regards Siegmar > On Dec 21, 2012, at 1:41 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > > > The program breaks if I use two Linux.x86_64 machines (Open Suse 12.1). > > > > linpc1 etc 101 mpiexec -np 2 -host linpc0,linpc1 java BcastIntArrayMain > > -------------------------------------------------------------------------- > > It looks like opal_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during opal_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > > mca_base_open failed > > --> Returned value -2 instead of OPAL_SUCCESS > > ... > > ompi_mpi_init: orte_init failed > > --> Returned "Out of resource" (-2) instead of "Success" (0) > > -------------------------------------------------------------------------- > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [(null):10586] Local abort before MPI_INIT completed successfully; not able to > > aggregate error messages, and not able to guarantee that all other > > processes > > were killed! > > ------------------------------------------------------- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code.. Per user-direction, the job has been aborted. > > ------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpiexec detected that one or more processes exited with non-zero status, thus > > causing > > the job to be terminated. The first process to do so was: > > > > Process name: [[16706,1],1] > > Exit code: 1 > > -------------------------------------------------------------------------- > > > > > > > > I use a valid environment on all machines. The problem occurs as well > > when I compile and run the program directly on the Linux system. > > > > linpc1 java 101 mpijavac BcastIntMain.java > > linpc1 java 102 mpiexec -np 2 -host linpc0,linpc1 java -cp `pwd` BcastIntMain > > -------------------------------------------------------------------------- > > It looks like opal_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during opal_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > > mca_base_open failed > > --> Returned value -2 instead of OPAL_SUCCESS >