Hi, all, I asked for help for a code problem here days ago ( http://www.open-mpi.org/community/lists/users/2011/02/15656.php ). Then I found that the code can be executed without any issue on other cluster. So I suspected that there maybe something wrong in my cluster environment configuration. So I reconfigured the nfs,ssh and other related thing and reinstalled the openmpi library. The cluster consists of two desktops which are connected using a crossover cable. Both of the desktops have a Intel Core 2 Duo CPU and are using Ubuntu 10.04 LTS, and the version of openmpi intalled on the nfs (located at the master node) is 1.4.3.
Now, things seems to be getting worse. I can't run any code successfully that more complicated than the "MPI hello world". But if all of the processes are launched in the same node, the code can be executed without any issue. For example, the following code(only add one line to the "MPI hello world") would crash at the MPI_Barrier. However, if I delete the line of MPI_Barrier, the code would run successfully. **************************************************************************************************** #include <stdio.h> #include "mpi.h" int main(int argc, char** argv) { int myrank, nprocs; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf("First hello from processor %d of %d\n", myrank, nprocs); MPI_Barrier(MPI_COMM_WORLD); printf("Second hello from processor %d of %d\n", myrank, nprocs); MPI_Finalize(); return 0; } **************************************************************************************************** The output of the above code is: **************************************************************************************************** [kongdragon-master:16119] *** An error occurred in MPI_Barrier [kongdragon-master:16119] *** on communicator MPI_COMM_WORLD [kongdragon-master:16119] *** MPI_ERR_IN_STATUS: error code in status [kongdragon-master:16119] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) First hello from processor 0 of 2 -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 16119 on node kongdragon-master exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- First hello from processor 1 of 2 **************************************************************************************************** Can anyone help to point out why things didn't work? Thanks! Kong