I have an MPI program that is fairly straight forward, essentially "initialize, 2 sends from master to slaves, 2 receives on slaves, do a bunch of system calls for copying/pasting then running a serial code on each mpi task, tidy up and mpi finalize". This seems straightforward, but I'm not getting mpi_finalize to work correctly. Below is a snapshot of the program, without all the system copy/paste/call external code which I've rolled up in "do codish stuff" type statements. program mpi_finalize_break !<variable declarations> call MPI_INIT(ierr) icomm = MPI_COMM_WORLD call MPI_COMM_SIZE(icomm,nproc,ierr) call MPI_COMM_RANK(icomm,rank,ierr)
!<do codish stuff for a while> if (rank == 0) then !<set up some stuff then call MPI_SEND in a loop over number of slaves> call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr) call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr) else call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr) call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr) !<do codish stuff for a while> endif print*, "got here4", rank call MPI_BARRIER(icomm,ierr) print*, "got here5", rank, ierr call MPI_FINALIZE(ierr) print*, "got here6" end program mpi_finalize_break Now the problem I am seeing occurs around the "got here4", "got here5" and "got here6" statements. I get the appropriate number of print statements with corresponding ranks for "got here4", as well as "got here5". Meaning, the master and all the slaves (rank 0, and all other ranks) got to the barrier call, through the barrier call, and to MPI_FINALIZE, reporting 0 for ierr on all of them. However, when it gets to "got here6", after the MPI_FINALIZE I'll get all kinds of weird behavior. Sometimes I'll get one less "got here6" than I expect, or sometimes I'll get eight less (it varies), however the program hangs forever, never closing and leaves an orphaned process on one (or more) of the compute nodes. I am running this on an infiniband backbone machine, with the NFS server shared over infiniband (nfs-rdma). I'm trying to determine how the MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned runs (not the same node, nor the same number of orphans every time). I'm guessing it is related to the various system calls to cp, mv, ./run_some_code, cp, mv but wasn't sure if it may be related to the speed of infiniband too, as all this happens fairly quickly. I could have wrong intuition as well. Anybody have thoughts? I could put the whole code if helpful, but this condensed version I believe captures it. I'm running openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running firmware 2.9.1000. This is the mellanox firmware available through yum with centos 6.5, 2.6.32-504.8.1.el6.x86_64. ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.6.254 Bcast:192.168.6.255 Mask:255.255.255.0 inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:10952 errors:0 dropped:0 overruns:0 frame:0 TX packets:9805 errors:0 dropped:625413 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:830040 (810.5 KiB) TX bytes:643212 (628.1 KiB) hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.9.1000 node_guid: 0002:c903:0057:e7fc sys_image_guid: 0002:c903:0057:e7ff vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D90110009 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 2 port_lmc: 0x00 link_layer: InfiniBand This problem only occurs in this simple implementation, thus my thinking it is tied to the system calls. I run several other, much larger, much more robust MPI codes without issue on the machine. Thanks for the help. --Jack