Dear list readers, I have some problems with OpenMPI 3.1.1. In some node combos, I got the error (libibverbs: GRH is mandatory For RoCE address handle; *** Error in `/apps/brussel/CO7/ivybridge-ib/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/bin/orted': double free or corruption (out): 0x00002aaab4001680 ***), see details in file 114_151.out.bz2, even with the most simplest run, like mpirun -host nic114,nic151 hostname In the file 114_151.out.bz2, you can see the output if I run the command from nic114. If I run the same command from nic151, it simply spits out the hostnames, without any errors.
I also enclosed the ompi_info --all --parsable outputs from nic114 (nic151 is identical, see ompi.nic114.bz2). I do not have the config.log file, although I still have the config output (see confilg.out.bz2). The nodes have identical opsystems (as we use the same image), and the OpenMPI is also loaded from a central directory shared amongst the nodes. We have an infiniband network (with IP over IB) and an ethernet network. Intel MPI works without a problem, and I confirmed that the network is IB when I use the Intel MPI) It is not clear whether the orted error is the consequence of the libibverbs error, but it is not clear why OpenMPI wants to use RoCE at all. (ibv_devinfo is also attached, we do have a somewhat creative infiniband topology, based on fat-tree, but changing the topology did not solved the problem). The /tmp directory is writable, and not full. As a matter of fact, I get the same error incase of OpenMPI 2.0.2, and 2.1.1, and I do not get this error in case of OpenMPI 1.10.2, and 1.10.3. Can anyone have some thoughts about this issue? Regards, Balazs Hajgato
ibv_dev.nic114
Description: ibv_dev.nic114
ibv_dev.nic151
Description: ibv_dev.nic151
114_151.out.bz2
Description: 114_151.out.bz2
config.out.bz2
Description: config.out.bz2
ompi.nic114.bz2
Description: ompi.nic114.bz2
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users