[OMPI users] stdin issue with openmpi/2.0.0

Jingchao Zhang Mon, 22 Aug 2016 13:14:32 -0700

Hi all,


We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have 
odd behaviors when trying to read from standard input.


For example, if we start the application lammps across 4 nodes, each node 16 
cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st time, 
but always stuck in a few seconds thereafter.

Command:

mpirun ./lmp_ompi_g++ < in.snr

in.snr is the Lammps input file. compiler is gcc/6.1.


Instead, if we use

mpirun ./lmp_ompi_g++ -in in.snr

it works 100%.


Some odd behaviors we gathered so far.

1. For 1 node job, stdin always works.

2. For multiple nodes, stdin works unstably when the number of cores per node 
are relatively small. For example, for 2/3/4 nodes, each node 8 cores, mpirun 
works most of the time. But for each node with >8 cores, mpirun works the 1st 
time, then always stuck. There seems to be a magic number when it stops working.

3. We tested Quantum Expresso with compiler intel/13 and had the same issue.


We used gdb to debug and found when mpirun was stuck, the rest of the processes 
were all waiting on mpi broadcast from the master thread. The lammps binary, 
input file and gdb core files (example.tar.bz2) can be downloaded from this 
link https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc


Extra information:

1. Job scheduler is slurm.

2. configure setup:

./configure     --prefix=$PREFIX \
                --with-hwloc=internal \
                --enable-mpirun-prefix-by-default \
                --with-slurm \
                --with-verbs \
                --with-psm \
                --disable-openib-connectx-xrc \
                --with-knem=/opt/knem-1.1.2.90mlnx1 \
                --with-cma
3. openmpi-mca-params.conf file
orte_hetero_nodes=1
hwloc_base_binding_policy=core
rmaps_base_mapping_policy=core
opal_cuda_support=0
btl_openib_use_eager_rdma=0
btl_openib_max_eager_rdma=0
btl_openib_flags=1

Thanks,
Jingchao


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] stdin issue with openmpi/2.0.0

Reply via email to