Hi all,
We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have odd behaviors when trying to read from standard input. For example, if we start the application lammps across 4 nodes, each node 16 cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st time, but always stuck in a few seconds thereafter. Command: mpirun ./lmp_ompi_g++ < in.snr in.snr is the Lammps input file. compiler is gcc/6.1. Instead, if we use mpirun ./lmp_ompi_g++ -in in.snr it works 100%. Some odd behaviors we gathered so far. 1. For 1 node job, stdin always works. 2. For multiple nodes, stdin works unstably when the number of cores per node are relatively small. For example, for 2/3/4 nodes, each node 8 cores, mpirun works most of the time. But for each node with >8 cores, mpirun works the 1st time, then always stuck. There seems to be a magic number when it stops working. 3. We tested Quantum Expresso with compiler intel/13 and had the same issue. We used gdb to debug and found when mpirun was stuck, the rest of the processes were all waiting on mpi broadcast from the master thread. The lammps binary, input file and gdb core files (example.tar.bz2) can be downloaded from this link https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc Extra information: 1. Job scheduler is slurm. 2. configure setup: ./configure --prefix=$PREFIX \ --with-hwloc=internal \ --enable-mpirun-prefix-by-default \ --with-slurm \ --with-verbs \ --with-psm \ --disable-openib-connectx-xrc \ --with-knem=/opt/knem-1.1.2.90mlnx1 \ --with-cma 3. openmpi-mca-params.conf file orte_hetero_nodes=1 hwloc_base_binding_policy=core rmaps_base_mapping_policy=core opal_cuda_support=0 btl_openib_use_eager_rdma=0 btl_openib_max_eager_rdma=0 btl_openib_flags=1 Thanks, Jingchao Dr. Jingchao Zhang Holland Computing Center University of Nebraska-Lincoln 402-472-6400
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users