Well, I can try to find time to take a look. However, I will reiterate what Jeff H said - it is very unwise to rely on IO forwarding. Much better to just directly read the file unless that file is simply unavailable on the node where rank=0 is running.
> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang <zh...@unl.edu> wrote: > > Here you can find the source code for lammps input > https://github.com/lammps/lammps/blob/r13864/src/input.cpp > <https://github.com/lammps/lammps/blob/r13864/src/input.cpp> > Based on the gdb output, rank 0 stuck at line 167 > if > (fgets(&line[m],maxline-m,infile) > == NULL) > and the rest threads stuck at line 203 > MPI_Bcast(&n,1,MPI_INT,0,world); > > So rank 0 possibly hangs on the fgets() function. > > Here are the whole backtrace information: > $ cat master.backtrace worker.backtrace > #0 0x0000003c37cdb68d in read () from /lib64/libc.so.6 > #1 0x0000003c37c71ca8 in _IO_new_file_underflow () from /lib64/libc.so.6 > #2 0x0000003c37c737ae in _IO_default_uflow_internal () from /lib64/libc.so.6 > #3 0x0000003c37c67e8a in _IO_getline_info_internal () from /lib64/libc.so.6 > #4 0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6 > #5 0x00000000005c5a43 in LAMMPS_NS::Input::file() () at ../input.cpp:167 > #6 0x00000000005d4236 in main () at ../main.cpp:31 > #0 0x00002b1635d2ace2 in poll_dispatch () from > /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 > #1 0x00002b1635d1fa71 in opal_libevent2022_event_base_loop () > from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 > #2 0x00002b1635ce4634 in opal_progress () from > /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 > #3 0x00002b16351b8fad in ompi_request_default_wait () from > /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 > #4 0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic () > from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 > #5 0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial () > from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 > #6 0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed () > from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so > #7 0x00002b16351cb4fb in PMPI_Bcast () from > /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 > #8 0x00000000005c5b5d in LAMMPS_NS::Input::file() () at ../input.cpp:203 > #9 0x00000000005d4236 in main () at ../main.cpp:31 > > Thanks, > > Dr. Jingchao Zhang > Holland Computing Center > University of Nebraska-Lincoln > 402-472-6400 > From: users <users-boun...@lists.open-mpi.org > <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org > <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> > Sent: Monday, August 22, 2016 2:17:10 PM > To: Open MPI Users > Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 > > Hmmm...perhaps we can break this out a bit? The stdin will be going to your > rank=0 proc. It sounds like you have some subsequent step that calls > MPI_Bcast? > > Can you first verify that the input is being correctly delivered to rank=0? > This will help us isolate if the problem is in the IO forwarding, or in the > subsequent Bcast. > >> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang <zh...@unl.edu >> <mailto:zh...@unl.edu>> wrote: >> >> Hi all, >> >> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have >> odd behaviors when trying to read from standard input. >> >> For example, if we start the application lammps across 4 nodes, each node 16 >> cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st >> time, but always stuck in a few seconds thereafter. >> Command: >> mpirun ./lmp_ompi_g++ < in.snr >> in.snr is the Lammps input file. compiler is gcc/6.1. >> >> Instead, if we use >> mpirun ./lmp_ompi_g++ -in in.snr >> it works 100%. >> >> Some odd behaviors we gathered so far. >> 1. For 1 node job, stdin always works. >> 2. For multiple nodes, stdin works unstably when the number of cores per >> node are relatively small. For example, for 2/3/4 nodes, each node 8 cores, >> mpirun works most of the time. But for each node with >8 cores, mpirun works >> the 1st time, then always stuck. There seems to be a magic number when it >> stops working. >> 3. We tested Quantum Expresso with compiler intel/13 and had the same issue. >> >> We used gdb to debug and found when mpirun was stuck, the rest of the >> processes were all waiting on mpi broadcast from the master thread. The >> lammps binary, input file and gdb core files (example.tar.bz2) can be >> downloaded from this link >> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc >> <https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc> >> >> Extra information: >> 1. Job scheduler is slurm. >> 2. configure setup: >> ./configure --prefix=$PREFIX \ >> --with-hwloc=internal \ >> --enable-mpirun-prefix-by-default \ >> --with-slurm \ >> --with-verbs \ >> --with-psm \ >> --disable-openib-connectx-xrc \ >> --with-knem=/opt/knem-1.1.2.90mlnx1 \ >> --with-cma >> 3. openmpi-mca-params.conf file >> orte_hetero_nodes=1 >> hwloc_base_binding_policy=core >> rmaps_base_mapping_policy=core >> opal_cuda_support=0 >> btl_openib_use_eager_rdma=0 >> btl_openib_max_eager_rdma=0 >> btl_openib_flags=1 >> >> Thanks, >> Jingchao >> >> Dr. Jingchao Zhang >> Holland Computing Center >> University of Nebraska-Lincoln >> 402-472-6400 >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > _______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users