Re: [OMPI users] stdin issue with openmpi/2.0.0

r...@open-mpi.org Mon, 22 Aug 2016 14:07:49 -0700

Well, I can try to find time to take a look. However, I will reiterate what 
Jeff H said - it is very unwise to rely on IO forwarding. Much better to just 
directly read the file unless that file is simply unavailable on the node where 
rank=0 is running.


> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang <zh...@unl.edu> wrote:
> 
> Here you can find the source code for lammps input 
> https://github.com/lammps/lammps/blob/r13864/src/input.cpp 
> <https://github.com/lammps/lammps/blob/r13864/src/input.cpp>
> Based on the gdb output, rank 0 stuck at line 167
> if
>  (fgets(&line[m],maxline-m,infile)
>  == NULL)
> and the rest threads stuck at line 203
> MPI_Bcast(&n,1,MPI_INT,0,world);
> 
> So rank 0 possibly hangs on the fgets() function.
> 
> Here are the whole backtrace information:
> $ cat master.backtrace worker.backtrace
> #0  0x0000003c37cdb68d in read () from /lib64/libc.so.6
> #1  0x0000003c37c71ca8 in _IO_new_file_underflow () from /lib64/libc.so.6
> #2  0x0000003c37c737ae in _IO_default_uflow_internal () from /lib64/libc.so.6
> #3  0x0000003c37c67e8a in _IO_getline_info_internal () from /lib64/libc.so.6
> #4  0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6
> #5  0x00000000005c5a43 in LAMMPS_NS::Input::file() () at ../input.cpp:167
> #6  0x00000000005d4236 in main () at ../main.cpp:31
> #0  0x00002b1635d2ace2 in poll_dispatch () from 
> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
> #1  0x00002b1635d1fa71 in opal_libevent2022_event_base_loop ()
>    from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
> #2  0x00002b1635ce4634 in opal_progress () from 
> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
> #3  0x00002b16351b8fad in ompi_request_default_wait () from 
> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
> #4  0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic ()
>    from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
> #5  0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial ()
>    from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
> #6  0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed ()
>    from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so
> #7  0x00002b16351cb4fb in PMPI_Bcast () from 
> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
> #8  0x00000000005c5b5d in LAMMPS_NS::Input::file() () at ../input.cpp:203
> #9  0x00000000005d4236 in main () at ../main.cpp:31
> 
> Thanks,
> 
> Dr. Jingchao Zhang
> Holland Computing Center
> University of Nebraska-Lincoln
> 402-472-6400
> From: users <users-boun...@lists.open-mpi.org 
> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
> Sent: Monday, August 22, 2016 2:17:10 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>  
> Hmmm...perhaps we can break this out a bit? The stdin will be going to your 
> rank=0 proc. It sounds like you have some subsequent step that calls 
> MPI_Bcast?
> 
> Can you first verify that the input is being correctly delivered to rank=0? 
> This will help us isolate if the problem is in the IO forwarding, or in the 
> subsequent Bcast.
> 
>> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang <zh...@unl.edu 
>> <mailto:zh...@unl.edu>> wrote:
>> 
>> Hi all,
>> 
>> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both of them have 
>> odd behaviors when trying to read from standard input.
>> 
>> For example, if we start the application lammps across 4 nodes, each node 16 
>> cores, connected by Intel QDR Infiniband, mpirun works fine for the 1st 
>> time, but always stuck in a few seconds thereafter.
>> Command:
>> mpirun ./lmp_ompi_g++ < in.snr
>> in.snr is the Lammps input file. compiler is gcc/6.1.
>> 
>> Instead, if we use
>> mpirun ./lmp_ompi_g++ -in in.snr
>> it works 100%.
>> 
>> Some odd behaviors we gathered so far. 
>> 1. For 1 node job, stdin always works.
>> 2. For multiple nodes, stdin works unstably when the number of cores per 
>> node are relatively small. For example, for 2/3/4 nodes, each node 8 cores, 
>> mpirun works most of the time. But for each node with >8 cores, mpirun works 
>> the 1st time, then always stuck. There seems to be a magic number when it 
>> stops working.
>> 3. We tested Quantum Expresso with compiler intel/13 and had the same issue. 
>> 
>> We used gdb to debug and found when mpirun was stuck, the rest of the 
>> processes were all waiting on mpi broadcast from the master thread. The 
>> lammps binary, input file and gdb core files (example.tar.bz2) can be 
>> downloaded from this link 
>> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc 
>> <https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc>
>> 
>> Extra information:
>> 1. Job scheduler is slurm.
>> 2. configure setup:
>> ./configure     --prefix=$PREFIX \
>>                 --with-hwloc=internal \
>>                 --enable-mpirun-prefix-by-default \
>>                 --with-slurm \
>>                 --with-verbs \
>>                 --with-psm \
>>                 --disable-openib-connectx-xrc \
>>                 --with-knem=/opt/knem-1.1.2.90mlnx1 \
>>                 --with-cma
>> 3. openmpi-mca-params.conf file 
>> orte_hetero_nodes=1
>> hwloc_base_binding_policy=core
>> rmaps_base_mapping_policy=core
>> opal_cuda_support=0
>> btl_openib_use_eager_rdma=0
>> btl_openib_max_eager_rdma=0
>> btl_openib_flags=1
>> 
>> Thanks,
>> Jingchao 
>> 
>> Dr. Jingchao Zhang
>> Holland Computing Center
>> University of Nebraska-Lincoln
>> 402-472-6400
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] stdin issue with openmpi/2.0.0

Reply via email to