Oh my - that indeed illustrated the problem!! It is indeed a race condition on the backend orted. I’ll try to fix it - probably have to send you a patch to test?
> On Aug 30, 2016, at 1:04 PM, Jingchao Zhang <zh...@unl.edu> wrote: > > $mpirun -mca state_base_verbose 5 ./a.out < test.in > > Please see attached for the outputs. > > Thank you Ralph. I am willing to provide whatever information you need. > > From: users <users-boun...@lists.open-mpi.org > <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org > <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> > Sent: Tuesday, August 30, 2016 1:45:45 PM > To: Open MPI Users > Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 > > Well, that helped a bit. For some reason, your system is skipping a step in > the launch state machine, and so we never hit the step where we setup the IO > forwarding system. > > Sorry to keep poking, but I haven’t seen this behavior anywhere else, and so > I have no way to replicate it. Must be a subtle race condition. > > Can you replace “plm” with ‘“state” and try to hit a “bad” run again? > > >> On Aug 30, 2016, at 12:30 PM, Jingchao Zhang <zh...@unl.edu >> <mailto:zh...@unl.edu>> wrote: >> >> Yes, all procs were launched properly. I added “-mca plm_base_verbose 5” to >> the mpirun command. Please see attached for the results. >> >> $mpirun -mca plm_base_verbose 5 ./a.out < test.in >> >> I mentioned in my initial post that the test job can run properly for the >> 1st time. But if I kill the job and resubmit, then it hangs. It happened >> with the job above as well. Very odd. >> From: users <users-boun...@lists.open-mpi.org >> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org >> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >> Sent: Tuesday, August 30, 2016 12:56:33 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >> >> Hmmm...well, the problem appears to be that we aren’t setting up the input >> channel to read stdin. This happens immediately after the application is >> launched - there is no “if” clause or anything else in front of it. The only >> way it wouldn’t get called is if all the procs weren’t launched, but that >> appears to be happening, yes? >> >> Hence my confusion - there is no test in front of that print statement now, >> and yet we aren’t seeing the code being called. >> >> Could you please add “-mca plm_base_verbose 5” to your cmd line? We should >> see a debug statement print that contains "plm:base:launch wiring up iof for >> job” >> >> >> >>> On Aug 30, 2016, at 11:40 AM, Jingchao Zhang <zh...@unl.edu >>> <mailto:zh...@unl.edu>> wrote: >>> >>> I checked again and as far as I can tell, everything was setup correctly. I >>> added "HCC debug" to the output message to make sure it's the correct >>> plugin. >>> >>> The updated outputs: >>> $ mpirun ./a.out < test.in >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 35 for process [[26513,1],0] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 41 for process [[26513,1],0] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 43 for process [[26513,1],0] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 37 for process [[26513,1],1] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 46 for process [[26513,1],1] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 49 for process [[26513,1],1] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 38 for process [[26513,1],2] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 50 for process [[26513,1],2] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 52 for process [[26513,1],2] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 42 for process [[26513,1],3] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 53 for process [[26513,1],3] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 55 for process [[26513,1],3] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 45 for process [[26513,1],4] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 56 for process [[26513,1],4] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 58 for process [[26513,1],4] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 47 for process [[26513,1],5] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 59 for process [[26513,1],5] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 61 for process [[26513,1],5] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 51 for process [[26513,1],6] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 62 for process [[26513,1],6] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 64 for process [[26513,1],6] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 57 for process [[26513,1],7] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 66 for process [[26513,1],7] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 68 for process [[26513,1],7] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 63 for process [[26513,1],8] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 70 for process [[26513,1],8] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 72 for process [[26513,1],8] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 67 for process [[26513,1],9] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 74 for process [[26513,1],9] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >>> debug: [[26513,0],0] iof:hnp pushing fd 76 for process [[26513,1],9] >>> Rank 1 has cleared MPI_Init >>> Rank 3 has cleared MPI_Init >>> Rank 4 has cleared MPI_Init >>> Rank 5 has cleared MPI_Init >>> Rank 6 has cleared MPI_Init >>> Rank 7 has cleared MPI_Init >>> Rank 0 has cleared MPI_Init >>> Rank 2 has cleared MPI_Init >>> Rank 8 has cleared MPI_Init >>> Rank 9 has cleared MPI_Init >>> Rank 10 has cleared MPI_Init >>> Rank 11 has cleared MPI_Init >>> Rank 12 has cleared MPI_Init >>> Rank 13 has cleared MPI_Init >>> Rank 16 has cleared MPI_Init >>> Rank 17 has cleared MPI_Init >>> Rank 18 has cleared MPI_Init >>> Rank 14 has cleared MPI_Init >>> Rank 15 has cleared MPI_Init >>> Rank 19 has cleared MPI_Init >>> >>> >>> The part of code I changed in file ./orte/mca/iof/hnp/iof_hnp.c >>> >>> opal_output(0, >>> "HCC debug: %s iof:hnp pushing fd %d for process >>> %s", >>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>> fd, ORTE_NAME_PRINT(dst_name)); >>> >>> /* don't do this if the dst vpid is invalid or the fd is negative! */ >>> if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) { >>> return ORTE_SUCCESS; >>> } >>> >>> /* OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output, >>> "%s iof:hnp pushing fd %d for process %s", >>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>> fd, ORTE_NAME_PRINT(dst_name))); >>> */ >>> >>> From: users <users-boun...@lists.open-mpi.org >>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org >>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >>> Sent: Monday, August 29, 2016 11:42:00 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>> >>> I’m sorry, but something is simply very wrong here. Are you sure you are >>> pointed at the correct LD_LIBRARY_PATH? Perhaps add a “BOO” or something at >>> the front of the output message to ensure we are using the correct plugin? >>> >>> This looks to me like you must be picking up a stale library somewhere. >>> >>>> On Aug 29, 2016, at 10:29 AM, Jingchao Zhang <zh...@unl.edu >>>> <mailto:zh...@unl.edu>> wrote: >>>> >>>> Hi Ralph, >>>> >>>> I used the tarball from Aug 26 and added the patch. Tested with 2 nodes, >>>> 10 cores/node. Please see the results below: >>>> >>>> $ mpirun ./a.out < test.in >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 35 for process [[43954,1],0] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 41 for process [[43954,1],0] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 43 for process [[43954,1],0] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 37 for process [[43954,1],1] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 46 for process [[43954,1],1] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 49 for process [[43954,1],1] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 38 for process [[43954,1],2] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 50 for process [[43954,1],2] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 52 for process [[43954,1],2] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 42 for process [[43954,1],3] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 53 for process [[43954,1],3] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 55 for process [[43954,1],3] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 45 for process [[43954,1],4] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 56 for process [[43954,1],4] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 58 for process [[43954,1],4] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 47 for process [[43954,1],5] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 59 for process [[43954,1],5] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 61 for process [[43954,1],5] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 57 for process [[43954,1],6] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 64 for process [[43954,1],6] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 66 for process [[43954,1],6] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 62 for process [[43954,1],7] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 68 for process [[43954,1],7] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 70 for process [[43954,1],7] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 65 for process [[43954,1],8] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 72 for process [[43954,1],8] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 74 for process [[43954,1],8] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 75 for process [[43954,1],9] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 79 for process [[43954,1],9] >>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>>> [[43954,0],0] iof:hnp pushing fd 81 for process [[43954,1],9] >>>> Rank 5 has cleared MPI_Init >>>> Rank 9 has cleared MPI_Init >>>> Rank 1 has cleared MPI_Init >>>> Rank 2 has cleared MPI_Init >>>> Rank 3 has cleared MPI_Init >>>> Rank 4 has cleared MPI_Init >>>> Rank 8 has cleared MPI_Init >>>> Rank 0 has cleared MPI_Init >>>> Rank 6 has cleared MPI_Init >>>> Rank 7 has cleared MPI_Init >>>> Rank 14 has cleared MPI_Init >>>> Rank 15 has cleared MPI_Init >>>> Rank 16 has cleared MPI_Init >>>> Rank 18 has cleared MPI_Init >>>> Rank 10 has cleared MPI_Init >>>> Rank 11 has cleared MPI_Init >>>> Rank 12 has cleared MPI_Init >>>> Rank 13 has cleared MPI_Init >>>> Rank 17 has cleared MPI_Init >>>> Rank 19 has cleared MPI_Init >>>> >>>> Thanks, >>>> >>>> Dr. Jingchao Zhang >>>> Holland Computing Center >>>> University of Nebraska-Lincoln >>>> 402-472-6400 >>>> From: users <users-boun...@lists.open-mpi.org >>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org >>>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >>>> Sent: Saturday, August 27, 2016 12:31:53 PM >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>> >>>> I am finding this impossible to replicate, so something odd must be going >>>> on. Can you please (a) pull down the latest v2.0.1 nightly tarball, and >>>> (b) add this patch to it? >>>> >>>> diff --git a/orte/mca/iof/hnp/iof_hnp.c b/orte/mca/iof/hnp/iof_hnp.c >>>> old mode 100644 >>>> new mode 100755 >>>> index 512fcdb..362ff46 >>>> --- a/orte/mca/iof/hnp/iof_hnp.c >>>> +++ b/orte/mca/iof/hnp/iof_hnp.c >>>> @@ -143,16 +143,17 @@ static int hnp_push(const orte_process_name_t* >>>> dst_name, orte_iof_tag_t src_tag, >>>> int np, numdigs; >>>> orte_ns_cmp_bitmask_t mask; >>>> >>>> + opal_output(0, >>>> + "%s iof:hnp pushing fd %d for process %s", >>>> + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>>> + fd, ORTE_NAME_PRINT(dst_name)); >>>> + >>>> /* don't do this if the dst vpid is invalid or the fd is negative! */ >>>> if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) { >>>> return ORTE_SUCCESS; >>>> } >>>> >>>> - OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output, >>>> - "%s iof:hnp pushing fd %d for process %s", >>>> - ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>>> - fd, ORTE_NAME_PRINT(dst_name))); >>>> - >>>> if (!(src_tag & ORTE_IOF_STDIN)) { >>>> /* set the file descriptor to non-blocking - do this before we >>>> setup >>>> * and activate the read event in case it fires right away >>>> >>>> >>>> You can then run the test again without the "--mca iof_base_verbose 100” >>>> flag to reduce the chatter - this print statement will tell me what I need >>>> to know. >>>> >>>> Thanks! >>>> Ralph >>>> >>>> >>>>> On Aug 25, 2016, at 8:19 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com >>>>> <mailto:jsquy...@cisco.com>> wrote: >>>>> >>>>> The IOF fix PR for v2.0.1 was literally just merged a few minutes ago; it >>>>> wasn't in last night's tarball. >>>>> >>>>> >>>>> >>>>>> On Aug 25, 2016, at 10:59 AM, r...@open-mpi.org >>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>> >>>>>> ??? Weird - can you send me an updated output of that last test we ran? >>>>>> >>>>>>> On Aug 25, 2016, at 7:51 AM, Jingchao Zhang <zh...@unl.edu >>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>> >>>>>>> Hi Ralph, >>>>>>> >>>>>>> I saw the pull request and did a test with v2.0.1rc1, but the problem >>>>>>> persists. Any ideas? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Dr. Jingchao Zhang >>>>>>> Holland Computing Center >>>>>>> University of Nebraska-Lincoln >>>>>>> 402-472-6400 >>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>> <mailto:r...@open-mpi.org>> >>>>>>> Sent: Wednesday, August 24, 2016 1:27:28 PM >>>>>>> To: Open MPI Users >>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>> >>>>>>> Bingo - found it, fix submitted and hope to get it into 2.0.1 >>>>>>> >>>>>>> Thanks for the assist! >>>>>>> Ralph >>>>>>> >>>>>>> >>>>>>>> On Aug 24, 2016, at 12:15 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>> >>>>>>>> I configured v2.0.1rc1 with --enable-debug and ran the test with --mca >>>>>>>> iof_base_verbose 100. I also added -display-devel-map in case it >>>>>>>> provides some useful information. >>>>>>>> >>>>>>>> Test job has 2 nodes, each node 10 cores. Rank 0 and mpirun command on >>>>>>>> the same node. >>>>>>>> $ mpirun -display-devel-map --mca iof_base_verbose 100 ./a.out < >>>>>>>> test.in &> debug_info.txt >>>>>>>> >>>>>>>> The debug_info.txt is attached. >>>>>>>> >>>>>>>> Dr. Jingchao Zhang >>>>>>>> Holland Computing Center >>>>>>>> University of Nebraska-Lincoln >>>>>>>> 402-472-6400 >>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>> Sent: Wednesday, August 24, 2016 12:14:26 PM >>>>>>>> To: Open MPI Users >>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>> >>>>>>>> Afraid I can’t replicate a problem at all, whether rank=0 is local or >>>>>>>> not. I’m also using bash, but on CentOS-7, so I suspect the OS is the >>>>>>>> difference. >>>>>>>> >>>>>>>> Can you configure OMPI with --enable-debug, and then run the test >>>>>>>> again with --mca iof_base_verbose 100? It will hopefully tell us >>>>>>>> something about why the IO subsystem is stuck. >>>>>>>> >>>>>>>> >>>>>>>>> On Aug 24, 2016, at 8:46 AM, Jingchao Zhang <zh...@unl.edu >>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>> >>>>>>>>> Hi Ralph, >>>>>>>>> >>>>>>>>> For our tests, rank 0 is always on the same node with mpirun. I just >>>>>>>>> tested mpirun with -nolocal and it still hangs. >>>>>>>>> >>>>>>>>> Information on shell and OS >>>>>>>>> $ echo $0 >>>>>>>>> -bash >>>>>>>>> >>>>>>>>> $ lsb_release -a >>>>>>>>> LSB Version: >>>>>>>>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch >>>>>>>>> Distributor ID: Scientific >>>>>>>>> Description: Scientific Linux release 6.8 (Carbon) >>>>>>>>> Release: 6.8 >>>>>>>>> Codename: Carbon >>>>>>>>> >>>>>>>>> $ uname -a >>>>>>>>> Linux login.crane.hcc.unl.edu <http://login.crane.hcc.unl.edu/> >>>>>>>>> 2.6.32-642.3.1.el6.x86_64 #1 SMP Tue Jul 12 11:25:51 CDT 2016 x86_64 >>>>>>>>> x86_64 x86_64 GNU/Linux >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr. Jingchao Zhang >>>>>>>>> Holland Computing Center >>>>>>>>> University of Nebraska-Lincoln >>>>>>>>> 402-472-6400 >>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>> Sent: Tuesday, August 23, 2016 8:14:48 PM >>>>>>>>> To: Open MPI Users >>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>> >>>>>>>>> Hmmm...that’s a good point. Rank 0 and mpirun are always on the same >>>>>>>>> node on my cluster. I’ll give it a try. >>>>>>>>> >>>>>>>>> Jingchao: is rank 0 on the node with mpirun, or on a remote node? >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Aug 23, 2016, at 5:58 PM, Gilles Gouaillardet <gil...@rist.or.jp >>>>>>>>>> <mailto:gil...@rist.or.jp>> wrote: >>>>>>>>>> >>>>>>>>>> Ralph, >>>>>>>>>> >>>>>>>>>> did you run task 0 and mpirun on different nodes ? >>>>>>>>>> >>>>>>>>>> i observed some random hangs, though i cannot blame openmpi 100% yet >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> Gilles >>>>>>>>>> >>>>>>>>>> On 8/24/2016 9:41 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>>> Very strange. I cannot reproduce it as I’m able to run any number >>>>>>>>>>> of nodes and procs, pushing over 100Mbytes thru without any problem. >>>>>>>>>>> >>>>>>>>>>> Which leads me to suspect that the issue here is with the tty >>>>>>>>>>> interface. Can you tell me what shell and OS you are running? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Aug 23, 2016, at 3:25 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Everything stuck at MPI_Init. For a test job with 2 nodes and 10 >>>>>>>>>>>> cores each node, I got the following >>>>>>>>>>>> >>>>>>>>>>>> $ mpirun ./a.out < test.in >>>>>>>>>>>> Rank 2 has cleared MPI_Init >>>>>>>>>>>> Rank 4 has cleared MPI_Init >>>>>>>>>>>> Rank 7 has cleared MPI_Init >>>>>>>>>>>> Rank 8 has cleared MPI_Init >>>>>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>>>>> Rank 5 has cleared MPI_Init >>>>>>>>>>>> Rank 6 has cleared MPI_Init >>>>>>>>>>>> Rank 9 has cleared MPI_Init >>>>>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>>>>> Rank 16 has cleared MPI_Init >>>>>>>>>>>> Rank 19 has cleared MPI_Init >>>>>>>>>>>> Rank 10 has cleared MPI_Init >>>>>>>>>>>> Rank 11 has cleared MPI_Init >>>>>>>>>>>> Rank 12 has cleared MPI_Init >>>>>>>>>>>> Rank 13 has cleared MPI_Init >>>>>>>>>>>> Rank 14 has cleared MPI_Init >>>>>>>>>>>> Rank 15 has cleared MPI_Init >>>>>>>>>>>> Rank 17 has cleared MPI_Init >>>>>>>>>>>> Rank 18 has cleared MPI_Init >>>>>>>>>>>> Rank 3 has cleared MPI_Init >>>>>>>>>>>> >>>>>>>>>>>> then it just hanged. >>>>>>>>>>>> >>>>>>>>>>>> --Jingchao >>>>>>>>>>>> >>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>>> Sent: Tuesday, August 23, 2016 4:03:07 PM >>>>>>>>>>>> To: Open MPI Users >>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>>> >>>>>>>>>>>> The IO forwarding messages all flow over the Ethernet, so the type >>>>>>>>>>>> of fabric is irrelevant. The number of procs involved would >>>>>>>>>>>> definitely have an impact, but that might not be due to the IO >>>>>>>>>>>> forwarding subsystem. We know we have flow control issues with >>>>>>>>>>>> collectives like Bcast that don’t have built-in synchronization >>>>>>>>>>>> points. How many reads were you able to do before it hung? >>>>>>>>>>>> >>>>>>>>>>>> I was running it on my little test setup (2 nodes, using only a >>>>>>>>>>>> few procs), but I’ll try scaling up and see what happens. I’ll >>>>>>>>>>>> also try introducing some forced “syncs” on the Bcast and see if >>>>>>>>>>>> that solves the issue. >>>>>>>>>>>> >>>>>>>>>>>> Ralph >>>>>>>>>>>> >>>>>>>>>>>>> On Aug 23, 2016, at 2:30 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>> >>>>>>>>>>>>> I tested v2.0.1rc1 with your code but has the same issue. I also >>>>>>>>>>>>> installed v2.0.1rc1 on a different cluster which has Mellanox QDR >>>>>>>>>>>>> Infiniband and get the same result. For the tests you have done, >>>>>>>>>>>>> how many cores and nodes did you use? I can trigger the problem >>>>>>>>>>>>> by using multiple nodes and each node with more than 10 cores. >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for looking into this. >>>>>>>>>>>>> >>>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>>>> Sent: Monday, August 22, 2016 10:23:42 PM >>>>>>>>>>>>> To: Open MPI Users >>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>>>> >>>>>>>>>>>>> FWIW: I just tested forwarding up to 100MBytes via stdin using >>>>>>>>>>>>> the simple test shown below with OMPI v2.0.1rc1, and it worked >>>>>>>>>>>>> fine. So I’d suggest upgrading when the official release comes >>>>>>>>>>>>> out, or going ahead and at least testing 2.0.1rc1 on your >>>>>>>>>>>>> machine. Or you can test this program with some input file and >>>>>>>>>>>>> let me know if it works for you. >>>>>>>>>>>>> >>>>>>>>>>>>> Ralph >>>>>>>>>>>>> >>>>>>>>>>>>> #include <stdlib.h> >>>>>>>>>>>>> #include <stdio.h> >>>>>>>>>>>>> #include <string.h> >>>>>>>>>>>>> #include <stdbool.h> >>>>>>>>>>>>> #include <unistd.h> >>>>>>>>>>>>> #include <mpi.h> >>>>>>>>>>>>> >>>>>>>>>>>>> #define ORTE_IOF_BASE_MSG_MAX 2048 >>>>>>>>>>>>> >>>>>>>>>>>>> int main(int argc, char *argv[]) >>>>>>>>>>>>> { >>>>>>>>>>>>> int i, rank, size, next, prev, tag = 201; >>>>>>>>>>>>> int pos, msgsize, nbytes; >>>>>>>>>>>>> bool done; >>>>>>>>>>>>> char *msg; >>>>>>>>>>>>> >>>>>>>>>>>>> MPI_Init(&argc, &argv); >>>>>>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>>>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>>>>>>>>>>> >>>>>>>>>>>>> fprintf(stderr, "Rank %d has cleared MPI_Init\n", rank); >>>>>>>>>>>>> >>>>>>>>>>>>> next = (rank + 1) % size; >>>>>>>>>>>>> prev = (rank + size - 1) % size; >>>>>>>>>>>>> msg = malloc(ORTE_IOF_BASE_MSG_MAX); >>>>>>>>>>>>> pos = 0; >>>>>>>>>>>>> nbytes = 0; >>>>>>>>>>>>> >>>>>>>>>>>>> if (0 == rank) { >>>>>>>>>>>>> while (0 != (msgsize = read(0, msg, >>>>>>>>>>>>> ORTE_IOF_BASE_MSG_MAX))) { >>>>>>>>>>>>> fprintf(stderr, "Rank %d: sending blob %d\n", rank, >>>>>>>>>>>>> pos); >>>>>>>>>>>>> if (msgsize > 0) { >>>>>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>>>>> } >>>>>>>>>>>>> ++pos; >>>>>>>>>>>>> nbytes += msgsize; >>>>>>>>>>>>> } >>>>>>>>>>>>> fprintf(stderr, "Rank %d: sending termination blob %d\n", >>>>>>>>>>>>> rank, pos); >>>>>>>>>>>>> memset(msg, 0, ORTE_IOF_BASE_MSG_MAX); >>>>>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>>>>>>>>> } else { >>>>>>>>>>>>> while (1) { >>>>>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>>>>> fprintf(stderr, "Rank %d: recvd blob %d\n", rank, pos); >>>>>>>>>>>>> ++pos; >>>>>>>>>>>>> done = true; >>>>>>>>>>>>> for (i=0; i < ORTE_IOF_BASE_MSG_MAX; i++) { >>>>>>>>>>>>> if (0 != msg[i]) { >>>>>>>>>>>>> done = false; >>>>>>>>>>>>> break; >>>>>>>>>>>>> } >>>>>>>>>>>>> } >>>>>>>>>>>>> if (done) { >>>>>>>>>>>>> break; >>>>>>>>>>>>> } >>>>>>>>>>>>> } >>>>>>>>>>>>> fprintf(stderr, "Rank %d: recv done\n", rank); >>>>>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> fprintf(stderr, "Rank %d has completed bcast\n", rank); >>>>>>>>>>>>> MPI_Finalize(); >>>>>>>>>>>>> return 0; >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Aug 22, 2016, at 3:40 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> This might be a thin argument but we have many users running >>>>>>>>>>>>>> mpirun in this way for years with no problem until this recent >>>>>>>>>>>>>> upgrade. And some home-brewed mpi codes do not even have a >>>>>>>>>>>>>> standard way to read the input files. Last time I checked, the >>>>>>>>>>>>>> openmpi manual still claims it supports stdin >>>>>>>>>>>>>> (https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14 >>>>>>>>>>>>>> <https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14>). >>>>>>>>>>>>>> Maybe I missed it but the v2.0 release notes did not mention any >>>>>>>>>>>>>> changes to the behaviors of stdin as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We can tell our users to run mpirun in the suggested way, but I >>>>>>>>>>>>>> do hope someone can look into the issue and fix it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>>>>> Sent: Monday, August 22, 2016 3:04:50 PM >>>>>>>>>>>>>> To: Open MPI Users >>>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Well, I can try to find time to take a look. However, I will >>>>>>>>>>>>>> reiterate what Jeff H said - it is very unwise to rely on IO >>>>>>>>>>>>>> forwarding. Much better to just directly read the file unless >>>>>>>>>>>>>> that file is simply unavailable on the node where rank=0 is >>>>>>>>>>>>>> running. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here you can find the source code for lammps input >>>>>>>>>>>>>>> https://github.com/lammps/lammps/blob/r13864/src/input.cpp >>>>>>>>>>>>>>> <https://github.com/lammps/lammps/blob/r13864/src/input.cpp> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Based on the gdb output, rank 0 stuck at line 167 >>>>>>>>>>>>>>> if >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ( >>>>>>>>>>>>>>> fgets >>>>>>>>>>>>>>> (&line[m],maxline-m,infile) >>>>>>>>>>>>>>> == >>>>>>>>>>>>>>> NULL) >>>>>>>>>>>>>>> and the rest threads stuck at line 203 >>>>>>>>>>>>>>> MPI_Bcast(&n,1,MPI_INT,0,world); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So rank 0 possibly hangs on the fgets() function. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here are the whole backtrace information: >>>>>>>>>>>>>>> $ cat master.backtrace worker.backtrace >>>>>>>>>>>>>>> #0 0x0000003c37cdb68d in read () from /lib64/libc.so.6 >>>>>>>>>>>>>>> #1 0x0000003c37c71ca8 in _IO_new_file_underflow () from >>>>>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>>>>> #2 0x0000003c37c737ae in _IO_default_uflow_internal () from >>>>>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>>>>> #3 0x0000003c37c67e8a in _IO_getline_info_internal () from >>>>>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>>>>> #4 0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6 >>>>>>>>>>>>>>> #5 0x00000000005c5a43 in LAMMPS_NS::Input::file() () at >>>>>>>>>>>>>>> ../input.cpp:167 >>>>>>>>>>>>>>> #6 0x00000000005d4236 in main () at ../main.cpp:31 >>>>>>>>>>>>>>> #0 0x00002b1635d2ace2 in poll_dispatch () from >>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>>>>> #1 0x00002b1635d1fa71 in opal_libevent2022_event_base_loop () >>>>>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>>>>> #2 0x00002b1635ce4634 in opal_progress () from >>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>>>>> #3 0x00002b16351b8fad in ompi_request_default_wait () from >>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>>>>> #4 0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic () >>>>>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>>>>> #5 0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial () >>>>>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>>>>> #6 0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed >>>>>>>>>>>>>>> () >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so >>>>>>>>>>>>>>> #7 0x00002b16351cb4fb in PMPI_Bcast () from >>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>>>>> #8 0x00000000005c5b5d in LAMMPS_NS::Input::file() () at >>>>>>>>>>>>>>> ../input.cpp:203 >>>>>>>>>>>>>>> #9 0x00000000005d4236 in main () at ../main.cpp:31 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>>>>>> Sent: Monday, August 22, 2016 2:17:10 PM >>>>>>>>>>>>>>> To: Open MPI Users >>>>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hmmm...perhaps we can break this out a bit? The stdin will be >>>>>>>>>>>>>>> going to your rank=0 proc. It sounds like you have some >>>>>>>>>>>>>>> subsequent step that calls MPI_Bcast? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you first verify that the input is being correctly >>>>>>>>>>>>>>> delivered to rank=0? This will help us isolate if the problem >>>>>>>>>>>>>>> is in the IO forwarding, or in the subsequent Bcast. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. >>>>>>>>>>>>>>>> Both of them have odd behaviors when trying to read from >>>>>>>>>>>>>>>> standard input. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For example, if we start the application lammps across 4 >>>>>>>>>>>>>>>> nodes, each node 16 cores, connected by Intel QDR Infiniband, >>>>>>>>>>>>>>>> mpirun works fine for the 1st time, but always stuck in a few >>>>>>>>>>>>>>>> seconds thereafter. >>>>>>>>>>>>>>>> Command: >>>>>>>>>>>>>>>> mpirun ./lmp_ompi_g++ < in.snr >>>>>>>>>>>>>>>> in.snr is the Lammps input file. compiler is gcc/6.1. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Instead, if we use >>>>>>>>>>>>>>>> mpirun ./lmp_ompi_g++ -in in.snr >>>>>>>>>>>>>>>> it works 100%. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Some odd behaviors we gathered so far. >>>>>>>>>>>>>>>> 1. For 1 node job, stdin always works. >>>>>>>>>>>>>>>> 2. For multiple nodes, stdin works unstably when the number of >>>>>>>>>>>>>>>> cores per node are relatively small. For example, for 2/3/4 >>>>>>>>>>>>>>>> nodes, each node 8 cores, mpirun works most of the time. But >>>>>>>>>>>>>>>> for each node with >8 cores, mpirun works the 1st time, then >>>>>>>>>>>>>>>> always stuck. There seems to be a magic number when it stops >>>>>>>>>>>>>>>> working. >>>>>>>>>>>>>>>> 3. We tested Quantum Expresso with compiler intel/13 and had >>>>>>>>>>>>>>>> the same issue. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We used gdb to debug and found when mpirun was stuck, the rest >>>>>>>>>>>>>>>> of the processes were all waiting on mpi broadcast from the >>>>>>>>>>>>>>>> master thread. The lammps binary, input file and gdb core >>>>>>>>>>>>>>>> files (example.tar.bz2) can be downloaded from this link >>>>>>>>>>>>>>>> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc >>>>>>>>>>>>>>>> <https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Extra information: >>>>>>>>>>>>>>>> 1. Job scheduler is slurm. >>>>>>>>>>>>>>>> 2. configure setup: >>>>>>>>>>>>>>>> ./configure --prefix=$PREFIX \ >>>>>>>>>>>>>>>> --with-hwloc=internal \ >>>>>>>>>>>>>>>> --enable-mpirun-prefix-by-default \ >>>>>>>>>>>>>>>> --with-slurm \ >>>>>>>>>>>>>>>> --with-verbs \ >>>>>>>>>>>>>>>> --with-psm \ >>>>>>>>>>>>>>>> --disable-openib-connectx-xrc \ >>>>>>>>>>>>>>>> --with-knem=/opt/knem-1.1.2.90mlnx1 \ >>>>>>>>>>>>>>>> --with-cma >>>>>>>>>>>>>>>> 3. openmpi-mca-params.conf file >>>>>>>>>>>>>>>> orte_hetero_nodes=1 >>>>>>>>>>>>>>>> hwloc_base_binding_policy=core >>>>>>>>>>>>>>>> rmaps_base_mapping_policy=core >>>>>>>>>>>>>>>> opal_cuda_support=0 >>>>>>>>>>>>>>>> btl_openib_use_eager_rdma=0 >>>>>>>>>>>>>>>> btl_openib_max_eager_rdma=0 >>>>>>>>>>>>>>>> btl_openib_flags=1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Jingchao >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> >>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>> >>>>>>>> <debug_info.txt>_______________________________________________ >>>>>>>> users mailing list >>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> <debug_info.txt>_______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > <debug_state.txt>_______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users