Well, that helped a bit. For some reason, your system is skipping a step in the 
launch state machine, and so we never hit the step where we setup the IO 
forwarding system.

Sorry to keep poking, but I haven’t seen this behavior anywhere else, and so I 
have no way to replicate it. Must be a subtle race condition.

Can you replace “plm” with ‘“state” and try to hit a “bad” run again?


> On Aug 30, 2016, at 12:30 PM, Jingchao Zhang <zh...@unl.edu> wrote:
> 
> Yes, all procs were launched properly. I added “-mca plm_base_verbose 5” to 
> the mpirun command. Please see attached for the results.
> 
> $mpirun -mca plm_base_verbose 5 ./a.out < test.in
> 
> I mentioned in my initial post that the test job can run properly for the 1st 
> time. But if I kill the job and resubmit, then it hangs. It happened with the 
> job above as well. Very odd. 
> From: users <users-boun...@lists.open-mpi.org 
> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
> Sent: Tuesday, August 30, 2016 12:56:33 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>  
> Hmmm...well, the problem appears to be that we aren’t setting up the input 
> channel to read stdin. This happens immediately after the application is 
> launched - there is no “if” clause or anything else in front of it. The only 
> way it wouldn’t get called is if all the procs weren’t launched, but that 
> appears to be happening, yes?
> 
> Hence my confusion - there is no test in front of that print statement now, 
> and yet we aren’t seeing the code being called.
> 
> Could you please add “-mca plm_base_verbose 5” to your cmd line? We should 
> see a debug statement print that contains "plm:base:launch wiring up iof for 
> job”
> 
> 
> 
>> On Aug 30, 2016, at 11:40 AM, Jingchao Zhang <zh...@unl.edu 
>> <mailto:zh...@unl.edu>> wrote:
>> 
>> I checked again and as far as I can tell, everything was setup correctly. I 
>> added "HCC debug" to the output message to make sure it's the correct 
>> plugin. 
>> 
>> The updated outputs:
>> $ mpirun ./a.out < test.in
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 35 for process [[26513,1],0]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 41 for process [[26513,1],0]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 43 for process [[26513,1],0]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 37 for process [[26513,1],1]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 46 for process [[26513,1],1]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 49 for process [[26513,1],1]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 38 for process [[26513,1],2]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 50 for process [[26513,1],2]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 52 for process [[26513,1],2]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 42 for process [[26513,1],3]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 53 for process [[26513,1],3]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 55 for process [[26513,1],3]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 45 for process [[26513,1],4]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 56 for process [[26513,1],4]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 58 for process [[26513,1],4]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 47 for process [[26513,1],5]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 59 for process [[26513,1],5]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 61 for process [[26513,1],5]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 51 for process [[26513,1],6]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 62 for process [[26513,1],6]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 64 for process [[26513,1],6]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 57 for process [[26513,1],7]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 66 for process [[26513,1],7]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 68 for process [[26513,1],7]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 63 for process [[26513,1],8]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 70 for process [[26513,1],8]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 72 for process [[26513,1],8]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 67 for process [[26513,1],9]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 74 for process [[26513,1],9]
>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC 
>> debug: [[26513,0],0] iof:hnp pushing fd 76 for process [[26513,1],9]
>> Rank 1 has cleared MPI_Init
>> Rank 3 has cleared MPI_Init
>> Rank 4 has cleared MPI_Init
>> Rank 5 has cleared MPI_Init
>> Rank 6 has cleared MPI_Init
>> Rank 7 has cleared MPI_Init
>> Rank 0 has cleared MPI_Init
>> Rank 2 has cleared MPI_Init
>> Rank 8 has cleared MPI_Init
>> Rank 9 has cleared MPI_Init
>> Rank 10 has cleared MPI_Init
>> Rank 11 has cleared MPI_Init
>> Rank 12 has cleared MPI_Init
>> Rank 13 has cleared MPI_Init
>> Rank 16 has cleared MPI_Init
>> Rank 17 has cleared MPI_Init
>> Rank 18 has cleared MPI_Init
>> Rank 14 has cleared MPI_Init
>> Rank 15 has cleared MPI_Init
>> Rank 19 has cleared MPI_Init
>> 
>> 
>> The part of code I changed in file ./orte/mca/iof/hnp/iof_hnp.c
>> 
>>     opal_output(0,
>>                          "HCC debug: %s iof:hnp pushing fd %d for process 
>> %s",
>>                          ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>                          fd, ORTE_NAME_PRINT(dst_name));
>> 
>>     /* don't do this if the dst vpid is invalid or the fd is negative! */
>>     if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) {
>>         return ORTE_SUCCESS;
>>     }
>> 
>> /*    OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output,
>>                          "%s iof:hnp pushing fd %d for process %s",
>>                          ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>                          fd, ORTE_NAME_PRINT(dst_name)));
>> */
>> 
>> From: users <users-boun...@lists.open-mpi.org 
>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>> Sent: Monday, August 29, 2016 11:42:00 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>  
>> I’m sorry, but something is simply very wrong here. Are you sure you are 
>> pointed at the correct LD_LIBRARY_PATH? Perhaps add a “BOO” or something at 
>> the front of the output message to ensure we are using the correct plugin?
>> 
>> This looks to me like you must be picking up a stale library somewhere.
>> 
>>> On Aug 29, 2016, at 10:29 AM, Jingchao Zhang <zh...@unl.edu 
>>> <mailto:zh...@unl.edu>> wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> I used the tarball from Aug 26 and added the patch. Tested with 2 nodes, 10 
>>> cores/node. Please see the results below:
>>> 
>>> $ mpirun ./a.out < test.in
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 35 for process [[43954,1],0]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 41 for process [[43954,1],0]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 43 for process [[43954,1],0]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 37 for process [[43954,1],1]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 46 for process [[43954,1],1]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 49 for process [[43954,1],1]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 38 for process [[43954,1],2]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 50 for process [[43954,1],2]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 52 for process [[43954,1],2]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 42 for process [[43954,1],3]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 53 for process [[43954,1],3]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 55 for process [[43954,1],3]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 45 for process [[43954,1],4]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 56 for process [[43954,1],4]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 58 for process [[43954,1],4]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 47 for process [[43954,1],5]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 59 for process [[43954,1],5]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 61 for process [[43954,1],5]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 57 for process [[43954,1],6]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 64 for process [[43954,1],6]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 66 for process [[43954,1],6]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 62 for process [[43954,1],7]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 68 for process [[43954,1],7]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 70 for process [[43954,1],7]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 65 for process [[43954,1],8]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 72 for process [[43954,1],8]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 74 for process [[43954,1],8]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 75 for process [[43954,1],9]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 79 for process [[43954,1],9]
>>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] 
>>> [[43954,0],0] iof:hnp pushing fd 81 for process [[43954,1],9]
>>> Rank 5 has cleared MPI_Init
>>> Rank 9 has cleared MPI_Init
>>> Rank 1 has cleared MPI_Init
>>> Rank 2 has cleared MPI_Init
>>> Rank 3 has cleared MPI_Init
>>> Rank 4 has cleared MPI_Init
>>> Rank 8 has cleared MPI_Init
>>> Rank 0 has cleared MPI_Init
>>> Rank 6 has cleared MPI_Init
>>> Rank 7 has cleared MPI_Init
>>> Rank 14 has cleared MPI_Init
>>> Rank 15 has cleared MPI_Init
>>> Rank 16 has cleared MPI_Init
>>> Rank 18 has cleared MPI_Init
>>> Rank 10 has cleared MPI_Init
>>> Rank 11 has cleared MPI_Init
>>> Rank 12 has cleared MPI_Init
>>> Rank 13 has cleared MPI_Init
>>> Rank 17 has cleared MPI_Init
>>> Rank 19 has cleared MPI_Init
>>> 
>>> Thanks,
>>> 
>>> Dr. Jingchao Zhang
>>> Holland Computing Center
>>> University of Nebraska-Lincoln
>>> 402-472-6400
>>> From: users <users-boun...@lists.open-mpi.org 
>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>>> Sent: Saturday, August 27, 2016 12:31:53 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>  
>>> I am finding this impossible to replicate, so something odd must be going 
>>> on. Can you please (a) pull down the latest v2.0.1 nightly tarball, and (b) 
>>> add this patch to it?
>>> 
>>> diff --git a/orte/mca/iof/hnp/iof_hnp.c b/orte/mca/iof/hnp/iof_hnp.c
>>> old mode 100644
>>> new mode 100755
>>> index 512fcdb..362ff46
>>> --- a/orte/mca/iof/hnp/iof_hnp.c
>>> +++ b/orte/mca/iof/hnp/iof_hnp.c
>>> @@ -143,16 +143,17 @@ static int hnp_push(const orte_process_name_t* 
>>> dst_name, orte_iof_tag_t src_tag,
>>>      int np, numdigs;
>>>      orte_ns_cmp_bitmask_t mask;
>>>  
>>> +    opal_output(0,
>>> +                         "%s iof:hnp pushing fd %d for process %s",
>>> +                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>> +                         fd, ORTE_NAME_PRINT(dst_name));
>>> +
>>>      /* don't do this if the dst vpid is invalid or the fd is negative! */
>>>      if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) {
>>>          return ORTE_SUCCESS;
>>>      }
>>>  
>>> -    OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output,
>>> -                         "%s iof:hnp pushing fd %d for process %s",
>>> -                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>> -                         fd, ORTE_NAME_PRINT(dst_name)));
>>> -
>>>      if (!(src_tag & ORTE_IOF_STDIN)) {
>>>          /* set the file descriptor to non-blocking - do this before we 
>>> setup
>>>           * and activate the read event in case it fires right away
>>> 
>>> 
>>> You can then run the test again without the "--mca iof_base_verbose 100” 
>>> flag to reduce the chatter - this print statement will tell me what I need 
>>> to know.
>>> 
>>> Thanks!
>>> Ralph
>>> 
>>> 
>>>> On Aug 25, 2016, at 8:19 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
>>>> <mailto:jsquy...@cisco.com>> wrote:
>>>> 
>>>> The IOF fix PR for v2.0.1 was literally just merged a few minutes ago; it 
>>>> wasn't in last night's tarball.
>>>> 
>>>> 
>>>> 
>>>>> On Aug 25, 2016, at 10:59 AM, r...@open-mpi.org 
>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>> 
>>>>> ??? Weird - can you send me an updated output of that last test we ran?
>>>>> 
>>>>>> On Aug 25, 2016, at 7:51 AM, Jingchao Zhang <zh...@unl.edu 
>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>> 
>>>>>> Hi Ralph,
>>>>>> 
>>>>>> I saw the pull request and did a test with v2.0.1rc1, but the problem 
>>>>>> persists. Any ideas?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Dr. Jingchao Zhang
>>>>>> Holland Computing Center
>>>>>> University of Nebraska-Lincoln
>>>>>> 402-472-6400
>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>> <mailto:r...@open-mpi.org>>
>>>>>> Sent: Wednesday, August 24, 2016 1:27:28 PM
>>>>>> To: Open MPI Users
>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>> 
>>>>>> Bingo - found it, fix submitted and hope to get it into 2.0.1
>>>>>> 
>>>>>> Thanks for the assist!
>>>>>> Ralph
>>>>>> 
>>>>>> 
>>>>>>> On Aug 24, 2016, at 12:15 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>> 
>>>>>>> I configured v2.0.1rc1 with --enable-debug and ran the test with --mca 
>>>>>>> iof_base_verbose 100. I also added -display-devel-map in case it 
>>>>>>> provides some useful information.
>>>>>>> 
>>>>>>> Test job has 2 nodes, each node 10 cores. Rank 0 and mpirun command on 
>>>>>>> the same node.
>>>>>>> $ mpirun -display-devel-map --mca iof_base_verbose 100 ./a.out < 
>>>>>>> test.in &> debug_info.txt
>>>>>>> 
>>>>>>> The debug_info.txt is attached. 
>>>>>>> 
>>>>>>> Dr. Jingchao Zhang
>>>>>>> Holland Computing Center
>>>>>>> University of Nebraska-Lincoln
>>>>>>> 402-472-6400
>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>> Sent: Wednesday, August 24, 2016 12:14:26 PM
>>>>>>> To: Open MPI Users
>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>> 
>>>>>>> Afraid I can’t replicate a problem at all, whether rank=0 is local or 
>>>>>>> not. I’m also using bash, but on CentOS-7, so I suspect the OS is the 
>>>>>>> difference.
>>>>>>> 
>>>>>>> Can you configure OMPI with --enable-debug, and then run the test again 
>>>>>>> with --mca iof_base_verbose 100? It will hopefully tell us something 
>>>>>>> about why the IO subsystem is stuck.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 24, 2016, at 8:46 AM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>> 
>>>>>>>> Hi Ralph,
>>>>>>>> 
>>>>>>>> For our tests, rank 0 is always on the same node with mpirun. I just 
>>>>>>>> tested mpirun with -nolocal and it still hangs. 
>>>>>>>> 
>>>>>>>> Information on shell and OS
>>>>>>>> $ echo $0
>>>>>>>> -bash
>>>>>>>> 
>>>>>>>> $ lsb_release -a
>>>>>>>> LSB Version:    
>>>>>>>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
>>>>>>>> Distributor ID: Scientific
>>>>>>>> Description:    Scientific Linux release 6.8 (Carbon)
>>>>>>>> Release:        6.8
>>>>>>>> Codename:       Carbon
>>>>>>>> 
>>>>>>>> $ uname -a
>>>>>>>> Linux login.crane.hcc.unl.edu <http://login.crane.hcc.unl.edu/> 
>>>>>>>> 2.6.32-642.3.1.el6.x86_64 #1 SMP Tue Jul 12 11:25:51 CDT 2016 x86_64 
>>>>>>>> x86_64 x86_64 GNU/Linux
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Dr. Jingchao Zhang
>>>>>>>> Holland Computing Center
>>>>>>>> University of Nebraska-Lincoln
>>>>>>>> 402-472-6400
>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>> Sent: Tuesday, August 23, 2016 8:14:48 PM
>>>>>>>> To: Open MPI Users
>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>> 
>>>>>>>> Hmmm...that’s a good point. Rank 0 and mpirun are always on the same 
>>>>>>>> node on my cluster. I’ll give it a try.
>>>>>>>> 
>>>>>>>> Jingchao: is rank 0 on the node with mpirun, or on a remote node?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Aug 23, 2016, at 5:58 PM, Gilles Gouaillardet <gil...@rist.or.jp 
>>>>>>>>> <mailto:gil...@rist.or.jp>> wrote:
>>>>>>>>> 
>>>>>>>>> Ralph,
>>>>>>>>> 
>>>>>>>>> did you run task 0 and mpirun on different nodes ?
>>>>>>>>> 
>>>>>>>>> i observed some random hangs, though i cannot blame openmpi 100% yet
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Gilles
>>>>>>>>> 
>>>>>>>>> On 8/24/2016 9:41 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>>>>>>>>> wrote:
>>>>>>>>>> Very strange. I cannot reproduce it as I’m able to run any number of 
>>>>>>>>>> nodes and procs, pushing over 100Mbytes thru without any problem.
>>>>>>>>>> 
>>>>>>>>>> Which leads me to suspect that the issue here is with the tty 
>>>>>>>>>> interface. Can you tell me what shell and OS you are running?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Aug 23, 2016, at 3:25 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Everything stuck at MPI_Init. For a test job with 2 nodes and 10 
>>>>>>>>>>> cores each node, I got the following
>>>>>>>>>>> 
>>>>>>>>>>> $ mpirun ./a.out < test.in
>>>>>>>>>>> Rank 2 has cleared MPI_Init
>>>>>>>>>>> Rank 4 has cleared MPI_Init
>>>>>>>>>>> Rank 7 has cleared MPI_Init
>>>>>>>>>>> Rank 8 has cleared MPI_Init
>>>>>>>>>>> Rank 0 has cleared MPI_Init
>>>>>>>>>>> Rank 5 has cleared MPI_Init
>>>>>>>>>>> Rank 6 has cleared MPI_Init
>>>>>>>>>>> Rank 9 has cleared MPI_Init
>>>>>>>>>>> Rank 1 has cleared MPI_Init
>>>>>>>>>>> Rank 16 has cleared MPI_Init
>>>>>>>>>>> Rank 19 has cleared MPI_Init
>>>>>>>>>>> Rank 10 has cleared MPI_Init
>>>>>>>>>>> Rank 11 has cleared MPI_Init
>>>>>>>>>>> Rank 12 has cleared MPI_Init
>>>>>>>>>>> Rank 13 has cleared MPI_Init
>>>>>>>>>>> Rank 14 has cleared MPI_Init
>>>>>>>>>>> Rank 15 has cleared MPI_Init
>>>>>>>>>>> Rank 17 has cleared MPI_Init
>>>>>>>>>>> Rank 18 has cleared MPI_Init
>>>>>>>>>>> Rank 3 has cleared MPI_Init
>>>>>>>>>>> 
>>>>>>>>>>> then it just hanged.
>>>>>>>>>>> 
>>>>>>>>>>> --Jingchao
>>>>>>>>>>> 
>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>> 402-472-6400
>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>>>>> Sent: Tuesday, August 23, 2016 4:03:07 PM
>>>>>>>>>>> To: Open MPI Users
>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>>>>> 
>>>>>>>>>>> The IO forwarding messages all flow over the Ethernet, so the type 
>>>>>>>>>>> of fabric is irrelevant. The number of procs involved would 
>>>>>>>>>>> definitely have an impact, but that might not be due to the IO 
>>>>>>>>>>> forwarding subsystem. We know we have flow control issues with 
>>>>>>>>>>> collectives like Bcast that don’t have built-in synchronization 
>>>>>>>>>>> points. How many reads were you able to do before it hung?
>>>>>>>>>>> 
>>>>>>>>>>> I was running it on my little test setup (2 nodes, using only a few 
>>>>>>>>>>> procs), but I’ll try scaling up and see what happens. I’ll also try 
>>>>>>>>>>> introducing some forced “syncs” on the Bcast and see if that solves 
>>>>>>>>>>> the issue.
>>>>>>>>>>> 
>>>>>>>>>>> Ralph
>>>>>>>>>>> 
>>>>>>>>>>>> On Aug 23, 2016, at 2:30 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>> 
>>>>>>>>>>>> I tested v2.0.1rc1 with your code but has the same issue. I also 
>>>>>>>>>>>> installed v2.0.1rc1 on a different cluster which has Mellanox QDR 
>>>>>>>>>>>> Infiniband and get the same result. For the tests you have done, 
>>>>>>>>>>>> how many cores and nodes did you use? I can trigger the problem by 
>>>>>>>>>>>> using multiple nodes and each node with more than 10 cores. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you for looking into this.
>>>>>>>>>>>> 
>>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>>> 402-472-6400
>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>>>>>> Sent: Monday, August 22, 2016 10:23:42 PM
>>>>>>>>>>>> To: Open MPI Users
>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>>>>>> 
>>>>>>>>>>>> FWIW: I just tested forwarding up to 100MBytes via stdin using the 
>>>>>>>>>>>> simple test shown below with OMPI v2.0.1rc1, and it worked fine. 
>>>>>>>>>>>> So I’d suggest upgrading when the official release comes out, or 
>>>>>>>>>>>> going ahead and at least testing 2.0.1rc1 on your machine. Or you 
>>>>>>>>>>>> can test this program with some input file and let me know if it 
>>>>>>>>>>>> works for you.
>>>>>>>>>>>> 
>>>>>>>>>>>> Ralph
>>>>>>>>>>>> 
>>>>>>>>>>>> #include <stdlib.h>
>>>>>>>>>>>> #include <stdio.h>
>>>>>>>>>>>> #include <string.h>
>>>>>>>>>>>> #include <stdbool.h>
>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>> #include <mpi.h>
>>>>>>>>>>>> 
>>>>>>>>>>>> #define ORTE_IOF_BASE_MSG_MAX   2048
>>>>>>>>>>>> 
>>>>>>>>>>>> int main(int argc, char *argv[])
>>>>>>>>>>>> {
>>>>>>>>>>>>    int i, rank, size, next, prev, tag = 201;
>>>>>>>>>>>>    int pos, msgsize, nbytes;
>>>>>>>>>>>>    bool done;
>>>>>>>>>>>>    char *msg;
>>>>>>>>>>>> 
>>>>>>>>>>>>    MPI_Init(&argc, &argv);
>>>>>>>>>>>>    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>>>>>>>    MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>>>>>>>> 
>>>>>>>>>>>>    fprintf(stderr, "Rank %d has cleared MPI_Init\n", rank);
>>>>>>>>>>>> 
>>>>>>>>>>>>    next = (rank + 1) % size;
>>>>>>>>>>>>    prev = (rank + size - 1) % size;
>>>>>>>>>>>>    msg = malloc(ORTE_IOF_BASE_MSG_MAX);
>>>>>>>>>>>>    pos = 0;
>>>>>>>>>>>>    nbytes = 0;
>>>>>>>>>>>> 
>>>>>>>>>>>>    if (0 == rank) {
>>>>>>>>>>>>        while (0 != (msgsize = read(0, msg, 
>>>>>>>>>>>> ORTE_IOF_BASE_MSG_MAX))) {
>>>>>>>>>>>>            fprintf(stderr, "Rank %d: sending blob %d\n", rank, 
>>>>>>>>>>>> pos);
>>>>>>>>>>>>            if (msgsize > 0) {
>>>>>>>>>>>>                MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
>>>>>>>>>>>> MPI_COMM_WORLD);
>>>>>>>>>>>>            }
>>>>>>>>>>>>            ++pos;
>>>>>>>>>>>>            nbytes += msgsize;
>>>>>>>>>>>>        }
>>>>>>>>>>>>        fprintf(stderr, "Rank %d: sending termination blob %d\n", 
>>>>>>>>>>>> rank, pos);
>>>>>>>>>>>>        memset(msg, 0, ORTE_IOF_BASE_MSG_MAX);
>>>>>>>>>>>>        MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
>>>>>>>>>>>> MPI_COMM_WORLD);
>>>>>>>>>>>>        MPI_Barrier(MPI_COMM_WORLD);
>>>>>>>>>>>>    } else {
>>>>>>>>>>>>        while (1) {
>>>>>>>>>>>>            MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, 
>>>>>>>>>>>> MPI_COMM_WORLD);
>>>>>>>>>>>>            fprintf(stderr, "Rank %d: recvd blob %d\n", rank, pos);
>>>>>>>>>>>>            ++pos;
>>>>>>>>>>>>            done = true;
>>>>>>>>>>>>            for (i=0; i < ORTE_IOF_BASE_MSG_MAX; i++) {
>>>>>>>>>>>>                if (0 != msg[i]) {
>>>>>>>>>>>>                    done = false;
>>>>>>>>>>>>                    break;
>>>>>>>>>>>>                }
>>>>>>>>>>>>            }
>>>>>>>>>>>>            if (done) {
>>>>>>>>>>>>                break;
>>>>>>>>>>>>            }
>>>>>>>>>>>>        }
>>>>>>>>>>>>        fprintf(stderr, "Rank %d: recv done\n", rank);
>>>>>>>>>>>>        MPI_Barrier(MPI_COMM_WORLD);
>>>>>>>>>>>>    }
>>>>>>>>>>>> 
>>>>>>>>>>>>    fprintf(stderr, "Rank %d has completed bcast\n", rank);
>>>>>>>>>>>>    MPI_Finalize();
>>>>>>>>>>>>    return 0;
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Aug 22, 2016, at 3:40 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This might be a thin argument but we have many users running 
>>>>>>>>>>>>> mpirun in this way for years with no problem until this recent 
>>>>>>>>>>>>> upgrade. And some home-brewed mpi codes do not even have a 
>>>>>>>>>>>>> standard way to read the input files. Last time I checked, the 
>>>>>>>>>>>>> openmpi manual still claims it supports stdin 
>>>>>>>>>>>>> (https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14 
>>>>>>>>>>>>> <https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14>). 
>>>>>>>>>>>>> Maybe I missed it but the v2.0 release notes did not mention any 
>>>>>>>>>>>>> changes to the behaviors of stdin as well.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We can tell our users to run mpirun in the suggested way, but I 
>>>>>>>>>>>>> do hope someone can look into the issue and fix it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>>>> 402-472-6400
>>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>>>>>>> Sent: Monday, August 22, 2016 3:04:50 PM
>>>>>>>>>>>>> To: Open MPI Users
>>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Well, I can try to find time to take a look. However, I will 
>>>>>>>>>>>>> reiterate what Jeff H said - it is very unwise to rely on IO 
>>>>>>>>>>>>> forwarding. Much better to just directly read the file unless 
>>>>>>>>>>>>> that file is simply unavailable on the node where rank=0 is 
>>>>>>>>>>>>> running.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here you can find the source code for lammps input 
>>>>>>>>>>>>>> https://github.com/lammps/lammps/blob/r13864/src/input.cpp 
>>>>>>>>>>>>>> <https://github.com/lammps/lammps/blob/r13864/src/input.cpp>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Based on the gdb output, rank 0 stuck at line 167
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (
>>>>>>>>>>>>>> fgets
>>>>>>>>>>>>>> (&line[m],maxline-m,infile)
>>>>>>>>>>>>>> == 
>>>>>>>>>>>>>> NULL)
>>>>>>>>>>>>>> and the rest threads stuck at line 203
>>>>>>>>>>>>>> MPI_Bcast(&n,1,MPI_INT,0,world);
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So rank 0 possibly hangs on the fgets() function.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here are the whole backtrace information:
>>>>>>>>>>>>>> $ cat master.backtrace worker.backtrace
>>>>>>>>>>>>>> #0  0x0000003c37cdb68d in read () from /lib64/libc.so.6
>>>>>>>>>>>>>> #1  0x0000003c37c71ca8 in _IO_new_file_underflow () from 
>>>>>>>>>>>>>> /lib64/libc.so.6
>>>>>>>>>>>>>> #2  0x0000003c37c737ae in _IO_default_uflow_internal () from 
>>>>>>>>>>>>>> /lib64/libc.so.6
>>>>>>>>>>>>>> #3  0x0000003c37c67e8a in _IO_getline_info_internal () from 
>>>>>>>>>>>>>> /lib64/libc.so.6
>>>>>>>>>>>>>> #4  0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6
>>>>>>>>>>>>>> #5  0x00000000005c5a43 in LAMMPS_NS::Input::file() () at 
>>>>>>>>>>>>>> ../input.cpp:167
>>>>>>>>>>>>>> #6  0x00000000005d4236 in main () at ../main.cpp:31
>>>>>>>>>>>>>> #0  0x00002b1635d2ace2 in poll_dispatch () from 
>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>>>>>>>>>>>> #1  0x00002b1635d1fa71 in opal_libevent2022_event_base_loop ()
>>>>>>>>>>>>>>   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>>>>>>>>>>>> #2  0x00002b1635ce4634 in opal_progress () from 
>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20
>>>>>>>>>>>>>> #3  0x00002b16351b8fad in ompi_request_default_wait () from 
>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>>>>>>>>>>>> #4  0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic ()
>>>>>>>>>>>>>>   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>>>>>>>>>>>> #5  0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial ()
>>>>>>>>>>>>>>   from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>>>>>>>>>>>> #6  0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed 
>>>>>>>>>>>>>> ()
>>>>>>>>>>>>>>   from 
>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so
>>>>>>>>>>>>>> #7  0x00002b16351cb4fb in PMPI_Bcast () from 
>>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20
>>>>>>>>>>>>>> #8  0x00000000005c5b5d in LAMMPS_NS::Input::file() () at 
>>>>>>>>>>>>>> ../input.cpp:203
>>>>>>>>>>>>>> #9  0x00000000005d4236 in main () at ../main.cpp:31
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>>>>> 402-472-6400
>>>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org 
>>>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of 
>>>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org 
>>>>>>>>>>>>>> <mailto:r...@open-mpi.org>>
>>>>>>>>>>>>>> Sent: Monday, August 22, 2016 2:17:10 PM
>>>>>>>>>>>>>> To: Open MPI Users
>>>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hmmm...perhaps we can break this out a bit? The stdin will be 
>>>>>>>>>>>>>> going to your rank=0 proc. It sounds like you have some 
>>>>>>>>>>>>>> subsequent step that calls MPI_Bcast?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Can you first verify that the input is being correctly delivered 
>>>>>>>>>>>>>> to rank=0? This will help us isolate if the problem is in the IO 
>>>>>>>>>>>>>> forwarding, or in the subsequent Bcast.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang <zh...@unl.edu 
>>>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both 
>>>>>>>>>>>>>>> of them have odd behaviors when trying to read from standard 
>>>>>>>>>>>>>>> input.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For example, if we start the application lammps across 4 nodes, 
>>>>>>>>>>>>>>> each node 16 cores, connected by Intel QDR Infiniband, mpirun 
>>>>>>>>>>>>>>> works fine for the 1st time, but always stuck in a few seconds 
>>>>>>>>>>>>>>> thereafter.
>>>>>>>>>>>>>>> Command:
>>>>>>>>>>>>>>> mpirun ./lmp_ompi_g++ < in.snr
>>>>>>>>>>>>>>> in.snr is the Lammps input file. compiler is gcc/6.1.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Instead, if we use
>>>>>>>>>>>>>>> mpirun ./lmp_ompi_g++ -in in.snr
>>>>>>>>>>>>>>> it works 100%.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Some odd behaviors we gathered so far. 
>>>>>>>>>>>>>>> 1. For 1 node job, stdin always works.
>>>>>>>>>>>>>>> 2. For multiple nodes, stdin works unstably when the number of 
>>>>>>>>>>>>>>> cores per node are relatively small. For example, for 2/3/4 
>>>>>>>>>>>>>>> nodes, each node 8 cores, mpirun works most of the time. But 
>>>>>>>>>>>>>>> for each node with >8 cores, mpirun works the 1st time, then 
>>>>>>>>>>>>>>> always stuck. There seems to be a magic number when it stops 
>>>>>>>>>>>>>>> working.
>>>>>>>>>>>>>>> 3. We tested Quantum Expresso with compiler intel/13 and had 
>>>>>>>>>>>>>>> the same issue. 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We used gdb to debug and found when mpirun was stuck, the rest 
>>>>>>>>>>>>>>> of the processes were all waiting on mpi broadcast from the 
>>>>>>>>>>>>>>> master thread. The lammps binary, input file and gdb core files 
>>>>>>>>>>>>>>> (example.tar.bz2) can be downloaded from this link 
>>>>>>>>>>>>>>> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc 
>>>>>>>>>>>>>>> <https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Extra information:
>>>>>>>>>>>>>>> 1. Job scheduler is slurm.
>>>>>>>>>>>>>>> 2. configure setup:
>>>>>>>>>>>>>>> ./configure     --prefix=$PREFIX \
>>>>>>>>>>>>>>>                --with-hwloc=internal \
>>>>>>>>>>>>>>>                --enable-mpirun-prefix-by-default \
>>>>>>>>>>>>>>>                --with-slurm \
>>>>>>>>>>>>>>>                --with-verbs \
>>>>>>>>>>>>>>>                --with-psm \
>>>>>>>>>>>>>>>                --disable-openib-connectx-xrc \
>>>>>>>>>>>>>>>                --with-knem=/opt/knem-1.1.2.90mlnx1 \
>>>>>>>>>>>>>>>                --with-cma
>>>>>>>>>>>>>>> 3. openmpi-mca-params.conf file 
>>>>>>>>>>>>>>> orte_hetero_nodes=1
>>>>>>>>>>>>>>> hwloc_base_binding_policy=core
>>>>>>>>>>>>>>> rmaps_base_mapping_policy=core
>>>>>>>>>>>>>>> opal_cuda_support=0
>>>>>>>>>>>>>>> btl_openib_use_eager_rdma=0
>>>>>>>>>>>>>>> btl_openib_max_eager_rdma=0
>>>>>>>>>>>>>>> btl_openib_flags=1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Jingchao 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Dr. Jingchao Zhang
>>>>>>>>>>>>>>> Holland Computing Center
>>>>>>>>>>>>>>> University of Nebraska-Lincoln
>>>>>>>>>>>>>>> 402-472-6400
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> 
>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>>> 
>>>>>>> <debug_info.txt>_______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>> 
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>>> For corporate legal information go to: 
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
> <debug_info.txt>_______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to