Well, that helped a bit. For some reason, your system is skipping a step in the launch state machine, and so we never hit the step where we setup the IO forwarding system.
Sorry to keep poking, but I haven’t seen this behavior anywhere else, and so I have no way to replicate it. Must be a subtle race condition. Can you replace “plm” with ‘“state” and try to hit a “bad” run again? > On Aug 30, 2016, at 12:30 PM, Jingchao Zhang <zh...@unl.edu> wrote: > > Yes, all procs were launched properly. I added “-mca plm_base_verbose 5” to > the mpirun command. Please see attached for the results. > > $mpirun -mca plm_base_verbose 5 ./a.out < test.in > > I mentioned in my initial post that the test job can run properly for the 1st > time. But if I kill the job and resubmit, then it hangs. It happened with the > job above as well. Very odd. > From: users <users-boun...@lists.open-mpi.org > <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org > <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> > Sent: Tuesday, August 30, 2016 12:56:33 PM > To: Open MPI Users > Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 > > Hmmm...well, the problem appears to be that we aren’t setting up the input > channel to read stdin. This happens immediately after the application is > launched - there is no “if” clause or anything else in front of it. The only > way it wouldn’t get called is if all the procs weren’t launched, but that > appears to be happening, yes? > > Hence my confusion - there is no test in front of that print statement now, > and yet we aren’t seeing the code being called. > > Could you please add “-mca plm_base_verbose 5” to your cmd line? We should > see a debug statement print that contains "plm:base:launch wiring up iof for > job” > > > >> On Aug 30, 2016, at 11:40 AM, Jingchao Zhang <zh...@unl.edu >> <mailto:zh...@unl.edu>> wrote: >> >> I checked again and as far as I can tell, everything was setup correctly. I >> added "HCC debug" to the output message to make sure it's the correct >> plugin. >> >> The updated outputs: >> $ mpirun ./a.out < test.in >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 35 for process [[26513,1],0] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 41 for process [[26513,1],0] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 43 for process [[26513,1],0] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 37 for process [[26513,1],1] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 46 for process [[26513,1],1] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 49 for process [[26513,1],1] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 38 for process [[26513,1],2] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 50 for process [[26513,1],2] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 52 for process [[26513,1],2] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 42 for process [[26513,1],3] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 53 for process [[26513,1],3] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 55 for process [[26513,1],3] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 45 for process [[26513,1],4] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 56 for process [[26513,1],4] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 58 for process [[26513,1],4] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 47 for process [[26513,1],5] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 59 for process [[26513,1],5] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 61 for process [[26513,1],5] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 51 for process [[26513,1],6] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 62 for process [[26513,1],6] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 64 for process [[26513,1],6] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 57 for process [[26513,1],7] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 66 for process [[26513,1],7] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 68 for process [[26513,1],7] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 63 for process [[26513,1],8] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 70 for process [[26513,1],8] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 72 for process [[26513,1],8] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 67 for process [[26513,1],9] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 74 for process [[26513,1],9] >> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:218844] HCC >> debug: [[26513,0],0] iof:hnp pushing fd 76 for process [[26513,1],9] >> Rank 1 has cleared MPI_Init >> Rank 3 has cleared MPI_Init >> Rank 4 has cleared MPI_Init >> Rank 5 has cleared MPI_Init >> Rank 6 has cleared MPI_Init >> Rank 7 has cleared MPI_Init >> Rank 0 has cleared MPI_Init >> Rank 2 has cleared MPI_Init >> Rank 8 has cleared MPI_Init >> Rank 9 has cleared MPI_Init >> Rank 10 has cleared MPI_Init >> Rank 11 has cleared MPI_Init >> Rank 12 has cleared MPI_Init >> Rank 13 has cleared MPI_Init >> Rank 16 has cleared MPI_Init >> Rank 17 has cleared MPI_Init >> Rank 18 has cleared MPI_Init >> Rank 14 has cleared MPI_Init >> Rank 15 has cleared MPI_Init >> Rank 19 has cleared MPI_Init >> >> >> The part of code I changed in file ./orte/mca/iof/hnp/iof_hnp.c >> >> opal_output(0, >> "HCC debug: %s iof:hnp pushing fd %d for process >> %s", >> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >> fd, ORTE_NAME_PRINT(dst_name)); >> >> /* don't do this if the dst vpid is invalid or the fd is negative! */ >> if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) { >> return ORTE_SUCCESS; >> } >> >> /* OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output, >> "%s iof:hnp pushing fd %d for process %s", >> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >> fd, ORTE_NAME_PRINT(dst_name))); >> */ >> >> From: users <users-boun...@lists.open-mpi.org >> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org >> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >> Sent: Monday, August 29, 2016 11:42:00 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >> >> I’m sorry, but something is simply very wrong here. Are you sure you are >> pointed at the correct LD_LIBRARY_PATH? Perhaps add a “BOO” or something at >> the front of the output message to ensure we are using the correct plugin? >> >> This looks to me like you must be picking up a stale library somewhere. >> >>> On Aug 29, 2016, at 10:29 AM, Jingchao Zhang <zh...@unl.edu >>> <mailto:zh...@unl.edu>> wrote: >>> >>> Hi Ralph, >>> >>> I used the tarball from Aug 26 and added the patch. Tested with 2 nodes, 10 >>> cores/node. Please see the results below: >>> >>> $ mpirun ./a.out < test.in >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 35 for process [[43954,1],0] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 41 for process [[43954,1],0] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 43 for process [[43954,1],0] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 37 for process [[43954,1],1] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 46 for process [[43954,1],1] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 49 for process [[43954,1],1] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 38 for process [[43954,1],2] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 50 for process [[43954,1],2] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 52 for process [[43954,1],2] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 42 for process [[43954,1],3] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 53 for process [[43954,1],3] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 55 for process [[43954,1],3] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 45 for process [[43954,1],4] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 56 for process [[43954,1],4] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 58 for process [[43954,1],4] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 47 for process [[43954,1],5] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 59 for process [[43954,1],5] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 61 for process [[43954,1],5] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 57 for process [[43954,1],6] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 64 for process [[43954,1],6] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 66 for process [[43954,1],6] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 62 for process [[43954,1],7] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 68 for process [[43954,1],7] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 70 for process [[43954,1],7] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 65 for process [[43954,1],8] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 72 for process [[43954,1],8] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 74 for process [[43954,1],8] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 75 for process [[43954,1],9] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 79 for process [[43954,1],9] >>> [c1725.crane.hcc.unl.edu <http://c1725.crane.hcc.unl.edu/>:170750] >>> [[43954,0],0] iof:hnp pushing fd 81 for process [[43954,1],9] >>> Rank 5 has cleared MPI_Init >>> Rank 9 has cleared MPI_Init >>> Rank 1 has cleared MPI_Init >>> Rank 2 has cleared MPI_Init >>> Rank 3 has cleared MPI_Init >>> Rank 4 has cleared MPI_Init >>> Rank 8 has cleared MPI_Init >>> Rank 0 has cleared MPI_Init >>> Rank 6 has cleared MPI_Init >>> Rank 7 has cleared MPI_Init >>> Rank 14 has cleared MPI_Init >>> Rank 15 has cleared MPI_Init >>> Rank 16 has cleared MPI_Init >>> Rank 18 has cleared MPI_Init >>> Rank 10 has cleared MPI_Init >>> Rank 11 has cleared MPI_Init >>> Rank 12 has cleared MPI_Init >>> Rank 13 has cleared MPI_Init >>> Rank 17 has cleared MPI_Init >>> Rank 19 has cleared MPI_Init >>> >>> Thanks, >>> >>> Dr. Jingchao Zhang >>> Holland Computing Center >>> University of Nebraska-Lincoln >>> 402-472-6400 >>> From: users <users-boun...@lists.open-mpi.org >>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org >>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >>> Sent: Saturday, August 27, 2016 12:31:53 PM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>> >>> I am finding this impossible to replicate, so something odd must be going >>> on. Can you please (a) pull down the latest v2.0.1 nightly tarball, and (b) >>> add this patch to it? >>> >>> diff --git a/orte/mca/iof/hnp/iof_hnp.c b/orte/mca/iof/hnp/iof_hnp.c >>> old mode 100644 >>> new mode 100755 >>> index 512fcdb..362ff46 >>> --- a/orte/mca/iof/hnp/iof_hnp.c >>> +++ b/orte/mca/iof/hnp/iof_hnp.c >>> @@ -143,16 +143,17 @@ static int hnp_push(const orte_process_name_t* >>> dst_name, orte_iof_tag_t src_tag, >>> int np, numdigs; >>> orte_ns_cmp_bitmask_t mask; >>> >>> + opal_output(0, >>> + "%s iof:hnp pushing fd %d for process %s", >>> + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>> + fd, ORTE_NAME_PRINT(dst_name)); >>> + >>> /* don't do this if the dst vpid is invalid or the fd is negative! */ >>> if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) { >>> return ORTE_SUCCESS; >>> } >>> >>> - OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output, >>> - "%s iof:hnp pushing fd %d for process %s", >>> - ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>> - fd, ORTE_NAME_PRINT(dst_name))); >>> - >>> if (!(src_tag & ORTE_IOF_STDIN)) { >>> /* set the file descriptor to non-blocking - do this before we >>> setup >>> * and activate the read event in case it fires right away >>> >>> >>> You can then run the test again without the "--mca iof_base_verbose 100” >>> flag to reduce the chatter - this print statement will tell me what I need >>> to know. >>> >>> Thanks! >>> Ralph >>> >>> >>>> On Aug 25, 2016, at 8:19 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com >>>> <mailto:jsquy...@cisco.com>> wrote: >>>> >>>> The IOF fix PR for v2.0.1 was literally just merged a few minutes ago; it >>>> wasn't in last night's tarball. >>>> >>>> >>>> >>>>> On Aug 25, 2016, at 10:59 AM, r...@open-mpi.org >>>>> <mailto:r...@open-mpi.org> wrote: >>>>> >>>>> ??? Weird - can you send me an updated output of that last test we ran? >>>>> >>>>>> On Aug 25, 2016, at 7:51 AM, Jingchao Zhang <zh...@unl.edu >>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>> >>>>>> Hi Ralph, >>>>>> >>>>>> I saw the pull request and did a test with v2.0.1rc1, but the problem >>>>>> persists. Any ideas? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dr. Jingchao Zhang >>>>>> Holland Computing Center >>>>>> University of Nebraska-Lincoln >>>>>> 402-472-6400 >>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>> <mailto:r...@open-mpi.org>> >>>>>> Sent: Wednesday, August 24, 2016 1:27:28 PM >>>>>> To: Open MPI Users >>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>> >>>>>> Bingo - found it, fix submitted and hope to get it into 2.0.1 >>>>>> >>>>>> Thanks for the assist! >>>>>> Ralph >>>>>> >>>>>> >>>>>>> On Aug 24, 2016, at 12:15 PM, Jingchao Zhang <zh...@unl.edu >>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>> >>>>>>> I configured v2.0.1rc1 with --enable-debug and ran the test with --mca >>>>>>> iof_base_verbose 100. I also added -display-devel-map in case it >>>>>>> provides some useful information. >>>>>>> >>>>>>> Test job has 2 nodes, each node 10 cores. Rank 0 and mpirun command on >>>>>>> the same node. >>>>>>> $ mpirun -display-devel-map --mca iof_base_verbose 100 ./a.out < >>>>>>> test.in &> debug_info.txt >>>>>>> >>>>>>> The debug_info.txt is attached. >>>>>>> >>>>>>> Dr. Jingchao Zhang >>>>>>> Holland Computing Center >>>>>>> University of Nebraska-Lincoln >>>>>>> 402-472-6400 >>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>> <mailto:r...@open-mpi.org>> >>>>>>> Sent: Wednesday, August 24, 2016 12:14:26 PM >>>>>>> To: Open MPI Users >>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>> >>>>>>> Afraid I can’t replicate a problem at all, whether rank=0 is local or >>>>>>> not. I’m also using bash, but on CentOS-7, so I suspect the OS is the >>>>>>> difference. >>>>>>> >>>>>>> Can you configure OMPI with --enable-debug, and then run the test again >>>>>>> with --mca iof_base_verbose 100? It will hopefully tell us something >>>>>>> about why the IO subsystem is stuck. >>>>>>> >>>>>>> >>>>>>>> On Aug 24, 2016, at 8:46 AM, Jingchao Zhang <zh...@unl.edu >>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>> >>>>>>>> Hi Ralph, >>>>>>>> >>>>>>>> For our tests, rank 0 is always on the same node with mpirun. I just >>>>>>>> tested mpirun with -nolocal and it still hangs. >>>>>>>> >>>>>>>> Information on shell and OS >>>>>>>> $ echo $0 >>>>>>>> -bash >>>>>>>> >>>>>>>> $ lsb_release -a >>>>>>>> LSB Version: >>>>>>>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch >>>>>>>> Distributor ID: Scientific >>>>>>>> Description: Scientific Linux release 6.8 (Carbon) >>>>>>>> Release: 6.8 >>>>>>>> Codename: Carbon >>>>>>>> >>>>>>>> $ uname -a >>>>>>>> Linux login.crane.hcc.unl.edu <http://login.crane.hcc.unl.edu/> >>>>>>>> 2.6.32-642.3.1.el6.x86_64 #1 SMP Tue Jul 12 11:25:51 CDT 2016 x86_64 >>>>>>>> x86_64 x86_64 GNU/Linux >>>>>>>> >>>>>>>> >>>>>>>> Dr. Jingchao Zhang >>>>>>>> Holland Computing Center >>>>>>>> University of Nebraska-Lincoln >>>>>>>> 402-472-6400 >>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>> Sent: Tuesday, August 23, 2016 8:14:48 PM >>>>>>>> To: Open MPI Users >>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>> >>>>>>>> Hmmm...that’s a good point. Rank 0 and mpirun are always on the same >>>>>>>> node on my cluster. I’ll give it a try. >>>>>>>> >>>>>>>> Jingchao: is rank 0 on the node with mpirun, or on a remote node? >>>>>>>> >>>>>>>> >>>>>>>>> On Aug 23, 2016, at 5:58 PM, Gilles Gouaillardet <gil...@rist.or.jp >>>>>>>>> <mailto:gil...@rist.or.jp>> wrote: >>>>>>>>> >>>>>>>>> Ralph, >>>>>>>>> >>>>>>>>> did you run task 0 and mpirun on different nodes ? >>>>>>>>> >>>>>>>>> i observed some random hangs, though i cannot blame openmpi 100% yet >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Gilles >>>>>>>>> >>>>>>>>> On 8/24/2016 9:41 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>>> Very strange. I cannot reproduce it as I’m able to run any number of >>>>>>>>>> nodes and procs, pushing over 100Mbytes thru without any problem. >>>>>>>>>> >>>>>>>>>> Which leads me to suspect that the issue here is with the tty >>>>>>>>>> interface. Can you tell me what shell and OS you are running? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Aug 23, 2016, at 3:25 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>> >>>>>>>>>>> Everything stuck at MPI_Init. For a test job with 2 nodes and 10 >>>>>>>>>>> cores each node, I got the following >>>>>>>>>>> >>>>>>>>>>> $ mpirun ./a.out < test.in >>>>>>>>>>> Rank 2 has cleared MPI_Init >>>>>>>>>>> Rank 4 has cleared MPI_Init >>>>>>>>>>> Rank 7 has cleared MPI_Init >>>>>>>>>>> Rank 8 has cleared MPI_Init >>>>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>>>> Rank 5 has cleared MPI_Init >>>>>>>>>>> Rank 6 has cleared MPI_Init >>>>>>>>>>> Rank 9 has cleared MPI_Init >>>>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>>>> Rank 16 has cleared MPI_Init >>>>>>>>>>> Rank 19 has cleared MPI_Init >>>>>>>>>>> Rank 10 has cleared MPI_Init >>>>>>>>>>> Rank 11 has cleared MPI_Init >>>>>>>>>>> Rank 12 has cleared MPI_Init >>>>>>>>>>> Rank 13 has cleared MPI_Init >>>>>>>>>>> Rank 14 has cleared MPI_Init >>>>>>>>>>> Rank 15 has cleared MPI_Init >>>>>>>>>>> Rank 17 has cleared MPI_Init >>>>>>>>>>> Rank 18 has cleared MPI_Init >>>>>>>>>>> Rank 3 has cleared MPI_Init >>>>>>>>>>> >>>>>>>>>>> then it just hanged. >>>>>>>>>>> >>>>>>>>>>> --Jingchao >>>>>>>>>>> >>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>> Holland Computing Center >>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>> 402-472-6400 >>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>> Sent: Tuesday, August 23, 2016 4:03:07 PM >>>>>>>>>>> To: Open MPI Users >>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>> >>>>>>>>>>> The IO forwarding messages all flow over the Ethernet, so the type >>>>>>>>>>> of fabric is irrelevant. The number of procs involved would >>>>>>>>>>> definitely have an impact, but that might not be due to the IO >>>>>>>>>>> forwarding subsystem. We know we have flow control issues with >>>>>>>>>>> collectives like Bcast that don’t have built-in synchronization >>>>>>>>>>> points. How many reads were you able to do before it hung? >>>>>>>>>>> >>>>>>>>>>> I was running it on my little test setup (2 nodes, using only a few >>>>>>>>>>> procs), but I’ll try scaling up and see what happens. I’ll also try >>>>>>>>>>> introducing some forced “syncs” on the Bcast and see if that solves >>>>>>>>>>> the issue. >>>>>>>>>>> >>>>>>>>>>> Ralph >>>>>>>>>>> >>>>>>>>>>>> On Aug 23, 2016, at 2:30 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>> >>>>>>>>>>>> I tested v2.0.1rc1 with your code but has the same issue. I also >>>>>>>>>>>> installed v2.0.1rc1 on a different cluster which has Mellanox QDR >>>>>>>>>>>> Infiniband and get the same result. For the tests you have done, >>>>>>>>>>>> how many cores and nodes did you use? I can trigger the problem by >>>>>>>>>>>> using multiple nodes and each node with more than 10 cores. >>>>>>>>>>>> >>>>>>>>>>>> Thank you for looking into this. >>>>>>>>>>>> >>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>>> Sent: Monday, August 22, 2016 10:23:42 PM >>>>>>>>>>>> To: Open MPI Users >>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>>> >>>>>>>>>>>> FWIW: I just tested forwarding up to 100MBytes via stdin using the >>>>>>>>>>>> simple test shown below with OMPI v2.0.1rc1, and it worked fine. >>>>>>>>>>>> So I’d suggest upgrading when the official release comes out, or >>>>>>>>>>>> going ahead and at least testing 2.0.1rc1 on your machine. Or you >>>>>>>>>>>> can test this program with some input file and let me know if it >>>>>>>>>>>> works for you. >>>>>>>>>>>> >>>>>>>>>>>> Ralph >>>>>>>>>>>> >>>>>>>>>>>> #include <stdlib.h> >>>>>>>>>>>> #include <stdio.h> >>>>>>>>>>>> #include <string.h> >>>>>>>>>>>> #include <stdbool.h> >>>>>>>>>>>> #include <unistd.h> >>>>>>>>>>>> #include <mpi.h> >>>>>>>>>>>> >>>>>>>>>>>> #define ORTE_IOF_BASE_MSG_MAX 2048 >>>>>>>>>>>> >>>>>>>>>>>> int main(int argc, char *argv[]) >>>>>>>>>>>> { >>>>>>>>>>>> int i, rank, size, next, prev, tag = 201; >>>>>>>>>>>> int pos, msgsize, nbytes; >>>>>>>>>>>> bool done; >>>>>>>>>>>> char *msg; >>>>>>>>>>>> >>>>>>>>>>>> MPI_Init(&argc, &argv); >>>>>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>>>>>>>>>> >>>>>>>>>>>> fprintf(stderr, "Rank %d has cleared MPI_Init\n", rank); >>>>>>>>>>>> >>>>>>>>>>>> next = (rank + 1) % size; >>>>>>>>>>>> prev = (rank + size - 1) % size; >>>>>>>>>>>> msg = malloc(ORTE_IOF_BASE_MSG_MAX); >>>>>>>>>>>> pos = 0; >>>>>>>>>>>> nbytes = 0; >>>>>>>>>>>> >>>>>>>>>>>> if (0 == rank) { >>>>>>>>>>>> while (0 != (msgsize = read(0, msg, >>>>>>>>>>>> ORTE_IOF_BASE_MSG_MAX))) { >>>>>>>>>>>> fprintf(stderr, "Rank %d: sending blob %d\n", rank, >>>>>>>>>>>> pos); >>>>>>>>>>>> if (msgsize > 0) { >>>>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>>>> } >>>>>>>>>>>> ++pos; >>>>>>>>>>>> nbytes += msgsize; >>>>>>>>>>>> } >>>>>>>>>>>> fprintf(stderr, "Rank %d: sending termination blob %d\n", >>>>>>>>>>>> rank, pos); >>>>>>>>>>>> memset(msg, 0, ORTE_IOF_BASE_MSG_MAX); >>>>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>>>>>>>> } else { >>>>>>>>>>>> while (1) { >>>>>>>>>>>> MPI_Bcast(msg, ORTE_IOF_BASE_MSG_MAX, MPI_BYTE, 0, >>>>>>>>>>>> MPI_COMM_WORLD); >>>>>>>>>>>> fprintf(stderr, "Rank %d: recvd blob %d\n", rank, pos); >>>>>>>>>>>> ++pos; >>>>>>>>>>>> done = true; >>>>>>>>>>>> for (i=0; i < ORTE_IOF_BASE_MSG_MAX; i++) { >>>>>>>>>>>> if (0 != msg[i]) { >>>>>>>>>>>> done = false; >>>>>>>>>>>> break; >>>>>>>>>>>> } >>>>>>>>>>>> } >>>>>>>>>>>> if (done) { >>>>>>>>>>>> break; >>>>>>>>>>>> } >>>>>>>>>>>> } >>>>>>>>>>>> fprintf(stderr, "Rank %d: recv done\n", rank); >>>>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> fprintf(stderr, "Rank %d has completed bcast\n", rank); >>>>>>>>>>>> MPI_Finalize(); >>>>>>>>>>>> return 0; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Aug 22, 2016, at 3:40 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> This might be a thin argument but we have many users running >>>>>>>>>>>>> mpirun in this way for years with no problem until this recent >>>>>>>>>>>>> upgrade. And some home-brewed mpi codes do not even have a >>>>>>>>>>>>> standard way to read the input files. Last time I checked, the >>>>>>>>>>>>> openmpi manual still claims it supports stdin >>>>>>>>>>>>> (https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14 >>>>>>>>>>>>> <https://www.open-mpi.org/doc/v2.0/man1/mpirun.1.php#sect14>). >>>>>>>>>>>>> Maybe I missed it but the v2.0 release notes did not mention any >>>>>>>>>>>>> changes to the behaviors of stdin as well. >>>>>>>>>>>>> >>>>>>>>>>>>> We can tell our users to run mpirun in the suggested way, but I >>>>>>>>>>>>> do hope someone can look into the issue and fix it. >>>>>>>>>>>>> >>>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>>>> Sent: Monday, August 22, 2016 3:04:50 PM >>>>>>>>>>>>> To: Open MPI Users >>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>>>> >>>>>>>>>>>>> Well, I can try to find time to take a look. However, I will >>>>>>>>>>>>> reiterate what Jeff H said - it is very unwise to rely on IO >>>>>>>>>>>>> forwarding. Much better to just directly read the file unless >>>>>>>>>>>>> that file is simply unavailable on the node where rank=0 is >>>>>>>>>>>>> running. >>>>>>>>>>>>> >>>>>>>>>>>>>> On Aug 22, 2016, at 1:55 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here you can find the source code for lammps input >>>>>>>>>>>>>> https://github.com/lammps/lammps/blob/r13864/src/input.cpp >>>>>>>>>>>>>> <https://github.com/lammps/lammps/blob/r13864/src/input.cpp> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Based on the gdb output, rank 0 stuck at line 167 >>>>>>>>>>>>>> if >>>>>>>>>>>>>> >>>>>>>>>>>>>> ( >>>>>>>>>>>>>> fgets >>>>>>>>>>>>>> (&line[m],maxline-m,infile) >>>>>>>>>>>>>> == >>>>>>>>>>>>>> NULL) >>>>>>>>>>>>>> and the rest threads stuck at line 203 >>>>>>>>>>>>>> MPI_Bcast(&n,1,MPI_INT,0,world); >>>>>>>>>>>>>> >>>>>>>>>>>>>> So rank 0 possibly hangs on the fgets() function. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here are the whole backtrace information: >>>>>>>>>>>>>> $ cat master.backtrace worker.backtrace >>>>>>>>>>>>>> #0 0x0000003c37cdb68d in read () from /lib64/libc.so.6 >>>>>>>>>>>>>> #1 0x0000003c37c71ca8 in _IO_new_file_underflow () from >>>>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>>>> #2 0x0000003c37c737ae in _IO_default_uflow_internal () from >>>>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>>>> #3 0x0000003c37c67e8a in _IO_getline_info_internal () from >>>>>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>>>>> #4 0x0000003c37c66ce9 in fgets () from /lib64/libc.so.6 >>>>>>>>>>>>>> #5 0x00000000005c5a43 in LAMMPS_NS::Input::file() () at >>>>>>>>>>>>>> ../input.cpp:167 >>>>>>>>>>>>>> #6 0x00000000005d4236 in main () at ../main.cpp:31 >>>>>>>>>>>>>> #0 0x00002b1635d2ace2 in poll_dispatch () from >>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>>>> #1 0x00002b1635d1fa71 in opal_libevent2022_event_base_loop () >>>>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>>>> #2 0x00002b1635ce4634 in opal_progress () from >>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libopen-pal.so.20 >>>>>>>>>>>>>> #3 0x00002b16351b8fad in ompi_request_default_wait () from >>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>>>> #4 0x00002b16351fcb40 in ompi_coll_base_bcast_intra_generic () >>>>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>>>> #5 0x00002b16351fd0c2 in ompi_coll_base_bcast_intra_binomial () >>>>>>>>>>>>>> from /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>>>> #6 0x00002b1644fa6d9b in ompi_coll_tuned_bcast_intra_dec_fixed >>>>>>>>>>>>>> () >>>>>>>>>>>>>> from >>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/openmpi/mca_coll_tuned.so >>>>>>>>>>>>>> #7 0x00002b16351cb4fb in PMPI_Bcast () from >>>>>>>>>>>>>> /util/opt/openmpi/2.0.0/gcc/6.1.0/lib/libmpi.so.20 >>>>>>>>>>>>>> #8 0x00000000005c5b5d in LAMMPS_NS::Input::file() () at >>>>>>>>>>>>>> ../input.cpp:203 >>>>>>>>>>>>>> #9 0x00000000005d4236 in main () at ../main.cpp:31 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>>>> From: users <users-boun...@lists.open-mpi.org >>>>>>>>>>>>>> <mailto:users-boun...@lists.open-mpi.org>> on behalf of >>>>>>>>>>>>>> r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org >>>>>>>>>>>>>> <mailto:r...@open-mpi.org>> >>>>>>>>>>>>>> Sent: Monday, August 22, 2016 2:17:10 PM >>>>>>>>>>>>>> To: Open MPI Users >>>>>>>>>>>>>> Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hmmm...perhaps we can break this out a bit? The stdin will be >>>>>>>>>>>>>> going to your rank=0 proc. It sounds like you have some >>>>>>>>>>>>>> subsequent step that calls MPI_Bcast? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can you first verify that the input is being correctly delivered >>>>>>>>>>>>>> to rank=0? This will help us isolate if the problem is in the IO >>>>>>>>>>>>>> forwarding, or in the subsequent Bcast. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Aug 22, 2016, at 1:11 PM, Jingchao Zhang <zh...@unl.edu >>>>>>>>>>>>>>> <mailto:zh...@unl.edu>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We compiled openmpi/2.0.0 with gcc/6.1.0 and intel/13.1.3. Both >>>>>>>>>>>>>>> of them have odd behaviors when trying to read from standard >>>>>>>>>>>>>>> input. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For example, if we start the application lammps across 4 nodes, >>>>>>>>>>>>>>> each node 16 cores, connected by Intel QDR Infiniband, mpirun >>>>>>>>>>>>>>> works fine for the 1st time, but always stuck in a few seconds >>>>>>>>>>>>>>> thereafter. >>>>>>>>>>>>>>> Command: >>>>>>>>>>>>>>> mpirun ./lmp_ompi_g++ < in.snr >>>>>>>>>>>>>>> in.snr is the Lammps input file. compiler is gcc/6.1. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Instead, if we use >>>>>>>>>>>>>>> mpirun ./lmp_ompi_g++ -in in.snr >>>>>>>>>>>>>>> it works 100%. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Some odd behaviors we gathered so far. >>>>>>>>>>>>>>> 1. For 1 node job, stdin always works. >>>>>>>>>>>>>>> 2. For multiple nodes, stdin works unstably when the number of >>>>>>>>>>>>>>> cores per node are relatively small. For example, for 2/3/4 >>>>>>>>>>>>>>> nodes, each node 8 cores, mpirun works most of the time. But >>>>>>>>>>>>>>> for each node with >8 cores, mpirun works the 1st time, then >>>>>>>>>>>>>>> always stuck. There seems to be a magic number when it stops >>>>>>>>>>>>>>> working. >>>>>>>>>>>>>>> 3. We tested Quantum Expresso with compiler intel/13 and had >>>>>>>>>>>>>>> the same issue. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We used gdb to debug and found when mpirun was stuck, the rest >>>>>>>>>>>>>>> of the processes were all waiting on mpi broadcast from the >>>>>>>>>>>>>>> master thread. The lammps binary, input file and gdb core files >>>>>>>>>>>>>>> (example.tar.bz2) can be downloaded from this link >>>>>>>>>>>>>>> https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc >>>>>>>>>>>>>>> <https://drive.google.com/open?id=0B3Yj4QkZpI-dVWZtWmJ3ZXNVRGc> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Extra information: >>>>>>>>>>>>>>> 1. Job scheduler is slurm. >>>>>>>>>>>>>>> 2. configure setup: >>>>>>>>>>>>>>> ./configure --prefix=$PREFIX \ >>>>>>>>>>>>>>> --with-hwloc=internal \ >>>>>>>>>>>>>>> --enable-mpirun-prefix-by-default \ >>>>>>>>>>>>>>> --with-slurm \ >>>>>>>>>>>>>>> --with-verbs \ >>>>>>>>>>>>>>> --with-psm \ >>>>>>>>>>>>>>> --disable-openib-connectx-xrc \ >>>>>>>>>>>>>>> --with-knem=/opt/knem-1.1.2.90mlnx1 \ >>>>>>>>>>>>>>> --with-cma >>>>>>>>>>>>>>> 3. openmpi-mca-params.conf file >>>>>>>>>>>>>>> orte_hetero_nodes=1 >>>>>>>>>>>>>>> hwloc_base_binding_policy=core >>>>>>>>>>>>>>> rmaps_base_mapping_policy=core >>>>>>>>>>>>>>> opal_cuda_support=0 >>>>>>>>>>>>>>> btl_openib_use_eager_rdma=0 >>>>>>>>>>>>>>> btl_openib_max_eager_rdma=0 >>>>>>>>>>>>>>> btl_openib_flags=1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Jingchao >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dr. Jingchao Zhang >>>>>>>>>>>>>>> Holland Computing Center >>>>>>>>>>>>>>> University of Nebraska-Lincoln >>>>>>>>>>>>>>> 402-472-6400 >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> >>>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>>> >>>>>>> <debug_info.txt>_______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > <debug_info.txt>_______________________________________________ > users mailing list > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users