No, I installed this version of openmpi once and with intel14. Vahid
> On Jan 30, 2018, at 4:41 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > Did you install one version of Open MPI over another version? > > https://www.open-mpi.org/faq/?category=building#install-overwrite > > >> On Jan 30, 2018, at 2:09 PM, Vahid Askarpour <vh261...@dal.ca> wrote: >> >> This is just an update on how things turned out with openmpi-3.0.x. >> >> I compiled both EPW and openmpi with intel14. In the past, EPW crashed for >> both intel16 and 17. However, with intel14 and openmpi/1.8.8 , I have been >> getting results consistently. >> >> The nscf.in worked with the -i argument. However, when I ran EPW with >> intel14/openmpi-3.0.x, I get the following error: >> >> mca_base_component_repository_open: unable to open mca_io_romio314: >> libgpfs.so: cannot open shared object file: No such file or directory >> (ignored) >> >> What is interesting is that this error occurs in the middle of a long loop. >> Since the loop repeats over different coordinates, the error may not be >> coming from the gpfs library. >> >> Cheers, >> >> Vahid >> >>> On Jan 23, 2018, at 9:52 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@gmail.com> wrote: >>> >>> Fair enough, >>> >>> To be on the safe side, I encourage you to use the latest Intel compilers >>> >>> Cheers, >>> >>> Gilles >>> >>> Vahid Askarpour <vh261...@dal.ca> wrote: >>> Gilles, >>> >>> I have not tried compiling the latest openmpi with GCC. I am waiting to see >>> how the intel version turns out before attempting GCC. >>> >>> Cheers, >>> >>> Vahid >>> >>>> On Jan 23, 2018, at 9:33 AM, Gilles Gouaillardet >>>> <gilles.gouaillar...@gmail.com> wrote: >>>> >>>> Vahid, >>>> >>>> There used to be a bug in the IOF part, but I am pretty sure this has >>>> already been fixed. >>>> >>>> Does the issue also occur with GNU compilers ? >>>> There used to be an issue with Intel Fortran runtime (short read/write >>>> were silently ignored) and that was also fixed some time ago. >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> Vahid Askarpour <vh261...@dal.ca> wrote: >>>> This would work for Quantum Espresso input. I am waiting to see what >>>> happens to EPW. I don’t think EPW accepts the -i argument. I will report >>>> back once the EPW job is done. >>>> >>>> Cheers, >>>> >>>> Vahid >>>> >>>>> On Jan 22, 2018, at 6:05 PM, Edgar Gabriel <egabr...@central.uh.edu> >>>>> wrote: >>>>> >>>>> well, my final comment on this topic, as somebody suggested earlier in >>>>> this email chain, if you provide the input with the -i argument instead >>>>> of piping from standard input, things seem to work as far as I can see >>>>> (disclaimer: I do not know what the final outcome should be. I just see >>>>> that the application does not complain about the 'end of file while >>>>> reading crystal k points'). So maybe that is the most simple solution. >>>>> >>>>> Thanks >>>>> Edgar >>>>> >>>>> On 1/22/2018 1:17 PM, Edgar Gabriel wrote: >>>>>> after some further investigation, I am fairly confident that this is not >>>>>> an MPI I/O problem. >>>>>> The input file input_tmp.in is generated in this sequence of >>>>>> instructions (which is in Modules/open_close_input_file.f90) >>>>>> --- >>>>>> IF ( TRIM(input_file_) /= ' ' ) THEn >>>>>> ! >>>>>> ! copy file to be opened into input_file >>>>>> ! >>>>>> input_file = input_file_ >>>>>> ! >>>>>> ELSE >>>>>> ! >>>>>> ! if no file specified then copy from standard input >>>>>> ! >>>>>> input_file="input_tmp.in" >>>>>> OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', & >>>>>> STATUS='unknown', IOSTAT = ierr ) >>>>>> IF ( ierr > 0 ) GO TO 30 >>>>>> ! >>>>>> dummy=' ' >>>>>> WRITE(stdout, '(5x,a)') "Waiting for input..." >>>>>> DO WHILE ( TRIM(dummy) .NE. "MAGICALME" ) >>>>>> READ (stdin,fmt='(A512)',END=20) dummy >>>>>> WRITE (stdtmp,'(A)') trim(dummy) >>>>>> END DO >>>>>> ! >>>>>> 20 CLOSE ( UNIT=stdtmp, STATUS='keep' ) >>>>>> >>>>>> ---- >>>>>> >>>>>> Basically, if no input file has been provided, the input file is >>>>>> generated by reading from standard input. Since the application is being >>>>>> launched e.g. with >>>>>> mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out >>>>>> >>>>>> the data comes from nscf.in. I simply do not know enough about IO >>>>>> forwarding do be able to tell why we do not see the entire file, but one >>>>>> interesting detail is that if I run it in the debugger, the input_tmp.in >>>>>> is created correctly. However, if I run it using mpirun as shown above, >>>>>> the file is cropped incorrectly, which leads to the error message >>>>>> mentioned in this email chain. >>>>>> Anyway, I would probably need some help here from somebody who knows the >>>>>> runtime better than me on what could go wrong at this point. >>>>>> Thanks >>>>>> >>>>>> Edgar >>>>>> >>>>>> >>>>>> >>>>>> On 1/19/2018 1:22 PM, Vahid Askarpour wrote: >>>>>>> Concerning the following error >>>>>>> >>>>>>> from pw_readschemafile : error # 1 >>>>>>> xml data file not found >>>>>>> >>>>>>> The nscf run uses files generated by the scf.in run. So I first run >>>>>>> scf.in and when it finishes, I run nscf.in. If you have done this and >>>>>>> still get the above error, then this could be another bug. It does not >>>>>>> happen for me with intel14/openmpi-1.8.8. >>>>>>> >>>>>>> Thanks for the update, >>>>>>> >>>>>>> Vahid >>>>>>> >>>>>>>> On Jan 19, 2018, at 3:08 PM, Edgar Gabriel <egabr...@central.uh.edu> >>>>>>>> wrote: >>>>>>>> >>>>>>>> ok, here is what found out so far, will have to stop for now however >>>>>>>> for today: >>>>>>>> >>>>>>>> 1. I can in fact reproduce your bug on my systems. >>>>>>>> >>>>>>>> 2. I can confirm that the problem occurs both with romio314 and ompio. >>>>>>>> I *think* the issue is that the input_tmp.in file is incomplete. In >>>>>>>> both cases (ompio and romio) the end of the file looks as follows (and >>>>>>>> its exactly the same for both libraries): >>>>>>>> >>>>>>>> gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files> tail -10 >>>>>>>> input_tmp.in >>>>>>>> 0.66666667 0.50000000 0.83333333 5.787037e-04 >>>>>>>> 0.66666667 0.50000000 0.91666667 5.787037e-04 >>>>>>>> 0.66666667 0.58333333 0.00000000 5.787037e-04 >>>>>>>> 0.66666667 0.58333333 0.08333333 5.787037e-04 >>>>>>>> 0.66666667 0.58333333 0.16666667 5.787037e-04 >>>>>>>> 0.66666667 0.58333333 0.25000000 5.787037e-04 >>>>>>>> 0.66666667 0.58333333 0.33333333 5.787037e-04 >>>>>>>> 0.66666667 0.58333333 0.41666667 5.787037e-04 >>>>>>>> 0.66666667 0.58333333 0.50000000 5.787037e-04 >>>>>>>> 0.66666667 0.58333333 0.58333333 5 >>>>>>>> which is what I *think* causes the problem. >>>>>>>> >>>>>>>> 3. I tried to find where input_tmp.in is generated, but haven't >>>>>>>> completely identified the location. However, I could not find MPI >>>>>>>> file_write(_all) operations anywhere in the code, although there are >>>>>>>> some MPI_file_read(_all) operations. >>>>>>>> >>>>>>>> 4. I can confirm that the behavior with Open MPI 1.8.x is different. >>>>>>>> input_tmp.in looks more complete (at least it doesn't end in the >>>>>>>> middle of the line). The simulation does still not finish for me, but >>>>>>>> the bug reported is slightly different, I might just be missing a file >>>>>>>> or something >>>>>>>> >>>>>>>> >>>>>>>> from pw_readschemafile : error # 1 >>>>>>>> xml data file not found >>>>>>>> >>>>>>>> Since I think input_tmp.in is generated from data that is provided in >>>>>>>> nscf.in, it might very well be something in the MPI_File_read(_all) >>>>>>>> operation that causes the issue, but since both ompio and romio are >>>>>>>> affected, there is good chance that something outside of the control >>>>>>>> of io components is causing the trouble (maybe a datatype issue that >>>>>>>> has changed from 1.8.x series to 3.0.x). >>>>>>>> >>>>>>>> 5. Last but not least, I also wanted to mention that I ran all >>>>>>>> parallel tests that I found in the testsuite (run-tests-cp-parallel, >>>>>>>> run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel >>>>>>>> ), and they all passed with ompio (and romio314 although I only ran a >>>>>>>> subset of the tests with romio314). >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Edgar >>>>>>>> - >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 01/19/2018 11:44 AM, Vahid Askarpour wrote: >>>>>>>>> Hi Edgar, >>>>>>>>> >>>>>>>>> Just to let you know that the nscf run with --mca io ompio crashed >>>>>>>>> like the other two runs. >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> >>>>>>>>> Vahid >>>>>>>>> >>>>>>>>>> On Jan 19, 2018, at 12:46 PM, Edgar Gabriel >>>>>>>>>> <egabr...@central.uh.edu> wrote: >>>>>>>>>> >>>>>>>>>> ok, thank you for the information. Two short questions and requests. >>>>>>>>>> I have qe-6.2.1 compiled and running on my system (although it is >>>>>>>>>> with gcc-6.4 instead of the intel compiler), and I am currently >>>>>>>>>> running the parallel test suite. So far, all the tests passed, >>>>>>>>>> although it is still running. >>>>>>>>>> >>>>>>>>>> My question is now, would it be possible for you to give me access >>>>>>>>>> to exactly the same data set that you are using? You could upload >>>>>>>>>> to a webpage or similar and just send me the link. >>>>>>>>>> The second question/request, could you rerun your tests one more >>>>>>>>>> time, this time forcing using ompio? e.g. --mca io ompio >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Edgar >>>>>>>>>> >>>>>>>>>> On 1/19/2018 10:32 AM, Vahid Askarpour wrote: >>>>>>>>>>> To run EPW, the command for running the preliminary nscf run is >>>>>>>>>>> (http://epw.org.uk/Documentation/B-dopedDiamond): >>>>>>>>>>> >>>>>>>>>>> ~/bin/openmpi-v3.0/bin/mpiexec -np 64 >>>>>>>>>>> /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in >>>>>>>>>>> > nscf.out >>>>>>>>>>> >>>>>>>>>>> So I submitted it with the following command: >>>>>>>>>>> >>>>>>>>>>> ~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 >>>>>>>>>>> /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in >>>>>>>>>>> > nscf.out >>>>>>>>>>> >>>>>>>>>>> And it crashed like the first time. >>>>>>>>>>> >>>>>>>>>>> It is interesting that the preliminary scf run works fine. The scf >>>>>>>>>>> run requires Quantum Espresso to generate the k points >>>>>>>>>>> automatically as shown below: >>>>>>>>>>> >>>>>>>>>>> K_POINTS (automatic) >>>>>>>>>>> 12 12 12 0 0 0 >>>>>>>>>>> >>>>>>>>>>> The nscf run which crashes includes a list of k points (1728 in >>>>>>>>>>> this case) as seen below: >>>>>>>>>>> >>>>>>>>>>> K_POINTS (crystal) >>>>>>>>>>> 1728 >>>>>>>>>>> 0.00000000 0.00000000 0.00000000 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.08333333 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.16666667 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.25000000 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.33333333 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.41666667 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.50000000 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.58333333 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.66666667 5.787037e-04 >>>>>>>>>>> 0.00000000 0.00000000 0.75000000 5.787037e-04 >>>>>>>>>>> ……. >>>>>>>>>>> ……. >>>>>>>>>>> >>>>>>>>>>> To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran >>>>>>>>>>> compiler module, configured with only the “--prefix=" >>>>>>>>>>> and then “make all install”. I did not enable or disable any other >>>>>>>>>>> options. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> Vahid >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Jan 19, 2018, at 10:23 AM, Edgar Gabriel >>>>>>>>>>>> <egabr...@central.uh.edu> wrote: >>>>>>>>>>>> >>>>>>>>>>>> thanks, that is interesting. Since /scratch is a lustre file >>>>>>>>>>>> system, Open MPI should actually utilize romio314 for that anyway, >>>>>>>>>>>> not ompio. What I have seen however happen on at least one >>>>>>>>>>>> occasions is that ompio was still used since ( I suspect) romio314 >>>>>>>>>>>> didn't pick up correctly the configuration options. It is a little >>>>>>>>>>>> bit of a mess from that perspective that we have to pass the romio >>>>>>>>>>>> arguments with different flag/options than for ompio, e.g. >>>>>>>>>>>> >>>>>>>>>>>> --with-lustre=/path/to/lustre/ >>>>>>>>>>>> --with-io-romio-flags="--with-file-system=ufs+nfs+lustre >>>>>>>>>>>> --with-lustre=/path/to/lustre" >>>>>>>>>>>> >>>>>>>>>>>> ompio should pick up the lustre options correctly if lustre >>>>>>>>>>>> headers/libraries are found at the default location, even if the >>>>>>>>>>>> user did not pass the --with-lustre option. I am not entirely sure >>>>>>>>>>>> what happens in romio if the user did not pass the >>>>>>>>>>>> --with-file-system=ufs+nfs+lustre but the lustre headers/libraries >>>>>>>>>>>> are found at the default location, i.e. whether the lustre adio >>>>>>>>>>>> component is still compiled or not. >>>>>>>>>>>> >>>>>>>>>>>> Anyway, lets wait for the outcome of your run enforcing using the >>>>>>>>>>>> romio314 component, and I will still try to reproduce your problem >>>>>>>>>>>> on my system. >>>>>>>>>>>> Thanks >>>>>>>>>>>> Edgar >>>>>>>>>>>> >>>>>>>>>>>> On 1/19/2018 7:15 AM, Vahid Askarpour wrote: >>>>>>>>>>>>> Gilles, >>>>>>>>>>>>> >>>>>>>>>>>>> I have submitted that job with --mca io romio314. If it finishes, >>>>>>>>>>>>> I will let you know. It is sitting in Conte’s queue at Purdue. >>>>>>>>>>>>> >>>>>>>>>>>>> As to Edgar’s question about the file system, here is the output >>>>>>>>>>>>> of df -Th: >>>>>>>>>>>>> >>>>>>>>>>>>> vaskarpo@conte-fe00:~ $ df -Th >>>>>>>>>>>>> Filesystem Type Size Used Avail Use% Mounted on >>>>>>>>>>>>> /dev/sda1 ext4 435G 16G 398G 4% / >>>>>>>>>>>>> tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm >>>>>>>>>>>>> >>>>>>>>>>>>> persistent-nfs.rcac.purdue.edu >>>>>>>>>>>>> :/persistent/home >>>>>>>>>>>>> nfs 80T 64T 17T 80% /home >>>>>>>>>>>>> >>>>>>>>>>>>> persistent-nfs.rcac.purdue.edu >>>>>>>>>>>>> :/persistent/apps >>>>>>>>>>>>> nfs 8.0T 4.0T 4.1T 49% /apps >>>>>>>>>>>>> >>>>>>>>>>>>> mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD >>>>>>>>>>>>> >>>>>>>>>>>>> lustre 1.4P 994T 347T 75% /scratch/conte >>>>>>>>>>>>> >>>>>>>>>>>>> depotint-nfs.rcac.purdue.edu >>>>>>>>>>>>> :/depot >>>>>>>>>>>>> nfs 4.5P 3.0P 1.6P 66% /depot >>>>>>>>>>>>> 172.18.84.186:/persistent/fsadmin >>>>>>>>>>>>> nfs 200G 130G 71G 65% >>>>>>>>>>>>> /usr/rmt_share/fsadmin >>>>>>>>>>>>> >>>>>>>>>>>>> The code is compiled in my $HOME and is run on the scratch. >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> >>>>>>>>>>>>> Vahid >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet >>>>>>>>>>>>>> <gilles.gouaillar...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Vahid, >>>>>>>>>>>>>> >>>>>>>>>>>>>> i the v1.10 series, the default MPI-IO component was ROMIO >>>>>>>>>>>>>> based, and >>>>>>>>>>>>>> in the v3 series, it is now ompio. >>>>>>>>>>>>>> You can force the latest Open MPI to use the ROMIO based >>>>>>>>>>>>>> component with >>>>>>>>>>>>>> mpirun --mca io romio314 ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> That being said, your description (e.g. a hand edited file) >>>>>>>>>>>>>> suggests >>>>>>>>>>>>>> that I/O is not performed with MPI-IO, >>>>>>>>>>>>>> which makes me very puzzled on why the latest Open MPI is >>>>>>>>>>>>>> crashing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Gilles >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel >>>>>>>>>>>>>> <egabr...@central.uh.edu> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I will try to reproduce this problem with 3.0.x, but it might >>>>>>>>>>>>>>> take me a >>>>>>>>>>>>>>> couple of days to get to it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Since it seemed to have worked with 2.0.x (except for the >>>>>>>>>>>>>>> running out file >>>>>>>>>>>>>>> handles problem), there is the suspicion that one of the fixes >>>>>>>>>>>>>>> that we >>>>>>>>>>>>>>> introduced since then is the problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What file system did you run it on? NFS? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Edgar >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <vh261...@dal.ca> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My openmpi3.0.x run (called nscf run) was reading data from a >>>>>>>>>>>>>>>>> routine >>>>>>>>>>>>>>>>> Quantum Espresso input file edited by hand. The preliminary >>>>>>>>>>>>>>>>> run (called scf >>>>>>>>>>>>>>>>> run) was done with openmpi3.0.x on a similar input file also >>>>>>>>>>>>>>>>> edited by hand. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Gotcha. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Well, that's a little disappointing. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It would be good to understand why it is crashing -- is the >>>>>>>>>>>>>>>> app doing >>>>>>>>>>>>>>>> something that is accidentally not standard? Is there a bug >>>>>>>>>>>>>>>> in (soon to be >>>>>>>>>>>>>>>> released) Open MPI 3.0.1? ...? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> >>>>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> >>>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > > -- > Jeff Squyres > jsquy...@cisco.com > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users