Did you install one version of Open MPI over another version? https://www.open-mpi.org/faq/?category=building#install-overwrite
> On Jan 30, 2018, at 2:09 PM, Vahid Askarpour <vh261...@dal.ca> wrote: > > This is just an update on how things turned out with openmpi-3.0.x. > > I compiled both EPW and openmpi with intel14. In the past, EPW crashed for > both intel16 and 17. However, with intel14 and openmpi/1.8.8 , I have been > getting results consistently. > > The nscf.in worked with the -i argument. However, when I ran EPW with > intel14/openmpi-3.0.x, I get the following error: > > mca_base_component_repository_open: unable to open mca_io_romio314: > libgpfs.so: cannot open shared object file: No such file or directory > (ignored) > > What is interesting is that this error occurs in the middle of a long loop. > Since the loop repeats over different coordinates, the error may not be > coming from the gpfs library. > > Cheers, > > Vahid > >> On Jan 23, 2018, at 9:52 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >> Fair enough, >> >> To be on the safe side, I encourage you to use the latest Intel compilers >> >> Cheers, >> >> Gilles >> >> Vahid Askarpour <vh261...@dal.ca> wrote: >> Gilles, >> >> I have not tried compiling the latest openmpi with GCC. I am waiting to see >> how the intel version turns out before attempting GCC. >> >> Cheers, >> >> Vahid >> >>> On Jan 23, 2018, at 9:33 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@gmail.com> wrote: >>> >>> Vahid, >>> >>> There used to be a bug in the IOF part, but I am pretty sure this has >>> already been fixed. >>> >>> Does the issue also occur with GNU compilers ? >>> There used to be an issue with Intel Fortran runtime (short read/write were >>> silently ignored) and that was also fixed some time ago. >>> >>> Cheers, >>> >>> Gilles >>> >>> Vahid Askarpour <vh261...@dal.ca> wrote: >>> This would work for Quantum Espresso input. I am waiting to see what >>> happens to EPW. I don’t think EPW accepts the -i argument. I will report >>> back once the EPW job is done. >>> >>> Cheers, >>> >>> Vahid >>> >>>> On Jan 22, 2018, at 6:05 PM, Edgar Gabriel <egabr...@central.uh.edu> wrote: >>>> >>>> well, my final comment on this topic, as somebody suggested earlier in >>>> this email chain, if you provide the input with the -i argument instead of >>>> piping from standard input, things seem to work as far as I can see >>>> (disclaimer: I do not know what the final outcome should be. I just see >>>> that the application does not complain about the 'end of file while >>>> reading crystal k points'). So maybe that is the most simple solution. >>>> >>>> Thanks >>>> Edgar >>>> >>>> On 1/22/2018 1:17 PM, Edgar Gabriel wrote: >>>>> after some further investigation, I am fairly confident that this is not >>>>> an MPI I/O problem. >>>>> The input file input_tmp.in is generated in this sequence of instructions >>>>> (which is in Modules/open_close_input_file.f90) >>>>> --- >>>>> IF ( TRIM(input_file_) /= ' ' ) THEn >>>>> ! >>>>> ! copy file to be opened into input_file >>>>> ! >>>>> input_file = input_file_ >>>>> ! >>>>> ELSE >>>>> ! >>>>> ! if no file specified then copy from standard input >>>>> ! >>>>> input_file="input_tmp.in" >>>>> OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', & >>>>> STATUS='unknown', IOSTAT = ierr ) >>>>> IF ( ierr > 0 ) GO TO 30 >>>>> ! >>>>> dummy=' ' >>>>> WRITE(stdout, '(5x,a)') "Waiting for input..." >>>>> DO WHILE ( TRIM(dummy) .NE. "MAGICALME" ) >>>>> READ (stdin,fmt='(A512)',END=20) dummy >>>>> WRITE (stdtmp,'(A)') trim(dummy) >>>>> END DO >>>>> ! >>>>> 20 CLOSE ( UNIT=stdtmp, STATUS='keep' ) >>>>> >>>>> ---- >>>>> >>>>> Basically, if no input file has been provided, the input file is >>>>> generated by reading from standard input. Since the application is being >>>>> launched e.g. with >>>>> mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out >>>>> >>>>> the data comes from nscf.in. I simply do not know enough about IO >>>>> forwarding do be able to tell why we do not see the entire file, but one >>>>> interesting detail is that if I run it in the debugger, the input_tmp.in >>>>> is created correctly. However, if I run it using mpirun as shown above, >>>>> the file is cropped incorrectly, which leads to the error message >>>>> mentioned in this email chain. >>>>> Anyway, I would probably need some help here from somebody who knows the >>>>> runtime better than me on what could go wrong at this point. >>>>> Thanks >>>>> >>>>> Edgar >>>>> >>>>> >>>>> >>>>> On 1/19/2018 1:22 PM, Vahid Askarpour wrote: >>>>>> Concerning the following error >>>>>> >>>>>> from pw_readschemafile : error # 1 >>>>>> xml data file not found >>>>>> >>>>>> The nscf run uses files generated by the scf.in run. So I first run >>>>>> scf.in and when it finishes, I run nscf.in. If you have done this and >>>>>> still get the above error, then this could be another bug. It does not >>>>>> happen for me with intel14/openmpi-1.8.8. >>>>>> >>>>>> Thanks for the update, >>>>>> >>>>>> Vahid >>>>>> >>>>>>> On Jan 19, 2018, at 3:08 PM, Edgar Gabriel <egabr...@central.uh.edu> >>>>>>> wrote: >>>>>>> >>>>>>> ok, here is what found out so far, will have to stop for now however >>>>>>> for today: >>>>>>> >>>>>>> 1. I can in fact reproduce your bug on my systems. >>>>>>> >>>>>>> 2. I can confirm that the problem occurs both with romio314 and ompio. >>>>>>> I *think* the issue is that the input_tmp.in file is incomplete. In >>>>>>> both cases (ompio and romio) the end of the file looks as follows (and >>>>>>> its exactly the same for both libraries): >>>>>>> >>>>>>> gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files> tail -10 >>>>>>> input_tmp.in >>>>>>> 0.66666667 0.50000000 0.83333333 5.787037e-04 >>>>>>> 0.66666667 0.50000000 0.91666667 5.787037e-04 >>>>>>> 0.66666667 0.58333333 0.00000000 5.787037e-04 >>>>>>> 0.66666667 0.58333333 0.08333333 5.787037e-04 >>>>>>> 0.66666667 0.58333333 0.16666667 5.787037e-04 >>>>>>> 0.66666667 0.58333333 0.25000000 5.787037e-04 >>>>>>> 0.66666667 0.58333333 0.33333333 5.787037e-04 >>>>>>> 0.66666667 0.58333333 0.41666667 5.787037e-04 >>>>>>> 0.66666667 0.58333333 0.50000000 5.787037e-04 >>>>>>> 0.66666667 0.58333333 0.58333333 5 >>>>>>> which is what I *think* causes the problem. >>>>>>> >>>>>>> 3. I tried to find where input_tmp.in is generated, but haven't >>>>>>> completely identified the location. However, I could not find MPI >>>>>>> file_write(_all) operations anywhere in the code, although there are >>>>>>> some MPI_file_read(_all) operations. >>>>>>> >>>>>>> 4. I can confirm that the behavior with Open MPI 1.8.x is different. >>>>>>> input_tmp.in looks more complete (at least it doesn't end in the middle >>>>>>> of the line). The simulation does still not finish for me, but the bug >>>>>>> reported is slightly different, I might just be missing a file or >>>>>>> something >>>>>>> >>>>>>> >>>>>>> from pw_readschemafile : error # 1 >>>>>>> xml data file not found >>>>>>> >>>>>>> Since I think input_tmp.in is generated from data that is provided in >>>>>>> nscf.in, it might very well be something in the MPI_File_read(_all) >>>>>>> operation that causes the issue, but since both ompio and romio are >>>>>>> affected, there is good chance that something outside of the control of >>>>>>> io components is causing the trouble (maybe a datatype issue that has >>>>>>> changed from 1.8.x series to 3.0.x). >>>>>>> >>>>>>> 5. Last but not least, I also wanted to mention that I ran all >>>>>>> parallel tests that I found in the testsuite (run-tests-cp-parallel, >>>>>>> run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel ), >>>>>>> and they all passed with ompio (and romio314 although I only ran a >>>>>>> subset of the tests with romio314). >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Edgar >>>>>>> - >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 01/19/2018 11:44 AM, Vahid Askarpour wrote: >>>>>>>> Hi Edgar, >>>>>>>> >>>>>>>> Just to let you know that the nscf run with --mca io ompio crashed >>>>>>>> like the other two runs. >>>>>>>> >>>>>>>> Thank you, >>>>>>>> >>>>>>>> Vahid >>>>>>>> >>>>>>>>> On Jan 19, 2018, at 12:46 PM, Edgar Gabriel <egabr...@central.uh.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> ok, thank you for the information. Two short questions and requests. >>>>>>>>> I have qe-6.2.1 compiled and running on my system (although it is >>>>>>>>> with gcc-6.4 instead of the intel compiler), and I am currently >>>>>>>>> running the parallel test suite. So far, all the tests passed, >>>>>>>>> although it is still running. >>>>>>>>> >>>>>>>>> My question is now, would it be possible for you to give me access to >>>>>>>>> exactly the same data set that you are using? You could upload to a >>>>>>>>> webpage or similar and just send me the link. >>>>>>>>> The second question/request, could you rerun your tests one more >>>>>>>>> time, this time forcing using ompio? e.g. --mca io ompio >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> Edgar >>>>>>>>> >>>>>>>>> On 1/19/2018 10:32 AM, Vahid Askarpour wrote: >>>>>>>>>> To run EPW, the command for running the preliminary nscf run is >>>>>>>>>> (http://epw.org.uk/Documentation/B-dopedDiamond): >>>>>>>>>> >>>>>>>>>> ~/bin/openmpi-v3.0/bin/mpiexec -np 64 >>>>>>>>>> /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > >>>>>>>>>> nscf.out >>>>>>>>>> >>>>>>>>>> So I submitted it with the following command: >>>>>>>>>> >>>>>>>>>> ~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 >>>>>>>>>> /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > >>>>>>>>>> nscf.out >>>>>>>>>> >>>>>>>>>> And it crashed like the first time. >>>>>>>>>> >>>>>>>>>> It is interesting that the preliminary scf run works fine. The scf >>>>>>>>>> run requires Quantum Espresso to generate the k points automatically >>>>>>>>>> as shown below: >>>>>>>>>> >>>>>>>>>> K_POINTS (automatic) >>>>>>>>>> 12 12 12 0 0 0 >>>>>>>>>> >>>>>>>>>> The nscf run which crashes includes a list of k points (1728 in this >>>>>>>>>> case) as seen below: >>>>>>>>>> >>>>>>>>>> K_POINTS (crystal) >>>>>>>>>> 1728 >>>>>>>>>> 0.00000000 0.00000000 0.00000000 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.08333333 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.16666667 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.25000000 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.33333333 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.41666667 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.50000000 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.58333333 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.66666667 5.787037e-04 >>>>>>>>>> 0.00000000 0.00000000 0.75000000 5.787037e-04 >>>>>>>>>> ……. >>>>>>>>>> ……. >>>>>>>>>> >>>>>>>>>> To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran >>>>>>>>>> compiler module, configured with only the “--prefix=" >>>>>>>>>> and then “make all install”. I did not enable or disable any other >>>>>>>>>> options. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> Vahid >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Jan 19, 2018, at 10:23 AM, Edgar Gabriel >>>>>>>>>>> <egabr...@central.uh.edu> wrote: >>>>>>>>>>> >>>>>>>>>>> thanks, that is interesting. Since /scratch is a lustre file >>>>>>>>>>> system, Open MPI should actually utilize romio314 for that anyway, >>>>>>>>>>> not ompio. What I have seen however happen on at least one >>>>>>>>>>> occasions is that ompio was still used since ( I suspect) romio314 >>>>>>>>>>> didn't pick up correctly the configuration options. It is a little >>>>>>>>>>> bit of a mess from that perspective that we have to pass the romio >>>>>>>>>>> arguments with different flag/options than for ompio, e.g. >>>>>>>>>>> >>>>>>>>>>> --with-lustre=/path/to/lustre/ >>>>>>>>>>> --with-io-romio-flags="--with-file-system=ufs+nfs+lustre >>>>>>>>>>> --with-lustre=/path/to/lustre" >>>>>>>>>>> >>>>>>>>>>> ompio should pick up the lustre options correctly if lustre >>>>>>>>>>> headers/libraries are found at the default location, even if the >>>>>>>>>>> user did not pass the --with-lustre option. I am not entirely sure >>>>>>>>>>> what happens in romio if the user did not pass the >>>>>>>>>>> --with-file-system=ufs+nfs+lustre but the lustre headers/libraries >>>>>>>>>>> are found at the default location, i.e. whether the lustre adio >>>>>>>>>>> component is still compiled or not. >>>>>>>>>>> >>>>>>>>>>> Anyway, lets wait for the outcome of your run enforcing using the >>>>>>>>>>> romio314 component, and I will still try to reproduce your problem >>>>>>>>>>> on my system. >>>>>>>>>>> Thanks >>>>>>>>>>> Edgar >>>>>>>>>>> >>>>>>>>>>> On 1/19/2018 7:15 AM, Vahid Askarpour wrote: >>>>>>>>>>>> Gilles, >>>>>>>>>>>> >>>>>>>>>>>> I have submitted that job with --mca io romio314. If it finishes, >>>>>>>>>>>> I will let you know. It is sitting in Conte’s queue at Purdue. >>>>>>>>>>>> >>>>>>>>>>>> As to Edgar’s question about the file system, here is the output >>>>>>>>>>>> of df -Th: >>>>>>>>>>>> >>>>>>>>>>>> vaskarpo@conte-fe00:~ $ df -Th >>>>>>>>>>>> Filesystem Type Size Used Avail Use% Mounted on >>>>>>>>>>>> /dev/sda1 ext4 435G 16G 398G 4% / >>>>>>>>>>>> tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm >>>>>>>>>>>> >>>>>>>>>>>> persistent-nfs.rcac.purdue.edu >>>>>>>>>>>> :/persistent/home >>>>>>>>>>>> nfs 80T 64T 17T 80% /home >>>>>>>>>>>> >>>>>>>>>>>> persistent-nfs.rcac.purdue.edu >>>>>>>>>>>> :/persistent/apps >>>>>>>>>>>> nfs 8.0T 4.0T 4.1T 49% /apps >>>>>>>>>>>> >>>>>>>>>>>> mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD >>>>>>>>>>>> >>>>>>>>>>>> lustre 1.4P 994T 347T 75% /scratch/conte >>>>>>>>>>>> >>>>>>>>>>>> depotint-nfs.rcac.purdue.edu >>>>>>>>>>>> :/depot >>>>>>>>>>>> nfs 4.5P 3.0P 1.6P 66% /depot >>>>>>>>>>>> 172.18.84.186:/persistent/fsadmin >>>>>>>>>>>> nfs 200G 130G 71G 65% >>>>>>>>>>>> /usr/rmt_share/fsadmin >>>>>>>>>>>> >>>>>>>>>>>> The code is compiled in my $HOME and is run on the scratch. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> >>>>>>>>>>>> Vahid >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet >>>>>>>>>>>>> <gilles.gouaillar...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Vahid, >>>>>>>>>>>>> >>>>>>>>>>>>> i the v1.10 series, the default MPI-IO component was ROMIO based, >>>>>>>>>>>>> and >>>>>>>>>>>>> in the v3 series, it is now ompio. >>>>>>>>>>>>> You can force the latest Open MPI to use the ROMIO based >>>>>>>>>>>>> component with >>>>>>>>>>>>> mpirun --mca io romio314 ... >>>>>>>>>>>>> >>>>>>>>>>>>> That being said, your description (e.g. a hand edited file) >>>>>>>>>>>>> suggests >>>>>>>>>>>>> that I/O is not performed with MPI-IO, >>>>>>>>>>>>> which makes me very puzzled on why the latest Open MPI is >>>>>>>>>>>>> crashing. >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> >>>>>>>>>>>>> Gilles >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel >>>>>>>>>>>>> <egabr...@central.uh.edu> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I will try to reproduce this problem with 3.0.x, but it might >>>>>>>>>>>>>> take me a >>>>>>>>>>>>>> couple of days to get to it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Since it seemed to have worked with 2.0.x (except for the >>>>>>>>>>>>>> running out file >>>>>>>>>>>>>> handles problem), there is the suspicion that one of the fixes >>>>>>>>>>>>>> that we >>>>>>>>>>>>>> introduced since then is the problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What file system did you run it on? NFS? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> >>>>>>>>>>>>>> Edgar >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <vh261...@dal.ca> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My openmpi3.0.x run (called nscf run) was reading data from a >>>>>>>>>>>>>>>> routine >>>>>>>>>>>>>>>> Quantum Espresso input file edited by hand. The preliminary >>>>>>>>>>>>>>>> run (called scf >>>>>>>>>>>>>>>> run) was done with openmpi3.0.x on a similar input file also >>>>>>>>>>>>>>>> edited by hand. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Gotcha. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Well, that's a little disappointing. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It would be good to understand why it is crashing -- is the app >>>>>>>>>>>>>>> doing >>>>>>>>>>>>>>> something that is accidentally not standard? Is there a bug in >>>>>>>>>>>>>>> (soon to be >>>>>>>>>>>>>>> released) Open MPI 3.0.1? ...? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> >>>>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> >>>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> >>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users