Fair enough, To be on the safe side, I encourage you to use the latest Intel compilers
Cheers, Gilles Vahid Askarpour <vh261...@dal.ca> wrote: >Gilles, > > >I have not tried compiling the latest openmpi with GCC. I am waiting to see >how the intel version turns out before attempting GCC. > > >Cheers, > > >Vahid > > >On Jan 23, 2018, at 9:33 AM, Gilles Gouaillardet ><gilles.gouaillar...@gmail.com> wrote: > > >Vahid, > >There used to be a bug in the IOF part, but I am pretty sure this has already >been fixed. > >Does the issue also occur with GNU compilers ? >There used to be an issue with Intel Fortran runtime (short read/write were >silently ignored) and that was also fixed some time ago. > >Cheers, > >Gilles > >Vahid Askarpour <vh261...@dal.ca> wrote: >This would work for Quantum Espresso input. I am waiting to see what happens >to EPW. I don’t think EPW accepts the -i argument. I will report back once the >EPW job is done. > > >Cheers, > > >Vahid > > > >On Jan 22, 2018, at 6:05 PM, Edgar Gabriel <egabr...@central.uh.edu> wrote: > > >well, my final comment on this topic, as somebody suggested earlier in this >email chain, if you provide the input with the -i argument instead of piping >from standard input, things seem to work as far as I can see (disclaimer: I do >not know what the final outcome should be. I just see that the application >does not complain about the 'end of file while reading crystal k points'). So >maybe that is the most simple solution. > >Thanks > >Edgar > > >On 1/22/2018 1:17 PM, Edgar Gabriel wrote: > >after some further investigation, I am fairly confident that this is not an >MPI I/O problem. > >The input file input_tmp.in is generated in this sequence of instructions >(which is in Modules/open_close_input_file.f90) > >--- > > IF ( TRIM(input_file_) /= ' ' ) THEn ! ! copy file to be opened >into input_file ! input_file = input_file_ ! ELSE ! >! if no file specified then copy from standard input ! >input_file="input_tmp.in" OPEN(UNIT = stdtmp, FILE=trim(input_file), >FORM='formatted', & STATUS='unknown', IOSTAT = ierr ) IF ( ierr >> 0 ) GO TO 30 ! dummy=' ' WRITE(stdout, '(5x,a)') "Waiting for >input..." DO WHILE ( TRIM(dummy) .NE. "MAGICALME" ) READ >(stdin,fmt='(A512)',END=20) dummy WRITE (stdtmp,'(A)') trim(dummy) > END DO ! 20 CLOSE ( UNIT=stdtmp, STATUS='keep' ) > >---- > >Basically, if no input file has been provided, the input file is generated by >reading from standard input. Since the application is being launched e.g. with > >mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out > >the data comes from nscf.in. I simply do not know enough about IO forwarding >do be able to tell why we do not see the entire file, but one interesting >detail is that if I run it in the debugger, the input_tmp.in is created >correctly. However, if I run it using mpirun as shown above, the file is >cropped incorrectly, which leads to the error message mentioned in this email >chain. > >Anyway, I would probably need some help here from somebody who knows the >runtime better than me on what could go wrong at this point. > >Thanks > >Edgar > > > > >On 1/19/2018 1:22 PM, Vahid Askarpour wrote: > >Concerning the following error > > > from pw_readschemafile : error # 1 > xml data file not found > > >The nscf run uses files generated by the scf.in run. So I first run scf.in and >when it finishes, I run nscf.in. If you have done this and still get the above >error, then this could be another bug. It does not happen for me with >intel14/openmpi-1.8.8. > > >Thanks for the update, > > >Vahid > > >On Jan 19, 2018, at 3:08 PM, Edgar Gabriel <egabr...@central.uh.edu> wrote: > > >ok, here is what found out so far, will have to stop for now however for today: > > 1. I can in fact reproduce your bug on my systems. > > 2. I can confirm that the problem occurs both with romio314 and ompio. I >*think* the issue is that the input_tmp.in file is incomplete. In both cases >(ompio and romio) the end of the file looks as follows (and its exactly the >same for both libraries): > >gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files> tail -10 input_tmp.in > 0.66666667 0.50000000 0.83333333 5.787037e-04 > 0.66666667 0.50000000 0.91666667 5.787037e-04 > 0.66666667 0.58333333 0.00000000 5.787037e-04 > 0.66666667 0.58333333 0.08333333 5.787037e-04 > 0.66666667 0.58333333 0.16666667 5.787037e-04 > 0.66666667 0.58333333 0.25000000 5.787037e-04 > 0.66666667 0.58333333 0.33333333 5.787037e-04 > 0.66666667 0.58333333 0.41666667 5.787037e-04 > 0.66666667 0.58333333 0.50000000 5.787037e-04 > 0.66666667 0.58333333 0.58333333 5 > >which is what I *think* causes the problem. > > 3. I tried to find where input_tmp.in is generated, but haven't completely >identified the location. However, I could not find MPI file_write(_all) >operations anywhere in the code, although there are some MPI_file_read(_all) >operations. > > 4. I can confirm that the behavior with Open MPI 1.8.x is different. >input_tmp.in looks more complete (at least it doesn't end in the middle of the >line). The simulation does still not finish for me, but the bug reported is >slightly different, I might just be missing a file or something > > > from pw_readschemafile : error # 1 > xml data file not found > >Since I think input_tmp.in is generated from data that is provided in nscf.in, >it might very well be something in the MPI_File_read(_all) operation that >causes the issue, but since both ompio and romio are affected, there is good >chance that something outside of the control of io components is causing the >trouble (maybe a datatype issue that has changed from 1.8.x series to 3.0.x). > > 5. Last but not least, I also wanted to mention that I ran all parallel tests >that I found in the testsuite (run-tests-cp-parallel, run-tests-pw-parallel, >run-tests-ph-parallel, run-tests-epw-parallel ), and they all passed with >ompio (and romio314 although I only ran a subset of the tests with romio314). > >Thanks > >Edgar > >- > > > > >On 01/19/2018 11:44 AM, Vahid Askarpour wrote: > >Hi Edgar, > > >Just to let you know that the nscf run with --mca io ompio crashed like the >other two runs. > > >Thank you, > > >Vahid > > >On Jan 19, 2018, at 12:46 PM, Edgar Gabriel <egabr...@central.uh.edu> wrote: > > >ok, thank you for the information. Two short questions and requests. I have >qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 >instead of the intel compiler), and I am currently running the parallel test >suite. So far, all the tests passed, although it is still running. > >My question is now, would it be possible for you to give me access to exactly >the same data set that you are using? You could upload to a webpage or >similar and just send me the link. > >The second question/request, could you rerun your tests one more time, this >time forcing using ompio? e.g. --mca io ompio > >Thanks > >Edgar > > >On 1/19/2018 10:32 AM, Vahid Askarpour wrote: > >To run EPW, the command for running the preliminary nscf run is >(http://epw.org.uk/Documentation/B-dopedDiamond): > > >~/bin/openmpi-v3.0/bin/mpiexec -np 64 >/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out > > >So I submitted it with the following command: > > >~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 >/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out > > >And it crashed like the first time. > > >It is interesting that the preliminary scf run works fine. The scf run >requires Quantum Espresso to generate the k points automatically as shown >below: > > >K_POINTS (automatic) >12 12 12 0 0 0 > > >The nscf run which crashes includes a list of k points (1728 in this case) as >seen below: > > >K_POINTS (crystal) >1728 > 0.00000000 0.00000000 0.00000000 5.787037e-04 > 0.00000000 0.00000000 0.08333333 5.787037e-04 > 0.00000000 0.00000000 0.16666667 5.787037e-04 > 0.00000000 0.00000000 0.25000000 5.787037e-04 > 0.00000000 0.00000000 0.33333333 5.787037e-04 > 0.00000000 0.00000000 0.41666667 5.787037e-04 > 0.00000000 0.00000000 0.50000000 5.787037e-04 > 0.00000000 0.00000000 0.58333333 5.787037e-04 > 0.00000000 0.00000000 0.66666667 5.787037e-04 > 0.00000000 0.00000000 0.75000000 5.787037e-04 > >……. > >……. > > >To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler >module, configured with only the “--prefix=" and then “make all >install”. I did not enable or disable any other options. > > >Cheers, > > >Vahid > > > >On Jan 19, 2018, at 10:23 AM, Edgar Gabriel <egabr...@central.uh.edu> wrote: > > >thanks, that is interesting. Since /scratch is a lustre file system, Open MPI >should actually utilize romio314 for that anyway, not ompio. What I have seen >however happen on at least one occasions is that ompio was still used since ( >I suspect) romio314 didn't pick up correctly the configuration options. It is >a little bit of a mess from that perspective that we have to pass the romio >arguments with different flag/options than for ompio, e.g. > >--with-lustre=/path/to/lustre/ >--with-io-romio-flags="--with-file-system=ufs+nfs+lustre >--with-lustre=/path/to/lustre" > >ompio should pick up the lustre options correctly if lustre headers/libraries >are found at the default location, even if the user did not pass the >--with-lustre option. I am not entirely sure what happens in romio if the user >did not pass the --with-file-system=ufs+nfs+lustre but the lustre >headers/libraries are found at the default location, i.e. whether the lustre >adio component is still compiled or not. > >Anyway, lets wait for the outcome of your run enforcing using the romio314 >component, and I will still try to reproduce your problem on my system. > >Thanks >Edgar > >On 1/19/2018 7:15 AM, Vahid Askarpour wrote: > >Gilles, I have submitted that job with --mca io romio314. If it finishes, I >will let you know. It is sitting in Conte’s queue at Purdue. As to Edgar’s >question about the file system, here is the output of df -Th: >vaskarpo@conte-fe00:~ $ df -Th Filesystem Type Size Used Avail Use% Mounted on >/dev/sda1 ext4 435G 16G 398G 4% / tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm >persistent-nfs.rcac.purdue.edu:/persistent/home nfs 80T 64T 17T 80% /home >persistent-nfs.rcac.purdue.edu:/persistent/apps nfs 8.0T 4.0T 4.1T 49% /apps >mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD >lustre 1.4P 994T 347T 75% /scratch/conte depotint-nfs.rcac.purdue.edu:/depot >nfs 4.5P 3.0P 1.6P 66% /depot 172.18.84.186:/persistent/fsadmin nfs 200G 130G >71G 65% /usr/rmt_share/fsadmin The code is compiled in my $HOME and is run on >the scratch. Cheers, Vahid > >On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet ><gilles.gouaillar...@gmail.com> wrote: Vahid, i the v1.10 series, the default >MPI-IO component was ROMIO based, and in the v3 series, it is now ompio. You >can force the latest Open MPI to use the ROMIO based component with mpirun >--mca io romio314 ... That being said, your description (e.g. a hand edited >file) suggests that I/O is not performed with MPI-IO, which makes me very >puzzled on why the latest Open MPI is crashing. Cheers, Gilles On Fri, Jan 19, >2018 at 10:55 AM, Edgar Gabriel <egabr...@central.uh.edu> wrote: > >I will try to reproduce this problem with 3.0.x, but it might take me a couple >of days to get to it. Since it seemed to have worked with 2.0.x (except for >the running out file handles problem), there is the suspicion that one of the >fixes that we introduced since then is the problem. What file system did you >run it on? NFS? Thanks Edgar On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) >wrote: > >On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <vh261...@dal.ca> wrote: > >My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum >Espresso input file edited by hand. The preliminary run (called scf run) was >done with openmpi3.0.x on a similar input file also edited by hand. > >Gotcha. Well, that's a little disappointing. It would be good to understand >why it is crashing -- is the app doing something that is accidentally not >standard? Is there a bug in (soon to be released) Open MPI 3.0.1? ...? > >_______________________________________________ users mailing list >users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users > >_______________________________________________ users mailing list >users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users > >_______________________________________________ users mailing list >users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users > > > > > > > > > >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://lists.open-mpi.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users