Vahid,

There used to be a bug in the IOF part, but I am pretty sure this has already 
been fixed.

Does the issue also occur with GNU compilers ?
There used to be an issue with Intel Fortran runtime (short read/write were 
silently ignored) and that was also fixed some time ago.

Cheers,

Gilles

Vahid Askarpour <vh261...@dal.ca> wrote:
>This would work for Quantum Espresso input. I am waiting to see what happens 
>to EPW. I don’t think EPW accepts the -i argument. I will report back once the 
>EPW job is done. 
>
>
>Cheers,
>
>
>Vahid
>
> 
>
>On Jan 22, 2018, at 6:05 PM, Edgar Gabriel <egabr...@central.uh.edu> wrote:
>
>
>well, my final comment on this topic, as somebody suggested earlier in this 
>email chain, if you provide the input with the -i argument instead of piping 
>from standard input, things seem to work as far as I can see (disclaimer: I do 
>not know what the final outcome should be. I just see that the application 
>does not complain about the 'end of file while reading crystal k points'). So 
>maybe that is the most simple solution.
>
>Thanks
>
>Edgar
>
>
>On 1/22/2018 1:17 PM, Edgar Gabriel wrote:
>
>after some further investigation, I am fairly confident that this is not an 
>MPI I/O problem. 
>
>The input file input_tmp.in is generated in this sequence of instructions 
>(which is in Modules/open_close_input_file.f90)
>
>---
>
>  IF ( TRIM(input_file_) /= ' ' ) THEn      !      ! copy file to be opened 
>into input_file      !      input_file = input_file_      !   ELSE      !      
>! if no file specified then copy from standard input      !      
>input_file="input_tmp.in"      OPEN(UNIT = stdtmp, FILE=trim(input_file), 
>FORM='formatted', &           STATUS='unknown', IOSTAT = ierr )      IF ( ierr 
>> 0 ) GO TO 30      !      dummy=' '      WRITE(stdout, '(5x,a)') "Waiting for 
>input..."      DO WHILE ( TRIM(dummy) .NE. "MAGICALME" )         READ 
>(stdin,fmt='(A512)',END=20) dummy         WRITE (stdtmp,'(A)') trim(dummy)     
> END DO      ! 20   CLOSE ( UNIT=stdtmp, STATUS='keep' ) 
>
>----
>
>Basically, if no input file has been provided, the input file is generated by 
>reading from standard input. Since the application is being launched e.g. with
>
>mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out 
>
>the data comes from nscf.in. I simply do not know enough about IO forwarding 
>do be able to tell why we do not see the entire file, but one interesting 
>detail is that if I run it in the debugger, the input_tmp.in is created 
>correctly. However, if I run it using mpirun as shown above, the file is 
>cropped incorrectly, which leads to the error message mentioned in this email 
>chain. 
>
>Anyway, I would probably need some help here from somebody who knows the 
>runtime better than me on what could go wrong at this point. 
>
>Thanks
>
>Edgar
>
>
>
>
>On 1/19/2018 1:22 PM, Vahid Askarpour wrote:
>
>Concerning the following error 
>
>
>     from pw_readschemafile : error #         1
>     xml data file not found
>
>
>The nscf run uses files generated by the scf.in run. So I first run scf.in and 
>when it finishes, I run nscf.in. If you have done this and still get the above 
>error, then this could be another bug. It does not happen for me with 
>intel14/openmpi-1.8.8.
>
>
>Thanks for the update,
>
>
>Vahid
>
>
>On Jan 19, 2018, at 3:08 PM, Edgar Gabriel <egabr...@central.uh.edu> wrote:
>
>
>ok, here is what found out so far, will have to stop for now however for today:
>
> 1. I can in fact reproduce your bug on my systems.
>
> 2. I can confirm that the problem occurs both with romio314 and ompio. I 
>*think* the issue is that the input_tmp.in file is incomplete. In both cases 
>(ompio and romio) the end of the file looks as follows (and its exactly the 
>same for both libraries):
>
>gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files> tail -10 input_tmp.in 
>  0.66666667  0.50000000  0.83333333  5.787037e-04
>  0.66666667  0.50000000  0.91666667  5.787037e-04
>  0.66666667  0.58333333  0.00000000  5.787037e-04
>  0.66666667  0.58333333  0.08333333  5.787037e-04
>  0.66666667  0.58333333  0.16666667  5.787037e-04
>  0.66666667  0.58333333  0.25000000  5.787037e-04
>  0.66666667  0.58333333  0.33333333  5.787037e-04
>  0.66666667  0.58333333  0.41666667  5.787037e-04
>  0.66666667  0.58333333  0.50000000  5.787037e-04
>  0.66666667  0.58333333  0.58333333  5
>
>which is what I *think* causes the problem.
>
> 3. I tried to find where input_tmp.in is generated, but haven't completely 
>identified the location. However, I could not find MPI file_write(_all) 
>operations anywhere in the code, although there are some MPI_file_read(_all) 
>operations.
>
> 4. I can confirm that the behavior with Open MPI 1.8.x is different. 
>input_tmp.in looks more complete (at least it doesn't end in the middle of the 
>line). The simulation does still not finish for me, but the bug reported is 
>slightly different, I might just be missing a file or something
>
>
>     from pw_readschemafile : error #         1
>     xml data file not found
>
>Since I think input_tmp.in is generated from data that is provided in nscf.in, 
>it might very well be something in the MPI_File_read(_all) operation that 
>causes the issue, but since both ompio and romio are affected, there is good 
>chance that something outside of the control of io components is causing the 
>trouble (maybe a datatype issue that has changed from 1.8.x series to 3.0.x).
>
> 5. Last but not least, I also wanted to mention that I ran all parallel tests 
>that I found in the testsuite  (run-tests-cp-parallel, run-tests-pw-parallel, 
>run-tests-ph-parallel, run-tests-epw-parallel ), and they all passed with 
>ompio (and romio314 although I only ran a subset of the tests with romio314).
>
>Thanks
>
>Edgar
>
>-
>
>
>
>
>On 01/19/2018 11:44 AM, Vahid Askarpour wrote:
>
>Hi Edgar,
>
>
>Just to let you know that the nscf run with --mca io ompio crashed like the 
>other two runs. 
>
>
>Thank you,
>
>
>Vahid
>
>
>On Jan 19, 2018, at 12:46 PM, Edgar Gabriel <egabr...@central.uh.edu> wrote:
>
>
>ok, thank you for the information. Two short questions and requests. I have 
>qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 
>instead of the intel compiler), and I am currently running the parallel test 
>suite. So far, all the tests passed, although it is still running.
>
>My question is now, would it be possible for you to give me access to exactly 
>the same data set that you are using?  You could upload to a webpage or 
>similar and just send me the link. 
>
>The second question/request, could you rerun your tests one more time, this 
>time forcing using ompio? e.g. --mca io ompio
>
>Thanks
>
>Edgar
>
>
>On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
>
>To run EPW, the command for running the preliminary nscf run is 
>(http://epw.org.uk/Documentation/B-dopedDiamond): 
>
>
>~/bin/openmpi-v3.0/bin/mpiexec -np 64 
>/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out
>
>
>So I submitted it with the following command:
>
>
>~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 
>/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out
>
>
>And it crashed like the first time. 
>
>
>It is interesting that the preliminary scf run works fine. The scf run 
>requires Quantum Espresso to generate the k points automatically as shown 
>below:
>
>
>K_POINTS (automatic)
>12 12 12 0 0 0
>
>
>The nscf run which crashes includes a list of k points (1728 in this case) as 
>seen below:
>
>
>K_POINTS (crystal)
>1728
>  0.00000000  0.00000000  0.00000000  5.787037e-04 
>  0.00000000  0.00000000  0.08333333  5.787037e-04 
>  0.00000000  0.00000000  0.16666667  5.787037e-04 
>  0.00000000  0.00000000  0.25000000  5.787037e-04 
>  0.00000000  0.00000000  0.33333333  5.787037e-04 
>  0.00000000  0.00000000  0.41666667  5.787037e-04 
>  0.00000000  0.00000000  0.50000000  5.787037e-04 
>  0.00000000  0.00000000  0.58333333  5.787037e-04 
>  0.00000000  0.00000000  0.66666667  5.787037e-04 
>  0.00000000  0.00000000  0.75000000  5.787037e-04 
>
>…….
>
>…….
>
>
>To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler 
>module, configured with only the             “--prefix="  and then “make all 
>install”. I did not enable or disable any other options.
>
>
>Cheers,
>
>
>Vahid
>
>
>
>On Jan 19, 2018, at 10:23 AM, Edgar Gabriel <egabr...@central.uh.edu> wrote:
>
>
>thanks, that is interesting. Since /scratch is a lustre file system, Open MPI 
>should actually utilize romio314 for that anyway, not ompio. What I have seen 
>however happen on at least one occasions is that ompio was still used since ( 
>I suspect) romio314 didn't pick up correctly the configuration options. It is 
>a little bit of a mess from that perspective that we have to pass the romio 
>arguments with different flag/options than for ompio, e.g.
>
>--with-lustre=/path/to/lustre/ 
>--with-io-romio-flags="--with-file-system=ufs+nfs+lustre 
>--with-lustre=/path/to/lustre"
>
>ompio should pick up the lustre options correctly if lustre headers/libraries 
>are found at the default location, even if the user did not pass the 
>--with-lustre option. I am not entirely sure what happens in romio if the user 
>did not pass the --with-file-system=ufs+nfs+lustre but the lustre 
>headers/libraries are found at the default location, i.e. whether the lustre 
>adio component is still compiled or not.
>
>Anyway, lets wait for the outcome of your run enforcing using the romio314 
>component, and I will still try to reproduce your problem on my system.
>
>Thanks
>Edgar
>
>On 1/19/2018 7:15 AM, Vahid Askarpour wrote:
>
>Gilles, I have submitted that job with --mca io romio314. If it finishes, I 
>will let you know. It is sitting in Conte’s queue at Purdue. As to Edgar’s 
>question about the file system, here is the output of df -Th: 
>vaskarpo@conte-fe00:~ $ df -Th Filesystem Type Size Used Avail Use% Mounted on 
>/dev/sda1 ext4 435G 16G 398G 4% / tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm 
>persistent-nfs.rcac.purdue.edu:/persistent/home nfs 80T 64T 17T 80% /home 
>persistent-nfs.rcac.purdue.edu:/persistent/apps nfs 8.0T 4.0T 4.1T 49% /apps 
>mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD 
>lustre 1.4P 994T 347T 75% /scratch/conte depotint-nfs.rcac.purdue.edu:/depot 
>nfs 4.5P 3.0P 1.6P 66% /depot 172.18.84.186:/persistent/fsadmin nfs 200G 130G 
>71G 65% /usr/rmt_share/fsadmin The code is compiled in my $HOME and is run on 
>the scratch. Cheers, Vahid 
>
>On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet 
><gilles.gouaillar...@gmail.com> wrote: Vahid, i the v1.10 series, the default 
>MPI-IO component was ROMIO based, and in the v3 series, it is now ompio. You 
>can force the latest Open MPI to use the ROMIO based component with mpirun 
>--mca io romio314 ... That being said, your description (e.g. a hand edited 
>file) suggests that I/O is not performed with MPI-IO, which makes me very 
>puzzled on why the latest Open MPI is crashing. Cheers, Gilles On Fri, Jan 19, 
>2018 at 10:55 AM, Edgar Gabriel <egabr...@central.uh.edu> wrote: 
>
>I will try to reproduce this problem with 3.0.x, but it might take me a couple 
>of days to get to it. Since it seemed to have worked with 2.0.x (except for 
>the running out file handles problem), there is the suspicion that one of the 
>fixes that we introduced since then is the problem. What file system did you 
>run it on? NFS? Thanks Edgar On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) 
>wrote: 
>
>On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <vh261...@dal.ca> wrote: 
>
>My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum 
>Espresso input file edited by hand. The preliminary run (called scf run) was 
>done with openmpi3.0.x on a similar input file also edited by hand. 
>
>Gotcha. Well, that's a little disappointing. It would be good to understand 
>why it is crashing -- is the app doing something that is accidentally not 
>standard? Is there a bug in (soon to be released) Open MPI 3.0.1? ...? 
>
>_______________________________________________ users mailing list 
>users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users 
>
>_______________________________________________ users mailing list 
>users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users 
>
>_______________________________________________ users mailing list 
>users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users 
>
>
>
>
>
>
>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to