Re: [OMPI users] Installation of openmpi-1.10.7 fails

Edgar Gabriel Mon, 22 Jan 2018 11:23:04 -0800

after some further investigation, I am fairly confident that this is notan MPI I/O problem.

The input file input_tmp.in is generated in this sequence ofinstructions (which is in Modules/open_close_input_file.f90)


---

  IF ( TRIM(input_file_) /= ' ' ) THEn
     !
     ! copy file to be opened into input_file
     !
     input_file = input_file_
     !
  ELSE
     !
     ! if no file specified then copy from standard input
     !
     input_file="input_tmp.in"
     OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', &
          STATUS='unknown', IOSTAT = ierr )
     IF ( ierr > 0 ) GO TO 30
     !
     dummy=' '
     WRITE(stdout, '(5x,a)') "Waiting for input..."
     DO WHILE ( TRIM(dummy) .NE. "MAGICALME" )
        READ (stdin,fmt='(A512)',END=20) dummy
        WRITE (stdtmp,'(A)') trim(dummy)
     END DO
     !
20   CLOSE ( UNIT=stdtmp, STATUS='keep' )

----

Basically, if no input file has been provided, the input file isgenerated by reading from standard input. Since the application is beinglaunched e.g. with


mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out

the data comes from nscf.in. I simply do not know enough about IOforwarding do be able to tell why we do not see the entire file, but oneinteresting detail is that if I run it in the debugger, the input_tmp.inis created correctly. However, if I run it using mpirun as shown above,the file is cropped incorrectly, which leads to the error messagementioned in this email chain.

Anyway, I would probably need some help here from somebody who knows theruntime better than me on what could go wrong at this point.


Thanks

Edgar




On 1/19/2018 1:22 PM, Vahid Askarpour wrote:

Concerning the following error

     from pw_readschemafile : error #         1
     xml data file not found
The nscf run uses files generated by the scf.in run. So I first runscf.in and when it finishes, I run nscf.in. If you have done this andstill get the above error, then this could be another bug. It does nothappen for me with intel14/openmpi-1.8.8.
Thanks for the update,

Vahid
On Jan 19, 2018, at 3:08 PM, Edgar Gabriel <egabr...@central.uh.edu<mailto:egabr...@central.uh.edu>> wrote:
ok, here is what found out so far, will have to stop for now howeverfor today:
 1. I can in fact reproduce your bug on my systems.
2. I can confirm that the problem occurs both with romio314 andompio. I *think* the issue is that the input_tmp.in file isincomplete. In both cases (ompio and romio) the end of the file looksas follows (and its exactly the same for both libraries):
gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files> tail -10input_tmp.in
  0.66666667  0.50000000  0.83333333  5.787037e-04
  0.66666667  0.50000000  0.91666667  5.787037e-04
  0.66666667  0.58333333  0.00000000  5.787037e-04
  0.66666667  0.58333333  0.08333333  5.787037e-04
  0.66666667  0.58333333  0.16666667  5.787037e-04
  0.66666667  0.58333333  0.25000000  5.787037e-04
  0.66666667  0.58333333  0.33333333  5.787037e-04
  0.66666667  0.58333333  0.41666667  5.787037e-04
  0.66666667  0.58333333  0.50000000  5.787037e-04
  0.66666667  0.58333333  0.58333333  5

which is what I *think* causes the problem.
3. I tried to find where input_tmp.in is generated, but haven'tcompletely identified the location. However, I could not find MPIfile_write(_all) operations anywhere in the code, although there aresome MPI_file_read(_all) operations.
4. I can confirm that the behavior with Open MPI 1.8.x is different.input_tmp.in looks more complete (at least it doesn't end in themiddle of the line). The simulation does still not finish for me, butthe bug reported is slightly different, I might just be missing afile or something
     from pw_readschemafile : error #         1
     xml data file not found
Since I think input_tmp.in is generated from data that is provided innscf.in, it might very well be something in the MPI_File_read(_all)operation that causes the issue, but since both ompio and romio areaffected, there is good chance that something outside of the controlof io components is causing the trouble (maybe a datatype issue thathas changed from 1.8.x series to 3.0.x).
5. Last but not least, I also wanted to mention that I ran allparallel tests that I found in the testsuite (run-tests-cp-parallel,run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel), and they all passed with ompio (and romio314 although I only ran asubset of the tests with romio314).
Thanks

Edgar

-




On 01/19/2018 11:44 AM, Vahid Askarpour wrote:
Hi Edgar,
Just to let you know that the nscf run with --mca io ompio crashedlike the other two runs.
Thank you,

Vahid
On Jan 19, 2018, at 12:46 PM, Edgar Gabriel<egabr...@central.uh.edu <mailto:egabr...@central.uh.edu>> wrote:
ok, thank you for the information. Two short questions andrequests. I have qe-6.2.1 compiled and running on my system(although it is with gcc-6.4 instead of the intel compiler), and Iam currently running the parallel test suite. So far, all the testspassed, although it is still running.
My question is now, would it be possible for you to give me accessto exactly the same data set that you are using? You could uploadto a webpage or similar and just send me the link.
The second question/request, could you rerun your tests one moretime, this time forcing using ompio? e.g. --mca io ompio
Thanks

Edgar


On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is(http://epw.org.uk/Documentation/B-dopedDiamond):
~/bin/openmpi-v3.0/bin/mpiexec -np 64/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in> nscf.out
So I submitted it with the following command:
~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in> nscf.out
And it crashed like the first time.
It is interesting that the preliminary scf run works fine. The scfrun requires Quantum Espresso to generate the k pointsautomatically as shown below:
K_POINTS (automatic)
12 12 12 0 0 0
The nscf run which crashes includes a list of k points (1728 inthis case) as seen below:
K_POINTS (crystal)
1728
  0.00000000  0.00000000  0.00000000  5.787037e-04
  0.00000000  0.00000000  0.08333333  5.787037e-04
  0.00000000  0.00000000  0.16666667  5.787037e-04
  0.00000000  0.00000000  0.25000000  5.787037e-04
  0.00000000  0.00000000  0.33333333  5.787037e-04
  0.00000000  0.00000000  0.41666667  5.787037e-04
  0.00000000  0.00000000  0.50000000  5.787037e-04
  0.00000000  0.00000000  0.58333333  5.787037e-04
  0.00000000  0.00000000  0.66666667  5.787037e-04
  0.00000000  0.00000000  0.75000000  5.787037e-04
…….
…….
To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortrancompiler module, configured with only the “--prefix=" and then“make all install”. I did not enable or disable any other options.
Cheers,

Vahid
On Jan 19, 2018, at 10:23 AM, Edgar Gabriel<egabr...@central.uh.edu <mailto:egabr...@central.uh.edu>> wrote:
thanks, that is interesting. Since /scratch is a lustre filesystem, Open MPI should actually utilize romio314 for thatanyway, not ompio. What I have seen however happen on at leastone occasions is that ompio was still used since ( I suspect)romio314 didn't pick up correctly the configuration options. Itis a little bit of a mess from that perspective that we have topass the romio arguments with different flag/options than forompio, e.g.
--with-lustre=/path/to/lustre/--with-io-romio-flags="--with-file-system=ufs+nfs+lustre--with-lustre=/path/to/lustre"
ompio should pick up the lustre options correctly if lustreheaders/libraries are found at the default location, even if theuser did not pass the --with-lustre option. I am not entirelysure what happens in romio if the user did not pass the--with-file-system=ufs+nfs+lustre but the lustreheaders/libraries are found at the default location, i.e. whetherthe lustre adio component is still compiled or not.
Anyway, lets wait for the outcome of your run enforcing using theromio314 component, and I will still try to reproduce yourproblem on my system.
Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:
Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let 
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem           Type    Size  Used Avail Use% Mounted on
/dev/sda1            ext4    435G   16G  398G   4% /
tmpfs                tmpfs    16G  1.4M   16G   1% /dev/shm
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
                      nfs      80T   64T   17T  80% /home
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
                      nfs     8.0T  4.0T  4.1T  49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD
                      lustre  1.4P  994T  347T  75% /scratch/conte
depotint-nfs.rcac.purdue.edu <http://depotint-nfs.rcac.purdue.edu/>:/depot
                      nfs     4.5P  3.0P  1.6P  66% /depot
172.18.84.186:/persistent/fsadmin
                      nfs     200G  130G   71G  65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid
On Jan 18, 2018, at 10:14 PM, Gilles 
Gouaillardet<gilles.gouaillar...@gmail.com>  wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel<egabr...@central.uh.edu>  wrote:
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar


On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:
On Jan 18, 2018, at 5:53 PM, Vahid Askarpour<vh261...@dal.ca>  wrote:
My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard?  Is there a bug in (soon to be
released) Open MPI 3.0.1?  ...?
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Installation of openmpi-1.10.7 fails

Reply via email to