Re: [OMPI users] OMPI users] OMPI users] Installation of openmpi-1.10.7 fails

Vahid Askarpour Tue, 30 Jan 2018 11:14:57 -0800

This is just an update on how things turned out with openmpi-3.0.x.

I compiled both EPW and openmpi with intel14. In the past, EPW crashed for both 
intel16 and 17. However, with intel14 and openmpi/1.8.8 , I have been getting 
results consistently.

The nscf.in worked with the -i argument. However, when I ran EPW with
intel14/openmpi-3.0.x, I get the following error:

mca_base_component_repository_open: unable to open mca_io_romio314: libgpfs.so:
cannot open shared object file: No such file or directory (ignored)

What is interesting is that this error occurs in the middle of a long loop.
Since the loop repeats over different coordinates, the error may not be coming
from the gpfs library.

Cheers,

Vahid

On Jan 23, 2018, at 9:52 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> wrote:

Fair enough,

To be on the safe side, I encourage you to use the latest Intel compilers

Cheers,

Gilles

Vahid Askarpour <vh261...@dal.ca<mailto:vh261...@dal.ca>> wrote:
Gilles,

I have not tried compiling the latest openmpi with GCC. I am waiting to see how
the intel version turns out before attempting GCC.

Cheers,

Vahid

On Jan 23, 2018, at 9:33 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> wrote:

Vahid,

There used to be a bug in the IOF part, but I am pretty sure this has already
been fixed.

Does the issue also occur with GNU compilers ?
There used to be an issue with Intel Fortran runtime (short read/write were
silently ignored) and that was also fixed some time ago.

Cheers,

Gilles

Vahid Askarpour <vh261...@dal.ca<mailto:vh261...@dal.ca>> wrote:
This would work for Quantum Espresso input. I am waiting to see what happens to
EPW. I don’t think EPW accepts the -i argument. I will report back once the EPW
job is done.

Cheers,

Vahid

On Jan 22, 2018, at 6:05 PM, Edgar Gabriel
<egabr...@central.uh.edu<mailto:egabr...@central.uh.edu>> wrote:

well, my final comment on this topic, as somebody suggested earlier in this
email chain, if you provide the input with the -i argument instead of piping
from standard input, things seem to work as far as I can see (disclaimer: I do
not know what the final outcome should be. I just see that the application does
not complain about the 'end of file while reading crystal k points'). So maybe
that is the most simple solution.

Thanks

Edgar

On 1/22/2018 1:17 PM, Edgar Gabriel wrote:

after some further investigation, I am fairly confident that this is not an MPI
I/O problem.

The input file input_tmp.in is generated in this sequence of instructions
(which is in Modules/open_close_input_file.f90)

---

IF ( TRIM(input_file_) /= ' ' ) THEn
!
! copy file to be opened into input_file
!
input_file = input_file_
!
ELSE
!
! if no file specified then copy from standard input
!
input_file="input_tmp.in"
OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', &
STATUS='unknown', IOSTAT = ierr )
IF ( ierr > 0 ) GO TO 30
!
dummy=' '
WRITE(stdout, '(5x,a)') "Waiting for input..."
DO WHILE ( TRIM(dummy) .NE. "MAGICALME" )
READ (stdin,fmt='(A512)',END=20) dummy
WRITE (stdtmp,'(A)') trim(dummy)
END DO
!
20 CLOSE ( UNIT=stdtmp, STATUS='keep' )

----

Basically, if no input file has been provided, the input file is generated by
reading from standard input. Since the application is being launched e.g. with

mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out

the data comes from nscf.in. I simply do not know enough about IO forwarding do
be able to tell why we do not see the entire file, but one interesting detail
is that if I run it in the debugger, the input_tmp.in is created correctly.
However, if I run it using mpirun as shown above, the file is cropped
incorrectly, which leads to the error message mentioned in this email chain.

Anyway, I would probably need some help here from somebody who knows the
runtime better than me on what could go wrong at this point.

Thanks

Edgar

On 1/19/2018 1:22 PM, Vahid Askarpour wrote:
Concerning the following error

from pw_readschemafile : error # 1
xml data file not found

The nscf run uses files generated by the scf.in run. So I first run scf.in and
when it finishes, I run nscf.in. If you have done this and still get the above
error, then this could be another bug. It does not happen for me with
intel14/openmpi-1.8.8.

Thanks for the update,

Vahid

On Jan 19, 2018, at 3:08 PM, Edgar Gabriel
<egabr...@central.uh.edu<mailto:egabr...@central.uh.edu>> wrote:

ok, here is what found out so far, will have to stop for now however for today:

1. I can in fact reproduce your bug on my systems.

2. I can confirm that the problem occurs both with romio314 and ompio. I
*think* the issue is that the input_tmp.in file is incomplete. In both cases
(ompio and romio) the end of the file looks as follows (and its exactly the
same for both libraries):

gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files<mailto:gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files>>
tail -10 input_tmp.in
0.66666667 0.50000000 0.83333333 5.787037e-04
0.66666667 0.50000000 0.91666667 5.787037e-04
0.66666667 0.58333333 0.00000000 5.787037e-04
0.66666667 0.58333333 0.08333333 5.787037e-04
0.66666667 0.58333333 0.16666667 5.787037e-04
0.66666667 0.58333333 0.25000000 5.787037e-04
0.66666667 0.58333333 0.33333333 5.787037e-04
0.66666667 0.58333333 0.41666667 5.787037e-04
0.66666667 0.58333333 0.50000000 5.787037e-04
0.66666667 0.58333333 0.58333333 5

which is what I *think* causes the problem.

3. I tried to find where input_tmp.in is generated, but haven't completely
identified the location. However, I could not find MPI file_write(_all)
operations anywhere in the code, although there are some MPI_file_read(_all)
operations.

4. I can confirm that the behavior with Open MPI 1.8.x is different.
input_tmp.in looks more complete (at least it doesn't end in the middle of the
line). The simulation does still not finish for me, but the bug reported is
slightly different, I might just be missing a file or something

from pw_readschemafile : error # 1
xml data file not found

Since I think input_tmp.in is generated from data that is provided in nscf.in,
it might very well be something in the MPI_File_read(_all) operation that
causes the issue, but since both ompio and romio are affected, there is good
chance that something outside of the control of io components is causing the
trouble (maybe a datatype issue that has changed from 1.8.x series to 3.0.x).

5. Last but not least, I also wanted to mention that I ran all parallel tests
that I found in the testsuite (run-tests-cp-parallel, run-tests-pw-parallel,
run-tests-ph-parallel, run-tests-epw-parallel ), and they all passed with ompio
(and romio314 although I only ran a subset of the tests with romio314).

Thanks

Edgar

On 01/19/2018 11:44 AM, Vahid Askarpour wrote:
Hi Edgar,

Just to let you know that the nscf run with --mca io ompio crashed like the
other two runs.

Thank you,

Vahid

On Jan 19, 2018, at 12:46 PM, Edgar Gabriel
<egabr...@central.uh.edu<mailto:egabr...@central.uh.edu>> wrote:

ok, thank you for the information. Two short questions and requests. I have
qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 instead
of the intel compiler), and I am currently running the parallel test suite. So
far, all the tests passed, although it is still running.

My question is now, would it be possible for you to give me access to exactly
the same data set that you are using? You could upload to a webpage or similar
and just send me the link.

The second question/request, could you rerun your tests one more time, this
time forcing using ompio? e.g. --mca io ompio

Thanks

Edgar

On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is
(http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires
Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as
seen below:

K_POINTS (crystal)
1728
0.00000000 0.00000000 0.00000000 5.787037e-04
0.00000000 0.00000000 0.08333333 5.787037e-04
0.00000000 0.00000000 0.16666667 5.787037e-04
0.00000000 0.00000000 0.25000000 5.787037e-04
0.00000000 0.00000000 0.33333333 5.787037e-04
0.00000000 0.00000000 0.41666667 5.787037e-04
0.00000000 0.00000000 0.50000000 5.787037e-04
0.00000000 0.00000000 0.58333333 5.787037e-04
0.00000000 0.00000000 0.66666667 5.787037e-04
0.00000000 0.00000000 0.75000000 5.787037e-04
…….
…….

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler
module, configured with only the “--prefix=" and then “make all
install”. I did not enable or disable any other options.

Cheers,

Vahid

On Jan 19, 2018, at 10:23 AM, Edgar Gabriel
<egabr...@central.uh.edu<mailto:egabr...@central.uh.edu>> wrote:

thanks, that is interesting. Since /scratch is a lustre file system, Open MPI
should actually utilize romio314 for that anyway, not ompio. What I have seen
however happen on at least one occasions is that ompio was still used since ( I
suspect) romio314 didn't pick up correctly the configuration options. It is a
little bit of a mess from that perspective that we have to pass the romio
arguments with different flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre
--with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre headers/libraries
are found at the default location, even if the user did not pass the
--with-lustre option. I am not entirely sure what happens in romio if the user
did not pass the --with-file-system=ufs+nfs+lustre but the lustre
headers/libraries are found at the default location, i.e. whether the lustre
adio component is still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the romio314
component, and I will still try to reproduce your problem on my system.

Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD<mailto:mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD>
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu<http://depotint-nfs.rcac.purdue.edu/>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid

On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com><mailto:gilles.gouaillar...@gmail.com> wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel
<egabr...@central.uh.edu><mailto:egabr...@central.uh.edu> wrote:

I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar

On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:

On Jan 18, 2018, at 5:53 PM, Vahid Askarpour
<vh261...@dal.ca><mailto:vh261...@dal.ca> wrote:

My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.

Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] OMPI users] Installation of openmpi-1.10.7 fails

Reply via email to