Hi Edgar,

Just to let you know that the nscf run with --mca io ompio crashed like the 
other two runs.

Thank you,

Vahid

On Jan 19, 2018, at 12:46 PM, Edgar Gabriel 
<egabr...@central.uh.edu<mailto:egabr...@central.uh.edu>> wrote:


ok, thank you for the information. Two short questions and requests. I have 
qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 instead 
of the intel compiler), and I am currently running the parallel test suite. So 
far, all the tests passed, although it is still running.

My question is now, would it be possible for you to give me access to exactly 
the same data set that you are using?  You could upload to a webpage or similar 
and just send me the link.

The second question/request, could you rerun your tests one more time, this 
time forcing using ompio? e.g. --mca io ompio

Thanks

Edgar

On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is 
(http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires 
Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as 
seen below:

K_POINTS (crystal)
1728
  0.00000000  0.00000000  0.00000000  5.787037e-04
  0.00000000  0.00000000  0.08333333  5.787037e-04
  0.00000000  0.00000000  0.16666667  5.787037e-04
  0.00000000  0.00000000  0.25000000  5.787037e-04
  0.00000000  0.00000000  0.33333333  5.787037e-04
  0.00000000  0.00000000  0.41666667  5.787037e-04
  0.00000000  0.00000000  0.50000000  5.787037e-04
  0.00000000  0.00000000  0.58333333  5.787037e-04
  0.00000000  0.00000000  0.66666667  5.787037e-04
  0.00000000  0.00000000  0.75000000  5.787037e-04
…….
…….

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler 
module, configured with only the             “--prefix="  and then “make all 
install”. I did not enable or disable any other options.

Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel 
<egabr...@central.uh.edu<mailto:egabr...@central.uh.edu>> wrote:


thanks, that is interesting. Since /scratch is a lustre file system, Open MPI 
should actually utilize romio314 for that anyway, not ompio. What I have seen 
however happen on at least one occasions is that ompio was still used since ( I 
suspect) romio314 didn't pick up correctly the configuration options. It is a 
little bit of a mess from that perspective that we have to pass the romio 
arguments with different flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/ 
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre 
--with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre headers/libraries 
are found at the default location, even if the user did not pass the 
--with-lustre option. I am not entirely sure what happens in romio if the user 
did not pass the --with-file-system=ufs+nfs+lustre but the lustre 
headers/libraries are found at the default location, i.e. whether the lustre 
adio component is still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the romio314 
component, and I will still try to reproduce your problem on my system.

Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let 
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem           Type    Size  Used Avail Use% Mounted on
/dev/sda1            ext4    435G   16G  398G   4% /
tmpfs                tmpfs    16G  1.4M   16G   1% /dev/shm
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
                     nfs      80T   64T   17T  80% /home
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
                     nfs     8.0T  4.0T  4.1T  49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD<mailto:mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD>
                     lustre  1.4P  994T  347T  75% /scratch/conte
depotint-nfs.rcac.purdue.edu<http://depotint-nfs.rcac.purdue.edu/>:/depot
                     nfs     4.5P  3.0P  1.6P  66% /depot
172.18.84.186:/persistent/fsadmin
                     nfs     200G  130G   71G  65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid



On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com><mailto:gilles.gouaillar...@gmail.com> wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel 
<egabr...@central.uh.edu><mailto:egabr...@central.uh.edu> wrote:


I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar


On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:


On Jan 18, 2018, at 5:53 PM, Vahid Askarpour 
<vh261...@dal.ca><mailto:vh261...@dal.ca> wrote:


My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.


Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard?  Is there a bug in (soon to be
released) Open MPI 3.0.1?  ...?



_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users




--
Edgar Gabriel
Associate Professor
Department of Computer Science

Associate Director
Center for Advanced Computing and Data Science (CACDS)

University of Houston
Philip G. Hoffman Hall, Room 228        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
--


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to