Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Vahid Askarpour
Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let 
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem   TypeSize  Used Avail Use% Mounted on
/dev/sda1ext4435G   16G  398G   4% /
tmpfstmpfs16G  1.4M   16G   1% /dev/shm
persistent-nfs.rcac.purdue.edu:/persistent/home
 nfs  80T   64T   17T  80% /home
persistent-nfs.rcac.purdue.edu:/persistent/apps
 nfs 8.0T  4.0T  4.1T  49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD
 lustre  1.4P  994T  347T  75% /scratch/conte
depotint-nfs.rcac.purdue.edu:/depot
 nfs 4.5P  3.0P  1.6P  66% /depot
172.18.84.186:/persistent/fsadmin
 nfs 200G  130G   71G  65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid

> On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet 
>  wrote:
> 
> Vahid,
> 
> i the v1.10 series, the default MPI-IO component was ROMIO based, and
> in the v3 series, it is now ompio.
> You can force the latest Open MPI to use the ROMIO based component with
> mpirun --mca io romio314 ...
> 
> That being said, your description (e.g. a hand edited file) suggests
> that I/O is not performed with MPI-IO,
> which makes me very puzzled on why the latest Open MPI is crashing.
> 
> Cheers,
> 
> Gilles
> 
> On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel  
> wrote:
>> I will try to reproduce this problem with 3.0.x, but it might take me a
>> couple of days to get to it.
>> 
>> Since it seemed to have worked with 2.0.x (except for the running out file
>> handles problem), there is the suspicion that one of the fixes that we
>> introduced since then is the problem.
>> 
>> What file system did you run it on? NFS?
>> 
>> Thanks
>> 
>> Edgar
>> 
>> 
>> On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:
>>> 
>>> On Jan 18, 2018, at 5:53 PM, Vahid Askarpour  wrote:
 
 My openmpi3.0.x run (called nscf run) was reading data from a routine
 Quantum Espresso input file edited by hand. The preliminary run (called scf
 run) was done with openmpi3.0.x on a similar input file also edited by 
 hand.
>>> 
>>> Gotcha.
>>> 
>>> Well, that's a little disappointing.
>>> 
>>> It would be good to understand why it is crashing -- is the app doing
>>> something that is accidentally not standard?  Is there a bug in (soon to be
>>> released) Open MPI 3.0.1?  ...?
>>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Vinson, John (Fed)
Hi Vahid,

This may be a red herring, but are you using a redirect or -i for the QE input? 
If you are running "pw.x < input" try running with "pw.x -i input". 

John

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Vahid 
Askarpour
Sent: Friday, January 19, 2018 8:15 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Installation of openmpi-1.10.7 fails

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let 
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem   TypeSize  Used Avail Use% Mounted on
/dev/sda1ext4435G   16G  398G   4% /
tmpfstmpfs16G  1.4M   16G   1% /dev/shm
persistent-nfs.rcac.purdue.edu:/persistent/home
 nfs  80T   64T   17T  80% /home
persistent-nfs.rcac.purdue.edu:/persistent/apps
 nfs 8.0T  4.0T  4.1T  49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD
 lustre  1.4P  994T  347T  75% /scratch/conte 
depotint-nfs.rcac.purdue.edu:/depot
 nfs 4.5P  3.0P  1.6P  66% /depot
172.18.84.186:/persistent/fsadmin
 nfs 200G  130G   71G  65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid

> On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet 
>  wrote:
> 
> Vahid,
> 
> i the v1.10 series, the default MPI-IO component was ROMIO based, and 
> in the v3 series, it is now ompio.
> You can force the latest Open MPI to use the ROMIO based component 
> with mpirun --mca io romio314 ...
> 
> That being said, your description (e.g. a hand edited file) suggests 
> that I/O is not performed with MPI-IO, which makes me very puzzled on 
> why the latest Open MPI is crashing.
> 
> Cheers,
> 
> Gilles
> 
> On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel  
> wrote:
>> I will try to reproduce this problem with 3.0.x, but it might take me 
>> a couple of days to get to it.
>> 
>> Since it seemed to have worked with 2.0.x (except for the running out 
>> file handles problem), there is the suspicion that one of the fixes 
>> that we introduced since then is the problem.
>> 
>> What file system did you run it on? NFS?
>> 
>> Thanks
>> 
>> Edgar
>> 
>> 
>> On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:
>>> 
>>> On Jan 18, 2018, at 5:53 PM, Vahid Askarpour  wrote:
 
 My openmpi3.0.x run (called nscf run) was reading data from a 
 routine Quantum Espresso input file edited by hand. The preliminary 
 run (called scf
 run) was done with openmpi3.0.x on a similar input file also edited by 
 hand.
>>> 
>>> Gotcha.
>>> 
>>> Well, that's a little disappointing.
>>> 
>>> It would be good to understand why it is crashing -- is the app 
>>> doing something that is accidentally not standard?  Is there a bug 
>>> in (soon to be
>>> released) Open MPI 3.0.1?  ...?
>>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
>> s.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cjohn.vinso
>> n%40nist.gov%7C6fe0d5f26dd545205eb108d55f40b74e%7C2ab5d82fd8fa4797a93
>> e054655c61dec%7C1%7C1%7C636519653913113615&sdata=Vt%2BSdJFcvmdqEKgMPU
>> ylYAd%2FdQMgTUEXiBPGzQkeSio%3D&reserved=0
> ___
> users mailing list
> users@lists.open-mpi.org
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists
> .open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cjohn.vinson%
> 40nist.gov%7C6fe0d5f26dd545205eb108d55f40b74e%7C2ab5d82fd8fa4797a93e05
> 4655c61dec%7C1%7C1%7C636519653913113615&sdata=Vt%2BSdJFcvmdqEKgMPUylYA
> d%2FdQMgTUEXiBPGzQkeSio%3D&reserved=0

___
users mailing list
users@lists.open-mpi.org
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cjohn.vinson%40nist.gov%7C6fe0d5f26dd545205eb108d55f40b74e%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C1%7C636519653913113615&sdata=Vt%2BSdJFcvmdqEKgMPUylYAd%2FdQMgTUEXiBPGzQkeSio%3D&reserved=0
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Edgar Gabriel
thanks, that is interesting. Since /scratch is a lustre file system, 
Open MPI should actually utilize romio314 for that anyway, not ompio. 
What I have seen however happen on at least one occasions is that ompio 
was still used since ( I suspect) romio314 didn't pick up correctly the 
configuration options. It is a little bit of a mess from that 
perspective that we have to pass the romio arguments with different 
flag/options than for ompio, e.g.


--with-lustre=/path/to/lustre/ 
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre 
--with-lustre=/path/to/lustre"


ompio should pick up the lustre options correctly if lustre 
headers/libraries are found at the default location, even if the user 
did not pass the --with-lustre option. I am not entirely sure what 
happens in romio if the user did not pass the 
--with-file-system=ufs+nfs+lustre but the lustre headers/libraries are 
found at the default location, i.e. whether the lustre adio component is 
still compiled or not.


Anyway, lets wait for the outcome of your run enforcing using the 
romio314 component, and I will still try to reproduce your problem on my 
system.


Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let 
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem   TypeSize  Used Avail Use% Mounted on
/dev/sda1ext4435G   16G  398G   4% /
tmpfstmpfs16G  1.4M   16G   1% /dev/shm
persistent-nfs.rcac.purdue.edu:/persistent/home
  nfs  80T   64T   17T  80% /home
persistent-nfs.rcac.purdue.edu:/persistent/apps
  nfs 8.0T  4.0T  4.1T  49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD
  lustre  1.4P  994T  347T  75% /scratch/conte
depotint-nfs.rcac.purdue.edu:/depot
  nfs 4.5P  3.0P  1.6P  66% /depot
172.18.84.186:/persistent/fsadmin
  nfs 200G  130G   71G  65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid


On Jan 18, 2018, at 10:14 PM, Gilles 
Gouaillardet  wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel  wrote:

I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar


On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:

On Jan 18, 2018, at 5:53 PM, Vahid Askarpour  wrote:

My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.

Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard?  Is there a bug in (soon to be
released) Open MPI 3.0.1?  ...?


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Vahid Askarpour
To run EPW, the command for running the preliminary nscf run is 
(http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires 
Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as 
seen below:

K_POINTS (crystal)
1728
  0.  0.  0.  5.787037e-04
  0.  0.  0.0833  5.787037e-04
  0.  0.  0.1667  5.787037e-04
  0.  0.  0.2500  5.787037e-04
  0.  0.  0.  5.787037e-04
  0.  0.  0.4167  5.787037e-04
  0.  0.  0.5000  5.787037e-04
  0.  0.  0.5833  5.787037e-04
  0.  0.  0.6667  5.787037e-04
  0.  0.  0.7500  5.787037e-04
…….
…….

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler 
module, configured with only the “--prefix="  and then “make all install”. I 
did not enable or disable any other options.

Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel 
mailto:egabr...@central.uh.edu>> wrote:


thanks, that is interesting. Since /scratch is a lustre file system, Open MPI 
should actually utilize romio314 for that anyway, not ompio. What I have seen 
however happen on at least one occasions is that ompio was still used since ( I 
suspect) romio314 didn't pick up correctly the configuration options. It is a 
little bit of a mess from that perspective that we have to pass the romio 
arguments with different flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/ 
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre 
--with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre headers/libraries 
are found at the default location, even if the user did not pass the 
--with-lustre option. I am not entirely sure what happens in romio if the user 
did not pass the --with-file-system=ufs+nfs+lustre but the lustre 
headers/libraries are found at the default location, i.e. whether the lustre 
adio component is still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the romio314 
component, and I will still try to reproduce your problem on my system.

Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let 
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem   TypeSize  Used Avail Use% Mounted on
/dev/sda1ext4435G   16G  398G   4% /
tmpfstmpfs16G  1.4M   16G   1% /dev/shm
persistent-nfs.rcac.purdue.edu:/persistent/home
 nfs  80T   64T   17T  80% /home
persistent-nfs.rcac.purdue.edu:/persistent/apps
 nfs 8.0T  4.0T  4.1T  49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD
 lustre  1.4P  994T  347T  75% /scratch/conte
depotint-nfs.rcac.purdue.edu:/depot
 nfs 4.5P  3.0P  1.6P  66% /depot
172.18.84.186:/persistent/fsadmin
 nfs 200G  130G   71G  65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid



On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet 
 wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel 
 wrote:


I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar


On 1/18/2018 5:17 PM

Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Edgar Gabriel
ok, thank you for the information. Two short questions and requests. I 
have qe-6.2.1 compiled and running on my system (although it is with 
gcc-6.4 instead of the intel compiler), and I am currently running the 
parallel test suite. So far, all the tests passed, although it is still 
running.


My question is now, would it be possible for you to give me access to 
exactly the same data set that you are using?  You could upload to a 
webpage or similar and just send me the link.


The second question/request, could you rerun your tests one more time, 
this time forcing using ompio? e.g. --mca io ompio


Thanks

Edgar


On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is 
(http://epw.org.uk/Documentation/B-dopedDiamond):


~/bin/openmpi-v3.0/bin/mpiexec -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > 
nscf.out


So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > 
nscf.out


And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run 
requires Quantum Espresso to generate the k points automatically as 
shown below:


K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this 
case) as seen below:


K_POINTS (crystal)
1728
  0.  0.  0.  5.787037e-04
  0.  0.  0.0833  5.787037e-04
  0.  0.  0.1667  5.787037e-04
  0.  0.  0.2500  5.787037e-04
  0.  0.  0.  5.787037e-04
  0.  0.  0.4167  5.787037e-04
  0.  0.  0.5000  5.787037e-04
  0.  0.  0.5833  5.787037e-04
  0.  0.  0.6667  5.787037e-04
  0.  0.  0.7500  5.787037e-04
…….
…….

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran 
compiler module, configured with only the “--prefix="  and then “make 
all install”. I did not enable or disable any other options.


Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel > wrote:


thanks, that is interesting. Since /scratch is a lustre file system, 
Open MPI should actually utilize romio314 for that anyway, not ompio. 
What I have seen however happen on at least one occasions is that 
ompio was still used since ( I suspect) romio314 didn't pick up 
correctly the configuration options. It is a little bit of a mess 
from that perspective that we have to pass the romio arguments with 
different flag/options than for ompio, e.g.


--with-lustre=/path/to/lustre/ 
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre 
--with-lustre=/path/to/lustre"


ompio should pick up the lustre options correctly if lustre 
headers/libraries are found at the default location, even if the user 
did not pass the --with-lustre option. I am not entirely sure what 
happens in romio if the user did not pass the 
--with-file-system=ufs+nfs+lustre but the lustre headers/libraries 
are found at the default location, i.e. whether the lustre adio 
component is still compiled or not.


Anyway, lets wait for the outcome of your run enforcing using the 
romio314 component, and I will still try to reproduce your problem on 
my system.


Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let 
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem   TypeSize  Used Avail Use% Mounted on
/dev/sda1ext4435G   16G  398G   4% /
tmpfstmpfs16G  1.4M   16G   1% /dev/shm
persistent-nfs.rcac.purdue.edu 
:/persistent/home
  nfs  80T   64T   17T  80% /home
persistent-nfs.rcac.purdue.edu 
:/persistent/apps
  nfs 8.0T  4.0T  4.1T  49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD
  lustre  1.4P  994T  347T  75% /scratch/conte
depotint-nfs.rcac.purdue.edu :/depot
  nfs 4.5P  3.0P  1.6P  66% /depot
172.18.84.186:/persistent/fsadmin
  nfs 200G  130G   71G  65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid


On Jan 18, 2018, at 10:14 PM, Gilles 
Gouaillardet  wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
tha

Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Vahid Askarpour
Hi Edgar,

Just to let you know that the nscf run with --mca io ompio crashed like the 
other two runs.

Thank you,

Vahid

On Jan 19, 2018, at 12:46 PM, Edgar Gabriel 
mailto:egabr...@central.uh.edu>> wrote:


ok, thank you for the information. Two short questions and requests. I have 
qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 instead 
of the intel compiler), and I am currently running the parallel test suite. So 
far, all the tests passed, although it is still running.

My question is now, would it be possible for you to give me access to exactly 
the same data set that you are using?  You could upload to a webpage or similar 
and just send me the link.

The second question/request, could you rerun your tests one more time, this 
time forcing using ompio? e.g. --mca io ompio

Thanks

Edgar

On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is 
(http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires 
Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as 
seen below:

K_POINTS (crystal)
1728
  0.  0.  0.  5.787037e-04
  0.  0.  0.0833  5.787037e-04
  0.  0.  0.1667  5.787037e-04
  0.  0.  0.2500  5.787037e-04
  0.  0.  0.  5.787037e-04
  0.  0.  0.4167  5.787037e-04
  0.  0.  0.5000  5.787037e-04
  0.  0.  0.5833  5.787037e-04
  0.  0.  0.6667  5.787037e-04
  0.  0.  0.7500  5.787037e-04
…….
…….

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler 
module, configured with only the “--prefix="  and then “make all 
install”. I did not enable or disable any other options.

Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel 
mailto:egabr...@central.uh.edu>> wrote:


thanks, that is interesting. Since /scratch is a lustre file system, Open MPI 
should actually utilize romio314 for that anyway, not ompio. What I have seen 
however happen on at least one occasions is that ompio was still used since ( I 
suspect) romio314 didn't pick up correctly the configuration options. It is a 
little bit of a mess from that perspective that we have to pass the romio 
arguments with different flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/ 
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre 
--with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre headers/libraries 
are found at the default location, even if the user did not pass the 
--with-lustre option. I am not entirely sure what happens in romio if the user 
did not pass the --with-file-system=ufs+nfs+lustre but the lustre 
headers/libraries are found at the default location, i.e. whether the lustre 
adio component is still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the romio314 
component, and I will still try to reproduce your problem on my system.

Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let 
you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

vaskarpo@conte-fe00:~ $ df -Th
Filesystem   TypeSize  Used Avail Use% Mounted on
/dev/sda1ext4435G   16G  398G   4% /
tmpfstmpfs16G  1.4M   16G   1% /dev/shm
persistent-nfs.rcac.purdue.edu:/persistent/home
 nfs  80T   64T   17T  80% /home
persistent-nfs.rcac.purdue.edu:/persistent/apps
 nfs 8.0T  4.0T  4.1T  49% /apps
mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD
 lustre  1.4P  994T  347T  75% /scratch/conte
depotint-nfs.rcac.purdue.edu:/depot
 nfs 4.5P  3.0P  1.6P  66% /depot
172.18.84.186:/persistent/fsadmin
 nfs 200G  130G   71G  65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid



On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet 


Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Edgar Gabriel
ok, here is what found out so far, will have to stop for now however for 
today:


 1. I can in fact reproduce your bug on my systems.

 2. I can confirm that the problem occurs both with romio314 and ompio. 
I *think* the issue is that the input_tmp.in file is incomplete. In both 
cases (ompio and romio) the end of the file looks as follows (and its 
exactly the same for both libraries):


gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files> tail -10 
input_tmp.in

  0.6667  0.5000  0.8333  5.787037e-04
  0.6667  0.5000  0.9167  5.787037e-04
  0.6667  0.5833  0.  5.787037e-04
  0.6667  0.5833  0.0833  5.787037e-04
  0.6667  0.5833  0.1667  5.787037e-04
  0.6667  0.5833  0.2500  5.787037e-04
  0.6667  0.5833  0.  5.787037e-04
  0.6667  0.5833  0.4167  5.787037e-04
  0.6667  0.5833  0.5000  5.787037e-04
  0.6667  0.5833  0.5833  5

which is what I *think* causes the problem.

 3. I tried to find where input_tmp.in is generated, but haven't 
completely identified the location. However, I could not find MPI 
file_write(_all) operations anywhere in the code, although there are 
some MPI_file_read(_all) operations.


 4. I can confirm that the behavior with Open MPI 1.8.x is different. 
input_tmp.in looks more complete (at least it doesn't end in the middle 
of the line). The simulation does still not finish for me, but the bug 
reported is slightly different, I might just be missing a file or something



 from pw_readschemafile : error # 1
 xml data file not found

Since I think input_tmp.in is generated from data that is provided in 
nscf.in, it might very well be something in the MPI_File_read(_all) 
operation that causes the issue, but since both ompio and romio are 
affected, there is good chance that something outside of the control of 
io components is causing the trouble (maybe a datatype issue that has 
changed from 1.8.x series to 3.0.x).


 5. Last but not least, I also wanted to mention that I ran all 
parallel tests that I found in the testsuite (run-tests-cp-parallel, 
run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel ), 
and they all passed with ompio (and romio314 although I only ran a 
subset of the tests with romio314).


Thanks

Edgar

-




On 01/19/2018 11:44 AM, Vahid Askarpour wrote:

Hi Edgar,

Just to let you know that the nscf run with --mca io ompio crashed 
like the other two runs.


Thank you,

Vahid

On Jan 19, 2018, at 12:46 PM, Edgar Gabriel > wrote:


ok, thank you for the information. Two short questions and requests. 
I have qe-6.2.1 compiled and running on my system (although it is 
with gcc-6.4 instead of the intel compiler), and I am currently 
running the parallel test suite. So far, all the tests passed, 
although it is still running.


My question is now, would it be possible for you to give me access to 
exactly the same data set that you are using?  You could upload to a 
webpage or similar and just send me the link.


The second question/request, could you rerun your tests one more 
time, this time forcing using ompio? e.g. --mca io ompio


Thanks

Edgar


On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is 
(http://epw.org.uk/Documentation/B-dopedDiamond):


~/bin/openmpi-v3.0/bin/mpiexec -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > 
nscf.out


So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > 
nscf.out


And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf 
run requires Quantum Espresso to generate the k points automatically 
as shown below:


K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this 
case) as seen below:


K_POINTS (crystal)
1728
  0.  0.  0.  5.787037e-04
  0.  0.  0.0833  5.787037e-04
  0.  0.  0.1667  5.787037e-04
  0.  0.  0.2500  5.787037e-04
  0.  0.  0.  5.787037e-04
  0.  0.  0.4167  5.787037e-04
  0.  0.  0.5000  5.787037e-04
  0.  0.  0.5833  5.787037e-04
  0.  0.  0.6667  5.787037e-04
  0.  0.  0.7500  5.787037e-04
…….
…….

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran 
compiler module, configured with only the “--prefix="  and then 
“make all install”. I did not enable or disable any other options.


Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel 
mailto:egabr...@central.uh.edu>> wrote:


thanks, that is interesting. Since /scratch is a lustre file 
system, Open MPI should actually utilize romio314 f

Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Vahid Askarpour
Concerning the following error

 from pw_readschemafile : error # 1
 xml data file not found

The nscf run uses files generated by the scf.in run. So I first run scf.in and 
when it finishes, I run nscf.in. If you have done this and still get the above 
error, then this could be another bug. It does not happen for me with 
intel14/openmpi-1.8.8.

Thanks for the update,

Vahid

On Jan 19, 2018, at 3:08 PM, Edgar Gabriel 
mailto:egabr...@central.uh.edu>> wrote:


ok, here is what found out so far, will have to stop for now however for today:

 1. I can in fact reproduce your bug on my systems.

 2. I can confirm that the problem occurs both with romio314 and ompio. I 
*think* the issue is that the input_tmp.in file is incomplete. In both cases 
(ompio and romio) the end of the file looks as follows (and its exactly the 
same for both libraries):

gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files>
 tail -10 input_tmp.in
  0.6667  0.5000  0.8333  5.787037e-04
  0.6667  0.5000  0.9167  5.787037e-04
  0.6667  0.5833  0.  5.787037e-04
  0.6667  0.5833  0.0833  5.787037e-04
  0.6667  0.5833  0.1667  5.787037e-04
  0.6667  0.5833  0.2500  5.787037e-04
  0.6667  0.5833  0.  5.787037e-04
  0.6667  0.5833  0.4167  5.787037e-04
  0.6667  0.5833  0.5000  5.787037e-04
  0.6667  0.5833  0.5833  5

which is what I *think* causes the problem.

 3. I tried to find where input_tmp.in is generated, but haven't completely 
identified the location. However, I could not find MPI file_write(_all) 
operations anywhere in the code, although there are some MPI_file_read(_all) 
operations.

 4. I can confirm that the behavior with Open MPI 1.8.x is different. 
input_tmp.in looks more complete (at least it doesn't end in the middle of the 
line). The simulation does still not finish for me, but the bug reported is 
slightly different, I might just be missing a file or something

 from pw_readschemafile : error # 1
 xml data file not found

Since I think input_tmp.in is generated from data that is provided in nscf.in, 
it might very well be something in the MPI_File_read(_all) operation that 
causes the issue, but since both ompio and romio are affected, there is good 
chance that something outside of the control of io components is causing the 
trouble (maybe a datatype issue that has changed from 1.8.x series to 3.0.x).

 5. Last but not least, I also wanted to mention that I ran all parallel tests 
that I found in the testsuite  (run-tests-cp-parallel, run-tests-pw-parallel, 
run-tests-ph-parallel, run-tests-epw-parallel ), and they all passed with ompio 
(and romio314 although I only ran a subset of the tests with romio314).


Thanks

Edgar

-



On 01/19/2018 11:44 AM, Vahid Askarpour wrote:
Hi Edgar,

Just to let you know that the nscf run with --mca io ompio crashed like the 
other two runs.

Thank you,

Vahid

On Jan 19, 2018, at 12:46 PM, Edgar Gabriel 
mailto:egabr...@central.uh.edu>> wrote:


ok, thank you for the information. Two short questions and requests. I have 
qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 instead 
of the intel compiler), and I am currently running the parallel test suite. So 
far, all the tests passed, although it is still running.

My question is now, would it be possible for you to give me access to exactly 
the same data set that you are using?  You could upload to a webpage or similar 
and just send me the link.

The second question/request, could you rerun your tests one more time, this 
time forcing using ompio? e.g. --mca io ompio

Thanks

Edgar

On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is 
(http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires 
Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as 
seen below:

K_POINTS (crystal)
1728
  0.  0.  0.  5.787037e-04
  0.  0.  0.0833  5.787037e-04
  0.  0.  0.1667  5.787037e-04
  0.  0.  0.2500  5.787037e-04
  0.  0.  0.  5.787037e-04
  0.  0.  0.4167  5.787037e-04
  0.  0.  0.5000  5.787037e-04
  0.  0.  0.5833  5.787037e-04
  0.000

Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Stephen Guzik
Not sure if this is related and I have not had time to investigate it
much or reduce but I am also having issues with 3.0.x.  There's a couple
of layers of cgns and hdf5 but I am seeing:

mpirun --mca io romio314 --mca btl self,vader,openib...
-- works perfectly

mpirun --mca btl self,vader,openib...
cgio_open_file:H5Dwrite:write to node data failed

The files system in NFS and an openmpi-v3.0.x-201711220306-2399e85 build.

Stephen

Stephen Guzik, Ph.D.
Assistant Professor, Department of Mechanical Engineering
Colorado State University

On 01/18/2018 04:17 PM, Jeff Squyres (jsquyres) wrote:
> On Jan 18, 2018, at 5:53 PM, Vahid Askarpour  wrote:
>> My openmpi3.0.x run (called nscf run) was reading data from a routine 
>> Quantum Espresso input file edited by hand. The preliminary run (called scf 
>> run) was done with openmpi3.0.x on a similar input file also edited by hand. 
> Gotcha.
>
> Well, that's a little disappointing.
>
> It would be good to understand why it is crashing -- is the app doing 
> something that is accidentally not standard?  Is there a bug in (soon to be 
> released) Open MPI 3.0.1?  ...?
>

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Installation of openmpi-1.10.7 fails

2018-01-19 Thread Edgar Gabriel
this is most likely a different issue. The bug in the original case is 
appearing also on a local file system/disk, it doesn't have to be NSF.


That being said, I would urge to submit a new issue ( or a new email 
thread), I would be more than happy to look into your problem as well, 
since we submit a number of patches into the 3.0.x branch specifically 
for NFS.


Thanks
Edgar

On 1/19/2018 2:42 PM, Stephen Guzik wrote:

Not sure if this is related and I have not had time to investigate it
much or reduce but I am also having issues with 3.0.x.  There's a couple
of layers of cgns and hdf5 but I am seeing:

mpirun --mca io romio314 --mca btl self,vader,openib...
-- works perfectly

mpirun --mca btl self,vader,openib...
cgio_open_file:H5Dwrite:write to node data failed

The files system in NFS and an openmpi-v3.0.x-201711220306-2399e85 build.

Stephen

Stephen Guzik, Ph.D.
Assistant Professor, Department of Mechanical Engineering
Colorado State University

On 01/18/2018 04:17 PM, Jeff Squyres (jsquyres) wrote:

On Jan 18, 2018, at 5:53 PM, Vahid Askarpour  wrote:

My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum 
Espresso input file edited by hand. The preliminary run (called scf run) was 
done with openmpi3.0.x on a similar input file also edited by hand.

Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing something 
that is accidentally not standard?  Is there a bug in (soon to be released) 
Open MPI 3.0.1?  ...?


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users