Did you install one version of Open MPI over another version?

    https://www.open-mpi.org/faq/?category=building#install-overwrite


> On Jan 30, 2018, at 2:09 PM, Vahid Askarpour <vh261...@dal.ca> wrote:
> 
> This is just an update on how things turned out with openmpi-3.0.x.
> 
> I compiled both EPW and openmpi with intel14. In the past, EPW crashed for 
> both intel16 and 17. However, with intel14 and openmpi/1.8.8 , I have been 
> getting results consistently.
> 
> The nscf.in worked with the -i argument. However, when I ran EPW with 
> intel14/openmpi-3.0.x, I get the following error:
> 
> mca_base_component_repository_open: unable to open mca_io_romio314: 
> libgpfs.so: cannot open shared object file: No such file or directory 
> (ignored)
> 
> What is interesting is that this error occurs in the middle of a long loop. 
> Since the loop repeats over different coordinates, the error may not be 
> coming from the gpfs library.
> 
> Cheers,
> 
> Vahid
> 
>> On Jan 23, 2018, at 9:52 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>> 
>> Fair enough,
>> 
>> To be on the safe side, I encourage you to use the latest Intel compilers
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Vahid Askarpour <vh261...@dal.ca> wrote:
>> Gilles,
>> 
>> I have not tried compiling the latest openmpi with GCC. I am waiting to see 
>> how the intel version turns out before attempting GCC.
>> 
>> Cheers,
>> 
>> Vahid
>> 
>>> On Jan 23, 2018, at 9:33 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>> 
>>> Vahid,
>>> 
>>> There used to be a bug in the IOF part, but I am pretty sure this has 
>>> already been fixed.
>>> 
>>> Does the issue also occur with GNU compilers ?
>>> There used to be an issue with Intel Fortran runtime (short read/write were 
>>> silently ignored) and that was also fixed some time ago.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Vahid Askarpour <vh261...@dal.ca> wrote:
>>> This would work for Quantum Espresso input. I am waiting to see what 
>>> happens to EPW. I don’t think EPW accepts the -i argument. I will report 
>>> back once the EPW job is done.
>>> 
>>> Cheers,
>>> 
>>> Vahid
>>>  
>>>> On Jan 22, 2018, at 6:05 PM, Edgar Gabriel <egabr...@central.uh.edu> wrote:
>>>> 
>>>> well, my final comment on this topic, as somebody suggested earlier in 
>>>> this email chain, if you provide the input with the -i argument instead of 
>>>> piping from standard input, things seem to work as far as I can see 
>>>> (disclaimer: I do not know what the final outcome should be. I just see 
>>>> that the application does not complain about the 'end of file while 
>>>> reading crystal k points'). So maybe that is the most simple solution.
>>>> 
>>>> Thanks
>>>> Edgar
>>>> 
>>>> On 1/22/2018 1:17 PM, Edgar Gabriel wrote:
>>>>> after some further investigation, I am fairly confident that this is not 
>>>>> an MPI I/O problem. 
>>>>> The input file input_tmp.in is generated in this sequence of instructions 
>>>>> (which is in Modules/open_close_input_file.f90)
>>>>> ---
>>>>>   IF ( TRIM(input_file_) /= ' ' ) THEn
>>>>>      !
>>>>>      ! copy file to be opened into input_file
>>>>>      !
>>>>>      input_file = input_file_
>>>>>      !
>>>>>   ELSE
>>>>>      !
>>>>>      ! if no file specified then copy from standard input
>>>>>      !
>>>>>      input_file="input_tmp.in"
>>>>>      OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', &
>>>>>           STATUS='unknown', IOSTAT = ierr )
>>>>>      IF ( ierr > 0 ) GO TO 30
>>>>>      !
>>>>>      dummy=' '
>>>>>      WRITE(stdout, '(5x,a)') "Waiting for input..."
>>>>>      DO WHILE ( TRIM(dummy) .NE. "MAGICALME" )
>>>>>         READ (stdin,fmt='(A512)',END=20) dummy
>>>>>         WRITE (stdtmp,'(A)') trim(dummy)
>>>>>      END DO
>>>>>      !
>>>>> 20   CLOSE ( UNIT=stdtmp, STATUS='keep' )
>>>>> 
>>>>> ----
>>>>> 
>>>>> Basically, if no input file has been provided, the input file is 
>>>>> generated by reading from standard input. Since the application is being 
>>>>> launched e.g. with
>>>>> mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out
>>>>> 
>>>>> the data comes from nscf.in. I simply do not know enough about IO 
>>>>> forwarding do be able to tell why we do not see the entire file, but one 
>>>>> interesting detail is that if I run it in the debugger, the input_tmp.in 
>>>>> is created correctly. However, if I run it using mpirun as shown above, 
>>>>> the file is cropped incorrectly, which leads to the error message 
>>>>> mentioned in this email chain. 
>>>>> Anyway, I would probably need some help here from somebody who knows the 
>>>>> runtime better than me on what could go wrong at this point. 
>>>>> Thanks
>>>>> 
>>>>> Edgar
>>>>> 
>>>>> 
>>>>> 
>>>>> On 1/19/2018 1:22 PM, Vahid Askarpour wrote:
>>>>>> Concerning the following error
>>>>>> 
>>>>>>      from pw_readschemafile : error #         1
>>>>>>      xml data file not found
>>>>>> 
>>>>>> The nscf run uses files generated by the scf.in run. So I first run 
>>>>>> scf.in and when it finishes, I run nscf.in. If you have done this and 
>>>>>> still get the above error, then this could be another bug. It does not 
>>>>>> happen for me with intel14/openmpi-1.8.8.
>>>>>> 
>>>>>> Thanks for the update,
>>>>>> 
>>>>>> Vahid
>>>>>> 
>>>>>>> On Jan 19, 2018, at 3:08 PM, Edgar Gabriel <egabr...@central.uh.edu> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> ok, here is what found out so far, will have to stop for now however 
>>>>>>> for today:
>>>>>>> 
>>>>>>>  1. I can in fact reproduce your bug on my systems.
>>>>>>> 
>>>>>>>  2. I can confirm that the problem occurs both with romio314 and ompio. 
>>>>>>> I *think* the issue is that the input_tmp.in file is incomplete. In 
>>>>>>> both cases (ompio and romio) the end of the file looks as follows (and 
>>>>>>> its exactly the same for both libraries):
>>>>>>> 
>>>>>>> gabriel@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files> tail -10 
>>>>>>> input_tmp.in 
>>>>>>>   0.66666667  0.50000000  0.83333333  5.787037e-04
>>>>>>>   0.66666667  0.50000000  0.91666667  5.787037e-04
>>>>>>>   0.66666667  0.58333333  0.00000000  5.787037e-04
>>>>>>>   0.66666667  0.58333333  0.08333333  5.787037e-04
>>>>>>>   0.66666667  0.58333333  0.16666667  5.787037e-04
>>>>>>>   0.66666667  0.58333333  0.25000000  5.787037e-04
>>>>>>>   0.66666667  0.58333333  0.33333333  5.787037e-04
>>>>>>>   0.66666667  0.58333333  0.41666667  5.787037e-04
>>>>>>>   0.66666667  0.58333333  0.50000000  5.787037e-04
>>>>>>>   0.66666667  0.58333333  0.58333333  5
>>>>>>> which is what I *think* causes the problem.
>>>>>>> 
>>>>>>>  3. I tried to find where input_tmp.in is generated, but haven't 
>>>>>>> completely identified the location. However, I could not find MPI 
>>>>>>> file_write(_all) operations anywhere in the code, although there are 
>>>>>>> some MPI_file_read(_all) operations.
>>>>>>> 
>>>>>>>  4. I can confirm that the behavior with Open MPI 1.8.x is different. 
>>>>>>> input_tmp.in looks more complete (at least it doesn't end in the middle 
>>>>>>> of the line). The simulation does still not finish for me, but the bug 
>>>>>>> reported is slightly different, I might just be missing a file or 
>>>>>>> something
>>>>>>> 
>>>>>>> 
>>>>>>>      from pw_readschemafile : error #         1
>>>>>>>      xml data file not found
>>>>>>> 
>>>>>>> Since I think input_tmp.in is generated from data that is provided in 
>>>>>>> nscf.in, it might very well be something in the MPI_File_read(_all) 
>>>>>>> operation that causes the issue, but since both ompio and romio are 
>>>>>>> affected, there is good chance that something outside of the control of 
>>>>>>> io components is causing the trouble (maybe a datatype issue that has 
>>>>>>> changed from 1.8.x series to 3.0.x).
>>>>>>> 
>>>>>>>  5. Last but not least, I also wanted to mention that I ran all 
>>>>>>> parallel tests that I found in the testsuite  (run-tests-cp-parallel, 
>>>>>>> run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel ), 
>>>>>>> and they all passed with ompio (and romio314 although I only ran a 
>>>>>>> subset of the tests with romio314).
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> Edgar
>>>>>>> -
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 01/19/2018 11:44 AM, Vahid Askarpour wrote:
>>>>>>>> Hi Edgar,
>>>>>>>> 
>>>>>>>> Just to let you know that the nscf run with --mca io ompio crashed 
>>>>>>>> like the other two runs.
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> 
>>>>>>>> Vahid
>>>>>>>> 
>>>>>>>>> On Jan 19, 2018, at 12:46 PM, Edgar Gabriel <egabr...@central.uh.edu> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> ok, thank you for the information. Two short questions and requests. 
>>>>>>>>> I have qe-6.2.1 compiled and running on my system (although it is 
>>>>>>>>> with gcc-6.4 instead of the intel compiler), and I am currently 
>>>>>>>>> running the parallel test suite. So far, all the tests passed, 
>>>>>>>>> although it is still running.
>>>>>>>>> 
>>>>>>>>> My question is now, would it be possible for you to give me access to 
>>>>>>>>> exactly the same data set that you are using?  You could upload to a 
>>>>>>>>> webpage or similar and just send me the link. 
>>>>>>>>> The second question/request, could you rerun your tests one more 
>>>>>>>>> time, this time forcing using ompio? e.g. --mca io ompio
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> Edgar
>>>>>>>>> 
>>>>>>>>> On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
>>>>>>>>>> To run EPW, the command for running the preliminary nscf run is 
>>>>>>>>>> (http://epw.org.uk/Documentation/B-dopedDiamond):
>>>>>>>>>> 
>>>>>>>>>> ~/bin/openmpi-v3.0/bin/mpiexec -np 64 
>>>>>>>>>> /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > 
>>>>>>>>>> nscf.out
>>>>>>>>>> 
>>>>>>>>>> So I submitted it with the following command:
>>>>>>>>>> 
>>>>>>>>>> ~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 
>>>>>>>>>> /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > 
>>>>>>>>>> nscf.out
>>>>>>>>>> 
>>>>>>>>>> And it crashed like the first time. 
>>>>>>>>>> 
>>>>>>>>>> It is interesting that the preliminary scf run works fine. The scf 
>>>>>>>>>> run requires Quantum Espresso to generate the k points automatically 
>>>>>>>>>> as shown below:
>>>>>>>>>> 
>>>>>>>>>> K_POINTS (automatic)
>>>>>>>>>> 12 12 12 0 0 0
>>>>>>>>>> 
>>>>>>>>>> The nscf run which crashes includes a list of k points (1728 in this 
>>>>>>>>>> case) as seen below:
>>>>>>>>>> 
>>>>>>>>>> K_POINTS (crystal)
>>>>>>>>>> 1728
>>>>>>>>>>   0.00000000  0.00000000  0.00000000  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.08333333  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.16666667  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.25000000  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.33333333  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.41666667  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.50000000  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.58333333  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.66666667  5.787037e-04 
>>>>>>>>>>   0.00000000  0.00000000  0.75000000  5.787037e-04 
>>>>>>>>>> …….
>>>>>>>>>> …….
>>>>>>>>>> 
>>>>>>>>>> To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran 
>>>>>>>>>> compiler module, configured with only the             “--prefix="  
>>>>>>>>>> and then “make all install”. I did not enable or disable any other 
>>>>>>>>>> options.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> Vahid
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jan 19, 2018, at 10:23 AM, Edgar Gabriel 
>>>>>>>>>>> <egabr...@central.uh.edu> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> thanks, that is interesting. Since /scratch is a lustre file 
>>>>>>>>>>> system, Open MPI should actually utilize romio314 for that anyway, 
>>>>>>>>>>> not ompio. What I have seen however happen on at least one 
>>>>>>>>>>> occasions is that ompio was still used since ( I suspect) romio314 
>>>>>>>>>>> didn't pick up correctly the configuration options. It is a little 
>>>>>>>>>>> bit of a mess from that perspective that we have to pass the romio 
>>>>>>>>>>> arguments with different flag/options than for ompio, e.g.
>>>>>>>>>>> 
>>>>>>>>>>> --with-lustre=/path/to/lustre/ 
>>>>>>>>>>> --with-io-romio-flags="--with-file-system=ufs+nfs+lustre 
>>>>>>>>>>> --with-lustre=/path/to/lustre"
>>>>>>>>>>> 
>>>>>>>>>>> ompio should pick up the lustre options correctly if lustre 
>>>>>>>>>>> headers/libraries are found at the default location, even if the 
>>>>>>>>>>> user did not pass the --with-lustre option. I am not entirely sure 
>>>>>>>>>>> what happens in romio if the user did not pass the 
>>>>>>>>>>> --with-file-system=ufs+nfs+lustre but the lustre headers/libraries 
>>>>>>>>>>> are found at the default location, i.e. whether the lustre adio 
>>>>>>>>>>> component is still compiled or not.
>>>>>>>>>>> 
>>>>>>>>>>> Anyway, lets wait for the outcome of your run enforcing using the 
>>>>>>>>>>> romio314 component, and I will still try to reproduce your problem 
>>>>>>>>>>> on my system.
>>>>>>>>>>> Thanks
>>>>>>>>>>> Edgar
>>>>>>>>>>> 
>>>>>>>>>>> On 1/19/2018 7:15 AM, Vahid Askarpour wrote:
>>>>>>>>>>>> Gilles,
>>>>>>>>>>>> 
>>>>>>>>>>>> I have submitted that job with --mca io romio314. If it finishes, 
>>>>>>>>>>>> I will let you know. It is sitting in Conte’s queue at Purdue.
>>>>>>>>>>>> 
>>>>>>>>>>>> As to Edgar’s question about the file system, here is the output 
>>>>>>>>>>>> of df -Th:
>>>>>>>>>>>> 
>>>>>>>>>>>> vaskarpo@conte-fe00:~ $ df -Th
>>>>>>>>>>>> Filesystem           Type    Size  Used Avail Use% Mounted on
>>>>>>>>>>>> /dev/sda1            ext4    435G   16G  398G   4% /
>>>>>>>>>>>> tmpfs                tmpfs    16G  1.4M   16G   1% /dev/shm
>>>>>>>>>>>> 
>>>>>>>>>>>> persistent-nfs.rcac.purdue.edu
>>>>>>>>>>>> :/persistent/home
>>>>>>>>>>>>                      nfs      80T   64T   17T  80% /home
>>>>>>>>>>>> 
>>>>>>>>>>>> persistent-nfs.rcac.purdue.edu
>>>>>>>>>>>> :/persistent/apps
>>>>>>>>>>>>                      nfs     8.0T  4.0T  4.1T  49% /apps
>>>>>>>>>>>> 
>>>>>>>>>>>> mds-d01-ib.rcac.purdue.edu@o2ib1:mds-d02-ib.rcac.purdue.edu@o2ib1:/lustreD
>>>>>>>>>>>> 
>>>>>>>>>>>>                      lustre  1.4P  994T  347T  75% /scratch/conte
>>>>>>>>>>>> 
>>>>>>>>>>>> depotint-nfs.rcac.purdue.edu
>>>>>>>>>>>> :/depot
>>>>>>>>>>>>                      nfs     4.5P  3.0P  1.6P  66% /depot
>>>>>>>>>>>> 172.18.84.186:/persistent/fsadmin
>>>>>>>>>>>>                      nfs     200G  130G   71G  65% 
>>>>>>>>>>>> /usr/rmt_share/fsadmin
>>>>>>>>>>>> 
>>>>>>>>>>>> The code is compiled in my $HOME and is run on the scratch.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> 
>>>>>>>>>>>> Vahid
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet 
>>>>>>>>>>>>> <gilles.gouaillar...@gmail.com>
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Vahid,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> i the v1.10 series, the default MPI-IO component was ROMIO based, 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> in the v3 series, it is now ompio.
>>>>>>>>>>>>> You can force the latest Open MPI to use the ROMIO based 
>>>>>>>>>>>>> component with
>>>>>>>>>>>>> mpirun --mca io romio314 ...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> That being said, your description (e.g. a hand edited file) 
>>>>>>>>>>>>> suggests
>>>>>>>>>>>>> that I/O is not performed with MPI-IO,
>>>>>>>>>>>>> which makes me very puzzled on why the latest Open MPI is 
>>>>>>>>>>>>> crashing.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Gilles
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel 
>>>>>>>>>>>>> <egabr...@central.uh.edu>
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I will try to reproduce this problem with 3.0.x, but it might 
>>>>>>>>>>>>>> take me a
>>>>>>>>>>>>>> couple of days to get to it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Since it seemed to have worked with 2.0.x (except for the 
>>>>>>>>>>>>>> running out file
>>>>>>>>>>>>>> handles problem), there is the suspicion that one of the fixes 
>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>> introduced since then is the problem.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What file system did you run it on? NFS?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <vh261...@dal.ca>
>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> My openmpi3.0.x run (called nscf run) was reading data from a 
>>>>>>>>>>>>>>>> routine
>>>>>>>>>>>>>>>> Quantum Espresso input file edited by hand. The preliminary 
>>>>>>>>>>>>>>>> run (called scf
>>>>>>>>>>>>>>>> run) was done with openmpi3.0.x on a similar input file also 
>>>>>>>>>>>>>>>> edited by hand.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Gotcha.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Well, that's a little disappointing.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It would be good to understand why it is crashing -- is the app 
>>>>>>>>>>>>>>> doing
>>>>>>>>>>>>>>> something that is accidentally not standard?  Is there a bug in 
>>>>>>>>>>>>>>> (soon to be
>>>>>>>>>>>>>>> released) Open MPI 3.0.1?  ...?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> 
>>>>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> 
>>>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to