Hi all,

I'm having a few isolated failed tests in the test-suite as well as a general 
OpenFabrics initialization error and want to check why these are happening and 
if it's "OK". I'm able to get all tests that don't skip to pass with serial 
compilation using gfortran 13.1.0. I only get failures when I switch to 
parallel compilation using openmpi/4.1.6. Can anyone help steer me in a 
direction for how to get a robust parallel compilation? Thanks in advance!

Some details on my configuration:
GCC/Gfortran 13.1.0
QE 7.4.1
Openmpi 4.1.6
Running make run-tests NPROCS=12
Red Hat Enterprise Linux 8
Using QE internal BLAS & LAPACK

Many of the tests are having errors like the following, even if they pass:

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   pn5657
  Local device: mlx5_0
--------------------------------------------------------------------------
Note: The following floating-point exceptions are signalling: 
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
[pn5657:3197139] 11 more processes have sent help message 
help-mpi-btl-openib.txt / error in device init
[pn5657:3197139] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages

Here are the tests that are failing:


  1.  pw_plugins - plugin-pw2casino_1.in (arg(s): 1): **FAILED**.

Different sets of data extracted from benchmark and test.

    Data only in benchmark: p1.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

     Error in routine pw2casino (1):

     pool/band/image parallelization not (yet) implemented

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



     stopping ...



  1.  pw_vdw - xdm.in: **FAILED**.

ef1

    ERROR: absolute error 5.62e-01 greater than 8.00e-02. (Test: 10.7872.  
Benchmark: 10.2253.)

    ERROR: relative error 5.50e-02 greater than 2.00e-02. (Test: 10.7872.  
Benchmark: 10.2253.)



  1.  cp_al_edft - Al.uspp.in: **FAILED**.

t1

    ERROR: absolute error 1.75e-02 greater than 6.00e-03. (Test: 159.46581.  
Benchmark: 159.44833.)


  1.  ph_1d - ch4.scf.in (arg(s): 1): **FAILED**.

n1

    ERROR: absolute error 6.00e+00 greater than 5.00e+00. (Test: 32.0.  
Benchmark: 26.0.)



  1.  /hpc/data/idunn/qe/7.4.1/test-suite/..//test-suite/run-hp.sh 2 Fe.scf.in 
test.out.070425-2.inp=Fe.scf.in.args=2 test.err.070425-2.inp=Fe.scf.in.args=2

Running PW ...

mpirun -np 12 /hpc/data/idunn/qe/7.4.1/test-suite/..//bin/pw.x < Fe.scf.in > 
test.out.070425-2.inp=Fe.scf.in.args=2 2> test.err.070425-2.inp=Fe.scf.in.args=2

hp_metal_paw_magn - Fe.scf.in (arg(s): 2): **FAILED**.

n1

    ERROR: absolute error 6.00e+00 greater than 5.00e+00. (Test: 31.0.  
Benchmark: 25.0.)



  1.  /hpc/data/idunn/qe/7.4.1/test-suite/..//test-suite/run-hp.sh 4 bn.hp.in 
test.out.070425-2.inp=bn.hp.in.args=4 test.err.070425-2.inp=bn.hp.in.args=4

Running HP ...

mpirun -np 12 /hpc/data/idunn/qe/7.4.1/test-suite/..//bin/hp.x < bn.hp.in > 
test.out.070425-2.inp=bn.hp.in.args=4 2> test.err.070425-2.inp=bn.hp.in.args=4

hp_soc_UV_paw_magn - bn.hp.in (arg(s): 4): **FAILED**.

v2

    ERROR: absolute error 1.37e-02 greater than 1.50e-03. (Test: -0.1254.  
Benchmark: -0.1117.)

    ERROR: relative error 1.23e-01 greater than 1.80e-04. (Test: -0.1254.  
Benchmark: -0.1117.)

v1

    ERROR: absolute error 1.72e+00 greater than 1.50e-03. (Test: 6.4294.  
Benchmark: 4.7069.)

    ERROR: relative error 3.66e-01 greater than 1.20e-04. (Test: 6.4294.  
Benchmark: 4.7069.)

u

    ERROR: absolute error 1.72e+00 greater than 1.50e-03. (Test: 6.4294.  
Benchmark: 4.7069.)

    ERROR: relative error 3.66e-01 greater than 1.20e-04. (Test: 6.4294.  
Benchmark: 4.7069.)



  1.  It seems all the KCW tests that need the kcw executable are failing with 
error messages like:



mpirun was unable to launch the specified application as it could not access

or execute an executable:



Executable: /hpc/data/sm-euv_rs/idunn/qe/7.4.1/test-suite/..//bin/kcw.x

Node: pn5657



while attempting to start process rank 0.



I'm not sure why kcw.x isn't in the bin folder.





Best regards,
Ian Dunn (he/him)
ASML Wilton MDEV Analysis Architect

--- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. Unless explicitly stated otherwise in the body of this 
communication or the attachment thereto (if any), the information is provided 
on an AS-IS basis without any express or implied warranties or liabilities. To 
the extent you are relying on this information, you are doing so at your own 
risk. If you are not the intended recipient, please notify the sender 
immediately by replying to this message and destroy all copies of this message 
and any attachments. Neither the sender nor the company/group of companies he 
or she represents shall be liable for the proper and complete transmission of 
the information contained in this communication, or for any delay in its 
receipt.
_______________________________________________________________________________
The Quantum ESPRESSO Foundation stands in solidarity with all civilians 
worldwide who are victims of terrorism, military aggression, and indiscriminate 
warfare.
--------------------------------------------------------------------------------
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list users@lists.quantum-espresso.org
https://lists.quantum-espresso.org/mailman/listinfo/users

Reply via email to