Re: [OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path

2016-11-19 Thread Christof Köhler

Hello,

I tried

mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x  
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host  
node009,node009,node009,node009 ./xdsyevr
mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x  
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host  
node009,node010,node009,node010 ./xdsyevr


This does not change anything.


I made an attempt to narrow down what happens. Sorry, but this is a  
bit longer. A stack trace is also below.


Looking at the actual numbers, see at the very bottom, I notice that  
the CHK and QTQ columns (9th and 10th column, maximum over all  
eigentests) between the two OpenMPIs are simliar. What changes is the  
"IL, IU, VL or VU altered by PDSYEVR" line which is not present in the  
output with 1.10.4, only in the 2.0.1 output. Looking at  
pdseprsubtst.f, comment line 751, I see that this is (as far as I  
understand it) a sanity check.


Inserting my own print statement  in pdseprsubtst.f (and changing  
optimization to "-O0 -g"), i.e.


 IF( IL.NE.OLDIL .OR. IU.NE.OLDIU .OR. VL.NE.OLDVL .OR. VU.NE.
 $   OLDVU ) THEN
IF( IAM.EQ.0 ) THEN
  WRITE( NOUT, FMT = 9982 )
  WRITE( NOUT, '(F8.3,F8.3,F8.3,F8.3)') VL,VU,OLDVL,OLDVU
  WRITE( NOUT, '(I10,I10,I10,I10)') IL,IU,OLDIL,OLDIU
END IF
RESULT = 1
 END IF

The result with 2.0.1 is

   500   2   2   2   8   Y 0.08-1.00  0.81E-03   3.3 PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
 NaN   0.000 NaN   0.000
-1 132733856-1 132733856
   500   4   1   4   8   Y 0.18-1.00  0.84E-03   3.5 FAILED
   500   4   4   1   8   Y 0.17-1.00  0.78E-03   2.9 PASSED   EVR

The values OLDVL and OLDVU are the saved values of VL and VU on entry  
in pdseprsubtst (line 253 and 254) _before_ the actual eigensolver  
pdsyevr is called.


Working upwards in the call tree and additionally inserting
IF (IAM.EQ.0) THEN
WRITE(NOUT,'(F8.3,F8.3)') VL, VU
ENDIF
right before each call to PDSEPRSUBTST in pdseprtst.f gives with 2.0.1


   500   2   2   2   8   Y 0.07-1.00  0.81E-03   3.3 PASSED   EVR
 NaN   0.000
IL, IU, VL or VU altered by PDSYEVR
 NaN   0.000 NaN   0.000
-1 128725600-1 128725600
   500   4   1   4   8   Y 0.16-1.00  0.84E-03   3.5 FAILED
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.343   0.377
  -0.697   0.104
   500   4   4   1   8   Y 0.17-1.00  0.76E-03   3.1 PASSED   EVR

With 1.10.4

   500   2   2   2   8   Y 0.07-1.00  0.80E-03   4.4 PASSED   EVR
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.435   0.884
  -0.804   0.699
   500   4   1   4   8   Y 0.08-1.00  0.91E-03   3.3 PASSED   EVR
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
  -0.437   0.253
  -0.603   0.220
   500   4   4   1   8   Y 0.17-1.00  0.83E-03   3.7 PASSED   EVR


So something goes wrong early and it is probably not related to numerics.

Setting -ffpe-trap=invalid,zero,overflow in FCFLAGS (and NOOPT), which  
of course does nothing to the BLACS and C routines, although the stack  
trace below ends in a C routine (which might be spurious).


login 14:04 ~/src/scalapack/TESTING % mpirun -n 4 --mca pml ob1 -x  
PATH -x LD_LIBRARY_PATH -x OMP_NUM_T[31/1861]
ca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010  
./xdsyevr

Check if overflow is handled in ieee default manner.
If this is the last output you see, you should assume
that overflow caused a floating point exception.

Program received signal SIGFPE: Floating-point exception - erroneous  
arithmetic operation.


Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous  
arithmetic operation.


Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous  
arithmetic operation.


Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous  
arithmetic operation.


Backtrace for this error:
#0  0x2b921971266f in ???
#0  0x2ade83c4966f in ???
#1  0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2  0x40457b in pdseprdriver
#1  0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3  0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#2  0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3  0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0  0x2b414549566f in ???
#1  0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2  0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3  0x405828 in main
at /home1/ckoe/src/scal

Re: [OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path

2016-11-19 Thread Christof Köhler

Hello again,

please ignore the stack trace contained in my previous mail. It fails  
with 1.10.4 at the same point, apparently the check for IEEE  
arithmetics is a red herring !


Best Regards

Christof

- Nachricht von Christof Köhler  
 -

 Datum: Sat, 19 Nov 2016 14:10:55 +0100
   Von: Christof Köhler 
Antwort an: christof.koeh...@bccms.uni-bremen.de
   Betreff: Re: [OMPI users] ScaLapack tester fails with 2.0.1, works  
with 1.10.4; Intel Omni-Path

An: Howard Pritchard 
Cc: Open MPI Users 



Hello,

I tried

mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x  
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host  
node009,node009,node009,node009 ./xdsyevr
mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x  
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host  
node009,node010,node009,node010 ./xdsyevr


This does not change anything.


I made an attempt to narrow down what happens. Sorry, but this is a  
bit longer. A stack trace is also below.


Looking at the actual numbers, see at the very bottom, I notice that  
the CHK and QTQ columns (9th and 10th column, maximum over all  
eigentests) between the two OpenMPIs are simliar. What changes is  
the "IL, IU, VL or VU altered by PDSYEVR" line which is not present  
in the output with 1.10.4, only in the 2.0.1 output. Looking at  
pdseprsubtst.f, comment line 751, I see that this is (as far as I  
understand it) a sanity check.


Inserting my own print statement  in pdseprsubtst.f (and changing  
optimization to "-O0 -g"), i.e.


 IF( IL.NE.OLDIL .OR. IU.NE.OLDIU .OR. VL.NE.OLDVL .OR. VU.NE.
 $   OLDVU ) THEN
IF( IAM.EQ.0 ) THEN
  WRITE( NOUT, FMT = 9982 )
  WRITE( NOUT, '(F8.3,F8.3,F8.3,F8.3)') VL,VU,OLDVL,OLDVU
  WRITE( NOUT, '(I10,I10,I10,I10)') IL,IU,OLDIL,OLDIU
END IF
RESULT = 1
 END IF

The result with 2.0.1 is

   500   2   2   2   8   Y 0.08-1.00  0.81E-03   3.3 PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
 NaN   0.000 NaN   0.000
-1 132733856-1 132733856
   500   4   1   4   8   Y 0.18-1.00  0.84E-03   3.5 FAILED
   500   4   4   1   8   Y 0.17-1.00  0.78E-03   2.9 PASSED   EVR

The values OLDVL and OLDVU are the saved values of VL and VU on  
entry in pdseprsubtst (line 253 and 254) _before_ the actual  
eigensolver pdsyevr is called.


Working upwards in the call tree and additionally inserting
IF (IAM.EQ.0) THEN
WRITE(NOUT,'(F8.3,F8.3)') VL, VU
ENDIF
right before each call to PDSEPRSUBTST in pdseprtst.f gives with 2.0.1


   500   2   2   2   8   Y 0.07-1.00  0.81E-03   3.3 PASSED   EVR
 NaN   0.000
IL, IU, VL or VU altered by PDSYEVR
 NaN   0.000 NaN   0.000
-1 128725600-1 128725600
   500   4   1   4   8   Y 0.16-1.00  0.84E-03   3.5 FAILED
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.343   0.377
  -0.697   0.104
   500   4   4   1   8   Y 0.17-1.00  0.76E-03   3.1 PASSED   EVR

With 1.10.4

   500   2   2   2   8   Y 0.07-1.00  0.80E-03   4.4 PASSED   EVR
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.435   0.884
  -0.804   0.699
   500   4   1   4   8   Y 0.08-1.00  0.91E-03   3.3 PASSED   EVR
   0.000   0.000
   0.000   0.000
   0.000   0.000
   0.000   0.000
  -0.437   0.253
  -0.603   0.220
   500   4   4   1   8   Y 0.17-1.00  0.83E-03   3.7 PASSED   EVR


So something goes wrong early and it is probably not related to numerics.

Setting -ffpe-trap=invalid,zero,overflow in FCFLAGS (and NOOPT),  
which of course does nothing to the BLACS and C routines, although  
the stack trace below ends in a C routine (which might be spurious).


login 14:04 ~/src/scalapack/TESTING % mpirun -n 4 --mca pml ob1 -x  
PATH -x LD_LIBRARY_PATH -x OMP_NUM_T[31/1861]
ca oob_tcp_if_include eth0,team0 -host  
node009,node010,node009,node010 ./xdsyevr

Check if overflow is handled in ieee default manner.
If this is the last output you see, you should assume
that overflow caused a floating point exception.

Program received signal SIGFPE: Floating-point exception - erroneous  
arithmetic operation.


Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous  
arithmetic operation.


Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous  
arithmetic operation.


Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous  
arithmetic operation.


Backtrace for this error:
#0  0x2b921971266f in ???
#0  0x2ade83c4966f in ???
#1  0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2  0x40457b in pdseprdriver
#1  0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.

[OMPI users] Error bash: /usr/mpi/gcc/openmpi-1.8.8/bin/orted: No such file or directory

2016-11-19 Thread Sebastian Antunez N.
Hello Guys

I have a cluster of HPC and I update OFED, Firmware etc.

Post reboot and run  mpirun -machinefile nodes8 -n 128
/home/HPL/run_hpl/xhpl show the following error

bash: /usr/mpi/gcc/openmpi-1.8.8/bin/orted: No such file or directory
bash: /usr/mpi/gcc/openmpi-1.8.8/bin/orted: No such file or directory
bash: /usr/mpi/gcc/openmpi-1.8.8/bin/orted: No such file or directory
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).



Before update I have version 1.6.4 and the cluster not show errors when I
run the mpirun

I changed the Enviroment Variables but persist the error.

Is possible ypur comment who resolved the issue.

Regards

Sebastian Antunez
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users