Re: [OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path
Hello, I tried mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 ./xdsyevr mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 ./xdsyevr This does not change anything. I made an attempt to narrow down what happens. Sorry, but this is a bit longer. A stack trace is also below. Looking at the actual numbers, see at the very bottom, I notice that the CHK and QTQ columns (9th and 10th column, maximum over all eigentests) between the two OpenMPIs are simliar. What changes is the "IL, IU, VL or VU altered by PDSYEVR" line which is not present in the output with 1.10.4, only in the 2.0.1 output. Looking at pdseprsubtst.f, comment line 751, I see that this is (as far as I understand it) a sanity check. Inserting my own print statement in pdseprsubtst.f (and changing optimization to "-O0 -g"), i.e. IF( IL.NE.OLDIL .OR. IU.NE.OLDIU .OR. VL.NE.OLDVL .OR. VU.NE. $ OLDVU ) THEN IF( IAM.EQ.0 ) THEN WRITE( NOUT, FMT = 9982 ) WRITE( NOUT, '(F8.3,F8.3,F8.3,F8.3)') VL,VU,OLDVL,OLDVU WRITE( NOUT, '(I10,I10,I10,I10)') IL,IU,OLDIL,OLDIU END IF RESULT = 1 END IF The result with 2.0.1 is 500 2 2 2 8 Y 0.08-1.00 0.81E-03 3.3 PASSED EVR IL, IU, VL or VU altered by PDSYEVR NaN 0.000 NaN 0.000 -1 132733856-1 132733856 500 4 1 4 8 Y 0.18-1.00 0.84E-03 3.5 FAILED 500 4 4 1 8 Y 0.17-1.00 0.78E-03 2.9 PASSED EVR The values OLDVL and OLDVU are the saved values of VL and VU on entry in pdseprsubtst (line 253 and 254) _before_ the actual eigensolver pdsyevr is called. Working upwards in the call tree and additionally inserting IF (IAM.EQ.0) THEN WRITE(NOUT,'(F8.3,F8.3)') VL, VU ENDIF right before each call to PDSEPRSUBTST in pdseprtst.f gives with 2.0.1 500 2 2 2 8 Y 0.07-1.00 0.81E-03 3.3 PASSED EVR NaN 0.000 IL, IU, VL or VU altered by PDSYEVR NaN 0.000 NaN 0.000 -1 128725600-1 128725600 500 4 1 4 8 Y 0.16-1.00 0.84E-03 3.5 FAILED 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.343 0.377 -0.697 0.104 500 4 4 1 8 Y 0.17-1.00 0.76E-03 3.1 PASSED EVR With 1.10.4 500 2 2 2 8 Y 0.07-1.00 0.80E-03 4.4 PASSED EVR 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.435 0.884 -0.804 0.699 500 4 1 4 8 Y 0.08-1.00 0.91E-03 3.3 PASSED EVR 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.437 0.253 -0.603 0.220 500 4 4 1 8 Y 0.17-1.00 0.83E-03 3.7 PASSED EVR So something goes wrong early and it is probably not related to numerics. Setting -ffpe-trap=invalid,zero,overflow in FCFLAGS (and NOOPT), which of course does nothing to the BLACS and C routines, although the stack trace below ends in a C routine (which might be spurious). login 14:04 ~/src/scalapack/TESTING % mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_T[31/1861] ca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 ./xdsyevr Check if overflow is handled in ieee default manner. If this is the last output you see, you should assume that overflow caused a floating point exception. Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: #0 0x2b921971266f in ??? #0 0x2ade83c4966f in ??? #1 0x4316fd in pdlachkieee_ at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260 #2 0x40457b in pdseprdriver #1 0x4316fd in pdlachkieee_ at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260 at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120 #3 0x405828 in main at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257 #2 0x40457b in pdseprdriver at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120 #3 0x405828 in main at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257 #0 0x2b414549566f in ??? #1 0x4316fd in pdlachkieee_ at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260 #2 0x40457b in pdseprdriver at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120 #3 0x405828 in main at /home1/ckoe/src/scal
Re: [OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path
Hello again, please ignore the stack trace contained in my previous mail. It fails with 1.10.4 at the same point, apparently the check for IEEE arithmetics is a red herring ! Best Regards Christof - Nachricht von Christof Köhler - Datum: Sat, 19 Nov 2016 14:10:55 +0100 Von: Christof Köhler Antwort an: christof.koeh...@bccms.uni-bremen.de Betreff: Re: [OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path An: Howard Pritchard Cc: Open MPI Users Hello, I tried mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 ./xdsyevr mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 ./xdsyevr This does not change anything. I made an attempt to narrow down what happens. Sorry, but this is a bit longer. A stack trace is also below. Looking at the actual numbers, see at the very bottom, I notice that the CHK and QTQ columns (9th and 10th column, maximum over all eigentests) between the two OpenMPIs are simliar. What changes is the "IL, IU, VL or VU altered by PDSYEVR" line which is not present in the output with 1.10.4, only in the 2.0.1 output. Looking at pdseprsubtst.f, comment line 751, I see that this is (as far as I understand it) a sanity check. Inserting my own print statement in pdseprsubtst.f (and changing optimization to "-O0 -g"), i.e. IF( IL.NE.OLDIL .OR. IU.NE.OLDIU .OR. VL.NE.OLDVL .OR. VU.NE. $ OLDVU ) THEN IF( IAM.EQ.0 ) THEN WRITE( NOUT, FMT = 9982 ) WRITE( NOUT, '(F8.3,F8.3,F8.3,F8.3)') VL,VU,OLDVL,OLDVU WRITE( NOUT, '(I10,I10,I10,I10)') IL,IU,OLDIL,OLDIU END IF RESULT = 1 END IF The result with 2.0.1 is 500 2 2 2 8 Y 0.08-1.00 0.81E-03 3.3 PASSED EVR IL, IU, VL or VU altered by PDSYEVR NaN 0.000 NaN 0.000 -1 132733856-1 132733856 500 4 1 4 8 Y 0.18-1.00 0.84E-03 3.5 FAILED 500 4 4 1 8 Y 0.17-1.00 0.78E-03 2.9 PASSED EVR The values OLDVL and OLDVU are the saved values of VL and VU on entry in pdseprsubtst (line 253 and 254) _before_ the actual eigensolver pdsyevr is called. Working upwards in the call tree and additionally inserting IF (IAM.EQ.0) THEN WRITE(NOUT,'(F8.3,F8.3)') VL, VU ENDIF right before each call to PDSEPRSUBTST in pdseprtst.f gives with 2.0.1 500 2 2 2 8 Y 0.07-1.00 0.81E-03 3.3 PASSED EVR NaN 0.000 IL, IU, VL or VU altered by PDSYEVR NaN 0.000 NaN 0.000 -1 128725600-1 128725600 500 4 1 4 8 Y 0.16-1.00 0.84E-03 3.5 FAILED 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.343 0.377 -0.697 0.104 500 4 4 1 8 Y 0.17-1.00 0.76E-03 3.1 PASSED EVR With 1.10.4 500 2 2 2 8 Y 0.07-1.00 0.80E-03 4.4 PASSED EVR 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.435 0.884 -0.804 0.699 500 4 1 4 8 Y 0.08-1.00 0.91E-03 3.3 PASSED EVR 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.437 0.253 -0.603 0.220 500 4 4 1 8 Y 0.17-1.00 0.83E-03 3.7 PASSED EVR So something goes wrong early and it is probably not related to numerics. Setting -ffpe-trap=invalid,zero,overflow in FCFLAGS (and NOOPT), which of course does nothing to the BLACS and C routines, although the stack trace below ends in a C routine (which might be spurious). login 14:04 ~/src/scalapack/TESTING % mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_T[31/1861] ca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 ./xdsyevr Check if overflow is handled in ieee default manner. If this is the last output you see, you should assume that overflow caused a floating point exception. Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: #0 0x2b921971266f in ??? #0 0x2ade83c4966f in ??? #1 0x4316fd in pdlachkieee_ at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260 #2 0x40457b in pdseprdriver #1 0x4316fd in pdlachkieee_ at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260 at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.
[OMPI users] Error bash: /usr/mpi/gcc/openmpi-1.8.8/bin/orted: No such file or directory
Hello Guys I have a cluster of HPC and I update OFED, Firmware etc. Post reboot and run mpirun -machinefile nodes8 -n 128 /home/HPL/run_hpl/xhpl show the following error bash: /usr/mpi/gcc/openmpi-1.8.8/bin/orted: No such file or directory bash: /usr/mpi/gcc/openmpi-1.8.8/bin/orted: No such file or directory bash: /usr/mpi/gcc/openmpi-1.8.8/bin/orted: No such file or directory -- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). Before update I have version 1.6.4 and the cluster not show errors when I run the mpirun I changed the Enviroment Variables but persist the error. Is possible ypur comment who resolved the issue. Regards Sebastian Antunez ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users