Jeff, thank your for your suggestion, I am sure that the correct mpif.h is being included. One thing that I did not do in my original message was submit the job to SGE. I did that and the program still failed with the same seg fault messages.
Below is the output of the job submitted to SGE. <<< example1.output >>> [compute-0-1:19367] *** Process received signal *** [compute-0-5:19650] *** Process received signal *** [compute-0-3:17571] *** Process received signal *** [compute-0-1:19366] *** Process received signal *** [compute-0-1:19366] Signal: Segmentation fault (11) [compute-0-1:19366] Signal code: Address not mapped (1) [compute-0-1:19366] Failing at address: 0x44000070 [compute-0-1:19366] *** End of error message *** [compute-0-5:19650] Signal: Segmentation fault (11) [compute-0-5:19650] Signal code: Address not mapped (1) [compute-0-5:19650] Failing at address: 0x44000070 [compute-0-5:19650] *** End of error message *** [compute-0-3:17571] Signal: Segmentation fault (11) [compute-0-3:17571] Signal code: Address not mapped (1) [compute-0-3:17571] Failing at address: 0x44000070 [compute-0-3:17571] *** End of error message *** [compute-0-1:19367] Signal: Segmentation fault (11) [compute-0-1:19367] Signal code: Address not mapped (1) [compute-0-1:19367] Failing at address: 0x44000070 [compute-0-1:19367] *** End of error message *** [compute-0-5:19651] *** Process received signal *** [compute-0-5:19651] Signal: Segmentation fault (11) [compute-0-5:19651] Signal code: Address not mapped (1) [compute-0-5:19651] Failing at address: 0x44000070 [compute-0-5:19651] *** End of error message *** [compute-0-3:17572] *** Process received signal *** [compute-0-3:17572] Signal: Segmentation fault (11) [compute-0-3:17572] Signal code: Address not mapped (1) [compute-0-3:17572] Failing at address: 0x44000070 [compute-0-3:17572] *** End of error message *** [compute-0-1.local:19292] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [compute-0-1.local:19292] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_gridengine_module.c at line 791 [compute-0-1.local:19292] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 mpirun noticed that job rank 2 with PID 19650 on node compute-0-5.local exited on signal 11 (Segmentation fault). *** glibc detected *** free(): invalid pointer: 0x0000000000606b80 *** [compute-0-1.local:19292] ERROR: A daemon on node compute-0-5.local failed to start as expected. [compute-0-1.local:19292] ERROR: There may be more information available from [compute-0-1.local:19292] ERROR: the 'qstat -t' command on the Grid Engine tasks. [compute-0-1.local:19292] ERROR: If the problem persists, please restart the [compute-0-1.local:19292] ERROR: Grid Engine PE job [compute-0-1.local:19292] The daemon received a signal 6 (with core). [compute-0-1.local:19292] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [compute-0-1.local:19292] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_gridengine_module.c at line 826 -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -------------------------------------------------------------------------- [compute-0-1.local:19365] OOB: Connection to HNP lost <<< END example1.output >>> Is it possible that the ACML libraries are incompatible with linking to my version of OMPI? Or like Jeff said, maybe it is just a Pathscale bug. I hope not. Daniel -----Original Message----- From: users-boun...@open-mpi.org on behalf of Backlund, Daniel Sent: Tue 1/22/2008 3:06 PM To: us...@open-mpi.org Subject: [OMPI users] SCALAPACK: Segmentation Fault (11) and Signal code:Address not mapped (1) Hello all, I am using OMPI 1.2.4 on a Linux cluster (Rocks 4.2). OMPI was configured to use the Pathscale Compiler Suite installed in the (NFS mounted on nodes) /home/PROGRAMS/pathscale. I am trying to compile and run the example1.f that comes with the ACML package from AMD, and I am unable to get it to run. All nodes have the same Opteron processors and 2GB ram per core. OMPI was configured as below. export CC=pathcc export CXX=pathCC export FC=pathf90 export F77=pathf90 ./configure --prefix=/opt/openmpi/1.2.4 --enable-static --without-threads --without-memory-manager \ --without-libnuma --disable-mpi-threads The configuration was successful, the install was successful, I can even run a sample mpihello.f90 program. I would eventually like to link the ACML SCALAPACK and BLACS libraries to our code, but I need some help. The ACML version is 3.1.0 for pathscale64. I go into the scalapack_examples directory, modify GNUmakefile to the correct values, and compile successfully. I have made openmpi into an rpm and pushed it to the nodes, modified LD_LIBRARY_PATH and PATH, and made sure I can see it on all nodes. When I try to run the example1.exe which is generated, using /opt/openmpi/1.2.4/bin/mpirun -np 6 example1.exe I get the following output: <<<< example1.res >>>> [XXXXXXX:31295] *** Process received signal *** [XXXXXXX:31295] Signal: Segmentation fault (11) [XXXXXXX:31295] Signal code: Address not mapped (1) [XXXXXXX:31295] Failing at address: 0x44000070 [XXXXXXX:31295] *** End of error message *** [XXXXXXX:31298] *** Process received signal *** [XXXXXXX:31298] Signal: Segmentation fault (11) [XXXXXXX:31298] Signal code: Address not mapped (1) [XXXXXXX:31298] Failing at address: 0x44000070 [XXXXXXX:31298] *** End of error message *** [XXXXXXX:31299] *** Process received signal *** [XXXXXXX:31299] Signal: Segmentation fault (11) [XXXXXXX:31299] Signal code: Address not mapped (1) [XXXXXXX:31299] Failing at address: 0x44000070 [XXXXXXX:31299] *** End of error message *** [XXXXXXX:31300] *** Process received signal *** [XXXXXXX:31300] Signal: Segmentation fault (11) [XXXXXXX:31300] Signal code: Address not mapped (1) [XXXXXXX:31300] Failing at address: 0x44000070 [XXXXXXX:31300] *** End of error message *** [XXXXXXX:31296] *** Process received signal *** [XXXXXXX:31296] Signal: Segmentation fault (11) [XXXXXXX:31296] Signal code: Address not mapped (1) [XXXXXXX:31296] Failing at address: 0x44000070 [XXXXXXX:31296] *** End of error message *** [XXXXXXX:31297] *** Process received signal *** [XXXXXXX:31297] Signal: Segmentation fault (11) [XXXXXXX:31297] Signal code: Address not mapped (1) [XXXXXXX:31297] Failing at address: 0x44000070 [XXXXXXX:31297] *** End of error message *** mpirun noticed that job rank 0 with PID 31295 on node XXXXXXX.ourdomain.com exited on signal 11 (Segmentation fault). 5 additional processes aborted (not shown) <<<< end example1.res >>>> Here is the result of ldd example1.exe <<<< ldd example1.exe >>>> libmpi_f90.so.0 => /opt/openmpi/1.2.4/lib/libmpi_f90.so.0 (0x0000002a9557d000) libmpi_f77.so.0 => /opt/openmpi/1.2.4/lib/libmpi_f77.so.0 (0x0000002a95681000) libmpi.so.0 => /opt/openmpi/1.2.4/lib/libmpi.so.0 (0x0000002a957b3000) libopen-rte.so.0 => /opt/openmpi/1.2.4/lib/libopen-rte.so.0 (0x0000002a959fb000) libopen-pal.so.0 => /opt/openmpi/1.2.4/lib/libopen-pal.so.0 (0x0000002a95be7000) librt.so.1 => /lib64/tls/librt.so.1 (0x0000003e7cd00000) libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003e7c200000) libutil.so.1 => /lib64/libutil.so.1 (0x0000003e79e00000) libmv.so.1 => /home/PROGRAMS/pathscale/lib/3.0/libmv.so.1 (0x0000002a95d4d000) libmpath.so.1 => /home/PROGRAMS/pathscale/lib/3.0/libmpath.so.1 (0x0000002a95e76000) libm.so.6 => /lib64/tls/libm.so.6 (0x0000003e77a00000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003e77c00000) libpathfortran.so.1 => /home/PROGRAMS/pathscale/lib/3.0/libpathfortran.so.1 (0x0000002a95f97000) libc.so.6 => /lib64/tls/libc.so.6 (0x0000003e77700000) libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003e78200000) /lib64/ld-linux-x86-64.so.2 (0x0000003e76800000) <<<< end ldd >>>> Like I said, the compilation of the example program yields no errors, it just will not run. Does anybody have any suggestions? Am I doing something wrong? _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
<<winmail.dat>>