Hi, I have a SEGV problem with Scalapack. The same configuration works fine with MPICH, but I seem to get much better performance with Openmpi on this machine. I have attached the log and slmake.inc I am using. I have a the same problem with programs that call this routine that xcdblu uses. It seems to occur when the number of processors doesn't match the number of diagonals for the case of bwl = 15. If i choose -np 15 it just seems to seems to hang, however if i use mpirun --mca mpi_paffinity_alone 1 -np 15 xcdblu it crashes too.
Any help would be appreciated. Regards, Kevin > mpirun -np 6 xcdblu SCALAPACK banded linear systems. 'MPI machine' Tests of the parallel complex single precision band matrix solve The following scaled residual checks will be computed: Solve residual = ||Ax - b|| / (||x|| * ||A|| * eps * N) Factorization residual = ||A - LU|| / (||A|| * eps * N) The matrix A is randomly generated for each test. An explanation of the input/output parameters follows: TIME : Indicates whether WALL or CPU time was used. N : The number of rows and columns in the matrix A. bwl, bwu : The number of diagonals in the matrix A. NB : The size of the column panels the matrix A is split into. [-1 for default] NRHS : The total number of RHS to solve for. NBRHS : The number of RHS to be put on a column of processes before going on to the next column of processes. P : The number of process rows. Q : The number of process columns. THRESH : If a residual value is less than THRESH, CHECK is flagged as PASSED Fact time: Time in seconds to factor the matrix Sol Time: Time in seconds to solve the system. MFLOPS : Rate of execution for factor and solve using sequential operation count. MFLOP2 : Rough estimate of speed using actual op count (accurate big P,N). The following parameter values will be used: N : 3 5 17 bwl : 1 3 15 bwu : 1 1 4 NB : -1 NRHS : 4 NBRHS: 1 P : 1 1 1 1 Q : 1 2 3 4 Relative machine precision (eps) is taken to be 0.596046E-07 Routines pass computational tests if scaled residual is less than 3.0000 TIME TR N BWL BWU NB NRHS P Q L*U Time Slv Time MFLOPS MFLOP2 CHECK ---- -- ------ --- --- ---- ----- ---- ---- -------- -------- -------- -------- ------ WALL N 3 1 1 3 4 1 1 0.000 0.0000 0.00 0.00 PASSED WALL N 5 1 1 5 4 1 1 0.000 0.0000 0.00 0.00 PASSED WALL N 5 3 1 5 4 1 1 0.000 0.0000 0.00 0.00 PASSED WALL N 17 1 1 17 4 1 1 0.000 0.0000 0.00 0.00 PASSED WALL N 17 3 1 17 4 1 1 0.000 0.0000 0.00 0.00 PASSED WALL N 17 15 4 17 4 1 1 0.000 0.0000 0.00 0.00 PASSED WALL N 3 1 1 2 4 1 2 0.000 0.0000 0.00 0.00 PASSED WALL N 5 1 1 3 4 1 2 0.000 0.0000 0.00 0.00 PASSED Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x10 [0] func:/usr/local/lib/libopal.so.0 [0x2b0fdb4ee1c0] [1] func:/lib64/libpthread.so.0 [0x2b0fdbe0d140] [2] func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_match+0x2ff) [0x2b0fde2a4d9f] [3] func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback+0xaf) [0x2b0fde2a5d8f] [4] func:/usr/local/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x8c9) [0x2b0fde5b9e39] [5] func:/usr/local/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x21) [0x2b0fde3aeff1] [6] func:/usr/local/lib/libopal.so.0(opal_progress+0x4a) [0x2b0fdb4d9bfa] [7] func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x265) [0x2b0fde2a2c75] [8] func:/usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_basic_linear+0x10b) [0x2b0fdebe544b] [9] func:/usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x4d) [0x2b0fdebe25bd] [10] func:/usr/local/lib/libmpi.so.0(ompi_comm_nextcid+0x209) [0x2b0fdb207c59] [11] func:/usr/local/lib/libmpi.so.0(ompi_comm_create+0x8c) [0x2b0fdb206bcc] [12] func:/usr/local/lib/libmpi.so.0(MPI_Comm_create+0x90) [0x2b0fdb22d890] [13] func:/usr/local/lib/libmpi.so.0(pmpi_comm_create__+0x42) [0x2b0fdb2491b2] [14] func:xcdblu(BI_TransUserComm+0xef) [0x46797f] [15] func:xcdblu(Cblacs_gridmap+0x13a) [0x463e3a] [16] func:xcdblu(Creshape+0x17c) [0x42365c] [17] func:xcdblu(pcdbtrf_+0x5d9) [0x42df35] [18] func:xcdblu(MAIN__+0x190c) [0x417a0c] [19] func:xcdblu(main+0x32) [0x4160ea] [20] func:/lib64/libc.so.6(__libc_start_main+0xf4) [0x2b0fdbf34154] [21] func:xcdblu [0x416029] *** End of error message *** 1 additional process aborted (not shown)
config.log.tar.gz
Description: GNU Zip compressed data