On Fri, Jan 19, 2007 at 05:51:49PM +0000, Arif Ali wrote: > >>I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed > >>miserably. > >> > > > >Can you describe what happened there? Is it failing in a different way? > > > Here's the output > > #--------------------------------------------------- > # Intel (R) MPI Benchmark Suite V2.3, MPI-1 part > #--------------------------------------------------- > # Date : Fri Jan 19 17:33:52 2007 > # Machine : ppc64# System : Linux > # Release : 2.6.16.21-0.8-ppc64 > # Version : #1 SMP Mon Jul 3 18:25:39 UTC 2006 > > # > # Minimum message length in bytes: 0 > # Maximum message length in bytes: 4194304 > # > # MPI_Datatype : MPI_BYTE > # MPI_Datatype for reductions : MPI_FLOAT > # MPI_Op : MPI_SUM > # > # > > # List of Benchmarks to run: > > # PingPong > # PingPing > # Sendrecv > # Exchange > # Allreduce > # Reduce > # Reduce_scatter > # Allgather > # Allgatherv > # Alltoall > # Bcast > # Barrier > > #--------------------------------------------------- > # Benchmarking PingPong > # #processes = 2 > # ( 58 additional processes waiting in MPI_Barrier) > #--------------------------------------------------- > #bytes #repetitions t[usec] Mbytes/sec > 0 1000 1.76 0.00 > 1 1000 1.88 0.51 > 2 1000 1.89 1.01 > 4 1000 1.91 2.00 > 8 1000 1.88 4.05 > 16 1000 2.02 7.55 > 32 1000 2.05 14.88 > [0,1,4][btl_openib_component.c:1153:btl_openib_component_progress] from > node03 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR > status number 10 for wr_id 268969528 opcode 128 > [0,1,28][btl_openib_component.c:1153:btl_openib_component_progress] from > node09 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR > status number 10 for wr_id 268906808 opcode 128 > [0,1,58][btl_openib_component.c:1153:btl_openib_component_progress] from > node16 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR > status number 10 for wr_id 268919352 opcode 256614836 > [0,1,0][btl_openib_component.c:1153:btl_openib_component_progress] from > node02 to: node03 error polling HP CQ with status WORK REQUEST FLUSHED > ERROR status number 5 for wr_id 276070200 opcode 0 > [0,1,59][btl_openib_component.c:1153:btl_openib_component_progress] from > node16 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR > status number 10 for wr_id 268919352 opcode 256614836 > mpirun noticed that job rank 0 with PID 0 on node node02 exited on > signal 15 (Terminated). > 55 additional processes aborted (not shown) does this happen with btl_openib_flags=1? Does this also happen without this setting. This doesn't happen with OpenMPI-1.2b3 right?
-- Gleb.