have a user whos code at scale dies reliably with the errors (new hosts each 
time):

We have been using for this code:
-mca btl_openib_receive_queues X,4096,128:X,12288,128:X,65536,12

Without that option it dies with an out of memory message reliably. 

Note this code runs fine at the same scale on Pilaties (NASA SGI box) using 
MPT, 

Are we running out of QP?  Is that possible?

--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.

  Local host:        nyx5608.engin.umich.edu
  MPI process PID:   42036
  Error number:      3 (IBV_EVENT_QP_ACCESS_ERR)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.
--------------------------------------------------------------------------
[[9462,1],3][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3394:handle_wc]
 from nyx5608.engin.umich.edu to: nyx5022 error polling LP CQ with status 
INVALID REQUEST ERROR status number 9 for wr_id 14d6d00 opcode 0  vendor error 
138 qp_idx 0
--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.

  Local host:        (null)
  MPI process PID:   42038
  Error number:      3 (IBV_EVENT_QP_ACCESS_ERR)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.


Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985




Reply via email to