have a user whos code at scale dies reliably with the errors (new hosts each time):
We have been using for this code: -mca btl_openib_receive_queues X,4096,128:X,12288,128:X,65536,12 Without that option it dies with an out of memory message reliably. Note this code runs fine at the same scale on Pilaties (NASA SGI box) using MPT, Are we running out of QP? Is that possible? -------------------------------------------------------------------------- The OpenFabrics stack has reported a network error event. Open MPI will try to continue, but your job may end up failing. Local host: nyx5608.engin.umich.edu MPI process PID: 42036 Error number: 3 (IBV_EVENT_QP_ACCESS_ERR) This error may indicate connectivity problems within the fabric; please contact your system administrator. -------------------------------------------------------------------------- [[9462,1],3][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3394:handle_wc] from nyx5608.engin.umich.edu to: nyx5022 error polling LP CQ with status INVALID REQUEST ERROR status number 9 for wr_id 14d6d00 opcode 0 vendor error 138 qp_idx 0 -------------------------------------------------------------------------- The OpenFabrics stack has reported a network error event. Open MPI will try to continue, but your job may end up failing. Local host: (null) MPI process PID: 42038 Error number: 3 (IBV_EVENT_QP_ACCESS_ERR) This error may indicate connectivity problems within the fabric; please contact your system administrator. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing bro...@umich.edu (734)936-1985