Dear SciNet: I have resubmitted, just letting you know.
Thank you, Chris. starting mdrun 'title' 1000000000 steps, 2000000.0 ps (continuing from step 101920720, 203841.4 ps). [[3105,1],27][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 10982784 opcode 32767 vendor error 129 qp_idx 0 [[3105,1],25][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 10982016 opcode 32767 vendor error 129 qp_idx 0 [[3105,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 11032832 opcode 32767 vendor error 129 qp_idx 2 -------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 20). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. Below is some information about the host that raised the error and the peer to which it was connected: Local host: gpc-f106n004 Local device: mlx4_0 Peer host: gpc-f106n003-ib0 You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 28 with PID 6738 on node gpc-f106n004-ib0 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [[3105,1],5][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n001 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 12949632 opcode 32767 vendor error 129 qp_idx 0 [[3105,1],7][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n001 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 11033216 opcode 32767 vendor error 129 qp_idx 2 [gpc-f106n001:06928] 4 more processes have sent help message help-mpi-btl-openib.txt / pp retry exceeded [gpc-f106n001:06928] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [gpc-f106n001:06928] [[3105,0],0]-[[3105,0],2] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) =>> PBS: job killed: node 2 (gpc-f106n003-ib0) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's Terminated mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists