BTW, another cause for retransmission is the lack of posted receive buffers.
Rich From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ludovic Raess Sent: Thursday, September 28, 2017 5:19 PM To: Open MPI Users <users@lists.open-mpi.org> Subject: Re: [OMPI users] Open MPI internal error Dear John, George, Rich, thank you for the suggestions and potential paths towards understanding the reason for the observed freeze. Although a HW issue might be possible, it sounds unlikely since the error appears only after long runs and not randomly. Also, it is kind of fixed after a reboot, unless another long run starts again. Cables and connections seems OK, we already reseted all connexions. Currently, we are investigating two paths towards a fix. We implemented a slightly modified version of the MPI point to point comm routine, to see if it was still a hidden programming issue. Additionally, I run the problematic setup using mvapich to see if it is related to Open MPI in particular, excluding thus a HW or implementation issue. In both cases, I will run 'ibdiagnet' in case freeze will occur again, as suggested. In last, we could try to set the retransmit count to 0 as suggested by Rich. Thanks for propositions, and I'll write if I have new hints (it would require some days to the runs to potentially freeze) Ludovic ________________________________ De : users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> de la part de Richard Graham <richa...@mellanox.com<mailto:richa...@mellanox.com>> Envoyé : jeudi 28 septembre 2017 18:09 À : Open MPI Users Objet : Re: [OMPI users] Open MPI internal error I just talked with George, who brought me up to speed on this particular problem. I would suggest a couple of things: - Look at the HW error counters, and see if you have many retransmits. This would indicate a potential issue with the particular HW in use, such as a cable that is not seated well, or some type of similar problem. - If you have the ability, reset your cables from the HCA to the switch, and see if this addresses the problem. Also, if you have the ability (e.g., can modify the Open MPI source code), set the retransmit count to 0, and see if you see the same issue. This would just speed up reaching the problem, if this is indeed the issue. Rich From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of George Bosilca Sent: Thursday, September 28, 2017 11:04 AM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Subject: Re: [OMPI users] Open MPI internal error John, On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However, keep in mind that the setup was different as we were talking about losing a process. The proposed solution to force the timeout to a large value was not fixing the problem, it was just delaying it enough for the application to run to completion. George. On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ? On 28 September 2017 at 11:17, John Hearns <hear...@googlemail.com<mailto:hear...@googlemail.com>> wrote: Google turns this up: https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fulfm%2FOPdsHTXF5ls&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=ydWEy8pvVUbwLw31V3dUS2ruDcCa3sPmQV4KSYZUSeQ%3D&reserved=0> On 28 September 2017 at 01:26, Ludovic Raess <ludovic.ra...@unil.ch<mailto:ludovic.ra...@unil.ch>> wrote: Hi, we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7). On long runs (over ~10 days) involving more than 1 node (usually 64 MPI processes distributed on 16 nodes [node01-node16]), we observe the freeze of the simulation due to an internal error displaying: "error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 vendor error 136 qp_idx 0" (see attached file for full output). The job hangs, no computation neither communication occurs anymore, but no exit neither unload of the nodes is observed. The job can be killed normally but then the concerned nodes do not fully recover. A relaunch of the simulation usually sustains a couple of iterations (few minutes runtime), and then the job hangs again due to similar reasons. The only workaround so far is to reboot the involved nodes. Since we didn't find any hints on the web regarding this strange behaviour, I am wondering if this is a known issue. We actually don't know what causes this to happen and why. So any hints were to start investigating or possible reasons for this to happen are welcome. Ludovic _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0> _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users