Re: [OMPI users] Open MPI internal error

Richard Graham Fri, 29 Sep 2017 08:57:57 -0700

BTW, another cause for retransmission is the lack of posted receive buffers.

Rich

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ludovic Raess
Sent: Thursday, September 28, 2017 5:19 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Open MPI internal error

Dear John, George, Rich,

thank you for the suggestions and potential paths towards understanding the 
reason for the observed freeze.  Although a HW issue might be possible, it 
sounds unlikely since the error appears only after long runs and not randomly. 
Also, it is kind of fixed after a reboot, unless another long run starts again.

Cables and connections seems OK, we already reseted all connexions.

Currently, we are investigating two paths towards a fix. We implemented a 
slightly modified version of the MPI point to point comm routine, to see if it 
was still a hidden programming issue. Additionally, I run the problematic setup 
using mvapich to see if it is related to Open MPI in particular, excluding thus 
a HW or implementation issue.

In both cases, I will run 'ibdiagnet' in case freeze will occur again, as 
suggested. In last, we could try to set the retransmit count to 0 as suggested 
by Rich.

Thanks for propositions, and I'll write if I have new hints (it would require 
some days to the runs to potentially freeze)

Ludovic

________________________________
De : users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> de 
la part de Richard Graham <richa...@mellanox.com<mailto:richa...@mellanox.com>>
Envoyé : jeudi 28 septembre 2017 18:09
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI internal error

I just talked with George, who brought me up to speed on this particular  
problem.

I would suggest a couple of things:

-          Look at the HW error counters, and see if you have many retransmits. 
 This would indicate a potential issue with the particular HW in use, such as a 
cable that is not seated well, or some type of similar problem.

-          If you have the ability, reset your cables from the HCA to the 
switch, and see if this addresses the problem.
Also, if you have the ability (e.g., can modify the Open MPI source code), set 
the retransmit count to 0, and see if you see the same issue.  This would just 
speed up reaching the problem, if this is indeed the issue.

Rich

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of George 
Bosilca
Sent: Thursday, September 28, 2017 11:04 AM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Open MPI internal error

John,

On the ULFM mailing list you pointed out, we converged toward a hardware issue. 
Resources associated with the dead process were not correctly freed, and 
follow-up processes on the same setup would inherit issues related to these 
lingering messages. However, keep in mind that the setup was different as we 
were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing the 
problem, it was just delaying it enough for the application to run to 
completion.

  George.

On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns 
<hear...@googlemail.com<mailto:hear...@googlemail.com>> wrote:

Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fulfm%2FOPdsHTXF5ls&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=ydWEy8pvVUbwLw31V3dUS2ruDcCa3sPmQV4KSYZUSeQ%3D&reserved=0>

On 28 September 2017 at 01:26, Ludovic Raess 
<ludovic.ra...@unil.ch<mailto:ludovic.ra...@unil.ch>> wrote:

Hi,

we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI 
in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, 
Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).

On long runs (over ~10 days) involving more than 1 node (usually 64 MPI 
processes distributed on 16 nodes [node01-node16]), we observe the freeze of 
the simulation due to an internal error displaying: "error polling LP CQ with 
status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1  vendor 
error 136 qp_idx 0" (see attached file for full output).

The job hangs, no computation neither communication occurs anymore, but no exit 
neither unload of the nodes is observed. The job can be killed normally but 
then the concerned nodes do not fully recover. A relaunch of the simulation 
usually sustains a couple of iterations (few minutes runtime), and then the job 
hangs again due to similar reasons. The only workaround so far is to reboot the 
involved nodes.

Since we didn't find any hints on the web regarding this strange behaviour, I 
am wondering if this is a known issue. We actually don't know what causes this 
to happen and why. So any hints were to start investigating or possible reasons 
for this to happen are welcome.

Ludovic

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>

_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI internal error

Reply via email to