Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
Google turns this up: https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls On 28 September 2017 at 01:26, Ludovic Raess wrote: > Hi, > > > we have a issue on our 32 nodes Linux cluster regarding the usage of Open > MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single > por

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ? On 28 September 2017 at 11:17, John Hearns wrote: > > Google turns this up: > https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls > > > On 28 September 2017 at 01:26, Ludovic Raess > wrote: > >> Hi, >> >> >> we have

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread George Bosilca
John, On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However, keep in mind that the setup was di

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread Richard Graham
I just talked with George, who brought me up to speed on this particular problem. I would suggest a couple of things: - Look at the HW error counters, and see if you have many retransmits. This would indicate a potential issue with the particular HW in use, such as a cable that is n

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread Ludovic Raess
Dear John, George, Rich, thank you for the suggestions and potential paths towards understanding the reason for the observed freeze. Although a HW issue might be possible, it sounds unlikely since the error appears only after long runs and not randomly. Also, it is kind of fixed after a reboo

Re: [OMPI users] Fwd: OpenMPI does not obey hostfile

2017-09-28 Thread Anthony Thyssen
Thank you Gilles for the pointer. However that package "openmpi-gnu-ohpc-1.10.6-23.1.x86_64.rpm" has other dependencies from the OpenHPC. Basically it is strongly tied to the whole OpenHPC concept. I did however follow your suggestion and rebuild the OpenMPI RPM package from redhat adding the

[OMPI users] OpenMPI with-tm is not obeying torque

2017-09-28 Thread Anthony Thyssen
I recompiled the RHEL OpenMPI package to include the configuration option --with-tm and it compiled and is working fine. *# mpirun -V* mpirun (Open MPI) 1.10.6 *# ompi_info | grep ras* MCA ras: gridengine (MCA v2.0.0, API v2.0.0, Component v1.10.6) MCA ras: loadleveler (MCA v2.0.0,