Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
On 28 September 2017 at 01:26, Ludovic Raess wrote:
> Hi,
>
>
> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
> por
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?
On 28 September 2017 at 11:17, John Hearns wrote:
>
> Google turns this up:
> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>
>
> On 28 September 2017 at 01:26, Ludovic Raess
> wrote:
>
>> Hi,
>>
>>
>> we have
John,
On the ULFM mailing list you pointed out, we converged toward a hardware
issue. Resources associated with the dead process were not correctly freed,
and follow-up processes on the same setup would inherit issues related to
these lingering messages. However, keep in mind that the setup was
di
I just talked with George, who brought me up to speed on this particular
problem.
I would suggest a couple of things:
- Look at the HW error counters, and see if you have many retransmits.
This would indicate a potential issue with the particular HW in use, such as a
cable that is n
Dear John, George, Rich,
thank you for the suggestions and potential paths towards understanding the
reason for the observed freeze. Although a HW issue might be possible, it
sounds unlikely since the error appears only after long runs and not randomly.
Also, it is kind of fixed after a reboo
Thank you Gilles for the pointer.
However that package "openmpi-gnu-ohpc-1.10.6-23.1.x86_64.rpm" has other
dependencies from the OpenHPC. Basically it is strongly tied to the whole
OpenHPC concept.
I did however follow your suggestion and rebuild the OpenMPI RPM package
from redhat adding the
I recompiled the RHEL OpenMPI package to include the configuration option
--with-tm
and it compiled and is working fine.
*# mpirun -V*
mpirun (Open MPI) 1.10.6
*# ompi_info | grep ras*
MCA ras: gridengine (MCA v2.0.0, API v2.0.0, Component v1.10.6)
MCA ras: loadleveler (MCA v2.0.0,