Dear All I have an application which is run using openmpi and uses infiniband flags. The application is a forecast model simulation. A frequent problem arises that the Infiniband mezzanine cards of servers become faulty (don't know the reason why it happens so frequent), the model simulation becomes very slow or even remain stuck, I have to manually remove the nodes from the hostlist one by one to check which nodes has faulty infiniband so that I can run the model on the rest of the nodes. Is there any way to check during job run that which node is having communication problem over infiniband aur is delaying the application.
Thanks! Ahsan