openmpi-2.0.2 running on rhel 7.4 with qlogic QDR infiniband switches/adapters, also using slurm
i have a user that's running a job over multiple days. unfortunately after a few days at random the job will seemingly hang. the latest instance was caused by an infiniband adapter that went offline and online several times. the card is in a semi-working state at the moment, it's passing traffic, but i suspect some of the IB messages during the job run got lost and now the job is seemingly hung. is there some mechanism i can put in place to detect this condition either in the code or on the system. it's causing two problems at the moment. first and foremost the user has no idea the job hung and for what reason. second it's wasting system time. i'm sure other people have come across wonky IB cards, i'm curious how everyone else is detecting this condition and dealing with it. _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users