openmpi-2.0.2 running on rhel 7.4 with qlogic QDR infiniband
switches/adapters, also using slurm

i have a user that's running a job over multiple days.  unfortunately
after a few days at random the job will seemingly hang.  the latest
instance was caused by an infiniband adapter that went offline and
online several times.

the card is in a semi-working state at the moment, it's passing
traffic, but i suspect some of the IB messages during the job run got
lost and now the job is seemingly hung.

is there some mechanism i can put in place to detect this condition
either in the code or on the system.  it's causing two problems at the
moment.  first and foremost the user has no idea the job hung and for
what reason.  second it's wasting system time.

i'm sure other people have come across wonky IB cards, i'm curious how
everyone else is detecting this condition and dealing with it.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to