>> Do you know what StarCCM is doing when it hangs? I.e., is it in an MPI call?
I have set FI_LOG_LEVEL="debug" and below except is the point where it hangs on usdf_cq_readerr, right after the last usdf_am_insert_async. I am defining hang as 5 minutes. It might hang for longer? With Intel MPI and USNIC or TCP BTL, there is no "hang" and it starts happily running the batch job almost immediately. libfabric-cisco:usnic:domain:usdf_am_get_distance():219<trace> libfabric-cisco:usnic:av:usdf_am_insert_async():317<trace> libfabric-cisco:usnic:cq:usdf_cq_readerr():93<trace> libfabric-cisco:usnic:cq:usdf_cq_readerr():93<trace> libfabric-cisco:usnic:cq:usdf_cq_readerr():93<trace> libfabric-cisco:usnic:cq:usdf_cq_readerr():93<trace> (above readerr's generate rapidly forever..) On the large core runs it happens during the first stages of mpi init and it never get's passed "Starting STAR-CCM+ parallel server". It does not reach CPU Affinity Report (I have -cpubind bandwidth,v flag in STAR). Perhaps it is possible this is lower level than mpi, perhaps with libfabric-cisco, or as you point out with StarCCM. Interestingly, with a small number of cores selected, the job does complete, however we still see these libfabric-cisco:usnic:cq:usdf_cq_readerr():93 errors above. I will try to run some other app through mpirun and see if I can replicate. I briefly used fi_pingpong and cant replicate the cq_readerr, however did get plenty of other errors related to provider -Logan _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users