Re: [OMPI users] Floating point overflow and tuning

Logan Stonebraker via users Mon, 09 Sep 2019 18:11:17 -0700

>> Do you know what StarCCM is doing when it hangs?  I.e., is it in an MPI call?


I have set FI_LOG_LEVEL="debug" and below except is the point where it hangs on 
usdf_cq_readerr, right after the last usdf_am_insert_async.  I am defining hang 
as 5 minutes.  It might hang for longer?   With Intel MPI and USNIC or TCP BTL, 
there is no "hang" and it starts happily running the batch job almost 
immediately.

libfabric-cisco:usnic:domain:usdf_am_get_distance():219<trace> 
libfabric-cisco:usnic:av:usdf_am_insert_async():317<trace>
libfabric-cisco:usnic:cq:usdf_cq_readerr():93<trace>
libfabric-cisco:usnic:cq:usdf_cq_readerr():93<trace>
libfabric-cisco:usnic:cq:usdf_cq_readerr():93<trace>
libfabric-cisco:usnic:cq:usdf_cq_readerr():93<trace>
(above readerr's generate rapidly forever..)

On the large core runs it happens during the first stages of mpi init and it 
never get's passed "Starting STAR-CCM+ parallel server".  It does not reach CPU 
Affinity Report (I have -cpubind bandwidth,v flag in STAR).  

Perhaps it is possible this is lower level than mpi, perhaps with 
libfabric-cisco, or as you point out with StarCCM.

Interestingly, with a small number of cores selected, the job does complete, 
however we still see these libfabric-cisco:usnic:cq:usdf_cq_readerr():93 errors 
above.

I will try to run some other app through mpirun and see if I can replicate.
I briefly used fi_pingpong and cant replicate the cq_readerr, however did get 
plenty of other errors related to provider 

-Logan

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Floating point overflow and tuning

Reply via email to