[OMPI users] Floating point overflow and tuning

Logan Stonebraker via users Fri, 06 Sep 2019 12:19:40 -0700

I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with 
star ccm+)
Also trying to make openmpi work (more on that later)
Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbeIntel(R) Xeon(R) 
CPU E5-26987 nodes280 total cores
enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installedusnic 
RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installedenic modinfo 
version: 3.2.210.22enic loaded module version: 3.2.210.22usnic_verbs modinfo 
version: 3.2.158.15usnic_verbs loaded module version: 3.2.158.15libdaplusnic 
RPM version 2.0.39cisco3.2.112.8 installedlibfabric RPM version 
1.6.0cisco3.2.112.9.rhel7u6 installed
On batch runs less than 5 hours, everything works flawlessly, the jobs complete 
with out error, and is quite fast with dapl. Especially compared to TCP btl.


However when running with n-1 273 total cores at or around 5 hours into a job, 
the longer jobs die with a star ccm floating point exception.The same job 
completes fine with no more than 210 cores, 30 cores per each of 7 nodes.  I 
would like to be able to use the 60 more cores.I am using PBS Pro with 99 hour 
wall time
Here is the overflow error. ------------------Turbulent viscosity limited on 56 
cells in RegionA floating point exception has occurred: floating point 
exception [Overflow].  The specific cause cannot be identified.  Please refer 
to the troubleshooting section of the User's Guide.Context: 
star.coupledflow.CoupledImplicitSolverCommand: Automation.Run   error: Server 
Error------------------
I have not ruled out that I am missing some parameters or tuning with Intel MPI 
as this is a new cluster.
I am also trying to make Open MPI work.  I have openmpi compiled and it runs 
and I can see it is using the usnic fabric, however it only runs with very 
small number of CPU.  Anything over about 2 cores per node it hangs 
indefinately, right after the job starts.
I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url] 
because this is what Star CCM version I am running supports.  I am telling star 
to use the open mpi that I installed so it can support the Cisco USNIC fabric, 
which I can verify using Cisco native tools (star ships with openmpi btw 
however I'm not using it).
I am thinking that I need to tune OpenMPI, which was also requried with Intel 
MPI in order to run with out indefinite hang.
With Intel MPI prior to tuning, jobs with more than about 100 cores would hang 
forever until I added these parameters:
reference: 
[url]https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591[/url]reference:
 
[url]https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques[/url]
export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208export 
I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208export 
I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704export 
I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704export I_MPI_DAPL_UD_RNDV_EP_NUM=2export 
I_MPI_DAPL_UD_REQ_EVD_SIZE=2000export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096export 
I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647
After adding these parms I can scale to 273 cores and it runs very fast, up 
until the point where it gets the floating point exception about 5 hours into 
the job.
I am struggling trying to find equivelant turning parms for Open MPI.
I have listed all the MCA available with Open using MCA, and have tried setting 
these parms with no success, I may not have the equivalent parms listed here, 
this is what I have tried.
btl_max_send_size = 4096btl_usnic_eager_limit = 
2147483647btl_usnic_rndv_eager_limit = 2147483647btl_usnic_sd_num = 
8208btl_usnic_rd_num = 8208btl_usnic_prio_sd_num = 8704btl_usnic_prio_rd_num = 
8704btl_usnic_pack_lazy_threshold = -1

Does anyone have any advice or ideas for:
1.) The floating point overflow issueand   2.)  Know of equivelant tuning parms 
for Open MPI 
Many thanks in advance!
-Logan

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Floating point overflow and tuning

Reply via email to