Re: [OMPI users] Floating point overflow and tuning

Jeff Squyres (jsquyres) via users Mon, 09 Sep 2019 10:07:24 -0700

On Sep 6, 2019, at 2:17 PM, Logan Stonebraker via users 
<users@lists.open-mpi.org> wrote:
> 
> I am working with star ccm+ 2019.1.1 Build 14.02.012
> 
> CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
> 
> Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with 
> star ccm+)
> 
> Also trying to make openmpi work (more on that later)


Greetings Logan.

I would definitely recommend Open MPI vs. DAPL/Intel MPI.

> Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbe
> Intel(R) Xeon(R) CPU E5-2698
> 7 nodes
> 280 total cores
> 
> enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed
> usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed
> enic modinfo version: 3.2.210.22
> enic loaded module version: 3.2.210.22
> usnic_verbs modinfo version: 3.2.158.15
> usnic_verbs loaded module version: 3.2.158.15
> libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed
> libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed
> 
> On batch runs less than 5 hours, everything works flawlessly, the jobs 
> complete with out error, and is quite fast with dapl. Especially compared to 
> TCP btl.
> 
> However when running with n-1 273 total cores at or around 5 hours into a 
> job, the longer jobs die with a star ccm floating point exception.
> The same job completes fine with no more than 210 cores, 30 cores per each of 
> 7 nodes.  I would like to be able to use the 60 more cores.
> I am using PBS Pro with 99 hour wall time
> 
> Here is the overflow error. 
> ------------------
> Turbulent viscosity limited on 56 cells in Region
> A floating point exception has occurred: floating point exception [Overflow]. 
>  The specific cause cannot be identified.  Please refer to the 
> troubleshooting section of the User's Guide.
> Context: star.coupledflow.CoupledImplicitSolver
> Command: Automation.Run
>    error: Server Error
> ------------------
> 
> I have not ruled out that I am missing some parameters or tuning with Intel 
> MPI as this is a new cluster.

That's odd.  That type of error is *usually* not the MPI's fault.

> I am also trying to make Open MPI work.  I have openmpi compiled and it runs 
> and I can see it is using the usnic fabric, however it only runs with very 
> small number of CPU.  Anything over about 2 cores per node it hangs 
> indefinately, right after the job starts.

That's also quite odd; it shouldn't *hang*.

> I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url] 
> because this is what Star CCM version I am running supports.  I am telling 
> star to use the open mpi that I installed so it can support the Cisco USNIC 
> fabric, which I can verify using Cisco native tools (star ships with openmpi 
> btw however I'm not using it).
> 
> I am thinking that I need to tune OpenMPI, which was also requried with Intel 
> MPI in order to run with out indefinite hang.
> 
> With Intel MPI prior to tuning, jobs with more than about 100 cores would 
> hang forever until I added these parameters:
> 
> reference: 
> [url]https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591[/url]
> reference: 
> [url]https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques[/url]
> 
> export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208
> export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208
> export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704
> export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704
> export I_MPI_DAPL_UD_RNDV_EP_NUM=2
> export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000
> export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096
> export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647
> 
> After adding these parms I can scale to 273 cores and it runs very fast, up 
> until the point where it gets the floating point exception about 5 hours into 
> the job.
> 
> I am struggling trying to find equivelant turning parms for Open MPI.

FWIW, you shouldn't need any tuning params -- it should "just work".

> I have listed all the MCA available with Open using MCA, and have tried 
> setting these parms with no success, I may not have the equivalent parms 
> listed here, this is what I have tried.
> 
> btl_max_send_size = 4096
> btl_usnic_eager_limit = 2147483647
> btl_usnic_rndv_eager_limit = 2147483647
> btl_usnic_sd_num = 8208
> btl_usnic_rd_num = 8208
> btl_usnic_prio_sd_num = 8704
> btl_usnic_prio_rd_num = 8704
> btl_usnic_pack_lazy_threshold = -1

All those look reasonable.

Do you know what StarCCM is doing when it hangs?  I.e., is it in an MPI call?

-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Floating point overflow and tuning

Reply via email to