On Sep 6, 2019, at 2:17 PM, Logan Stonebraker via users <users@lists.open-mpi.org> wrote: > > I am working with star ccm+ 2019.1.1 Build 14.02.012 > > CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64 > > Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with > star ccm+) > > Also trying to make openmpi work (more on that later)
Greetings Logan. I would definitely recommend Open MPI vs. DAPL/Intel MPI. > Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbe > Intel(R) Xeon(R) CPU E5-2698 > 7 nodes > 280 total cores > > enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed > usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed > enic modinfo version: 3.2.210.22 > enic loaded module version: 3.2.210.22 > usnic_verbs modinfo version: 3.2.158.15 > usnic_verbs loaded module version: 3.2.158.15 > libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed > libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed > > On batch runs less than 5 hours, everything works flawlessly, the jobs > complete with out error, and is quite fast with dapl. Especially compared to > TCP btl. > > However when running with n-1 273 total cores at or around 5 hours into a > job, the longer jobs die with a star ccm floating point exception. > The same job completes fine with no more than 210 cores, 30 cores per each of > 7 nodes. I would like to be able to use the 60 more cores. > I am using PBS Pro with 99 hour wall time > > Here is the overflow error. > ------------------ > Turbulent viscosity limited on 56 cells in Region > A floating point exception has occurred: floating point exception [Overflow]. > The specific cause cannot be identified. Please refer to the > troubleshooting section of the User's Guide. > Context: star.coupledflow.CoupledImplicitSolver > Command: Automation.Run > error: Server Error > ------------------ > > I have not ruled out that I am missing some parameters or tuning with Intel > MPI as this is a new cluster. That's odd. That type of error is *usually* not the MPI's fault. > I am also trying to make Open MPI work. I have openmpi compiled and it runs > and I can see it is using the usnic fabric, however it only runs with very > small number of CPU. Anything over about 2 cores per node it hangs > indefinately, right after the job starts. That's also quite odd; it shouldn't *hang*. > I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url] > because this is what Star CCM version I am running supports. I am telling > star to use the open mpi that I installed so it can support the Cisco USNIC > fabric, which I can verify using Cisco native tools (star ships with openmpi > btw however I'm not using it). > > I am thinking that I need to tune OpenMPI, which was also requried with Intel > MPI in order to run with out indefinite hang. > > With Intel MPI prior to tuning, jobs with more than about 100 cores would > hang forever until I added these parameters: > > reference: > [url]https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591[/url] > reference: > [url]https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques[/url] > > export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208 > export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208 > export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704 > export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704 > export I_MPI_DAPL_UD_RNDV_EP_NUM=2 > export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000 > export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096 > export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647 > > After adding these parms I can scale to 273 cores and it runs very fast, up > until the point where it gets the floating point exception about 5 hours into > the job. > > I am struggling trying to find equivelant turning parms for Open MPI. FWIW, you shouldn't need any tuning params -- it should "just work". > I have listed all the MCA available with Open using MCA, and have tried > setting these parms with no success, I may not have the equivalent parms > listed here, this is what I have tried. > > btl_max_send_size = 4096 > btl_usnic_eager_limit = 2147483647 > btl_usnic_rndv_eager_limit = 2147483647 > btl_usnic_sd_num = 8208 > btl_usnic_rd_num = 8208 > btl_usnic_prio_sd_num = 8704 > btl_usnic_prio_rd_num = 8704 > btl_usnic_pack_lazy_threshold = -1 All those look reasonable. Do you know what StarCCM is doing when it hangs? I.e., is it in an MPI call? -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users