I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with
star ccm+)
Also trying to make openmpi work (more on that later)
Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbeIntel(R) Xeon(R)
CPU E5-26987 nodes280 total cores
enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installedusnic
RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installedenic modinfo
version: 3.2.210.22enic loaded module version: 3.2.210.22usnic_verbs modinfo
version: 3.2.158.15usnic_verbs loaded module version: 3.2.158.15libdaplusnic
RPM version 2.0.39cisco3.2.112.8 installedlibfabric RPM version
1.6.0cisco3.2.112.9.rhel7u6 installed
On batch runs less than 5 hours, everything works flawlessly, the jobs complete
with out error, and is quite fast with dapl. Especially compared to TCP btl.
However when running with n-1 273 total cores at or around 5 hours into a job,
the longer jobs die with a star ccm floating point exception.The same job
completes fine with no more than 210 cores, 30 cores per each of 7 nodes. I
would like to be able to use the 60 more cores.I am using PBS Pro with 99 hour
wall time
Here is the overflow error. ------------------Turbulent viscosity limited on 56
cells in RegionA floating point exception has occurred: floating point
exception [Overflow]. The specific cause cannot be identified. Please refer
to the troubleshooting section of the User's Guide.Context:
star.coupledflow.CoupledImplicitSolverCommand: Automation.Run error: Server
Error------------------
I have not ruled out that I am missing some parameters or tuning with Intel MPI
as this is a new cluster.
I am also trying to make Open MPI work. I have openmpi compiled and it runs
and I can see it is using the usnic fabric, however it only runs with very
small number of CPU. Anything over about 2 cores per node it hangs
indefinately, right after the job starts.
I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url]
because this is what Star CCM version I am running supports. I am telling star
to use the open mpi that I installed so it can support the Cisco USNIC fabric,
which I can verify using Cisco native tools (star ships with openmpi btw
however I'm not using it).
I am thinking that I need to tune OpenMPI, which was also requried with Intel
MPI in order to run with out indefinite hang.
With Intel MPI prior to tuning, jobs with more than about 100 cores would hang
forever until I added these parameters:
reference:
[url]https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591[/url]reference:
[url]https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques[/url]
export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208export
I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208export
I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704export
I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704export I_MPI_DAPL_UD_RNDV_EP_NUM=2export
I_MPI_DAPL_UD_REQ_EVD_SIZE=2000export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096export
I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647
After adding these parms I can scale to 273 cores and it runs very fast, up
until the point where it gets the floating point exception about 5 hours into
the job.
I am struggling trying to find equivelant turning parms for Open MPI.
I have listed all the MCA available with Open using MCA, and have tried setting
these parms with no success, I may not have the equivalent parms listed here,
this is what I have tried.
btl_max_send_size = 4096btl_usnic_eager_limit =
2147483647btl_usnic_rndv_eager_limit = 2147483647btl_usnic_sd_num =
8208btl_usnic_rd_num = 8208btl_usnic_prio_sd_num = 8704btl_usnic_prio_rd_num =
8704btl_usnic_pack_lazy_threshold = -1
Does anyone have any advice or ideas for:
1.) The floating point overflow issueand 2.) Know of equivelant tuning parms
for Open MPI
Many thanks in advance!
-Logan
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users