Re: [OMPI users] Unable to run a python code on cluster with mpirun in parallel

2019-09-09 Thread Ralph Castain via users
Take a look at "man orte_hosts" for a full explanation of how to use hostfile - 
/etc/hosts is not a properly formatted hostfile.

You really just want a file that lists the names of the hosts, one per line, as 
that is the simplest hostfile.

> On Sep 7, 2019, at 4:23 AM, Sepinoud Azimi via users 
>  wrote:
> 
> Hi,
> 
> I have a parallelized code that works fine on my local computer with the 
> command 
> 
>   $mpirun -n 5 python parallel_simulation.py
> but when I try the same code on cluster I only get one process and it is not 
> running in parallel. I have both mpich and openmpi loaded on my cluster. 
> 
> I tried to use 
> 
> $mpirun -n 5 -hostfile /etc/hosts python parallel_simulation.py
> which gives me error 
> 
> Open RTE detected a parse error in the hostfile:
> 
> 
> /etc/
> hosts
> 
> It occured on line number 39 on token 12.
> --
> --
> An internal error has occurred in ORTE:
> 
> 
> 
> [[57450,0],0] FORCE-TERMINATE AT (null):1 - error 
> base/ras_base_allocate.c(302)
> 
> 
> 
> This is something that should be reported to the developers.
> This is what I get when I run:
> 
> $cat /etc/hosts
> 
> 
> # Ansible managed file, do not edit directly
> 
> 127.0.0.1   localhost localhost.localdomain localhost4 localhost4.
> localdomain4
> 
> ::1 localhost localhost.localdomain localhost6 localhost6.
> localdomain6
> 
> 
> # Hosts from ansible hosts file
> 10.1.1.2 titan-install.int.utu.fi titan-install
> 10.1.1.1 titan-admin.int.utu.fi titan-admin
> 10.1.1.3 titan-grid.int.utu.fi titan-grid
> 10.1.100.1 ti1.int.utu.fi ti1
> 10.2.100.1 ti1-ib.int.utu.fi ti1-ib
> 10.1.100.2 ti2.int.utu.fi ti2
> 
> 
> I would be very grateful if someone could suggest a solution. I am very new 
> to this and I am not sure how to solve the problem.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Floating point overflow and tuning

2019-09-09 Thread Jeff Squyres (jsquyres) via users
On Sep 6, 2019, at 2:17 PM, Logan Stonebraker via users 
 wrote:
> 
> I am working with star ccm+ 2019.1.1 Build 14.02.012
> 
> CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
> 
> Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with 
> star ccm+)
> 
> Also trying to make openmpi work (more on that later)

Greetings Logan.

I would definitely recommend Open MPI vs. DAPL/Intel MPI.

> Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbe
> Intel(R) Xeon(R) CPU E5-2698
> 7 nodes
> 280 total cores
> 
> enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed
> usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed
> enic modinfo version: 3.2.210.22
> enic loaded module version: 3.2.210.22
> usnic_verbs modinfo version: 3.2.158.15
> usnic_verbs loaded module version: 3.2.158.15
> libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed
> libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed
> 
> On batch runs less than 5 hours, everything works flawlessly, the jobs 
> complete with out error, and is quite fast with dapl. Especially compared to 
> TCP btl.
> 
> However when running with n-1 273 total cores at or around 5 hours into a 
> job, the longer jobs die with a star ccm floating point exception.
> The same job completes fine with no more than 210 cores, 30 cores per each of 
> 7 nodes.  I would like to be able to use the 60 more cores.
> I am using PBS Pro with 99 hour wall time
> 
> Here is the overflow error. 
> --
> Turbulent viscosity limited on 56 cells in Region
> A floating point exception has occurred: floating point exception [Overflow]. 
>  The specific cause cannot be identified.  Please refer to the 
> troubleshooting section of the User's Guide.
> Context: star.coupledflow.CoupledImplicitSolver
> Command: Automation.Run
>error: Server Error
> --
> 
> I have not ruled out that I am missing some parameters or tuning with Intel 
> MPI as this is a new cluster.

That's odd.  That type of error is *usually* not the MPI's fault.

> I am also trying to make Open MPI work.  I have openmpi compiled and it runs 
> and I can see it is using the usnic fabric, however it only runs with very 
> small number of CPU.  Anything over about 2 cores per node it hangs 
> indefinately, right after the job starts.

That's also quite odd; it shouldn't *hang*.

> I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url] 
> because this is what Star CCM version I am running supports.  I am telling 
> star to use the open mpi that I installed so it can support the Cisco USNIC 
> fabric, which I can verify using Cisco native tools (star ships with openmpi 
> btw however I'm not using it).
> 
> I am thinking that I need to tune OpenMPI, which was also requried with Intel 
> MPI in order to run with out indefinite hang.
> 
> With Intel MPI prior to tuning, jobs with more than about 100 cores would 
> hang forever until I added these parameters:
> 
> reference: 
> [url]https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591[/url]
> reference: 
> [url]https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques[/url]
> 
> export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208
> export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208
> export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704
> export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704
> export I_MPI_DAPL_UD_RNDV_EP_NUM=2
> export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000
> export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096
> export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647
> 
> After adding these parms I can scale to 273 cores and it runs very fast, up 
> until the point where it gets the floating point exception about 5 hours into 
> the job.
> 
> I am struggling trying to find equivelant turning parms for Open MPI.

FWIW, you shouldn't need any tuning params -- it should "just work".

> I have listed all the MCA available with Open using MCA, and have tried 
> setting these parms with no success, I may not have the equivalent parms 
> listed here, this is what I have tried.
> 
> btl_max_send_size = 4096
> btl_usnic_eager_limit = 2147483647
> btl_usnic_rndv_eager_limit = 2147483647
> btl_usnic_sd_num = 8208
> btl_usnic_rd_num = 8208
> btl_usnic_prio_sd_num = 8704
> btl_usnic_prio_rd_num = 8704
> btl_usnic_pack_lazy_threshold = -1

All those look reasonable.

Do you know what StarCCM is doing when it hangs?  I.e., is it in an MPI call?

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Floating point overflow and tuning

2019-09-09 Thread Logan Stonebraker via users
>> Do you know what StarCCM is doing when it hangs?  I.e., is it in an MPI call?

I have set FI_LOG_LEVEL="debug" and below except is the point where it hangs on 
usdf_cq_readerr, right after the last usdf_am_insert_async.  I am defining hang 
as 5 minutes.  It might hang for longer?   With Intel MPI and USNIC or TCP BTL, 
there is no "hang" and it starts happily running the batch job almost 
immediately.

libfabric-cisco:usnic:domain:usdf_am_get_distance():219 
libfabric-cisco:usnic:av:usdf_am_insert_async():317
libfabric-cisco:usnic:cq:usdf_cq_readerr():93
libfabric-cisco:usnic:cq:usdf_cq_readerr():93
libfabric-cisco:usnic:cq:usdf_cq_readerr():93
libfabric-cisco:usnic:cq:usdf_cq_readerr():93
(above readerr's generate rapidly forever..)

On the large core runs it happens during the first stages of mpi init and it 
never get's passed "Starting STAR-CCM+ parallel server".  It does not reach CPU 
Affinity Report (I have -cpubind bandwidth,v flag in STAR).  

Perhaps it is possible this is lower level than mpi, perhaps with 
libfabric-cisco, or as you point out with StarCCM.

Interestingly, with a small number of cores selected, the job does complete, 
however we still see these libfabric-cisco:usnic:cq:usdf_cq_readerr():93 errors 
above.

I will try to run some other app through mpirun and see if I can replicate.
I briefly used fi_pingpong and cant replicate the cq_readerr, however did get 
plenty of other errors related to provider 

-Logan

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users