Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

Matthieu Brucher Tue, 21 Dec 2010 12:52:40 -0500

Don't forget that MPT has some optimizations OpenMPI may not have, as
"overriding" free(). This way, MPT can have a huge performance boost
if you're allocating and freeing memory, and the same happens if you
communicate often.


Matthieu

2010/12/21 Gilbert Grosdidier <gilbert.grosdid...@cern.ch>:
> Hi George,
>  Thanks for your help. The bottom line is that the processes are neatly
> placed on the nodes/cores,
> as far as I can tell from the map :
> [...]
>         Process OMPI jobid: [33285,1] Process rank: 4
>         Process OMPI jobid: [33285,1] Process rank: 5
>         Process OMPI jobid: [33285,1] Process rank: 6
>         Process OMPI jobid: [33285,1] Process rank: 7
>  Data for node: Name: r34i0n1   Num procs: 8
>         Process OMPI jobid: [33285,1] Process rank: 8
>         Process OMPI jobid: [33285,1] Process rank: 9
>         Process OMPI jobid: [33285,1] Process rank: 10
>         Process OMPI jobid: [33285,1] Process rank: 11
>         Process OMPI jobid: [33285,1] Process rank: 12
>         Process OMPI jobid: [33285,1] Process rank: 13
>         Process OMPI jobid: [33285,1] Process rank: 14
>         Process OMPI jobid: [33285,1] Process rank: 15
>  Data for node: Name: r34i0n2   Num procs: 8
>         Process OMPI jobid: [33285,1] Process rank: 16
>         Process OMPI jobid: [33285,1] Process rank: 17
>         Process OMPI jobid: [33285,1] Process rank: 18
>         Process OMPI jobid: [33285,1] Process rank: 19
>         Process OMPI jobid: [33285,1] Process rank: 20
> [...]
>  But the perfs are still very low ;-(
>  Best,    G.
> Le 20 déc. 10 à 22:27, George Bosilca a écrit :
>
> That's a first step. My question was more related to the process overlay on
> the cores. If the MPI implementation place one process per node, then rank k
> and rank k+1 will always be on separate node, and the communications will
> have to go over IB. In the opposite if the MPI implementation places the
> processes per core, then rank k and k+1 will [mostly] be on the same node
> and the communications will be over shared memory. Depending on how the
> processes are placed and how you create the neighborhoods the performance
> can be drastically impacted.
>
> There is a pretty good description of the problem at:
> http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/
>
> Some hints at
> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest
> you play with the --byslot --bynode options to see how this affect the
> performance of your application.
>
> For the hardcore cases we provide a rankfile feature. More info at:
> http://www.open-mpi.org/faq/?category=tuning#using-paffinity
>
> Enjoy,
>  george.
>
>
>
> On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:
>
> Yes, there is definitely only 1 process per core with both MPI
> implementations.
>
> Thanks,   G.
>
>
> Le 20/12/2010 20:39, George Bosilca a écrit :
>
> Are your processes places the same way with the two MPI implementations?
> Per-node vs. per-core ?
>
>  george.
>
> On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:
>
> Bonjour,
>
> I am now at a loss with my running of OpenMPI (namely 1.4.3)
>
> on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.
>
> After fixing several rather obvious failures with Ralph, Jeff and John help,
>
> I am now facing the bottom of this story since :
>
> - there are no more obvious failures with messages
>
> - compared to the running of the application with SGI-MPT, the CPU
> performances I get
>
> are very low, decreasing when the number of cores increases (cf below)
>
> - these performances are highly reproducible
>
> - I tried a very high number of -mca parameters, to no avail
>
> If I take as a reference the MPT CPU speed performance,
>
> it is of about 900 (in some arbitrary unit), whatever the
>
> number of cores I used (up to 8192).
>
> But, when running with OMPI, I get:
>
> - 700 with 1024 cores (which is already rather low)
>
> - 300 with 2048 cores
>
> - 60   with 4096 cores.
>
> The computing loop, over which the above CPU performance is evaluated,
> includes
>
> a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) +
> MPI_Waitall]
>
> The application is of the 'domain partition' type,
>
> and the performances, together with the memory footprint,
>
> are very identical on all  cores. The memory footprint is twice higher in
>
> the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).
>
> What could be wrong with all these, please ?
>
> I provided (in attachment) the 'ompi_info -all ' output.
>
> The config.log is in attachment as well.
>
> I compiled OMPI with icc. I checked numa and affinity are OK.
>
> I use the following command to run my OMPI app:
>
> "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
>
> -mca btl_openib_rdma_pipeline_frag_size 65536\
>
> -mca btl_openib_min_rdma_pipeline_size 65536\
>
> -mca btl_self_rdma_pipeline_send_length 262144\
>
> -mca btl_self_rdma_pipeline_frag_size 262144\
>
> -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\
>
> -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\
>
> -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\
>
> -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\
>
> -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\
>
> -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\
>
> -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\
>
> -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\
>
> -mca osc_rdma_no_locks 1\
>
> $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput".
>
> OpenIB info:
>
> 1) OFED-1.4.1, installed by SGI SGI
>
> 2) Linux xxxxxx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010
> x86_64 x86_64 x86_64 GNU/Linux
>
> OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200
>
> 3) Running most probably an SGI subnet manager
>
> 4)>  ibv_devinfo (on a worker node)
>
> hca_id:    mlx4_0
>
>    fw_ver:                2.7.000
>
>    node_guid:            0030:48ff:ffcc:4c44
>
>    sys_image_guid:            0030:48ff:ffcc:4c47
>
>    vendor_id:            0x02c9
>
>    vendor_part_id:            26418
>
>    hw_ver:                0xA0
>
>    board_id:            SM_2071000001000
>
>    phys_port_cnt:            2
>
>        port:    1
>
>            state:            PORT_ACTIVE (4)
>
>            max_mtu:        2048 (4)
>
>            active_mtu:        2048 (4)
>
>            sm_lid:            1
>
>            port_lid:        6009
>
>            port_lmc:        0x00
>
>        port:    2
>
>            state:            PORT_ACTIVE (4)
>
>            max_mtu:        2048 (4)
>
>            active_mtu:        2048 (4)
>
>            sm_lid:            1
>
>            port_lid:        6010
>
>            port_lmc:        0x00
>
> 5)>  ifconfig -a (on a worker node)
>
> eth0      Link encap:Ethernet  HWaddr 00:30:48:CE:73:30
>
>          inet adr:192.168.159.10  Bcast:192.168.159.255
>  Masque:255.255.255.0
>
>          adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien
>
>          UP BROADCAST NOTRAILERS RUNNING MULTICAST  MTU:1500  Metric:1
>
>          RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0
>
>          TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0
>
>          collisions:0 lg file transmission:1000
>
>          RX bytes:11486224753 (10954.1 Mb)  TX bytes:16450996864 (15688.8
> Mb)
>
>          Mémoire:fbc60000-fbc80000
>
> eth1      Link encap:Ethernet  HWaddr 00:30:48:CE:73:31
>
>          BROADCAST MULTICAST  MTU:1500  Metric:1
>
>          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>
>          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>
>          collisions:0 lg file transmission:1000
>
>          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>
>          Mémoire:fbce0000-fbd00000
>
> ib0       Link encap:UNSPEC  HWaddr
> 80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00
>
>          inet adr:10.148.9.198  Bcast:10.148.255.255  Masque:255.255.0.0
>
>          adr inet6: fe80::230:48ff:ffcc:4c45/64 Scope:Lien
>
>          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>
>          RX packets:115055101 errors:0 dropped:0 overruns:0 frame:0
>
>          TX packets:5390843 errors:0 dropped:182 overruns:0 carrier:0
>
>          collisions:0 lg file transmission:256
>
>          RX bytes:49592870352 (47295.4 Mb)  TX bytes:43566897620 (41548.6
> Mb)
>
> ib1       Link encap:UNSPEC  HWaddr
> 80-00-00-49-FE-C0-00-00-00-00-00-00-00-00-00-00
>
>          inet adr:10.149.9.198  Bcast:10.149.255.255  Masque:255.255.0.0
>
>          adr inet6: fe80::230:48ff:ffcc:4c46/64 Scope:Lien
>
>          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>
>          RX packets:673448 errors:0 dropped:0 overruns:0 frame:0
>
>          TX packets:187 errors:0 dropped:5 overruns:0 carrier:0
>
>          collisions:0 lg file transmission:256
>
>          RX bytes:37713088 (35.9 Mb)  TX bytes:11228 (10.9 Kb)
>
> lo        Link encap:Boucle locale
>
>          inet adr:127.0.0.1  Masque:255.0.0.0
>
>          adr inet6: ::1/128 Scope:Hôte
>
>          UP LOOPBACK RUNNING  MTU:16436  Metric:1
>
>          RX packets:33504149 errors:0 dropped:0 overruns:0 frame:0
>
>          TX packets:33504149 errors:0 dropped:0 overruns:0 carrier:0
>
>          collisions:0 lg file transmission:0
>
>          RX bytes:5100850397 (4864.5 Mb)  TX bytes:5100850397 (4864.5 Mb)
>
> sit0      Link encap:IPv6-dans-IPv4
>
>          NOARP  MTU:1480  Metric:1
>
>          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>
>          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>
>          collisions:0 lg file transmission:0
>
>          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>
> 6)>  limit (on a worker node)
>
> cputime      unlimited
>
> filesize     unlimited
>
> datasize     unlimited
>
> stacksize    300000 kbytes
>
> coredumpsize 0 kbytes
>
> memoryuse    unlimited
>
> vmemoryuse   unlimited
>
> descriptors  16384
>
> memorylocked unlimited
>
> maxproc      303104
>
> If some info is still missing despite all my efforts, please ask.
>
> Thanks in advance for any hints,   Best,      G.
>
>
> <config.log.gz><ompi_info-all.001.gz>_______________________________________________
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

Reply via email to