Don't forget that MPT has some optimizations OpenMPI may not have, as "overriding" free(). This way, MPT can have a huge performance boost if you're allocating and freeing memory, and the same happens if you communicate often.
Matthieu 2010/12/21 Gilbert Grosdidier <gilbert.grosdid...@cern.ch>: > Hi George, > Thanks for your help. The bottom line is that the processes are neatly > placed on the nodes/cores, > as far as I can tell from the map : > [...] > Process OMPI jobid: [33285,1] Process rank: 4 > Process OMPI jobid: [33285,1] Process rank: 5 > Process OMPI jobid: [33285,1] Process rank: 6 > Process OMPI jobid: [33285,1] Process rank: 7 > Data for node: Name: r34i0n1 Num procs: 8 > Process OMPI jobid: [33285,1] Process rank: 8 > Process OMPI jobid: [33285,1] Process rank: 9 > Process OMPI jobid: [33285,1] Process rank: 10 > Process OMPI jobid: [33285,1] Process rank: 11 > Process OMPI jobid: [33285,1] Process rank: 12 > Process OMPI jobid: [33285,1] Process rank: 13 > Process OMPI jobid: [33285,1] Process rank: 14 > Process OMPI jobid: [33285,1] Process rank: 15 > Data for node: Name: r34i0n2 Num procs: 8 > Process OMPI jobid: [33285,1] Process rank: 16 > Process OMPI jobid: [33285,1] Process rank: 17 > Process OMPI jobid: [33285,1] Process rank: 18 > Process OMPI jobid: [33285,1] Process rank: 19 > Process OMPI jobid: [33285,1] Process rank: 20 > [...] > But the perfs are still very low ;-( > Best, G. > Le 20 déc. 10 à 22:27, George Bosilca a écrit : > > That's a first step. My question was more related to the process overlay on > the cores. If the MPI implementation place one process per node, then rank k > and rank k+1 will always be on separate node, and the communications will > have to go over IB. In the opposite if the MPI implementation places the > processes per core, then rank k and k+1 will [mostly] be on the same node > and the communications will be over shared memory. Depending on how the > processes are placed and how you create the neighborhoods the performance > can be drastically impacted. > > There is a pretty good description of the problem at: > http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/ > > Some hints at > http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest > you play with the --byslot --bynode options to see how this affect the > performance of your application. > > For the hardcore cases we provide a rankfile feature. More info at: > http://www.open-mpi.org/faq/?category=tuning#using-paffinity > > Enjoy, > george. > > > > On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote: > > Yes, there is definitely only 1 process per core with both MPI > implementations. > > Thanks, G. > > > Le 20/12/2010 20:39, George Bosilca a écrit : > > Are your processes places the same way with the two MPI implementations? > Per-node vs. per-core ? > > george. > > On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote: > > Bonjour, > > I am now at a loss with my running of OpenMPI (namely 1.4.3) > > on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband. > > After fixing several rather obvious failures with Ralph, Jeff and John help, > > I am now facing the bottom of this story since : > > - there are no more obvious failures with messages > > - compared to the running of the application with SGI-MPT, the CPU > performances I get > > are very low, decreasing when the number of cores increases (cf below) > > - these performances are highly reproducible > > - I tried a very high number of -mca parameters, to no avail > > If I take as a reference the MPT CPU speed performance, > > it is of about 900 (in some arbitrary unit), whatever the > > number of cores I used (up to 8192). > > But, when running with OMPI, I get: > > - 700 with 1024 cores (which is already rather low) > > - 300 with 2048 cores > > - 60 with 4096 cores. > > The computing loop, over which the above CPU performance is evaluated, > includes > > a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) + > MPI_Waitall] > > The application is of the 'domain partition' type, > > and the performances, together with the memory footprint, > > are very identical on all cores. The memory footprint is twice higher in > > the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core). > > What could be wrong with all these, please ? > > I provided (in attachment) the 'ompi_info -all ' output. > > The config.log is in attachment as well. > > I compiled OMPI with icc. I checked numa and affinity are OK. > > I use the following command to run my OMPI app: > > "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\ > > -mca btl_openib_rdma_pipeline_frag_size 65536\ > > -mca btl_openib_min_rdma_pipeline_size 65536\ > > -mca btl_self_rdma_pipeline_send_length 262144\ > > -mca btl_self_rdma_pipeline_frag_size 262144\ > > -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\ > > -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\ > > -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\ > > -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\ > > -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\ > > -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\ > > -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\ > > -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\ > > -mca osc_rdma_no_locks 1\ > > $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput". > > OpenIB info: > > 1) OFED-1.4.1, installed by SGI SGI > > 2) Linux xxxxxx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010 > x86_64 x86_64 x86_64 GNU/Linux > > OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200 > > 3) Running most probably an SGI subnet manager > > 4)> ibv_devinfo (on a worker node) > > hca_id: mlx4_0 > > fw_ver: 2.7.000 > > node_guid: 0030:48ff:ffcc:4c44 > > sys_image_guid: 0030:48ff:ffcc:4c47 > > vendor_id: 0x02c9 > > vendor_part_id: 26418 > > hw_ver: 0xA0 > > board_id: SM_2071000001000 > > phys_port_cnt: 2 > > port: 1 > > state: PORT_ACTIVE (4) > > max_mtu: 2048 (4) > > active_mtu: 2048 (4) > > sm_lid: 1 > > port_lid: 6009 > > port_lmc: 0x00 > > port: 2 > > state: PORT_ACTIVE (4) > > max_mtu: 2048 (4) > > active_mtu: 2048 (4) > > sm_lid: 1 > > port_lid: 6010 > > port_lmc: 0x00 > > 5)> ifconfig -a (on a worker node) > > eth0 Link encap:Ethernet HWaddr 00:30:48:CE:73:30 > > inet adr:192.168.159.10 Bcast:192.168.159.255 > Masque:255.255.255.0 > > adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien > > UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 lg file transmission:1000 > > RX bytes:11486224753 (10954.1 Mb) TX bytes:16450996864 (15688.8 > Mb) > > Mémoire:fbc60000-fbc80000 > > eth1 Link encap:Ethernet HWaddr 00:30:48:CE:73:31 > > BROADCAST MULTICAST MTU:1500 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 lg file transmission:1000 > > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > Mémoire:fbce0000-fbd00000 > > ib0 Link encap:UNSPEC HWaddr > 80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00 > > inet adr:10.148.9.198 Bcast:10.148.255.255 Masque:255.255.0.0 > > adr inet6: fe80::230:48ff:ffcc:4c45/64 Scope:Lien > > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > > RX packets:115055101 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:5390843 errors:0 dropped:182 overruns:0 carrier:0 > > collisions:0 lg file transmission:256 > > RX bytes:49592870352 (47295.4 Mb) TX bytes:43566897620 (41548.6 > Mb) > > ib1 Link encap:UNSPEC HWaddr > 80-00-00-49-FE-C0-00-00-00-00-00-00-00-00-00-00 > > inet adr:10.149.9.198 Bcast:10.149.255.255 Masque:255.255.0.0 > > adr inet6: fe80::230:48ff:ffcc:4c46/64 Scope:Lien > > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > > RX packets:673448 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:187 errors:0 dropped:5 overruns:0 carrier:0 > > collisions:0 lg file transmission:256 > > RX bytes:37713088 (35.9 Mb) TX bytes:11228 (10.9 Kb) > > lo Link encap:Boucle locale > > inet adr:127.0.0.1 Masque:255.0.0.0 > > adr inet6: ::1/128 Scope:Hôte > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:33504149 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:33504149 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 lg file transmission:0 > > RX bytes:5100850397 (4864.5 Mb) TX bytes:5100850397 (4864.5 Mb) > > sit0 Link encap:IPv6-dans-IPv4 > > NOARP MTU:1480 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 lg file transmission:0 > > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > 6)> limit (on a worker node) > > cputime unlimited > > filesize unlimited > > datasize unlimited > > stacksize 300000 kbytes > > coredumpsize 0 kbytes > > memoryuse unlimited > > vmemoryuse unlimited > > descriptors 16384 > > memorylocked unlimited > > maxproc 303104 > > If some info is still missing despite all my efforts, please ask. > > Thanks in advance for any hints, Best, G. > > > <config.log.gz><ompi_info-all.001.gz>_______________________________________________ > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher