[OMPI users] How does OpenMPI decided to use which algorithm in MPI_Bcast????????????????
Hi, I had a glance at OpenMPI source codes and there are several algorithms for MPI_Bcast function. My question is how is the algorithm decided to use in a given MPI_Bcast call? message size? Anyone give me little detailed information for this question? Thanks a lot. Axida
[OMPI users] SHARED Memory----------------
Hi, Any body know how to make use of shared memory in OpenMPI implementation? Thanks
Re: [OMPI users] SHARED Memory----------------
Hi, What I am asking is if I use MPI_Send and MPI_Recv between processes in a node, does it mean using shared memory or not? if not, how to use shared memory among processes which are runing in a node? Thank you! From: Eugene Loh To: Open MPI Users Sent: Thursday, April 23, 2009 1:20:05 PM Subject: Re: [OMPI users] SHARED Memory Just to clarify (since "send to self" strikes me as confusing)... If you're talking about using shared memory for point-to-point MPI message passing, OMPI typically uses it automatically between two processes on the same node. It is *not* used for a process sending to itself. There is a well-written FAQ (in my arrogant opinion!) at http://www.open-mpi.org/faq/?category=sm -- e.g., http://www.open-mpi.org/faq/?category=sm#sm-btl . If you're talking about some other use of shared memory, let us know what you had in mind. Elvedin Trnjanin wrote: Shared memory is used for send-to-self scenarios such as if you're making use of multiple slots on the same machine. shan axida wrote: Any body know how to make use of shared memory in OpenMPI implementation?
Re: [OMPI users] SHARED Memory----------------
Hi, It have read that FAQ. Does it mean shared memory communication is used when send messages between the processes in same node in default? No need any options and configuration for OpenMPI shared memory? THANK YOU! From: Eugene Loh To: Open MPI Users Sent: Thursday, April 23, 2009 2:08:33 PM Subject: Re: [OMPI users] SHARED Memory shan axida wrote: What I am asking is if I use MPI_Send and MPI_Recv between processes in a node, does it mean using shared memory or not? It (typically) does. (Some edge cases could occur.) Your question is addressed by the FAQ I mentioned. if not, how to use shared memory among processes which are runing in a node? From: Eugene Loh To: Open MPI Users Sent: Thursday, April 23, 2009 1:20:05 PM Subject: Re: [OMPI users] SHARED Memory Just to clarify (since "send to self" strikes me as confusing)... If you're talking about using shared memory for point-to-point MPI message passing, OMPI typically uses it automatically between two processes on the same node. It is *not* used for a process sending to itself. There is a well-written FAQ (in my arrogant opinion!) at http://www.open-mpi.org/faq/?category=sm -- e.g., http://www.open-mpi.org/faq/?category=sm#sm-btl . If you're talking about some other use of shared memory, let us know what you had in mind. Elvedin Trnjanin wrote: Shared memory is used for send-to-self scenarios such as if you're making use of multiple slots on the same machine. shan axida wrote: Any body know how to make use of shared memory in OpenMPI implementation?
[OMPI users] MPI_Bcast from OpenMPI
Hi, One more question: I have executed the MPI_Bcast() in 64 processes in 16 nodes Ethernet multiple links cluster. The result is shown in the file attached on this E-mail. What is going on at 131072 double message size? I have executed it many times but the result is still the same. THANK YOU! openmpi.pdf Description: Adobe PDF document
[OMPI users] Fw: MPI_Bcast from OpenMPI
- Forwarded Message From: shan axida To: Open MPI Users Sent: Thursday, April 23, 2009 2:32:08 PM Subject: MPI_Bcast from OpenMPI Hi, One more question: I have executed the MPI_Bcast() in 64 processes in 16 nodes Ethernet multiple links cluster. The result is shown in the file attached on this E-mail. What is going on at 131072 double message size? I have executed it many times but the result is still the same. THANK YOU! openmpi.pdf Description: Adobe PDF document
Re: [OMPI users] MPI_Bcast from OpenMPI
Hi, Hardware setups: + We have 4 NICs for each node in our cluster. That is why I called 4 links. + All nodes are connected by 4 switches (1Gb switch). + 4GB memory for each node. How can I check that NUMA or UMA memory access? Thank you! From: Jeff Squyres To: Open MPI Users Sent: Thursday, April 23, 2009 8:23:52 PM Subject: Re: [OMPI users] MPI_Bcast from OpenMPI Very strange; 6 seconds for a 1MB broadcast over 64 processes is *way* too long. Even 2.5 sec at 2MB seems too long -- what is your network speed? I'm not entirely sure what you mean by "4 link" on your graph. Without more information, I would first check your hardware setup to see if there's some kind of network buffering / congestion issue occurring. Here's a total guess: your ethernet switch(es) are low quality (from an HPC perspective, at least) such that you're incurring congestion and/or retransmission at that size for some reason. You could also be running up against memory bus congestion (I assume you mean 4 cores per node; are they NUMA or UMA?). But that wouldn't account for the huge spike at 1MB. On Apr 23, 2009, at 1:32 AM, shan axida wrote: > Hi, > One more question: > I have executed the MPI_Bcast() in 64 processes in 16 nodes Ethernet multiple > links cluster. > The result is shown in the file attached on this E-mail. > What is going on at 131072 double message size? > I have executed it many times but the result is still the same. > > THANK YOU! > > > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users --Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Bcast from OpenMPI
Sorry, I had a mistake in calculation. Not 131072 (double) but 131072 KB. It means around 128 MB. From: Jeff Squyres To: Open MPI Users Sent: Thursday, April 23, 2009 8:23:52 PM Subject: Re: [OMPI users] MPI_Bcast from OpenMPI Very strange; 6 seconds for a 1MB broadcast over 64 processes is *way* too long. Even 2.5 sec at 2MB seems too long -- what is your network speed? I'm not entirely sure what you mean by "4 link" on your graph. Without more information, I would first check your hardware setup to see if there's some kind of network buffering / congestion issue occurring. Here's a total guess: your ethernet switch(es) are low quality (from an HPC perspective, at least) such that you're incurring congestion and/or retransmission at that size for some reason. You could also be running up against memory bus congestion (I assume you mean 4 cores per node; are they NUMA or UMA?). But that wouldn't account for the huge spike at 1MB. On Apr 23, 2009, at 1:32 AM, shan axida wrote: > Hi, > One more question: > I have executed the MPI_Bcast() in 64 processes in 16 nodes Ethernet multiple > links cluster. > The result is shown in the file attached on this E-mail. > What is going on at 131072 double message size? > I have executed it many times but the result is still the same. > > THANK YOU! > > > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users --Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Bcast from OpenMPI
But, exactly the same program gets different result in another cluster. I mean the result doent have any spike at all. Second cluster is almost the same features with the previous one except little small memory capacity and little low frequency. First cluster: 3.0 GHz Intel Xeon, 4GB memory, centOS 4.6, Second cluster: 2.8 GHz Intel Xeon, 3GBmemory, Fedora core5 Openmpi1.3 is used in both cluster. From: Eugene Loh To: Open MPI Users Sent: Friday, April 24, 2009 1:26:14 AM Subject: Re: [OMPI users] MPI_Bcast from OpenMPI Okay. So, going back to Jeff's second surprise, we have 256 Mbyte/2.5 sec = 100 Mbyte/sec = 1 Gbit/sec (sloppy math). So, without getting into details of what we're measuring/reporting here, there doesn't on the face of it appear to be anything wrong with the baseline performance. Jeff was right that 256K doubles should have been faster, but 256 Mbyte... seems reasonable. So, the remaining mystery is the 6x or so spike at 128 Mbyte. Dunno. How important is it to resolve that mystery? shan axida wrote: Sorry, I had a mistake in calculation. Not 131072 (double) but 131072 KB. It means around 128 MB. From: Jeff Squyres To: Open MPI Users Sent: Thursday, April 23, 2009 8:23:52 PM Subject: Re: [OMPI users] MPI_Bcast from OpenMPI Very strange; 6 seconds for a 1MB broadcast over 64 processes is *way* too long. Even 2.5 sec at 2MB seems too long
Re: [OMPI users] MPI_Bcast from OpenMPI
Thank You Eugene Loh, It is very important for me to explain the spike at figure! But I dont know how to hunt the reason and how to check it. Would you please help me in more practically? Thank you again. From: Eugene Loh To: Open MPI Users Sent: Friday, April 24, 2009 2:16:22 PM Subject: Re: [OMPI users] MPI_Bcast from OpenMPI Right. So, baseline performance seems reasonable, but there is an odd spike that seems difficult to explain. This is annoying, but again: how important is it to resolve that mystery? You can spend a few days trying to hunt this down, only to find that it's some oddity that has no general relevence. I don't know if that's really the case, but I'm just suggesting that it may make most sense just to let this one go. shan axida wrote: But, exactly the same program gets different result in another cluster. I mean the result doent have any spike at all. Second cluster is almost the same features with the previous one From: Eugene Loh To: Open MPI Users Sent: Friday, April 24, 2009 1:26:14 AM Subject: Re: [OMPI users] MPI_Bcast from OpenMPI So, the remaining mystery is the 6x or so spike at 128 Mbyte. Dunno. How important is it to resolve that mystery?
[OMPI users] OpenMPI MPI_Bcast Algorithms
Hi all, I think there are several algorithms used in MPI_Bcast. I am wondering how are they decided to be excuted ? I mean, How to decide which algorithm will be used? Is it depending on the message size or something ? Would some people help me? Thank you!
[OMPI users] ****---How to configure NIS and MPI on spread NICs?----****
Hello all, I want to configure NIS and MPI with different network. For example, NIS uses eth0 and MPI uses eth1 some thing like that. How can I do that? Axida
[OMPI users] How to use Multiple links with OpenMPI? ?????????????????
Hi everyone, I want to ask how to use multiple links (multiple NICs) with OpenMPI. For example, how can I assign a link to each process, if there are 4 links and 4 processors on each node in our cluster? Is this a correct way? hostfile: -- host1-eth0 slots=1 host1-eth1 slots=1 host1-eth2 slots=1 host1-eth3 slots=1 host2-eth0 slots=1 host2-eth1 slots=1 host2-eth2 slots=1 host2-eth3 slots=1 ... ... ... ... host16-eth0 slots=1 host16-eth1 slots=1 host16-eth2 slots=1 host16-eth3 slots=1
Re: [OMPI users] How to use Multiple links with OpenMPI??????????????????
Thank you! Mr. Jeff Squyres, I have conducted a simple MPI_Bcast experiment in out cluster. The results are shown in the file attached on this e-mail. The hostfile is : - hostname1 slots=4 hostname2 slots=4 hostname3 slots=4 hostname16 slots=4 - As we can see in the figure, it is little faster than single link when we use 2,3,4 links between nodes. My question is what would be the reason to make almost the same performance when we use 2,3,4 links ? Thank you! Axida From: Jeff Squyres To: Open MPI Users Sent: Wednesday, May 27, 2009 11:28:42 PM Subject: Re: [OMPI users] How to use Multiple links with OpenMPI?? Open MPI considers hosts differently than network links. So you should only list the actual hostname in the hostfile, with slots equal to the number of processors (4 in your case, I think?). Once the MPI processes are launched, they each look around on the host that they're running and find network paths to each of their peers. If they are multiple paths between pairs of peers, Open MPI will round-robin stripe messages across each of the links. We don't really have an easy setting for each peer pair only using 1 link. Indeed, since connectivity is bidirectional, the traffic patterns become less obvious if you want MPI_COMM_WORLD rank X to only use link Y -- what does that mean to the other 4 MPI processes on the other host (with whom you have assumedly assigned their own individual links as well)? On May 26, 2009, at 12:24 AM, shan axida wrote: > Hi everyone, > I want to ask how to use multiple links (multiple NICs) with OpenMPI. > For example, how can I assign a link to each process, if there are 4 links > and 4 processors on each node in our cluster? > Is this a correct way? > hostfile: > -- > host1-eth0 slots=1 > host1-eth1 slots=1 > host1-eth2 slots=1 > host1-eth3 slots=1 > host2-eth0 slots=1 > host2-eth1 slots=1 > host2-eth2 slots=1 > host2-eth3 slots=1 > ... ... > ... ... > host16-eth0 slots=1 > host16-eth1 slots=1 > host16-eth2 slots=1 > host16-eth3 slots=1 > > > > > > > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users --Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users MPI_Bcast-ypc05xx.pdf Description: Adobe PDF document
Re: [OMPI users] How to use Multiple links with OpenMPI??????????????????
Hi Mr. Jeff Squyres, Is it true to use bidirectianal communication with MPI in ethernet Cluster? I have tried once (I thought, it is possible because of fully duplex swithes). However, I could not get bandwidth improvement as I was expecting. If you answer is YES, would you please tell me about pseudocode for bidirectional communication ? Thank you. Axida From: Jeff Squyres To: Open MPI Users Sent: Wednesday, May 27, 2009 11:28:42 PM Subject: Re: [OMPI users] How to use Multiple links with OpenMPI?? Open MPI considers hosts differently than network links. So you should only list the actual hostname in the hostfile, with slots equal to the number of processors (4 in your case, I think?). Once the MPI processes are launched, they each look around on the host that they're running and find network paths to each of their peers. If they are multiple paths between pairs of peers, Open MPI will round-robin stripe messages across each of the links. We don't really have an easy setting for each peer pair only using 1 link. Indeed, since connectivity is bidirectional, the traffic patterns become less obvious if you want MPI_COMM_WORLD rank X to only use link Y -- what does that mean to the other 4 MPI processes on the other host (with whom you have assumedly assigned their own individual links as well)? On May 26, 2009, at 12:24 AM, shan axida wrote: > Hi everyone, > I want to ask how to use multiple links (multiple NICs) with OpenMPI. > For example, how can I assign a link to each process, if there are 4 links > and 4 processors on each node in our cluster? > Is this a correct way? > hostfile: > -- > host1-eth0 slots=1 > host1-eth1 slots=1 > host1-eth2 slots=1 > host1-eth3 slots=1 > host2-eth0 slots=1 > host2-eth1 slots=1 > host2-eth2 slots=1 > host2-eth3 slots=1 > ... ... > ... ... > host16-eth0 slots=1 > host16-eth1 slots=1 > host16-eth2 slots=1 > host16-eth3 slots=1 > > > > > > > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users --Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] How to use Multiple links withOpenMPI??????????????????
Hi Jeff Squyres, We have Dell powerconnect 2724 Gigabit switches to connect the nodes in our cluster. As you said, may be the speed of PCI bus is a bottleneck. How can check it in practical? What is your suggestion for the problem? Thank you! Axida From: Jeff Squyres To: Open MPI Users Sent: Tuesday, June 2, 2009 10:15:39 AM Subject: Re: [OMPI users] How to use Multiple links withOpenMPI?? Note that striping doesn't really help you much until data sizes get large. For example, networks tend to have an elbow in the graph where the size of the message starts to matter (clearly evident on your graphs). Additionally, you have your network marked as with "hubs" not "switches" -- if you really do have hubs and not switches, you may run into serious contention issues if you start loading up the network. With both of these factors, even though you have 4 links, you likely aren't going to see much of a performance benefit until you send large messages (which will be limited by your bus speeds -- can you feed all 4 of your links from a single machine at line rate, or will you be limited by PCI bus speeds and contention?), and you may run into secondary performance issues due to contention on your hubs. On May 28, 2009, at 11:06 PM, shan axida wrote: > Thank you! Mr. Jeff Squyres, > I have conducted a simple MPI_Bcast experiment in out cluster. > The results are shown in the file attached on this e-mail. > The hostfile is : > - > hostname1 slots=4 > hostname2 slots=4 > hostname3 slots=4 > > > hostname16 slots=4 > - > As we can see in the figure, it is little faster than single link > when we use 2,3,4 links between nodes. > My question is what would be the reason to make almost the same > performance when we use 2,3,4 links ? > > Thank you! > > Axida > > > > > From: Jeff Squyres > To: Open MPI Users > Sent: Wednesday, May 27, 2009 11:28:42 PM > Subject: Re: [OMPI users] How to use Multiple links with > OpenMPI?? > > Open MPI considers hosts differently than network links. > > So you should only list the actual hostname in the hostfile, with slots equal > to the number of processors (4 in your case, I think?). > > Once the MPI processes are launched, they each look around on the host that > they're running and find network paths to each of their peers. If they are > multiple paths between pairs of peers, Open MPI will round-robin stripe > messages across each of the links. We don't really have an easy setting for > each peer pair only using 1 link. Indeed, since connectivity is > bidirectional, the traffic patterns become less obvious if you want > MPI_COMM_WORLD rank X to only use link Y -- what does that mean to the other > 4 MPI processes on the other host (with whom you have assumedly assigned > their own individual links as well)? > > > On May 26, 2009, at 12:24 AM, shan axida wrote: > > > Hi everyone, > > I want to ask how to use multiple links (multiple NICs) with OpenMPI. > > For example, how can I assign a link to each process, if there are 4 links > > and 4 processors on each node in our cluster? > > Is this a correct way? > > hostfile: > > -- > > host1-eth0 slots=1 > > host1-eth1 slots=1 > > host1-eth2 slots=1 > > host1-eth3 slots=1 > > host2-eth0 slots=1 > > host2-eth1 slots=1 > > host2-eth2 slots=1 > > host2-eth3 slots=1 > > ...... > > ... ... > > host16-eth0 slots=1 > > host16-eth1 slots=1 > > host16-eth2 slots=1 > > host16-eth3 slots=1 > > > > > > > > > > > > > > > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > --Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users --Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] How to use Multiple linkswithOpenMPI??????????????????
Hi, Yes, we 2 NICs on the same bus and the other 2 are embeded. We did the experiment about netperf in our cluster and we could not get full bandwith using 4 pairs copies on two nodes. the bandwidth is increased when the number of NICs changes to 2 but there is no big increase when it becomes 3, 4. Thank you! Axida. From: Jeff Squyres To: Open MPI Users Sent: Friday, June 5, 2009 11:19:02 PM Subject: Re: [OMPI users] How to use Multiple linkswithOpenMPI?? On Jun 4, 2009, at 3:42 AM, shan axida wrote: > We have Dell powerconnect 2724 Gigabit switches to connect the nodes in our > cluster. > As you said, may be the speed of PCI bus is a bottleneck. > How can check it in practical? Are all your gige nics on the same bus? You might want to try running multiple copies of TCP pt2pt benchmarks simultaneously on your machine to see what kind of performance you get. E.g., run 4 copies of netperf on node A talking to 4 corresponding copies of netper on node B. Do you get full bandwidth out of all 4 copies? --Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] "Re: Best way to overlap computation and transfer using MPI over TCP/Ethernet?"
Hi, Would you please tell me how did you do the experiment by calling MPI_Test in little more details? Thanks! From: Lars Andersson To: us...@open-mpi.org Sent: Tuesday, June 9, 2009 6:11:11 AM Subject: Re: [OMPI users] "Re: Best way to overlap computation and transfer using MPI over TCP/Ethernet?" On Mon, Jun 8, 2009 at 11:07 PM, Lars Andersson wrote: > I'd say that your own workaround here is to intersperse MPI_TEST's > periodically. This will trigger OMPI's pipelined protocol for large > messages, and should allow partial bursts of progress while you're > assumedly off doing useful work. If this is difficult because the > work is being done in library code that you can't change, then perhaps > a pre-spawned "work" through could be used to call MPI_TEST > periodically. That way, it won't steal huge ammounts of CPU cycles > (like MPI_WAIT would). You still might get some cache thrashing, > context switching, etc. -- YMMV. Thanks Jeff, it's good to hear that this is a valid workaround. I've done a few small experiments, and by calling MPI_Test in a while loop with an usleep(1000) I'm able to get almost full bandwidth for large messages with less than 5% CPU utilization. /Lars ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users