Re: [OMPI users] openmpi shared memory feature

Ralph Castain Mon, 29 Oct 2012 11:01:18 -0400

On Oct 29, 2012, at 7:33 AM, Mahmood Naderan <nt_mahm...@yahoo.com> wrote:


> Thanks again for your answer. The reason why I had negative view to the 
> shared memory feature was that we were debugging the system (our program, 
> openmpi, cluster settings, ...) for nearly a week. To avoid any confusion, I 
> will use "node". Here we have:
> 1- Node 'A' which has some pysical disks 32GB of memory
> 2- Node 'B' which has 64GB of memory but has no disks. It boots from an image 
> which resides on 'A'.
> 3- There is no tmpfs.
> 4- We installed openmpi with *default* options.
> 5- We run the command "openmpi -np 4 ...." on 'B'
> 
> 
> So 4 processes are running on 'B'. Assume P1 is trying to send sond something 
> to P2. This is my understanding (please correct if I am wrong)
> 1- P1 create a packet.
> 2- P1 send the packet to the network interface.
> 3- The packet is transfered from 'B' to 'A'.
> 4- While on 'A', the packet goes to the disk and do something.
> 5- The packet is again oon the way from 'A' to 'B'.
> 6- P2 on 'B' will get the packet.

Wow, that would make no sense at all. If P1 and P2 are on the same node, then 
we will use shared memory to do the transfer, as Jeff described. However, if 
you disable shared memory, as you indicated you were doing on a previous 
message (by adding -mca btl ^sm), then we would use a loopback device if 
available - i.e., the packet would be handed to the network stack, which would 
then return it to P2 without it ever leaving the node.

If there is no loopback device, and you disable shared memory, then we would 
abort the job with an error as there is no way for P1 to communicate with P2.

We would never do what you describe.

> 
> That is a clear inefficient communication.
> 
> What I understand from your replies, is that if there is a tmpfs, then P1 and 
> P2 can communicate through the memory on 'B' which is fine. But I think there 
> should be more documentation on that. 
>  
> Regards,
> Mahmood
> 
> From: Jeff Squyres <jsquy...@cisco.com>
> To: Mahmood Naderan <nt_mahm...@yahoo.com>; Open MPI Users 
> <us...@open-mpi.org> 
> Sent: Monday, October 29, 2012 1:28 PM
> Subject: Re: [OMPI users] openmpi shared memory feature
> 
> Your original question stuck in my brain over the weekend, and I *think* you 
> may have been asking a different question than I originally answered.  Even 
> though you say we answered your question, I'm going to post my ruminations 
> here anyway.  :-)
> 
> You might have been asking about how a shared memory *file* on a diskless 
> machine -- where the majority of the filesystem is presumably on a network 
> mount -- could be efficient.  
> 
> If you look at the shared memory as a "file" on a filesystem (particularly if 
> it's a network filesystem), then you're right: all file reads and writes turn 
> into network communications.  Therefore, communication through "files" would 
> actually be quite inefficient: reads and writes to such files would be pumped 
> through the network.
> 
> The reality is that shared memory "files" are special kinds of files.  
> They're just rendezvous points for multiple processes to find the shared 
> memory.  Once a process mmaps a shared memory "file", then reads and writes 
> to that file effectively don't actually go through the underlying filesystem 
> anymore.  Instead, they go directly to the shared memory (which is kinda the 
> point).  
> 
> There are some corner cases where the contents of the shared memory can be 
> written out to the filesystem (which, in the case of the network filesystem, 
> would result in network communications to the file server), but Open MPI 
> avoids those cases.
> 
> Hope that helps.
> 
> 
> 
> 
> On Oct 27, 2012, at 2:17 PM, Mahmood Naderan wrote:
> 
> > Thanks all. It is now cleared.
> > 
> > Regards,
> > Mahmood
> > 
> > From: Damien <dam...@khubla.com>
> > To: Open MPI Users <us...@open-mpi.org> 
> > Sent: Saturday, October 27, 2012 7:25 PM
> > Subject: Re: [OMPI users] openmpi shared memory feature
> > 
> > Mahmood,
> > 
> > To build on what Jeff said, here's a short summary of how diskless clusters 
> > work:
> > 
> > A diskless node gets its operating system through a physical network (say 
> > gig-E), including the HPC applications and the MPI runtimes, from a master 
> > server.  That master server isn't the MPI head node, it's a separate 
> > OS/Network boot server.  That's completely separate from how the MPI 
> > applications run.  The MPI-based HPC applications on the nodes communicate 
> > through a dedicated, faster physical network (say Infiniband).  There's two 
> > separate networks, one for starting and running nodes and one for doing HPC 
> > work.  On the same node, MPI processes use shared-memory to communicate, 
> > regardless of whether it's diskless or not, it's just part of MPI.  Between 
> > nodes, MPI processes use that faster, dedicated network, and that's 
> > regardless of whether it's diskless or not, it's just part of MPI. The 
> > networks are separate because it's more efficient.
> > 
> > Damien
> > 
> > On 27/10/2012 11:00 AM, Jeff Squyres wrote:
> > > On Oct 27, 2012, at 12:47 PM, Mahmood Naderan wrote:
> > > 
> > >>> Because communicating through shared memory when sending messages 
> > >>> between processes on the same server is far faster than going through a 
> > >>> network stack.
> > >>  I see... But that is not good for diskless clusters. Am I right? assume 
> > >> processes are on a node (which has no disk). In this case, their 
> > >> communication go though network (from computing node to server) then IO 
> > >> and then network again (from server to computing node).
> > > I don't quite understand what you're saying -- what exactly is your 
> > > distinction between "server" and "computing node"?
> > > 
> > > For the purposes of my reply, I use the word "server" to mean "one 
> > > computational server, possibly containing multiple processors, a bunch of 
> > > RAM, and possibly one or more disks."  For example, a 1U "pizza box" 
> > > style rack enclosure containing the guts of a typical x86-based system.
> > > 
> > > You seem to be relating two orthogonal things: whether a server has a 
> > > disk and how MPI messages flow from one process to another.
> > > 
> > > When using shared memory, the message starts in one process, gets copied 
> > > to shared memory, then then gets copied to the other process.  If you use 
> > > the knem Linux kernel module, we can avoid shared memory in some cases 
> > > and copy the message directly from one process' memory to the other.
> > > 
> > > It's irrelevant as to whether there is a disk or not.
> > > 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi shared memory feature

Reply via email to