Re: [OMPI users] Loopback Communication
Loopback communication it's shared memory copy, of course. Even if it would be implemented with socket() and other network syscalls, kernel will do finally only memory copying. Il giorno 01/mar/08, alle ore 00:25, Elvedin Trnjanin ha scritto: I'm using a "ping pong" program to approximate bandwidth and latency of various messages sizes and I notice when doing various transfers (eg. async) that the maximum bandwidth isn't the system's maximum bandwidth. I've looked through the FAQ and I haven't noticed this being covered but how does OpenMPI handle loopback communication? Is it still over a network interconnect or some sort of shared memory copy? - Elvedin ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Loopback Communication
On Feb 29, 2008, at 6:25 PM, Elvedin Trnjanin wrote: I'm using a "ping pong" program to approximate bandwidth and latency of various messages sizes and I notice when doing various transfers (eg. async) that the maximum bandwidth isn't the system's maximum bandwidth. I've looked through the FAQ and I haven't noticed this being covered but how does OpenMPI handle loopback communication? Is it still over a network interconnect or some sort of shared memory copy? There are two kinds of loopback: 1. messages exchanged between two MPI processes on the same host. This can be handled by most of OMPI's devices, but the best/fastest is usually shared memory (i.e., the "sm" BTL). 2. messages exchanges between a single MPI process. This is handled by the "self" OMPI device because it's just a memcpy within a single process. So you'd typically want to run (assuming you have an IB network): mpirun --mca btl openib,self,sm That being said, OMPI should usually pick the relevant BTL modules for you (to include self and sm). -- Jeff Squyres Cisco Systems
Re: [OMPI users] Openmpi with SGE
Hi, Am 19.02.2008 um 12:49 schrieb Neeraj Chourasia: I am facing problem while calling mpirun in a loop when using with SGE. My sge version is SGE6.1AR_snapshot3. The script i am submitting via sge is xx xxx let i=0 while [ $i -lt 100 ] do echo "# ###" echo "Iteration :$i" /usr/local/openmpi-1.2.4/bin/mpirun -np $NP -hostfile $TMP/ machines send let "i+=1" echo "# ###" done xx xx Now above script runs well for 15-20 iteration and then fails with following message in a Tight Integration in SGE the qrsh on the slave node has to perfom some housekeeping, e.g. removing the $TMPDIR (would be an RFE to do this only once atv the end of the job. a) saves time, b) some parallel jobs need persistent $TMPDIR on the slave nodes which must for now be implemented by hand). Could you put a wait of some seconds after the mpirun before the next iteration? Maybe it will help. -- Reuti -Error Message--- error: executing task of job 3869 failed: execution daemon on host "n101" didn't accept task [n199:11989] ERROR: A daemon on node n101 failed to start as expected. [n199:11989] ERROR: There may be more information available from [n199:11989] ERROR: the 'qstat -t' command on the Grid Engine tasks. [n199:11989] ERROR: If the problem persists, please restart the [n199:11989] ERROR: Grid Engine PE job [n199:11989] ERROR: The daemon exited unexpectedly with status 1. -- - When i do ssh to n101, there is no orted and qrsh_starter running. While checking its spool file, i came across following message ---Execd spool Error Message- |execd|n101|E|no free queue for job 3869 of user neeraj@n199 (localhost = n101) -- - What could be the reason for it. While checking the mailing list, i come across following link http://www.open-mpi.org/community/lists/users/2007/03/2771.php but, i dont think its the same problem. Any help is appreciated. Regards Neeraj ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users