Re: [OMPI users] Loopback Communication

2008-03-01 Thread Giovanni Davide Vergine

Loopback communication it's shared memory copy, of course.
Even if it would be implemented with socket() and other network
syscalls, kernel will do finally only memory copying.


Il giorno 01/mar/08, alle ore 00:25, Elvedin Trnjanin ha scritto:

I'm using a "ping pong" program to approximate bandwidth and latency  
of

various messages sizes and I notice when doing various transfers (eg.
async) that the maximum bandwidth isn't the system's maximum  
bandwidth.
I've looked through the FAQ and I haven't noticed this being covered  
but

how does OpenMPI handle loopback communication? Is it still over a
network interconnect or some sort of shared memory copy?

- Elvedin
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Loopback Communication

2008-03-01 Thread Jeff Squyres

On Feb 29, 2008, at 6:25 PM, Elvedin Trnjanin wrote:

I'm using a "ping pong" program to approximate bandwidth and latency  
of

various messages sizes and I notice when doing various transfers (eg.
async) that the maximum bandwidth isn't the system's maximum  
bandwidth.
I've looked through the FAQ and I haven't noticed this being covered  
but

how does OpenMPI handle loopback communication? Is it still over a
network interconnect or some sort of shared memory copy?



There are two kinds of loopback:

1. messages exchanged between two MPI processes on the same host.   
This can be handled by most of OMPI's devices, but the best/fastest is  
usually shared memory (i.e., the "sm" BTL).


2. messages exchanges between a single MPI process.  This is handled  
by the "self" OMPI device because it's just a memcpy within a single  
process.


So you'd typically want to run (assuming you have an IB network):

mpirun --mca btl openib,self,sm 

That being said, OMPI should usually pick the relevant BTL modules for  
you (to include self and sm).


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Openmpi with SGE

2008-03-01 Thread Reuti

Hi,

Am 19.02.2008 um 12:49 schrieb Neeraj Chourasia:

I am facing problem while calling mpirun in a loop when using  
with SGE. My sge version is SGE6.1AR_snapshot3. The script i am  
submitting via sge is


xx 
xxx

let i=0

while [ $i -lt 100 ]
do
echo  
"# 
###"

echo "Iteration :$i"
/usr/local/openmpi-1.2.4/bin/mpirun -np $NP -hostfile $TMP/ 
machines send

let "i+=1"
echo  
"# 
###"

done
xx 
xx


Now above script runs well for 15-20 iteration and then fails with  
following message


in a Tight Integration in SGE the qrsh on the slave node has to  
perfom some housekeeping, e.g. removing the $TMPDIR (would be an RFE  
to do this only once atv the end of the job. a) saves time, b) some  
parallel jobs need persistent $TMPDIR on the slave nodes which must  
for now be implemented by hand).


Could you put a wait of some seconds after the mpirun before the next  
iteration? Maybe it will help.


-- Reuti

-Error  
Message--- 

error: executing task of job 3869 failed: execution daemon on host  
"n101" didn't accept task

[n199:11989] ERROR: A daemon on node n101 failed to start as expected.
[n199:11989] ERROR: There may be more information available from
[n199:11989] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[n199:11989] ERROR: If the problem persists, please restart the
[n199:11989] ERROR: Grid Engine PE job
[n199:11989] ERROR: The daemon exited unexpectedly with status 1.
-- 
-


When i do ssh to n101, there is no orted and qrsh_starter running.  
While checking its spool file, i came across following message
---Execd spool Error  
Message-
|execd|n101|E|no free queue for job 3869 of user neeraj@n199  
(localhost = n101)
-- 
-


What could be the reason for it.
While checking the mailing list, i come across following link
http://www.open-mpi.org/community/lists/users/2007/03/2771.php
but, i dont think its the same problem. Any help is appreciated.

Regards
Neeraj





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users