Re: [OMPI users] MPI_Comm_spawn multiple bproc support

2006-11-02 Thread hpe...@infonie.fr
Thank you for your support Ralf, I really appreciate.

I have now a better understanding of your very first answer asking if I had a 
NODES environment variable.
It was related to the fact that your platform is configured with LSF. 
I have read some tutorials about LSF and it seems that LSF provides a "llogin" 
command that creates an environment where the NODES variables is permanently 
defined.

Then, under this "llogin" environment, all jobs are automatically allocated to 
the nodes defined with NODES.

This is why, I think,  the spawning works fine in this condition.

Unfortunately, LSF is commercial and then I am not able to install it on my 
platform.
I whish I can not do anything more on my side now.

You proposed to concoct something over the next few days. I look forward to 
hearing from you.

Regards.

Herve



List-Post: users@lists.open-mpi.org
Date: Tue, 31 Oct 2006 06:53:53 -0700
From: Ralph H Castain 
Subject: Re: [OMPI users] MPI_Comm_spawn multiple bproc support
To: "Open MPI Users " 
Message-ID: 
Content-Type: text/plain; charset="ISO-8859-1"

Aha! Thanks for your detailed information - that helps identify the problem.

See some thoughts below.
Ralph


On 10/31/06 3:49 AM, "hpe...@infonie.fr"  wrote:

> Thank you for you quick reply Ralf,
> 
> As far as I know, the NODES environment variable is created when a job is
> submitted to the bjs scheduler.
> The only way I know (but I am a bproc newbe) is to use the bjssub command.

That is correct. However, Open MPI requires that ALL of the nodes you are
going to use must be allocated in advance. In other words, you have to get
an allocation large enough to run your entire job - both the initial
application and anything you comm_spawn.

I wish I could help you with the proper bjs commands to get an allocation,
but I am not familiar with bjs and (even after multiple Google searches)
cannot find any documentation on that code. Try doing a "bjs --help" and see
what it says.

> 
> Then, I have retried my test with the following running command: "bjssub -i
> mpirun -np 1 main_exe".
> 


> 
> I guess, this problem comes from the way I set the parameters to the spawned
> program. Instead of giving instructions to spawn the program on a specific
> host, I should set parameters to spawn the program on a specific node.
> But I do not know how to do it.
> 

What you did was fine. "host" is the correct field to set. I suspect two
possible issues:

1. The specified host may not be in the allocation. In the case you showed
here, I would expect it to be since you specified the same host we are
already on. However, you might try running mpirun with the "--nolocal"
option - this will force mpirun to launch the processes on a machine other
than the one you are on (typically you are on the head node. In many bproc
machines, this node is not included in an allocation as the system admins
don't want you running MPI jobs on it).

2. We may have something wrong in our code for this case. I'm not sure how
well that has been tested, especially in the 1.1 code branch.

> Then, I have a bunch of questions:
> - when mpi is used together with bproc, is it necessary to use bjssub or bjs
> in general ?

You have to use some kind of resource manager to obtain a node allocation
for your use. At our site, we use LSF - other people use bjs. Anything that
sets the NODE variable is fine.

> - I was wondering if I had to submit to bjs the spawned program ? i.e do I
> have to add 'bjssub' to the commands parameter of the MPI_Comm_spawn_mutliple
> call ?

You shouldn't have to do so. I suspect, however, that bjssub is not getting
a large enough allocation for your combined mpirun + spawned job. I'm not
familiar enough with bjs to know for certain.
> 
> As you can see, I am still not able to spawn a program and need some more help
> ? 
> Do you have a some examples describing how to do it ?

Unfortunately, not in the 1.1 branch, nor do I have one for
comm_spawn_multiple that uses the "host" field. I can try to concoct
something over the next few days, though, and verify that our code is
working correctly.



- ALICE SECURITE ENFANTS -
Protégez vos enfants des dangers d'Internet en installant Sécurité Enfants, le 
contrôle parental d'Alice.
http://www.aliceadsl.fr/securitepc/default_copa.asp





Re: [OMPI users] MPI_Comm_spawn multiple bproc support

2006-11-02 Thread Ralph Castain
I gather you have access to bjs? Could you use bjs to get a node allocation,
and then send me a printout of the environment? All I need to see is what
your environment looks like - how does the system tell you what nodes you
have been allocated?

Then we can make something that will solve your problem.
Ralph



On 11/2/06 1:10 AM, "hpe...@infonie.fr"  wrote:

> Thank you for your support Ralf, I really appreciate.
> 
> I have now a better understanding of your very first answer asking if I had a
> NODES environment variable.
> It was related to the fact that your platform is configured with LSF.
> I have read some tutorials about LSF and it seems that LSF provides a "llogin"
> command that creates an environment where the NODES variables is permanently
> defined.
> 
> Then, under this "llogin" environment, all jobs are automatically allocated to
> the nodes defined with NODES.
> 
> This is why, I think,  the spawning works fine in this condition.
> 
> Unfortunately, LSF is commercial and then I am not able to install it on my
> platform.
> I whish I can not do anything more on my side now.
> 
> You proposed to concoct something over the next few days. I look forward to
> hearing from you.
> 
> Regards.
> 
> Herve
> 
> 
> 
> Date: Tue, 31 Oct 2006 06:53:53 -0700
> From: Ralph H Castain 
> Subject: Re: [OMPI users] MPI_Comm_spawn multiple bproc support
> To: "Open MPI Users " 
> Message-ID: 
> Content-Type: text/plain; charset="ISO-8859-1"
> 
> Aha! Thanks for your detailed information - that helps identify the problem.
> 
> See some thoughts below.
> Ralph
> 
> 
> On 10/31/06 3:49 AM, "hpe...@infonie.fr"  wrote:
> 
>> Thank you for you quick reply Ralf,
>> 
>> As far as I know, the NODES environment variable is created when a job is
>> submitted to the bjs scheduler.
>> The only way I know (but I am a bproc newbe) is to use the bjssub command.
> 
> That is correct. However, Open MPI requires that ALL of the nodes you are
> going to use must be allocated in advance. In other words, you have to get
> an allocation large enough to run your entire job - both the initial
> application and anything you comm_spawn.
> 
> I wish I could help you with the proper bjs commands to get an allocation,
> but I am not familiar with bjs and (even after multiple Google searches)
> cannot find any documentation on that code. Try doing a "bjs --help" and see
> what it says.
> 
>> 
>> Then, I have retried my test with the following running command: "bjssub -i
>> mpirun -np 1 main_exe".
>> 
> 
> 
>> 
>> I guess, this problem comes from the way I set the parameters to the spawned
>> program. Instead of giving instructions to spawn the program on a specific
>> host, I should set parameters to spawn the program on a specific node.
>> But I do not know how to do it.
>> 
> 
> What you did was fine. "host" is the correct field to set. I suspect two
> possible issues:
> 
> 1. The specified host may not be in the allocation. In the case you showed
> here, I would expect it to be since you specified the same host we are
> already on. However, you might try running mpirun with the "--nolocal"
> option - this will force mpirun to launch the processes on a machine other
> than the one you are on (typically you are on the head node. In many bproc
> machines, this node is not included in an allocation as the system admins
> don't want you running MPI jobs on it).
> 
> 2. We may have something wrong in our code for this case. I'm not sure how
> well that has been tested, especially in the 1.1 code branch.
> 
>> Then, I have a bunch of questions:
>> - when mpi is used together with bproc, is it necessary to use bjssub or bjs
>> in general ?
> 
> You have to use some kind of resource manager to obtain a node allocation
> for your use. At our site, we use LSF - other people use bjs. Anything that
> sets the NODE variable is fine.
> 
>> - I was wondering if I had to submit to bjs the spawned program ? i.e do I
>> have to add 'bjssub' to the commands parameter of the MPI_Comm_spawn_mutliple
>> call ?
> 
> You shouldn't have to do so. I suspect, however, that bjssub is not getting
> a large enough allocation for your combined mpirun + spawned job. I'm not
> familiar enough with bjs to know for certain.
>> 
>> As you can see, I am still not able to spawn a program and need some more
>> help
>> ? 
>> Do you have a some examples describing how to do it ?
> 
> Unfortunately, not in the 1.1 branch, nor do I have one for
> comm_spawn_multiple that uses the "host" field. I can try to concoct
> something over the next few days, though, and verify that our code is
> working correctly.
> 
> 
> 
> - ALICE SECURITE ENFANTS -
> Protégez vos enfants des dangers d'Internet en installant Sécurité Enfants, le
> contrôle parental d'Alice.
> http://www.aliceadsl.fr/securitepc/default_copa.asp
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/list

[OMPI users] openmpi problem

2006-11-02 Thread calin pal

sir,
  in   four machine of our college i have installed in this way..that i
m sending u
i start four machine from root...
then i installed the openmpi1.1.1 -tar.gz using the commands.

tar -xvzf openmpi-1.1.1
cd openmpi-1.1.1
./configure --prefix=/usr/local
make
make all install
ompi_info

that i did in root

then according to u r suggestion i went to user(where i did my program
jacobi.c)
gave the password
then i wrote

cd .bashrc
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
source .bashrc
mpicc mpihello.c -o mpihello
mpirun -np 4 mpihello


after did all this thing i m getting the problem libmpi:so file
.."mpihello" is not working

what i supposed to do???

should i have to install again???

anything wrong in the installation    sir i cant undersatnd from the FAQ
whatever u have suggested to see methats why i m asking again sir please
tell me whatever i have done in our computer is this okay or anything i have
to change in the code what i have written in the above code please check it
out sir and tell me whats wrong in my code please
sir.please sir read the command also which i have used for installation
in root and user for running the openmpi-1.1.1.tar.gz ...please see it.

calin pal
fergusson college
india
msc.tech(maths and comp.sc)


[OMPI users] Scalapack testing fails with openmpi

2006-11-02 Thread Kevin Radican
Hi,

I have a SEGV problem with Scalapack.  The same configuration works fine with
MPICH, but I seem to get much better performance with Openmpi on this machine.
I have attached the log and slmake.inc I am using.  I have a the same problem
with programs that call this routine that xcdblu uses.  It seems to occur when
the number of processors doesn't match the number of diagonals for the case of
bwl = 15.  If i choose -np 15 it just seems to seems to hang, however if i use 
mpirun --mca mpi_paffinity_alone 1 -np 15 xcdblu it crashes too.

Any help would be appreciated.

Regards,
Kevin


> mpirun -np 6 xcdblu
SCALAPACK banded linear systems.
'MPI machine'

Tests of the parallel complex single precision band matrix solve
The following scaled residual checks will be computed:
 Solve residual = ||Ax - b|| / (||x|| * ||A|| * eps * N)
 Factorization residual = ||A - LU|| / (||A|| * eps * N)
The matrix A is randomly generated for each test.

An explanation of the input/output parameters follows:
TIME: Indicates whether WALL or CPU time was used.
N   : The number of rows and columns in the matrix A.
bwl, bwu  : The number of diagonals in the matrix A.
NB  : The size of the column panels the matrix A is split into. [-1 for
default]
NRHS: The total number of RHS to solve for.
NBRHS   : The number of RHS to be put on a column of processes before going
  on to the next column of processes.
P   : The number of process rows.
Q   : The number of process columns.
THRESH  : If a residual value is less than THRESH, CHECK is flagged as PASSED
Fact time: Time in seconds to factor the matrix
Sol Time: Time in seconds to solve the system.
MFLOPS  : Rate of execution for factor and solve using sequential operation
count.
MFLOP2  : Rough estimate of speed using actual op count (accurate big P,N).

The following parameter values will be used:
  N: 3 517
  bwl  : 1 315
  bwu  : 1 1 4
  NB   :-1
  NRHS : 4
  NBRHS: 1
  P: 1 1 1 1
  Q: 1 2 3 4

Relative machine precision (eps) is taken to be   0.596046E-07
Routines pass computational tests if scaled residual is less than   3.

TIME TR  N  BWL BWUNB  NRHSPQ L*U Time Slv Time   MFLOPS
MFLOP2  CHECK
 -- --  --- ---   -     
 --

WALL N   3   1   13 411 0.000   0. 0.00 0.00
PASSED
WALL N   5   1   15 411 0.000   0. 0.00 0.00
PASSED
WALL N   5   3   15 411 0.000   0. 0.00 0.00
PASSED
WALL N  17   1   1   17 411 0.000   0. 0.00 0.00
PASSED
WALL N  17   3   1   17 411 0.000   0. 0.00 0.00
PASSED
WALL N  17  15   4   17 411 0.000   0. 0.00 0.00
PASSED
WALL N   3   1   12 412 0.000   0. 0.00 0.00
PASSED
WALL N   5   1   13 412 0.000   0. 0.00 0.00
PASSED
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x10
[0] func:/usr/local/lib/libopal.so.0 [0x2b0fdb4ee1c0]
[1] func:/lib64/libpthread.so.0 [0x2b0fdbe0d140]
[2]
func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_match+0x2ff)
[0x2b0fde2a4d9f]
[3]
func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback+0xaf)
[0x2b0fde2a5d8f]
[4]
func:/usr/local/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x8c9)
[0x2b0fde5b9e39]
[5] func:/usr/local/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x21)
[0x2b0fde3aeff1]
[6] func:/usr/local/lib/libopal.so.0(opal_progress+0x4a) [0x2b0fdb4d9bfa]
[7] func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x265)
[0x2b0fde2a2c75]
[8]
func:/usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_basic_linear+0x10b)
[0x2b0fdebe544b]
[9]
func:/usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x4d)
[0x2b0fdebe25bd]
[10] func:/usr/local/lib/libmpi.so.0(ompi_comm_nextcid+0x209) [0x2b0fdb207c59]
[11] func:/usr/local/lib/libmpi.so.0(ompi_comm_create+0x8c) [0x2b0fdb206bcc]
[12] func:/usr/local/lib/libmpi.so.0(MPI_Comm_create+0x90) [0x2b0fdb22d890]
[13] func:/usr/local/lib/libmpi.so.0(pmpi_comm_create__+0x42) [0x2b0fdb2491b2]
[14] func:xcdblu(BI_TransUserComm+0xef) [0x46797f]
[15] func:xcdblu(Cblacs_gridmap+0x13a) [0x463e3a]
[16] func:xcdblu(Creshape+0x17c) [0x42365c]
[17] func:xcdblu(pcdbtrf_+0x5d9) [0x42df35]
[18] func:xcdblu(MAIN__+0x190c) [0x417a0c]
[19] func:xcdblu(main+0x32) [0x4160ea]
[20] func:/lib64/libc.so.6(__libc_start_main+0xf4) [0x2b0fdbf34154]
[21] func:xcdblu [0x416029]
*** End of error message ***
1 additional process aborted (not shown)

config.log.tar.gz
Description: GNU Zip compressed data


[OMPI users] Re: Re: Re: Re: Re:MPI_Comm_spawn multiple bproc support

2006-11-02 Thread hpe...@infonie.fr
I again Ralf,

>I gather you have access to bjs? Could you use bjs to get a node allocation,
>and then send me a printout of the environment? 

I have slightly changed my cluster configuration for something like:
master is running on a machine call: machine10
node 0 is running on a machine call: machine10 (same as master then)
node 1 is running on a machine call: machine14

node 0 and 1 are up

My bjs configration allocates node 0 and 1 to the default pool
<--->
pool default
  policy simple
  nodes 0-1
<->

Be default, when I run "env" in a terminal, NODES variable is not present.
If I run env under a job submission command like "bjsub -i env", then I can see 
the following new environments variable.
NODES=0
JOBID=27 (for instance)
BPROC_RANK=000
BPROC_PROGNAME=/usr/bin/env

When the command is over, NODES is unset again.

What is strange is that I would have expected that NODES=0,1. I do not know if 
you bjs users have the same behaviour.

Hopefully, it is the kind of information you were expecting.

Regards.

Herve




- ALICE SECURITE ENFANTS -
Protégez vos enfants des dangers d'Internet en installant Sécurité Enfants, le 
contrôle parental d'Alice.
http://www.aliceadsl.fr/securitepc/default_copa.asp





Re: [OMPI users] MPI_Comm_spawn multiple bproc support

2006-11-02 Thread Ralph Castain
I truly appreciate your patience. Let me talk to some of our Bproc folks and
see if they can tell me what is going on. I agree - I would have expected
the NODES to be 0,1. The fact that you are getting just 0 explains the
behavior you are seeing with Open MPI.

I also know (though I don't the command syntax) that you can get a long-term
allocation from bjs (i.e., one that continues until you logout). Let me dig
a little and see how that is done.

Again, I appreciate your patience.
Ralph


On 11/2/06 6:32 AM, "hpe...@infonie.fr"  wrote:

> I again Ralf,
> 
>> I gather you have access to bjs? Could you use bjs to get a node allocation,
>> and then send me a printout of the environment?
> 
> I have slightly changed my cluster configuration for something like:
> master is running on a machine call: machine10
> node 0 is running on a machine call: machine10 (same as master then)
> node 1 is running on a machine call: machine14
> 
> node 0 and 1 are up
> 
> My bjs configration allocates node 0 and 1 to the default pool
> <--->
> pool default
>   policy simple
>   nodes 0-1
> <->
> 
> Be default, when I run "env" in a terminal, NODES variable is not present.
> If I run env under a job submission command like "bjsub -i env", then I can
> see the following new environments variable.
> NODES=0
> JOBID=27 (for instance)
> BPROC_RANK=000
> BPROC_PROGNAME=/usr/bin/env
> 
> When the command is over, NODES is unset again.
> 
> What is strange is that I would have expected that NODES=0,1. I do not know if
> you bjs users have the same behaviour.
> 
> Hopefully, it is the kind of information you were expecting.
> 
> Regards.
> 
> Herve
> 
> 
> 
> 
> - ALICE SECURITE ENFANTS -
> Protégez vos enfants des dangers d'Internet en installant Sécurité Enfants, le
> contrôle parental d'Alice.
> http://www.aliceadsl.fr/securitepc/default_copa.asp
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] OMPI Collectives

2006-11-02 Thread Pierre Valiron




Tony, 

What do mean by TCP ?  Are you using an ethernet interconnect ?

I have noticed a similar slowdown using LAM/MPI and  MPI_Alltoall
primitive on our Solaris 10 cluster using gigabit ethernet and TCP. For
a large number of nodes I could ever come to a complete hangup. Part of
the problem lied in the ethernet network itself, and setting hardware
flow control on ethernet interfaces and switch lead to considerable
improvement. With flow control, I could approach the full duplex
bandwidth (200 MB/s) for large buffer sizes. I could achieve additional
improvement by using optimized algorithms (thanks to George and others
on this point), especially for smaller buffer sizes in the same range
as yours. I did not studied the MPI_Reduce case, but I suspect it would
be similar.

If this is relevant to you, you may find this discussion hanging
somewhere, most probably on the LAM/MPI list, starting august or
september 2005. I did not experimented Open MPI at that time due to
portability problems on Solaris 10 opteron platforms. Now these
problems have been solved, and Open MPI is generally faster on our
applications that LAM/MPI and MPICH. 

Pierre.  



George Bosilca wrote:

  On Oct 28, 2006, at 6:51 PM, Tony Ladd wrote:

  
  
George

Thanks for the references. However, I was not able to figure out if  
it what
I am asking is so trivial it is simply passed over or so subtle  
that its
been overlooked (I suspect the former).

  
  
No. The answer to your question was in the articles. We have more  
than just the Rabenseifner reduce and all-reduce algorithms. Some of  
the most common collective communication calls have up to 15  
different implementations in Open MPI. Of course, each of these  
implementations give the best performance under some particular  
conditions. Unfortunately, there is no unique algorithms that give  
the best performance in all cases. As we have to deal with multiple  
algorithms for each collective, we have to figure out which one is  
better and where. This usually depend on the number of nodes in the  
communicator, the message size as well as the network properties. In  
few words, it's difficult to choose the best one without having prior  
knowledge about the networks you're trying to use. This is something  
we're working on right now on Open MPI. Until then ... It might  
happens that for some particular points the performance of he  
collective communications will not show the best possible  
performance. However, to have a slow-down of a factor of 10 is quite  
unbelievable. There might be something else going on there...

   Thanks,
 george.

PS: BTW which version of Open MPI are you using ? The one who deliver  
the best performance or the collective communications (at least on  
high performance networks) is the nightly release of he 1.2 branch.

  
  
The binary tree algorithm in
MPI_Allreduce takes a tiume proportional to 2*N*log_2M where N is  
the vector
length and M is the number of processes. There is a divide and conquer
strategy
(http://www.hlrs.de/organization/par/services/models/mpi/ 
myreduce.html) that
mpich uses to do a MPI_Reduce in a time proportional to N. Is this  
algorithm
or something equivalent in OpenMPI at present? If so how do I turn  
it on?

I also found that OpenMPI is sometimes very slow on MPI_Allreduce  
using TCP.
Things are OK up to 16 processes but at 24 the rates (Message  
length divided
by time) are as follows:

Message size (Kbytes)  		   Throughput (Mbytes/sec)
	M=24		M=32		M=48
	11.38		1.30		1.09

	22.28		1.94		1.50
	42.92		2.35		1.73
	83.56		2.81		1.99
	163.97		1.94		0.12
	320.34		0.24		0.13
	643.07		2.33		1.57
	1283.70		2.80		1.89
	2564.10		3.10		2.08
	5124.19		3.28		2.08
	10244.36		3.36		2.17

Around 16-32KBytes there is a pronouced slowdown-roughly a factor  
of 10,
which seems too much. Any idea whats going on?

Tony

---
Tony Ladd
Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  
  
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  



-- 
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/

   _/_/_/_/_/   _/   Dr. Pierre VALIRON
  _/ _/   _/  _/   Laboratoire d'Astrophysique
 _/ _/   _/ _/Observatoire de Grenoble / UJF
_/_/_/_/_/_/BP 53  F-38041 Grenoble Cedex 9 (France)
   _/  _/   _/http://www-laog.obs.ujf-grenoble.fr/~valiron/
  _/  _/  _/ Mail: pierre.vali...@obs.ujf-grenoble.fr
 _/  _/ _/  Phone: +33 4 7651 4787  Fax: +33 4 7644 8821
_/  _/

Re: [OMPI users] tickets 39 & 55

2006-11-02 Thread Jeff Squyres

Adding Craig Rasmussen from LANL into the CC list...

On Oct 31, 2006, at 10:26 AM, Michael Kluskens wrote:

OpenMPI tickets 39 & 55 deal with problems with the Fortran 90  
large interface with regards to:


#39: MPI_IN_PLACE in MPI_REDUCE 
#55: MPI_GATHER with arrays of different dimensions svn.open-mpi.org/trac/ompi/ticket/55>


Attached is a patch to deal with these two issues as applied  
against OpenMPI-1.3a1r12364.


Thanks for the patch!  Before committing this, though, I think more  
needs to be done and I want to understand it before doing so (part of  
this is me thinking it out while I write this e-mail...).  Also, be  
aware that SC is 1.5 weeks away, so I may not be able to get to  
address this issue before then (SC tends to be all-consuming).


1. The "same type" heuristic for the "large" F90 module was not  
intended to cover all possible scenarios.  You're absolutely right  
that assuming the same time makes no sense for some of the  
interfaces.  The problem is that the obvious alternative (all  
possible scenarios) creates an exponential number of interfaces (in  
the millions).  So "large" was an attempt to provide *some* of the  
interfaces -- but [your] experience has shown that this can do more  
harm than good (i.e., make some legal MPI applications uncompilable  
because we provide *some* interfaces to MPI_GATHER, but not all).


1a. It gets worse because of MPI's semantics for MPI_GATHER.  You  
pointed out one scenario -- it doesn't make sense to supply "integer"  
for both the sendbuf and recvbuf because the root will need an  
integer array to receive all the values (similar logic applies to  
MPI_SCATTER and other collectives -- so what you did for MPI_GATHER  
would need to be applied to several others as well).


1b. But even worse than that is the fact that, for MPI_GATHER, the  
receive buffer is not relevant on non-root processes.  So it's valid  
for *any* type to be passed for non-root processes (leading to the  
exponential interface explosion described above).


So having *some* interfaces for MPI_GATHER can be a problem for both  
1a and 1b -- perfectly valid/legal MPI apps will fail to compile.


I'm not sure what the right balance is here -- how do we allow for  
both 1a and 1b without creating millions of interfaces?  Your patch  
created MPI_GATHER interfaces for all the same types, but allowing  
any dimension mix.  With the default max dimension level of 4 in  
OMPI's interfaces, this created 90 new interfaces for MPI_GATHER,  
calculated (and verified with some grep/wc'ing):


For src buffer of dimension:0   1   2   3   4
Create this many recvbuf types: 4 + 4 + 3 + 2 + 1 = 14

For each src/recvbuf combination, create this many interfaces:

(char + logical + (integer * 4) + (real * 2) + (complex * 2)) = 10

Where 4, 2, and 2 are the number of integer, real, and complex types  
supported by the compiler on my machines (e.g., gfortran on OSX/intel  
and Linux/EM64T).


So this created 14 * 10 = 140 interfaces, as opposed to the 50 that  
were there before the patch (5 dimensions of src/recvbuf * 10 types =  
50), resulting in 90 new interfaces.


This effort will need to be duplicated by several other collectives:

- allgather, allgatherv
- alltoall, alltoallv, alltoallw
- gather, gatherv
- scatter, scatterv

So an increase of 9 * 90 = 810 new interfaces.  Not too bad,  
considering the alternative (exponential).  But consider that the  
"large" interface only has (by my count via egrep/wc) 4013  
interfaces.  This would be increasing its size by about 20%.  This is  
certainly not a show-stopper, but something to consider.


Note that if you go higher than OMPI's default 4 dimensions, the  
number of new interfaces gets considerably larger (e.g., for 7  
dimensions you get 35 send/recv type combinations instead of 14, so  
(35 * 10 * 9) = 3150 total interfaces (just for the collectives), if  
I did my math right.


2. You also identified another scenario that needs to be fixed --  
support for MPI_IN_PLACE in certain collectives (MPI_REDUCE is not  
the only collective that supports it).  It doesn't seem to be a Good  
Idea to allow the INTEGER type to be mixed with any other type for  
send/recvbuf combinations, just to allow MPI_IN_PLACE.  This  
potentially adds in send/recvbuf signatures that we want to disallow  
(even though they are potentially valid MPI applications!) -- e.g.,  
INTEGER and FLOAT.  What if a user accidentally supplied an INTEGER  
for the sendbuf that wasn't MPI_IN_PLACE?  That's what the type  
system is supposed to be preventing.


I don't know enough about the type system of F90, but it strikes me  
that we should be able to create a unique type for MPI_IN_PLACE  
(don't know why I didn't think of this before for some of the MPI  
sentinel values... :-\ ) and therefore have a safe mechanism for this  
sentinel value.


This would add 10 interfaces for every function that supports  

[OMPI users] dma using infiniband protocol

2006-11-02 Thread Brian Budge

Hi all -

I'm wondering how DMA is handled in OpenMPI when using the infiniband
protocol.  In particular, will I get a speed gain if my read/write buffers
are already pinned via mlock?

Thanks,
 Brian


Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Jeff Squyres

This paper explains it pretty well:

http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/



On Nov 2, 2006, at 1:37 PM, Brian Budge wrote:


Hi all -

I'm wondering how DMA is handled in OpenMPI when using the  
infiniband protocol.  In particular, will I get a speed gain if my  
read/write buffers are already pinned via mlock?


Thanks,
  Brian
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Brian Budge

Thanks for the pointer, it was a very interesting read.

It seems that by default OpenMPI uses the nifty pipelining trick with
pinning pages while transfer is happening.  Also the pinning can be
(somewhat) perminant and the state is cached so that next usage requires no
registration.  I guess it is possible to use pre-pinned memory, but do I
need to do anything special to do so?  I will already have some buffers
pinned to allow DMAs to devices across PCI-Express, so it makes sense to use
one pinned buffer so that I can avoid memcpys.

Are there any HOWTO tutorials or anything?  I've searched around, but it's
possible I just used the wrong search terms.

Thanks,
 Brian



On 11/2/06, Jeff Squyres  wrote:


This paper explains it pretty well:

 http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/



On Nov 2, 2006, at 1:37 PM, Brian Budge wrote:

> Hi all -
>
> I'm wondering how DMA is handled in OpenMPI when using the
> infiniband protocol.  In particular, will I get a speed gain if my
> read/write buffers are already pinned via mlock?
>
> Thanks,
>   Brian
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Brian W Barrett
Locking a page with mlock() is not all that is required for RDMA  
using InfiniBand (or Myrinet, for that matter).  You have to call  
that device's registration function first.  In Open MPI, that can be  
done implicitly with the mpi_leave_pinned option, which will pin  
memory as needed and then leave it pinned for the life of the  
buffer.  Or it can be done ahead of time by calling MPI_ALLOC_MEM.


Because the amount of memory a NIC can have pinned at any time may  
not directly match the total amount of memory that can be mlock()ed  
at any given time, it's also not a safe assumption that a buffer  
allocated with MPI_ALLOC_MEM or used with an RDMA transfer from MPI  
is going to be mlock()ed as a side effect of NIC registration.  Open  
MPI internally might unregister that memory with the NIC in order to  
register a different memory segment for another memory transfer.


Brian


On Nov 2, 2006, at 12:22 PM, Brian Budge wrote:


Thanks for the pointer, it was a very interesting read.

 It seems that by default OpenMPI uses the nifty pipelining trick  
with pinning pages while transfer is happening.  Also the pinning  
can be (somewhat) perminant and the state is cached so that next  
usage requires no registration.  I guess it is possible to use pre- 
pinned memory, but do I need to do anything special to do so?  I  
will already have some buffers pinned to allow DMAs to devices  
across PCI-Express, so it makes sense to use one pinned buffer so  
that I can avoid memcpys.


Are there any HOWTO tutorials or anything?  I've searched around,  
but it's possible I just used the wrong search terms.


Thanks,
  Brian



On 11/2/06, Jeff Squyres  wrote: This paper  
explains it pretty well:


 http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/



On Nov 2, 2006, at 1:37 PM, Brian Budge wrote:

> Hi all -
>
> I'm wondering how DMA is handled in OpenMPI when using the
> infiniband protocol.  In particular, will I get a speed gain if my
> read/write buffers are already pinned via mlock?
>
> Thanks,
>   Brian
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Gleb Natapov
On Thu, Nov 02, 2006 at 10:37:24AM -0800, Brian Budge wrote:
> Hi all -
> 
> I'm wondering how DMA is handled in OpenMPI when using the infiniband
> protocol.  In particular, will I get a speed gain if my read/write buffers
> are already pinned via mlock?
> 
No you will not. mlock has nothing to do with memory registration that
is needed for RDMA. If you'll allocate your read/write buffers with
MPI_Alloc_mem() that will help because this function register memory
for you.

--
Gleb.


Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Brian Budge

Thanks for the help guys.

In my case the memory will be allocated and pinned by my other device
driver.  Is it safe to simply use that memory?  My pages won't be unpinned
as a result?

As far as registration, I am sure that OpenMPI will do a better job of that
than I could, so I won't even attempt to futz with that.

Thanks,
 Brian

On 11/2/06, Brian W Barrett  wrote:


Locking a page with mlock() is not all that is required for RDMA
using InfiniBand (or Myrinet, for that matter).  You have to call
that device's registration function first.  In Open MPI, that can be
done implicitly with the mpi_leave_pinned option, which will pin
memory as needed and then leave it pinned for the life of the
buffer.  Or it can be done ahead of time by calling MPI_ALLOC_MEM.

Because the amount of memory a NIC can have pinned at any time may
not directly match the total amount of memory that can be mlock()ed
at any given time, it's also not a safe assumption that a buffer
allocated with MPI_ALLOC_MEM or used with an RDMA transfer from MPI
is going to be mlock()ed as a side effect of NIC registration.  Open
MPI internally might unregister that memory with the NIC in order to
register a different memory segment for another memory transfer.

Brian


On Nov 2, 2006, at 12:22 PM, Brian Budge wrote:

> Thanks for the pointer, it was a very interesting read.
>
>  It seems that by default OpenMPI uses the nifty pipelining trick
> with pinning pages while transfer is happening.  Also the pinning
> can be (somewhat) perminant and the state is cached so that next
> usage requires no registration.  I guess it is possible to use pre-
> pinned memory, but do I need to do anything special to do so?  I
> will already have some buffers pinned to allow DMAs to devices
> across PCI-Express, so it makes sense to use one pinned buffer so
> that I can avoid memcpys.
>
> Are there any HOWTO tutorials or anything?  I've searched around,
> but it's possible I just used the wrong search terms.
>
> Thanks,
>   Brian
>
>
>
> On 11/2/06, Jeff Squyres  wrote: This paper
> explains it pretty well:
>
>  http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/
>
>
>
> On Nov 2, 2006, at 1:37 PM, Brian Budge wrote:
>
> > Hi all -
> >
> > I'm wondering how DMA is handled in OpenMPI when using the
> > infiniband protocol.  In particular, will I get a speed gain if my
> > read/write buffers are already pinned via mlock?
> >
> > Thanks,
> >   Brian
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] tickets 39 & 55

2006-11-02 Thread Michael Kluskens

On Nov 2, 2006, at 11:53 AM, Jeff Squyres wrote:


Adding Craig Rasmussen from LANL into the CC list...

On Oct 31, 2006, at 10:26 AM, Michael Kluskens wrote:


OpenMPI tickets 39 & 55 deal with problems with the Fortran 90
large interface with regards to:

#39: MPI_IN_PLACE in MPI_REDUCE 
#55: MPI_GATHER with arrays of different dimensions 

Attached is a patch to deal with these two issues as applied
against OpenMPI-1.3a1r12364.


Thanks for the patch!  Before committing this, though, I think more
needs to be done and I want to understand it before doing so (part of
this is me thinking it out while I write this e-mail...).  Also, be
aware that SC is 1.5 weeks away, so I may not be able to get to
address this issue before then (SC tends to be all-consuming).


Understood, just didn't wish to see this die or get worse.


1. The "same type" heuristic for the "large" F90 module was not
intended to cover all possible scenarios.  You're absolutely right
that assuming the same

dimension (sp)

makes no sense for some of the
interfaces.  The problem is that the obvious alternative (all
possible scenarios) creates an exponential number of interfaces (in
the millions).


I think it can be limited by including reasonable scenarios.  As is  
it's not very useful but as is it at least can be patched by the end- 
builder.



  So "large" was an attempt to provide *some* of the
interfaces -- but [your] experience has shown that this can do more
harm than good (i.e., make some legal MPI applications uncompilable
because we provide *some* interfaces to MPI_GATHER, but not all).


This is a serious issue in my opinion.  I suspect that virtually  
every use of MPI_GATHER and the others would fail with the large  
interfaces as is, there by making sure no one would be able to use  
the large interfaces on a multiuser system.



1a. It gets worse because of MPI's semantics for MPI_GATHER.  You
pointed out one scenario -- it doesn't make sense to supply "integer"
for both the sendbuf and recvbuf because the root will need an
integer array to receive all the values (similar logic applies to
MPI_SCATTER and other collectives -- so what you did for MPI_GATHER
would need to be applied to several others as well).


Agreed.  I limited my patch to that which I could test with working  
code and could justify work time wise.



1b. But even worse than that is the fact that, for MPI_GATHER, the
receive buffer is not relevant on non-root processes.  So it's valid
for *any* type to be passed for non-root processes (leading to the
exponential interface explosion described above).


I would consider this to be very bad programming practice and not a  
good idea to support in the large interface regardless of the cost.


One issue is that derived datatypes will never (?) work with the  
large interfaces, for that matter I would guess that derived  
datatypes probably don't work with medium and possibly small  
interfaces.  I don't know if there is away around that issue at all  
in F90/F95, some places may have to do two installations.  I don't  
think giving up on all interfaces that conflict with derived types is  
a good solution.



So having *some* interfaces for MPI_GATHER can be a problem for both
1a and 1b -- perfectly valid/legal MPI apps will fail to compile.

I'm not sure what the right balance is here -- how do we allow for
both 1a and 1b without creating millions of interfaces?  Your patch
created MPI_GATHER interfaces for all the same types, but allowing
any dimension mix.  With the default max dimension level of 4 in
OMPI's interfaces, this created 90 new interfaces for MPI_GATHER,
calculated (and verified with some grep/wc'ing):

For src buffer of dimension:0   1   2   3   4
Create this many recvbuf types: 4 + 4 + 3 + 2 + 1 = 14


An alternative would be to allow same and one less dimension for  
large (called dim+1 below), and make all dimensions be optional some  
way.  I know that having these extra interfaces allowed me to find  
serious oversights on my part by permitting me to compile with the  
large interfaces.



For each src/recvbuf combination, create this many interfaces:

(char + logical + (integer * 4) + (real * 2) + (complex * 2)) = 10

Where 4, 2, and 2 are the number of integer, real, and complex types
supported by the compiler on my machines (e.g., gfortran on OSX/intel
and Linux/EM64T).

So this created 14 * 10 = 140 interfaces, as opposed to the 50 that
were there before the patch (5 dimensions of src/recvbuf * 10 types =
50), resulting in 90 new interfaces.

This effort will need to be duplicated by several other collectives:

- allgather, allgatherv
- alltoall, alltoallv, alltoallw
- gather, gatherv
- scatter, scatterv

So an increase of 9 * 90 = 810 new interfaces.  Not too bad,
considering the alternative (exponential).  But consider that the
"large" interface only has (by my count via egrep/wc) 4013
interfaces.  This would be increasing its size by

Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Gleb Natapov
On Thu, Nov 02, 2006 at 11:57:16AM -0800, Brian Budge wrote:
> Thanks for the help guys.
> 
> In my case the memory will be allocated and pinned by my other device
> driver.  Is it safe to simply use that memory?  My pages won't be unpinned
> as a result?
> 
If your driver plays nicely with openib driver everything should be OK.
If by pinned you mean mlcok() then your are save since openib doesn't
use mlock().

> As far as registration, I am sure that OpenMPI will do a better job of that
> than I could, so I won't even attempt to futz with that.
> 
> Thanks,
>  Brian
> 
> On 11/2/06, Brian W Barrett  wrote:
> >
> >Locking a page with mlock() is not all that is required for RDMA
> >using InfiniBand (or Myrinet, for that matter).  You have to call
> >that device's registration function first.  In Open MPI, that can be
> >done implicitly with the mpi_leave_pinned option, which will pin
> >memory as needed and then leave it pinned for the life of the
> >buffer.  Or it can be done ahead of time by calling MPI_ALLOC_MEM.
> >
> >Because the amount of memory a NIC can have pinned at any time may
> >not directly match the total amount of memory that can be mlock()ed
> >at any given time, it's also not a safe assumption that a buffer
> >allocated with MPI_ALLOC_MEM or used with an RDMA transfer from MPI
> >is going to be mlock()ed as a side effect of NIC registration.  Open
> >MPI internally might unregister that memory with the NIC in order to
> >register a different memory segment for another memory transfer.
> >
> >Brian
> >
> >
> >On Nov 2, 2006, at 12:22 PM, Brian Budge wrote:
> >
> >> Thanks for the pointer, it was a very interesting read.
> >>
> >>  It seems that by default OpenMPI uses the nifty pipelining trick
> >> with pinning pages while transfer is happening.  Also the pinning
> >> can be (somewhat) perminant and the state is cached so that next
> >> usage requires no registration.  I guess it is possible to use pre-
> >> pinned memory, but do I need to do anything special to do so?  I
> >> will already have some buffers pinned to allow DMAs to devices
> >> across PCI-Express, so it makes sense to use one pinned buffer so
> >> that I can avoid memcpys.
> >>
> >> Are there any HOWTO tutorials or anything?  I've searched around,
> >> but it's possible I just used the wrong search terms.
> >>
> >> Thanks,
> >>   Brian
> >>
> >>
> >>
> >> On 11/2/06, Jeff Squyres  wrote: This paper
> >> explains it pretty well:
> >>
> >>  http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/
> >>
> >>
> >>
> >> On Nov 2, 2006, at 1:37 PM, Brian Budge wrote:
> >>
> >> > Hi all -
> >> >
> >> > I'm wondering how DMA is handled in OpenMPI when using the
> >> > infiniband protocol.  In particular, will I get a speed gain if my
> >> > read/write buffers are already pinned via mlock?
> >> >
> >> > Thanks,
> >> >   Brian
> >> > ___
> >> > users mailing list
> >> > us...@open-mpi.org
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> --
> >> Jeff Squyres
> >> Server Virtualization Business Unit
> >> Cisco Systems
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >___
> >users mailing list
> >us...@open-mpi.org
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
> >

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.


Re: [OMPI users] tickets 39 & 55

2006-11-02 Thread Pierre Valiron




All this seems a terrific effort. 
Is it really justified, especially if it can't cope with the diversity
of real-world applications ?

I suspect that people who are clever enough to code complex parallel
codes involving collective primitives might be able to check arguments.

If possible, I suggest that "easy" arguments (codes) should be checked,
and that multidimensional ones (buffers) should not. And of course the
doc should make the point clear enough.

If f03 standard allows a full argument check, let it be for a f03
interface only.

Pierre V. 


Michael Kluskens wrote:

  On Nov 2, 2006, at 11:53 AM, Jeff Squyres wrote:

  
  
Adding Craig Rasmussen from LANL into the CC list...

On Oct 31, 2006, at 10:26 AM, Michael Kluskens wrote:



  OpenMPI tickets 39 & 55 deal with problems with the Fortran 90
large interface with regards to:

#39: MPI_IN_PLACE in MPI_REDUCE 
#55: MPI_GATHER with arrays of different dimensions 

Attached is a patch to deal with these two issues as applied
against OpenMPI-1.3a1r12364.
  

Thanks for the patch!  Before committing this, though, I think more
needs to be done and I want to understand it before doing so (part of
this is me thinking it out while I write this e-mail...).  Also, be
aware that SC is 1.5 weeks away, so I may not be able to get to
address this issue before then (SC tends to be all-consuming).

  
  
Understood, just didn't wish to see this die or get worse.

  
  
1. The "same type" heuristic for the "large" F90 module was not
intended to cover all possible scenarios.  You're absolutely right
that assuming the same

  
  dimension (sp)
  
  
makes no sense for some of the
interfaces.  The problem is that the obvious alternative (all
possible scenarios) creates an exponential number of interfaces (in
the millions).

  
  
I think it can be limited by including reasonable scenarios.  As is  
it's not very useful but as is it at least can be patched by the end- 
builder.

  
  
  So "large" was an attempt to provide *some* of the
interfaces -- but [your] experience has shown that this can do more
harm than good (i.e., make some legal MPI applications uncompilable
because we provide *some* interfaces to MPI_GATHER, but not all).

  
  
This is a serious issue in my opinion.  I suspect that virtually  
every use of MPI_GATHER and the others would fail with the large  
interfaces as is, there by making sure no one would be able to use  
the large interfaces on a multiuser system.

  
  
1a. It gets worse because of MPI's semantics for MPI_GATHER.  You
pointed out one scenario -- it doesn't make sense to supply "integer"
for both the sendbuf and recvbuf because the root will need an
integer array to receive all the values (similar logic applies to
MPI_SCATTER and other collectives -- so what you did for MPI_GATHER
would need to be applied to several others as well).

  
  
Agreed.  I limited my patch to that which I could test with working  
code and could justify work time wise.

  
  
1b. But even worse than that is the fact that, for MPI_GATHER, the
receive buffer is not relevant on non-root processes.  So it's valid
for *any* type to be passed for non-root processes (leading to the
exponential interface explosion described above).

  
  
I would consider this to be very bad programming practice and not a  
good idea to support in the large interface regardless of the cost.

One issue is that derived datatypes will never (?) work with the  
large interfaces, for that matter I would guess that derived  
datatypes probably don't work with medium and possibly small  
interfaces.  I don't know if there is away around that issue at all  
in F90/F95, some places may have to do two installations.  I don't  
think giving up on all interfaces that conflict with derived types is  
a good solution.

  
  
So having *some* interfaces for MPI_GATHER can be a problem for both
1a and 1b -- perfectly valid/legal MPI apps will fail to compile.

I'm not sure what the right balance is here -- how do we allow for
both 1a and 1b without creating millions of interfaces?  Your patch
created MPI_GATHER interfaces for all the same types, but allowing
any dimension mix.  With the default max dimension level of 4 in
OMPI's interfaces, this created 90 new interfaces for MPI_GATHER,
calculated (and verified with some grep/wc'ing):

For src buffer of dimension:0   1   2   3   4
Create this many recvbuf types: 4 + 4 + 3 + 2 + 1 = 14

  
  
An alternative would be to allow same and one less dimension for  
large (called dim+1 below), and make all dimensions be optional some  
way.  I know that having these extra interfaces allowed me to find  
serious oversights on my part by permitting me to compile with the  
large interfaces.

  
  
For each src/recvbuf combination, create this many interfaces:

(char + logical + (integer * 4) + (real * 2) + (complex * 2)) = 10

Where 4, 2, and 2 are the number of 

Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Jeff Squyres
It depends on what you're trying to do.  Are you writing new  
components internal to Open MPI, or are you just trying to leverage  
OMPI's PML for some other project?  Or are you writing MPI  
applications?  Or ...?



On Nov 2, 2006, at 2:22 PM, Brian Budge wrote:


Thanks for the pointer, it was a very interesting read.

 It seems that by default OpenMPI uses the nifty pipelining trick  
with pinning pages while transfer is happening.  Also the pinning  
can be (somewhat) perminant and the state is cached so that next  
usage requires no registration.  I guess it is possible to use pre- 
pinned memory, but do I need to do anything special to do so?  I  
will already have some buffers pinned to allow DMAs to devices  
across PCI-Express, so it makes sense to use one pinned buffer so  
that I can avoid memcpys.


Are there any HOWTO tutorials or anything?  I've searched around,  
but it's possible I just used the wrong search terms.


Thanks,
  Brian



On 11/2/06, Jeff Squyres  wrote: This paper  
explains it pretty well:


 http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/



On Nov 2, 2006, at 1:37 PM, Brian Budge wrote:

> Hi all -
>
> I'm wondering how DMA is handled in OpenMPI when using the
> infiniband protocol.  In particular, will I get a speed gain if my
> read/write buffers are already pinned via mlock?
>
> Thanks,
>   Brian
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] tickets 39 & 55

2006-11-02 Thread Jeff Squyres

On Nov 2, 2006, at 3:18 PM, Michael Kluskens wrote:


  So "large" was an attempt to provide *some* of the
interfaces -- but [your] experience has shown that this can do more
harm than good (i.e., make some legal MPI applications uncompilable
because we provide *some* interfaces to MPI_GATHER, but not all).


This is a serious issue in my opinion.  I suspect that virtually
every use of MPI_GATHER and the others would fail with the large
interfaces as is, there by making sure no one would be able to use
the large interfaces on a multiuser system.


I agree -- the "large" interface is pretty unusable at this point.


1b. But even worse than that is the fact that, for MPI_GATHER, the
receive buffer is not relevant on non-root processes.  So it's valid
for *any* type to be passed for non-root processes (leading to the
exponential interface explosion described above).


I would consider this to be very bad programming practice and not a
good idea to support in the large interface regardless of the cost.


I admit to not knowing what good practices are here; in C, you can  
just pass NULL for non-root processes and be done with it.  For MPMD  
codes in Fortran (e.g., where non-root processes have a different  
code path than the root processes), I can imagine passing in any old  
buffer that you've got handy since MPI guarantees to ignore it.



One issue is that derived datatypes will never (?) work with the
large interfaces, for that matter I would guess that derived
datatypes probably don't work with medium and possibly small
interfaces.  I don't know if there is away around that issue at all
in F90/F95, some places may have to do two installations.  I don't
think giving up on all interfaces that conflict with derived types is
a good solution.


Very true; derived data types are always going to be a problem for  
F90/F95 (as I understand those languages).  The proposed F03 bindings  
don't have this problem because (again, as I understand the language  
-- and I am *not* a Fortran expert!) they have the equivalent of  
(void*) that we can use for choice buffers.



So there are multiple options here:

1. Keep chasing a "good" definition of "large" such that most/all
current MPI F90 apps can compile.  The problem is that this target
can change over time, and keep requiring maintenance.

2. Stop pursuing "large" because of the problems mentioned above.
This has the potential problem of not providing type safety to F90
MPI apps for the MPI collective interfaces, but at least all apps can
compile, and there's only a small number of 2-choice-buffer functions
that do not get the type safety from F90 (i.e., several MPI
collective functions).

3. Start implementing the proposed F03 MPI interfaces that don't have
the same problems as the F90 MPI interfaces.

I have to admit that I'm leaning more towards #2 (and I wish that
someone who has the time would do #3!) and discarding #1...


I dislike #2 intensely because then I and others couldn't at least
patch the interface scripts before building OpenMPI.


Don't misunderstand me -- by #2, I don't mean ripping the code out of  
Open MPI.  I simply mean not progressing it any further.



#1 is preferred and just give the users/builders clear notice they
may not cover everything and perhaps a hint as to what directory has
the files to be patched to extend the large interface a bit further.


I suppose.  I'd be willing to accept a patch for all the things we  
talked about in this thread (e.g., the stuff you did for GATHER  
extrapolated for all the other collectives that need it, and either  
what you did for REDUCE to allow IN_PLACE or expanding IN_PLACE to be  
a unique datatype as we discussed).  More specifically, I'd rather  
fix *all* the collectives rather than just GATHER/dimensions and  
REDUCE/IN_PLACE.  I unfortunately do not have the cycles to do this  
work myself.  :-\



#3 would be nice but I don't see enough F03 support in enough
compilers at this time.  I don't even have a book on the F03 changes
and I program Fortran most of the day virtually every weekday.  It
took our group till about 2000 to start using Fortran 90 and almost
everything we do is in Fortran.


Craig -- you probably have better visibility on the state of F03  
compilers that me.  What's the view from that perspective?


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Brian Budge

Ha, yeah, I should have been more clear there.  I'm simply writing an MPI
application.

Thanks,
 Brian

On 11/2/06, Jeff Squyres  wrote:


It depends on what you're trying to do.  Are you writing new
components internal to Open MPI, or are you just trying to leverage
OMPI's PML for some other project?  Or are you writing MPI
applications?  Or ...?


On Nov 2, 2006, at 2:22 PM, Brian Budge wrote:

> Thanks for the pointer, it was a very interesting read.
>
>  It seems that by default OpenMPI uses the nifty pipelining trick
> with pinning pages while transfer is happening.  Also the pinning
> can be (somewhat) perminant and the state is cached so that next
> usage requires no registration.  I guess it is possible to use pre-
> pinned memory, but do I need to do anything special to do so?  I
> will already have some buffers pinned to allow DMAs to devices
> across PCI-Express, so it makes sense to use one pinned buffer so
> that I can avoid memcpys.
>
> Are there any HOWTO tutorials or anything?  I've searched around,
> but it's possible I just used the wrong search terms.
>
> Thanks,
>   Brian
>
>
>
> On 11/2/06, Jeff Squyres  wrote: This paper
> explains it pretty well:
>
>  http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/
>
>
>
> On Nov 2, 2006, at 1:37 PM, Brian Budge wrote:
>
> > Hi all -
> >
> > I'm wondering how DMA is handled in OpenMPI when using the
> > infiniband protocol.  In particular, will I get a speed gain if my
> > read/write buffers are already pinned via mlock?
> >
> > Thanks,
> >   Brian
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] OMPI collectives

2006-11-02 Thread Tony Ladd
George

I found the info I think you were referring to. Thanks. I then experimented
essentially randomly with different algorithms for all reduce. But the issue
with really bad performance for certain message sizes persisted with v1.1.
The good news is that the upgrade to 1.2 fixed my worst problem. Now the
performance is reasonable for all message sizes. I will test the tuned
algorithms again asap.

I had a couple of questions

1) Ompi_info lists only 3 or 4 algorithms for allreduce and reduce and about
5 for b'cast. But you can use higher numbers as well. Are these additional
undocmented algorithms (you mentioned a number like 15) or is it ignoring
out of range parameters?
2) It seems for allreduce you can select a tuned reduce and tuned bcast
instead of the binary tree. But there is a faster allreduce which is order
2N rather than 4N for Reduce + Bcast (N is msg size). It segments the vector
and distributes the root among the nodes; in an allreduce there is no need
to gather the root vector to one processor and then scatter it again. I
wrote a simple version for powers of 2 (MPI_SUM)-any chance of it being
implemented in OMPI. 

Tony