Re: [OMPI users] How to use a wrapper for ssh?

2011-07-13 Thread Paul Kapinos

Hi Ralph,

2. use MCA parameters described in
http://www.open-mpi.org/faq/?category=rsh#rsh-not-ssh
to bend the call to my wrapper, e.g.
export OMPI_MCA_plm_rsh_agent=WrapPer
export OMPI_MCA_orte_rsh_agent=WrapPer

the oddly thing is, that the OMPI_MCA_orte_rsh_agent envvar seem not to have 
any effect, whereas OMPI_MCA_plm_rsh_agent works.
Why I believe so?


orte_rsh_agent doesn't exist in the 1.4 series :-)
Only plm_rsh_agent is available in 1.4. "ompi_info --param orte all" and "ompi_info 
--param plm rsh" will confirm that fact.


If so, then the Wiki is not correct. Maybe someone can correct it? This 
would save some time for people like me...


Best wishes
Paul Kapinos




--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] How to use a wrapper for ssh?

2011-07-13 Thread Jeff Squyres
Yes, I guess it looks like 
http://www.open-mpi.org/faq/?category=rsh#rsh-not-ssh is a little out of date.

Thanks for the heads-up...



On Jul 13, 2011, at 4:35 AM, Paul Kapinos wrote:

> Hi Ralph,
>>> 2. use MCA parameters described in
>>> http://www.open-mpi.org/faq/?category=rsh#rsh-not-ssh
>>> to bend the call to my wrapper, e.g.
>>> export OMPI_MCA_plm_rsh_agent=WrapPer
>>> export OMPI_MCA_orte_rsh_agent=WrapPer
>>> 
>>> the oddly thing is, that the OMPI_MCA_orte_rsh_agent envvar seem not to 
>>> have any effect, whereas OMPI_MCA_plm_rsh_agent works.
>>> Why I believe so?
>> orte_rsh_agent doesn't exist in the 1.4 series :-)
>> Only plm_rsh_agent is available in 1.4. "ompi_info --param orte all" and 
>> "ompi_info --param plm rsh" will confirm that fact.
> 
> If so, then the Wiki is not correct. Maybe someone can correct it? This would 
> save some time for people like me...
> 
> Best wishes
> Paul Kapinos
> 
> 
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD

2011-07-13 Thread Jeff Squyres
On Jul 12, 2011, at 1:37 PM, Steve Kargl wrote:

> (many lines removed)
> checking prefix for function in .type... @
> checking if .size is needed... yes
> checking if .align directive takes logarithmic value... no
> configure: error: No atomic primitives available for amd64-unknown-freebsd9.0

Hmm; this is quite odd.  This worked in v1.4, but didn't work in trunk?

There are a bunch of changes to our configure assembly tests between v1.4, but 
I don't see any that should affect AMD vs. Intel.  Weird. 

I wonder if this has to do with versions of config.* scripts.  What does 
config/config.guess report from the trunk tarball, and what does it report from 
the v1.4 tarball?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Mpirun only works when n< 3

2011-07-13 Thread Randolph Pullen
Got it.   Building a new openMPI solved it.

I don't know if the standard Ubuntu install was the problem or if it just 
didn't like the slightly later kernel.
Seems to be reason to be suspicious of Ubuntu 10.10 OpenMPI builds if you have 
anything unusual in your system.
Thanks.
--- On Tue, 12/7/11, Jeff Squyres  wrote:

From: Jeff Squyres 
Subject: Re: [OMPI users] Mpirun only works when n< 3
To: randolph_pul...@yahoo.com.au
Cc: "Open MPI Users" 
Received: Tuesday, 12 July, 2011, 10:29 PM

On Jul 11, 2011, at 11:31 AM, Randolph Pullen wrote:

> There are no firewalls by default.  I can ssh between both nodes without a 
> password so I assumed that all is good with the comms.

FWIW, ssh'ing is different than "comms" (which I assume you mean opening random 
TCP sockets between two servers).

> I can also get both nodes to participate in the ring program at the same time.
> Its just that I am limited to inly 2 processes if they are split between the 
> nodes
> ie:
> mpirun -H A,B ring                         (works)
> mpirun -H A,A,A,A,A,A,A  ring     (works)
> mpirun -H B,B,B,B ring                 (works)
> mpirun -H A,B,A  ring                    (hangs)

It is odd that A,B works and A,B,A does not.

> I have discovered slightly more information:
> When I replace node 'B' from the new cluster with node 'C' from the old 
> cluster
> I get the similar behavior but with an error message:
> mpirun -H A,A,A,A,A,A,A  ring     (works from either node)
> mpirun -H C,C,C  ring     (works from either node)
> mpirun -H A,C  ring     (Fails from either node:)
> Process 0 sending 10 to 1, tag 201 (3 processes in ring)
> [C:23465] ***  An error occurred in MPI_Recv
> [C:23465] ***  on communicator MPI_COMM_WORLD
> [C:23465] ***  MPI_ERRORS_ARE FATAL (your job will now abort)
> Process 0 sent to 1
> --
> Running this on either node A or C produces the same result
> Node C runs openMPI 1.4.1 and is an ordinary dual core on FC10 , not an i5 
> 2400 like the others.
> all the binaries are compiled on FC10 with gcc 4.3.2


Are you sure that all the versions of Open MPI being used on all nodes are 
exactly the same?  I.e., are you finding/using Open MPI v1.4.1 on all nodes?

Are the nodes homogeneous in terms of software?  If they're heterogeneous in 
terms of hardware, you *might* need to have separate OMPI installations on each 
machine (vs., for example, a network-filesystem-based install shared to all 3) 
because the compiler's optimizer may produce code tailored for one of the 
machines, and it may therefore fail in unexpected ways on the other(s).  The 
same is true for your executable.

See this FAQ entry about heterogeneous setups:

    http://www.open-mpi.org/faq/?category=building#where-to-install

...hmm.  I could have sworn we had more on the FAQ about heterogeneity, but 
perhaps not.  The old LAM/MPI FAQ on heterogeneity is somewhat outdated, but 
most of its concepts are directly relevant to Open MPI as well:

    http://www.lam-mpi.org/faq/category11.php3

I should probably copy most of that LAM/MPI heterogeneous FAQ to the Open MPI 
FAQ, but it'll be waaay down on my priority list.  :-(  If anyone could help 
out here, I'd be happy to point them in the right direction to convert the 
LAM/MPI FAQ PHP to Open MPI FAQ PHP...  

To be clear: the PHP conversion will be pretty trivial; I stole heavily from 
the LAM/MPI FAQ PHP to create the Open MPI FAQ PHP -- but there are points 
where the LAM/MPI heterogeneity text needs to be updated; that'll take an hour 
or two to update all that content.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] Running your MPI application on a Computer Cluster in the Cloud - cloudnumbers.com

2011-07-13 Thread Markus Schmidberger
Dear MPI users and experts,

cloudnumbers.com provides researchers and companies with the resources
to perform high performance calculations in the cloud. As
cloudnumbers.com's community manager I may invite you to register and
test your MPI application on a computer cluster in the cloud for free:
http://my.cloudnumbers.com/register

Our aim is to change the way of research collaboration is done today by
bringing together scientists and businesses from all over the world on a
single platform. cloudnumbers.com is a Berlin (Germany) based
international high-tech startup striving for enabling everyone to
benefit from the High Performance Computing related advantages of the
cloud. We provide easy access to applications running on any kind of
computer hardware: from single core high memory machines up to 1000
cores computer clusters.

Our platform provides several advantages:

* Turn fixed into variable costs and pay only for the capacity you need.
Watch our latest saving costs with cloudnumbers.com video:
http://www.youtube.com/watch?v=ln_BSVigUhg&feature=player_embedded

* Enter the cloud using an intuitive and user friendly platform. Watch
our latest cloudnumbers.com in a nutshell video:
http://www.youtube.com/watch?v=0ZNEpR_ElV0&feature=player_embedded

* Be released from ongoing technological obsolescence and continuous
maintenance costs (e.g. linking to libraries or system dependencies)

* Accelerated your C, C++, Fortran, R, Python, ... calculations through
parallel processing and great computing capacity - more than 1000 cores
are available and GPUs are coming soon.

* Share your results worldwide (coming soon).

* Get high speed access to public databases (please let us know, if your
favorite database is missing!).

* We have developed a security architecture that meets high requirements
of data security and privacy. Read our security white paper:
http://d1372nki7bx5yg.cloudfront.net/wp-content/uploads/2011/06/cloudnumberscom-security.whitepaper.pdf


This is only a selection of our top features. To get more information
check out our web-page (http://www.cloudnumbers.com/) or follow our blog
about cloud computing, HPC and HPC applications:
http://cloudnumbers.com/blog

Register and test for free now at cloudnumbers.com:
http://my.cloudnumbers.com/register

We are looking forward to get your feedback and consumer insights. Take
the chance and have an impact to the development of a new cloud
computing calculation platform.

Best
Markus


-- 
Dr. rer. nat. Markus Schmidberger 
Senior Community Manager 

Cloudnumbers.com GmbH
Chausseestraße 6
10119 Berlin 

www.cloudnumbers.com 
E-Mail: markus.schmidber...@cloudnumbers.com 


* 
Amtsgericht München, HRB 191138 
Geschäftsführer: Erik Muttersbach, Markus Fensterer, Moritz v. 
Petersdorff-Campen 



Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD

2011-07-13 Thread Jeff Squyres
On Jul 12, 2011, at 3:26 PM, Steve Kargl wrote:

> % /usr/local/ompi/bin/mpiexec -machinefile mf --mca btl self,tcp \
>  --mca btl_base_verbose 30 ./z
> 
> with mf containing 
> 
> node11 slots=1   (node11 contains a single bge0=168.192.0.11)
> node16 slots=1   (node16 contains a single bge0=168.192.0.16)
> 
> or
> 
> node11 slots=2   (communication on memory bus)
> 
> However, if mf contains
> 
> node10 slots=1   (node10 contains bge0=10.208.xx and bge1=192.168.0.10)
> node16 slots=1   (node16 contains a single bge0=192.168.0.16)
> 
> I see the same problem where node10 cannot communicate with node16.

If you ever get the time to check into the code to see why this is happening, 
I'd be curious to hear what you find (per my explanation of the TCP BTL here: 
http://www.open-mpi.org/community/lists/users/2011/07/16872.php).

> Good News:
> 
> Adding 'btl_tcp_if_include=192.168.0.0/16' to my ~/.openmpi/mca-params.conf
> file seems to cure the communication problem.

Good.

> Thanks for the help.  If I run into any other problems with trunk,
> I'll report those here.

Keep in mind the usual disclaimers with development trunks -- it's *usually* 
stable, but sometimes it does break.  

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Mixed Mellanox and Qlogic problems

2011-07-13 Thread David Warren
I finally got access to the systems again (the original ones are part of 
our real time system). I thought I would try one other test I had set up 
first.  I went to OFED 1.6 and it started running with no errors. It 
must have been an OFED bug. Now I just have the speed problem. Anyone 
have a way to make the mixture of mlx4 and qlogic work together without 
slowing down?


On 07/07/11 17:19, Jeff Squyres wrote:

Huh; wonky.

Can you set the MCA parameter "mpi_abort_delay" to -1 and run your job again? 
This will prevent all the processes from dying when MPI_ABORT is invoked.  Then attach a 
debugger to one of the still-live processes after the error message is printed.  Can you 
send the stack trace?  It would be interesting to know what is going on here -- I can't 
think of a reason that would happen offhand.


On Jun 30, 2011, at 5:03 PM, David Warren wrote:

   

I have a cluster with mostly Mellanox ConnectX hardware and a few with Qlogic 
QLE7340's. After looking through the web, FAQs etc. I built openmpi-1.5.3 with 
psm and openib. If I run within the same hardware it is fast and works fine. If 
I run between without specifying an MTL (e.g. mpirun -np 24 -machinefile 
dwhosts --byslot --bind-to-core --mca btl ^tcp ...) it dies with
*** The MPI_Init() function was called before MPI_INIT was invoked.
 

*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[n16:9438] Abort before MPI_INIT completed successfully; not able to
   

guarantee that all other processes were killed!
 

*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
   

...
I can make it run but giving a bad mtl e.g. -mca mtl psm,none. All the 
processes run after complaining that mtl none does not exist. However, they run 
just as slow (about 10% slower than either set alone)

Pertinent info:
On the Qlogic Nodes:
OFED: QLogic-OFED.SLES11-x86_64.1.5.3.0.22
On the Mellanox Nodes:
OFED-1.5.2.1-20101105-0600

All:
debian lenny kernel 2.6.32.41
OpenSM
limit | grep memorylocked gives unlimited on all nodes.

Configure line:
./configure --with-libnuma --with-openib --prefix=/usr/local/openmpi-1.5.3 
--with-psm=/usr --enable-btl-openib-failover --enable-openib-connectx-xrc 
--enable-openib-rdmacm

I thought that with 1.5.3 I am supposed to be able to do this. Am I just wrong? 
Does anyone see what I am doing wrong?

Thanks
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
 


   
<>