date:20110511

Re: [OMPI users] Windows: MPI_Allreduce() crashes when using MPI_DOUBLE_PRECISION

2011-05-11 Thread hi

Hi Jeff,

> Can you send the info listed on the help page?

>From the HELP page...

***For run-time problems:
1) Check the FAQ first. Really. This can save you a lot of time; many
common problems and solutions are listed there.
I couldn't find reference in FAQ.

2) The version of Open MPI that you're using.
I am using pre-built openmpi-1.5.3 64-bit and 32-bit binaries on Window 7
I also tried with locally built openmpi-1.5.2 using Visual Studio 2008
32-bit compilers
I tried various compilers: VS-9 32-bit and VS-10 64-bit and
corresponding intel ifort compiler.

3) The config.log file from the top-level Open MPI directory, if
available (please compress!).
Don't have.

4) The output of the "ompi_info --all" command from the node where
you're invoking mpirun.
see output of pre-built openmpi-1.5.3_x64/bin/ompi_info --all" in attachments.

5) If running on more than one node --
I am running test program on single none.

6) A detailed description of what is failing.
Already described in this post.

7) Please include information about your network:
As I am running test program on local and single machine, this might
not be required.

> You forgot ierr in the call to MPI_Finalize.  You also paired 
> DOUBLE_PRECISION data with MPI_INTEGER in the call to allreduce.  And you 
> mixed sndbuf and rcvbuf in the call to allreduce, meaning that when your 
> print rcvbuf afterwards, it'll always still be 0.

As I am not Fortran programmer, this is my mistake !!!


>        program Test_MPI
>            use mpi
>            implicit none
>
>            DOUBLE PRECISION rcvbuf(5), sndbuf(5)
>            INTEGER nproc, rank, ierr, n, i, ret
>
>            n = 5
>            do i = 1, n
>                sndbuf(i) = 2.0
>                rcvbuf(i) = 0.0
>            end do
>
>            call MPI_INIT(ierr)
>            call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
>            call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
>            write(*,*) "size=", nproc, ", rank=", rank
>            write(*,*) "start --, rcvbuf=", rcvbuf
>            CALL MPI_ALLREDUCE(sndbuf, rcvbuf, n,
>     &              MPI_DOUBLE_PRECISION, MPI_SUM, MPI_COMM_WORLD, ierr)
>            write(*,*) "end --, rcvbuf=", rcvbuf
>
>            CALL MPI_Finalize(ierr)
>        end
>
> (you could use "include 'mpif.h'", too -- I tried both)
>
> This program works fine for me.

I am observing same crash, as described in this thread (when executing
as "mpirun -np 2 mar_f_dp.exe"), even with above correct and simple
test program. I commented 'use mpi' as it gave me "Error in compiled
module file" error, so I used 'include "mpif.h"' statement (see
attachement).

It seems that Windows specific issue, (I could run this test program
on Linux with openmpi-1.5.1).

Can anybody try this test program on Windows?

Thank you in advance.
-Hiral
 Package: Open MPI hpcfan@VISCLUSTER25 Distribution
Open MPI: 1.5.3
   Open MPI SVN revision: r24532
   Open MPI release date: Mar 16, 2011
Open RTE: 1.5.3
   Open RTE SVN revision: r24532
   Open RTE release date: Mar 16, 2011
OPAL: 1.5.3
   OPAL SVN revision: r24532
   OPAL release date: Mar 16, 2011
Ident string: 1.5.3
   MCA backtrace: none (MCA v2.0, API v2.0, Component v1.5.3)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.3)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3)
   MCA timer: windows (MCA v2.0, API v2.0, Component v1.5.3)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3)
 MCA installdirs: windows (MCA v2.0, API v2.0, Component v1.5.3)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3)
   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3)
 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3)
 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3)
 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3)
 MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3)
 MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3)
 MCA osc: rdma (MCA v2.0, API v2.0, Comp

[OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

2011-05-11 Thread hi

Hi,

I am observing following message on Windows platform...

c:\Users\oza\Desktop\test>mpirun
--
orterun:executable-not-specified
But I couldn't open the help file:

C:\Users\hpcfan\Documents\OpenMPI\openmpi-1.5.3\installed-64\share\openmpi\help-orterun.txt:
No such file or directory.  Sorry!
--

I copied pre-built installed "OpenMPI_v1.5.3-x64" directory into one
Windows machine to another  Windows machine.

As discussed in some mailing-threads I also tried to set OPAL_PKGDATA
and other OPAL_* environment variables, but still above message
persist.

Please suggest.

Thank you in advance.
-Hiral

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

2011-05-11 Thread hi

It's my mistake it should be OPAL_PKGDATADIR env var instead of OPAL_DATADIR.
With this it is working fine.

Thank you.
-Hiral

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

2011-05-11 Thread hi

After setting OPAL_PKGDATADIR, "mpirun" gives proper help message.

But when executing simple test program which calls MPI_ALLREDUCE() it
gives following errors onto the console...

c:\ompi_tests\win64>mpirun mar_f_i_op.exe
[nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
..\..\..\openmpi-1.5.3\orte\mca\ras\base\ras_base_allocate.c at line
147
[nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
..\..\..\openmpi-1.5.3\orte\mca\plm\base\plm_base_launch_support.c at
line 99
[nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
..\..\..\openmpi-1.5.3\orte\mca\plm\ccp\plm_ccp_module.c at line 186

Any idea on these errors???

Clarification: I installed pre-built OpenMPI_v1.5.3-x64 on Windows 7
and copied this directory into Windows Server 2008.

Thank you in advance.
-Hiral

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

2011-05-11 Thread Ralph Castain

I don't know a lot about the Windows port, but that error means that mpirun got 
an error when trying to discover the allocated nodes.


On May 11, 2011, at 6:10 AM, hi wrote:

> After setting OPAL_PKGDATADIR, "mpirun" gives proper help message.
> 
> But when executing simple test program which calls MPI_ALLREDUCE() it
> gives following errors onto the console...
> 
> c:\ompi_tests\win64>mpirun mar_f_i_op.exe
> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
> ..\..\..\openmpi-1.5.3\orte\mca\ras\base\ras_base_allocate.c at line
> 147
> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
> ..\..\..\openmpi-1.5.3\orte\mca\plm\base\plm_base_launch_support.c at
> line 99
> [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file
> ..\..\..\openmpi-1.5.3\orte\mca\plm\ccp\plm_ccp_module.c at line 186
> 
> Any idea on these errors???
> 
> Clarification: I installed pre-built OpenMPI_v1.5.3-x64 on Windows 7
> and copied this directory into Windows Server 2008.
> 
> Thank you in advance.
> -Hiral
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] is there an equiv of iprove for bcast?

2011-05-11 Thread Jeff Squyres

I'm not so much worried about the "load" than N pending ibcasts would cause; 
the "load" will be zero until the broadcast actually fires.  But I'm concerned 
about the pending resource usage (i.e., how many internal network and 
collective resources will be slurped up into hundreds or thousands of pending 
broadcasts).

You might want to have a tiered system, instead.  Have a tree-based 
communication pattern where each worker has a "parent" who does the actual 
broadcasting; each broadcaster can have tens of children (for example).  Even 
have an N-level tree, perhaps even gathering your children by server rack 
and/or network topology.

That way, you can have a small number of processes at the top of the tree that 
do an actual broadcast.  The rest can use a (relatively) small number of 
non-blocking sends and receives.  Or, when non-blocking collectives become 
available, you can have everyone in pending ibcasts with the small number of 
broadcasters (i.e., N broadcasters for M processes, where N << M), which 
wouldn't be nearly as resource-consuming-heavy as M pending ibasts.

Or something like that... just throwing some ideas out there for you...

On May 10, 2011, at 7:14 PM, Randolph Pullen wrote:

> Thanks,
> 
> The messages are small and frequent (they flash metadata across the cluster). 
>  The current approach works fine for small to medium clusters but I want it 
> to be able to go big.  Maybe up to several hundred or even a thousands of 
> nodes.
> 
> Its these larger deployments that concern me.  The current scheme may see the 
> clearinghouse become overloaded in a very large cluster.
> 
> From what you have  said, a possible strategy may be to combine the listener 
> and worker into a single process, using the non-blocking bcast just for that 
> group, while each worker scanned its own port for an incoming request, which 
> it would in turn bcast to its peers.
> 
> As you have indicated though, this would depend on the load the non-blocking 
> bcast would cause.  - At least the load would be fairly even over the cluster.
> 
> 
> --- On Mon, 9/5/11, Jeff Squyres  wrote:
> 
> From: Jeff Squyres 
> Subject: Re: [OMPI users] is there an equiv of iprove for bcast?
> To: randolph_pul...@yahoo.com.au
> Cc: "Open MPI Users" 
> Received: Monday, 9 May, 2011, 11:27 PM
> 
> On May 3, 2011, at 8:20 PM, Randolph Pullen wrote:
> 
> > Sorry, I meant to say:
> > - on each node there is 1 listener and 1 worker.
> > - all workers act together when any of the listeners send them a request.
> > - currently I must use an extra clearinghouse process to receive from any 
> > of the listeners and bcast to workers, this is unfortunate because of the 
> > potential scaling issues
> > 
> > I think you have answered this in that I must wait for MPI-3's non-blocking 
> > collectives.
> 
> Yes and no.  If each worker starts N non-blocking broadcasts just to be able 
> to test for completion of any of them, you might end up consuming a bunch of 
> resources for them (I'm *anticipating* that pending non-blocking collective 
> requests maybe more heavyweight than pending non-blocking point-to-point 
> requests).
> 
> But then again, if N is small, it might not matter.
> 
> > Can anyone suggest another way?  I don't like the serial clearinghouse 
> > approach.
> 
> If you only have a few workers and/or the broadcast message is small and/or 
> the broadcasts aren't frequent, then MPI's built-in broadcast algorithms 
> might not offer much more optimization than doing your own with 
> point-to-point mechanisms.  I don't usually recommend this, but it may be 
> possible for your case.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

2011-05-11 Thread Jeff Squyres

On May 11, 2011, at 5:50 AM, Ralph Castain wrote:

>> Clarification: I installed pre-built OpenMPI_v1.5.3-x64 on Windows 7
>> and copied this directory into Windows Server 2008.

Did you copy OMPI to the same directory tree that you built it?

OMPI hard-codes some directory names when it builds, and it expects to find 
that directory structure when it runs.  If you build OMPI with a --prefix of 
/foo, but then move it to /bar, various things may not work (like finding help 
messages, etc.) unless you set the OMPI/OPAL environment variables that tell 
OMPI where the files are actually located.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI users] error with checkpoint in openmpi

2011-05-11 Thread Tran Hai Quan

Hi , I am working on mpi
I've have installed openmpi 1.4.3 with blcr included.
I ran a simple mpi application using a hostfile:

pc1 slots=2 max-slots=2
pc2 slots=2 max-slots=2

And, i ran command to run it with checkpoint supported
#mpirun --hostfile myhost -np 4 --am ft-enable-cr ./mpi_app

When i checkpointed, i got an error:

[pc1:04836] Error: expected_component: PID information unavailable!
--
Error: The local checkpoint contains invalid or incomplete metadata for
Process 3411083265.2.
   This usually indicates that the local checkpoint is invalid.
   Check the metadata file (snapshot_meta.data) in the following
directory:
 /root/ompi_global_snapshot_4836.ckpt/0/opal_snapshot_2.ckpt
--
[pc1:04836] [[52049,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c
at line 1054

I'm glad if anyone can help me.

[OMPI users] TotalView Memory debugging and OpenMPI

2011-05-11 Thread Peter Thompson

We've gotten a few reports of problems with memory debugging when using 
OpenMPI under TotalView.  Usually, TotalView will attach tot he 
processes started after an MPI_Init.  However in the case where memory 
debugging is enabled, things seemed to run away or fail.   My analysis 
showed that we had a number of core files left over from the attempt, 
and all were mpirun (or orterun) cores.   It seemed to be a regression 
on our part, since testing seemed to indicate this worked okay before 
TotalView 8.9.0-0, so I filed an internal bug and passed it to 
engineering.   After giving our engineer a brief tutorial on how to 
build a debug version of OpenMPI, he found what appears to be a problem 
in the code for orterun.c.   He's made a slight change that fixes the 
issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being the versions he's 
tested with so far.He doesn't subscribe to this list that I know of, 
so I offered to pass this by the group.   Of course, I'm not sure if 
this is exactly the right place to submit patches, but I'm sure you'd 
tell me where to put it if I'm in the wrong here.   It's a short patch, 
so I'll cut and paste it, and attach as well, since cut and paste can do 
weird things to formatting.


Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew 
to find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
'totalview mpirun -a -np 4 ./foo'


Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
20:28:16.5881

83000 -0400
***
*** 1578,1588 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! putenv(env[j]);
 }
 }

 /* All done */

--- 1578,1600 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the 
pointer.

!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
 }
 }

 /* All done */


*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
  }

  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! putenv(env[j]);
  }
  }

  /* All done */

--- 1578,1600 
  }

  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
! 
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
  }
  }

  /* All done */

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Brock Palen

On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:

> On May 3, 2011, at 6:42 AM, Dave Love wrote:
> 
>>> We managed to have another user hit the bug that causes collectives (this 
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>> 
>>> btl_openib_cpc_include rdmacm
>> 
>> Could someone explain this?  We also have problems with collective hangs
>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>> see any relevant issues filed.  However, rdmacm isn't an available value
>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>> that I understand what these things are...).
> 
> Sorry for the delay -- perhaps an IB vendor can reply here with more detail...
> 
> We had a user-reported issue of some hangs that the IB vendors have been 
> unable to replicate in their respective labs.  We *suspect* that it may be an 
> issue with the oob openib CPC, but that code is pretty old and pretty mature, 
> so all of us would be at least somewhat surprised if that were the case.  If 
> anyone can reliably reproduce this error, please let us know and/or give us 
> access to your machines -- we have not closed this issue, but are unable to 
> move forward because the customers who reported this issue switched to rdmacm 
> and moved on (i.e., we don't have access to their machines to test any more).

An update, we set all our ib0 interfaces to have IP's on a 172. network. This 
allowed the use of rdmacm to work and get latencies that we would expect.  That 
said we are still getting hangs.  I can very reliably reproduce it using IMB 
with a specific core count on a specific test case. 

Just an update.  Has anyone else had luck fixing the lockup issues on openib 
BTL for collectives in some cases? Thanks!


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Ralph Castain



Sent from my iPad

On May 11, 2011, at 2:05 PM, Brock Palen  wrote:

> On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:
> 
>> On May 3, 2011, at 6:42 AM, Dave Love wrote:
>> 
 We managed to have another user hit the bug that causes collectives (this 
 time MPI_Bcast() ) to hang on IB that was fixed by setting:
 
 btl_openib_cpc_include rdmacm
>>> 
>>> Could someone explain this?  We also have problems with collective hangs
>>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>>> see any relevant issues filed.  However, rdmacm isn't an available value
>>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>>> that I understand what these things are...).
>> 
>> Sorry for the delay -- perhaps an IB vendor can reply here with more 
>> detail...
>> 
>> We had a user-reported issue of some hangs that the IB vendors have been 
>> unable to replicate in their respective labs.  We *suspect* that it may be 
>> an issue with the oob openib CPC, but that code is pretty old and pretty 
>> mature, so all of us would be at least somewhat surprised if that were the 
>> case.  If anyone can reliably reproduce this error, please let us know 
>> and/or give us access to your machines -- we have not closed this issue, but 
>> are unable to move forward because the customers who reported this issue 
>> switched to rdmacm and moved on (i.e., we don't have access to their 
>> machines to test any more).
> 
> An update, we set all our ib0 interfaces to have IP's on a 172. network. This 
> allowed the use of rdmacm to work and get latencies that we would expect.  
> That said we are still getting hangs.  I can very reliably reproduce it using 
> IMB with a specific core count on a specific test case. 
> 
> Just an update.  Has anyone else had luck fixing the lockup issues on openib 
> BTL for collectives in some cases? Thanks!

I'll go back to my earlier comments. Users always claim that their code doesn't 
have the sync issue, but it has proved to help more often than not, and costs 
nothing to try,

My $.0002


> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-11 Thread Ralph Castain

That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

> We've gotten a few reports of problems with memory debugging when using 
> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
> started after an MPI_Init.  However in the case where memory debugging is 
> enabled, things seemed to run away or fail.   My analysis showed that we had 
> a number of core files left over from the attempt, and all were mpirun (or 
> orterun) cores.   It seemed to be a regression on our part, since testing 
> seemed to indicate this worked okay before TotalView 8.9.0-0, so I filed an 
> internal bug and passed it to engineering.   After giving our engineer a 
> brief tutorial on how to build a debug version of OpenMPI, he found what 
> appears to be a problem in the code for orterun.c.   He's made a slight 
> change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being 
> the versions he's tested with so far.He doesn't subscribe to this list 
> that I know of, so I offered to pass this by the group.   Of course, I'm not 
> sure if this is exactly the right place to submit patches, but I'm sure you'd 
> tell me where to put it if I'm in the wrong here.   It's a short patch, so 
> I'll cut and paste it, and attach as well, since cut and paste can do weird 
> things to formatting.
> 
> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
> find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
> mpirun -a -np 4 ./foo'
> 
> Cheers,
> PeterT
> 
> 
> more ~/patches/anbs-patch
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
> 20:28:16.5881
> 83000 -0400
> ***
> *** 1578,1588 
> }
> if (NULL != env) {
> size1 = opal_argv_count(env);
> for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
> }
> }
> /* All done */
> --- 1578,1600 
> }
> if (NULL != env) {
> size1 = opal_argv_count(env);
> for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !region.
> !So we make a copy of the variable.
> ! */
> ! char *s = strdup(env[j]);
> !
> ! if (NULL == s) {
> ! return OPAL_ERR_OUT_OF_RESOURCE;
> ! }
> ! putenv(s);
> }
> }
> /* All done */
> 
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
> 2011-05-09 20:28:16.588183000 -0400
> ***
> *** 1578,1588 
>  }
> 
>  if (NULL != env) {
>  size1 = opal_argv_count(env);
>  for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
>  }
>  }
> 
>  /* All done */
> 
> --- 1578,1600 
>  }
> 
>  if (NULL != env) {
>  size1 = opal_argv_count(env);
>  for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !region.
> !So we make a copy of the variable.
> ! */
> ! char *s = strdup(env[j]);
> ! 
> ! if (NULL == s) {
> ! return OPAL_ERR_OUT_OF_RESOURCE;
> ! }
> ! putenv(s);
>  }
>  }
> 
>  /* All done */
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

2011-05-11 Thread Marcus R. Epperson

I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only 
when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the collectives 
hangs go away. I don't know what, if anything, the higher optimization buys you 
when compiling openmpi, so I'm not sure if that's an acceptable workaround or 
not.

My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 
5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a 
single iteration of Barrier to reproduce the hang, and it happens 100% of the 
time for me when I invoke it like this:

# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the participating 
ranks have this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from 
[instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that seems to rule out 
the sm btl (or interactions with it) as a culprit at least.

I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:
> Dear all,
> 
> we succeed in building several version of openmpi from 1.2.8 to 1.4.3 
> with Intel composer XE 2011 (aka 12.0).
> However we found a threshold in the number of cores (depending from the 
> application: IMB, xhpl or user applications
> and form the number of required cores) above which the application hangs 
> (sort of deadlocks).
> The building of openmpi with 'gcc' and 'pgi' does not show the same limits.
> There are any known incompatibilities of openmpi with this version of 
> intel compiilers?
> 
> The characteristics of our computational infrastructure are:
> 
> Intel processors E7330, E5345, E5530 e E5620
> 
> CentOS 5.3, CentOS 5.5.
> 
> Intel composer XE 2011
> gcc 4.1.2
> pgi 10.2-1
> 
> Regards
> 
> Salvatore Podda
> 
> ENEA UTICT-HPC
> Department for Computer Science Development and ICT
> Facilities Laboratory for Science and High Performace Computing
> C.R. Frascati
> Via E. Fermi, 45
> PoBox 65
> 00044 Frascati (Rome)
> Italy
> 
> Tel: +39 06 9400 5342
> Fax: +39 06 9400 5551
> Fax: +39 06 9400 5735
> E-mail: salvatore.po...@enea.it
> Home Page: www.cresco.enea.it
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Issue with Open MPI 1.5.3 Windows binary builds

2011-05-11 Thread Tyler W. Wilson


Answer to my own question.

It was simply a noob problem - not using mpiexec to run my 
'application'. Once I did this, everything is running as expected. My 
bad for not reading more before jumping in.


Later,
Tyler

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Dave Love

Jeff Squyres  writes:

> We had a user-reported issue of some hangs that the IB vendors have
> been unable to replicate in their respective labs.  We *suspect* that
> it may be an issue with the oob openib CPC, but that code is pretty
> old and pretty mature, so all of us would be at least somewhat
> surprised if that were the case.  If anyone can reliably reproduce
> this error, please let us know and/or give us access to your machines

We can reproduce it with IMB.  We could provide access, but we'd have to
negotiate with the owners of the relevant nodes to give you interactive
access to them.  Maybe Brock's would be more accessible?  (If you
contact me, I may not be able to respond for a few days.)

> -- we have not closed this issue,

Which issue?   I couldn't find a relevant-looking one.

> but are unable to move forward
> because the customers who reported this issue switched to rdmacm and
> moved on (i.e., we don't have access to their machines to test any
> more).

For what it's worth, I figured out why I couldn't see rdmacm, but adding
ipoib would be a bit of a pain.

-- 
Excuse the typping -- I have a broken wrist

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Dave Love

Ralph Castain  writes:

> I'll go back to my earlier comments. Users always claim that their
> code doesn't have the sync issue, but it has proved to help more often
> than not, and costs nothing to try,

Could you point to that post, or tell us what to try excatly, given
we're running IMB?  Thanks.

(As far as I know, this isn't happening with real codes, just IMB, but
only a few have been in use.)

-- 
Excuse the typping -- I have a broken wrist

[OMPI users] Invitation to connect on LinkedIn

2011-05-11 Thread alex su

LinkedIn



Open,

I'd like to add you to my professional network on LinkedIn.

- alex

alex su
Developer at Alibaba Cloud Computing Company 
China

Confirm that you know alex su
https://www.linkedin.com/e/kq0fyp-gnl5gmpc-3e/isd/2864619397/g7iDMgn0/



-- 
(c) 2011, LinkedIn Corporation

Re: [OMPI users] Windows: MPI_Allreduce() crashes when using MPI_DOUBLE_PRECISION

[OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

Re: [OMPI users] is there an equiv of iprove for bcast?

Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt

[OMPI users] error with checkpoint in openmpi

[OMPI users] TotalView Memory debugging and OpenMPI

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

Re: [OMPI users] TotalView Memory debugging and OpenMPI

Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

Re: [OMPI users] Issue with Open MPI 1.5.3 Windows binary builds

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

[OMPI users] Invitation to connect on LinkedIn

17 matches

Site Navigation

Mail list logo

Footer information