Re: [OMPI users] Mac OS X 10.4.5 and XGrid, Open-MPI V1.0.1

2006-03-20 Thread Frank

Hi Brian,

this is the full -d option output I've got mpi-running vhone on the 
xgrid. The truncation is due to the reported "hang".


[powerbook:/usr/local/MVH-1] admin% mpirun -d -np 4 ./vhone
[powerbook:03138] procdir: (null)
[powerbook:03138] jobdir: (null)
[powerbook:03138] unidir: 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe

[powerbook:03138] top: openmpi-sessions-admin@powerbook_0
[powerbook:03138] tmp: /tmp
[powerbook:03138] connect_uni: contact info read
[powerbook:03138] connect_uni: connection not allowed
[powerbook:03138] [0,0,0] setting up session dir with
[powerbook:03138]   tmpdir /tmp
[powerbook:03138]   universe default-universe-3138
[powerbook:03138]   user admin
[powerbook:03138]   host powerbook
[powerbook:03138]   jobid 0
[powerbook:03138]   procid 0
[powerbook:03138] procdir: 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe-3138/0/0
[powerbook:03138] jobdir: 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe-3138/0
[powerbook:03138] unidir: 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe-3138

[powerbook:03138] top: openmpi-sessions-admin@powerbook_0
[powerbook:03138] tmp: /tmp
[powerbook:03138] [0,0,0] contact_file 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe-3138/universe-setup.txt

[powerbook:03138] [0,0,0] wrote setup file
[powerbook:03138] spawn: in job_state_callback(jobid = 1, state = 0x1)
[ibi:00717] [0,1,2] setting up session dir with
[ibi:00717] universe default-universe
[ibi:00717] user nobody
[ibi:00717] host xgrid-node-2
[ibi:00717] jobid 1
[ibi:00717] procid 2
[ibi:00717] procdir: 
/tmp/openmpi-sessions-nobody@xgrid-node-2_0/default-universe/1/2
[ibi:00717] jobdir: 
/tmp/openmpi-sessions-nobody@xgrid-node-2_0/default-universe/1
[ibi:00717] unidir: 
/tmp/openmpi-sessions-nobody@xgrid-node-2_0/default-universe

[ibi:00717] top: openmpi-sessions-nobody@xgrid-node-2_0
[ibi:00717] tmp: /tmp
[powerbook:03147] [0,1,0] setting up session dir with
[powerbook:03147]   universe default-universe
[powerbook:03147]   user nobody
[powerbook:03147]   host xgrid-node-0
[powerbook:03147]   jobid 1
[powerbook:03147]   procid 0
[powerbook:03147] procdir: 
/tmp/openmpi-sessions-nobody@xgrid-node-0_0/default-universe/1/0
[powerbook:03147] jobdir: 
/tmp/openmpi-sessions-nobody@xgrid-node-0_0/default-universe/1
[powerbook:03147] unidir: 
/tmp/openmpi-sessions-nobody@xgrid-node-0_0/default-universe

[powerbook:03147] top: openmpi-sessions-nobody@xgrid-node-0_0
[powerbook:03147] tmp: /tmp
^Z
Suspended
[powerbook:/usr/local/MVH-1] admin%

I've been waiting quite a while before canceling the jobs, so this is 
not due to poor priority of the jobs supplied to the xgrid (i.e. xgrid 
is told to always accept jobs and run them). Comparing this with the 
output I get from a non-xgrid-mpirun (ssh submitting jobs) the next line 
of -d output I've been waiting on is another spawn and thereafter the 
message, that the open_mpi_init has been completed. While "hanging" 
adding another xgrid-node or removing a node is still recognized, though 
initializing does not finish.


Just to compare with, here's the -d output I get from submitting the 
same job via ssh:


[powerbook:/usr/local/MVH-1] admin% mpirun -d -hostfile machinefile -np 
4 ./vhone

[powerbook:03270] procdir: (null)
[powerbook:03270] jobdir: (null)
[powerbook:03270] unidir: 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe

[powerbook:03270] top: openmpi-sessions-admin@powerbook_0
[powerbook:03270] tmp: /tmp
[powerbook:03270] connect_uni: contact info read
[powerbook:03270] connect_uni: connection not allowed
[powerbook:03270] [0,0,0] setting up session dir with
[powerbook:03270]   tmpdir /tmp
[powerbook:03270]   universe default-universe-3270
[powerbook:03270]   user admin
[powerbook:03270]   host powerbook
[powerbook:03270]   jobid 0
[powerbook:03270]   procid 0
[powerbook:03270] procdir: 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe-3270/0/0
[powerbook:03270] jobdir: 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe-3270/0
[powerbook:03270] unidir: 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe-3270

[powerbook:03270] top: openmpi-sessions-admin@powerbook_0
[powerbook:03270] tmp: /tmp
[powerbook:03270] [0,0,0] contact_file 
/tmp/openmpi-sessions-admin@powerbook_0/default-universe-3270/universe-setup.txt

[powerbook:03270] [0,0,0] wrote setup file
[powerbook:03270] spawn: in job_state_callback(jobid = 1, state = 0x1)
[powerbook:03270] pls:rsh: local csh: 1, local bash: 0
[powerbook:03270] pls:rsh: assuming same remote shell as local shell
[powerbook:03270] pls:rsh: remote csh: 1, remote bash: 0
[powerbook:03270] pls:rsh: final template argv:
[powerbook:03270] pls:rsh: ssh  orted --debug --bootproxy 
1 --name  --num_procs 3 --vpid_start 0 --nodename  
--universe admin@powerbook:default-universe-3270 --nsreplica 
"0.0.0;tcp://192.168.178.23:50205" --gprreplica 
"0.0.

[OMPI users] mpif90 broken in recent tarballs of 1.1a1

2006-03-20 Thread Michael Kluskens

Building Open MPI 1.1a1r9xxx on a PowerMac G4 running OS X 10.4.5 using
1) Apple gnu compilers from Xcode 2.2.1
2) fink-installed g95

setenv F77 g95 ; setenv FC g95 ; ./configure ; make all ; sudo make  
install


r9212 (built ~week ago) worked but I has having some issues and  
wished to try a newer 1.1


r9275 (built Thursday) and r9336 (built today) do not work, meaning  
they appear to compile just fine, but:


mpif90 mpitest.f90 -o mpitest

does nothing, just returns.  No obvious errors in config.log.

1.0.2a10 (built today) does not have this problem.

I use "sudo make uninstall" to remove the previous installation  
before installing a new version.


Michael

ps. I've had to use 1.1 because of bugs in the 1.0.x series that will  
not be fixed.


On Mar 4, 2006, at 9:29 AM, Jeff Squyres wrote:

I'm hesitant to put these fixes in the 1.0.x series simply because
we're trying to finish that series and advance towards 1.1.




Re: [OMPI users] mpif90 broken in recent tarballs of 1.1a1

2006-03-20 Thread Michael Kluskens

I have identified what I think is the issue described below.

Even though the default prefix is /usr/local, r9336 only works for me  
if I use


./configure --prefix=/usr/local

Michael

On Mar 20, 2006, at 11:49 AM, Michael Kluskens wrote:

Building Open MPI 1.1a1r9xxx on a PowerMac G4 running OS X 10.4.5  
using

1) Apple gnu compilers from Xcode 2.2.1
2) fink-installed g95

setenv F77 g95 ; setenv FC g95 ; ./configure ; make all ; sudo make
install

r9212 (built ~week ago) worked but I has having some issues and
wished to try a newer 1.1

r9275 (built Thursday) and r9336 (built today) do not work, meaning
they appear to compile just fine, but:

mpif90 mpitest.f90 -o mpitest

does nothing, just returns.  No obvious errors in config.log.

1.0.2a10 (built today) does not have this problem.

I use "sudo make uninstall" to remove the previous installation
before installing a new version.

Michael

ps. I've had to use 1.1 because of bugs in the 1.0.x series that will
not be fixed.

On Mar 4, 2006, at 9:29 AM, Jeff Squyres wrote:

I'm hesitant to put these fixes in the 1.0.x series simply because
we're trying to finish that series and advance towards 1.1.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





[OMPI users] Sample code demonstrating issues with multiple versions of OpenMPI

2006-03-20 Thread Michael Kluskens
The sample code at the end of this message demonstrates issues with  
multiple versions of OpenMPI.


OpenMPI 1.0.2a10 compiles the code but crashes because of the  
interface issues previously discussed.   This is both using " USE MPI  
" and " include 'mpif.h' "


OpenMPI 1.1a1r9336 generates the following output (generated on OS X  
with g95, but same errors previously documented on Debian Linux with  
pgif90 version 6.1):



 >spawn
How many processes total?
2
alpha 0  of  1
master receiving
alpha 0 receiving 17 from master
alpha 0  sending -1 0
answer= -1 0  from alpha 0 0
[x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ 
soh_base_get_proc_soh.c at line 100
[x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ 
oob_base_xcast.c at line 108
[x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ 
rmgr_base_stage_gate.c at line 276
[x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ 
soh_base_get_proc_soh.c at line 100
[x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ 
oob_base_xcast.c at line 108
[x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ 
rmgr_base_stage_gate.c at line 276


Michael

 spawn.f90 ---

program main
  USE MPI
  implicit none
!  include 'mpif.h'
  integer :: ierr,size,rank,child
  integer  (kind=MPI_ADDRESS_KIND) :: universe_size
  integer :: status(MPI_STATUS_SIZE)
  logical :: flag
  integer :: ans(0:2),btest
  integer :: k, subprocesses
  real:: ts(4)

  call MPI_INIT(ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)

  if ( size /= 1 ) then
if ( rank == 0 ) then
  write(*,*) 'Only one master process permitted'
  write(*,*) 'Terminating all but root process'
else
  call MPI_FINALIZE(ierr)
  stop
end if
  end if

  call MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE,  
universe_size, flag,ierr)

  if ( .not. flag ) then
write(*,*) 'This MPI does not support UNIVERSE_SIZE.'
write(*,*) 'How many processes total?'
read(*,*) universe_size
  else if ( universe_size < 2 ) then
write(*,*) 'How many processes total?'
read(*,*) universe_size
  end if
  subprocesses = universe_size-1
  call MPI_Comm_spawn('subprocess', MPI_ARGV_NULL, subprocesses,  
MPI_INFO_NULL, 0, &

MPI_COMM_WORLD, child, MPI_ERRCODES_IGNORE, ierr )

  btest = 17
  call MPI_BCAST( btest, 1, MPI_INTEGER, MPI_ROOT, child, ierr )
  call MPI_BCAST( ts,4   ,MPI_REAL   ,MPI_ROOT,child,ierr)

  do k = 1, universe_size-1
write(*,*) 'master receiving'
ans = 0
call MPI_RECV( ans, 2, MPI_INTEGER, MPI_ANY_SOURCE, MPI_ANY_TAG,  
child, status, ierr )
write(*,*) 'answer=',ans(0:1),' from alpha',status 
(MPI_SOURCE),status(MPI_TAG)

  end do

  call MPI_COMM_FREE(child,ierr)

  call MPI_FINALIZE(ierr)
end

--- subprocess.f90 
program alpha
  USE MPI
  implicit none
!  include 'mpif.h'
  integer :: ierr,size,rank,parent,rsize
  integer :: ans(0:2), btest
  real:: ts(4)

  call MPI_INIT(ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)
  write(*,*) 'alpha',rank,' of ',size
  call MPI_Comm_get_parent(parent,ierr)

  call MPI_BCAST( btest, 1, MPI_INTEGER, 0, parent, ierr )
  call MPI_BCAST(ts,4,MPI_REAL,0,parent,ierr)
  write(*,*) 'alpha',rank,'receiving',btest,'from master'
  ans(0) = rank-1
  ans(1) = rank
  ans(2) = rank+1
  write(*,*) 'alpha',rank,' sending',ans(0:1)
  call MPI_SSEND( ans, 2, MPI_INTEGER, 0, rank, parent, ierr)

  call MPI_FINALIZE(ierr)
end program alpha



Re: [OMPI users] mpif90 broken in recent tarballs of 1.1a1

2006-03-20 Thread Brian Barrett

On Mar 20, 2006, at 6:10 PM, Michael Kluskens wrote:


I have identified what I think is the issue described below.

Even though the default prefix is /usr/local, r9336 only works for me
if I use

./configure --prefix=/usr/local


Thank you for the bug report.  I was able to pin down the problem to  
a change I made last week to fix a recompilation issue.  The bug has  
been fixed in r9346 on the trunk.


Thanks,

Brian



On Mar 20, 2006, at 11:49 AM, Michael Kluskens wrote:


Building Open MPI 1.1a1r9xxx on a PowerMac G4 running OS X 10.4.5
using
1) Apple gnu compilers from Xcode 2.2.1
2) fink-installed g95

setenv F77 g95 ; setenv FC g95 ; ./configure ; make all ; sudo make
install

r9212 (built ~week ago) worked but I has having some issues and
wished to try a newer 1.1

r9275 (built Thursday) and r9336 (built today) do not work, meaning
they appear to compile just fine, but:

mpif90 mpitest.f90 -o mpitest

does nothing, just returns.  No obvious errors in config.log.

1.0.2a10 (built today) does not have this problem.

I use "sudo make uninstall" to remove the previous installation
before installing a new version.

Michael

ps. I've had to use 1.1 because of bugs in the 1.0.x series that will
not be fixed.

On Mar 4, 2006, at 9:29 AM, Jeff Squyres wrote:

I'm hesitant to put these fixes in the 1.0.x series simply because
we're trying to finish that series and advance towards 1.1.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9; problems on heterogeneous cluster

2006-03-20 Thread Brian Barrett

On Mar 16, 2006, at 1:32 AM, Ravi Manumachu wrote:

I have installed OpenMPI-1.1a1r9260 on my SunOS machines. It has  
solved
the problems. However there is one more issue that I found in my  
testing

and that I failed to report. This concerns Linux machines too.





csultra06$ mpirun --hostfile hosts.txt --app mpiinit_appfile
csultra02:Hello world from 5
csultra06:Hello world from 0
csultra06:Hello world from 4
csultra02:Hello world from 1
csultra08:Hello world from 3
csultra05:Hello world from 2

The following two more statements are not printed.

csultra05:Hello world from 6
csultra08:Hello world from 7

This behavior I observed on my Linux cluster too.


Hi Ravi -

Thanks for the bug report.  We've determined that there is definitely  
a problem with starting applications from an app context file on the  
trunk.  The issue appears to be a regression that slipped into our  
development trunk, but is not in our release branch.  I've passed the  
bug report on to the author of that code and he is looking into the  
issue, but I don't have a timeline to having the issue resolved.


For good news, the issue with Solaris has been resolved in the v1.0  
release branch and the app context bug does not exist there.  So the  
upcoming Open MPI 1.0.2 release (and the currently available Open MPI  
1.0.2a10 alpha release) should work properly for your environment.   
Hopefully now that we have some Solaris users regularly testing Open  
MPI, your experiences with the release branches on Solaris should be  
much more stable :).


Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




[OMPI users] Mac OS X 10.4.5 and XGrid, Open-MPI V1.0.1

2006-03-20 Thread Warner Yuen

Hi Frank,

I've used OMPI 1.0.1 with Xgrid. I don't think I ran into the same  
problem as you with the job hanging. But I'll continue just in case  
it helps or helps someone else. The one thing that I noticed was that  
Xgrid/OMPI does not allow an MPI application to write out a file  
other than to standard output.


In my example when running HP Linpack over an Xgrid enabled OMPI, if  
I execute the mpirun with HPL just outputting to the screen,  
everything runs fine. However, if I set my hpl.dat file to write out  
the results to a file, I get an error:


With 'hpl.dat' set to write to an output file called 'HPL.out' after  
executing: mpirun -d -hostfile myhosts -np 4 ./xhpl


portal.private:00545] [0,1,0] ompi_mpi_init completed
HPL ERROR from process # 0, on line 318 of function HPL_pdinfo:
>>> cannot open file HPL.out. <<<

I've tested this with a couple of other applications as well. For  
now, the only way I can solve it is if I set my working directory to  
allow user nobody to write to my working directory. Hope this helps.


-Warner

Warner Yuen
Apple Computer
email: wy...@apple.com
Tel: 408.718.2859
Fax: 408.715.0133


On Mar 20, 2006, at 9:00 AM, users-requ...@open-mpi.org wrote:


Message: 1
Date: Mon, 20 Mar 2006 08:11:32 +0100
From: Frank 
Subject: Re: [OMPI users] Mac OS X 10.4.5 and XGrid, Open-MPI V1.0.1
To: us...@open-mpi.org
Message-ID: <441e55a4.6090...@fraka-mp.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Brian,

this is the full -d option output I've got mpi-running vhone on the
xgrid. The truncation is due to the reported "hang".