Re: [OMPI users] Mac OS X 10.4.5 and XGrid, Open-MPI V1.0.1
Hi Brian, this is the full -d option output I've got mpi-running vhone on the xgrid. The truncation is due to the reported "hang". [powerbook:/usr/local/MVH-1] admin% mpirun -d -np 4 ./vhone [powerbook:03138] procdir: (null) [powerbook:03138] jobdir: (null) [powerbook:03138] unidir: /tmp/openmpi-sessions-admin@powerbook_0/default-universe [powerbook:03138] top: openmpi-sessions-admin@powerbook_0 [powerbook:03138] tmp: /tmp [powerbook:03138] connect_uni: contact info read [powerbook:03138] connect_uni: connection not allowed [powerbook:03138] [0,0,0] setting up session dir with [powerbook:03138] tmpdir /tmp [powerbook:03138] universe default-universe-3138 [powerbook:03138] user admin [powerbook:03138] host powerbook [powerbook:03138] jobid 0 [powerbook:03138] procid 0 [powerbook:03138] procdir: /tmp/openmpi-sessions-admin@powerbook_0/default-universe-3138/0/0 [powerbook:03138] jobdir: /tmp/openmpi-sessions-admin@powerbook_0/default-universe-3138/0 [powerbook:03138] unidir: /tmp/openmpi-sessions-admin@powerbook_0/default-universe-3138 [powerbook:03138] top: openmpi-sessions-admin@powerbook_0 [powerbook:03138] tmp: /tmp [powerbook:03138] [0,0,0] contact_file /tmp/openmpi-sessions-admin@powerbook_0/default-universe-3138/universe-setup.txt [powerbook:03138] [0,0,0] wrote setup file [powerbook:03138] spawn: in job_state_callback(jobid = 1, state = 0x1) [ibi:00717] [0,1,2] setting up session dir with [ibi:00717] universe default-universe [ibi:00717] user nobody [ibi:00717] host xgrid-node-2 [ibi:00717] jobid 1 [ibi:00717] procid 2 [ibi:00717] procdir: /tmp/openmpi-sessions-nobody@xgrid-node-2_0/default-universe/1/2 [ibi:00717] jobdir: /tmp/openmpi-sessions-nobody@xgrid-node-2_0/default-universe/1 [ibi:00717] unidir: /tmp/openmpi-sessions-nobody@xgrid-node-2_0/default-universe [ibi:00717] top: openmpi-sessions-nobody@xgrid-node-2_0 [ibi:00717] tmp: /tmp [powerbook:03147] [0,1,0] setting up session dir with [powerbook:03147] universe default-universe [powerbook:03147] user nobody [powerbook:03147] host xgrid-node-0 [powerbook:03147] jobid 1 [powerbook:03147] procid 0 [powerbook:03147] procdir: /tmp/openmpi-sessions-nobody@xgrid-node-0_0/default-universe/1/0 [powerbook:03147] jobdir: /tmp/openmpi-sessions-nobody@xgrid-node-0_0/default-universe/1 [powerbook:03147] unidir: /tmp/openmpi-sessions-nobody@xgrid-node-0_0/default-universe [powerbook:03147] top: openmpi-sessions-nobody@xgrid-node-0_0 [powerbook:03147] tmp: /tmp ^Z Suspended [powerbook:/usr/local/MVH-1] admin% I've been waiting quite a while before canceling the jobs, so this is not due to poor priority of the jobs supplied to the xgrid (i.e. xgrid is told to always accept jobs and run them). Comparing this with the output I get from a non-xgrid-mpirun (ssh submitting jobs) the next line of -d output I've been waiting on is another spawn and thereafter the message, that the open_mpi_init has been completed. While "hanging" adding another xgrid-node or removing a node is still recognized, though initializing does not finish. Just to compare with, here's the -d output I get from submitting the same job via ssh: [powerbook:/usr/local/MVH-1] admin% mpirun -d -hostfile machinefile -np 4 ./vhone [powerbook:03270] procdir: (null) [powerbook:03270] jobdir: (null) [powerbook:03270] unidir: /tmp/openmpi-sessions-admin@powerbook_0/default-universe [powerbook:03270] top: openmpi-sessions-admin@powerbook_0 [powerbook:03270] tmp: /tmp [powerbook:03270] connect_uni: contact info read [powerbook:03270] connect_uni: connection not allowed [powerbook:03270] [0,0,0] setting up session dir with [powerbook:03270] tmpdir /tmp [powerbook:03270] universe default-universe-3270 [powerbook:03270] user admin [powerbook:03270] host powerbook [powerbook:03270] jobid 0 [powerbook:03270] procid 0 [powerbook:03270] procdir: /tmp/openmpi-sessions-admin@powerbook_0/default-universe-3270/0/0 [powerbook:03270] jobdir: /tmp/openmpi-sessions-admin@powerbook_0/default-universe-3270/0 [powerbook:03270] unidir: /tmp/openmpi-sessions-admin@powerbook_0/default-universe-3270 [powerbook:03270] top: openmpi-sessions-admin@powerbook_0 [powerbook:03270] tmp: /tmp [powerbook:03270] [0,0,0] contact_file /tmp/openmpi-sessions-admin@powerbook_0/default-universe-3270/universe-setup.txt [powerbook:03270] [0,0,0] wrote setup file [powerbook:03270] spawn: in job_state_callback(jobid = 1, state = 0x1) [powerbook:03270] pls:rsh: local csh: 1, local bash: 0 [powerbook:03270] pls:rsh: assuming same remote shell as local shell [powerbook:03270] pls:rsh: remote csh: 1, remote bash: 0 [powerbook:03270] pls:rsh: final template argv: [powerbook:03270] pls:rsh: ssh orted --debug --bootproxy 1 --name --num_procs 3 --vpid_start 0 --nodename --universe admin@powerbook:default-universe-3270 --nsreplica "0.0.0;tcp://192.168.178.23:50205" --gprreplica "0.0.
[OMPI users] mpif90 broken in recent tarballs of 1.1a1
Building Open MPI 1.1a1r9xxx on a PowerMac G4 running OS X 10.4.5 using 1) Apple gnu compilers from Xcode 2.2.1 2) fink-installed g95 setenv F77 g95 ; setenv FC g95 ; ./configure ; make all ; sudo make install r9212 (built ~week ago) worked but I has having some issues and wished to try a newer 1.1 r9275 (built Thursday) and r9336 (built today) do not work, meaning they appear to compile just fine, but: mpif90 mpitest.f90 -o mpitest does nothing, just returns. No obvious errors in config.log. 1.0.2a10 (built today) does not have this problem. I use "sudo make uninstall" to remove the previous installation before installing a new version. Michael ps. I've had to use 1.1 because of bugs in the 1.0.x series that will not be fixed. On Mar 4, 2006, at 9:29 AM, Jeff Squyres wrote: I'm hesitant to put these fixes in the 1.0.x series simply because we're trying to finish that series and advance towards 1.1.
Re: [OMPI users] mpif90 broken in recent tarballs of 1.1a1
I have identified what I think is the issue described below. Even though the default prefix is /usr/local, r9336 only works for me if I use ./configure --prefix=/usr/local Michael On Mar 20, 2006, at 11:49 AM, Michael Kluskens wrote: Building Open MPI 1.1a1r9xxx on a PowerMac G4 running OS X 10.4.5 using 1) Apple gnu compilers from Xcode 2.2.1 2) fink-installed g95 setenv F77 g95 ; setenv FC g95 ; ./configure ; make all ; sudo make install r9212 (built ~week ago) worked but I has having some issues and wished to try a newer 1.1 r9275 (built Thursday) and r9336 (built today) do not work, meaning they appear to compile just fine, but: mpif90 mpitest.f90 -o mpitest does nothing, just returns. No obvious errors in config.log. 1.0.2a10 (built today) does not have this problem. I use "sudo make uninstall" to remove the previous installation before installing a new version. Michael ps. I've had to use 1.1 because of bugs in the 1.0.x series that will not be fixed. On Mar 4, 2006, at 9:29 AM, Jeff Squyres wrote: I'm hesitant to put these fixes in the 1.0.x series simply because we're trying to finish that series and advance towards 1.1. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Sample code demonstrating issues with multiple versions of OpenMPI
The sample code at the end of this message demonstrates issues with multiple versions of OpenMPI. OpenMPI 1.0.2a10 compiles the code but crashes because of the interface issues previously discussed. This is both using " USE MPI " and " include 'mpif.h' " OpenMPI 1.1a1r9336 generates the following output (generated on OS X with g95, but same errors previously documented on Debian Linux with pgif90 version 6.1): >spawn How many processes total? 2 alpha 0 of 1 master receiving alpha 0 receiving 17 from master alpha 0 sending -1 0 answer= -1 0 from alpha 0 0 [x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ soh_base_get_proc_soh.c at line 100 [x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ oob_base_xcast.c at line 108 [x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ rmgr_base_stage_gate.c at line 276 [x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ soh_base_get_proc_soh.c at line 100 [x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ oob_base_xcast.c at line 108 [x:14559] [0,0,0] ORTE_ERROR_LOG: GPR data corruption in file base/ rmgr_base_stage_gate.c at line 276 Michael spawn.f90 --- program main USE MPI implicit none ! include 'mpif.h' integer :: ierr,size,rank,child integer (kind=MPI_ADDRESS_KIND) :: universe_size integer :: status(MPI_STATUS_SIZE) logical :: flag integer :: ans(0:2),btest integer :: k, subprocesses real:: ts(4) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr) if ( size /= 1 ) then if ( rank == 0 ) then write(*,*) 'Only one master process permitted' write(*,*) 'Terminating all but root process' else call MPI_FINALIZE(ierr) stop end if end if call MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, universe_size, flag,ierr) if ( .not. flag ) then write(*,*) 'This MPI does not support UNIVERSE_SIZE.' write(*,*) 'How many processes total?' read(*,*) universe_size else if ( universe_size < 2 ) then write(*,*) 'How many processes total?' read(*,*) universe_size end if subprocesses = universe_size-1 call MPI_Comm_spawn('subprocess', MPI_ARGV_NULL, subprocesses, MPI_INFO_NULL, 0, & MPI_COMM_WORLD, child, MPI_ERRCODES_IGNORE, ierr ) btest = 17 call MPI_BCAST( btest, 1, MPI_INTEGER, MPI_ROOT, child, ierr ) call MPI_BCAST( ts,4 ,MPI_REAL ,MPI_ROOT,child,ierr) do k = 1, universe_size-1 write(*,*) 'master receiving' ans = 0 call MPI_RECV( ans, 2, MPI_INTEGER, MPI_ANY_SOURCE, MPI_ANY_TAG, child, status, ierr ) write(*,*) 'answer=',ans(0:1),' from alpha',status (MPI_SOURCE),status(MPI_TAG) end do call MPI_COMM_FREE(child,ierr) call MPI_FINALIZE(ierr) end --- subprocess.f90 program alpha USE MPI implicit none ! include 'mpif.h' integer :: ierr,size,rank,parent,rsize integer :: ans(0:2), btest real:: ts(4) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr) write(*,*) 'alpha',rank,' of ',size call MPI_Comm_get_parent(parent,ierr) call MPI_BCAST( btest, 1, MPI_INTEGER, 0, parent, ierr ) call MPI_BCAST(ts,4,MPI_REAL,0,parent,ierr) write(*,*) 'alpha',rank,'receiving',btest,'from master' ans(0) = rank-1 ans(1) = rank ans(2) = rank+1 write(*,*) 'alpha',rank,' sending',ans(0:1) call MPI_SSEND( ans, 2, MPI_INTEGER, 0, rank, parent, ierr) call MPI_FINALIZE(ierr) end program alpha
Re: [OMPI users] mpif90 broken in recent tarballs of 1.1a1
On Mar 20, 2006, at 6:10 PM, Michael Kluskens wrote: I have identified what I think is the issue described below. Even though the default prefix is /usr/local, r9336 only works for me if I use ./configure --prefix=/usr/local Thank you for the bug report. I was able to pin down the problem to a change I made last week to fix a recompilation issue. The bug has been fixed in r9346 on the trunk. Thanks, Brian On Mar 20, 2006, at 11:49 AM, Michael Kluskens wrote: Building Open MPI 1.1a1r9xxx on a PowerMac G4 running OS X 10.4.5 using 1) Apple gnu compilers from Xcode 2.2.1 2) fink-installed g95 setenv F77 g95 ; setenv FC g95 ; ./configure ; make all ; sudo make install r9212 (built ~week ago) worked but I has having some issues and wished to try a newer 1.1 r9275 (built Thursday) and r9336 (built today) do not work, meaning they appear to compile just fine, but: mpif90 mpitest.f90 -o mpitest does nothing, just returns. No obvious errors in config.log. 1.0.2a10 (built today) does not have this problem. I use "sudo make uninstall" to remove the previous installation before installing a new version. Michael ps. I've had to use 1.1 because of bugs in the 1.0.x series that will not be fixed. On Mar 4, 2006, at 9:29 AM, Jeff Squyres wrote: I'm hesitant to put these fixes in the 1.0.x series simply because we're trying to finish that series and advance towards 1.1. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9; problems on heterogeneous cluster
On Mar 16, 2006, at 1:32 AM, Ravi Manumachu wrote: I have installed OpenMPI-1.1a1r9260 on my SunOS machines. It has solved the problems. However there is one more issue that I found in my testing and that I failed to report. This concerns Linux machines too. csultra06$ mpirun --hostfile hosts.txt --app mpiinit_appfile csultra02:Hello world from 5 csultra06:Hello world from 0 csultra06:Hello world from 4 csultra02:Hello world from 1 csultra08:Hello world from 3 csultra05:Hello world from 2 The following two more statements are not printed. csultra05:Hello world from 6 csultra08:Hello world from 7 This behavior I observed on my Linux cluster too. Hi Ravi - Thanks for the bug report. We've determined that there is definitely a problem with starting applications from an app context file on the trunk. The issue appears to be a regression that slipped into our development trunk, but is not in our release branch. I've passed the bug report on to the author of that code and he is looking into the issue, but I don't have a timeline to having the issue resolved. For good news, the issue with Solaris has been resolved in the v1.0 release branch and the app context bug does not exist there. So the upcoming Open MPI 1.0.2 release (and the currently available Open MPI 1.0.2a10 alpha release) should work properly for your environment. Hopefully now that we have some Solaris users regularly testing Open MPI, your experiences with the release branches on Solaris should be much more stable :). Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
[OMPI users] Mac OS X 10.4.5 and XGrid, Open-MPI V1.0.1
Hi Frank, I've used OMPI 1.0.1 with Xgrid. I don't think I ran into the same problem as you with the job hanging. But I'll continue just in case it helps or helps someone else. The one thing that I noticed was that Xgrid/OMPI does not allow an MPI application to write out a file other than to standard output. In my example when running HP Linpack over an Xgrid enabled OMPI, if I execute the mpirun with HPL just outputting to the screen, everything runs fine. However, if I set my hpl.dat file to write out the results to a file, I get an error: With 'hpl.dat' set to write to an output file called 'HPL.out' after executing: mpirun -d -hostfile myhosts -np 4 ./xhpl portal.private:00545] [0,1,0] ompi_mpi_init completed HPL ERROR from process # 0, on line 318 of function HPL_pdinfo: >>> cannot open file HPL.out. <<< I've tested this with a couple of other applications as well. For now, the only way I can solve it is if I set my working directory to allow user nobody to write to my working directory. Hope this helps. -Warner Warner Yuen Apple Computer email: wy...@apple.com Tel: 408.718.2859 Fax: 408.715.0133 On Mar 20, 2006, at 9:00 AM, users-requ...@open-mpi.org wrote: Message: 1 Date: Mon, 20 Mar 2006 08:11:32 +0100 From: Frank Subject: Re: [OMPI users] Mac OS X 10.4.5 and XGrid, Open-MPI V1.0.1 To: us...@open-mpi.org Message-ID: <441e55a4.6090...@fraka-mp.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi Brian, this is the full -d option output I've got mpi-running vhone on the xgrid. The truncation is due to the reported "hang".