I'm pretty sure that this particular VT compile issue has already been
fixed in the 1.3 series.
Lenny -- can you try the latest OMPI 1.3.1 nightly tarball to verify?
On Mar 1, 2009, at 4:54 PM, Lenny Verkhovsky wrote:
We saw the same problem with compilation,
the workaround for us was configuring without vt ( ./configure --
help ).
I hope vt guys will fix it somewhen .
Lenny.
On Mon, Feb 23, 2009 at 11:48 PM, Jeff Squyres <jsquy...@cisco.com>
wrote:
It would be interesting to see what happens with the 1.3 build.
It's hard to interpret the output of your user's test program without
knowing exactly what that printf means...
On Feb 23, 2009, at 4:44 PM, Jim Kusznir wrote:
I haven't had time to do the openmpi build from the nightly yet, but
my user has run some more tests and now has a simple program and
algorithm to "break" openmpi. His notes:
hey, just fyi, I can reproduce the error readily in a simple test
case
my "way to break mpi" is as follows: Master proc runs MPI_Send 1000
times to each child, then waits for a "I got it" ack from each
child.
Each child receives 1000 numbers from the Master, then sends "I got
it" to the master
running this on 25 nodes causes it to break about 60% of the time
interestingly, it usually breaks on the same process number each
time
ah. It looks like if I let it sit for about 5 minutes, sometimes it
will work. From my log
rank: 23 Mon Feb 23 13:29:44 2009 recieved 816
rank: 23 Mon Feb 23 13:29:44 2009 recieved 817
rank: 23 Mon Feb 23 13:29:44 2009 recieved 818
rank: 23 Mon Feb 23 13:33:08 2009 recieved 819
rank: 23 Mon Feb 23 13:33:08 2009 recieved 820
Any thoughts on this problem?
(this is the only reason I'm currently working on upgrading openmpi)
--Jim
On Fri, Feb 20, 2009 at 1:59 PM, Jeff Squyres <jsquy...@cisco.com>
wrote:
There won't be an official SRPM until 1.3.1 is released.
But to test if 1.3.1 is on-track to deliver a proper solution to
you, can
you try a nightly tarball, perhaps in conjunction with our
"buildrpm.sh"
script?
https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/buildrpm.sh
It should build a trivial SRPM for you from the tarball. You'll
likely
need
to get the specfile, too, and put it in the same dir as
buildrpm.sh. The
specfile is in the same SVN directory:
https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/openmpi.spec
On Feb 20, 2009, at 3:51 PM, Jim Kusznir wrote:
As long as I can still build the rpm for it and install it via
rpm.
I'm running it on a ROCKS cluster, so it needs to be an RPM to get
pushed out to the compute nodes.
--Jim
On Fri, Feb 20, 2009 at 11:30 AM, Jeff Squyres
<jsquy...@cisco.com>
wrote:
On Feb 20, 2009, at 2:20 PM, Jim Kusznir wrote:
I just went to www.open-mpi.org, went to download, then source
rpm.
Looks like it was actually 1.3-1. Here's the src.rpm that I
pulled
in:
http://www.open-mpi.org/software/ompi/v1.3/downloads/openmpi-1.3-1.src.rpm
Ah, gotcha. Yes, that's 1.3.0, SRPM version 1. We didn't make
up this
nomenclature. :-(
The reason for this upgrade is it seems a user found some bug
that may
be in the OpenMPI code that results in occasionally an
MPI_Send()
message getting lost. He's managed to reproduce it multiple
times,
and we can't find anything in his code that can cause
it...He's got
logs of mpi_send() going out, but the matching mpi_receive()
never
getting anything, thus killing his code. We're currently
running
1.2.8 with ofed support (Haven't tried turning off ofed, etc.
yet).
Ok. 1.3.x is much mo' betta' then 1.2 in many ways. We could
probably
help
track down the problem, but if you're willing to upgrade to
1.3.x,
it'll
hopefully just make the problem go away.
Can you try a 1.3.1 nightly tarball?
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems