We saw the same problem with compilation, the workaround for us was configuring without vt ( ./configure --help ). I hope vt guys will fix it somewhen .
Lenny. On Mon, Feb 23, 2009 at 11:48 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > It would be interesting to see what happens with the 1.3 build. > > It's hard to interpret the output of your user's test program without > knowing exactly what that printf means... > > > On Feb 23, 2009, at 4:44 PM, Jim Kusznir wrote: > >> I haven't had time to do the openmpi build from the nightly yet, but >> my user has run some more tests and now has a simple program and >> algorithm to "break" openmpi. His notes: >> >> hey, just fyi, I can reproduce the error readily in a simple test case >> my "way to break mpi" is as follows: Master proc runs MPI_Send 1000 >> times to each child, then waits for a "I got it" ack from each child. >> Each child receives 1000 numbers from the Master, then sends "I got >> it" to the master >> running this on 25 nodes causes it to break about 60% of the time >> interestingly, it usually breaks on the same process number each time >> >> ah. It looks like if I let it sit for about 5 minutes, sometimes it >> will work. From my log >> rank: 23 Mon Feb 23 13:29:44 2009 recieved 816 >> rank: 23 Mon Feb 23 13:29:44 2009 recieved 817 >> rank: 23 Mon Feb 23 13:29:44 2009 recieved 818 >> rank: 23 Mon Feb 23 13:33:08 2009 recieved 819 >> rank: 23 Mon Feb 23 13:33:08 2009 recieved 820 >> >> Any thoughts on this problem? >> (this is the only reason I'm currently working on upgrading openmpi) >> >> --Jim >> >> On Fri, Feb 20, 2009 at 1:59 PM, Jeff Squyres <jsquy...@cisco.com> wrote: >>> >>> There won't be an official SRPM until 1.3.1 is released. >>> >>> But to test if 1.3.1 is on-track to deliver a proper solution to you, can >>> you try a nightly tarball, perhaps in conjunction with our "buildrpm.sh" >>> script? >>> >>> >>> >>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/buildrpm.sh >>> >>> It should build a trivial SRPM for you from the tarball. You'll likely >>> need >>> to get the specfile, too, and put it in the same dir as buildrpm.sh. The >>> specfile is in the same SVN directory: >>> >>> >>> >>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/openmpi.spec >>> >>> >>> >>> On Feb 20, 2009, at 3:51 PM, Jim Kusznir wrote: >>> >>>> As long as I can still build the rpm for it and install it via rpm. >>>> I'm running it on a ROCKS cluster, so it needs to be an RPM to get >>>> pushed out to the compute nodes. >>>> >>>> --Jim >>>> >>>> On Fri, Feb 20, 2009 at 11:30 AM, Jeff Squyres <jsquy...@cisco.com> >>>> wrote: >>>>> >>>>> On Feb 20, 2009, at 2:20 PM, Jim Kusznir wrote: >>>>> >>>>>> I just went to www.open-mpi.org, went to download, then source rpm. >>>>>> Looks like it was actually 1.3-1. Here's the src.rpm that I pulled >>>>>> in: >>>>>> >>>>>> >>>>>> >>>>>> http://www.open-mpi.org/software/ompi/v1.3/downloads/openmpi-1.3-1.src.rpm >>>>> >>>>> Ah, gotcha. Yes, that's 1.3.0, SRPM version 1. We didn't make up this >>>>> nomenclature. :-( >>>>> >>>>>> The reason for this upgrade is it seems a user found some bug that may >>>>>> be in the OpenMPI code that results in occasionally an MPI_Send() >>>>>> message getting lost. He's managed to reproduce it multiple times, >>>>>> and we can't find anything in his code that can cause it...He's got >>>>>> logs of mpi_send() going out, but the matching mpi_receive() never >>>>>> getting anything, thus killing his code. We're currently running >>>>>> 1.2.8 with ofed support (Haven't tried turning off ofed, etc. yet). >>>>> >>>>> Ok. 1.3.x is much mo' betta' then 1.2 in many ways. We could probably >>>>> help >>>>> track down the problem, but if you're willing to upgrade to 1.3.x, >>>>> it'll >>>>> hopefully just make the problem go away. >>>>> >>>>> Can you try a 1.3.1 nightly tarball? >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> Cisco Systems >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >