[OMPI users] MPI over IB not working : "Connection event not handled: 16391"

2007-06-29 Thread Glenn Carver
Hi, I'm trying to set up a new small cluster. It's based on Sun's X4100 servers running Solaris 10_x86. I have Open MPI that comes within Clustertools 7. In addition, I have an Infiniband network between the nodes. I can run parallel jobs fine if processes remain on one node (each node has

Re: [OMPI users] MPI over IB not working : "Connection event not handled: 16391"

2007-06-30 Thread Glenn Carver
Further to my email below regarding problems with uDAPL across IB, I found a bug report lodged with Sun (also reported with Opensolaris at: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6545187). I will lodge a support call with Sun first thing Monday though it might not get me very

Re: [OMPI users] MPI over IB not working : "Connection event not handled: 16391"

2007-06-30 Thread Glenn Carver
me to reply at the weekend! Much appreciated. Glenn Glenn, Are you running with Solaris 10 Update 3 (11/06) and with this patch 125793-01? It is required for running with udapl btl. http://www.sun.com/products-n-solutions/hardware/docs/html/819-7478-11/body.html#93180 Glenn Carver

[OMPI users] Connection to HNP lost

2007-07-10 Thread Glenn Carver
Hi, I'd be grateful if someone could explain the meaning of this error message to me and whether it indicates a hardware problem or application software issue: [node2:11881] OOB: Connection to HNP lost [node1:09876] OOB: Connection to HNP lost I have a small cluster which until last week was

Re: [OMPI users] Connection to HNP lost

2007-07-10 Thread Glenn Carver
On Jul 10, 2007, at 11:32 AM, Ralph H Castain wrote: On 7/10/07 11:08 AM, "Glenn Carver" wrote: Hi, I'd be grateful if someone could explain the meaning of this error message to me and whether it indicates a hardware problem or application software issue: [node2:118

[OMPI users] values of mca parameters whilst running program

2007-08-02 Thread Glenn Carver
Hopefully an easy question to answer... is it possible to get at the values of mca parameters whilst a program is running? What I had in mind was either an open-mpi function to call which would print the current values of mca parameters or a function to call for specific mca parameters. I don

Re: [OMPI users] values of mca parameters whilst running program

2007-08-03 Thread Glenn Carver
e. Of course you do take the hit of wireup time for all connections at MPI_Init. That's a useful tip and may apply in our case as the code configuration giving us trouble writes a lot of data to process 0 for disk output. Thanks, Glenn -DON Brian Barrett wrote: On Aug

[OMPI users] memory leaks on solaris

2007-08-05 Thread Glenn Carver
I'd appreciate some advice and help on this one. We're having serious problems running parallel applications on our cluster. After each batch job finishes, we lose a certain amount of available memory. Additional jobs cause free memory to gradually go down until the machine starts swapping an

Re: [OMPI users] memory leaks on solaris

2007-08-06 Thread Glenn Carver
-DON p.s. orte-clean does not exist in the ompi v1.2 branch, it is in the trunk but I think there is an issue with it currently Ralph H Castain wrote: On 8/5/07 6:35 PM, "Glenn Carver" >>>> wrote: I'd appreciate some advice and help on this one. We&#

Re: [OMPI users] memory leaks on solaris

2007-08-07 Thread Glenn Carver
uot;--mca btl self,tcp" If this is successful, i.e. frees memory as expected. The next step would be to run including shared memory, "--mca btl self,sm,tcp". If this is successful the last step would be to add in udapl, "--mca btl self,sm,udapl". -DON Glenn Carver wrote:

Re: [OMPI users] memory leaks on solaris

2007-09-03 Thread Glenn Carver
mber of MPI jobs running simultaneously? Size of the job(s)? Is your code something you can share? Reproducing what you are seeing is my intent. -DON p.s. I will not be checking email or working on this again until the week of August 27 as I am taking a little vacation. Glenn Carver wrote: Don, Fol