Re: [OMPI users] problem with rankfile in openmpi-1.8.2rc3

2014-08-07 Thread Siegmar Gross
Hello Ralph, > Try replacing --report-bindings with -mca hwloc_base_report_bindings 1 > and see if that works I get even more warnings with the new option. It seems that I always get the bindings only for the local machine. I used Solaris Sparc (tyr), Solaris x86_64 (sunpc1), and Linux x86_64 (li

Re: [OMPI users] problem with rankfile in openmpi-1.8.2rc3

2014-08-07 Thread Ralph Castain
Try replacing --report-bindings with -mca hwloc_base_report_bindings 1 and see if that works On Aug 7, 2014, at 4:04 AM, Siegmar Gross wrote: > Hi, > >> I can't replicate - this worked fine for me. I'm at a loss as >> to how you got that error as it would require some strange >> error in the

Re: [OMPI users] problem with rankfile in openmpi-1.8.2rc3

2014-08-07 Thread Siegmar Gross
Hi, > I can't replicate - this worked fine for me. I'm at a loss as > to how you got that error as it would require some strange > error in the report-bindngs option. If you remove that option > from your cmd line, does the problem go away? Yes. tyr openmpi_1.7.x_or_newer 468 mpiexec -np 4 -rf r

Re: [OMPI users] problem with rankfile in openmpi-1.8.2rc3

2014-08-05 Thread Ralph Castain
I can't replicate - this worked fine for me. I'm at a loss as to how you got that error as it would require some strange error in the report-bindngs option. If you remove that option from your cmd line, does the problem go away? On Aug 5, 2014, at 12:56 AM, Siegmar Gross wrote: > Hi, > > ye

Re: [OMPI users] problem with rankfile in openmpi-1.7.4rc2r30323

2014-01-23 Thread Ralph Castain
Okay, so this is a Sparc issue, not a rankfile one. I'm afraid my lack of time and access to that platform will mean this won't get fixed for 1.7.4, but I'll try to take a look at it when time permits. On Jan 22, 2014, at 10:52 PM, Siegmar Gross wrote: > Dear Ralph, > > the same problems oc

Re: [OMPI users] problem with rankfile in openmpi-1.7.4rc2r30323

2014-01-23 Thread Siegmar Gross
Dear Ralph, the same problems occur without rankfiles. tyr fd1026 102 which mpicc /usr/local/openmpi-1.7.4_64_cc/bin/mpicc tyr fd1026 103 mpiexec --report-bindings -np 2 \ -host tyr,sunpc1 hostname tyr fd1026 104 /opt/solstudio12.3/bin/sparcv9/dbx \ /usr/local/openmpi-1.7.4_64_cc/bin/mpiexe

Re: [OMPI users] problem with rankfile in openmpi-1.7.4rc2r30323

2014-01-22 Thread Ralph Castain
Hard to know how to address all that, Siegmar, but I'll give it a shot. See below. On Jan 22, 2014, at 5:34 AM, Siegmar Gross wrote: > Hi, > > yesterday I installed openmpi-1.7.4rc2r30323 on our machines > ("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux > 12.1 x86_64" with Sun C

Re: [OMPI users] problem with rankfile and openmpi-1.6.4rc3r27923

2013-01-29 Thread Ralph Castain
Aha - I'm able to replicate it, will fix. On Jan 29, 2013, at 11:57 AM, Ralph Castain wrote: > Using an svn checkout of the current 1.6 branch, if works fine for me: > > [rhc@odin ~/v1.6]$ cat rf > rank 0=odin127 slot=0:0-1,1:0-1 > rank 1=odin128 slot=1 > > [rhc@odin ~/v1.6]$ mpirun -n 2 -rf .

Re: [OMPI users] problem with rankfile and openmpi-1.6.4rc3r27923

2013-01-29 Thread Ralph Castain
Using an svn checkout of the current 1.6 branch, if works fine for me: [rhc@odin ~/v1.6]$ cat rf rank 0=odin127 slot=0:0-1,1:0-1 rank 1=odin128 slot=1 [rhc@odin ~/v1.6]$ mpirun -n 2 -rf ./rf --report-bindings hostname [odin127.cs.indiana.edu:12078] MCW rank 0 bound to socket 0[core 0-1] socket 1

Re: [OMPI users] problem with rankfile in openmpi-1.6.4rc2

2013-01-25 Thread Ralph Castain
Found it! A trivial error (missing a break in a switch statement) that only impacts things if multiple sockets are specified in the slot_list. CMR filed to include the fix in 1.6.4 Thanks for your patience Ralph On Jan 24, 2013, at 7:50 PM, Ralph Castain wrote: > I built the current 1.6 branc

Re: [OMPI users] problem with rankfile in openmpi-1.6.4rc2

2013-01-24 Thread Ralph Castain
I built the current 1.6 branch (which hasn't seen any changes that would impact this function) and was able to execute it just fine on a single socket machine. I then gave it your slot-list, which of course failed as I don't have two active sockets (one is empty), but it appeared to parse the li

Re: [OMPI users] problem with rankfile and openmpi-1.6.2

2012-10-03 Thread Ralph Castain
I filed a bug fix for this one. However, something you should note. If you fail to provide a "-np N" argument to mpiexec, we assume you want ALL all available slots filled. The rankfile will contain only those procs that you want specifically bound. The remaining procs will be unbound. So with

Re: [OMPI users] problem with rankfile and openmpi-1.6.2

2012-10-03 Thread Ralph Castain
I saw your earlier note about this too. Just a little busy right now, but hope to look at it soon. Your rankfile looks fine, so undoubtedly a bug has crept into this rarely-used code path. On Oct 3, 2012, at 3:03 AM, Siegmar Gross wrote: > Hi, > > I want to test process bindings with a ran

Re: [OMPI users] problem with rankfile

2012-09-10 Thread Jeff Squyres
We actually include hwloc v1.3.2 in the OMPI v1.6 series. Can you download and try that on your machines? http://www.open-mpi.org/software/hwloc/v1.3/ In particular try the hwloc-bind executable (outside of OMPI), and see if binding works properly on your machines. I typically run a te

Re: [OMPI users] problem with rankfile

2012-09-10 Thread Ralph Castain
Hmmm...well, let's try to isolate this a little. Would you mind installing a copy of the current trunk on this machine and trying it? I ask because I'd like to better understand if the problem is in the actual binding mechanism (i.e., hwloc), or in the code that computes where to bind the proce

Re: [OMPI users] problem with rankfile

2012-09-10 Thread Siegmar Gross
Hi, > > are the following outputs helpful to find the error with > > a rankfile on Solaris? > > If you can't bind on the new Solaris machine, then the rankfile > won't do you any good. It looks like we are getting the incorrect > number of cores on that machine - is it possible that it has > hard

Re: [OMPI users] problem with rankfile

2012-09-07 Thread Ralph Castain
On Sep 7, 2012, at 5:41 AM, Siegmar Gross wrote: > Hi, > > are the following outputs helpful to find the error with > a rankfile on Solaris? If you can't bind on the new Solaris machine, then the rankfile won't do you any good. It looks like we are getting the incorrect number of cores on th

Re: [OMPI users] problem with rankfile

2012-09-07 Thread Siegmar Gross
Hi, are the following outputs helpful to find the error with a rankfile on Solaris? I wrapped long lines so that they are easier to read. Have you had time to look at the segmentation fault with a rankfile which I reported in my last email (see below)? "tyr" is a two processor single core machine

Re: [OMPI users] problem with rankfile

2012-09-05 Thread Ralph Castain
I couldn't really say for certain - I don't see anything obviously wrong with your syntax, and the code appears to be working or else it would fail on the other nodes as well. The fact that it fails solely on that machine seems suspect. Set aside the rankfile for the moment and try to just bind

Re: [OMPI users] problem with rankfile

2012-09-05 Thread Siegmar Gross
Hi, I'm new to rankfiles so that I played a little bit with different options. I thought that the following entry would be similar to an entry in an appfile and that MPI could place the process with rank 0 on any core of any processor. rank 0=tyr.informatik.hs-fulda.de Unfortunately it's not all

Re: [OMPI users] problem with rankfile

2012-09-04 Thread Siegmar Gross
Hi, > Are *all* the machines Sparc? Or just the 3rd one (rs0)? Yes, both machines are Sparc. I tried first in a homogeneous environment. tyr fd1026 106 psrinfo -v Status of virtual processor 0 as of: 09/04/2012 07:32:14 on-line since 08/31/2012 15:44:42. The sparcv9 processor operates at 160

Re: [OMPI users] problem with rankfile

2012-09-03 Thread Ralph Castain
Are *all* the machines Sparc? Or just the 3rd one (rs0)? On Sep 3, 2012, at 12:43 PM, Siegmar Gross wrote: > Hi, > > the man page for "mpiexec" shows the following: > > cat myrankfile > rank 0=aa slot=1:0-2 > rank 1=bb slot=0:0,1 > rank 2=cc slot=1-2 >