Umm...actually, I said there isn't a bug to fix :-) I don't think there is a bug. I think it is doing what it should do.

Note that Geoffroy and I are specifically *not* talking about 1.3.1. We know that there are bugs in that release (specifically relating to multiple app_contexts, though there may be others), and in 1.3.2. We have been working on the OMPI trunk to fix the problems, and appear to have done so. Geoffroy's remaining observations are most likely due to building on one RHEL version and attempting to run on another.

You might try it again with the latest trunk tarball.

As for the nomenclature - that was decided by the folks who originally wrote that code. I don't have a personal stake in it, nor much of an opinion. However, note that we do differentiate between physical and logical cpu's. Your definitions correlate to our "physical" ones, while the rankfile mapping (in the absence of the 'P' qualifier) defaults to logical definitions. This may be the source of your confusion.

You might look at the paffinity documentation for a better explanation of physical vs logical numbering. If it isn't there or is inadequate, we can try to add more words - Jeff is particularly adept at doing so! :-)

HTH
Ralph


On May 4, 2009, at 7:49 PM, Gus Correa wrote:

Hi Ralph and Geffroy

I've been following this thread with a lot of interest.
Setting processor affinity and pin the processes to cores
was next on my "TODO" list, and I just started it.

I tried to use three different versions of rankfile,
with OpenMPI 1.3.1 on a dual-socket quad-core
Opteron machine.
In all cases I've got errors similar to Geoffroy's.

The mpiexec command line is:

${MPIEXEC} \
       -prefix ${PREFIX} \
       -np ${NP} \
        -rf my_rankfile \
       -mca btl openib,sm,self \
        -mca mpi_leave_pinned 0 \
        -mca paffinity_base_verbose 5 \
       xhpl


I use Torque, and I generate the rankfile programatically based
on the $PBS_NODEFILE.

Here are three rank files I used:

#1 rankfile (trying to associate slot=physical_id:core_id from /proc/ cpuinfo)
[gus@monk hpl]$ more my_rankfile
rank       0=node24      slot=0:0
rank       1=node24      slot=0:1
rank       2=node24      slot=0:2
rank       3=node24      slot=0:3
rank       4=node24      slot=1:0
rank       5=node24      slot=1:1
rank       6=node24      slot=1:2
rank       7=node24      slot=1:3


#2 rankfile (trying to associaate slot=processor from /proc/cpuinfo)
[gus@monk hpl]$ more my_rankfile
rank       0=node24      slot=0
rank       1=node24      slot=1
rank       2=node24      slot=2
rank       3=node24      slot=3
rank       4=node24      slot=4
rank       5=node24      slot=5
rank       6=node24      slot=6
rank       7=node24      slot=7


#3 rankfile (Similar to #1 but with "p" that the FAQs say stands for "physical")
[gus@monk hpl]$ more my_rankfile
rank       0=node24      slot=p0:0
rank       1=node24      slot=p0:1
rank       2=node24      slot=p0:2
rank       3=node24      slot=p0:3
rank       4=node24      slot=p1:0
rank       5=node24      slot=p1:1
rank       6=node24      slot=p1:2
rank       7=node24      slot=p1:3

***

In all cases I get this error (just like Geoffroy):

******

Rankfile claimed host node24 that was not allocated or oversubscribed it's slots
:

--------------------------------------------------------------------------
[node24.cluster:23762] [[59468,0],0] ORTE_ERROR_LOG: Bad parameter in file ../..
/../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 108
[node24.cluster:23762] [[59468,0],0] ORTE_ERROR_LOG: Bad parameter in file ../..
/../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 87
[node24.cluster:23762] [[59468,0],0] ORTE_ERROR_LOG: Bad parameter in file ../..
/../../orte/mca/plm/base/plm_base_launch_support.c at line 77
[node24.cluster:23762] [[59468,0],0] ORTE_ERROR_LOG: Bad parameter in file ../..
/../../../orte/mca/plm/tm/plm_tm_module.c at line 167
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

********

I confess I am a bit confused about the nomenclature.

What do you call CPU in the rankfile context?
How about slot, core, and socket?

Linux keeps information about these items in /proc/cpuinfo,
in  /sys/devices/system/cpu,
and in /sys/devices/system/nodes.
However, the nomenclature is different from OpenMPI.

How can one use that information to build a correct rankfile?
I read the mpiexec man page and the FAQs but I am still confused.

Questions:
1) In the rankfile notation slot=cpu_num, is cpu-num the same as "processor" in /proc/cpuinfo, or is it the same as "physical id" in / proc/cpuinfo?

2) In the rankfile notation slot=socket_num:core_num, is socket_num the
same as "physical id" in /proc/cpuinfo, or something else?

3) Is core_num in the rankfile notation the same as "core id" or the same as "processor" in /proc/cpuinfo?
Or is it yet another thing?

Geoffroy sent the /proc/cpuinfo of his
Intel dual-socket dual-core machine.
I enclose the one from my AMD dual-socket quad-core below.
The architectures (non-NUMA vs. NUMA) are different and so are the
numbering schemes:

Geoffrey's numbers go like this (each column match a single core):
processor---0-1-2-3
physical-id-0-3-0-3  (alternating physical IDs)
core--------0-0-1-1

Whereas my numbers go like this:
processor---0-1-2-3-4-5-6-7
physical-id-0-0-0-0-1-1-1-1 (physical IDs don't alternate)
core--------0-1-2-3-0-1-2-3


So, first I think a clarification about the nomenclature would
really help us build meaningful rankfiles.
I suggest to relate the names in rankfile to those in /proc/cpuinfo,
if possible (or to /sys/devices/system/cpu or /sys/devices/system/ nodes).
(Other OSs may use different names though.)
The tables above show that things can get confusing to the user,
if the connection between the two is not made.

Second, as Ralph pointed out, there may be a bug to fix as well.

****

It would be great to have the rankfile functionality working.
However, the good news is that just setting processor affinity
works fine.
This is OK for now, since I am using the whole node.
The mpirun command line I used is :

${MPIEXEC} \
       -prefix ${PREFIX} \
       -np ${NP} \
        -mca mpi_paffinity_alone 1 \
       -mca btl openib,sm,self \
        -mca mpi_leave_pinned 0 \
        -mca paffinity_base_verbose 5 \
       xhpl


Thank you,
Gus Correa

################

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : Quad-Core AMD Opteron(tm) Processor 2376
stepping        : 2
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 4625.83
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : Quad-Core AMD Opteron(tm) Processor 2376
stepping        : 2
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 4623.16
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : Quad-Core AMD Opteron(tm) Processor 2376
stepping        : 2
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 4623.16
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : Quad-Core AMD Opteron(tm) Processor 2376
stepping        : 2
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 4622.82
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor       : 4
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : Quad-Core AMD Opteron(tm) Processor 2376
stepping        : 2
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 1
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 4623.15
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor       : 5
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : Quad-Core AMD Opteron(tm) Processor 2376
stepping        : 2
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 1
siblings        : 4
core id         : 1
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 4623.16
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor       : 6
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : Quad-Core AMD Opteron(tm) Processor 2376
stepping        : 2
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 1
siblings        : 4
core id         : 2
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 4622.83
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor       : 7
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : Quad-Core AMD Opteron(tm) Processor 2376
stepping        : 2
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 4623.16
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]




Geoffroy Pignot wrote:
Hi Ralph
Thanks for your extra tests. Before leaving , I just pointed out a problem coming from running plpa across different rh distribs (<=> different Linux kernels). Indeed, I configure and compile openmpi on rhel4 , then I run on rhel5. I think my problem comes from this approximation. I'll do few more tests tomorrow morning (France) and keep you inform.
Regards
Geoffroy
    Message: 2
   Date: Mon, 4 May 2009 13:34:40 -0600
   From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>
   Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users <us...@open-mpi.org <mailto:users@open- mpi.org>>
   Message-ID:
<71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com <mailto:71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com >>
   Content-Type: text/plain; charset="iso-8859-1"
Hmmm...I'm afraid I can't replicate the problem. All seems to be working just fine on the RHEL systems available to me. The procs indeed bind
   to the
   specified processors in every case.
   rhc@odin ~/trunk]$ cat rankfile
   rank 0=odin001 slot=0
   rank 1=odin002 slot=1
   [rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2
   --leave-session-attached
   -mca paffinity_base_verbose 5 ./mpi_spin
[odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu: 09297>
   <http://odin001.cs.indiana.edu:9297/>]
   paffinity slot assignment: slot_list == 0
[odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu: 09297>
   <http://odin001.cs.indiana.edu:9297/>]
   paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[odin002.cs.indiana.edu:13566 <http://odin002.cs.indiana.edu: 13566>]
   paffinity slot assignment: slot_list == 1
[odin002.cs.indiana.edu:13566 <http://odin002.cs.indiana.edu: 13566>]
   paffinity slot assignment: rank 1 runs on cpu
   #1 (#1)
   Suspended
   [rhc@odin mpi]$ ssh odin001
   [rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
   S    rhc        0  9296  0.0 orted
   RLl  rhc        0  9297  100 mpi_spin
   [rhc@odin mpi]$ ssh odin002
   [rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
   S    rhc        0 13562  0.0 orted
   RLl  rhc        1 13566  102 mpi_spin
   Not sure where to go from here...perhaps someone else can spot the
   problem?
   Ralph
   On Mon, May 4, 2009 at 8:28 AM, Ralph Castain <r...@open-mpi.org
   <mailto:r...@open-mpi.org>> wrote:
    > Unfortunately, I didn't write any of that code - I was just
   fixing the
> mapper so it would properly map the procs. From what I can tell,
   the proper
    > things are happening there.
    >
    > I'll have to dig into the code that specifically deals with
   parsing the
    > results to bind the processes. Afraid that will take awhile
   longer - pretty
    > dark in that hole.
    >
    >
    >
    > On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot
   <geopig...@gmail.com <mailto:geopig...@gmail.com>>wrote:
    >
    >> Hi,
    >>
    >> So, there are no more crashes with my "crazy" mpirun command.
   But the
>> paffinity feature seems to be broken. Indeed I am not able to pin my
    >> processes.
    >>
    >> Simple test with a program using your plpa library :
    >>
    >> r011n006% cat hostf
    >> r011n006 slots=4
    >>
    >> r011n006% cat rankf
    >> rank 0=r011n006 slot=0   ----> bind to CPU 0 , exact ?
    >>
    >> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf
   --rankfile
    >> rankf --wdir /tmp -n 1 a.out
    >>  >>> PLPA Number of processors online: 4
    >>  >>> PLPA Number of processor sockets: 2
    >>  >>> PLPA Socket 0 (ID 0): 2 cores
    >>  >>> PLPA Socket 1 (ID 3): 2 cores
    >>
    >> Ctrl+Z
    >> r011n006%bg
    >>
    >> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
    >> R+   gpignot    3  9271 97.8 a.out
    >>
    >> In fact whatever the slot number I put in my rankfile , a.out
   always runs
    >> on the CPU 3. I was looking for it on CPU 0 accordind to my
   cpuinfo file
    >> (see below)
>> The result is the same if I try another syntax (rank 0=r011n006
   slot=0:0
    >> bind to socket 0 - core 0  , exact ? )
    >>
    >> Thanks in advance
    >>
    >> Geoffroy
    >>
    >> PS: I run on rhel5
    >>
    >> r011n006% uname -a
    >> Linux r011n006 2.6.18-92.1.1NOMAP32.el5 #1 SMP Sat Mar 15
   01:46:39 CDT
    >> 2008 x86_64 x86_64 x86_64 GNU/Linux
    >>
    >> My configure is :
    >>  ./configure --prefix=/tmp/openmpi-1.4a
   --libdir='${exec_prefix}/lib64'
    >> --disable-dlopen --disable-mpi-cxx --enable-heterogeneous
    >>
    >>
    >> r011n006% cat /proc/cpuinfo
    >> processor       : 0
    >> vendor_id       : GenuineIntel
    >> cpu family      : 6
    >> model           : 15
>> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
    >> stepping        : 6
    >> cpu MHz         : 2660.007
    >> cache size      : 4096 KB
    >> physical id     : 0
    >> siblings        : 2
    >> core id         : 0
    >> cpu cores       : 2
    >> fpu             : yes
    >> fpu_exception   : yes
    >> cpuid level     : 10
    >> wp              : yes
    >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
   mtrr pge mca
    >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
   syscall nx lm
    >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    >> bogomips        : 5323.68
    >> clflush size    : 64
    >> cache_alignment : 64
    >> address sizes   : 36 bits physical, 48 bits virtual
    >> power management:
    >>
    >> processor       : 1
    >> vendor_id       : GenuineIntel
    >> cpu family      : 6
    >> model           : 15
>> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
    >> stepping        : 6
    >> cpu MHz         : 2660.007
    >> cache size      : 4096 KB
    >> physical id     : 3
    >> siblings        : 2
    >> core id         : 0
    >> cpu cores       : 2
    >> fpu             : yes
    >> fpu_exception   : yes
    >> cpuid level     : 10
    >> wp              : yes
    >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
   mtrr pge mca
    >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
   syscall nx lm
    >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    >> bogomips        : 5320.03
    >> clflush size    : 64
    >> cache_alignment : 64
    >> address sizes   : 36 bits physical, 48 bits virtual
    >> power management:
    >>
    >> processor       : 2
    >> vendor_id       : GenuineIntel
    >> cpu family      : 6
    >> model           : 15
>> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
    >> stepping        : 6
    >> cpu MHz         : 2660.007
    >> cache size      : 4096 KB
    >> physical id     : 0
    >> siblings        : 2
    >> core id         : 1
    >> cpu cores       : 2
    >> fpu             : yes
    >> fpu_exception   : yes
    >> cpuid level     : 10
    >> wp              : yes
    >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
   mtrr pge mca
    >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
   syscall nx lm
    >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    >> bogomips        : 5319.39
    >> clflush size    : 64
    >> cache_alignment : 64
    >> address sizes   : 36 bits physical, 48 bits virtual
    >> power management:
    >>
    >> processor       : 3
    >> vendor_id       : GenuineIntel
    >> cpu family      : 6
    >> model           : 15
>> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
    >> stepping        : 6
    >> cpu MHz         : 2660.007
    >> cache size      : 4096 KB
    >> physical id     : 3
    >> siblings        : 2
    >> core id         : 1
    >> cpu cores       : 2
    >> fpu             : yes
    >> fpu_exception   : yes
    >> cpuid level     : 10
    >> wp              : yes
    >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
   mtrr pge mca
    >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
   syscall nx lm
    >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    >> bogomips        : 5320.03
    >> clflush size    : 64
    >> cache_alignment : 64
    >> address sizes   : 36 bits physical, 48 bits virtual
    >> power management:
    >>
    >>
    >>> ------------------------------
    >>>
    >>> Message: 2
    >>> Date: Mon, 4 May 2009 04:45:57 -0600
>>> From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org >>
    >>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> To: Open MPI Users <us...@open-mpi.org <mailto:us...@open-mpi.org >> >>> Message-ID: <D01D7B16-4B47-46F3-AD41-D1A90B2E4927@open- mpi.org
   <mailto:d01d7b16-4b47-46f3-ad41-d1a90b2e4...@open-mpi.org>>
    >>>
>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
    >>>        DelSp="yes"
    >>>
    >>> My apologies - I wasn't clear enough. You need a tarball from
   r21111
    >>> or greater...such as:
    >>>
    >>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21142.tar.gz
    >>>
    >>> HTH
    >>> Ralph
    >>>
    >>>
    >>> On May 4, 2009, at 2:14 AM, Geoffroy Pignot wrote:
    >>>
    >>> > Hi ,
    >>> >
    >>> > I got the openmpi-1.4a1r21095.tar.gz tarball, but
   unfortunately my
    >>> > command doesn't work
    >>> >
    >>> > cat rankf:
    >>> > rank 0=node1 slot=*
    >>> > rank 1=node2 slot=*
    >>> >
    >>> > cat hostf:
    >>> > node1 slots=2
    >>> > node2 slots=2
    >>> >
>>> > mpirun --rankfile rankf --hostfile hostf --host node1 - n 1
    >>> > hostname : --host node2 -n 1 hostname
    >>> >
    >>> > Error, invalid rank (1) in the rankfile (rankf)
    >>> >
    >>> >
    >>>
-------------------------------------------------------------------------- >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter
   in file
    >>> > rmaps_rank_file.c at line 403
>>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter
   in file
    >>> > base/rmaps_base_map_job.c at line 86
>>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter
   in file
    >>> > base/plm_base_launch_support.c at line 86
>>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter
   in file
    >>> > plm_rsh_module.c at line 1016
    >>> >
    >>> >
    >>> > Ralph, could you tell me if my command syntax is correct or
   not ? if
    >>> > not, give me the expected one ?
    >>> >
    >>> > Regards
    >>> >
    >>> > Geoffroy
    >>> >
    >>> >
    >>> >
    >>> >
    >>> > 2009/4/30 Geoffroy Pignot <geopig...@gmail.com
   <mailto:geopig...@gmail.com>>
    >>> > Immediately Sir !!! :)
    >>> >
    >>> > Thanks again Ralph
    >>> >
    >>> > Geoffroy
    >>> >
    >>> >
    >>> >
    >>> >
    >>> >
    >>> > ------------------------------
    >>> >
    >>> > Message: 2
    >>> > Date: Thu, 30 Apr 2009 06:45:39 -0600
>>> > From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org >>
    >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
    >>> > To: Open MPI Users <us...@open-mpi.org
   <mailto:us...@open-mpi.org>>
    >>> > Message-ID:
    >>> >           <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com
<mailto:71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com >>
    >>> > Content-Type: text/plain; charset="iso-8859-1"
    >>> >
>>> > I believe this is fixed now in our development trunk - you can
    >>> > download any
    >>> > tarball starting from last night and give it a try, if you
   like. Any
    >>> > feedback would be appreciated.
    >>> >
    >>> > Ralph
    >>> >
    >>> >
    >>> > On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
    >>> >
    >>> > Ah now, I didn't say it -worked-, did I? :-)
    >>> >
>>> > Clearly a bug exists in the program. I'll try to take a look
   at it
    >>> > (if Lenny
>>> > doesn't get to it first), but it won't be until later in the
   week.
    >>> >
    >>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
    >>> >
    >>> > I agree with you Ralph , and that 's what I expect from
   openmpi but my
    >>> > second example shows that it's not working
    >>> >
    >>> > cat hostfile.0
    >>> >   r011n002 slots=4
    >>> >   r011n003 slots=4
    >>> >
    >>> >  cat rankfile.0
    >>> >    rank 0=r011n002 slot=0
    >>> >    rank 1=r011n003 slot=1
    >>> >
>>> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
    >>> > hostname
    >>> > ### CRASHED
    >>> >
    >>> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
    >>> > > >
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
   parameter in
    >>> > file
    >>> > > > rmaps_rank_file.c at line 404
    >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
   parameter in
    >>> > file
    >>> > > > base/rmaps_base_map_job.c at line 87
    >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
   parameter in
    >>> > file
    >>> > > > base/plm_base_launch_support.c at line 77
    >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
   parameter in
    >>> > file
    >>> > > > plm_rsh_module.c at line 985
    >>> > > >
    >>> > >
    >>> >
    >>>
-------------------------------------------------------------------------- >>> > > > A daemon (pid unknown) died unexpectedly on signal 1 while
    >>> > > attempting to
    >>> > > > launch so we are aborting.
    >>> > > >
>>> > > > There may be more information reported by the environment
   (see
    >>> > > above).
    >>> > > >
>>> > > > This may be because the daemon was unable to find all the
   needed
    >>> > > shared
    >>> > > > libraries on the remote node. You may set your
   LD_LIBRARY_PATH to
    >>> > > have the
>>> > > > location of the shared libraries on the remote nodes and
   this will
    >>> > > > automatically be forwarded to the remote nodes.
    >>> > > >
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > >
    >>> > >
    >>> >
    >>>
-------------------------------------------------------------------------- >>> > > > orterun noticed that the job aborted, but has no info as
   to the
    >>> > > process
    >>> > > > that caused that situation.
    >>> > > >
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > > orterun: clean termination accomplished
    >>> >
    >>> >
    >>> >
    >>> > Message: 4
    >>> > Date: Tue, 14 Apr 2009 06:55:58 -0600
    >>> > From: Ralph Castain <r...@lanl.gov <mailto:r...@lanl.gov>>
    >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
    >>> > To: Open MPI Users <us...@open-mpi.org
   <mailto:us...@open-mpi.org>>
    >>> > Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov
   <mailto:f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov>>
>>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
    >>> >       DelSp="yes"
    >>> >
>>> > The rankfile cuts across the entire job - it isn't applied on an
    >>> > app_context basis. So the ranks in your rankfile must
   correspond to
    >>> > the eventual rank of each process in the cmd line.
    >>> >
    >>> > Unfortunately, that means you have to count ranks. In your
   case, you
    >>> > only have four, so that makes life easier. Your rankfile
   would look
    >>> > something like this:
    >>> >
    >>> > rank 0=r001n001 slot=0
    >>> > rank 1=r001n002 slot=1
    >>> > rank 2=r001n001 slot=1
    >>> > rank 3=r001n002 slot=2
    >>> >
    >>> > HTH
    >>> > Ralph
    >>> >
    >>> > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
    >>> >
    >>> > > Hi,
    >>> > >
>>> > > I agree that my examples are not very clear. What I want to
   do is to
>>> > > launch a multiexes application (masters-slaves) and benefit
   from the
    >>> > > processor affinity.
    >>> > > Could you show me how to convert this command , using -rf
   option
    >>> > > (whatever the affinity is)
    >>> > >
>>> > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 - host
   r001n002
>>> > > master.x options2 : -n 1 -host r001n001 slave.x options3 :
   -n 1 -
    >>> > > host r001n002 slave.x options4
    >>> > >
    >>> > > Thanks for your help
    >>> > >
    >>> > > Geoffroy
    >>> > >
    >>> > >
    >>> > >
    >>> > >
    >>> > >
    >>> > > Message: 2
    >>> > > Date: Sun, 12 Apr 2009 18:26:35 +0300
    >>> > > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com
   <mailto:lenny.verkhov...@gmail.com>>
    >>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
    >>> > > To: Open MPI Users <us...@open-mpi.org
   <mailto:us...@open-mpi.org>>
    >>> > > Message-ID:
    >>> > >           
<453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com
<mailto:453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com >>
    >>> > > Content-Type: text/plain; charset="iso-8859-1"
    >>> > >
    >>> > > Hi,
    >>> > >
>>> > > The first "crash" is OK, since your rankfile has ranks 0 and 1
    >>> > > defined,
    >>> > > while n=1, which means only rank 0 is present and can be
   allocated.
    >>> > >
    >>> > > NP must be >= the largest rank in rankfile.
    >>> > >
    >>> > > What exactly are you trying to do ?
    >>> > >
    >>> > > I tried to recreate your seqv but all I got was
    >>> > >
    >>> > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
   --hostfile
    >>> > > hostfile.0
>>> > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
    >>> > > [witch19:30798] mca: base: component_find: paffinity
    >>> > > "mca_paffinity_linux"
>>> > > uses an MCA interface that is not recognized (component MCA
    >>> > v1.0.0 !=
    >>> > > supported MCA v2.0.0) -- ignored
    >>> > >
    >>> >
    >>>
-------------------------------------------------------------------------- >>> > > It looks like opal_init failed for some reason; your parallel
    >>> > > process is
    >>> > > likely to abort. There are many reasons that a parallel
   process can
    >>> > > fail during opal_init; some of which are due to
   configuration or
>>> > > environment problems. This failure appears to be an internal
    >>> > failure;
    >>> > > here's some additional information (which may only be
   relevant to an
    >>> > > Open MPI developer):
    >>> > >
    >>> > >  opal_carto_base_select failed
    >>> > >  --> Returned value -13 instead of OPAL_SUCCESS
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
   found in
    >>> > file
    >>> > > ../../orte/runtime/orte_init.c at line 78
    >>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
   found in
    >>> > file
    >>> > > ../../orte/orted/orted_main.c at line 344
    >>> > >
    >>> >
    >>>
-------------------------------------------------------------------------- >>> > > A daemon (pid 11629) died unexpectedly with status 243 while
    >>> > > attempting
    >>> > > to launch so we are aborting.
    >>> > >
>>> > > There may be more information reported by the environment (see
    >>> > above).
    >>> > >
    >>> > > This may be because the daemon was unable to find all the
   needed
    >>> > > shared
    >>> > > libraries on the remote node. You may set your
   LD_LIBRARY_PATH to
    >>> > > have the
    >>> > > location of the shared libraries on the remote nodes and
   this will
    >>> > > automatically be forwarded to the remote nodes.
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > >
    >>> >
    >>>
-------------------------------------------------------------------------- >>> > > mpirun noticed that the job aborted, but has no info as to the
    >>> > process
    >>> > > that caused that situation.
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > mpirun: clean termination accomplished
    >>> > >
    >>> > >
    >>> > > Lenny.
    >>> > >
    >>> > >
    >>> > > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com
   <mailto:geopig...@gmail.com>> wrote:
    >>> > > >
    >>> > > > Hi ,
    >>> > > >
>>> > > > I am currently testing the process affinity capabilities of
    >>> > > openmpi and I
    >>> > > > would like to know if the rankfile behaviour I will
   describe below
    >>> > > is normal
    >>> > > > or not ?
    >>> > > >
    >>> > > > cat hostfile.0
    >>> > > > r011n002 slots=4
    >>> > > > r011n003 slots=4
    >>> > > >
    >>> > > > cat rankfile.0
    >>> > > > rank 0=r011n002 slot=0
    >>> > > > rank 1=r011n003 slot=1
    >>> > > >
    >>> > > >
    >>> > > >
    >>> > >
    >>> >
    >>>
##################################################################################
    >>> > > >
    >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2
    hostname ### OK
    >>> > > > r011n002
    >>> > > > r011n003
    >>> > > >
    >>> > > >
    >>> > > >
    >>> > >
    >>> >
    >>>
##################################################################################
    >>> > > > but
>>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname
   : -n 1
    >>> > > hostname
    >>> > > > ### CRASHED
    >>> > > > *
    >>> > > >
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
    >>> > > >
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
   parameter in
    >>> > file
    >>> > > > rmaps_rank_file.c at line 404
    >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
   parameter in
    >>> > file
    >>> > > > base/rmaps_base_map_job.c at line 87
    >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
   parameter in
    >>> > file
    >>> > > > base/plm_base_launch_support.c at line 77
    >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad
   parameter in
    >>> > file
    >>> > > > plm_rsh_module.c at line 985
    >>> > > >
    >>> > >
    >>> >
    >>>
-------------------------------------------------------------------------- >>> > > > A daemon (pid unknown) died unexpectedly on signal 1 while
    >>> > > attempting to
    >>> > > > launch so we are aborting.
    >>> > > >
>>> > > > There may be more information reported by the environment
   (see
    >>> > > above).
    >>> > > >
>>> > > > This may be because the daemon was unable to find all the
   needed
    >>> > > shared
    >>> > > > libraries on the remote node. You may set your
   LD_LIBRARY_PATH to
    >>> > > have the
>>> > > > location of the shared libraries on the remote nodes and
   this will
    >>> > > > automatically be forwarded to the remote nodes.
    >>> > > >
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > >
    >>> > >
    >>> >
    >>>
-------------------------------------------------------------------------- >>> > > > orterun noticed that the job aborted, but has no info as
   to the
    >>> > > process
    >>> > > > that caused that situation.
    >>> > > >
    >>> > >
    >>> >
    >>>
--------------------------------------------------------------------------
    >>> > > > orterun: clean termination accomplished
    >>> > > > *
>>> > > > It seems that the rankfile option is not propagted to the
   second
    >>> > > command
    >>> > > > line ; there is no global understanding of the ranking
   inside a
    >>> > > mpirun
    >>> > > > command.
    >>> > > >
    >>> > > >
    >>> > > >
    >>> > >
    >>> >
    >>>
##################################################################################
    >>> > > >
>>> > > > Assuming that , I tried to provide a rankfile to each command
    >>> > line:
    >>> > > >
    >>> > > > cat rankfile.0
    >>> > > > rank 0=r011n002 slot=0
    >>> > > >
    >>> > > > cat rankfile.1
    >>> > > > rank 0=r011n003 slot=1
    >>> > > >
>>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname
   : -rf
    >>> > > rankfile.1
    >>> > > > -n 1 hostname ### CRASHED
    >>> > > > *[r011n002:28778] *** Process received signal ***
    >>> > > > [r011n002:28778] Signal: Segmentation fault (11)
    >>> > > > [r011n002:28778] Signal code: Address not mapped (1)
    >>> > > > [r011n002:28778] Failing at address: 0x34
    >>> > > > [r011n002:28778] [ 0] [0xffffe600]
    >>> > > > [r011n002:28778] [ 1]
    >>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
    >>> > > 0(orte_odls_base_default_get_add_procs_data+0x55d)
    >>> > > > [0x5557decd]
    >>> > > > [r011n002:28778] [ 2]
    >>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
    >>> > > 0(orte_plm_base_launch_apps+0x117)
    >>> > > > [0x555842a7]
>>> > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/ openmpi/
    >>> > > mca_plm_rsh.so
    >>> > > > [0x556098c0]
>>> > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/ orterun
    >>> > > [0x804aa27]
>>> > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/ orterun
    >>> > > [0x804a022]
>>> > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main +0xdc)
    >>> > > [0x9f1dec]
>>> > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/ orterun
    >>> > > [0x8049f71]
    >>> > > > [r011n002:28778] *** End of error message ***
    >>> > > > Segmentation fault (core dumped)*
    >>> > > >
    >>> > > >
    >>> > > >
    >>> > > > I hope that I've found a bug because it would be very
   important
    >>> > > for me to
    >>> > > > have this kind of capabiliy .
>>> > > > Launch a multiexe mpirun command line and be able to bind
   my exes
    >>> > > and
    >>> > > > sockets together.
    >>> > > >
    >>> > > > Thanks in advance for your help
    >>> > > >
    >>> > > > Geoffroy
    >>> > > _______________________________________________
    >>> > > users mailing list
    >>> > > us...@open-mpi.org <mailto:us...@open-mpi.org>
    >>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>> >
    >>> > -------------- next part --------------
    >>> > HTML attachment scrubbed and removed
    >>> >
    >>> > ------------------------------
    >>> >
    >>> > _______________________________________________
    >>> > users mailing list
    >>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>> >
    >>> > End of users Digest, Vol 1202, Issue 2
    >>> > **************************************
    >>> >
    >>> > _______________________________________________
    >>> > users mailing list
    >>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>> >
    >>> > _______________________________________________
    >>> > users mailing list
    >>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    >>> > -------------- next part --------------
    >>> > HTML attachment scrubbed and removed
    >>> >
    >>> > ------------------------------
    >>> >
    >>> > _______________________________________________
    >>> > users mailing list
    >>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>> >
    >>> > End of users Digest, Vol 1218, Issue 2
    >>> > **************************************
    >>> >
    >>> >
    >>> > _______________________________________________
    >>> > users mailing list
    >>> > us...@open-mpi.org <mailto:us...@open-mpi.org>
    >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>>
    >>> -------------- next part --------------
    >>> HTML attachment scrubbed and removed
    >>>
    >>> ------------------------------
    >>>
    >>> _______________________________________________
    >>> users mailing list
    >>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>>
    >>> End of users Digest, Vol 1221, Issue 3
    >>> **************************************
    >>>
    >>
    >>
    >> _______________________________________________
    >> users mailing list
    >> us...@open-mpi.org <mailto:us...@open-mpi.org>
    >> http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>
    >
    >
   -------------- next part --------------
   HTML attachment scrubbed and removed
   ------------------------------
   _______________________________________________
   users mailing list
   us...@open-mpi.org <mailto:us...@open-mpi.org>
   http://www.open-mpi.org/mailman/listinfo.cgi/users
   End of users Digest, Vol 1221, Issue 17
   ***************************************
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to