Re: [OMPI users] Potential bug in creating MPI_GROUP_EMPTY handling

2011-03-18 Thread Jack Bryan



> Date: Thu, 17 Mar 2011 23:40:31 +0100
> From: dominik.goedd...@math.tu-dortmund.de
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Potential bug in creating MPI_GROUP_EMPTY handling
> 
> glad we could help and the two hours of stripping things down were 
> effectively not wasted. Also good to hear (implicitly) that we were not 
> too stupid to understand the MPI standard...
> 
> Since to the best of my understanding, our workaround is practically 
> overhead-free, we went ahead and coded everything up analogously to the 
> workaround, i.e. we don't rely on / wait for an immediate fix.
> 
> Please let us know if further information is needed.
> 
> Thanks,
> 
> dom
> 
> On 03/17/2011 05:10 PM, Jeff Squyres wrote:
> > Sorry for the late reply, but many thanks for the bug report and reliable 
> > reproducer.
> >
> > I've confirmed the problem and filed a bug about this:
> >
> >   https://svn.open-mpi.org/trac/ompi/ticket/2752
> >
> >
> > On Mar 6, 2011, at 6:12 PM, Dominik Goeddeke wrote:
> >
> >> The attached example code (stripped down from a bigger app) demonstrates a 
> >> way to trigger a severe crash in all recent ompi releases but not in a 
> >> bunch of latest MPICH2 releases. The code is minimalistic and boils down 
> >> to the call
> >>
> >> MPI_Comm_create(MPI_COMM_WORLD, MPI_GROUP_EMPTY,&dummy_comm);
> >>
> >> which isn't supposed to be illegal. Please refer to the (well-documented) 
> >> code for details on the high-dimensional cross product I tested (on ubuntu 
> >> 10.04 LTS), a potential workaround (which isn't supposed to be necessary I 
> >> think) and an exemplary stack trace.
> >>
> >> Instructions: mpicc test.c -Wall -O0&&  mpirun -np 2 ./a.out
> >>
> >> Thanks!
> >>
> >> dom
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> 
> -- 
> Dr. Dominik Göddeke
> Institut für Angewandte Mathematik
> Technische Universität Dortmund
> http://www.mathematik.tu-dortmund.de/~goeddeke
> Tel. +49-(0)231-755-7218  Fax +49-(0)231-755-5933
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
  

[OMPI users] OMPI free() error

2011-03-18 Thread Jack Bryan

Hi, 
I am running a C++ program with OMPI.I got error: 
*** glibc detected *** /nsga2b: free(): invalid next size (fast): 
0x01817a90 ***
I used GDB: 
=== Backtrace: =Program received signal SIGABRT, 
Aborted.0x0038b8830265 in raise () from /lib64/libc.so.6(gdb) bt#0  
0x0038b8830265 in raise () from /lib64/libc.so.6#1  0x0038b8831d10 in 
abort () from /lib64/libc.so.6#2  0x0038b886a99b in __libc_message () from 
/lib64/libc.so.6#3  0x0038b887245f in _int_free () from /lib64/libc.so.6#4  
0x0038b88728bb in free () from /lib64/libc.so.6#5  0x0044a4e3 in 
workerRunTask (message_to_master_type=0x38c06efe18, nodeSize=2, myRank=1, 
xVSize=84, objSize=7, xdata_to_workers_type=0x1206350, 
recvXDataVec=std::vector of length 0, capacity 84, myNsga2=..., 
Mpara_to_workers_type=0x1205390, events=0x7fffb1f0, netplan=...)at 
src/nsga2/workerRunTask.cpp:447#6  0x004514d9 in main (argc=1, 
argv=0x7fffcb48)at 
src/nsga2/main-parallel2.cpp:425-
In valgrind, 
there are some invalid read and write butno errors about this  free(): invalid 
next size .
---(populp.ind)->xreal  = new 
double[nreal];(populp.ind)->obj   = new double[nobj];   
  (populp.ind)->constr= new double[ncon]; (populp.ind)->xbin
  = new double[nbin]; if ((populp.ind)->xreal == NULL || 
(populp.ind)->obj == NULL || (populp.ind)->constr == NULL || (populp.ind)->xbin 
== NULL ){   #ifdef DEBUG_workerRunTask 
 cout << "In workerRunTask(), I am rank "<< myRank << " 
(populp.ind)->xreal or (populp.ind)->obj or (populp.ind)->constr or 
(populp.ind)->xbin is NULL .\n\n" << endl;   #endif  }  
 
delete [] (populp.ind)->xreal ; delete [] (populp.ind)->xbin ;  
delete [] (populp.ind)->obj ;   delete [] (populp.ind)->constr ;
delete [] sendResultArrayPr;

thanks
Any help is really appreciated. 
  

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-18 Thread Jack Bryan

thanks, 
I forgot to set up storage capacity for some a vector before using [] operator 
on it. 
thanks

> Subject: Re: [OMPI users] OMPI seg fault by a class with weird address.
> From: jsquy...@cisco.com
> Date: Wed, 16 Mar 2011 20:20:20 -0400
> CC: us...@open-mpi.org
> To: dtustud...@hotmail.com
> 
> Make sure you have the latest version of valgrind.
> 
> But it definitely does highlight what could be real problems if you read down 
> far enough in the output.
> 
> > ==18729== Invalid write of size 8
> > ==18729==at 0x443BEF: initPopPara(population*, 
> > std::vector > std::allocator >&, initParaType&, int, int, 
> > std::vector >&) (main-parallel2.cpp:552)
> > ==18729==by 0x44F12E: main (main-parallel2.cpp:204)
> > ==18729==  Address 0x62c9da0 is 0 bytes after a block of size 0 alloc'd
> > ==18729==at 0x4A0666E: operator new(unsigned long) 
> > (vg_replace_malloc.c:220)
> > ==18729==by 0x4573E4: void 
> > std::__uninitialized_fill_n_aux > message_para_to_workersT>(message_para_to_workersT*, unsigned long, 
> > message_para_to_workersT const&, __false_type) (new_allocator.h:88)
> > ==18729==by 0x4576CF: void 
> > std::__uninitialized_fill_n_a > message_para_to_workersT, 
> > message_para_to_workersT>(message_para_to_workersT*, unsigned long, 
> > message_para_to_workersT const&, std::allocator) 
> > (stl_uninitialized.h:218)
> > ==18729==by 0x44EE2E: main (stl_vector.h:218)
> 
> The above is an invalid read of write of size 8 -- you're essentially writing 
> outside of an array. 
> 
> Valgrind is showing you the call stack to how it got there.  Looks like you 
> new'ed or malloc'ed a block of size 0 and then tried to write something to 
> it.  Writing to memory that you don't own is a no-no; it can cause Very Bad 
> Things to happen.
> 
> You should probably investigate this, and the other issues that it is 
> reporting (e.g., the next invalid read of size 8).
> 
> > ==18729==
> > ==18729== Invalid read of size 8
> > ==18729==at 0x44F13A: main (main-parallel2.cpp:208)
> > ==18729==  Address 0x62c9d60 is 0 bytes after a block of size 0 alloc'd
> > ==18729==at 0x4A0666E: operator new(unsigned long) 
> > (vg_replace_malloc.c:220)
> > ==18729==by 0x45733D: void 
> > std::__uninitialized_fill_n_aux > message_para_to_workersT>(message_para_to_workersT*, unsigned long, 
> > message_para_to_workersT const&, __false_type) (new_allocator.h:88)
> > ==18729==by 0x4576CF: void 
> > std::__uninitialized_fill_n_a > message_para_to_workersT, 
> > message_para_to_workersT>(message_para_to_workersT*, unsigned long, 
> > message_para_to_workersT const&, std::allocator) 
> > (stl_uninitialized.h:218)
> > ==18729==by 0x44EE2E: main (stl_vector.h:218)
> > ==18729==
> > 
> > valgrind: m_mallocfree.c:225 (mk_plain_bszB): Assertion 'bszB != 0' failed.
> > valgrind: This is probably caused by your program erroneously writing past 
> > the
> > end of a heap block and corrupting heap metadata.  If you fix any
> > invalid writes reported by Memcheck, this assertion failure will
> > 
> > probably go away.  Please try that before reporting this as a bug.
> > 
> > ==18729==at 0x38029D5C: report_and_quit (m_libcassert.c:145)
> > ==18729==by 0x3802A032: vgPlain_assert_fail (m_libcassert.c:217)
> > ==18729==by 0x38035645: vgPlain_arena_malloc (m_mallocfree.c:225)
> > ==18729==by 0x38002BB5: vgMemCheck_new_block (mc_malloc_wrappers.c:199)
> > ==18729==by 0x38002F6B: vgMemCheck___builtin_new 
> > (mc_malloc_wrappers.c:246)
> > ==18729==by 0x3806070C: do_client_request (scheduler.c:1362)
> > ==18729==by 0x38061D30: vgPlain_scheduler (scheduler.c:1061)
> > ==18729==by 0x38085E6E: run_a_thread_NORETURN (syswrap-linux.c:91)
> > 
> > sched status:
> >   running_tid=1
> > 
> > Thread 1: status = VgTs_Runnable
> > ==18729==at 0x4A0666E: operator new(unsigned long) 
> > (vg_replace_malloc.c:220)
> > ==18729==by 0x464506: __gnu_cxx::new_allocator::allocate(unsigned 
> > long, void const*) (new_allocator.h:88)
> > ==18729==by 0x46452E: std::_Vector_base 
> > >::_M_allocate(unsigned long) (stl_vector.h:127)
> > ==18729==by 0x464560: std::_Vector_base 
> > >::_Vector_base(unsigned long, std::allocator const&) 
> > (stl_vector.h:113)
> > ==18729==by 0x464B6A: std::vector 
> > >::vector(unsigned long, int const&, std::allocator const&) 
> > (stl_vector.h:216)
> > ==18729==by 0x488F62: Index::Index() (index.cpp:20)
> > ==18729==by 0x489147: ReadFile(char const*) (index.cpp:86)
> > ==18729==by 0x48941C: ImportIndices() (index.cpp:121)
> > ==18729==by 0x445D00: myNeplanTaskScheduler(CNSGA2*, int, int, int, 
> > population*, char, int, std::vector > std::allocator >&, ompi_datatype_t*, int&, int&, 
> > std::vector >, 
> > std::allocator > > >&, 
> > std::vector >, 
> > std::allocator > > >&, 
> > std::vector >&, int, 
> > std::vector >, 
> > std::allocator > > >&, 
> > ompi_datatype_t*, int, ompi_datatype_t*, int) (myNetplanScheduler.cpp:109)
> >

Re: [OMPI users] OMPI free() error

2011-03-18 Thread Ashley Pittman

On 18 Mar 2011, at 06:07, Jack Bryan wrote:

> Hi, 
> 
> I am running a C++ program with OMPI.
> I got error: 
> 
> *** glibc detected *** /nsga2b: free(): invalid next size (fast): 
> 0x01817a90 ***

This error indicates that when glibc tried to free some memory the internal 
data structures it uses were corrupt.

> In valgrind, 
> 
> there are some invalid read and write butno errors about this 
>  free(): invalid next size .

You need to fix the invalid write errors, the above error is almost certainly a 
symptom is these.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] OMPI free() error

2011-03-18 Thread Jeff Squyres
Getting deeper into valgrind- and debugger-identified errors is somewhat 
outside the scope of this mailing list -- we're really here to talk about Open 
MPI-related things.

I suggest you read the valgrind documentation and/or google around for other 
memory debugging resources.

Good luck.


On Mar 18, 2011, at 2:07 AM, Jack Bryan wrote:

> Hi, 
> 
> I am running a C++ program with OMPI.
> I got error: 
> 
> *** glibc detected *** /nsga2b: free(): invalid next size (fast): 
> 0x01817a90 ***
> 
> I used GDB: 
> 
> === Backtrace: =
> Program received signal SIGABRT, Aborted.
> 0x0038b8830265 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0  0x0038b8830265 in raise () from /lib64/libc.so.6
> #1  0x0038b8831d10 in abort () from /lib64/libc.so.6
> #2  0x0038b886a99b in __libc_message () from /lib64/libc.so.6
> #3  0x0038b887245f in _int_free () from /lib64/libc.so.6
> #4  0x0038b88728bb in free () from /lib64/libc.so.6
> #5  0x0044a4e3 in workerRunTask (message_to_master_type=0x38c06efe18, 
> nodeSize=2, myRank=1, xVSize=84, objSize=7, 
> xdata_to_workers_type=0x1206350, 
> recvXDataVec=std::vector of length 0, capacity 84, myNsga2=..., 
> Mpara_to_workers_type=0x1205390, events=0x7fffb1f0, netplan=...)
> at src/nsga2/workerRunTask.cpp:447
> #6  0x004514d9 in main (argc=1, argv=0x7fffcb48)
> at src/nsga2/main-parallel2.cpp:425
> -
> 
> In valgrind, 
> 
> there are some invalid read and write butno errors about this 
>  free(): invalid next size .
> 
> ---
> (populp.ind)->xreal   = new double[nreal];
>   (populp.ind)->obj   = new double[nobj];
>   (populp.ind)->constr= new double[ncon];
>   (populp.ind)->xbin  = new double[nbin];
>   if ((populp.ind)->xreal == NULL || (populp.ind)->obj == NULL || 
> (populp.ind)->constr == NULL || (populp.ind)->xbin == NULL )
>   {
>   #ifdef DEBUG_workerRunTask
>   cout << "In workerRunTask(), I am rank "<< myRank << " 
> (populp.ind)->xreal or (populp.ind)->obj or (populp.ind)->constr or 
> (populp.ind)->xbin is NULL .\n\n" << endl;  
>   #endif
>   }   
> 
> delete [] (populp.ind)->xreal ;
>   delete [] (populp.ind)->xbin ;
>   delete [] (populp.ind)->obj ;
>   delete [] (populp.ind)->constr ;
>   delete [] sendResultArrayPr;
> 
> 
> 
> thanks
> 
> Any help is really appreciated. 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-18 Thread Prentice Bisbal
It's not hard to test whether or not SELinux is the problem. You can
turn SELinux off on the command-line with this command:

setenforce 0

Of course, you need to be root in order to do this.

After turning SELinux off, you can try reproducing the error. If it
still occurs, it's SELinux, if it doesn't the problem is elswhere. When
your done, you can reenable SELinux with

setenforce 1

If you're running your job across multiple nodes, you should disable
SELinux on all of them for testing.

Did you compile/install Open MPI yourself? If so, I suspect that you
have the SELinux context labels on your MPI binaries are incorrect.

If you use the method above to determine that SELinux is the problem,
please post your results here and I may be able to help you set things
right. I have some experience with SELinux problems like this, but I'm
not exactly an expert.

--
Prentice


On 03/17/2011 11:01 AM, Jeff Squyres wrote:
> Sorry for the delayed reply.
> 
> I'm afraid I haven't done much with SE Linux -- I don't know if there are any 
> "gotchas" that would show up there.  SE Linux support is not something we've 
> gotten a lot of request for.  I doubt that anyone in the community has done 
> much testing in this area.  :-\
> 
> I suspect that Open MPI is trying to access something that your user (under 
> SE Linux) doesn't have permission to.  
> 
> So I'm afraid I don't have much of an answer for you -- sorry!  If you do 
> figure it out, though, if a fix is not too intrusive, we can probably 
> incorporate it upstream.
> 
> 
> On Mar 4, 2011, at 7:31 AM, Youri LACAN-BARTLEY wrote:
> 
>> Hi,
>>  
>> This is my first post to this mailing-list so I apologize for maybe being a 
>> little rough on the edges.
>> I’ve been digging into OpenMPI for a little while now and have come across 
>> one issue that I just can’t explain and I’m sincerely hoping someone can put 
>> me on the right track here.
>>  
>> I’m using a fresh install of openmpi-1.2.7 and I systematically get a 
>> segmentation fault at the end of my mpirun calls if I’m logged in as a 
>> regular user.
>> However, as soon as I switch to the root account, the segfault does not 
>> appear.
>> The jobs actually run to their term but I just can’t find a good reason for 
>> this to be happening and I haven’t been able to reproduce the problem on 
>> another machine.
>>  
>> Any help or tips would be greatly appreciated.
>>  
>> Thanks,
>>  
>> Youri LACAN-BARTLEY
>>  
>> Here’s an example running osu_latency locally (I’ve “blacklisted” openib to 
>> make sure it’s not to blame):
>>  
>> [user@server ~]$ mpirun --mca btl ^openib  -np 2 
>> /opt/scripts/osu_latency-openmpi-1.2.7
>> # OSU MPI Latency Test v3.3
>> # SizeLatency (us)
>> 0 0.76
>> 1 0.89
>> 2 0.89
>> 4 0.89
>> 8 0.89
>> 160.91
>> 320.91
>> 640.92
>> 128   0.96
>> 256   1.13
>> 512   1.31
>> 1024  1.69
>> 2048  2.51
>> 4096  5.34
>> 8192  9.16
>> 1638417.47
>> 3276831.79
>> 6553651.10
>> 131072   92.41
>> 262144  181.74
>> 524288  512.26
>> 10485761238.21
>> 20971522280.28
>> 41943044616.67
>> [server:15586] *** Process received signal ***
>> [server:15586] Signal: Segmentation fault (11)
>> [server:15586] Signal code: Address not mapped (1)
>> [server:15586] Failing at address: (nil)
>> [server:15586] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]
>> [server:15586] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]
>> [server:15586] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]
>> [server:15586] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1) 
>> [0x3cd120fe61]
>> [server:15586] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]
>> [server:15586] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]
>> [server:15586] *** End of error message ***
>> [server:15587] *** Process received signal ***
>> [server:15587] Signal: Segmentation fault (11)
>> [server:15587] Signal code: Address not mapped (1)
>> [server:15587] Failing at address: (nil)
>> [server:15587] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]
>> [server:15587] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]
>> [server:15587] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]
>> [server:15587] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1) 
>> [0x3cd120fe61]
>> [server:15587] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]
>> [server:15587] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]
>> [server:15587] *** End of error message ***
>> mpirun noticed that job rank 0 with PID 15586 on node server exited on 
>> signal 11 (Segmentation fault).
>> 1 additional process aborted (not s

Re: [OMPI users] Error in Binding MPI Process to a socket

2011-03-18 Thread Terry Dontje

On 03/17/2011 03:31 PM, vaibhav dutt wrote:

Hi,

Thanks for your reply. I tried to execute first a process by using

mpirun -machinefile hostfile.txt  --slot-list 0:1   -np 1

but it gives the same as error as mentioned previously.

Then, I created a rankfile with contents"

rank 0=t1.tools.xxx  slot=0:0
rank 1=t1.tools.xxx  slot=1:0.

and the  used command

mpirun -machinefile hostfile.txt --rankfile my_rankfile.txt   -np 2

but ended  up getting same error. Is there any patch that I can 
install in my system to make it

topology aware?


You may want to check that you have numa turned on.

If you look in your /etc/grub.conf file does the kernel line have 
"numa=on" in it.  If not I would suggest making a new boot line and 
appending numa=on at the end.  That way if the new boot line doesn't 
work you'll be able to go back to the old one.  Anyway, my boot line 
that turns on numa looks like the following:


title Red Hat Enterprise Linux AS-up (2.6.9-67.EL)
root (hd0,0)
kernel /vmlinuz-2.6.9-67.EL ro root=LABEL=/ console=tty0 
console=ttyS0,9600 rhgb quiet numa=on


And of course once you've saved the changes you'll need to reboot and 
select the new boot line at the grub menu.


--td


Thanks


On Thu, Mar 17, 2011 at 2:05 PM, Ralph Castain > wrote:


The error is telling you that your OS doesn't support queries
telling us what cores are on which sockets, so we can't perform a
"bind to socket" operation. You can probably still "bind to core",
so if you know what cores are in which sockets, then you could use
the rank_file mapper to assign processes to groups of cores in a
socket.

It's just that we can't do it automatically because the OS won't
give us the required info.

See "mpirun -h" for more info on slot lists.

On Mar 17, 2011, at 11:26 AM, vaibhav dutt wrote:

> Hi,
>
> I am trying to perform an experiment in which I can spawn 2 MPI
processes, one on each socket in a 4 core node
> having 2 dual cores. I used the option  "bind to socket" which
mpirun for that but I am getting an error like:
>
> An attempt was made to bind a process to a specific hardware
topology
> mapping (e.g., binding to a socket) but the operating system
does not
> support such topology-aware actions.  Talk to your local system
> administrator to find out if your system can support topology-aware
> functionality (e.g., Linux Kernels newer than v2.6.18).
>
> Systems that do not support processor topology-aware
functionality cannot
> use "bind to socket" and other related functionality.
>
>
> Can anybody please tell me what is this error about. Is there
any other option than "bind to socket"
> that I can use.
>
> Thanks.
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error in Binding MPI Process to a socket

2011-03-18 Thread Terry Dontje

On 03/17/2011 03:31 PM, vaibhav dutt wrote:

Hi,

Thanks for your reply. I tried to execute first a process by using

mpirun -machinefile hostfile.txt  --slot-list 0:1   -np 1

but it gives the same as error as mentioned previously.

Then, I created a rankfile with contents"

rank 0=t1.tools.xxx  slot=0:0
rank 1=t1.tools.xxx  slot=1:0.

and the  used command

mpirun -machinefile hostfile.txt --rankfile my_rankfile.txt   -np 2

but ended  up getting same error. Is there any patch that I can 
install in my system to make it

topology aware?


You may want to check that you have numa turned on.

If you look in your /etc/grub.conf file does the kernel line have 
"numa=on" in it.  If not I would suggest making a new boot line and 
appending numa=on at the end.  That way if the new boot line doesn't 
work you'll be able to go back to the old one.  Anyway, my boot line 
that turns on numa looks like the following:


title Red Hat Enterprise Linux AS-up (2.6.9-67.EL)
root (hd0,0)
kernel /vmlinuz-2.6.9-67.EL ro root=LABEL=/ console=tty0 
console=ttyS0,9600 rhgb quiet numa=on


And of course once you've saved the changes you'll need to reboot and 
select the new boot line at the grub menu.


--td

Thanks


On Thu, Mar 17, 2011 at 2:05 PM, Ralph Castain > wrote:


The error is telling you that your OS doesn't support queries
telling us what cores are on which sockets, so we can't perform a
"bind to socket" operation. You can probably still "bind to core",
so if you know what cores are in which sockets, then you could use
the rank_file mapper to assign processes to groups of cores in a
socket.

It's just that we can't do it automatically because the OS won't
give us the required info.

See "mpirun -h" for more info on slot lists.

On Mar 17, 2011, at 11:26 AM, vaibhav dutt wrote:

> Hi,
>
> I am trying to perform an experiment in which I can spawn 2 MPI
processes, one on each socket in a 4 core node
> having 2 dual cores. I used the option  "bind to socket" which
mpirun for that but I am getting an error like:
>
> An attempt was made to bind a process to a specific hardware
topology
> mapping (e.g., binding to a socket) but the operating system
does not
> support such topology-aware actions.  Talk to your local system
> administrator to find out if your system can support topology-aware
> functionality (e.g., Linux Kernels newer than v2.6.18).
>
> Systems that do not support processor topology-aware
functionality cannot
> use "bind to socket" and other related functionality.
>
>
> Can anybody please tell me what is this error about. Is there
any other option than "bind to socket"
> that I can use.
>
> Thanks.
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com