Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment

2008-10-06 Thread Ralph Castain

Hi Roberto

My time is somewhat limited, so I couldn't review the code in detail.  
However, I think I got the gist of it.


A few observations:

1. the code is rather inefficient, if all you want to do is spawn a  
pattern of slave processes based on a file. Unless there is some  
overriding reason for doing this one comm_spawn at a time, it would be  
far faster to issue a single comm_spawn and just provide the hostfile  
to us. You could use either the seq or rank_file mapper - both would  
take the file and provide the outcome you seek. The only difference  
would be that the child procs would all be in the same comm_world -  
don't know if that is an issue or not.


2. OMPI definitely cannot handle the threaded version of this code at  
this time - not sure when we will get to it.


3. if you serialize the code, we -should- be able to handle it.  
However, I'm not entirely sure your current method actually does that.  
It looks like you call comm_spawn, and then create a new thread which  
then calls comm_spawn. I'm afraid I can't quite figure out how the  
thread locking would occur to prevent multiple threads continuing to  
call comm_spawn - you might want to check it again and ensure it is  
correct. Frankly, I'm not entirely sure what the thread creation is  
gaining you - as I said, we can only call comm_spawn serially, so  
having multiple threads would seem to be unnecessary...unless this  
code is incomplete and you need the threads for some other purpose.


Again, you might look at that loop_spawn code I mentioned before to  
see a working example. Alternatively, if your code works under HP MPI,  
you might want to stick with it for now until we get the threading  
support up to your required level.


Hope that helps
Ralph

On Oct 3, 2008, at 10:36 AM, Roberto Fichera wrote:


Ralph Castain ha scritto:

Interesting. I ran a loop calling comm_spawn 1000 times without a
problem. I suspect it is the threading that is causing the trouble  
here.
I think so! My guessing is that at low level there is some trouble  
when

handling *concurrent*
orted spawning. Maybe
You are welcome to send me the code. You can find my loop code in  
your

code distribution under orte/test/mpi - look for loop_spawn and
loop_child.
In the attached code the spawing logic is currently under a loop in  
the

main of the testmaster, so it's completly
unthreaded at least until the MPI_Comm_spawn() terminate its work. If
you wish like to test multithreading spawing
you can comment the NodeThread_spawnSlave() in the main loop and
uncomment the same function in the
NodeThread_threadMain(). Finally if you want multithreading spawning  
but

serialized against a mutex than uncomment
the pthread_mutex_lock/unlock() in the NodeThread_threadMain().

This code run *without* any trouble in the HP MPI implementation. It
works not so well in mpich2 trunk version due
to two problems: limit of ~24.4K context id and/or a race in poll()
while waiting a termination under MPI_Comm_disconnect()
concurrently with a MPI_Comm_spawn().



Ralph

On Oct 3, 2008, at 9:11 AM, Roberto Fichera wrote:


Ralph Castain ha scritto:


On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote:


Ralph Castain ha scritto:
I committed something to the trunk yesterday. Given the  
complexity of

the fix, I don't plan to bring it over to the 1.3 branch until
sometime mid-to-end next week so it can be adequately tested.
Ok! So it means that I can checkout from the SVN/trunk to get  
you fix,

right?


Yes, though note that I don't claim it is fully correct yet. Still
needs testing. However, I have tested it a fair amount and it seems
okay.

If you do test it, please let me know how it goes.

I execute my test on the svn/trunk below

  Open MPI: 1.4a1r19677
 Open MPI SVN revision: r19677
 Open MPI release date: Unreleased developer copy
  Open RTE: 1.4a1r19677
 Open RTE SVN revision: r19677
 Open RTE release date: Unreleased developer copy
  OPAL: 1.4a1r19677
 OPAL SVN revision: r19677
 OPAL release date: Unreleased developer copy
  Ident string: 1.4a1r19677

below is the output which seems to freeze just after the second  
spawn.


[roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons
--hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 10
$PBS_NODEFILE
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master  
daemon 0

arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4  
daemon

INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3  
daemon

INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2  
daemon

INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1  
daemon

INVALID arch ffc91200
Initializing MPI ...
[master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received
sync+nidmap fro

Re: [OMPI users] Running application with MPI_Comm_spawn() in multithreaded environment

2008-10-06 Thread Roberto Fichera
Ralph Castain ha scritto:
> Hi Roberto
>
> My time is somewhat limited, so I couldn't review the code in detail.
> However, I think I got the gist of it.
>
> A few observations:
>
> 1. the code is rather inefficient, if all you want to do is spawn a
> pattern of slave processes based on a file. Unless there is some
> overriding reason for doing this one comm_spawn at a time, it would be
> far faster to issue a single comm_spawn and just provide the hostfile
> to us. You could use either the seq or rank_file mapper - both would
> take the file and provide the outcome you seek. The only difference
> would be that the child procs would all be in the same comm_world -
> don't know if that is an issue or not.
I agree with you if the all the spawned slaves has to communicate in the
same
comm_world, but since that's not case, as I already told you each slave will
crunch different things, maybe also completly different data from the
other slaves.
The job's distribution will look like a tree or multi tree. The main
problem is that we
don't know in advance how to associate the slaves to the node, so we
need a very
dynamic distribution of the jobs while "unrolling" the algorithm,
basically the application
need to decide which is the best slave to run so that it can locally
converge in the solution
as better it can. That's why our distribution is quite *unusal* ... but,
I would say, legal in
the MPI-2 specs terms.
> 2. OMPI definitely cannot handle the threaded version of this code at
> this time - not sure when we will get to it.
We are talking about only MPI_Comm_spawn() or in general the whole
OpenMPI is not thread safe at moment?
> 3. if you serialize the code, we -should- be able to handle it.
> However, I'm not entirely sure your current method actually does that.
> It looks like you call comm_spawn, and then create a new thread which
> then calls comm_spawn. I'm afraid I can't quite figure out how the
> thread locking would occur to prevent multiple threads continuing to
> call comm_spawn - you might want to check it again and ensure it is
> correct. Frankly, I'm not entirely sure what the thread creation is
> gaining you - as I said, we can only call comm_spawn serially, so
> having multiple threads would seem to be unnecessary...unless this
> code is incomplete and you need the threads for some other purpose.
Could you explain, which part I have to serialize in order to meet the
OpenMPI expectation? Can I send/receive
in a multithreading fashion, for example?

I need threading because each thread will handle one communication with
one slave. In my code, the comm_spawn() is called in a thread
and in the same thread I'll drive the MPI communication, when the slave
computation is terminated the thread will accordly terminate.
> Again, you might look at that loop_spawn code I mentioned before to
> see a working example. Alternatively, if your code works under HP MPI,
> you might want to stick with it for now until we get the threading
> support up to your required level.
About your example, I see that it does permit to merge slaves in one
intercommunicator, but what happen in
the intercommunicator if one or more slaves complete their work and I
would reuse it for doing other things.
Basically I need to pair the slave with the data to send it for its
related computation, that's why we create
different intercommunicator one for each slave because they aren't
related or at least we locally decide if we
need more than one slave for crunching data, in that case the spawn will
be instrumented to spawn say 10
nodes for a single computation. So only that case we "fall back" in the
"standard usage" ;-)!
>
> Hope that helps
> Ralph
>
> On Oct 3, 2008, at 10:36 AM, Roberto Fichera wrote:
>
>> Ralph Castain ha scritto:
>>> Interesting. I ran a loop calling comm_spawn 1000 times without a
>>> problem. I suspect it is the threading that is causing the trouble
>>> here.
>> I think so! My guessing is that at low level there is some trouble when
>> handling *concurrent*
>> orted spawning. Maybe
>>> You are welcome to send me the code. You can find my loop code in your
>>> code distribution under orte/test/mpi - look for loop_spawn and
>>> loop_child.
>> In the attached code the spawing logic is currently under a loop in the
>> main of the testmaster, so it's completly
>> unthreaded at least until the MPI_Comm_spawn() terminate its work. If
>> you wish like to test multithreading spawing
>> you can comment the NodeThread_spawnSlave() in the main loop and
>> uncomment the same function in the
>> NodeThread_threadMain(). Finally if you want multithreading spawning but
>> serialized against a mutex than uncomment
>> the pthread_mutex_lock/unlock() in the NodeThread_threadMain().
>>
>> This code run *without* any trouble in the HP MPI implementation. It
>> works not so well in mpich2 trunk version due
>> to two problems: limit of ~24.4K context id and/or a race in poll()
>> while waiting a termination under MPI_Comm_disconnect()
>> co

Re: [OMPI users] Problem building OpenMPi with SunStudio compilers

2008-10-06 Thread Ethan Mallove
On Sat, Oct/04/2008 11:21:27AM, Raymond Muno wrote:
> Raymond Muno wrote:
>> Raymond Muno wrote:
>>> We are implementing a new cluster that is InfiniBand based.  I am working 
>>> on getting OpenMPI built for our various compile environments. So far it 
>>> is working for PGI 7.2 and PathScale 3.1.  I found some workarounds for 
>>> issues with the Pathscale compilers (seg faults) in the OpenMPI FAQ.
>>>
>>> When I try to build with SunStudio, I cannot even get past the configure 
>>> stage. It dies in th estage that checks for C++.
>>
>> It looks like the problem is with SunStudio itself. Even a simple CC 
>> program fails to compile.
>>>
>>> /usr/lib64/libm.so: file not recognized: File format not recognized
> OK, I took care of the linker issue fro C++ as recommended on Suns support 
> site (replace Sun supplied ld with /usr/bin/ld)
>
> Now I get farther along but the build fails at (small excerpt)
>
> mutex.c:(.text+0x30): multiple definition of `opal_atomic_cmpset_32'
> asm/.libs/libasm.a(asm.o):asm.c:(.text+0x30): first defined here
> threads/.libs/mutex.o: In function `opal_atomic_cmpset_64':
> mutex.c:(.text+0x50): multiple definition of `opal_atomic_cmpset_64'
> asm/.libs/libasm.a(asm.o):asm.c:(.text+0x50): first defined here
> make[2]: *** [libopen-pal.la] Error 1
> make[2]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal'
> make: *** [all-recursive] Error 1
>
> I based the configure on what was found in the FAQ here.
>
> http://www.open-mpi.org/faq/?category=building#build-sun-compilers
>
> Perhaps this is much more specific to our platform/OS.
>
> The environment is AMD Opteron, Barcelona running Centos 5
> (Rocks 5.03) with SunStudio 12 compilers.
>

Unfortunately I haven't seen the above issue, so I don't
have a workaround to propose. There are some issues that
have been fixed with GCC-style inline assembly in the latest
Sun Studio Express build. Could you try it out?

  http://developers.sun.com/sunstudio/downloads/express/index.jsp

-Ethan


> Does anyone have any insight as to how to successfully
> build OpenMPI for this OS/compiler selection?  As I said
> in the first post, we have it built for Pathscale 3.1 and
> PGI 7.2. 
> -Ray Muno
> University of Minnesota, Aerospace Engineering
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] does openmpi have C++ bindings?

2008-10-06 Thread Jeff Squyres
Yes, OMPI's C++ bindings are built by default if you have a valid C++  
compiler.  ompi_info should indicate whether you have the C++ bindings  
built or not.


But the C++ bindings don't allow sending/receiving STL containers via  
MPI calls.  For that, as someone else suggested, have a look at  
Boost.MPI.  The boost group built a nice C++ class library on top of  
MPI.



On Oct 2, 2008, at 5:25 PM, Shafagh Jafer wrote:

Does openmpi have C++ bindings? or I need to install this package?if  
yes from where and how?

Thanks.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Problem building OpenMPi with SunStudio compilers

2008-10-06 Thread Ray Muno
Ethan Mallove wrote:

>> Now I get farther along but the build fails at (small excerpt)
>>
>> mutex.c:(.text+0x30): multiple definition of `opal_atomic_cmpset_32'
>> asm/.libs/libasm.a(asm.o):asm.c:(.text+0x30): first defined here
>> threads/.libs/mutex.o: In function `opal_atomic_cmpset_64':
>> mutex.c:(.text+0x50): multiple definition of `opal_atomic_cmpset_64'
>> asm/.libs/libasm.a(asm.o):asm.c:(.text+0x50): first defined here
>> make[2]: *** [libopen-pal.la] Error 1
>> make[2]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/home/muno/OpenMPI/SunStudio/openmpi-1.2.7/opal'
>> make: *** [all-recursive] Error 1
>>
>> I based the configure on what was found in the FAQ here.
>>
>> http://www.open-mpi.org/faq/?category=building#build-sun-compilers
>>
>> Perhaps this is much more specific to our platform/OS.
>>
>> The environment is AMD Opteron, Barcelona running Centos 5
>> (Rocks 5.03) with SunStudio 12 compilers.
>>
> 
> Unfortunately I haven't seen the above issue, so I don't
> have a workaround to propose. There are some issues that
> have been fixed with GCC-style inline assembly in the latest
> Sun Studio Express build. Could you try it out?
> 
>   http://developers.sun.com/sunstudio/downloads/express/index.jsp
> 
> -Ethan
> 
> 

Looks like it dies at the exact same spot. I have the C++ failure as
well (supplied ld does not work).

-- 

 Ray Muno   http://www.aem.umn.edu/people/staff/muno
 University of Minnesota   e-mail:   m...@aem.umn.edu
 Aerospace Engineering and MechanicsPhone: (612) 625-9531
 110 Union St. S.E.   FAX: (612) 626-1558
 Minneapolis, Mn 55455  


Re: [OMPI users] OpenMPI with openib partitions

2008-10-06 Thread Jeff Squyres

On Oct 5, 2008, at 1:22 PM, Lenny Verkhovsky wrote:

you should probably use -mca tcp,self  -mca btl_openib_if_include  
ib0.8109




Really?  I thought we only took OpenFabrics device names in the  
openib_if_include MCA param...?  It looks like ib0.8109 is an IPoIB  
device name.




Lenny.


On 10/3/08, Matt Burgess  wrote:
Hi,


I'm trying to get openmpi working over openib partitions. On this  
cluster, the partition number is 0x109. The ib interfaces are  
pingable over the appropriate ib0.8109 interface:


d2:/opt/openmpi-ib # ifconfig ib0.8109
ib0.8109  Link encap:UNSPEC  HWaddr 80-00-00-4A- 
FE-80-00-00-00-00-00-00-00-00-00-00

  inet addr:10.21.48.2  Bcast:10.21.255.255  Mask:255.255.0.0
  inet6 addr: fe80::202:c902:26:ca01/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:16811 errors:0 dropped:0 overruns:0 frame:0
  TX packets:15848 errors:0 dropped:1 overruns:0 carrier:0
  collisions:0 txqueuelen:256
  RX bytes:102229428 (97.4 Mb)  TX bytes:102324172 (97.5 Mb)


I have tried the following:

/opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca  
btl openib,self -mca btl_openib_max_btls 1 -mca  
btl_openib_ib_pkey_val 0x8109 -mca btl_openib_ib_pkey_ix 1 /cluster/ 
pallas/x86_64-ib/IMB-MPI1


but I just get a RETRY EXCEEDED ERROR. Is there a MCA parameter I am  
missing?


I was successful using tcp only:

/opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca  
btl tcp,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val  
0x8109 /cluster/pallas/x86_64-ib/IMB-MPI1




Thanks,
Matt Burgess

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] segfault issue - possible bug in openmpi

2008-10-06 Thread Jeff Squyres
Yes, there still could be a dependence on a number of processors and  
using threads.  But it's not clear from the stack trace if this is a  
threaded problem or not (and it is correct that OMPI v1.2's thread  
support is non-functional).


As for more information that would help diagnose the problem, please see

http://www.open-mpi.org/community/help/

Thanks.


On Oct 4, 2008, at 9:28 PM, Doug Reeder wrote:


Shafagh,

I missed the dependence on the number of processors. Apparently  
there is some thread support.


Doug
On Oct 4, 2008, at 5:29 PM, Shafagh Jafer wrote:


Doug Reeder,
Daniel is saying that the problem only occurs in openmpi when  
running more than 16 processes. So could that still be cause  
becasue openmpi does not support threads??!!


--- On Fri, 10/3/08, Doug Reeder  wrote:
From: Doug Reeder 
Subject: Re: [OMPI users] segfault issue - possible bug in openmpi
To: "Open MPI Users" 
Date: Friday, October 3, 2008, 2:40 PM

Daniel,

Are you using threads. I don't think the opempi-1.2.x work with  
threads.


Doug Reeder
On Oct 3, 2008, at 2:30 PM, Daniel Hansen wrote:


Oh, by the way, here is the segfault:

[m4b-1-8:11481] *** Process received signal ***
[m4b-1-8:11481] Signal: Segmentation fault (11)
[m4b-1-8:11481] Signal code: Address not mapped (1)
[m4b-1-8:11481] Failing at address: 0x2b91c69eed
[m4b-1-8:11483] [ 0] /lib64/libpthread.so.0 [0x33e8c0de70]
[m4b-1-8:11483] [ 1] /fslhome/dhansen7/openmpi/lib/libmpi.so.0  
[0x2abea7c0]
[m4b-1-8:11483] [ 2] /fslhome/dhansen7/openmpi/lib/libmpi.so.0  
[0x2abea675]
[m4b-1-8:11483] [ 3] /fslhome/dhansen7/openmpi/lib/libmpi.so. 
0(mca_pml_ob1_send+0x2da) [0x2abeaf55]
[m4b-1-8:11483] [ 4] /fslhome/dhansen7/openmpi/lib/libmpi.so. 
0(MPI_Send+0x28e) [0x2ab52c5a]
[m4b-1-8:11483] [ 5] /fslhome/dhansen7/compute/for_DanielHansen/ 
replica_mpi_marylou2/Openmpi_md_twham(twham_init+0x708) [0x42a8a8]
[m4b-1-8:11483] [ 6] /fslhome/dhansen7/compute/for_DanielHansen/ 
replica_mpi_marylou2/Openmpi_md_twham(repexch+0x73c) [0x425d5c]
[m4b-1-8:11483] [ 7] /fslhome/dhansen7/compute/for_DanielHansen/ 
replica_mpi_marylou2/Openmpi_md_twham(main+0x855) [0x4133a5]
[m4b-1-8:11483] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x33e841d8a4]
[m4b-1-8:11483] [ 9] /fslhome/dhansen7/compute/for_DanielHansen/ 
replica_mpi_marylou2/Openmpi_md_twham [0x4040b9]

[m4b-1-8:11483] *** End of error message ***



On Fri, Oct 3, 2008 at 3:20 PM, Daniel Hansen   
wrote:
I have been testing some code against openmpi lately that always  
causes it to crash during certain mpi function calls.  The code  
does not seem to be the problem, as it runs just fine against  
mpich.  I have tested it against openmpi 1.2.5, 1.2.6, and 1.2.7  
and they all exhibit the same problem.  Also, the problem only  
occurs in openmpi when running more than 16 processes.  I have  
posted this stack trace to the list before, but I am submitting it  
now as a potential bug report.  I need some help debugging it and  
finding out exactly what is going on in openmpi when the segfault  
occurs.  Are there any suggestions on how best to do this?  Is  
there an easy way to attach gdb to one of the processes or  
something??  I have already compiled openmpi with debugging,  
memory profiling, etc.  How can I best take advantage of these  
features?


Thanks,
Daniel Hansen
Systems Administrator
BYU Fulton Supercomputing Lab

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



[OMPI users] ompi-restart issue : ompi-restart doesn't work across nodes - possible installation problem or environment setting problem??

2008-10-06 Thread arun dhakne
Hi all,

This is the procedure i have followed to install openmpi. Is there
some installation or environment setting problem in here?
an openmpi program with 4 process is run across 2 dual-core intel
machines, with 2 processes running on each of the machine.

ompi-checkpoint is successful but ompi-restart fails with following error


$:> ompi-restart ompi_global_snapshot_6045.ckpt
--
mpirun noticed that process rank 0 with PID 6372 on node
acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
fault).
--

Open-mpi installation steps:
./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr
--with-blcr=/usr/lib64 --enable-debug
make
make install



export LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/openmpi:/usr/lib64
export PATH=$HOME/.openmpi/bin:$PATH

NOTE: blcr is installed as a module
$:> lsmod | grep blcr

blcr  117892  0
blcr_vmadump   58264  1 blcr
blcr_imports   46080  2 blcr,blcr_vmadump

Please let me know if there is problem with above procedure, thanks a
lot for your time.

Best.

-- Forwarded message --
From: arun dhakne 
List-Post: users@lists.open-mpi.org
Date: Tue, Sep 30, 2008 at 12:52 AM
Subject: ompi-restart issue : ompi-restart doesn't work across nodes
To: Open MPI Users 


Hi all,

I had gone through some previous ompi-restart issues but i couldn't
find anything similar to this problem.

I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645'

i) If the sample mpi program say ( np 4 on single machine that is
without any hostfile )is ran and I try to checkpoint it, it happens
successfully and even ompi-restart works in this case.

ii) If the sample mpi program is ran across say 2 different nodes and
checkpoint happens successfully BUT ompi-restart throws following
error:

$ ompi-restart ompi_global_snapshot_7604.ckpt
--
mpirun noticed that process rank 3 with PID 9590 on node
acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
fault).
--

Please let me know if more information is needed.

--
Thanks and Regards,
Arun U. Dhakne


[OMPI users] Problem launching onto Bourne shell

2008-10-06 Thread Hahn Kim

Hi,

I'm having difficulty launching an Open MPI job onto a machine that is  
running the Bourne shell.


Here's my basic setup.  I have two machines, one is an x86-based  
machine running bash and the other is a Cell-based machine running  
Bourne shell.  I'm running mpirun from the x86 machine, which launches  
a C++ MPI application onto the Cell machine.  I get the following error:


   error while loading shared libraries: libstdc++.so.6: cannot open  
shared object file: No such file or directory


The basic problem is that LD_LIBRARY_PATH needs to be set to the  
directory that contains libstdc++.so.6 for the Cell.  I set the  
following line in .profile:


   export LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32

which is the path to the PPC libraries for Cell.

Now if I log directly into the Cell machine and run the program  
directly from the command line, I don't get the above error.  But  
mpirun still fails, even after setting LD_LIBRARY_PATH in .profile.


As a sanity check, I did the following.  I ran the following command  
from the x86 machine:


   mpirun -np 1 --host cab0 env

which, among others things, shows me the following value:

   LD_LIBRARY_PATH=/tools/openmpi-1.2.5/lib:

If I log into the Cell machine and run env directly from the command  
line, I get the following value:


   LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32

So it appears that .profile gets sourced when I log in but not when  
mpirun runs.


However, according to the OpenMPI FAQ (http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path 
), mpirun is supposed to directly call .profile since Bourne shell  
doesn't automatically call it for non-interactive shells.


Does anyone have any insight as to why my environment isn't being set  
properly?  Thanks!


Hahn

--
Hahn Kim, h...@ll.mit.edu
MIT Lincoln Laboratory
244 Wood St., Lexington, MA 02420
Tel: 781-981-0940, Fax: 781-981-5255








Re: [OMPI users] Problem launching onto Bourne shell

2008-10-06 Thread Aurélien Bouteiller
tYou can forward your local env with mpirun -x LD_LIBRARY_PATH. As an  
alternative you can set specific values with mpirun -x  
LD_LIBRARY_PATH=/some/where:/some/where/else . More information with  
mpirun --help (or man mpirun).


Aurelien



Le 6 oct. 08 à 16:06, Hahn Kim a écrit :


Hi,

I'm having difficulty launching an Open MPI job onto a machine that  
is running the Bourne shell.


Here's my basic setup.  I have two machines, one is an x86-based  
machine running bash and the other is a Cell-based machine running  
Bourne shell.  I'm running mpirun from the x86 machine, which  
launches a C++ MPI application onto the Cell machine.  I get the  
following error:


  error while loading shared libraries: libstdc++.so.6: cannot open  
shared object file: No such file or directory


The basic problem is that LD_LIBRARY_PATH needs to be set to the  
directory that contains libstdc++.so.6 for the Cell.  I set the  
following line in .profile:


  export LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32

which is the path to the PPC libraries for Cell.

Now if I log directly into the Cell machine and run the program  
directly from the command line, I don't get the above error.  But  
mpirun still fails, even after setting LD_LIBRARY_PATH in .profile.


As a sanity check, I did the following.  I ran the following command  
from the x86 machine:


  mpirun -np 1 --host cab0 env

which, among others things, shows me the following value:

  LD_LIBRARY_PATH=/tools/openmpi-1.2.5/lib:

If I log into the Cell machine and run env directly from the command  
line, I get the following value:


  LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32

So it appears that .profile gets sourced when I log in but not when  
mpirun runs.


However, according to the OpenMPI FAQ (http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path 
), mpirun is supposed to directly call .profile since Bourne shell  
doesn't automatically call it for non-interactive shells.


Does anyone have any insight as to why my environment isn't being  
set properly?  Thanks!


Hahn

--
Hahn Kim, h...@ll.mit.edu
MIT Lincoln Laboratory
244 Wood St., Lexington, MA 02420
Tel: 781-981-0940, Fax: 781-981-5255






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321







Re: [OMPI users] ompi-restart issue : ompi-restart doesn't work across nodes - possible installation problem or environment setting problem??

2008-10-06 Thread Josh Hursey
The installation looks ok, though I'm not sure what is causing the  
segfault of the restarted process. Two things to try. First can you  
send me a backtrace from the core file that is generated from the  
segmentation fault. That will provide insight into what is causing it.


Second you may try to enable the C/R thread which allows for a  
checkpoint to progress when an application is in a computation loop  
instead of only when it is in the MPI library. To do so configure with  
these additional flags:

  --enable-ft-thread --enable-mpi-threads

What version of Open MPI are you using? What version of BLCR?

Best,
Josh

On Oct 6, 2008, at 3:55 PM, arun dhakne wrote:


Hi all,

This is the procedure i have followed to install openmpi. Is there
some installation or environment setting problem in here?
an openmpi program with 4 process is run across 2 dual-core intel
machines, with 2 processes running on each of the machine.

ompi-checkpoint is successful but ompi-restart fails with following  
error



$:> ompi-restart ompi_global_snapshot_6045.ckpt
--
mpirun noticed that process rank 0 with PID 6372 on node
acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
fault).
--

Open-mpi installation steps:
./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr
--with-blcr=/usr/lib64 --enable-debug
make
make install



export LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/ 
openmpi:/usr/lib64

export PATH=$HOME/.openmpi/bin:$PATH

NOTE: blcr is installed as a module
$:> lsmod | grep blcr

blcr  117892  0
blcr_vmadump   58264  1 blcr
blcr_imports   46080  2 blcr,blcr_vmadump

Please let me know if there is problem with above procedure, thanks a
lot for your time.

Best.

-- Forwarded message --
From: arun dhakne 
Date: Tue, Sep 30, 2008 at 12:52 AM
Subject: ompi-restart issue : ompi-restart doesn't work across nodes
To: Open MPI Users 


Hi all,

I had gone through some previous ompi-restart issues but i couldn't
find anything similar to this problem.

I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645'

i) If the sample mpi program say ( np 4 on single machine that is
without any hostfile )is ran and I try to checkpoint it, it happens
successfully and even ompi-restart works in this case.

ii) If the sample mpi program is ran across say 2 different nodes and
checkpoint happens successfully BUT ompi-restart throws following
error:

$ ompi-restart ompi_global_snapshot_7604.ckpt
--
mpirun noticed that process rank 3 with PID 9590 on node
acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
fault).
--

Please let me know if more information is needed.

--
Thanks and Regards,
Arun U. Dhakne
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem launching onto Bourne shell

2008-10-06 Thread Hahn Kim
Great, that worked, thanks!  However, it still concerns me that the  
FAQ page says that mpirun will execute .profile which doesn't seem to  
work for me.  Are there any configuration issues that could possibly  
be preventing mpirun from doing this?  It would certainly be more  
convenient if I could maintain my environment in a single .profile  
file instead of adding what could potentially be a lot of -x  
arguments to my mpirun command.


Hahn

On Oct 6, 2008, at 5:44 PM, Aurélien Bouteiller wrote:


tYou can forward your local env with mpirun -x LD_LIBRARY_PATH. As an
alternative you can set specific values with mpirun -x
LD_LIBRARY_PATH=/some/where:/some/where/else . More information with
mpirun --help (or man mpirun).

Aurelien



Le 6 oct. 08 à 16:06, Hahn Kim a écrit :


Hi,

I'm having difficulty launching an Open MPI job onto a machine that
is running the Bourne shell.

Here's my basic setup.  I have two machines, one is an x86-based
machine running bash and the other is a Cell-based machine running
Bourne shell.  I'm running mpirun from the x86 machine, which
launches a C++ MPI application onto the Cell machine.  I get the
following error:

  error while loading shared libraries: libstdc++.so.6: cannot open
shared object file: No such file or directory

The basic problem is that LD_LIBRARY_PATH needs to be set to the
directory that contains libstdc++.so.6 for the Cell.  I set the
following line in .profile:

  export LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32

which is the path to the PPC libraries for Cell.

Now if I log directly into the Cell machine and run the program
directly from the command line, I don't get the above error.  But
mpirun still fails, even after setting LD_LIBRARY_PATH in .profile.

As a sanity check, I did the following.  I ran the following command
from the x86 machine:

  mpirun -np 1 --host cab0 env

which, among others things, shows me the following value:

  LD_LIBRARY_PATH=/tools/openmpi-1.2.5/lib:

If I log into the Cell machine and run env directly from the command
line, I get the following value:

  LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32

So it appears that .profile gets sourced when I log in but not when
mpirun runs.

However, according to the OpenMPI FAQ (http://www.open-mpi.org/ 
faq/?category=running#adding-ompi-to-path

), mpirun is supposed to directly call .profile since Bourne shell
doesn't automatically call it for non-interactive shells.

Does anyone have any insight as to why my environment isn't being
set properly?  Thanks!

Hahn

--
Hahn Kim, h...@ll.mit.edu
MIT Lincoln Laboratory
244 Wood St., Lexington, MA 02420
Tel: 781-981-0940, Fax: 781-981-5255






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Hahn Kim
MIT Lincoln Laboratory   Phone: (781) 981-0940
244 Wood Street, S2-252  Fax: (781) 981-5255
Lexington, MA 02420  E-mail: h...@ll.mit.edu