[OMPI users] Strange rank 0 behavior on Mac OS

2015-03-04 Thread Oliver
hi all,

I have openmpi 1.8.4 installed on a Mac latop.

When running a mpi job on localhost for testing, I've noticed that when -np
is odd number 3, 5 etc, it sometimes ran into a situation where only rank 0
is progressing, and the rest of ranks doesn't seem get to run at all ...
I've examine the code many times over and couldn't see why. I am wondering
if this is something else and has anyone else run into similar problem
before?

TIA

Oliver


[OMPI users] Rebuild RPM for CentOS 7.1

2015-10-31 Thread Oliver
hi all

I am trying to rebuild 1.10 RPM from the src rpm on Cent OS 7. The build
process went fine without problem. Whiling trying to install the rpm, I
encountered the following error:


Examining openmpi-1.10.0-1.x86_64.rpm: openmpi-1.10.0-1.x86_64
Marking openmpi-1.10.0-1.x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package openmpi.x86_64 0:1.10.0-1 will be installed
--> Finished Dependency Resolution

...

Transaction check error:
  file /usr/bin from install of openmpi-1.10.0-1.x86_64 conflicts with file
from package filesystem-3.2-18.el7.x86_64
  file /usr/lib64 from install of openmpi-1.10.0-1.x86_64 conflicts with
file from package filesystem-3.2-18.el7.x86_64

what am I missing, is there a fix?

TIA

-- 
Oliver


Re: [OMPI users] Rebuild RPM for CentOS 7.1

2015-11-04 Thread Oliver
hi Gilles,

Yes, I got the src rpm from the OMPI official site. I am not sure what you
meant by "related x86_64.rpm", as far as I can tell, only the src rpm is
distributed: http://www.open-mpi.org/software/ompi/v1.10/

rpm -qlp of newly built rpm sure have files that in in /usr/bin:

etc
/etc/openmpi-default-hostfile
/etc/openmpi-mca-params.conf
/etc/openmpi-totalview.tcl
/etc/vtsetup-config.dtd
/etc/vtsetup-config.xml
/usr
/usr/bin
/usr/bin/mpiCC
/usr/bin/mpiCC-vt
/usr/bin/mpic++
....

Best,

Oliver



On Sun, Nov 1, 2015 at 8:20 PM, Gilles Gouaillardet 
wrote:

> Olivier,
>
> where did you get the src.rpm from ?
>
> assuming you downloaded it, can you also download the related x86_64.rpm,
> run rpm -qlp open-mpi-xxx.x86_64.rpm and check if there are files in
> /usr/bin and /usr/lib64 ?
>
> Cheers,
>
> Gilles
>
>
> On 10/31/2015 9:09 PM, Oliver wrote:
>
> hi all
>
> I am trying to rebuild 1.10 RPM from the src rpm on Cent OS 7. The build
> process went fine without problem. Whiling trying to install the rpm, I
> encountered the following error:
>
>
> Examining openmpi-1.10.0-1.x86_64.rpm: openmpi-1.10.0-1.x86_64
> Marking openmpi-1.10.0-1.x86_64.rpm to be installed
> Resolving Dependencies
> --> Running transaction check
> ---> Package openmpi.x86_64 0:1.10.0-1 will be installed
> --> Finished Dependency Resolution
>
> ...
>
> Transaction check error:
>   file /usr/bin from install of openmpi-1.10.0-1.x86_64 conflicts with
> file from package filesystem-3.2-18.el7.x86_64
>   file /usr/lib64 from install of openmpi-1.10.0-1.x86_64 conflicts with
> file from package filesystem-3.2-18.el7.x86_64
>
> what am I missing, is there a fix?
>
> TIA
>
> --
> Oliver
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27965.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/27969.php
>



-- 
Oliver


Re: [OMPI users] Rebuild RPM for CentOS 7.1

2015-11-04 Thread Oliver
Gilles,

The new spec file has some issues:


f7b@ct0 ~/rpmbuild/SPECS$ rpmbuild -ba openmpi-1.10.0.spec
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.gDGUNx
+ umask 022
+ cd /home/f7b/rpmbuild/BUILD
+ rm -rf $'/home/f7b/rpmbuild/BUILDROOT/openmpi-1.10.0-1.x86_64\r'
+ $'\r'
/var/tmp/rpm-tmp.gDGUNx: line 33: $'\r': command not found
error: Bad exit status from /var/tmp/rpm-tmp.gDGUNx (%prep)


RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.gDGUNx (%prep)

On Mon, Nov 2, 2015 at 3:16 AM, Gilles Gouaillardet 
wrote:

> Olivier,
>
> here is attached an updated spec file that works on Cent OS 7
>
> i will double think about it before a permanent fix
>
> Cheers,
>
> Gilles
>
> On 10/31/2015 9:09 PM, Oliver wrote:
>
> hi all
>
> I am trying to rebuild 1.10 RPM from the src rpm on Cent OS 7. The build
> process went fine without problem. Whiling trying to install the rpm, I
> encountered the following error:
>
>
> Examining openmpi-1.10.0-1.x86_64.rpm: openmpi-1.10.0-1.x86_64
> Marking openmpi-1.10.0-1.x86_64.rpm to be installed
> Resolving Dependencies
> --> Running transaction check
> ---> Package openmpi.x86_64 0:1.10.0-1 will be installed
> --> Finished Dependency Resolution
>
> ...
>
> Transaction check error:
>   file /usr/bin from install of openmpi-1.10.0-1.x86_64 conflicts with
> file from package filesystem-3.2-18.el7.x86_64
>   file /usr/lib64 from install of openmpi-1.10.0-1.x86_64 conflicts with
> file from package filesystem-3.2-18.el7.x86_64
>
> what am I missing, is there a fix?
>
> TIA
>
> --
> Oliver
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27965.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/27970.php
>



-- 
Oliver


Re: [OMPI users] Rebuild RPM for CentOS 7.1

2015-11-04 Thread Oliver
Gilles,

Upon close look, the previous errors of spec file are caused by CRLF line
terminators (it seems the file is prepared on windows(?) ). Once converted
to Unix, everything seems to fine.

Thanks for your spec.

Oliver


On Wed, Nov 4, 2015 at 7:24 PM, Gilles Gouaillardet 
wrote:

> Olivier,
>
> i just forgot ompi was offilically shipping .src.rpm (and there is no
> binary x86_64.rpm)
>
> please use the .spec i sent in a previous email (assuming you want ompi in
> /usr)
> an other option is to
> rpmbuild -ba --define 'install_in_opt 1' SPECS/openmpi-1.10.0.spec
> and ompi will be installed in /opt
>
> Cheers,
>
> Gilles
>
>
>
>


[OMPI users] segfault (shared memory initialization) after program ended

2018-09-25 Thread Oliver
hi -

I have an application that consistently segfault when I do
"mpirun --oversubscribe" and the following message came AFTER application
runs. My running environment: MacOS with openmpi 3.1.2.

Is this a problme with my application? or my environment? any help?

thanks

Oliver

--
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  pi.local
  System call: unlink(2)
/var/folders/h2/ph7pgd4n3_z9v2pd0hk5nc6wgn/T//ompi.pi.501/pid.45364/1/vader_segment.pi.c1c1.7
  Error:   No such file or directory (errno 2)
--
mpirun(45364,0x7e1c9000) malloc: *** mach_vm_map(size=1125899906846720)
failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
[pi:45364] *** Process received signal ***
[pi:45364] Signal: Segmentation fault: 11 (11)
[pi:45364] Signal code: Address not mapped (1)
[pi:45364] Failing at address: 0x0
[pi:45364] [ 0] 0   libsystem_platform.dylib0x7fff7d999f5a
_sigtramp + 26
[pi:45364] [ 1] 0   ??? 0x2d595060
0x0 + 760828000
[pi:45364] [ 2] 0   mca_rml_oob.so  0x000103aeadaf
orte_rml_oob_send_buffer_nb + 956
[pi:45364] [ 3] 0   libopen-rte.40.dylib0x00010357d0fa
pmix_server_log_fn + 449
[pi:45364] [ 4] 0   mca_pmix_pmix2x.so  0x00010394f6d6
server_log + 857
[pi:45364] [ 5] 0   mca_pmix_pmix2x.so  0x000103982d42
pmix_server_log + 1257
[pi:45364] [ 6] 0   mca_pmix_pmix2x.so  0x0001039731e0
server_message_handler + 5032
[pi:45364] [ 7] 0   mca_pmix_pmix2x.so  0x0001039a9822
pmix_ptl_base_process_msg + 723
[pi:45364] [ 8] 0   libevent-2.1.6.dylib0x0001036b6719
event_process_active_single_queue + 376
[pi:45364] [ 9] 0   libevent-2.1.6.dylib0x0001036b3cb3
event_base_loop + 1074
[pi:45364] [10] 0   mca_pmix_pmix2x.so  0x000103988ce7
progress_engine + 26
[pi:45364] [11] 0   libsystem_pthread.dylib 0x7fff7d9a3661
_pthread_body + 340
[pi:45364] [12] 0   libsystem_pthread.dylib 0x7fff7d9a350d
_pthread_body + 0
[pi:45364] [13] 0   libsystem_pthread.dylib 0x7fff7d9a2bf9
thread_start + 13
[pi:45364] *** End of error message ***
Segmentation fault: 11

-- 
Oliver
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] process binding to NUMA node on Opteron 6xxx series CPUs?

2013-02-14 Thread Oliver Weihe

Hi,

is it possible to bind MPI processes to a NUMA node somehow on Opteron 
6xxx series CPUs (e.g. --bind-to-NUMAnode) *without* the usage of a 
rankfile?
Opteron 6xxx have two NUMA nodes per CPU(-socket) so --bind-to-socket 
doesn't work as I want.


This is a 4 socket Opteron 6344 (12 CPUs per CPU(-socket)):

root@node01:~> numactl --hardware | grep cpus
node 0 cpus: 0 1 2 3 4 5
node 1 cpus: 6 7 8 9 10 11
node 2 cpus: 12 13 14 15 16 17
node 3 cpus: 18 19 20 21 22 23
node 4 cpus: 24 25 26 27 28 29
node 5 cpus: 30 31 32 33 34 35
node 6 cpus: 36 37 38 39 40 41
node 7 cpus: 42 43 44 45 46 47

root@node01:~> /opt/openmpi/1.6.3/gcc/bin/mpirun --report-bindings -np 8 
--bind-to-socket --bysocket sleep 1s
[node01.cluster:21446] MCW rank 1 bound to socket 1[core 0-11]: [. . . . 
. . . . . . . .][B B B B B B B B B B B B][. . . . . . . . . . . .][. . . 
. . . . . . . . .]
[node01.cluster:21446] MCW rank 2 bound to socket 2[core 0-11]: [. . . . 
. . . . . . . .][. . . . . . . . . . . .][B B B B B B B B B B B B][. . . 
. . . . . . . . .]
[node01.cluster:21446] MCW rank 3 bound to socket 3[core 0-11]: [. . . . 
. . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][B B B 
B B B B B B B B B]
[node01.cluster:21446] MCW rank 4 bound to socket 0[core 0-11]: [B B B B 
B B B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . 
. . . . . . . . .]
[node01.cluster:21446] MCW rank 5 bound to socket 1[core 0-11]: [. . . . 
. . . . . . . .][B B B B B B B B B B B B][. . . . . . . . . . . .][. . . 
. . . . . . . . .]
[node01.cluster:21446] MCW rank 6 bound to socket 2[core 0-11]: [. . . . 
. . . . . . . .][. . . . . . . . . . . .][B B B B B B B B B B B B][. . . 
. . . . . . . . .]
[node01.cluster:21446] MCW rank 7 bound to socket 3[core 0-11]: [. . . . 
. . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][B B B 
B B B B B B B B B]
[node01.cluster:21446] MCW rank 0 bound to socket 0[core 0-11]: [B B B B 
B B B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . 
. . . . . . . . .]


So each process is bound to *two* NUMA nodes, but I wan't to bind to 
*one* NUMA node.


What I want is more like this:
root@node01:~> cat rankfile
rank 0=localhost slot=0-5
rank 1=localhost slot=6-11
rank 2=localhost slot=12-17
rank 3=localhost slot=18-23
rank 4=localhost slot=24-29
rank 5=localhost slot=30-35
rank 6=localhost slot=36-41
rank 7=localhost slot=42-47
root@node01:~> /opt/openmpi/1.6.3/gcc/bin/mpirun --report-bindings -np 8 
--rankfile rankfile sleep 1s
[node01.cluster:21505] MCW rank 1 bound to socket 0[core 6-11]: [. . . . 
. . B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . 
. . . . . . . . .] (slot list 6-11)
[node01.cluster:21505] MCW rank 2 bound to socket 1[core 0-5]: [. . . . 
. . . . . . . .][B B B B B B . . . . . .][. . . . . . . . . . . .][. . . 
. . . . . . . . .] (slot list 12-17)
[node01.cluster:21505] MCW rank 3 bound to socket 1[core 6-11]: [. . . . 
. . . . . . . .][. . . . . . B B B B B B][. . . . . . . . . . . .][. . . 
. . . . . . . . .] (slot list 18-23)
[node01.cluster:21505] MCW rank 4 bound to socket 2[core 0-5]: [. . . . 
. . . . . . . .][. . . . . . . . . . . .][B B B B B B . . . . . .][. . . 
. . . . . . . . .] (slot list 24-29)
[node01.cluster:21505] MCW rank 5 bound to socket 2[core 6-11]: [. . . . 
. . . . . . . .][. . . . . . . . . . . .][. . . . . . B B B B B B][. . . 
. . . . . . . . .] (slot list 30-35)
[node01.cluster:21505] MCW rank 6 bound to socket 3[core 0-5]: [. . . . 
. . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][B B B 
B B B . . . . . .] (slot list 36-41)
[node01.cluster:21505] MCW rank 7 bound to socket 3[core 6-11]: [. . . . 
. . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . 
. . . B B B B B B] (slot list 42-47)
[node01.cluster:21505] MCW rank 0 bound to socket 0[core 0-5]: [B B B B 
B B . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . 
. . . . . . . . .] (slot list 0-5)



Actually I'm dreaming of
mpirun --bind-to-NUMAnode --bycore ...
or
mpirun --bind-to-NUMAnode --byNUMAnode ...

Is there any workaround execpt rankfiles for this?

Regards,
 Oliver Weihe


[OMPI users] Segfault in mca_odls_default.so with > ~100 process.

2010-02-26 Thread Oliver Ford


I am trying to run an MPI code across 136 processing using an appfile
(attached), since every process needs to be run with a host/process
dependent parameter.

This whole system works wonderfully for up to around 100 processes but
usually fails with a segfault, apparently in in mca_odls_default.so,
during initialization.
The attached appfile is an attempt at 136 processes. If I split the
appfile into two, both halves will initialize OK and successfully pass
an MPI_Barrier() (the program won't actually work without all 136 nodes,
but I'm happy MPI is doing its job). Because both halves work, I think
it has to be related to the number of processes - not a problem with a
specific appfile entry or machine.

The cluster I am running it on has openmpi-1.3.3 but I have also
installed 1.4.1 from the website in my home dir and that does the same
(and is from which the attached data comes).

The actual segfault is:
[jac-11:12300] *** Process received signal ***
[jac-11:12300] Signal: Segmentation fault (11)
[jac-11:12300] Signal code: Address not mapped (1)
[jac-11:12300] Failing at address: 0x40
[jac-11:12300] [ 0] [0x74640c]
[jac-11:12300] [ 1] /home/oford/openmpi/lib/openmpi/mca_odls_default.so
[0x8863d4]
[jac-11:12300] [ 2] /home/oford/openmpi/lib/libopen-rte.so.0 [0x76ffe9]
[jac-11:12300] [ 3]
/home/oford/openmpi/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x2f6)
[0x771b86]
[jac-11:12300] [ 4] /home/oford/openmpi/lib/libopen-pal.so.0 [0x5d6ba8]
[jac-11:12300] [ 5]
/home/oford/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0x5d6e47]
[jac-11:12300] [ 6]
/home/oford/openmpi/lib/libopen-pal.so.0(opal_progress+0xce) [0x5ca00e]
[jac-11:12300] [ 7]
/home/oford/openmpi/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x355)
[0x7815f5]
[jac-11:12300] [ 8] /home/oford/openmpi/lib/openmpi/mca_plm_rsh.so
[0xc73d1b]
[jac-11:12300] [ 9] mpirun [0x804a8f0]
[jac-11:12300] [10] mpirun [0x8049ef6]
[jac-11:12300] [11] /lib/libc.so.6(__libc_start_main+0xe5) [0x1406e5]
[jac-11:12300] [12] mpirun [0x8049e41]
[jac-11:12300] *** End of error message ***
Segmentation fault


The full output with '-d' and the config.log from the build of 1.4.1 are
also attached.

I don't know the exact setup of the network, but I can ask our sysadmin
anything else that might help.

Thanks in advance,


Oliver Ford

Culham Centre for Fusion Energy
Oxford, UK




-np 1 --host jac-11 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-39-80115 Y 11 11 133 debug
-np 1 --host jac-5 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-26-81244 N 11 11 133 debug
-np 1 --host batch-020 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-122-75993 N 11 11 133 debug
-np 1 --host batch-037 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-157-15286 N 11 11 133 debug
-np 1 --host batch-042 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-114-89529 N 11 11 133 debug
-np 1 --host jac-9 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-35-90257 N 11 11 133 debug
-np 1 --host batch-020 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-151-56062 N 11 11 133 debug
-np 1 --host batch-004 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-16-2723 N 11 11 133 debug
-np 1 --host batch-003 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-156-65790 N 11 11 133 debug
-np 1 --host jac-11 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-198-63239 N 11 11 133 debug
-np 1 --host batch-046 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-105-12753 N 11 11 133 debug
-np 1 --host batch-015 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-12-25631 N 11 11 133 debug
-np 1 --host jac-12 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-196-35421 N 11 11 133 debug
-np 1 --host batch-045 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-103-98246 N 11 11 133 debug
-np 1 --host batch-006 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-142-44009 N 11 11 133 debug
-np 1 --host batch-044 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-117-30325 N 11 11 133 debug
-np 1 --host batch-003 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-143-21739 N 11 11 133 debug
-np 1 --host batch-042 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-112-64293 N 11 11 133 debug
-np 1 --host batch-041 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-57-11238 N 11 11 133 debug
-np 1 --host batch-025 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-170-80831 N 11 11 133 debug
-np 1 --host jac-6 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiS

Re: [OMPI users] Segfault in mca_odls_default.so with > ~100 process.

2010-02-27 Thread Oliver Ford

Ralph Castain wrote:

Yeah, the system won't like this. Your approach makes it look like you are 
launching 136 app_contexts. We currently only support up to 128 app_contexts. I 
don't think anyone anticipated somebody trying to use the system this way.

I can expand the number to something larger. Will have to see how big a change 
it requires (mostly a question of how many places are touched) before we know 
what release this might show up in.


  

I see.

Is there a better way that I should be doing this - to run the programs 
on specific hosts with specific args?


Alternatively, if you point me at the appropriate piece of code, I'll 
have a go at making the number a #define or something, and putting some 
checks in so it doesn't just crash.


Oliver


Re: [OMPI users] Segfault in mca_odls_default.so with > ~100 process.

2010-02-27 Thread Oliver Ford

Ralph Castain wrote:

Yeah, the system won't like this. Your approach makes it look like you are 
launching 136 app_contexts. We currently only support up to 128 app_contexts. I 
don't think anyone anticipated somebody trying to use the system this way.

I can expand the number to something larger. Will have to see how big a change 
it requires (mostly a question of how many places are touched) before we know 
what release this might show up in.

  
The app_context allocation is all dynamic so is fine, the problem that 
'app_idx' (various structures and code) which appears to be some kind of 
index mapping is defined as int8_t, so everything goes negative after 
128 - hence the segfault.


Attached is a patch to the openmpi-1.4.1 taball on the website to make 
it all int32_t, which I've tested and works fine.


I've also attached a patch for the current SVN head, which compiles but 
I can't test it because the current SVN head doesn't work for me at all 
at present (for an appfile with less than 128 entries).


Sorry to send this here rather than the dev list, but I don't really 
have the time to sign up and get involved at the moment.



Hope that helps a bit,
Oliver
diff -ur openmpi-1.4.1/orte/mca/odls/base/odls_base_default_fns.c openmpi-1.4.1-new/orte/mca/odls/base/odls_base_default_fns.c
--- openmpi-1.4.1/orte/mca/odls/base/odls_base_default_fns.c	2009-12-08 20:36:37.0 +
+++ openmpi-1.4.1-new/orte/mca/odls/base/odls_base_default_fns.c	2010-02-27 12:21:14.0 +
@@ -74,7 +74,7 @@
 #include "orte/mca/odls/base/base.h"
 #include "orte/mca/odls/base/odls_private.h"

-static int8_t *app_idx;
+static int32_t *app_idx;

 /* IT IS CRITICAL THAT ANY CHANGE IN THE ORDER OF THE INFO PACKED IN
  * THIS FUNCTION BE REFLECTED IN THE CONSTRUCT_CHILD_LIST PARSER BELOW
@@ -1555,7 +1577,7 @@
 nrank = 0;
 opal_dss.pack(&buffer, &nrank, 1, ORTE_NODE_RANK);  /* node rank */
 one8 = 0;
-opal_dss.pack(&buffer, &one8, 1, OPAL_INT8);  /* app_idx */
+opal_dss.pack(&buffer, &one32, 1, OPAL_INT32);  /* app_idx */
 jobdat->pmap = (opal_byte_object_t*)malloc(sizeof(opal_byte_object_t));
 opal_dss.unload(&buffer, (void**)&jobdat->pmap->bytes, &jobdat->pmap->size);
 OBJ_DESTRUCT(&buffer);
diff -ur openmpi-1.4.1/orte/runtime/orte_globals.h openmpi-1.4.1-new/orte/runtime/orte_globals.h
--- openmpi-1.4.1/orte/runtime/orte_globals.h	2009-12-08 20:36:44.0 +
+++ openmpi-1.4.1-new/orte/runtime/orte_globals.h	2010-02-27 12:30:20.0 +
@@ -137,7 +137,7 @@
 /** Parent object */
 opal_object_t super;
 /** Unique index when multiple apps per job */
-int8_t idx;
+int32_t idx;
 /** Absolute pathname of argv[0] */
 char   *app;
 /** Number of copies of this process that are to be launched */
@@ -382,7 +382,7 @@
 /* exit code */
 orte_exit_code_t exit_code;
 /* the app_context that generated this proc */
-int8_t app_idx;
+int32_t app_idx;
 /* a cpu list, if specified by the user */
 char *slot_list;
 /* pointer to the node where this proc is executing */
diff -ur openmpi-1.4.1/orte/util/nidmap.c openmpi-1.4.1-new/orte/util/nidmap.c
--- openmpi-1.4.1/orte/util/nidmap.c	2009-12-08 20:36:44.0 +
+++ openmpi-1.4.1-new/orte/util/nidmap.c	2010-02-27 12:23:18.0 +
@@ -589,7 +589,7 @@
 int32_t *nodes;
 orte_proc_t **procs;
 orte_vpid_t i;
-int8_t *tmp;
+int32_t *tmp;
 opal_buffer_t buf;
 orte_local_rank_t *lrank;
 orte_node_rank_t *nrank;
@@ -645,11 +645,11 @@
 free(nrank);

 /* transfer and pack the app_idx in one pack */
-tmp = (int8_t*)malloc(jdata->num_procs);
+tmp = (int32_t*)malloc(jdata->num_procs * sizeof(int32_t));
 for (i=0; i < jdata->num_procs; i++) {
 tmp[i] = procs[i]->app_idx;
 }
-if (ORTE_SUCCESS != (rc = opal_dss.pack(&buf, tmp, jdata->num_procs, OPAL_INT8))) {
+if (ORTE_SUCCESS != (rc = opal_dss.pack(&buf, tmp, jdata->num_procs, OPAL_INT32))) {
 ORTE_ERROR_LOG(rc);
 return rc;
 }
@@ -664,7 +665,7 @@


 int orte_util_decode_pidmap(opal_byte_object_t *bo, orte_vpid_t *nprocs,
-opal_value_array_t *procs, int8_t **app_idx,
+opal_value_array_t *procs, int32_t **app_idx,
 char ***slot_str)
 {
 orte_vpid_t i, num_procs;
@@ -672,7 +673,7 @@
 int32_t *nodes;
 orte_local_rank_t *local_rank;
 orte_node_rank_t *node_rank;
-int8_t *idx;
+int32_t *idx;
 orte_std_cntr_t n;
 opal_buffer_t buf;
 int rc;
@@ -746,10 +747,10 @@
 }

 /* allocate memory for app_idx */
-idx = (int8_t*)malloc(num_procs);
+idx = (int32_t*)malloc(num_procs * sizeof(int32_t));
 /* unpack app_idx in one shot */
 n=num_procs;

Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-03-31 Thread Oliver Geisler
I have tried up to kernel 2.6.33.1 on both architectures (Core2 Duo and
I5) with the same results. The "slow" results are also appearing for
distribution of processes on the 4 cores one single node.
We use
btl = self,sm,tcp
in
/etc/openmpi/openmpi-mca-params.conf
Distributing several process to each one core on several machines is
fast and has "normal" communication times. So I guess tcp communication
shouldn't be the problem.
Also multiple instances of the program, started on one "master" node,
with each instance distributing several processes to one core of "slave"
nodes don't seem to be a problem. In effect 4 instances of the program
occupie all 4 cores on each node which doesn't influence communication
and overall calculation time much.
But running 4 processes from the same "master" instance on 4 cores on
the same node does.


Do you have some more ideas what I can test for? I tried to test
connectivity_c from openmpi examples on 8 nodes/32 processes. It is hard
to get reliable/consistent figures from 'top' since the programm
terminates quite fast and interesting usage is very short. But these are
some shots of 'top' (master and slave nodes show similar images)

System and/or Wait Time are up.

sh-3.2$ mpirun -np 4 -host cluster-05 connectivity_c : -np 28 -host
cluster-06,cluster-07,cluster-08,cluster-09,cluster-10,cluster-11,cluster-12
connectivity_c
Connectivity test on 32 processes PASSED.


Cpu(s): 37.5%us, 46.6%sy,  0.0%ni,  0.0%id, 15.9%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   168200k used,  8013036k free,0k buffers
Swap:0k total,0k used,0k free,   132092k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
25179 oli   20   0  143m 3436 2196 R   43  0.0   0:00.57 0
25180 oli   20   0  142m 3392 2180 R  100  0.0   0:00.85 3
25182 oli   20   0  142m 3312 2172 R  100  0.0   0:00.93 2
25181 oli   20   0  134m 3052 2172 R  100  0.0   0:00.93 1

Cpu(s): 10.3%us,  8.7%sy,  0.0%ni, 21.4%id, 58.7%wa,  0.8%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   171352k used,  8009884k free,0k buffers
Swap:0k total,0k used,0k free,   130572k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
29496 oli   20   0  142m 3300 2176 D   33  0.0   0:00.21 2
29497 oli   20   0  142m 3280 2160 R   25  0.0   0:00.17 0
29494 oli   20   0  134m 3044 2180 D0  0.0   0:00.01 1
29495 oli   20   0  134m 3036 2172 R   16  0.0   0:00.11 3

Cpu(s): 18.3%us, 36.3%sy,  0.0%ni, 38.0%id,  6.3%wa,  1.1%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   141704k used,  8039532k free,0k buffers
Swap:0k total,0k used,0k free,99828k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
29452 oli   20   0  143m 3452 2212 R   52  0.0   0:00.37 1
29455 oli   20   0  143m 3452 2212 S   57  0.0   0:00.41 3
29453 oli   20   0  143m 3440 2200 S   55  0.0   0:00.39 0
29454 oli   20   0  143m 3440 2200 R   55  0.0   0:00.39 2


Thanks for your thoughts, each input is appreciated.

Oli




On 3/31/2010 8:38 AM, Jeff Squyres wrote:
> I have a very dim recollection of some kernel TCP issues back in some older 
> kernel versions -- such issues affected all TCP communications, not just MPI. 
>  Can you try a newer kernel, perchance?
> 
> 
> On Mar 30, 2010, at 1:26 PM,   wrote:
> 
>> Hello List,
>>
>> I hope you can help us out on that one, as we are trying to figure out
>> since weeks.
>>
>> The situation: We have a program being capable of slitting to several
>> processes to be shared on nodes within a cluster network using openmpi.
>> We were running that system on "older" cluster hardware (Intel Core2 Duo
>> based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are
>> diskless network booting. Recently we upgraded the hardware (Intel i5,
>> 8GB RAM) which also required an upgrade to a recent kernel version
>> (2.6.26+).
>>
>> Here is the problem: We experience overall performance loss on the new
>> hardware and think, we can break it down to a communication issue
>> inbetween the processes.
>>
>> Also, we found out, the issue araises in the transition from kernel
>> 2.6.23 to 2.6.24 (tested on the Core2 Duo system).
>>
>> Here is an output from our programm:
>>
>> 2.6.23.17 (64bit), MPI 1.2.7
>> 5 Iterationen (Core2 Duo) 6 CPU:
>> 93.33 seconds per iteration.
>>  Node   0 communication/computation time:  6.83 /647.64 seconds.
>>  Node   1 communication/computation time: 10.09 /644.36 seconds.
>>  Node   2 communication/computation time:  7.27 /645.03 seconds.
>>  Node   3 communication/computation time:165.02 /485.52 seconds.
>>  Node   4 communication/computation time:  6.50 /643.82 seconds.
>>  Node   5 communication/computation time:  7.80 /627.63 seconds.
>>  Computation time:897.00 seconds.
>>
>> 2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7
>> 5 Iterationen

Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-01 Thread Oliver Geisler
Does anyone know a benchmark program, I could use for testing?


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-01 Thread Oliver Geisler

> However, reading through your initial description on Tuesday, none of these 
> fit: You want to actually measure the kernel time on TCP communication costs.
> 
Since the problem occurs also on node only configuration and mca-option
btl = self,sm,tcp is used, I doubt it has to do with TCP communication.
But anyways will keep in the back of my mind.

> So, have You tried attaching "strace -c -f -p PID" to the actual application 
> processes?
> 
> As a starter You may invoke the benchmark using:
>mpirun -np 4 strace -c -f ./benchmark
> (which however includes initialization and all other system calls)...
> 
I ran it as you suggested (node-only, no network distribution)
I am not really fond of analyzing this in detail, but maybe this rings a
bell for one of you:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 37.970.000508   0119856   rt_sigaction
 33.780.000452   0 59925   poll
 21.000.000281   0179776   rt_sigprocmask
  7.250.97   0121297   gettimeofday
  0.000.00   085   read
  0.000.00   0 3   write
  0.000.00   0   324   203 open
  0.000.00   0   129   close
  0.000.00   0 3 3 unlink
[...]
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 34.640.000194   0 92934   gettimeofday
 28.750.000161   0137227   rt_sigprocmask
 26.250.000147   0 45742   poll
[...]

I can provide the whole output, if you like.

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Oliver Geisler
On 4/1/2010 12:49 PM, Rainer Keller wrote:

> On Thursday 01 April 2010 12:16:25 pm Oliver Geisler wrote:
>> Does anyone know a benchmark program, I could use for testing?
> There's an abundance of benchmarks (IMB, netpipe, SkaMPI...) and performance 
> analysis tools (Scalasca, Vampir, Paraver, Opt, Jumpshot).
> 

I used SkaMPI to test communication: Most important the third column
showing the communication time. Same effect, kernel lower 2.6.24 showing
faster communication(by thousands) against higher kernel version with
slow communication.

Hm. The issue seems not to be linked to the application. The kernel
configuration was carried forward from the working kernel 2.6.18 thru to
2.6.33 mostly using defaults for new features.

Any ideas what to look for? What other tests could I make to give you
guys more information?

Thanks so far,

oli



Tested on Intel Core2 Duo with openmpi 1.4.1

"skampi_coll"-test

kernel 2.6.18.6:
# begin result "MPI_Bcast-length"
count= 14   1.0   0.0   16   0.1   1.0
count= 28   1.0   0.08   0.0   1.0
count= 3   12   1.0   0.0   16   0.0   1.0
count= 4   16   1.3   0.1   32   0.0   1.3
count= 6   24   1.0   0.08   0.2   1.0
count= 8   32   1.0   0.0   32   0.1   1.0
{...}
count= 370728  14829121023.8  42.381023.81023.1
count= 524288  20971521440.3   3.781440.31439.5
# end result "MPI_Bcast-length"
# duration = 0.09 sec

kernel 2.6.33.1:
# begin result "MPI_Bcast-length"
count= 141786.5 131.2   341095.31786.5
count= 281504.9  77.1   34 759.31504.9
count= 3   121852.4 139.2   351027.91852.4
count= 4   162430.5 152.0   381200.52430.5
count= 6   241898.7  69.5   35 807.61898.7
count= 8   321769.1  16.3   34 763.31769.1
{...}
count= 370728  1482912  216145.93011.6   29  216145.9  214898.1
count= 524288  2097152  274813.71519.5   12  274813.7  274087.4
# end result "MPI_Bcast-length"
# duration = 140.64 sec

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Oliver Geisler
On 4/6/2010 2:53 PM, Jeff Squyres wrote:

> 
> Try NetPIPE -- it has both MPI communication benchmarking and TCP 
> benchmarking.  Then you can see if there is a noticable difference between 
> TCP and MPI (there shouldn't be).  There's also a "memcpy" mode in netpipe, 
> but it's not quite the same thing as shared memory message passing.
> 

Using netpipe and comparing tcp and mpi communication I get the
following results:

TCP is much faster than MPI, approx. by factor 12
e.g a packet size of 4096 bytes deliveres in
97.11 usec with NPtcp and
15338.98 usec with NPmpi
or
packet size 262kb
0.05268801 sec NPtcp
0.00254560 sec NPmpi

Further our benchmark started with "--mca btl tcp,self" runs with short
communication times, even using kernel 2.6.33.1

Is there a way to see what type of communication is actually selected?

Can anybody imagine why shared memory leads to these problems?
Kernel configuration?


Thanks, Jeff, for insisting upon testing network performance.
Thanks all others, too ;-)

oli


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Oliver Geisler
To keep this thread updated:

After I posted to the developers list, the community was able to guide
to a solution to the problem:
http://www.open-mpi.org/community/lists/devel/2010/04/7698.php

To sum up:

The extended communication times while using shared memory communication
of openmpi processes are caused by openmpi session directory laying on
the network via NFS.

The problem is resolved by establishing on each diskless node a ramdisk
or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to
point to the according mountpoint shared memory communication and its
files are kept local, thus decreasing the communication times by magnitudes.

The relation of the problem to the kernel version is not really
resolved, but maybe not "the problem" in this respect.
My benchmark is now running fine on a single node with 4 CPU, kernel
2.6.33.1 and openmpi 1.4.1.
Running on multiple nodes I experience still higher (TCP) communication
times than I would expect. But that requires me some more deep
researching the issue (e.g. collisions on the network) and should
probably posted to a new thread.

Thank you guys for your help.

oli



-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



[OMPI users] Processes always rank 0

2010-07-08 Thread Oliver Stolpe

Hello there,

I have a problem setting up MPI/LAM. Here we go:

I start lam with the lamboot command successfully:

$ lamboot -v hostnames

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

n-1<11960> ssi:boot:base:linear: booting n0 (frost)
n-1<11960> ssi:boot:base:linear: booting n1 (hurricane)
n-1<11960> ssi:boot:base:linear: booting n2 (hail)
n-1<11960> ssi:boot:base:linear: booting n3 (fog)
n-1<11960> ssi:boot:base:linear: booting n4 (rain)
n-1<11960> ssi:boot:base:linear: booting n5 (thunder)
n-1<11960> ssi:boot:base:linear: finished

Ok, all is fine. I test a command (hostname in this case):

$ mpirun -v --hostfile hostnames hostname
thunder
rain
frost
fog
hurricane
hail

Works. I write a hello world program for testing:

#include 
#include 

int main(int argc, char** argv) {
   unsigned int rank;
   unsigned int size;
   MPI_Init(&argc, &argv);

   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &size);

   printf("Hello, World. I am %u of %u\n", rank, size);

   MPI_Finalize();
   return 0;
}

I compile and run it:

$ mpicc -o mpitest mpitest.c && mpirun -v --hostfile hostnames ./mpitest
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1

And I don't get it why everyone has the rank 0 and the size is only 1. I 
followed many tutorials and i proved it right many times. Does anyone 
has an idea?


Thanks in advance!

Oliver

Some infos:

$ lamboot -v

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

n-1<12088> ssi:boot:base:linear: booting n0 (localhost)
n-1<12088> ssi:boot:base:linear: finished
ocs@frost:~$ lamboot -V

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

   Arch:x86_64-pc-linux-gnu
   Prefix:/usr/lib/lam
   Configured by:buildd
   Configured on:Sun Apr  6 01:43:15 UTC 2008
   Configure host:excelsior
   SSI rpi:crtcp lamd sysv tcp usysv

$ mpirun -V
mpirun (Open MPI) 1.2.7rc2

Report bugs to http://www.open-mpi.org/community/help/

$ mpicc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 
4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs 
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr 
--enable-shared --with-system-zlib --libexecdir=/usr/lib 
--without-included-gettext --enable-threads=posix --enable-nls 
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc 
--enable-mpfr --enable-cld --enable-checking=release 
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu

Thread model: posix
gcc version 4.3.2 (Debian 4.3.2-1.1)


Re: [OMPI users] Processes always rank 0

2010-07-08 Thread Oliver Stolpe
I thought this is OpenMPI what I was using. I do not have permission to 
install something, only in my home directory. All tutorials I found 
started the environment with the lamboot command. Whats the difference 
using only OpenMPI?


$ whereis openmpi
openmpi: /etc/openmpi /usr/lib/openmpi /usr/lib64/openmpi /usr/share/openmpi

$ echo $LD_LIBRARY_PATH
:/usr/lib/openmpi/lib:/usr/lib64/openmpi/lib:

$ whereis mpirun
mpirun: /usr/bin/mpirun.mpich /usr/bin/mpirun /usr/bin/mpirun.lam 
/usr/bin/mpirun.openmpi


$ ll /usr/bin/mpirun
lrwxrwxrwx 1 root root 24 14. Aug 2008  /usr/bin/mpirun -> /usr/bin/orterun

$ ll /usr/bin/orterun
-rwxr-xr-x 1 root root 39280 25. Aug 2008  /usr/bin/orterun

$ ll /usr/bin/mpirun.openmpi
lrwxrwxrwx 1 root root 7  5. Sep 2008  /usr/bin/mpirun.openmpi -> orterun

When I run mpirun without starting the environment by using lamboot, it 
says:


ocs@frost:~$ mpicc -o mpitest mpitest.c && mpirun -np 1 -machinefile 
machines ./mpitest

-

It seems that there is no lamd running on the host frost.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for MPI programs to run
(the MPI program tired to invoke the "MPI_Init" function).

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----

Thanks in advance,
Oliver


Jeff Squyres wrote:

If you're just starting with MPI, is there any chance you can upgrade to Open 
MPI instead of LAM/MPI?  All of the LAM/MPI developers moved to Open MPI years 
ago.
ann/lib:/home/bude/ocs/fann

On Jul 8, 2010, at 6:01 AM, Oliver Stolpe wrote:

  

Hello there,

I have a problem setting up MPI/LAM. Here we go:

I start lam with the lamboot command successfully:

$ lamboot -v hostnames

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

n-1<11960> ssi:boot:base:linear: booting n0 (frost)
n-1<11960> ssi:boot:base:linear: booting n1 (hurricane)
n-1<11960> ssi:boot:base:linear: booting n2 (hail)
n-1<11960> ssi:boot:base:linear: booting n3 (fog)
n-1<11960> ssi:boot:base:linear: booting n4 (rain)
n-1<11960> ssi:boot:base:linear: booting n5 (thunder)
n-1<11960> ssi:boot:base:linear: finished

Ok, all is fine. I test a command (hostname in this case):

$ mpirun -v --hostfile hostnames hostname
thunder
rain
frost
fog
hurricane
hail

Works. I write a hello world program for testing:

#include 
#include 

int main(int argc, char** argv) {
unsigned int rank;
unsigned int size;
MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Hello, World. I am %u of %u\n", rank, size);

MPI_Finalize();
return 0;
}

I compile and run it:

$ mpicc -o mpitest mpitest.c && mpirun -v --hostfile hostnames ./mpitest
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1
Hello, World. I am 0 of 1

And I don't get it why everyone has the rank 0 and the size is only 1. I
followed many tutorials and i proved it right many times. Does anyone
has an idea?

Thanks in advance!

Oliver

Some infos:

$ lamboot -v

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

n-1<12088> ssi:boot:base:linear: booting n0 (localhost)
n-1<12088> ssi:boot:base:linear: finished
ocs@frost:~$ lamboot -V

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

Arch:x86_64-pc-linux-gnu
Prefix:/usr/lib/lam
Configured by:buildd
Configured on:Sun Apr  6 01:43:15 UTC 2008
Configure host:excelsior
SSI rpi:crtcp lamd sysv tcp usysv

$ mpirun -V
mpirun (Open MPI) 1.2.7rc2

Report bugs to http://www.open-mpi.org/community/help/

$ mpicc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian
4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-cld --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.3.2 (Debian 4.3.2-1.1)
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





  




Re: [OMPI users] Processes always rank 0

2010-07-08 Thread Oliver Stolpe
You were right, it was linked to the lam compiler. I didn't find the 
open mpi compiler on the system, though.
Now I downloaded and compiled the current stable version of openmpi. 
That worked. I had to make symbolic links to the exectuables so that the 
system won't get confused with the old mpi install that I can't access 
to deinstall.


Now when I use my current host only as machine, it works:

$ mympicc mpitest.c -o mpitest && mympirun -np 3 -machinefile machines 
mpitest

Hello, World. I am 2 of 3
Hello, World. I am 0 of 3
Hello, World. I am 1 of 3

To get it run on multiple machines, I had to give an absolute path to 
the running binary:


/home/bude/ocs/openmpi/bin/mpirun -np 8 -machinefile machines mpitest

ocs@frost:~$ /home/bude/ocs/openmpi/bin/mpirun -np 8 -machinefile 
machines mpitest

Hello, World. I am 4 of 8
Hello, World. I am 0 of 8
Hello, World. I am 7 of 8
Hello, World. I am 5 of 8
Hello, World. I am 1 of 8
Hello, World. I am 3 of 8
Hello, World. I am 2 of 8
Hello, World. I am 6 of 8

Yay! Works as it should (I think so, can I prove it somewhere that he 
really executes the instances on the machines?)! Thanks everybody. This 
is not the most elegant solution but it works eventually.


Best regards,
Oliver

Jeff Squyres (jsquyres) wrote:
Lam and open mpi are two different mpi implementations. Lam came before open mpi; we stopped developing lam years ago. 

Lamboot is a lam-specific command. It has no analogue in open mpi. 

Orterun is open mpi's mpirun. 

>From a quick look at your paths and whatnot, its not immediately obvious how you are mixing lam and open mpi, but somehow you are. You need to disentangle them and entirely use open mpi. 


Perhaps your mpicc is sym linked to the lam mpicc (instead of the open mpi 
mpicc)...?

-jms
Sent from my PDA.  No type good.

- Original Message -
From: users-boun...@open-mpi.org 
To: Open MPI Users 
Sent: Thu Jul 08 06:13:43 2010
Subject: Re: [OMPI users] Processes always rank 0

I thought this is OpenMPI what I was using. I do not have permission to 
install something, only in my home directory. All tutorials I found 
started the environment with the lamboot command. Whats the difference 
using only OpenMPI?


$ whereis openmpi
openmpi: /etc/openmpi /usr/lib/openmpi /usr/lib64/openmpi /usr/share/openmpi

$ echo $LD_LIBRARY_PATH
:/usr/lib/openmpi/lib:/usr/lib64/openmpi/lib:

$ whereis mpirun
mpirun: /usr/bin/mpirun.mpich /usr/bin/mpirun /usr/bin/mpirun.lam 
/usr/bin/mpirun.openmpi


$ ll /usr/bin/mpirun
lrwxrwxrwx 1 root root 24 14. Aug 2008  /usr/bin/mpirun -> /usr/bin/orterun

$ ll /usr/bin/orterun
-rwxr-xr-x 1 root root 39280 25. Aug 2008  /usr/bin/orterun

$ ll /usr/bin/mpirun.openmpi
lrwxrwxrwx 1 root root 7  5. Sep 2008  /usr/bin/mpirun.openmpi -> orterun

When I run mpirun without starting the environment by using lamboot, it 
says:


ocs@frost:~$ mpicc -o mpitest mpitest.c && mpirun -np 1 -machinefile 
machines ./mpitest

-

It seems that there is no lamd running on the host frost.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for MPI programs to run
(the MPI program tired to invoke the "MPI_Init" function).

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----

Thanks in advance,
Oliver


Jeff Squyres wrote:
  

If you're just starting with MPI, is there any chance you can upgrade to Open 
MPI instead of LAM/MPI?  All of the LAM/MPI developers moved to Open MPI years 
ago.
ann/lib:/home/bude/ocs/fann

On Jul 8, 2010, at 6:01 AM, Oliver Stolpe wrote:

  


Hello there,

I have a problem setting up MPI/LAM. Here we go:

I start lam with the lamboot command successfully:

$ lamboot -v hostnames

LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

n-1<11960> ssi:boot:base:linear: booting n0 (frost)
n-1<11960> ssi:boot:base:linear: booting n1 (hurricane)
n-1<11960> ssi:boot:base:linear: booting n2 (hail)
n-1<11960> ssi:boot:base:linear: booting n3 (fog)
n-1<11960> ssi:boot:base:linear: booting n4 (rain)
n-1<11960> ssi:boot:base:linear: booting n5 (thunder)
n-1<11960> ssi:boot:base:linear: finished

Ok, all is fine. I test a command (hostname in this case):

$ mpirun -v --hostfile hostnames hostname
thunder
rain
frost
fog
hurricane
hail

Works. I write a hello world program for testing:

#include 
#include 

int main(int argc, char** argv) {
unsigned int rank;
unsigned int size;
MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Hello, World. I am %u of %u\n", rank, size);

MPI_