[OMPI users] Strange rank 0 behavior on Mac OS
hi all, I have openmpi 1.8.4 installed on a Mac latop. When running a mpi job on localhost for testing, I've noticed that when -np is odd number 3, 5 etc, it sometimes ran into a situation where only rank 0 is progressing, and the rest of ranks doesn't seem get to run at all ... I've examine the code many times over and couldn't see why. I am wondering if this is something else and has anyone else run into similar problem before? TIA Oliver
[OMPI users] Rebuild RPM for CentOS 7.1
hi all I am trying to rebuild 1.10 RPM from the src rpm on Cent OS 7. The build process went fine without problem. Whiling trying to install the rpm, I encountered the following error: Examining openmpi-1.10.0-1.x86_64.rpm: openmpi-1.10.0-1.x86_64 Marking openmpi-1.10.0-1.x86_64.rpm to be installed Resolving Dependencies --> Running transaction check ---> Package openmpi.x86_64 0:1.10.0-1 will be installed --> Finished Dependency Resolution ... Transaction check error: file /usr/bin from install of openmpi-1.10.0-1.x86_64 conflicts with file from package filesystem-3.2-18.el7.x86_64 file /usr/lib64 from install of openmpi-1.10.0-1.x86_64 conflicts with file from package filesystem-3.2-18.el7.x86_64 what am I missing, is there a fix? TIA -- Oliver
Re: [OMPI users] Rebuild RPM for CentOS 7.1
hi Gilles, Yes, I got the src rpm from the OMPI official site. I am not sure what you meant by "related x86_64.rpm", as far as I can tell, only the src rpm is distributed: http://www.open-mpi.org/software/ompi/v1.10/ rpm -qlp of newly built rpm sure have files that in in /usr/bin: etc /etc/openmpi-default-hostfile /etc/openmpi-mca-params.conf /etc/openmpi-totalview.tcl /etc/vtsetup-config.dtd /etc/vtsetup-config.xml /usr /usr/bin /usr/bin/mpiCC /usr/bin/mpiCC-vt /usr/bin/mpic++ .... Best, Oliver On Sun, Nov 1, 2015 at 8:20 PM, Gilles Gouaillardet wrote: > Olivier, > > where did you get the src.rpm from ? > > assuming you downloaded it, can you also download the related x86_64.rpm, > run rpm -qlp open-mpi-xxx.x86_64.rpm and check if there are files in > /usr/bin and /usr/lib64 ? > > Cheers, > > Gilles > > > On 10/31/2015 9:09 PM, Oliver wrote: > > hi all > > I am trying to rebuild 1.10 RPM from the src rpm on Cent OS 7. The build > process went fine without problem. Whiling trying to install the rpm, I > encountered the following error: > > > Examining openmpi-1.10.0-1.x86_64.rpm: openmpi-1.10.0-1.x86_64 > Marking openmpi-1.10.0-1.x86_64.rpm to be installed > Resolving Dependencies > --> Running transaction check > ---> Package openmpi.x86_64 0:1.10.0-1 will be installed > --> Finished Dependency Resolution > > ... > > Transaction check error: > file /usr/bin from install of openmpi-1.10.0-1.x86_64 conflicts with > file from package filesystem-3.2-18.el7.x86_64 > file /usr/lib64 from install of openmpi-1.10.0-1.x86_64 conflicts with > file from package filesystem-3.2-18.el7.x86_64 > > what am I missing, is there a fix? > > TIA > > -- > Oliver > > > ___ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27965.php > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/27969.php > -- Oliver
Re: [OMPI users] Rebuild RPM for CentOS 7.1
Gilles, The new spec file has some issues: f7b@ct0 ~/rpmbuild/SPECS$ rpmbuild -ba openmpi-1.10.0.spec Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.gDGUNx + umask 022 + cd /home/f7b/rpmbuild/BUILD + rm -rf $'/home/f7b/rpmbuild/BUILDROOT/openmpi-1.10.0-1.x86_64\r' + $'\r' /var/tmp/rpm-tmp.gDGUNx: line 33: $'\r': command not found error: Bad exit status from /var/tmp/rpm-tmp.gDGUNx (%prep) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.gDGUNx (%prep) On Mon, Nov 2, 2015 at 3:16 AM, Gilles Gouaillardet wrote: > Olivier, > > here is attached an updated spec file that works on Cent OS 7 > > i will double think about it before a permanent fix > > Cheers, > > Gilles > > On 10/31/2015 9:09 PM, Oliver wrote: > > hi all > > I am trying to rebuild 1.10 RPM from the src rpm on Cent OS 7. The build > process went fine without problem. Whiling trying to install the rpm, I > encountered the following error: > > > Examining openmpi-1.10.0-1.x86_64.rpm: openmpi-1.10.0-1.x86_64 > Marking openmpi-1.10.0-1.x86_64.rpm to be installed > Resolving Dependencies > --> Running transaction check > ---> Package openmpi.x86_64 0:1.10.0-1 will be installed > --> Finished Dependency Resolution > > ... > > Transaction check error: > file /usr/bin from install of openmpi-1.10.0-1.x86_64 conflicts with > file from package filesystem-3.2-18.el7.x86_64 > file /usr/lib64 from install of openmpi-1.10.0-1.x86_64 conflicts with > file from package filesystem-3.2-18.el7.x86_64 > > what am I missing, is there a fix? > > TIA > > -- > Oliver > > > ___ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27965.php > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/27970.php > -- Oliver
Re: [OMPI users] Rebuild RPM for CentOS 7.1
Gilles, Upon close look, the previous errors of spec file are caused by CRLF line terminators (it seems the file is prepared on windows(?) ). Once converted to Unix, everything seems to fine. Thanks for your spec. Oliver On Wed, Nov 4, 2015 at 7:24 PM, Gilles Gouaillardet wrote: > Olivier, > > i just forgot ompi was offilically shipping .src.rpm (and there is no > binary x86_64.rpm) > > please use the .spec i sent in a previous email (assuming you want ompi in > /usr) > an other option is to > rpmbuild -ba --define 'install_in_opt 1' SPECS/openmpi-1.10.0.spec > and ompi will be installed in /opt > > Cheers, > > Gilles > > > >
[OMPI users] segfault (shared memory initialization) after program ended
hi - I have an application that consistently segfault when I do "mpirun --oversubscribe" and the following message came AFTER application runs. My running environment: MacOS with openmpi 3.1.2. Is this a problme with my application? or my environment? any help? thanks Oliver -- A system call failed during shared memory initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation. Local host: pi.local System call: unlink(2) /var/folders/h2/ph7pgd4n3_z9v2pd0hk5nc6wgn/T//ompi.pi.501/pid.45364/1/vader_segment.pi.c1c1.7 Error: No such file or directory (errno 2) -- mpirun(45364,0x7e1c9000) malloc: *** mach_vm_map(size=1125899906846720) failed (error code=3) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug [pi:45364] *** Process received signal *** [pi:45364] Signal: Segmentation fault: 11 (11) [pi:45364] Signal code: Address not mapped (1) [pi:45364] Failing at address: 0x0 [pi:45364] [ 0] 0 libsystem_platform.dylib0x7fff7d999f5a _sigtramp + 26 [pi:45364] [ 1] 0 ??? 0x2d595060 0x0 + 760828000 [pi:45364] [ 2] 0 mca_rml_oob.so 0x000103aeadaf orte_rml_oob_send_buffer_nb + 956 [pi:45364] [ 3] 0 libopen-rte.40.dylib0x00010357d0fa pmix_server_log_fn + 449 [pi:45364] [ 4] 0 mca_pmix_pmix2x.so 0x00010394f6d6 server_log + 857 [pi:45364] [ 5] 0 mca_pmix_pmix2x.so 0x000103982d42 pmix_server_log + 1257 [pi:45364] [ 6] 0 mca_pmix_pmix2x.so 0x0001039731e0 server_message_handler + 5032 [pi:45364] [ 7] 0 mca_pmix_pmix2x.so 0x0001039a9822 pmix_ptl_base_process_msg + 723 [pi:45364] [ 8] 0 libevent-2.1.6.dylib0x0001036b6719 event_process_active_single_queue + 376 [pi:45364] [ 9] 0 libevent-2.1.6.dylib0x0001036b3cb3 event_base_loop + 1074 [pi:45364] [10] 0 mca_pmix_pmix2x.so 0x000103988ce7 progress_engine + 26 [pi:45364] [11] 0 libsystem_pthread.dylib 0x7fff7d9a3661 _pthread_body + 340 [pi:45364] [12] 0 libsystem_pthread.dylib 0x7fff7d9a350d _pthread_body + 0 [pi:45364] [13] 0 libsystem_pthread.dylib 0x7fff7d9a2bf9 thread_start + 13 [pi:45364] *** End of error message *** Segmentation fault: 11 -- Oliver ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] process binding to NUMA node on Opteron 6xxx series CPUs?
Hi, is it possible to bind MPI processes to a NUMA node somehow on Opteron 6xxx series CPUs (e.g. --bind-to-NUMAnode) *without* the usage of a rankfile? Opteron 6xxx have two NUMA nodes per CPU(-socket) so --bind-to-socket doesn't work as I want. This is a 4 socket Opteron 6344 (12 CPUs per CPU(-socket)): root@node01:~> numactl --hardware | grep cpus node 0 cpus: 0 1 2 3 4 5 node 1 cpus: 6 7 8 9 10 11 node 2 cpus: 12 13 14 15 16 17 node 3 cpus: 18 19 20 21 22 23 node 4 cpus: 24 25 26 27 28 29 node 5 cpus: 30 31 32 33 34 35 node 6 cpus: 36 37 38 39 40 41 node 7 cpus: 42 43 44 45 46 47 root@node01:~> /opt/openmpi/1.6.3/gcc/bin/mpirun --report-bindings -np 8 --bind-to-socket --bysocket sleep 1s [node01.cluster:21446] MCW rank 1 bound to socket 1[core 0-11]: [. . . . . . . . . . . .][B B B B B B B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .] [node01.cluster:21446] MCW rank 2 bound to socket 2[core 0-11]: [. . . . . . . . . . . .][. . . . . . . . . . . .][B B B B B B B B B B B B][. . . . . . . . . . . .] [node01.cluster:21446] MCW rank 3 bound to socket 3[core 0-11]: [. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][B B B B B B B B B B B B] [node01.cluster:21446] MCW rank 4 bound to socket 0[core 0-11]: [B B B B B B B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .] [node01.cluster:21446] MCW rank 5 bound to socket 1[core 0-11]: [. . . . . . . . . . . .][B B B B B B B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .] [node01.cluster:21446] MCW rank 6 bound to socket 2[core 0-11]: [. . . . . . . . . . . .][. . . . . . . . . . . .][B B B B B B B B B B B B][. . . . . . . . . . . .] [node01.cluster:21446] MCW rank 7 bound to socket 3[core 0-11]: [. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][B B B B B B B B B B B B] [node01.cluster:21446] MCW rank 0 bound to socket 0[core 0-11]: [B B B B B B B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .] So each process is bound to *two* NUMA nodes, but I wan't to bind to *one* NUMA node. What I want is more like this: root@node01:~> cat rankfile rank 0=localhost slot=0-5 rank 1=localhost slot=6-11 rank 2=localhost slot=12-17 rank 3=localhost slot=18-23 rank 4=localhost slot=24-29 rank 5=localhost slot=30-35 rank 6=localhost slot=36-41 rank 7=localhost slot=42-47 root@node01:~> /opt/openmpi/1.6.3/gcc/bin/mpirun --report-bindings -np 8 --rankfile rankfile sleep 1s [node01.cluster:21505] MCW rank 1 bound to socket 0[core 6-11]: [. . . . . . B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .] (slot list 6-11) [node01.cluster:21505] MCW rank 2 bound to socket 1[core 0-5]: [. . . . . . . . . . . .][B B B B B B . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .] (slot list 12-17) [node01.cluster:21505] MCW rank 3 bound to socket 1[core 6-11]: [. . . . . . . . . . . .][. . . . . . B B B B B B][. . . . . . . . . . . .][. . . . . . . . . . . .] (slot list 18-23) [node01.cluster:21505] MCW rank 4 bound to socket 2[core 0-5]: [. . . . . . . . . . . .][. . . . . . . . . . . .][B B B B B B . . . . . .][. . . . . . . . . . . .] (slot list 24-29) [node01.cluster:21505] MCW rank 5 bound to socket 2[core 6-11]: [. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . B B B B B B][. . . . . . . . . . . .] (slot list 30-35) [node01.cluster:21505] MCW rank 6 bound to socket 3[core 0-5]: [. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][B B B B B B . . . . . .] (slot list 36-41) [node01.cluster:21505] MCW rank 7 bound to socket 3[core 6-11]: [. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . B B B B B B] (slot list 42-47) [node01.cluster:21505] MCW rank 0 bound to socket 0[core 0-5]: [B B B B B B . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .][. . . . . . . . . . . .] (slot list 0-5) Actually I'm dreaming of mpirun --bind-to-NUMAnode --bycore ... or mpirun --bind-to-NUMAnode --byNUMAnode ... Is there any workaround execpt rankfiles for this? Regards, Oliver Weihe
[OMPI users] Segfault in mca_odls_default.so with > ~100 process.
I am trying to run an MPI code across 136 processing using an appfile (attached), since every process needs to be run with a host/process dependent parameter. This whole system works wonderfully for up to around 100 processes but usually fails with a segfault, apparently in in mca_odls_default.so, during initialization. The attached appfile is an attempt at 136 processes. If I split the appfile into two, both halves will initialize OK and successfully pass an MPI_Barrier() (the program won't actually work without all 136 nodes, but I'm happy MPI is doing its job). Because both halves work, I think it has to be related to the number of processes - not a problem with a specific appfile entry or machine. The cluster I am running it on has openmpi-1.3.3 but I have also installed 1.4.1 from the website in my home dir and that does the same (and is from which the attached data comes). The actual segfault is: [jac-11:12300] *** Process received signal *** [jac-11:12300] Signal: Segmentation fault (11) [jac-11:12300] Signal code: Address not mapped (1) [jac-11:12300] Failing at address: 0x40 [jac-11:12300] [ 0] [0x74640c] [jac-11:12300] [ 1] /home/oford/openmpi/lib/openmpi/mca_odls_default.so [0x8863d4] [jac-11:12300] [ 2] /home/oford/openmpi/lib/libopen-rte.so.0 [0x76ffe9] [jac-11:12300] [ 3] /home/oford/openmpi/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x2f6) [0x771b86] [jac-11:12300] [ 4] /home/oford/openmpi/lib/libopen-pal.so.0 [0x5d6ba8] [jac-11:12300] [ 5] /home/oford/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0x5d6e47] [jac-11:12300] [ 6] /home/oford/openmpi/lib/libopen-pal.so.0(opal_progress+0xce) [0x5ca00e] [jac-11:12300] [ 7] /home/oford/openmpi/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x355) [0x7815f5] [jac-11:12300] [ 8] /home/oford/openmpi/lib/openmpi/mca_plm_rsh.so [0xc73d1b] [jac-11:12300] [ 9] mpirun [0x804a8f0] [jac-11:12300] [10] mpirun [0x8049ef6] [jac-11:12300] [11] /lib/libc.so.6(__libc_start_main+0xe5) [0x1406e5] [jac-11:12300] [12] mpirun [0x8049e41] [jac-11:12300] *** End of error message *** Segmentation fault The full output with '-d' and the config.log from the build of 1.4.1 are also attached. I don't know the exact setup of the network, but I can ask our sysadmin anything else that might help. Thanks in advance, Oliver Ford Culham Centre for Fusion Energy Oxford, UK -np 1 --host jac-11 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-39-80115 Y 11 11 133 debug -np 1 --host jac-5 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-26-81244 N 11 11 133 debug -np 1 --host batch-020 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-122-75993 N 11 11 133 debug -np 1 --host batch-037 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-157-15286 N 11 11 133 debug -np 1 --host batch-042 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-114-89529 N 11 11 133 debug -np 1 --host jac-9 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-35-90257 N 11 11 133 debug -np 1 --host batch-020 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-151-56062 N 11 11 133 debug -np 1 --host batch-004 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-16-2723 N 11 11 133 debug -np 1 --host batch-003 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-156-65790 N 11 11 133 debug -np 1 --host jac-11 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-198-63239 N 11 11 133 debug -np 1 --host batch-046 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-105-12753 N 11 11 133 debug -np 1 --host batch-015 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-12-25631 N 11 11 133 debug -np 1 --host jac-12 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-196-35421 N 11 11 133 debug -np 1 --host batch-045 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-103-98246 N 11 11 133 debug -np 1 --host batch-006 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-142-44009 N 11 11 133 debug -np 1 --host batch-044 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-117-30325 N 11 11 133 debug -np 1 --host batch-003 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-143-21739 N 11 11 133 debug -np 1 --host batch-042 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-112-64293 N 11 11 133 debug -np 1 --host batch-041 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-57-11238 N 11 11 133 debug -np 1 --host batch-025 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-170-80831 N 11 11 133 debug -np 1 --host jac-6 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiS
Re: [OMPI users] Segfault in mca_odls_default.so with > ~100 process.
Ralph Castain wrote: Yeah, the system won't like this. Your approach makes it look like you are launching 136 app_contexts. We currently only support up to 128 app_contexts. I don't think anyone anticipated somebody trying to use the system this way. I can expand the number to something larger. Will have to see how big a change it requires (mostly a question of how many places are touched) before we know what release this might show up in. I see. Is there a better way that I should be doing this - to run the programs on specific hosts with specific args? Alternatively, if you point me at the appropriate piece of code, I'll have a go at making the number a #define or something, and putting some checks in so it doesn't just crash. Oliver
Re: [OMPI users] Segfault in mca_odls_default.so with > ~100 process.
Ralph Castain wrote: Yeah, the system won't like this. Your approach makes it look like you are launching 136 app_contexts. We currently only support up to 128 app_contexts. I don't think anyone anticipated somebody trying to use the system this way. I can expand the number to something larger. Will have to see how big a change it requires (mostly a question of how many places are touched) before we know what release this might show up in. The app_context allocation is all dynamic so is fine, the problem that 'app_idx' (various structures and code) which appears to be some kind of index mapping is defined as int8_t, so everything goes negative after 128 - hence the segfault. Attached is a patch to the openmpi-1.4.1 taball on the website to make it all int32_t, which I've tested and works fine. I've also attached a patch for the current SVN head, which compiles but I can't test it because the current SVN head doesn't work for me at all at present (for an appfile with less than 128 entries). Sorry to send this here rather than the dev list, but I don't really have the time to sign up and get involved at the moment. Hope that helps a bit, Oliver diff -ur openmpi-1.4.1/orte/mca/odls/base/odls_base_default_fns.c openmpi-1.4.1-new/orte/mca/odls/base/odls_base_default_fns.c --- openmpi-1.4.1/orte/mca/odls/base/odls_base_default_fns.c 2009-12-08 20:36:37.0 + +++ openmpi-1.4.1-new/orte/mca/odls/base/odls_base_default_fns.c 2010-02-27 12:21:14.0 + @@ -74,7 +74,7 @@ #include "orte/mca/odls/base/base.h" #include "orte/mca/odls/base/odls_private.h" -static int8_t *app_idx; +static int32_t *app_idx; /* IT IS CRITICAL THAT ANY CHANGE IN THE ORDER OF THE INFO PACKED IN * THIS FUNCTION BE REFLECTED IN THE CONSTRUCT_CHILD_LIST PARSER BELOW @@ -1555,7 +1577,7 @@ nrank = 0; opal_dss.pack(&buffer, &nrank, 1, ORTE_NODE_RANK); /* node rank */ one8 = 0; -opal_dss.pack(&buffer, &one8, 1, OPAL_INT8); /* app_idx */ +opal_dss.pack(&buffer, &one32, 1, OPAL_INT32); /* app_idx */ jobdat->pmap = (opal_byte_object_t*)malloc(sizeof(opal_byte_object_t)); opal_dss.unload(&buffer, (void**)&jobdat->pmap->bytes, &jobdat->pmap->size); OBJ_DESTRUCT(&buffer); diff -ur openmpi-1.4.1/orte/runtime/orte_globals.h openmpi-1.4.1-new/orte/runtime/orte_globals.h --- openmpi-1.4.1/orte/runtime/orte_globals.h 2009-12-08 20:36:44.0 + +++ openmpi-1.4.1-new/orte/runtime/orte_globals.h 2010-02-27 12:30:20.0 + @@ -137,7 +137,7 @@ /** Parent object */ opal_object_t super; /** Unique index when multiple apps per job */ -int8_t idx; +int32_t idx; /** Absolute pathname of argv[0] */ char *app; /** Number of copies of this process that are to be launched */ @@ -382,7 +382,7 @@ /* exit code */ orte_exit_code_t exit_code; /* the app_context that generated this proc */ -int8_t app_idx; +int32_t app_idx; /* a cpu list, if specified by the user */ char *slot_list; /* pointer to the node where this proc is executing */ diff -ur openmpi-1.4.1/orte/util/nidmap.c openmpi-1.4.1-new/orte/util/nidmap.c --- openmpi-1.4.1/orte/util/nidmap.c 2009-12-08 20:36:44.0 + +++ openmpi-1.4.1-new/orte/util/nidmap.c 2010-02-27 12:23:18.0 + @@ -589,7 +589,7 @@ int32_t *nodes; orte_proc_t **procs; orte_vpid_t i; -int8_t *tmp; +int32_t *tmp; opal_buffer_t buf; orte_local_rank_t *lrank; orte_node_rank_t *nrank; @@ -645,11 +645,11 @@ free(nrank); /* transfer and pack the app_idx in one pack */ -tmp = (int8_t*)malloc(jdata->num_procs); +tmp = (int32_t*)malloc(jdata->num_procs * sizeof(int32_t)); for (i=0; i < jdata->num_procs; i++) { tmp[i] = procs[i]->app_idx; } -if (ORTE_SUCCESS != (rc = opal_dss.pack(&buf, tmp, jdata->num_procs, OPAL_INT8))) { +if (ORTE_SUCCESS != (rc = opal_dss.pack(&buf, tmp, jdata->num_procs, OPAL_INT32))) { ORTE_ERROR_LOG(rc); return rc; } @@ -664,7 +665,7 @@ int orte_util_decode_pidmap(opal_byte_object_t *bo, orte_vpid_t *nprocs, -opal_value_array_t *procs, int8_t **app_idx, +opal_value_array_t *procs, int32_t **app_idx, char ***slot_str) { orte_vpid_t i, num_procs; @@ -672,7 +673,7 @@ int32_t *nodes; orte_local_rank_t *local_rank; orte_node_rank_t *node_rank; -int8_t *idx; +int32_t *idx; orte_std_cntr_t n; opal_buffer_t buf; int rc; @@ -746,10 +747,10 @@ } /* allocate memory for app_idx */ -idx = (int8_t*)malloc(num_procs); +idx = (int32_t*)malloc(num_procs * sizeof(int32_t)); /* unpack app_idx in one shot */ n=num_procs;
Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
I have tried up to kernel 2.6.33.1 on both architectures (Core2 Duo and I5) with the same results. The "slow" results are also appearing for distribution of processes on the 4 cores one single node. We use btl = self,sm,tcp in /etc/openmpi/openmpi-mca-params.conf Distributing several process to each one core on several machines is fast and has "normal" communication times. So I guess tcp communication shouldn't be the problem. Also multiple instances of the program, started on one "master" node, with each instance distributing several processes to one core of "slave" nodes don't seem to be a problem. In effect 4 instances of the program occupie all 4 cores on each node which doesn't influence communication and overall calculation time much. But running 4 processes from the same "master" instance on 4 cores on the same node does. Do you have some more ideas what I can test for? I tried to test connectivity_c from openmpi examples on 8 nodes/32 processes. It is hard to get reliable/consistent figures from 'top' since the programm terminates quite fast and interesting usage is very short. But these are some shots of 'top' (master and slave nodes show similar images) System and/or Wait Time are up. sh-3.2$ mpirun -np 4 -host cluster-05 connectivity_c : -np 28 -host cluster-06,cluster-07,cluster-08,cluster-09,cluster-10,cluster-11,cluster-12 connectivity_c Connectivity test on 32 processes PASSED. Cpu(s): 37.5%us, 46.6%sy, 0.0%ni, 0.0%id, 15.9%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8181236k total, 168200k used, 8013036k free,0k buffers Swap:0k total,0k used,0k free, 132092k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 25179 oli 20 0 143m 3436 2196 R 43 0.0 0:00.57 0 25180 oli 20 0 142m 3392 2180 R 100 0.0 0:00.85 3 25182 oli 20 0 142m 3312 2172 R 100 0.0 0:00.93 2 25181 oli 20 0 134m 3052 2172 R 100 0.0 0:00.93 1 Cpu(s): 10.3%us, 8.7%sy, 0.0%ni, 21.4%id, 58.7%wa, 0.8%hi, 0.0%si, 0.0%st Mem: 8181236k total, 171352k used, 8009884k free,0k buffers Swap:0k total,0k used,0k free, 130572k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 29496 oli 20 0 142m 3300 2176 D 33 0.0 0:00.21 2 29497 oli 20 0 142m 3280 2160 R 25 0.0 0:00.17 0 29494 oli 20 0 134m 3044 2180 D0 0.0 0:00.01 1 29495 oli 20 0 134m 3036 2172 R 16 0.0 0:00.11 3 Cpu(s): 18.3%us, 36.3%sy, 0.0%ni, 38.0%id, 6.3%wa, 1.1%hi, 0.0%si, 0.0%st Mem: 8181236k total, 141704k used, 8039532k free,0k buffers Swap:0k total,0k used,0k free,99828k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 29452 oli 20 0 143m 3452 2212 R 52 0.0 0:00.37 1 29455 oli 20 0 143m 3452 2212 S 57 0.0 0:00.41 3 29453 oli 20 0 143m 3440 2200 S 55 0.0 0:00.39 0 29454 oli 20 0 143m 3440 2200 R 55 0.0 0:00.39 2 Thanks for your thoughts, each input is appreciated. Oli On 3/31/2010 8:38 AM, Jeff Squyres wrote: > I have a very dim recollection of some kernel TCP issues back in some older > kernel versions -- such issues affected all TCP communications, not just MPI. > Can you try a newer kernel, perchance? > > > On Mar 30, 2010, at 1:26 PM, wrote: > >> Hello List, >> >> I hope you can help us out on that one, as we are trying to figure out >> since weeks. >> >> The situation: We have a program being capable of slitting to several >> processes to be shared on nodes within a cluster network using openmpi. >> We were running that system on "older" cluster hardware (Intel Core2 Duo >> based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are >> diskless network booting. Recently we upgraded the hardware (Intel i5, >> 8GB RAM) which also required an upgrade to a recent kernel version >> (2.6.26+). >> >> Here is the problem: We experience overall performance loss on the new >> hardware and think, we can break it down to a communication issue >> inbetween the processes. >> >> Also, we found out, the issue araises in the transition from kernel >> 2.6.23 to 2.6.24 (tested on the Core2 Duo system). >> >> Here is an output from our programm: >> >> 2.6.23.17 (64bit), MPI 1.2.7 >> 5 Iterationen (Core2 Duo) 6 CPU: >> 93.33 seconds per iteration. >> Node 0 communication/computation time: 6.83 /647.64 seconds. >> Node 1 communication/computation time: 10.09 /644.36 seconds. >> Node 2 communication/computation time: 7.27 /645.03 seconds. >> Node 3 communication/computation time:165.02 /485.52 seconds. >> Node 4 communication/computation time: 6.50 /643.82 seconds. >> Node 5 communication/computation time: 7.80 /627.63 seconds. >> Computation time:897.00 seconds. >> >> 2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7 >> 5 Iterationen
Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
Does anyone know a benchmark program, I could use for testing? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
> However, reading through your initial description on Tuesday, none of these > fit: You want to actually measure the kernel time on TCP communication costs. > Since the problem occurs also on node only configuration and mca-option btl = self,sm,tcp is used, I doubt it has to do with TCP communication. But anyways will keep in the back of my mind. > So, have You tried attaching "strace -c -f -p PID" to the actual application > processes? > > As a starter You may invoke the benchmark using: >mpirun -np 4 strace -c -f ./benchmark > (which however includes initialization and all other system calls)... > I ran it as you suggested (node-only, no network distribution) I am not really fond of analyzing this in detail, but maybe this rings a bell for one of you: % time seconds usecs/call callserrors syscall -- --- --- - - 37.970.000508 0119856 rt_sigaction 33.780.000452 0 59925 poll 21.000.000281 0179776 rt_sigprocmask 7.250.97 0121297 gettimeofday 0.000.00 085 read 0.000.00 0 3 write 0.000.00 0 324 203 open 0.000.00 0 129 close 0.000.00 0 3 3 unlink [...] % time seconds usecs/call callserrors syscall -- --- --- - - 34.640.000194 0 92934 gettimeofday 28.750.000161 0137227 rt_sigprocmask 26.250.000147 0 45742 poll [...] I can provide the whole output, if you like. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 4/1/2010 12:49 PM, Rainer Keller wrote: > On Thursday 01 April 2010 12:16:25 pm Oliver Geisler wrote: >> Does anyone know a benchmark program, I could use for testing? > There's an abundance of benchmarks (IMB, netpipe, SkaMPI...) and performance > analysis tools (Scalasca, Vampir, Paraver, Opt, Jumpshot). > I used SkaMPI to test communication: Most important the third column showing the communication time. Same effect, kernel lower 2.6.24 showing faster communication(by thousands) against higher kernel version with slow communication. Hm. The issue seems not to be linked to the application. The kernel configuration was carried forward from the working kernel 2.6.18 thru to 2.6.33 mostly using defaults for new features. Any ideas what to look for? What other tests could I make to give you guys more information? Thanks so far, oli Tested on Intel Core2 Duo with openmpi 1.4.1 "skampi_coll"-test kernel 2.6.18.6: # begin result "MPI_Bcast-length" count= 14 1.0 0.0 16 0.1 1.0 count= 28 1.0 0.08 0.0 1.0 count= 3 12 1.0 0.0 16 0.0 1.0 count= 4 16 1.3 0.1 32 0.0 1.3 count= 6 24 1.0 0.08 0.2 1.0 count= 8 32 1.0 0.0 32 0.1 1.0 {...} count= 370728 14829121023.8 42.381023.81023.1 count= 524288 20971521440.3 3.781440.31439.5 # end result "MPI_Bcast-length" # duration = 0.09 sec kernel 2.6.33.1: # begin result "MPI_Bcast-length" count= 141786.5 131.2 341095.31786.5 count= 281504.9 77.1 34 759.31504.9 count= 3 121852.4 139.2 351027.91852.4 count= 4 162430.5 152.0 381200.52430.5 count= 6 241898.7 69.5 35 807.61898.7 count= 8 321769.1 16.3 34 763.31769.1 {...} count= 370728 1482912 216145.93011.6 29 216145.9 214898.1 count= 524288 2097152 274813.71519.5 12 274813.7 274087.4 # end result "MPI_Bcast-length" # duration = 140.64 sec -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 4/6/2010 2:53 PM, Jeff Squyres wrote: > > Try NetPIPE -- it has both MPI communication benchmarking and TCP > benchmarking. Then you can see if there is a noticable difference between > TCP and MPI (there shouldn't be). There's also a "memcpy" mode in netpipe, > but it's not quite the same thing as shared memory message passing. > Using netpipe and comparing tcp and mpi communication I get the following results: TCP is much faster than MPI, approx. by factor 12 e.g a packet size of 4096 bytes deliveres in 97.11 usec with NPtcp and 15338.98 usec with NPmpi or packet size 262kb 0.05268801 sec NPtcp 0.00254560 sec NPmpi Further our benchmark started with "--mca btl tcp,self" runs with short communication times, even using kernel 2.6.33.1 Is there a way to see what type of communication is actually selected? Can anybody imagine why shared memory leads to these problems? Kernel configuration? Thanks, Jeff, for insisting upon testing network performance. Thanks all others, too ;-) oli -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
To keep this thread updated: After I posted to the developers list, the community was able to guide to a solution to the problem: http://www.open-mpi.org/community/lists/devel/2010/04/7698.php To sum up: The extended communication times while using shared memory communication of openmpi processes are caused by openmpi session directory laying on the network via NFS. The problem is resolved by establishing on each diskless node a ramdisk or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to point to the according mountpoint shared memory communication and its files are kept local, thus decreasing the communication times by magnitudes. The relation of the problem to the kernel version is not really resolved, but maybe not "the problem" in this respect. My benchmark is now running fine on a single node with 4 CPU, kernel 2.6.33.1 and openmpi 1.4.1. Running on multiple nodes I experience still higher (TCP) communication times than I would expect. But that requires me some more deep researching the issue (e.g. collisions on the network) and should probably posted to a new thread. Thank you guys for your help. oli -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
[OMPI users] Processes always rank 0
Hello there, I have a problem setting up MPI/LAM. Here we go: I start lam with the lamboot command successfully: $ lamboot -v hostnames LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University n-1<11960> ssi:boot:base:linear: booting n0 (frost) n-1<11960> ssi:boot:base:linear: booting n1 (hurricane) n-1<11960> ssi:boot:base:linear: booting n2 (hail) n-1<11960> ssi:boot:base:linear: booting n3 (fog) n-1<11960> ssi:boot:base:linear: booting n4 (rain) n-1<11960> ssi:boot:base:linear: booting n5 (thunder) n-1<11960> ssi:boot:base:linear: finished Ok, all is fine. I test a command (hostname in this case): $ mpirun -v --hostfile hostnames hostname thunder rain frost fog hurricane hail Works. I write a hello world program for testing: #include #include int main(int argc, char** argv) { unsigned int rank; unsigned int size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, World. I am %u of %u\n", rank, size); MPI_Finalize(); return 0; } I compile and run it: $ mpicc -o mpitest mpitest.c && mpirun -v --hostfile hostnames ./mpitest Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 And I don't get it why everyone has the rank 0 and the size is only 1. I followed many tutorials and i proved it right many times. Does anyone has an idea? Thanks in advance! Oliver Some infos: $ lamboot -v LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University n-1<12088> ssi:boot:base:linear: booting n0 (localhost) n-1<12088> ssi:boot:base:linear: finished ocs@frost:~$ lamboot -V LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University Arch:x86_64-pc-linux-gnu Prefix:/usr/lib/lam Configured by:buildd Configured on:Sun Apr 6 01:43:15 UTC 2008 Configure host:excelsior SSI rpi:crtcp lamd sysv tcp usysv $ mpirun -V mpirun (Open MPI) 1.2.7rc2 Report bugs to http://www.open-mpi.org/community/help/ $ mpicc -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-cld --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.3.2 (Debian 4.3.2-1.1)
Re: [OMPI users] Processes always rank 0
I thought this is OpenMPI what I was using. I do not have permission to install something, only in my home directory. All tutorials I found started the environment with the lamboot command. Whats the difference using only OpenMPI? $ whereis openmpi openmpi: /etc/openmpi /usr/lib/openmpi /usr/lib64/openmpi /usr/share/openmpi $ echo $LD_LIBRARY_PATH :/usr/lib/openmpi/lib:/usr/lib64/openmpi/lib: $ whereis mpirun mpirun: /usr/bin/mpirun.mpich /usr/bin/mpirun /usr/bin/mpirun.lam /usr/bin/mpirun.openmpi $ ll /usr/bin/mpirun lrwxrwxrwx 1 root root 24 14. Aug 2008 /usr/bin/mpirun -> /usr/bin/orterun $ ll /usr/bin/orterun -rwxr-xr-x 1 root root 39280 25. Aug 2008 /usr/bin/orterun $ ll /usr/bin/mpirun.openmpi lrwxrwxrwx 1 root root 7 5. Sep 2008 /usr/bin/mpirun.openmpi -> orterun When I run mpirun without starting the environment by using lamboot, it says: ocs@frost:~$ mpicc -o mpitest mpitest.c && mpirun -np 1 -machinefile machines ./mpitest - It seems that there is no lamd running on the host frost. This indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for MPI programs to run (the MPI program tired to invoke the "MPI_Init" function). Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines. ----- Thanks in advance, Oliver Jeff Squyres wrote: If you're just starting with MPI, is there any chance you can upgrade to Open MPI instead of LAM/MPI? All of the LAM/MPI developers moved to Open MPI years ago. ann/lib:/home/bude/ocs/fann On Jul 8, 2010, at 6:01 AM, Oliver Stolpe wrote: Hello there, I have a problem setting up MPI/LAM. Here we go: I start lam with the lamboot command successfully: $ lamboot -v hostnames LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University n-1<11960> ssi:boot:base:linear: booting n0 (frost) n-1<11960> ssi:boot:base:linear: booting n1 (hurricane) n-1<11960> ssi:boot:base:linear: booting n2 (hail) n-1<11960> ssi:boot:base:linear: booting n3 (fog) n-1<11960> ssi:boot:base:linear: booting n4 (rain) n-1<11960> ssi:boot:base:linear: booting n5 (thunder) n-1<11960> ssi:boot:base:linear: finished Ok, all is fine. I test a command (hostname in this case): $ mpirun -v --hostfile hostnames hostname thunder rain frost fog hurricane hail Works. I write a hello world program for testing: #include #include int main(int argc, char** argv) { unsigned int rank; unsigned int size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, World. I am %u of %u\n", rank, size); MPI_Finalize(); return 0; } I compile and run it: $ mpicc -o mpitest mpitest.c && mpirun -v --hostfile hostnames ./mpitest Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 Hello, World. I am 0 of 1 And I don't get it why everyone has the rank 0 and the size is only 1. I followed many tutorials and i proved it right many times. Does anyone has an idea? Thanks in advance! Oliver Some infos: $ lamboot -v LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University n-1<12088> ssi:boot:base:linear: booting n0 (localhost) n-1<12088> ssi:boot:base:linear: finished ocs@frost:~$ lamboot -V LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University Arch:x86_64-pc-linux-gnu Prefix:/usr/lib/lam Configured by:buildd Configured on:Sun Apr 6 01:43:15 UTC 2008 Configure host:excelsior SSI rpi:crtcp lamd sysv tcp usysv $ mpirun -V mpirun (Open MPI) 1.2.7rc2 Report bugs to http://www.open-mpi.org/community/help/ $ mpicc -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-cld --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.3.2 (Debian 4.3.2-1.1) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Processes always rank 0
You were right, it was linked to the lam compiler. I didn't find the open mpi compiler on the system, though. Now I downloaded and compiled the current stable version of openmpi. That worked. I had to make symbolic links to the exectuables so that the system won't get confused with the old mpi install that I can't access to deinstall. Now when I use my current host only as machine, it works: $ mympicc mpitest.c -o mpitest && mympirun -np 3 -machinefile machines mpitest Hello, World. I am 2 of 3 Hello, World. I am 0 of 3 Hello, World. I am 1 of 3 To get it run on multiple machines, I had to give an absolute path to the running binary: /home/bude/ocs/openmpi/bin/mpirun -np 8 -machinefile machines mpitest ocs@frost:~$ /home/bude/ocs/openmpi/bin/mpirun -np 8 -machinefile machines mpitest Hello, World. I am 4 of 8 Hello, World. I am 0 of 8 Hello, World. I am 7 of 8 Hello, World. I am 5 of 8 Hello, World. I am 1 of 8 Hello, World. I am 3 of 8 Hello, World. I am 2 of 8 Hello, World. I am 6 of 8 Yay! Works as it should (I think so, can I prove it somewhere that he really executes the instances on the machines?)! Thanks everybody. This is not the most elegant solution but it works eventually. Best regards, Oliver Jeff Squyres (jsquyres) wrote: Lam and open mpi are two different mpi implementations. Lam came before open mpi; we stopped developing lam years ago. Lamboot is a lam-specific command. It has no analogue in open mpi. Orterun is open mpi's mpirun. >From a quick look at your paths and whatnot, its not immediately obvious how you are mixing lam and open mpi, but somehow you are. You need to disentangle them and entirely use open mpi. Perhaps your mpicc is sym linked to the lam mpicc (instead of the open mpi mpicc)...? -jms Sent from my PDA. No type good. - Original Message - From: users-boun...@open-mpi.org To: Open MPI Users Sent: Thu Jul 08 06:13:43 2010 Subject: Re: [OMPI users] Processes always rank 0 I thought this is OpenMPI what I was using. I do not have permission to install something, only in my home directory. All tutorials I found started the environment with the lamboot command. Whats the difference using only OpenMPI? $ whereis openmpi openmpi: /etc/openmpi /usr/lib/openmpi /usr/lib64/openmpi /usr/share/openmpi $ echo $LD_LIBRARY_PATH :/usr/lib/openmpi/lib:/usr/lib64/openmpi/lib: $ whereis mpirun mpirun: /usr/bin/mpirun.mpich /usr/bin/mpirun /usr/bin/mpirun.lam /usr/bin/mpirun.openmpi $ ll /usr/bin/mpirun lrwxrwxrwx 1 root root 24 14. Aug 2008 /usr/bin/mpirun -> /usr/bin/orterun $ ll /usr/bin/orterun -rwxr-xr-x 1 root root 39280 25. Aug 2008 /usr/bin/orterun $ ll /usr/bin/mpirun.openmpi lrwxrwxrwx 1 root root 7 5. Sep 2008 /usr/bin/mpirun.openmpi -> orterun When I run mpirun without starting the environment by using lamboot, it says: ocs@frost:~$ mpicc -o mpitest mpitest.c && mpirun -np 1 -machinefile machines ./mpitest - It seems that there is no lamd running on the host frost. This indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for MPI programs to run (the MPI program tired to invoke the "MPI_Init" function). Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines. ----- Thanks in advance, Oliver Jeff Squyres wrote: If you're just starting with MPI, is there any chance you can upgrade to Open MPI instead of LAM/MPI? All of the LAM/MPI developers moved to Open MPI years ago. ann/lib:/home/bude/ocs/fann On Jul 8, 2010, at 6:01 AM, Oliver Stolpe wrote: Hello there, I have a problem setting up MPI/LAM. Here we go: I start lam with the lamboot command successfully: $ lamboot -v hostnames LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University n-1<11960> ssi:boot:base:linear: booting n0 (frost) n-1<11960> ssi:boot:base:linear: booting n1 (hurricane) n-1<11960> ssi:boot:base:linear: booting n2 (hail) n-1<11960> ssi:boot:base:linear: booting n3 (fog) n-1<11960> ssi:boot:base:linear: booting n4 (rain) n-1<11960> ssi:boot:base:linear: booting n5 (thunder) n-1<11960> ssi:boot:base:linear: finished Ok, all is fine. I test a command (hostname in this case): $ mpirun -v --hostfile hostnames hostname thunder rain frost fog hurricane hail Works. I write a hello world program for testing: #include #include int main(int argc, char** argv) { unsigned int rank; unsigned int size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, World. I am %u of %u\n", rank, size); MPI_