Send users mailing list submissions to
us...@open-mpi.org
To subscribe or unsubscribe via the World Wide Web, visit
http://www.open-mpi.org/mailman/listinfo.cgi/users
or, via email, send a message with subject or body 'help' to
users-requ...@open-mpi.org
You can reach the person managing the list at
users-ow...@open-mpi.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."
Today's Topics:
1. Re: "Address not mapped" error on user defined MPI_OP
function (Eric Thibodeau)
2. MPI 1.2 stuck in pthread_condition_wait ( hpe...@infonie.fr )
3. Re: "Address not mapped" error on user defined MPI_OP
function (Eric Thibodeau)
4. Re: problem with MPI_Bcast over ethernet (Jeff Squyres)
5. Re: btl_tcp_endpoint errors (Jeff Squyres)
6. Re: problems with profile.d scripts generated using
openmpi.spec (Jeff Squyres)
----------------------------------------------------------------------
Message: 1
Date: Wed, 4 Apr 2007 12:31:46 -0400
From: Eric Thibodeau <ky...@neuralbs.com>
Subject: Re: [OMPI users] "Address not mapped" error on user defined
MPI_OP function
To: us...@open-mpi.org
Message-ID: <200704041231.46356.ky...@neuralbs.com>
Content-Type: text/plain; charset="iso-8859-1"
I completely forgot to mention which version of OpenMPI I am using, I'll gladly
post additional info if required :
kyron@kyron ~/openmpi-1.2 $ ompi_info |head
Open MPI: 1.2
Open MPI SVN revision: r14027
Open RTE: 1.2
Open RTE SVN revision: r14027
OPAL: 1.2
OPAL SVN revision: r14027
Prefix: /home/kyron/openmpi_i686
Configured architecture: i686-pc-linux-gnu
Configured by: kyron
Configured on: Wed Apr 4 10:21:34 EDT 2007
Le mercredi 4 avril 2007 11:47, Eric Thibodeau a ?crit?:
> Hello all,
>
> First off, please excuse the attached code as I may be na??ve in my
attempts to implement my own MPI_OP.
>
> I am attempting to create my own MPI_OP to use with MPI_Allreduce. I have been able to find
very little examples off the net of creating MPI_OPs. My present references are "MPI The complete
reference Volume 1 2nd edition" and some rather good slides I found at
http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof of concept"
code which fails with:
>
> [kyron:14074] *** Process received signal ***
> [kyron:14074] Signal: Segmentation fault (11)
> [kyron:14074] Signal code: Address not mapped (1)
> [kyron:14074] Failing at address: 0x801da600
> [kyron:14074] [ 0] [0x6ffa6440]
> [kyron:14074] [ 1]
/home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700)
[0x6fbb0dd0]
> [kyron:14074] [ 2]
/home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2)
[0x6fbae9a2]
> [kyron:14074] [ 3]
/home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86]
> [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
> [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823]
> [kyron:14074] *** End of error message ***
>
>
> Eric Thibodeau
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
Message: 2
Date: Wed, 4 Apr 2007 18:50:38 +0200
From: " hpe...@infonie.fr " <hpe...@infonie.fr>
Subject: [OMPI users] MPI 1.2 stuck in pthread_condition_wait
To: " users " <us...@open-mpi.org>
Message-ID: <JFZG4E$41584250c17e66d5afe2eafa16558...@aliceadsl.fr>
Content-Type: text/plain; charset=iso-8859-1
Hi,
I have a problem of MPI 1.2.0rc being locked in a "pthread_condition_wait" call.
This happen whatever the application when openmpi has been compiled with
multi-thread support.
The full "configure" options are
"./configure --prefix=/usr/local/Mpi/openmpi-1.2 --enable-mpi-threads
--enable-progress-threads --with-threads=posix --enable-smp-lock"
An example of GDB session is provided here below:
-------------------------------------------------------------------------------------------------------------
>GNU gdb 6.3-debian
>Copyright 2004 Free Software Foundation, Inc.
>GDB is free software, covered by the GNU General Public License, and
>you are welcome to change it and/or distribute copies of it under certain
>conditions.
>Type "show copying" to see the conditions.
>There is absolutely no warranty for GDB. Type "show warranty" for
>details.
>This GDB was configured as "i386-linux"...Using host libthread_db
>library "/lib/tls/libthread_db.so.1".
>
>(gdb) run -np 1 spawn6
>Starting program: /usr/local/openmpi-1.2.0/bin/mpirun -np 1 spawn6
>[Thread debugging using libthread_db enabled]
>[New Thread 1076191360 (LWP 29006)]
>[New Thread 1084808112 (LWP 29009)]
>main*******************************
>main : Lancement MPI*
>
>Program received signal SIGINT, Interrupt.
>[Switching to Thread 1084808112 (LWP 29009)]
>0x401f0523 in poll () from /lib/tls/libc.so.6
>(gdb) where
>#0 0x401f0523 in poll () from /lib/tls/libc.so.6
>#1 0x40081c7c in opal_poll_dispatch () from
>/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>#2 0x4007e4f1 in opal_event_base_loop () from
>/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>#3 0x4007e36b in opal_event_loop () from
>/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>#4 0x4007f423 in opal_event_run () from
>/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>#5 0x40115b63 in start_thread () from /lib/tls/libpthread.so.0
>#6 0x401f918a in clone () from /lib/tls/libc.so.6
>(gdb) bt
>#0 0x401f0523 in poll () from /lib/tls/libc.so.6
>#1 0x40081c7c in opal_poll_dispatch () from
>/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>#2 0x4007e4f1 in opal_event_base_loop () from
>/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>#3 0x4007e36b in opal_event_loop () from
>/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>#4 0x4007f423 in opal_event_run () from
>/usr/local/openmpi-1.2.0/lib/libopen-pal.so.0
>#5 0x40115b63 in start_thread () from /lib/tls/libpthread.so.0
>#6 0x401f918a in clone () from /lib/tls/libc.so.6
>(gdb) info threads
>* 2 Thread 1084808112 (LWP 29009) 0x401f0523 in poll () from
>/lib/tls/libc.so.6
> 1 Thread 1076191360 (LWP 29006) 0x40118295 in
>pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
>(gdb) thread 1
>[Switching to thread 1 (Thread 1076191360 (LWP 29006))]#0 0x40118295
>in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
>(gdb) bt
>#0 0x40118295 in pthread_cond_wait@@GLIBC_2.3.2 () from
>/lib/tls/libpthread.so.0
>#1 0x0804cb68 in opal_condition_wait (c=0x8050e4c, m=0x8050e28) at
>condition.h:64
>#2 0x0804a4fe in orterun (argc=4, argv=0xbffff844) at orterun.c:436
>#3 0x0804a046 in main (argc=4, argv=0xbffff844) at main.c:13
>(gdb) where
>#0 0x40118295 in pthread_cond_wait@@GLIBC_2.3.2 () from
>/lib/tls/libpthread.so.0
>#1 0x0804cb68 in opal_condition_wait (c=0x8050e4c, m=0x8050e28) at
>condition.h:64
>#2 0x0804a4fe in orterun (argc=4, argv=0xbffff844) at orterun.c:436
>#3 0x0804a046 in main (argc=4, argv=0xbffff844) at main.c:13
-------------------------------------------------------------------------------------------------------------
I have read the other threads related to multi-threads support. I have
understood that multi-thread support will not be a priority before the end of
the year.
The thing is this locking stuff problem appeared only since 1.1.2 openmpi
release and as it is a locking problem, I was wondering if you could do an
exception and try to analyse this one before the end of the year.
Thanks,
Herve
P.S.: my OS is a debian sarge
------------------------ ALICE C'EST ENCORE MIEUX AVEC CANAL+ LE BOUQUET !
---------------
D?couvrez vite l'offre exclusive ALICEBOX et CANAL+ LE BOUQUET, en cliquant ici
http://alicebox.fr
Soumis ? conditions.
------------------------------
Message: 3
Date: Wed, 4 Apr 2007 13:32:15 -0400
From: Eric Thibodeau <ky...@neuralbs.com>
Subject: Re: [OMPI users] "Address not mapped" error on user defined
MPI_OP function
To: us...@open-mpi.org
Message-ID: <200704041332.15575.ky...@neuralbs.com>
Content-Type: text/plain; charset="iso-8859-1"
hehe...don't we all love it when a problem "fixes" itself. I was missing a line
in my Type creation to realigne the elements correctly:
// Displacement is RELATIVE to it's first structure element!
for(i=2; i >= 0; i--) Displacement[i] -= Displacement[0];
I'm attaching the functionnal code so that others can maybe see this one as an
example ;)
Le mercredi 4 avril 2007 11:47, Eric Thibodeau a ?crit?:
> Hello all,
>
> First off, please excuse the attached code as I may be na??ve in my
attempts to implement my own MPI_OP.
>
> I am attempting to create my own MPI_OP to use with MPI_Allreduce. I have been able to find
very little examples off the net of creating MPI_OPs. My present references are "MPI The complete
reference Volume 1 2nd edition" and some rather good slides I found at
http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof of concept"
code which fails with:
>
> [kyron:14074] *** Process received signal ***
> [kyron:14074] Signal: Segmentation fault (11)
> [kyron:14074] Signal code: Address not mapped (1)
> [kyron:14074] Failing at address: 0x801da600
> [kyron:14074] [ 0] [0x6ffa6440]
> [kyron:14074] [ 1]
/home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700)
[0x6fbb0dd0]
> [kyron:14074] [ 2]
/home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2)
[0x6fbae9a2]
> [kyron:14074] [ 3]
/home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86]
> [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
> [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823]
> [kyron:14074] *** End of error message ***
>
>
> Eric Thibodeau
>
--
Eric Thibodeau
Neural Bucket Solutions Inc.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AllReduceTest.c
Type: text/x-csrc
Size: 3170 bytes
Desc: not available
Url :
http://www.open-mpi.org/MailArchives/users/attachments/20070404/69383002/attachment.bin
------------------------------
Message: 4
Date: Wed, 4 Apr 2007 15:16:56 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] problem with MPI_Bcast over ethernet
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <eca5445b-727d-4e68-9917-bf9fbf323...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
There is nothing known in the current release that would cause this
(1.2). What version are you using?
On Apr 2, 2007, at 4:34 PM, Jeff Stuart wrote:
> for some reason, i am getting intermittent process crashing in
> MPI_Bcast. i run my program, which distributes some data via lots
> (thousands or more ) of 64k MPI_Bcast calls. the program that is
> crashing is fairly big, and it would take some time to widdle down a
> small example program. i *am* willing to do this, i just wanted to
> make sure there wasnt an already known problem about this first.
>
> thanks in advance,
> -jeff
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
------------------------------
Message: 5
Date: Wed, 4 Apr 2007 15:28:14 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] btl_tcp_endpoint errors
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <bc6c67e2-1172-4b00-83a5-f5c9c3e0f...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
On Apr 3, 2007, at 1:22 PM, Heywood, Todd wrote:
> ssh: connect to host blade45 port 22: No route to host
> [blade1:05832] ERROR: A daemon on node blade45 failed to start as
> expected.
> [blade1:05832] ERROR: There may be more information available from
> [blade1:05832] ERROR: the remote shell (see above).
> [blade1:05832] ERROR: The daemon exited unexpectedly with status 1.
> [blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> ../../../../orte/mca/pls/base/pls_base_orted_cmds.c at line 188
> [blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> ../../../../../orte/mca/pls/rsh/pls_rsh_module.c at line 1187
>
> I can understand this arising from an ssh bottleneck, with a
> timeout. So, a
> question to the OMPI folks: could the "no route to host" (113)
> error in
> btl_tcp_endpoint.c:572 also result from a timeout?
I think it *could*, but it's really an OS-level question. OMPI is
simply reporting what errno is giving us back from a failed TCP
connect() API call.
The timeout shown in the error message above is really an ORTE
timeout, meaning that we waited for a daemon to start that didn't, so
we timed out and gave up. It's on the "to do" list to recognize
quicker that an ssh failed (or any of the other starters failed --
SLURM/srun failures behaves similarly to ssh failures right now)
faster than a timeout, probably not until at least the 1.3 timeframe,
however.
--
Jeff Squyres
Cisco Systems
------------------------------
Message: 6
Date: Wed, 4 Apr 2007 17:39:57 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] problems with profile.d scripts generated
using openmpi.spec
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <ee226b9a-fbde-41ea-b9f5-71ddab9fc...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
On Apr 4, 2007, at 8:44 AM, Marcin Dulak wrote:
> Thank your for comments.
> 1) I'am using
> GNU bash, version 3.00.15(1)-release (i686-redhat-linux-gnu)
> To see the problem with the original
> eval "set %{configure_options}" I start the configure_options with
> -- in buildrpm.sh, like this: configure_options="--with-tm=/usr/
> local FC=pgf90 F77=pgf90 CC=pgcc CXX=pgCC CFLAGS=-Msignextend
> CXXFLAGS=-Msignextend --with-wrapper-cflags=-Msignextend --with-
> wrapper-cxxflags=-Msignextend FFLAGS=-Msignextend FCFLAGS=-
> Msignextend --with-wrapper-fflags=-Msignextend --with-wrapper-
> fcflags=-Msignextend" Or to see the problem directly, I go to the
> shell: sh; set --w sh: set: --: invalid option set: usage: set [--
> abefhkmnptuvxBCHP] [-o option] [arg ...]
(wow, my mail client really munged your formatting... :-\ )
I see why I didn't run into this before. I did all my testing within
the context of the OFED 1.2 installer, and we always pass in
configure_options that start with a token that does not start with
--. Hence, "set" knew to ignore the -- prefixed options.
So it looks like a slightly less intrusive fix would actually be to
use the following:
eval "set -- %{configure_options}"
> 2) if ("\$LD_LIBRARY_PATH" !~ *%{_libdir}*) then is the only
> possibility which works for me. I'am using tcsh 6.13.00 (Astron)
> 2004-05-19 (i386-intel-linux) options
> 8b,nls,dl,al,kan,rh,color,dspm,filec If I use "%{_libdir}", then
> every time I source /opt/openmpi/1.2/bin/mpivars-1.2.csh a new
> entry of opemnpi is prepended, so the LD_LIBRARY_PATH is growing.
> The same if I use "*%{_libdir}*" it seems that with the double
> quotes the shell despite the pattern comparison requested by !~
> uses literal matching.
I just went and read the man page on this (should have done this
before): it says that the =~ and !~ operators are glob-style
matching. So the * prefix and suffix is correct -- thanks for
pointing that out.
I was trying to use "" to protect multi-word strings, but I can't
seem to find a syntax that works for multi-word strings on the right
hand side. Oh well; there's probably other stuff in OMPI that will
break if use you spaces in the prefix -- I'm ok with this for now.
I'll fix up these in SVN.
> 3) using setenv MANPATH %{_mandir}: (with the colon (:) included),
> if I start from empty MANPATH
>
> unsetenv MANPATH
>
> and run
> source /opt/openmpi/1.2/bin/mpivars-1.2.csh
> I get
> echo $MANPATH
>
> /opt/openmpi/1.2/man:
Right.
> I tried to google for something like
> also include the default MANPATHbut I cannot find anything. What is
> the meaning of this colon at the end?
I believe that I found this option long ago by trial and error in the
OSCAR project. I just trolled through the man documentation right
now and [still] can't find it documented anywhere. :-\
The trailing : means "put all the options listed in man.conf here".
If you don't do that, then the contents of MANPATH wholly replaces
what is listed in man.conf. For example (I'm a C shell kind of guy):
# With no $MANPATH
shell% man ls
...get ls man page...
# Set MANPATH to a directory with no trailing :
shell% setenv MANPATH /opt/intel/9.1/man
shell% man icc
...get icc man page...
shell% man ls
No manual entry for ls
# Set MANPATH to a directory with a trailing :
shell% setenv MANPATH /opt/intel/9.1/man:
shell% man icc
...get icc man page...
shell% man ls
...get ls man page...
Thanks for the bug reports and your persistence!
--
Jeff Squyres
Cisco Systems
------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
End of users Digest, Vol 550, Issue 5
*************************************