date:20070927

[OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Åke Sandgren

Hi!

There are a couple of bugs in the configure scripts regarding threads
checking.

In ompi_check_pthread_pids.m4 the actual code for testing is wrong and
is also missing a CFLAG save/add-THREAD_CFLAGS/restore resulting in the
linking always failing for the -pthread test with gcc.
config.log looks like this.
=
configure:50353: checking if threads have different pids (pthreads on
linux)
configure:50409: gcc -o conftest -DNDEBUG -march=k8 -O3 -msse -msse2
-maccumulate-outgoing-args -finline-functions -fno-strict-aliasing
-fexceptions  conftest.c -lnsl -lutil  -lm  >&5
conftest.c: In function 'checkpid':
conftest.c:327: warning: cast to pointer from integer of different size
/tmp/ccqUaAns.o: In function `main':conftest.c:(.text+0x1f): undefined
reference to `pthread_create'
:conftest.c:(.text+0x2e): undefined reference to `pthread_join'
collect2: ld returned 1 exit status
configure:50412: $? = 1
configure: program exited with status 1
=

Adding the CFLAGS save/add/restore make the code return the right answer
both on systems with the old pthreads implementation and NPTL based
systems. BUT, the code as it stands is technically incorrect.
The patch have a corrected version.

There is also two bugs in ompi_config_pthreads.m4.
In OMPI_INTL_POSIX_THREADS_LIBS_CXX it is incorrectly setting
PTHREAD_LIBS to $pl, in the then-part of the second if-statement, which
at the time isn't set yet and forgetting to reset LIBS on failure in the
bottom most if-else case in the for pl loop.

In OMPI_INTL_POSIX_THREADS_LIBS_FC it is resetting LIBS whether
succesfull or not resulting in -lpthread missing when checking for
PTHREAD_MUTEX_ERRORCHECK_NP at least for some versions of pgi, (6.1 and
older fails, 7.0 seems to always add -lpthread with pgf77 as linker)

The output from configure in such a case looks like this:
checking if C compiler and POSIX threads work with -lpthread... yes
checking if C++ compiler and POSIX threads work with -lpthread... yes
checking if F77 compiler and POSIX threads work with -lpthread... yes
checking for PTHREAD_MUTEX_ERRORCHECK_NP... no
checking for PTHREAD_MUTEX_ERRORCHECK... no
(OS: Ubuntu Dapper, Compiler: pgi 6.1)

There is also a problem in the F90 modules include flag search.
The test currently does:
$FC -c conftest-module.f90
$FC conftest.f90

This doesn't work if one has set FCFLAGS=-g in the environment.
At least not with pgf90 since it needs the debug symbols from
conftest-module.o to be able to link.
You have to either add conftest-module.o to the compile line of conftest
or make it a three-stager, $FC -c conftest-module.f90; $FC -c
conftest.f90; $FC conftest.o conftest-module.o

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
diff -rU 10 site/config/ompi_config_pthreads.m4 p3/config/ompi_config_pthreads.m4
--- site/config/ompi_config_pthreads.m4	2006-08-15 22:14:05.0 +0200
+++ p3/config/ompi_config_pthreads.m4	2007-09-27 09:10:21.0 +0200
@@ -473,24 +473,24 @@
   CXXCPPFLAGS="$CXXCPPFLAGS $PTHREAD_CXXCPPFLAGS"
 fi
   ;;
 esac
 LIBS="$orig_LIBS $PTHREAD_LIBS"
 AC_LANG_PUSH(C++)
 OMPI_INTL_PTHREAD_TRY_LINK(ompi_pthread_cxx_success=1, 
   ompi_pthread_cxx_success=0)
 AC_LANG_POP(C++)
 if test "$ompi_pthread_cxx_success" = "1"; then
-  PTHREAD_LIBS="$pl"
   AC_MSG_RESULT([yes])
 else
   CXXCPPFLAGS="$orig_CXXCPPFLAGS"
+  LIBS="$orig_LIBS"
   AC_MSG_RESULT([no])
   AC_MSG_ERROR([Can not find working threads configuration.  aborting])
 fi
   else 
 for pl in $plibs; do
   AC_MSG_CHECKING([if C++ compiler and POSIX threads work with $pl])
   case "${host_cpu}-${host-_os}" in
 *-aix* | *-freebsd*)
   if test "`echo $CXXCPPFLAGS | grep 'D_THREAD_SAFE'`" = ""; then
 PTRHEAD_CXXCPPFLAGS="-D_THREAD_SAFE"
@@ -508,61 +508,62 @@
   AC_LANG_PUSH(C++)
   OMPI_INTL_PTHREAD_TRY_LINK(ompi_pthread_cxx_success=1, 
 ompi_pthread_cxx_success=0)
   AC_LANG_POP(C++)
   if test "$ompi_pthread_cxx_success" = "1"; then
 	PTHREAD_LIBS="$pl"
 AC_MSG_RESULT([yes])
   else
 PTHREAD_CXXCPPFLAGS=
 CXXCPPFLAGS="$orig_CXXCPPFLAGS"
+	LIBS="$orig_LIBS"
 AC_MSG_RESULT([no])
   fi
 done
   fi
 fi
 ])dnl


 AC_DEFUN([OMPI_INTL_POSIX_THREADS_LIBS_FC],[
 #
 # Fortran compiler
 #
 if test "$ompi_pthread_f77_success" = "0" -a "$OMPI_WANT_F77_BINDINGS" = "1"; then
   if test ! "$ompi_pthread_c_success" = "0" -a ! "$PTHREAD_LIBS" = "" ; then
 AC_MSG_CHECKING([if F77 compiler and POSIX threads work with $PTHREAD_LIBS])
 LIBS="$orig_LIBS $PTHREAD_LIBS"
 AC_LANG_PUSH(C)
 OMPI_INTL_PTHREAD_TRY_LINK_F77(ompi_pthread_f77_success=1, 
   ompi_pthread_f77_success=0)
 AC_LANG_POP(C)
-

Re: [OMPI users] [Open MPI Announce] Open MPI v1.2.4 released

2007-09-27 Thread George Bosilca

We have a working version of Open MPI on Windows. However, it's  
manually build and the whole compilation process as well as  
maintaining the project file is a nightmare. That's why the Windows  
project files are not committed into the trunk.


If you want I can provide you the solution and project files.

  george.

On Sep 26, 2007, at 10:33 AM, Damien Hocking wrote:


Is there a timeline for the Windows version yet?

Damien
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Tim Prins


Hi Ake,

Looking at the svn logs it looks like you reported the problems with 
these checks quite a while ago and we fixed them (in r13773 
https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved 
them to the 1.2 branch.


I will ask for this to be moved to the 1.2 branch.

However, the changes made for ompi_config_pthreads.m4 are different than 
you are suggesting now. Is this changeset good enough, or are there 
other changes you think should be made?


Thanks,

Tim



Åke Sandgren wrote:

Hi!

There are a couple of bugs in the configure scripts regarding threads
checking.

In ompi_check_pthread_pids.m4 the actual code for testing is wrong and
is also missing a CFLAG save/add-THREAD_CFLAGS/restore resulting in the
linking always failing for the -pthread test with gcc.
config.log looks like this.
=
configure:50353: checking if threads have different pids (pthreads on
linux)
configure:50409: gcc -o conftest -DNDEBUG -march=k8 -O3 -msse -msse2
-maccumulate-outgoing-args -finline-functions -fno-strict-aliasing
-fexceptions  conftest.c -lnsl -lutil  -lm  >&5
conftest.c: In function 'checkpid':
conftest.c:327: warning: cast to pointer from integer of different size
/tmp/ccqUaAns.o: In function `main':conftest.c:(.text+0x1f): undefined
reference to `pthread_create'
:conftest.c:(.text+0x2e): undefined reference to `pthread_join'
collect2: ld returned 1 exit status
configure:50412: $? = 1
configure: program exited with status 1
=

Adding the CFLAGS save/add/restore make the code return the right answer
both on systems with the old pthreads implementation and NPTL based
systems. BUT, the code as it stands is technically incorrect.
The patch have a corrected version.

There is also two bugs in ompi_config_pthreads.m4.
In OMPI_INTL_POSIX_THREADS_LIBS_CXX it is incorrectly setting
PTHREAD_LIBS to $pl, in the then-part of the second if-statement, which
at the time isn't set yet and forgetting to reset LIBS on failure in the
bottom most if-else case in the for pl loop.

In OMPI_INTL_POSIX_THREADS_LIBS_FC it is resetting LIBS whether
succesfull or not resulting in -lpthread missing when checking for
PTHREAD_MUTEX_ERRORCHECK_NP at least for some versions of pgi, (6.1 and
older fails, 7.0 seems to always add -lpthread with pgf77 as linker)

The output from configure in such a case looks like this:
checking if C compiler and POSIX threads work with -lpthread... yes
checking if C++ compiler and POSIX threads work with -lpthread... yes
checking if F77 compiler and POSIX threads work with -lpthread... yes
checking for PTHREAD_MUTEX_ERRORCHECK_NP... no
checking for PTHREAD_MUTEX_ERRORCHECK... no
(OS: Ubuntu Dapper, Compiler: pgi 6.1)

There is also a problem in the F90 modules include flag search.
The test currently does:
$FC -c conftest-module.f90
$FC conftest.f90

This doesn't work if one has set FCFLAGS=-g in the environment.
At least not with pgf90 since it needs the debug symbols from
conftest-module.o to be able to link.
You have to either add conftest-module.o to the compile line of conftest
or make it a three-stager, $FC -c conftest-module.f90; $FC -c
conftest.f90; $FC conftest.o conftest-module.o





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Åke Sandgren

On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote:
> Hi Ake,
> 
> Looking at the svn logs it looks like you reported the problems with 
> these checks quite a while ago and we fixed them (in r13773 
> https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved 
> them to the 1.2 branch.

Yes, it's the same. Since i never saw it in the source i tried once more
with some more explanations just in case :-)

> I will ask for this to be moved to the 1.2 branch.

Good.

> However, the changes made for ompi_config_pthreads.m4 are different than 
> you are suggesting now. Is this changeset good enough, or are there 
> other changes you think should be made?

The ones i sent today are slightly more correct. There where 2 missing
LIBS="$orig_LIBS" in the failure cases.

If you compare the resulting file after patching you will see the
difference. They are in the "Can not find working threads configuration"
portions.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

[OMPI users] Bundling OpenMPI

2007-09-27 Thread Teng Lin


Hi,


We would like to distribute OpenMPI along with our software  to  
customers, is there any legal issue we need to know about?


We can successfully build OpenMPI using
./configure --prefix=/some_path;make;make install

However, if we do

cp -r /some_path /other_path

and try to run
/other_path/bin/orterun,
below error message is thrown:
 
--

Sorry!  You were supposed to get help about:
orterun:usage
from the file:
help-orterun.txt
But I couldn't find any file matching that name.  Sorry!
 
--


Apparently, the path is hard-coded in the executable. Is there any  
way to fix it (such as using an environment variable etc)?



Thanks,
Teng

[OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

2007-09-27 Thread Dino Rossegger

Hi,

I have a problem running a simple programm mpihello.cpp.

Here is a excerp of the error and the command
root@sun:~# mpirun -H sun,saturn main
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1164
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[sun:25213] ERROR: A daemon on node saturn failed to start as expected.
[sun:25213] ERROR: There may be more information available from
[sun:25213] ERROR: the remote shell (see above).
[sun:25213] ERROR: The daemon exited unexpectedly with status 255.
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1196
--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.

--

The program is runable from each node alone (mpirun -np2 main)

My PathVariables:
$PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/libecho
$LD_LIBRARY_PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib

Passwordless ssh is up 'n running

I walked through the FAQ and Mailing Lists but couldn't find any
solution for my problem.

Thanks
Dino R.

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

2007-09-27 Thread jody

Hi Dino

Try
 ssh saturn printenv | grep PATH
from your host sun to see what your environment variables are when
ssh is run without a shell.


On 9/27/07, Dino Rossegger  wrote:
> Hi,
>
> I have a problem running a simple programm mpihello.cpp.
>
> Here is a excerp of the error and the command
> root@sun:~# mpirun -H sun,saturn main
> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1164
> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
> [sun:25213] ERROR: A daemon on node saturn failed to start as expected.
> [sun:25213] ERROR: There may be more information available from
> [sun:25213] ERROR: the remote shell (see above).
> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1196
> --
> mpirun was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
>
> --
>
> The program is runable from each node alone (mpirun -np2 main)
>
> My PathVariables:
> $PATH
> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/libecho
> $LD_LIBRARY_PATH
> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib
>
> Passwordless ssh is up 'n running
>
> I walked through the FAQ and Mailing Lists but couldn't find any
> solution for my problem.
>
> Thanks
> Dino R.
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] SIGSEG in ompi_comm_start_processes

2007-09-27 Thread Murat Knecht

Hi all,

I think, I found a bug and a fix for it.
Could someone verify the rationale behind this bug, as I have this
SIGSEG on only one of two machines, and I don't quite see why it doesn't
occur always. (Same testprogram, equally compiled 1.2.4 OpenMPI).
Though the fix does prevent the segmentation fault. :)

Thanks,
Murat





Where:
Bug:
free() crashes when trying to free stack memory
ompi/communicator/comm_dyn.c:630

OBJ_RELEASE(apps[i]);


SIGSEG:
orte/mca/rmgr/rmgr_types.h:113

free (app_context->cwd);



There are two ways that apps[i]->cwd is filled:
1. dynamically allocated memory
548 if ( !have_wdir ) {
getcwd(cwd, OMPI_PATH_MAX);
apps[i]->cwd = strdup(cwd);// <--
}

2. stack
354char cwd[OMPI_PATH_MAX];
// ...
516 /* check for 'wdir' */
ompi_info_get (array_of_info[i], "wdir", valuelen, cwd, &flag);
if ( flag ) {
apps[i]->cwd = cwd;  // <--
have_wdir = 1;
}



Fix: Allocate cwd always manually and make sure, it is deleted afterwards.

1.
char *cwd = (char*)malloc(OMPI_PATH_MAX);

2. And on cleanup (somewhere below line 624)

>if ( !have_wdir ) {
>getcwd(cwd, OMPI_PATH_MAX);
>apps[i]->cwd = strdup(cwd);
>}

Re: [OMPI users] SIGSEG in ompi_comm_start_processes

2007-09-27 Thread Murat Knecht

Copy-and-paste-error: The second part of the fix ought to be:

if ( !have_wdir ) {
  free(cwd);
}

Murat




Murat Knecht schrieb:
> Hi all,
>
> I think, I found a bug and a fix for it.
> Could someone verify the rationale behind this bug, as I have this
> SIGSEG on only one of two machines, and I don't quite see why it doesn't
> occur always. (Same testprogram, equally compiled 1.2.4 OpenMPI).
> Though the fix does prevent the segmentation fault. :)
>
> Thanks,
> Murat
>
>
>
>
>
> Where:
> Bug:
> free() crashes when trying to free stack memory
> ompi/communicator/comm_dyn.c:630
> 
> OBJ_RELEASE(apps[i]);
>
>
> SIGSEG:
> orte/mca/rmgr/rmgr_types.h:113
>
> free (app_context->cwd);
>
>
> 
> There are two ways that apps[i]->cwd is filled:
> 1. dynamically allocated memory
> 548 if ( !have_wdir ) {
> getcwd(cwd, OMPI_PATH_MAX);
> apps[i]->cwd = strdup(cwd);// <--
> }
>
> 2. stack
> 354char cwd[OMPI_PATH_MAX];
> // ...
> 516 /* check for 'wdir' */
> ompi_info_get (array_of_info[i], "wdir", valuelen, cwd, &flag);
> if ( flag ) {
> apps[i]->cwd = cwd;  // <--
> have_wdir = 1;
> }
>
>
>
> Fix: Allocate cwd always manually and make sure, it is deleted afterwards.
>
> 1.
>  ---
>   
>>char *cwd = (char*)malloc(OMPI_PATH_MAX);
>> 
>
> 2. And on cleanup (somewhere below line 624)
>
>   
>>if ( !have_wdir ) {
>>getcwd(cwd, OMPI_PATH_MAX);
>>apps[i]->cwd = strdup(cwd);
>>}
>> 
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Tim Prins


Åke Sandgren wrote:

On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote:

Hi Ake,

Looking at the svn logs it looks like you reported the problems with 
these checks quite a while ago and we fixed them (in r13773 
https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved 
them to the 1.2 branch.


Yes, it's the same. Since i never saw it in the source i tried once more
with some more explanations just in case :-)


I will ask for this to be moved to the 1.2 branch.


Good.

However, the changes made for ompi_config_pthreads.m4 are different than 
you are suggesting now. Is this changeset good enough, or are there 
other changes you think should be made?


The ones i sent today are slightly more correct. There where 2 missing
LIBS="$orig_LIBS" in the failure cases.
But do we really need these? It looks like configure aborts in these 
cases (I am not a autoconf wizard, so I could be completely wrong here).


Tim



If you compare the resulting file after patching you will see the
difference. They are in the "Can not find working threads configuration"
portions.

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Åke Sandgren

On Thu, 2007-09-27 at 14:18 -0400, Tim Prins wrote:
> Åke Sandgren wrote:
> > On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote:
> >> Hi Ake,
> >>
> >> Looking at the svn logs it looks like you reported the problems with 
> >> these checks quite a while ago and we fixed them (in r13773 
> >> https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved 
> >> them to the 1.2 branch.
> > 
> > Yes, it's the same. Since i never saw it in the source i tried once more
> > with some more explanations just in case :-)
> > 
> >> I will ask for this to be moved to the 1.2 branch.
> > 
> > Good.
> > 
> >> However, the changes made for ompi_config_pthreads.m4 are different than 
> >> you are suggesting now. Is this changeset good enough, or are there 
> >> other changes you think should be made?
> > 
> > The ones i sent today are slightly more correct. There where 2 missing
> > LIBS="$orig_LIBS" in the failure cases.
> But do we really need these? It looks like configure aborts in these 
> cases (I am not a autoconf wizard, so I could be completely wrong here).

I don't know. I just put them in since it was the right thing to do. And
there where other variables that was reset in those places.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

2007-09-27 Thread Daniel Rozenbaum





Here's some more info on the problem I've been struggling with; my
apologies for the lengthy posts, but I'm a little desperate here :-)

I was able to reduce the size of the experiment that reproduces the
problem, both in terms of input data size and the number of slots in
the cluster. The cluster now consists of 6 slots (5 clients), with two
of the clients running on the same node as the server and three others
on another node. This allowed me to follow Brian's
advice and run the server and all the clients under gdb and make
sure none of the processes terminates (normally or abnormally) when the
server reports the "readv failed" errors; this is indeed the case.

I then followed Jeff's
advice and added a debug loop just prior to the server calling
MPI_Waitany(), identifying the entries in the requests array which are
not
MPI_REQUEST_NULL, and then tracing back these
requests. What I found was the following:

At some point during the run, the server calls MPI_Waitany() on an
array of requests consisting of 96 elements, and gets stuck in it
forever; the only thing that happens at some point thereafter is that
the server reports a couple of "readv failed" errors:

[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110

According to my debug prints, just before that last call to
MPI_Waitany() the array requests[] contains 38 entries which are not
MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(),
half to Irecv(). Specifically, for example, entries
4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to
client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the
same client.

I traced back what's going on, for instance, with requests[4]. As I
mentioned, it corresponds to a call to MPI_Isend() initiated by the
server to client #3 (of 5). By the time the server gets stuck in
Waitany(), this client has already correctly processed the first
Isend() from master in requests[4], returned its response in
requests[5], and the server received this response properly. After
receiving this response, the server Isend()'s the next task to this
client in requests[4], and this is correctly reflected in "requests[4]
!= MPI_REQUESTS_NULL" just before the last call to Waitany(), but for
some reason this send doesn't seem to go any further.

Looking at all other requests[] corresponding to Isend()'s initiated by
the server to the same client (14,24,...,94), they're all also not
MPI_REQUEST_NULL, and are not going any further either.

One thing that might be important is that the messages the server is
sending to the clients in my experiment are quite large, ranging from
hundreds of Kbytes to several Mbytes, the largest being around 9
Mbytes. The largest messages take place at the beginning of the run and
are processed correctly though.

Also, I ran the same experiment on another cluster that uses slightly
different
hardware and network infrastructure, and could not reproduce the
problem.

Hope at least some of the above makes some sense. Any additional advice
would be greatly appreciated!
Many thanks,
Daniel


Daniel Rozenbaum wrote:

  
  
  I'm now running the same experiment under valgrind. It's probably
going to run for a few days, but interestingly what I'm seeing now is
that while running under valgrind's memcheck, the app has been
reporting much more of these "recv failed" errors, and not only on the
server node:
  
[host1][0,1,0]
[host4][0,1,13]
[host5][0,1,18]
[host8][0,1,30]
[host10][0,1,36]
[host12][0,1,46]
  
If in the original run I got 3 such messages, in the valgrind'ed run I
got about 45 so far, and the app still has about 75% of the work left.
  
I'm checking while all this is happening, and all the client processes
are still running, none exited early.
  
I've been analyzing the debug output in my original experiment, and it
does look like the server never receives any new messages from two of
the clients after the "recv failed" messages appear. If my analysis is
correct, these two clients ran on the same host. It might be the case
then that the messages with the next tasks to execute that the server
attempted to send to these two clients never reached them, or were
never sent. Interesting though that there were two additional clients
on the same host, and those seem to have kept working all along, until
the app got stuck.
  
Once this valgrind experiment is over, I'll proceed to your other
suggestion about the debug loop on the server side checking for any of
the requests the app is waiting for being MPI_REQUEST_NULL.
  
Many thanks,
Daniel
  
  
Jeff Squyres wrote:
  
On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote:

  

  What seems to be happening is this: the code of the server is  
written in
such a manner that the server knows how many "responses" it's supposed
to receive from all the c

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

2007-09-27 Thread Dino Rossegger

Hi Jody,

Thanks for your help, it really is the case that either in PATH nor in
LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
hope it works.

jody schrieb:
> Hi Dino
> 
> Try
>  ssh saturn printenv | grep PATH
>>from your host sun to see what your environment variables are when
> ssh is run without a shell.
> 
> 
> On 9/27/07, Dino Rossegger  wrote:
>> Hi,
>>
>> I have a problem running a simple programm mpihello.cpp.
>>
>> Here is a excerp of the error and the command
>> root@sun:~# mpirun -H sun,saturn main
>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 275
>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>> line 1164
>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>> [sun:25213] ERROR: A daemon on node saturn failed to start as expected.
>> [sun:25213] ERROR: There may be more information available from
>> [sun:25213] ERROR: the remote shell (see above).
>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 188
>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>> line 1196
>> --
>> mpirun was unable to cleanly terminate the daemons for this job.
>> Returned value Timeout instead of ORTE_SUCCESS.
>>
>> --
>>
>> The program is runable from each node alone (mpirun -np2 main)
>>
>> My PathVariables:
>> $PATH
>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/libecho
>> $LD_LIBRARY_PATH
>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib
>>
>> Passwordless ssh is up 'n running
>>
>> I walked through the FAQ and Mailing Lists but couldn't find any
>> solution for my problem.
>>
>> Thanks
>> Dino R.
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

2007-09-27 Thread Tim Prins

Note that you may be able to get some more error output by 
adding --debug-daemons to the mpirun command line.

Tim

On Thursday 27 September 2007 05:12:53 pm Dino Rossegger wrote:
> Hi Jody,
>
> Thanks for your help, it really is the case that either in PATH nor in
> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
> hope it works.
>
> jody schrieb:
> > Hi Dino
> >
> > Try
> >  ssh saturn printenv | grep PATH
> >
> >>from your host sun to see what your environment variables are when
> >
> > ssh is run without a shell.
> >
> > On 9/27/07, Dino Rossegger  wrote:
> >> Hi,
> >>
> >> I have a problem running a simple programm mpihello.cpp.
> >>
> >> Here is a excerp of the error and the command
> >> root@sun:~# mpirun -H sun,saturn main
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> base/pls_base_orted_cmds.c at line 275
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> >> line 1164
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line
> >> 90 [sun:25213] ERROR: A daemon on node saturn failed to start as
> >> expected. [sun:25213] ERROR: There may be more information available
> >> from [sun:25213] ERROR: the remote shell (see above).
> >> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> base/pls_base_orted_cmds.c at line 188
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> >> line 1196
> >> 
> >>-- mpirun was unable to cleanly terminate the daemons for this job.
> >> Returned value Timeout instead of ORTE_SUCCESS.
> >>
> >> 
> >>--
> >>
> >> The program is runable from each node alone (mpirun -np2 main)
> >>
> >> My PathVariables:
> >> $PATH
> >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:
> >>/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH
> >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:
> >>/usr/lib:/usr/local/lib
> >>
> >> Passwordless ssh is up 'n running
> >>
> >> I walked through the FAQ and Mailing Lists but couldn't find any
> >> solution for my problem.
> >>
> >> Thanks
> >> Dino R.
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Bundling OpenMPI

2007-09-27 Thread Tim Prins


Hi Teng,

Teng Lin wrote:

Hi,


We would like to distribute OpenMPI along with our software  to  
customers, is there any legal issue we need to know about?
Not that I know of (disclaimer: IANAL). Open MPI is licensed under the 
new BSD license. Open MPI's license is here:

http://www.open-mpi.org/community/license.php



We can successfully build OpenMPI using
./configure --prefix=/some_path;make;make install

However, if we do

cp -r /some_path /other_path

and try to run
/other_path/bin/orterun,
below error message is thrown:
 
--

Sorry!  You were supposed to get help about:
 orterun:usage
from the file:
 help-orterun.txt
But I couldn't find any file matching that name.  Sorry!
 
--


Apparently, the path is hard-coded in the executable. Is there any  
way to fix it (such as using an environment variable etc)?

There is. See:
http://www.open-mpi.org/faq/?category=building#installdirs

Hope this helps,

Tim

Re: [OMPI users] SIGSEG in ompi_comm_start_processes

2007-09-27 Thread Tim Prins


Murat,

Thanks for the bug report. I have fixed (slightly differently than you 
suggested) this in the Open MPI trunk in r16265 and it should be 
available in the nightly trunk tarball tonight.


I will ask to have this moved into the next release of Open MPI.

Thanks,

Tim

Murat Knecht wrote:

Copy-and-paste-error: The second part of the fix ought to be:

if ( !have_wdir ) {
  free(cwd);
}

Murat




Murat Knecht schrieb:

Hi all,

I think, I found a bug and a fix for it.
Could someone verify the rationale behind this bug, as I have this
SIGSEG on only one of two machines, and I don't quite see why it doesn't
occur always. (Same testprogram, equally compiled 1.2.4 OpenMPI).
Though the fix does prevent the segmentation fault. :)

Thanks,
Murat





Where:
Bug:
free() crashes when trying to free stack memory
ompi/communicator/comm_dyn.c:630

OBJ_RELEASE(apps[i]);



SIGSEG:
orte/mca/rmgr/rmgr_types.h:113

free (app_context->cwd);



There are two ways that apps[i]->cwd is filled:

1. dynamically allocated memory
548 if ( !have_wdir ) {
getcwd(cwd, OMPI_PATH_MAX);
apps[i]->cwd = strdup(cwd);// <--
}

2. stack
354char cwd[OMPI_PATH_MAX];
// ...
516 /* check for 'wdir' */
ompi_info_get (array_of_info[i], "wdir", valuelen, cwd, &flag);
if ( flag ) {
apps[i]->cwd = cwd;  // <--
have_wdir = 1;
}



Fix: Allocate cwd always manually and make sure, it is deleted afterwards.

1.
cwd = strdup(cwd);
   }



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] --enable-mca-no-build broken or bad docs?

2007-09-27 Thread Mostyn Lewis


I see docs for this like:

--enable-mca-no-build=btl:mvapi,btl:openib,btl:gm,btl:mx,mtl:psm

however, the code in a generated configure that parse this looks like:

...
ifs_save="$IFS"
IFS="${IFS}$PATH_SEPARATOR,"
msg=
for item in $enable_mca_no_build; do
type="`echo $item | cut -s -f1 -d-`"
comp="`echo $item | cut -s -f2- -d-`"
if test -z $type -o -z $comp ; then
...

So this actually expects "-" and not ":" as a delimiter and

--enable-mca-no-build=btl-mvapi,btl-openib,btl-gm,btl-mx,mtl-psm

would parse.

So, which is it? The docs or the last above?

From a SVN of today.


Regards,
Mostyn Lewis

[OMPI users] aclocal.m4 booboo?

2007-09-27 Thread Mostyn Lewis


Today's SVN.

A generated configure has this in it:

...
###
# Libtool: part two
# (after C compiler setup)


ompi_show_subtitle "Libtool configuration"






  _LT_SHELL_INIT(lt_ltdl_dir='opal/libltdl')





case $enable_ltdl_convenience in
  no) { { echo "$as_me:$LINENO: error: this package needs a convenience libltdl" 
>&5
echo "$as_me: error: this package needs a convenience libltdl" >&2;}
...

I guess this from aclocal.m4:

...
])# LT_CONFIG_LTDL_DIR

# We break this out into a separate macro, so that we can call it safely
# internally without being caught accidentally by the sed scan in libtoolize.
m4_defun([_LT_CONFIG_LTDL_DIR],
[dnl remove trailing slashes
m4_pushdef([_ARG_DIR], m4_bpatsubst([$1], [/*$]))
m4_case(_LTDL_DIR,
[], [dnl only set lt_ltdl_dir if _ARG_DIR is not simply `.'
 m4_if(_ARG_DIR, [.],
 [],
 [m4_define([_LTDL_DIR], _ARG_DIR)
  _LT_SHELL_INIT([lt_ltdl_dir=']_ARG_DIR['])])],
[m4_if(_ARG_DIR, _LTDL_DIR,
[],
[m4_fatal([multiple libltdl directories: `]_LTDL_DIR[', 
`]_ARG_DIR['])])])
m4_popdef([_ARG_DIR])
dnl If not otherwise defined, default to the 1.5.x compatible subproject mode:
m4_if(_LTDL_MODE, [],
[m4_define([_LTDL_MODE], m4_default([$2], [subproject]))
m4_if([-1], [m4_bregexp(_LTDL_MODE, 
[\(subproject\|\(non\)?recursive\)])],
[m4_fatal([unknown libltdl mode: ]_LTDL_MODE)])])
])# LT_CONFIG_LTDL_DIR

# Initialise:
m4_define([_LTDL_DIR], [])
m4_define([_LTDL_MODE], [])


# LTDL_CONVENIENCE
# 
...

GNU tools used:
autoconf 2.61
automake 1.10
libtool 2.1a_CVS.092407 (libtool from CVS 3 days ago)

Regards,
Mostyn Lewis

Re: [OMPI users] --enable-mca-no-build broken or bad docs?

2007-09-27 Thread Tim Prins


Mostyn,

It looks like the documentation is wrong (and has been wrong for years). 
I assume you were looking at the FAQ? I will update it tonight or tomorrow.


Thanks for the report!

Tim

Mostyn Lewis wrote:

I see docs for this like:

--enable-mca-no-build=btl:mvapi,btl:openib,btl:gm,btl:mx,mtl:psm

however, the code in a generated configure that parse this looks like:

...
 ifs_save="$IFS"
 IFS="${IFS}$PATH_SEPARATOR,"
 msg=
 for item in $enable_mca_no_build; do
 type="`echo $item | cut -s -f1 -d-`"
 comp="`echo $item | cut -s -f2- -d-`"
 if test -z $type -o -z $comp ; then
...

So this actually expects "-" and not ":" as a delimiter and

--enable-mca-no-build=btl-mvapi,btl-openib,btl-gm,btl-mx,mtl-psm

would parse.

So, which is it? The docs or the last above?

From a SVN of today.


Regards,
Mostyn Lewis
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI v1.2.4 released

2007-09-27 Thread Dirk Eddelbuettel


On 26 September 2007 at 13:37, Francesco Pietra wrote:
| Are any detailed directions for upgrading (for common guys, not experts, I
| mean)? My 1.2.3 version on Debian Linux amd64 runs perfectly.

How about

sudo apt-get update; sudo apt-get dist-upgrade

provided you point to Debian unstable which got 1.2.4 yesterday; ports for
alpha, amd64, ia64, powerpc are already available too.

Dirk
part of Debian's pkg-openmpi team

-- 
Three out of two people have difficulties with fractions.

[OMPI users] incorrect configure code (1.2.4 and earlier)

Re: [OMPI users] [Open MPI Announce] Open MPI v1.2.4 released

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

[OMPI users] Bundling OpenMPI

[OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

[OMPI users] SIGSEG in ompi_comm_start_processes

Re: [OMPI users] SIGSEG in ompi_comm_start_processes

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

Re: [OMPI users] Bundling OpenMPI

Re: [OMPI users] SIGSEG in ompi_comm_start_processes

[OMPI users] --enable-mca-no-build broken or bad docs?

[OMPI users] aclocal.m4 booboo?

Re: [OMPI users] --enable-mca-no-build broken or bad docs?

Re: [OMPI users] Open MPI v1.2.4 released

20 matches

Site Navigation

Mail list logo

Footer information