[OMPI users] incorrect configure code (1.2.4 and earlier)
Hi! There are a couple of bugs in the configure scripts regarding threads checking. In ompi_check_pthread_pids.m4 the actual code for testing is wrong and is also missing a CFLAG save/add-THREAD_CFLAGS/restore resulting in the linking always failing for the -pthread test with gcc. config.log looks like this. = configure:50353: checking if threads have different pids (pthreads on linux) configure:50409: gcc -o conftest -DNDEBUG -march=k8 -O3 -msse -msse2 -maccumulate-outgoing-args -finline-functions -fno-strict-aliasing -fexceptions conftest.c -lnsl -lutil -lm >&5 conftest.c: In function 'checkpid': conftest.c:327: warning: cast to pointer from integer of different size /tmp/ccqUaAns.o: In function `main':conftest.c:(.text+0x1f): undefined reference to `pthread_create' :conftest.c:(.text+0x2e): undefined reference to `pthread_join' collect2: ld returned 1 exit status configure:50412: $? = 1 configure: program exited with status 1 = Adding the CFLAGS save/add/restore make the code return the right answer both on systems with the old pthreads implementation and NPTL based systems. BUT, the code as it stands is technically incorrect. The patch have a corrected version. There is also two bugs in ompi_config_pthreads.m4. In OMPI_INTL_POSIX_THREADS_LIBS_CXX it is incorrectly setting PTHREAD_LIBS to $pl, in the then-part of the second if-statement, which at the time isn't set yet and forgetting to reset LIBS on failure in the bottom most if-else case in the for pl loop. In OMPI_INTL_POSIX_THREADS_LIBS_FC it is resetting LIBS whether succesfull or not resulting in -lpthread missing when checking for PTHREAD_MUTEX_ERRORCHECK_NP at least for some versions of pgi, (6.1 and older fails, 7.0 seems to always add -lpthread with pgf77 as linker) The output from configure in such a case looks like this: checking if C compiler and POSIX threads work with -lpthread... yes checking if C++ compiler and POSIX threads work with -lpthread... yes checking if F77 compiler and POSIX threads work with -lpthread... yes checking for PTHREAD_MUTEX_ERRORCHECK_NP... no checking for PTHREAD_MUTEX_ERRORCHECK... no (OS: Ubuntu Dapper, Compiler: pgi 6.1) There is also a problem in the F90 modules include flag search. The test currently does: $FC -c conftest-module.f90 $FC conftest.f90 This doesn't work if one has set FCFLAGS=-g in the environment. At least not with pgf90 since it needs the debug symbols from conftest-module.o to be able to link. You have to either add conftest-module.o to the compile line of conftest or make it a three-stager, $FC -c conftest-module.f90; $FC -c conftest.f90; $FC conftest.o conftest-module.o -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se diff -rU 10 site/config/ompi_config_pthreads.m4 p3/config/ompi_config_pthreads.m4 --- site/config/ompi_config_pthreads.m4 2006-08-15 22:14:05.0 +0200 +++ p3/config/ompi_config_pthreads.m4 2007-09-27 09:10:21.0 +0200 @@ -473,24 +473,24 @@ CXXCPPFLAGS="$CXXCPPFLAGS $PTHREAD_CXXCPPFLAGS" fi ;; esac LIBS="$orig_LIBS $PTHREAD_LIBS" AC_LANG_PUSH(C++) OMPI_INTL_PTHREAD_TRY_LINK(ompi_pthread_cxx_success=1, ompi_pthread_cxx_success=0) AC_LANG_POP(C++) if test "$ompi_pthread_cxx_success" = "1"; then - PTHREAD_LIBS="$pl" AC_MSG_RESULT([yes]) else CXXCPPFLAGS="$orig_CXXCPPFLAGS" + LIBS="$orig_LIBS" AC_MSG_RESULT([no]) AC_MSG_ERROR([Can not find working threads configuration. aborting]) fi else for pl in $plibs; do AC_MSG_CHECKING([if C++ compiler and POSIX threads work with $pl]) case "${host_cpu}-${host-_os}" in *-aix* | *-freebsd*) if test "`echo $CXXCPPFLAGS | grep 'D_THREAD_SAFE'`" = ""; then PTRHEAD_CXXCPPFLAGS="-D_THREAD_SAFE" @@ -508,61 +508,62 @@ AC_LANG_PUSH(C++) OMPI_INTL_PTHREAD_TRY_LINK(ompi_pthread_cxx_success=1, ompi_pthread_cxx_success=0) AC_LANG_POP(C++) if test "$ompi_pthread_cxx_success" = "1"; then PTHREAD_LIBS="$pl" AC_MSG_RESULT([yes]) else PTHREAD_CXXCPPFLAGS= CXXCPPFLAGS="$orig_CXXCPPFLAGS" + LIBS="$orig_LIBS" AC_MSG_RESULT([no]) fi done fi fi ])dnl AC_DEFUN([OMPI_INTL_POSIX_THREADS_LIBS_FC],[ # # Fortran compiler # if test "$ompi_pthread_f77_success" = "0" -a "$OMPI_WANT_F77_BINDINGS" = "1"; then if test ! "$ompi_pthread_c_success" = "0" -a ! "$PTHREAD_LIBS" = "" ; then AC_MSG_CHECKING([if F77 compiler and POSIX threads work with $PTHREAD_LIBS]) LIBS="$orig_LIBS $PTHREAD_LIBS" AC_LANG_PUSH(C) OMPI_INTL_PTHREAD_TRY_LINK_F77(ompi_pthread_f77_success=1, ompi_pthread_f77_success=0) AC_LANG_POP(C) -
Re: [OMPI users] [Open MPI Announce] Open MPI v1.2.4 released
We have a working version of Open MPI on Windows. However, it's manually build and the whole compilation process as well as maintaining the project file is a nightmare. That's why the Windows project files are not committed into the trunk. If you want I can provide you the solution and project files. george. On Sep 26, 2007, at 10:33 AM, Damien Hocking wrote: Is there a timeline for the Windows version yet? Damien ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] incorrect configure code (1.2.4 and earlier)
Hi Ake, Looking at the svn logs it looks like you reported the problems with these checks quite a while ago and we fixed them (in r13773 https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved them to the 1.2 branch. I will ask for this to be moved to the 1.2 branch. However, the changes made for ompi_config_pthreads.m4 are different than you are suggesting now. Is this changeset good enough, or are there other changes you think should be made? Thanks, Tim Åke Sandgren wrote: Hi! There are a couple of bugs in the configure scripts regarding threads checking. In ompi_check_pthread_pids.m4 the actual code for testing is wrong and is also missing a CFLAG save/add-THREAD_CFLAGS/restore resulting in the linking always failing for the -pthread test with gcc. config.log looks like this. = configure:50353: checking if threads have different pids (pthreads on linux) configure:50409: gcc -o conftest -DNDEBUG -march=k8 -O3 -msse -msse2 -maccumulate-outgoing-args -finline-functions -fno-strict-aliasing -fexceptions conftest.c -lnsl -lutil -lm >&5 conftest.c: In function 'checkpid': conftest.c:327: warning: cast to pointer from integer of different size /tmp/ccqUaAns.o: In function `main':conftest.c:(.text+0x1f): undefined reference to `pthread_create' :conftest.c:(.text+0x2e): undefined reference to `pthread_join' collect2: ld returned 1 exit status configure:50412: $? = 1 configure: program exited with status 1 = Adding the CFLAGS save/add/restore make the code return the right answer both on systems with the old pthreads implementation and NPTL based systems. BUT, the code as it stands is technically incorrect. The patch have a corrected version. There is also two bugs in ompi_config_pthreads.m4. In OMPI_INTL_POSIX_THREADS_LIBS_CXX it is incorrectly setting PTHREAD_LIBS to $pl, in the then-part of the second if-statement, which at the time isn't set yet and forgetting to reset LIBS on failure in the bottom most if-else case in the for pl loop. In OMPI_INTL_POSIX_THREADS_LIBS_FC it is resetting LIBS whether succesfull or not resulting in -lpthread missing when checking for PTHREAD_MUTEX_ERRORCHECK_NP at least for some versions of pgi, (6.1 and older fails, 7.0 seems to always add -lpthread with pgf77 as linker) The output from configure in such a case looks like this: checking if C compiler and POSIX threads work with -lpthread... yes checking if C++ compiler and POSIX threads work with -lpthread... yes checking if F77 compiler and POSIX threads work with -lpthread... yes checking for PTHREAD_MUTEX_ERRORCHECK_NP... no checking for PTHREAD_MUTEX_ERRORCHECK... no (OS: Ubuntu Dapper, Compiler: pgi 6.1) There is also a problem in the F90 modules include flag search. The test currently does: $FC -c conftest-module.f90 $FC conftest.f90 This doesn't work if one has set FCFLAGS=-g in the environment. At least not with pgf90 since it needs the debug symbols from conftest-module.o to be able to link. You have to either add conftest-module.o to the compile line of conftest or make it a three-stager, $FC -c conftest-module.f90; $FC -c conftest.f90; $FC conftest.o conftest-module.o ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] incorrect configure code (1.2.4 and earlier)
On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote: > Hi Ake, > > Looking at the svn logs it looks like you reported the problems with > these checks quite a while ago and we fixed them (in r13773 > https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved > them to the 1.2 branch. Yes, it's the same. Since i never saw it in the source i tried once more with some more explanations just in case :-) > I will ask for this to be moved to the 1.2 branch. Good. > However, the changes made for ompi_config_pthreads.m4 are different than > you are suggesting now. Is this changeset good enough, or are there > other changes you think should be made? The ones i sent today are slightly more correct. There where 2 missing LIBS="$orig_LIBS" in the failure cases. If you compare the resulting file after patching you will see the difference. They are in the "Can not find working threads configuration" portions. -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
[OMPI users] Bundling OpenMPI
Hi, We would like to distribute OpenMPI along with our software to customers, is there any legal issue we need to know about? We can successfully build OpenMPI using ./configure --prefix=/some_path;make;make install However, if we do cp -r /some_path /other_path and try to run /other_path/bin/orterun, below error message is thrown: -- Sorry! You were supposed to get help about: orterun:usage from the file: help-orterun.txt But I couldn't find any file matching that name. Sorry! -- Apparently, the path is hard-coded in the executable. Is there any way to fix it (such as using an environment variable etc)? Thanks, Teng
[OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.
Hi, I have a problem running a simple programm mpihello.cpp. Here is a excerp of the error and the command root@sun:~# mpirun -H sun,saturn main [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [sun:25213] ERROR: A daemon on node saturn failed to start as expected. [sun:25213] ERROR: There may be more information available from [sun:25213] ERROR: the remote shell (see above). [sun:25213] ERROR: The daemon exited unexpectedly with status 255. [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -- The program is runable from each node alone (mpirun -np2 main) My PathVariables: $PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib Passwordless ssh is up 'n running I walked through the FAQ and Mailing Lists but couldn't find any solution for my problem. Thanks Dino R.
Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.
Hi Dino Try ssh saturn printenv | grep PATH from your host sun to see what your environment variables are when ssh is run without a shell. On 9/27/07, Dino Rossegger wrote: > Hi, > > I have a problem running a simple programm mpihello.cpp. > > Here is a excerp of the error and the command > root@sun:~# mpirun -H sun,saturn main > [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 275 > [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at > line 1164 > [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 > [sun:25213] ERROR: A daemon on node saturn failed to start as expected. > [sun:25213] ERROR: There may be more information available from > [sun:25213] ERROR: the remote shell (see above). > [sun:25213] ERROR: The daemon exited unexpectedly with status 255. > [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 188 > [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at > line 1196 > -- > mpirun was unable to cleanly terminate the daemons for this job. > Returned value Timeout instead of ORTE_SUCCESS. > > -- > > The program is runable from each node alone (mpirun -np2 main) > > My PathVariables: > $PATH > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/libecho > $LD_LIBRARY_PATH > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib > > Passwordless ssh is up 'n running > > I walked through the FAQ and Mailing Lists but couldn't find any > solution for my problem. > > Thanks > Dino R. > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] SIGSEG in ompi_comm_start_processes
Hi all, I think, I found a bug and a fix for it. Could someone verify the rationale behind this bug, as I have this SIGSEG on only one of two machines, and I don't quite see why it doesn't occur always. (Same testprogram, equally compiled 1.2.4 OpenMPI). Though the fix does prevent the segmentation fault. :) Thanks, Murat Where: Bug: free() crashes when trying to free stack memory ompi/communicator/comm_dyn.c:630 OBJ_RELEASE(apps[i]); SIGSEG: orte/mca/rmgr/rmgr_types.h:113 free (app_context->cwd); There are two ways that apps[i]->cwd is filled: 1. dynamically allocated memory 548 if ( !have_wdir ) { getcwd(cwd, OMPI_PATH_MAX); apps[i]->cwd = strdup(cwd);// <-- } 2. stack 354char cwd[OMPI_PATH_MAX]; // ... 516 /* check for 'wdir' */ ompi_info_get (array_of_info[i], "wdir", valuelen, cwd, &flag); if ( flag ) { apps[i]->cwd = cwd; // <-- have_wdir = 1; } Fix: Allocate cwd always manually and make sure, it is deleted afterwards. 1.char *cwd = (char*)malloc(OMPI_PATH_MAX); 2. And on cleanup (somewhere below line 624) >if ( !have_wdir ) { >getcwd(cwd, OMPI_PATH_MAX); >apps[i]->cwd = strdup(cwd); >}
Re: [OMPI users] SIGSEG in ompi_comm_start_processes
Copy-and-paste-error: The second part of the fix ought to be: if ( !have_wdir ) { free(cwd); } Murat Murat Knecht schrieb: > Hi all, > > I think, I found a bug and a fix for it. > Could someone verify the rationale behind this bug, as I have this > SIGSEG on only one of two machines, and I don't quite see why it doesn't > occur always. (Same testprogram, equally compiled 1.2.4 OpenMPI). > Though the fix does prevent the segmentation fault. :) > > Thanks, > Murat > > > > > > Where: > Bug: > free() crashes when trying to free stack memory > ompi/communicator/comm_dyn.c:630 > > OBJ_RELEASE(apps[i]); > > > SIGSEG: > orte/mca/rmgr/rmgr_types.h:113 > > free (app_context->cwd); > > > > There are two ways that apps[i]->cwd is filled: > 1. dynamically allocated memory > 548 if ( !have_wdir ) { > getcwd(cwd, OMPI_PATH_MAX); > apps[i]->cwd = strdup(cwd);// <-- > } > > 2. stack > 354char cwd[OMPI_PATH_MAX]; > // ... > 516 /* check for 'wdir' */ > ompi_info_get (array_of_info[i], "wdir", valuelen, cwd, &flag); > if ( flag ) { > apps[i]->cwd = cwd; // <-- > have_wdir = 1; > } > > > > Fix: Allocate cwd always manually and make sure, it is deleted afterwards. > > 1. >--- > >>char *cwd = (char*)malloc(OMPI_PATH_MAX); >> > > 2. And on cleanup (somewhere below line 624) > > >>if ( !have_wdir ) { >>getcwd(cwd, OMPI_PATH_MAX); >>apps[i]->cwd = strdup(cwd); >>} >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] incorrect configure code (1.2.4 and earlier)
Åke Sandgren wrote: On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote: Hi Ake, Looking at the svn logs it looks like you reported the problems with these checks quite a while ago and we fixed them (in r13773 https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved them to the 1.2 branch. Yes, it's the same. Since i never saw it in the source i tried once more with some more explanations just in case :-) I will ask for this to be moved to the 1.2 branch. Good. However, the changes made for ompi_config_pthreads.m4 are different than you are suggesting now. Is this changeset good enough, or are there other changes you think should be made? The ones i sent today are slightly more correct. There where 2 missing LIBS="$orig_LIBS" in the failure cases. But do we really need these? It looks like configure aborts in these cases (I am not a autoconf wizard, so I could be completely wrong here). Tim If you compare the resulting file after patching you will see the difference. They are in the "Can not find working threads configuration" portions.
Re: [OMPI users] incorrect configure code (1.2.4 and earlier)
On Thu, 2007-09-27 at 14:18 -0400, Tim Prins wrote: > Åke Sandgren wrote: > > On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote: > >> Hi Ake, > >> > >> Looking at the svn logs it looks like you reported the problems with > >> these checks quite a while ago and we fixed them (in r13773 > >> https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved > >> them to the 1.2 branch. > > > > Yes, it's the same. Since i never saw it in the source i tried once more > > with some more explanations just in case :-) > > > >> I will ask for this to be moved to the 1.2 branch. > > > > Good. > > > >> However, the changes made for ompi_config_pthreads.m4 are different than > >> you are suggesting now. Is this changeset good enough, or are there > >> other changes you think should be made? > > > > The ones i sent today are slightly more correct. There where 2 missing > > LIBS="$orig_LIBS" in the failure cases. > But do we really need these? It looks like configure aborts in these > cases (I am not a autoconf wizard, so I could be completely wrong here). I don't know. I just put them in since it was the right thing to do. And there where other variables that was reset in those places. -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv
Here's some more info on the problem I've been struggling with; my apologies for the lengthy posts, but I'm a little desperate here :-) I was able to reduce the size of the experiment that reproduces the problem, both in terms of input data size and the number of slots in the cluster. The cluster now consists of 6 slots (5 clients), with two of the clients running on the same node as the server and three others on another node. This allowed me to follow Brian's advice and run the server and all the clients under gdb and make sure none of the processes terminates (normally or abnormally) when the server reports the "readv failed" errors; this is indeed the case. I then followed Jeff's advice and added a debug loop just prior to the server calling MPI_Waitany(), identifying the entries in the requests array which are not MPI_REQUEST_NULL, and then tracing back these requests. What I found was the following: At some point during the run, the server calls MPI_Waitany() on an array of requests consisting of 96 elements, and gets stuck in it forever; the only thing that happens at some point thereafter is that the server reports a couple of "readv failed" errors: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 According to my debug prints, just before that last call to MPI_Waitany() the array requests[] contains 38 entries which are not MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(), half to Irecv(). Specifically, for example, entries 4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the same client. I traced back what's going on, for instance, with requests[4]. As I mentioned, it corresponds to a call to MPI_Isend() initiated by the server to client #3 (of 5). By the time the server gets stuck in Waitany(), this client has already correctly processed the first Isend() from master in requests[4], returned its response in requests[5], and the server received this response properly. After receiving this response, the server Isend()'s the next task to this client in requests[4], and this is correctly reflected in "requests[4] != MPI_REQUESTS_NULL" just before the last call to Waitany(), but for some reason this send doesn't seem to go any further. Looking at all other requests[] corresponding to Isend()'s initiated by the server to the same client (14,24,...,94), they're all also not MPI_REQUEST_NULL, and are not going any further either. One thing that might be important is that the messages the server is sending to the clients in my experiment are quite large, ranging from hundreds of Kbytes to several Mbytes, the largest being around 9 Mbytes. The largest messages take place at the beginning of the run and are processed correctly though. Also, I ran the same experiment on another cluster that uses slightly different hardware and network infrastructure, and could not reproduce the problem. Hope at least some of the above makes some sense. Any additional advice would be greatly appreciated! Many thanks, Daniel Daniel Rozenbaum wrote: I'm now running the same experiment under valgrind. It's probably going to run for a few days, but interestingly what I'm seeing now is that while running under valgrind's memcheck, the app has been reporting much more of these "recv failed" errors, and not only on the server node: [host1][0,1,0] [host4][0,1,13] [host5][0,1,18] [host8][0,1,30] [host10][0,1,36] [host12][0,1,46] If in the original run I got 3 such messages, in the valgrind'ed run I got about 45 so far, and the app still has about 75% of the work left. I'm checking while all this is happening, and all the client processes are still running, none exited early. I've been analyzing the debug output in my original experiment, and it does look like the server never receives any new messages from two of the clients after the "recv failed" messages appear. If my analysis is correct, these two clients ran on the same host. It might be the case then that the messages with the next tasks to execute that the server attempted to send to these two clients never reached them, or were never sent. Interesting though that there were two additional clients on the same host, and those seem to have kept working all along, until the app got stuck. Once this valgrind experiment is over, I'll proceed to your other suggestion about the debug loop on the server side checking for any of the requests the app is waiting for being MPI_REQUEST_NULL. Many thanks, Daniel Jeff Squyres wrote: On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote: What seems to be happening is this: the code of the server is written in such a manner that the server knows how many "responses" it's supposed to receive from all the c
Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.
Hi Jody, Thanks for your help, it really is the case that either in PATH nor in LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out, hope it works. jody schrieb: > Hi Dino > > Try > ssh saturn printenv | grep PATH >>from your host sun to see what your environment variables are when > ssh is run without a shell. > > > On 9/27/07, Dino Rossegger wrote: >> Hi, >> >> I have a problem running a simple programm mpihello.cpp. >> >> Here is a excerp of the error and the command >> root@sun:~# mpirun -H sun,saturn main >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 275 >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >> line 1164 >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 >> [sun:25213] ERROR: A daemon on node saturn failed to start as expected. >> [sun:25213] ERROR: There may be more information available from >> [sun:25213] ERROR: the remote shell (see above). >> [sun:25213] ERROR: The daemon exited unexpectedly with status 255. >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 188 >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >> line 1196 >> -- >> mpirun was unable to cleanly terminate the daemons for this job. >> Returned value Timeout instead of ORTE_SUCCESS. >> >> -- >> >> The program is runable from each node alone (mpirun -np2 main) >> >> My PathVariables: >> $PATH >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/libecho >> $LD_LIBRARY_PATH >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib >> >> Passwordless ssh is up 'n running >> >> I walked through the FAQ and Mailing Lists but couldn't find any >> solution for my problem. >> >> Thanks >> Dino R. >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.
Note that you may be able to get some more error output by adding --debug-daemons to the mpirun command line. Tim On Thursday 27 September 2007 05:12:53 pm Dino Rossegger wrote: > Hi Jody, > > Thanks for your help, it really is the case that either in PATH nor in > LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out, > hope it works. > > jody schrieb: > > Hi Dino > > > > Try > > ssh saturn printenv | grep PATH > > > >>from your host sun to see what your environment variables are when > > > > ssh is run without a shell. > > > > On 9/27/07, Dino Rossegger wrote: > >> Hi, > >> > >> I have a problem running a simple programm mpihello.cpp. > >> > >> Here is a excerp of the error and the command > >> root@sun:~# mpirun -H sun,saturn main > >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file > >> base/pls_base_orted_cmds.c at line 275 > >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at > >> line 1164 > >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line > >> 90 [sun:25213] ERROR: A daemon on node saturn failed to start as > >> expected. [sun:25213] ERROR: There may be more information available > >> from [sun:25213] ERROR: the remote shell (see above). > >> [sun:25213] ERROR: The daemon exited unexpectedly with status 255. > >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file > >> base/pls_base_orted_cmds.c at line 188 > >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at > >> line 1196 > >> > >>-- mpirun was unable to cleanly terminate the daemons for this job. > >> Returned value Timeout instead of ORTE_SUCCESS. > >> > >> > >>-- > >> > >> The program is runable from each node alone (mpirun -np2 main) > >> > >> My PathVariables: > >> $PATH > >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/: > >>/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH > >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/: > >>/usr/lib:/usr/local/lib > >> > >> Passwordless ssh is up 'n running > >> > >> I walked through the FAQ and Mailing Lists but couldn't find any > >> solution for my problem. > >> > >> Thanks > >> Dino R. > >> > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Bundling OpenMPI
Hi Teng, Teng Lin wrote: Hi, We would like to distribute OpenMPI along with our software to customers, is there any legal issue we need to know about? Not that I know of (disclaimer: IANAL). Open MPI is licensed under the new BSD license. Open MPI's license is here: http://www.open-mpi.org/community/license.php We can successfully build OpenMPI using ./configure --prefix=/some_path;make;make install However, if we do cp -r /some_path /other_path and try to run /other_path/bin/orterun, below error message is thrown: -- Sorry! You were supposed to get help about: orterun:usage from the file: help-orterun.txt But I couldn't find any file matching that name. Sorry! -- Apparently, the path is hard-coded in the executable. Is there any way to fix it (such as using an environment variable etc)? There is. See: http://www.open-mpi.org/faq/?category=building#installdirs Hope this helps, Tim
Re: [OMPI users] SIGSEG in ompi_comm_start_processes
Murat, Thanks for the bug report. I have fixed (slightly differently than you suggested) this in the Open MPI trunk in r16265 and it should be available in the nightly trunk tarball tonight. I will ask to have this moved into the next release of Open MPI. Thanks, Tim Murat Knecht wrote: Copy-and-paste-error: The second part of the fix ought to be: if ( !have_wdir ) { free(cwd); } Murat Murat Knecht schrieb: Hi all, I think, I found a bug and a fix for it. Could someone verify the rationale behind this bug, as I have this SIGSEG on only one of two machines, and I don't quite see why it doesn't occur always. (Same testprogram, equally compiled 1.2.4 OpenMPI). Though the fix does prevent the segmentation fault. :) Thanks, Murat Where: Bug: free() crashes when trying to free stack memory ompi/communicator/comm_dyn.c:630 OBJ_RELEASE(apps[i]); SIGSEG: orte/mca/rmgr/rmgr_types.h:113 free (app_context->cwd); There are two ways that apps[i]->cwd is filled: 1. dynamically allocated memory 548 if ( !have_wdir ) { getcwd(cwd, OMPI_PATH_MAX); apps[i]->cwd = strdup(cwd);// <-- } 2. stack 354char cwd[OMPI_PATH_MAX]; // ... 516 /* check for 'wdir' */ ompi_info_get (array_of_info[i], "wdir", valuelen, cwd, &flag); if ( flag ) { apps[i]->cwd = cwd; // <-- have_wdir = 1; } Fix: Allocate cwd always manually and make sure, it is deleted afterwards. 1.cwd = strdup(cwd); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] --enable-mca-no-build broken or bad docs?
I see docs for this like: --enable-mca-no-build=btl:mvapi,btl:openib,btl:gm,btl:mx,mtl:psm however, the code in a generated configure that parse this looks like: ... ifs_save="$IFS" IFS="${IFS}$PATH_SEPARATOR," msg= for item in $enable_mca_no_build; do type="`echo $item | cut -s -f1 -d-`" comp="`echo $item | cut -s -f2- -d-`" if test -z $type -o -z $comp ; then ... So this actually expects "-" and not ":" as a delimiter and --enable-mca-no-build=btl-mvapi,btl-openib,btl-gm,btl-mx,mtl-psm would parse. So, which is it? The docs or the last above? From a SVN of today. Regards, Mostyn Lewis
[OMPI users] aclocal.m4 booboo?
Today's SVN. A generated configure has this in it: ... ### # Libtool: part two # (after C compiler setup) ompi_show_subtitle "Libtool configuration" _LT_SHELL_INIT(lt_ltdl_dir='opal/libltdl') case $enable_ltdl_convenience in no) { { echo "$as_me:$LINENO: error: this package needs a convenience libltdl" >&5 echo "$as_me: error: this package needs a convenience libltdl" >&2;} ... I guess this from aclocal.m4: ... ])# LT_CONFIG_LTDL_DIR # We break this out into a separate macro, so that we can call it safely # internally without being caught accidentally by the sed scan in libtoolize. m4_defun([_LT_CONFIG_LTDL_DIR], [dnl remove trailing slashes m4_pushdef([_ARG_DIR], m4_bpatsubst([$1], [/*$])) m4_case(_LTDL_DIR, [], [dnl only set lt_ltdl_dir if _ARG_DIR is not simply `.' m4_if(_ARG_DIR, [.], [], [m4_define([_LTDL_DIR], _ARG_DIR) _LT_SHELL_INIT([lt_ltdl_dir=']_ARG_DIR['])])], [m4_if(_ARG_DIR, _LTDL_DIR, [], [m4_fatal([multiple libltdl directories: `]_LTDL_DIR[', `]_ARG_DIR['])])]) m4_popdef([_ARG_DIR]) dnl If not otherwise defined, default to the 1.5.x compatible subproject mode: m4_if(_LTDL_MODE, [], [m4_define([_LTDL_MODE], m4_default([$2], [subproject])) m4_if([-1], [m4_bregexp(_LTDL_MODE, [\(subproject\|\(non\)?recursive\)])], [m4_fatal([unknown libltdl mode: ]_LTDL_MODE)])]) ])# LT_CONFIG_LTDL_DIR # Initialise: m4_define([_LTDL_DIR], []) m4_define([_LTDL_MODE], []) # LTDL_CONVENIENCE # ... GNU tools used: autoconf 2.61 automake 1.10 libtool 2.1a_CVS.092407 (libtool from CVS 3 days ago) Regards, Mostyn Lewis
Re: [OMPI users] --enable-mca-no-build broken or bad docs?
Mostyn, It looks like the documentation is wrong (and has been wrong for years). I assume you were looking at the FAQ? I will update it tonight or tomorrow. Thanks for the report! Tim Mostyn Lewis wrote: I see docs for this like: --enable-mca-no-build=btl:mvapi,btl:openib,btl:gm,btl:mx,mtl:psm however, the code in a generated configure that parse this looks like: ... ifs_save="$IFS" IFS="${IFS}$PATH_SEPARATOR," msg= for item in $enable_mca_no_build; do type="`echo $item | cut -s -f1 -d-`" comp="`echo $item | cut -s -f2- -d-`" if test -z $type -o -z $comp ; then ... So this actually expects "-" and not ":" as a delimiter and --enable-mca-no-build=btl-mvapi,btl-openib,btl-gm,btl-mx,mtl-psm would parse. So, which is it? The docs or the last above? From a SVN of today. Regards, Mostyn Lewis ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI v1.2.4 released
On 26 September 2007 at 13:37, Francesco Pietra wrote: | Are any detailed directions for upgrading (for common guys, not experts, I | mean)? My 1.2.3 version on Debian Linux amd64 runs perfectly. How about sudo apt-get update; sudo apt-get dist-upgrade provided you point to Debian unstable which got 1.2.4 yesterday; ports for alpha, amd64, ia64, powerpc are already available too. Dirk part of Debian's pkg-openmpi team -- Three out of two people have difficulties with fractions.