I believe I know what is happening here. My availability in the next week is
pretty limited due to a family emergency, but I'll take a look when I get
back. In brief, this is a resource starvation issue where the system thinks
your node is unable to support any further processes and so it blocks.

On a separate note, I never use threaded configurations due to the lack of
any real thread-safety review or testing on Open MPI to-date (per Tim's
earlier comment). My "standard" configuration for development and testing is
with --disable-progress-threads --without-threads.

I'll post something back to the list when I get it resolved.

Thanks
Ralph


On 3/6/07 9:00 AM, "rozzen.vinc...@fr.thalesgroup.com"
<rozzen.vinc...@fr.thalesgroup.com> wrote:

> Hi Tim, I get back to you
> 
> "What kind of system is it?"
> =>The system is a "Debian Sarge".
> "How many nodes are you running on?"
> => There is no cluster configured, so I guess I work with no node
> environnement.
> "Have you been able to try a more recent version of Open MPI?"
> =>Today, I tried with version 1.1.4, but the results are not better.
> I tested 2 cases :
> Test 1 : with the sames configuration options (./configure
> --enable-mpi-threads --enable-progress-threads --with-threads=posix
> --enable-smp-locks)
> The program stopped on MPI_Init_thread in __lll_mutex_lock_wait () from
> /lib/tls/libpthread.so.0
> 
> Test 2 : with the default configuration options (./configure
> --prefix=/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread)
> The program stoped on the "node allocation" after the spawn n°31.
> Maybe the problem comes from the lack of node definition?
> Thanks for your help.
> 
> Here below, the different log files of the 2 tests
> 
> /******************************TEST 1*******************************/
> GNU gdb 6.3-debian
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-linux"...Using host libthread_db library
> "/lib/tls/libthread_db.so.1".
> 
> (gdb) run
> Starting program: /home/workspace/test_spaw1/src/spawn
> [Thread debugging using libthread_db enabled]
> [New Thread 1076646560 (LWP 5178)]
> main*******************************
> main : Lancement MPI*
> [New Thread 1085225904 (LWP 5181)]
> [New Thread 1094495152 (LWP 5182)]
> 
> Program received signal SIGINT, Interrupt.
> [Switching to Thread 1076646560 (LWP 5178)]
> 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> (gdb) where
> #0  0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
> #1  0x40187893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
> #2  0xbffff508 in ?? ()
> #3  0x4000bcd0 in _dl_map_object_deps () from /lib/ld-linux.so.2
> #4  0x40b9f8cb in mca_btl_tcp_component_create_listen () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so
> #5  0x40b9f8cb in mca_btl_tcp_component_create_listen () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so
> #6  0x40b9eef4 in mca_btl_tcp_component_init () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so
> #7  0x4008c652 in mca_btl_base_select () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #8  0x40b8dd28 in mca_bml_r2_component_init () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_bml_r2.so
> #9  0x4008bf54 in mca_bml_base_init () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #10 0x40b7e5c9 in mca_pml_ob1_component_init () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_pml_ob1.so
> #11 0x40094192 in mca_pml_base_select () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #12 0x4005742c in ompi_mpi_init () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #13 0x4007c182 in PMPI_Init_thread () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #14 0x080489f3 in main (argc=1, argv=0xbffff8a4) at spawn6.c:33
> 
> 
> 
> /******************************TEST 2*******************************/
> 
> GNU gdb 6.3-debian
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-linux"...Using host libthread_db library
> "/lib/tls/libthread_db.so.1".
> 
> (gdb) run -np 1 --host myhost spawn6
> Starting program: /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/bin/mpirun -np
> 1 --host myhost spawn6
> [Thread debugging using libthread_db enabled]
> [New Thread 1076121728 (LWP 4022)]
> main*******************************
> main : Lancement MPI*
> Exe : Lance
> Exe: lRankExe  = 1   lRankMain  = 0
> 1 main***MPI_Comm_spawn return : 0
> 1 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 2 main***MPI_Comm_spawn return : 0
> 2 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> ...
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 30 main***MPI_Comm_spawn return : 0
> 30 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 31 main***MPI_Comm_spawn return : 0
> 31 main***Rang main : 0   Rang exe : 1
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1076121728 (LWP 4022)]
> 0x4018833b in strlen () from /lib/tls/libc.so.6
> (gdb) where
> #0  0x4018833b in strlen () from /lib/tls/libc.so.6
> #1  0x40297c5e in orte_gpr_replica_create_itag () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #2  0x4029d2df in orte_gpr_replica_put_fn () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #3  0x40297281 in orte_gpr_replica_put () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #4  0x40048287 in orte_ras_base_node_assign () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #5  0x400463e1 in orte_ras_base_allocate_nodes () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #6  0x402c2bb8 in orte_ras_hostfile_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_ras_hostfile.so
> #7  0x400464e0 in orte_ras_base_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #8  0x402b063f in orte_rmgr_urm_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
> #9  0x4004f277 in orte_rmgr_base_cmd_dispatch () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #10 0x402b10ae in orte_rmgr_urm_recv () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
> #11 0x4004301e in mca_oob_recv_callback () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #12 0x4027a748 in mca_oob_tcp_msg_data () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
> #13 0x4027bb12 in mca_oob_tcp_peer_recv_handler () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
> #14 0x400703f9 in opal_event_loop () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
> #15 0x4006adfa in opal_progress () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
> #16 0x0804c7a1 in opal_condition_wait (c=0x804fbcc, m=0x804fba8) at
> condition.h:81
> #17 0x0804a4c8 in orterun (argc=6, argv=0xbffff854) at orterun.c:427
> #18 0x08049dd6 in main (argc=6, argv=0xbffff854) at main.c:13
> (gdb)
> -----Message d'origine-----
> De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]De la
> part de Tim Prins
> Envoyé : lundi 5 mars 2007 22:34
> À : Open MPI Users
> Objet : Re: [OMPI users] MPI_Comm_Spawn
> 
> 
> Never mind, I was just able to replicate it. I'll look into it.
> 
> Tim
> 
> On Mar 5, 2007, at 4:26 PM, Tim Prins wrote:
> 
>> That is possible. Threading support is VERY lightly tested, but I
>> doubt it is the problem since it always fails after 31 spawns.
>> 
>> Again, I have tried with these configure options and the same version
>> of Open MPI and have still have been able to replicate this (after
>> letting it spawn over 500 times). Have you been able to try a more
>> recent version of Open MPI? What kind of system is it? How many nodes
>> are you running on?
>> 
>> Tim
>> 
>> On Mar 5, 2007, at 1:21 PM, rozzen.vinc...@fr.thalesgroup.com wrote:
>> 
>>> 
>>> Maybe the problem comes from the configuration options.
>>> The configuration options used are :
>>> ./configure  --enable-mpi-threads --enable-progress-threads --with-
>>> threads=posix --enable-smp-locks
>>> Could you give me your point of view about that please ?
>>> Thanks
>>> 
>>> -----Message d'origine-----
>>> De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>>> De la
>>> part de Ralph H Castain
>>> Envoyé : mardi 27 février 2007 16:26
>>> À : Open MPI Users <us...@open-mpi.org>
>>> Objet : Re: [OMPI users] MPI_Comm_Spawn
>>> 
>>> 
>>> Now that's interesting! There shouldn't be a limit, but to be
>>> honest, I've
>>> never tested that mode of operation - let me look into it and see.
>>> It sounds
>>> like there is some counter that is overflowing, but I'll look.
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> 
>>> On 2/27/07 8:15 AM, "rozzen.vinc...@fr.thalesgroup.com"
>>> <rozzen.vinc...@fr.thalesgroup.com> wrote:
>>> 
>>>> Do you know if there is a limit to the number of MPI_Comm_spawn we
>>>> can use in
>>>> order to launch a program?
>>>> I want to start and stop a program several times (with the function
>>>> MPI_Comm_spawn) but every time after  31 MPI_Comm_spawn, I get a
>>>> "segmentation
>>>> fault".
>>>> Could you give me your point of you to solve this problem?
>>>> Thanks
>>>> 
>>>> /*file .c : spawned  the file Exe*/
>>>> #include <stdio.h>
>>>> #include <malloc.h>
>>>> #include <unistd.h>
>>>> #include "mpi.h"
>>>> #include <pthread.h>
>>>> #include <signal.h>
>>>> #include <sys/time.h>
>>>> #include <errno.h>
>>>> #define     EXE_TEST             "/home/workspace/test_spaw1/src/
>>>> Exe"
>>>> 
>>>> 
>>>> 
>>>> int main( int argc, char **argv ) {
>>>> 
>>>>     long *lpBufferMpi;
>>>>     MPI_Comm lIntercom;
>>>>     int lErrcode;
>>>>     MPI_Comm lCommunicateur;
>>>>     int lRangMain,lRangExe,lMessageEnvoi,lIter,NiveauThreadVoulu,
>>>> NiveauThreadObtenu,lTailleBuffer;
>>>>     int *lpMessageEnvoi=&lMessageEnvoi;
>>>>     MPI_Status lStatus;             /*status de reception*/
>>>> 
>>>>      lIter=0;
>>>> 
>>>> 
>>>>     /* MPI environnement */
>>>> 
>>>>     printf("main*******************************\n");
>>>>     printf("main : Lancement MPI*\n");
>>>> 
>>>>     NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
>>>>     MPI_Init_thread( &argc, &argv, NiveauThreadVoulu,
>>>> &NiveauThreadObtenu );
>>>>     lpBufferMpi = calloc( 10000, sizeof(long));
>>>>     MPI_Buffer_attach( (void*)lpBufferMpi, 10000 * sizeof(long) );
>>>> 
>>>>     while (lIter<1000){
>>>>         lIter ++;
>>>>         lIntercom=(MPI_Comm)-1 ;
>>>> 
>>>>         MPI_Comm_spawn( EXE_TEST, NULL, 1, MPI_INFO_NULL,
>>>>                       0, MPI_COMM_WORLD, &lIntercom, &lErrcode );
>>>>         printf( "%i main***MPI_Comm_spawn return : %d\n",lIter,
>>>> lErrcode );
>>>> 
>>>>         if(lIntercom == (MPI_Comm)-1 ){
>>>>             printf("%i Intercom null\n",lIter);
>>>>             return 0;
>>>>         }
>>>>         MPI_Intercomm_merge(lIntercom, 0,&lCommunicateur );
>>>>         MPI_Comm_rank( lCommunicateur, &lRangMain);
>>>>         lRangExe=1-lRangMain;
>>>> 
>>>>         printf("%i main***Rang main : %i   Rang exe : %i
>>>> \n",lIter,(int)lRangMain,(int)lRangExe);
>>>>         sleep(2);
>>>> 
>>>>     }
>>>> 
>>>> 
>>>>     /* Arret de l'environnement MPI */
>>>>     lTailleBuffer=10000* sizeof(long);
>>>>     MPI_Buffer_detach( (void*)lpBufferMpi, &lTailleBuffer );
>>>>     MPI_Comm_free( &lCommunicateur );
>>>>     MPI_Finalize( );
>>>>     free( lpBufferMpi );
>>>> 
>>>>     printf( "Main = End .\n" );
>>>>     return 0;
>>>> 
>>>> }
>>>> /
>>>> ********************************************************************
>>>> *
>>>> ********
>>>> *******************/
>>>> Exe:
>>>> #include <string.h>
>>>> #include <stdlib.h>
>>>> #include <stdio.h>
>>>> #include <malloc.h>
>>>> #include <unistd.h>     /* pour sleep() */
>>>> #include <pthread.h>
>>>> #include <semaphore.h>
>>>> #include "mpi.h"
>>>> 
>>>> int main( int argc, char **argv ) {
>>>> /*1)pour communiaction MPI*/
>>>>     MPI_Comm lCommunicateur;        /*communicateur du process*/
>>>>     MPI_Comm CommParent;            /*Communiacteur parent à
>>>> récupérer*/
>>>>     int lRank;                      /*rang du communicateur du
>>>> process*/
>>>>     int lRangMain;            /*rang du séquenceur si lancé en
>>>> mode normal*/
>>>>     int lTailleCommunicateur;       /*taille du communicateur;*/
>>>>     long *lpBufferMpi;              /*buffer pour message*/
>>>>     int lBufferSize;                /*taille du buffer*/
>>>> 
>>>>     /*2) pour les thread*/
>>>>     int NiveauThreadVoulu, NiveauThreadObtenu;
>>>> 
>>>> 
>>>>     lCommunicateur   = (MPI_Comm)-1;
>>>>     NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
>>>>     int erreur = MPI_Init_thread( &argc, &argv, NiveauThreadVoulu,
>>>> &NiveauThreadObtenu );
>>>> 
>>>>     if (erreur!=0){
>>>>         printf("erreur\n");
>>>>         free( lpBufferMpi );
>>>>         return -1;
>>>>     }
>>>> 
>>>>    /*2) Attachement à un buffer pour le message*/
>>>>     lBufferSize=10000 * sizeof(long);
>>>>     lpBufferMpi = calloc( 10000, sizeof(long));
>>>>     erreur = MPI_Buffer_attach( (void*)lpBufferMpi, lBufferSize );
>>>> 
>>>>     if (erreur!=0){
>>>>         printf("erreur\n");
>>>>         free( lpBufferMpi );
>>>>         return -1;
>>>>     }
>>>> 
>>>>     printf( "Exe : Lance \n" );
>>>>     MPI_Comm_get_parent(&CommParent);
>>>>     MPI_Intercomm_merge( CommParent, 1, &lCommunicateur );
>>>>     MPI_Comm_rank( lCommunicateur, &lRank );
>>>>     MPI_Comm_size( lCommunicateur, &lTailleCommunicateur );
>>>>     lRangMain   =1-lRank;
>>>>     printf( "Exe: lRankExe  = %d   lRankMain  = %d\n", lRank ,
>>>> lRangMain,
>>>> lTailleCommunicateur);
>>>> 
>>>>     sleep(1);
>>>>     MPI_Buffer_detach( (void*)lpBufferMpi, &lBufferSize );
>>>>     MPI_Comm_free( &lCommunicateur );
>>>>     MPI_Finalize( );
>>>>     free( lpBufferMpi );
>>>>     printf( "Exe: Fin.\n\n\n" );
>>>> }
>>>> 
>>>> 
>>>> /
>>>> ********************************************************************
>>>> *
>>>> ********
>>>> *******************/
>>>> result :
>>>> main*******************************
>>>> main : Lancement MPI*
>>>> 1 main***MPI_Comm_spawn return : 0
>>>> Exe : Lance
>>>> 1 main***Rang main : 0   Rang exe : 1
>>>> Exe: lRankExe  = 1   lRankMain  = 0
>>>> Exe: Fin.
>>>> 
>>>> 
>>>> 2 main***MPI_Comm_spawn return : 0
>>>> Exe : Lance
>>>> 2 main***Rang main : 0   Rang exe : 1
>>>> Exe: lRankExe  = 1   lRankMain  = 0
>>>> Exe: Fin.
>>>> 
>>>> 
>>>> 3 main***MPI_Comm_spawn return : 0
>>>> Exe : Lance
>>>> 3 main***Rang main : 0   Rang exe : 1
>>>> Exe: lRankExe  = 1   lRankMain  = 0
>>>> Exe: Fin.
>>>> 
>>>> ....
>>>> 
>>>> 30 main***MPI_Comm_spawn return : 0
>>>> Exe : Lance
>>>> 30 main***Rang main : 0   Rang exe : 1
>>>> Exe: lRankExe  = 1   lRankMain  = 0
>>>> Exe: Fin.
>>>> 
>>>> 
>>>> 31 main***MPI_Comm_spawn return : 0
>>>> Exe : Lance
>>>> 31 main***Rang main : 0   Rang exe : 1
>>>> Exe: lRankExe  = 1   lRankMain  = 0
>>>> Erreur de segmentation
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to