I believe I know what is happening here. My availability in the next week is pretty limited due to a family emergency, but I'll take a look when I get back. In brief, this is a resource starvation issue where the system thinks your node is unable to support any further processes and so it blocks.
On a separate note, I never use threaded configurations due to the lack of any real thread-safety review or testing on Open MPI to-date (per Tim's earlier comment). My "standard" configuration for development and testing is with --disable-progress-threads --without-threads. I'll post something back to the list when I get it resolved. Thanks Ralph On 3/6/07 9:00 AM, "rozzen.vinc...@fr.thalesgroup.com" <rozzen.vinc...@fr.thalesgroup.com> wrote: > Hi Tim, I get back to you > > "What kind of system is it?" > =>The system is a "Debian Sarge". > "How many nodes are you running on?" > => There is no cluster configured, so I guess I work with no node > environnement. > "Have you been able to try a more recent version of Open MPI?" > =>Today, I tried with version 1.1.4, but the results are not better. > I tested 2 cases : > Test 1 : with the sames configuration options (./configure > --enable-mpi-threads --enable-progress-threads --with-threads=posix > --enable-smp-locks) > The program stopped on MPI_Init_thread in __lll_mutex_lock_wait () from > /lib/tls/libpthread.so.0 > > Test 2 : with the default configuration options (./configure > --prefix=/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread) > The program stoped on the "node allocation" after the spawn n°31. > Maybe the problem comes from the lack of node definition? > Thanks for your help. > > Here below, the different log files of the 2 tests > > /******************************TEST 1*******************************/ > GNU gdb 6.3-debian > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "i386-linux"...Using host libthread_db library > "/lib/tls/libthread_db.so.1". > > (gdb) run > Starting program: /home/workspace/test_spaw1/src/spawn > [Thread debugging using libthread_db enabled] > [New Thread 1076646560 (LWP 5178)] > main******************************* > main : Lancement MPI* > [New Thread 1085225904 (LWP 5181)] > [New Thread 1094495152 (LWP 5182)] > > Program received signal SIGINT, Interrupt. > [Switching to Thread 1076646560 (LWP 5178)] > 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > (gdb) where > #0 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #1 0x40187893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 > #2 0xbffff508 in ?? () > #3 0x4000bcd0 in _dl_map_object_deps () from /lib/ld-linux.so.2 > #4 0x40b9f8cb in mca_btl_tcp_component_create_listen () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so > #5 0x40b9f8cb in mca_btl_tcp_component_create_listen () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so > #6 0x40b9eef4 in mca_btl_tcp_component_init () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so > #7 0x4008c652 in mca_btl_base_select () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #8 0x40b8dd28 in mca_bml_r2_component_init () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_bml_r2.so > #9 0x4008bf54 in mca_bml_base_init () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #10 0x40b7e5c9 in mca_pml_ob1_component_init () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_pml_ob1.so > #11 0x40094192 in mca_pml_base_select () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #12 0x4005742c in ompi_mpi_init () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #13 0x4007c182 in PMPI_Init_thread () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #14 0x080489f3 in main (argc=1, argv=0xbffff8a4) at spawn6.c:33 > > > > /******************************TEST 2*******************************/ > > GNU gdb 6.3-debian > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "i386-linux"...Using host libthread_db library > "/lib/tls/libthread_db.so.1". > > (gdb) run -np 1 --host myhost spawn6 > Starting program: /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/bin/mpirun -np > 1 --host myhost spawn6 > [Thread debugging using libthread_db enabled] > [New Thread 1076121728 (LWP 4022)] > main******************************* > main : Lancement MPI* > Exe : Lance > Exe: lRankExe = 1 lRankMain = 0 > 1 main***MPI_Comm_spawn return : 0 > 1 main***Rang main : 0 Rang exe : 1 > Exe : Lance > Exe: Fin. > > > Exe: lRankExe = 1 lRankMain = 0 > 2 main***MPI_Comm_spawn return : 0 > 2 main***Rang main : 0 Rang exe : 1 > Exe : Lance > Exe: Fin. > > ... > > Exe: lRankExe = 1 lRankMain = 0 > 30 main***MPI_Comm_spawn return : 0 > 30 main***Rang main : 0 Rang exe : 1 > Exe : Lance > Exe: Fin. > > Exe: lRankExe = 1 lRankMain = 0 > 31 main***MPI_Comm_spawn return : 0 > 31 main***Rang main : 0 Rang exe : 1 > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 1076121728 (LWP 4022)] > 0x4018833b in strlen () from /lib/tls/libc.so.6 > (gdb) where > #0 0x4018833b in strlen () from /lib/tls/libc.so.6 > #1 0x40297c5e in orte_gpr_replica_create_itag () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so > #2 0x4029d2df in orte_gpr_replica_put_fn () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so > #3 0x40297281 in orte_gpr_replica_put () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so > #4 0x40048287 in orte_ras_base_node_assign () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0 > #5 0x400463e1 in orte_ras_base_allocate_nodes () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0 > #6 0x402c2bb8 in orte_ras_hostfile_allocate () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_ras_hostfile.so > #7 0x400464e0 in orte_ras_base_allocate () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0 > #8 0x402b063f in orte_rmgr_urm_allocate () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so > #9 0x4004f277 in orte_rmgr_base_cmd_dispatch () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0 > #10 0x402b10ae in orte_rmgr_urm_recv () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so > #11 0x4004301e in mca_oob_recv_callback () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0 > #12 0x4027a748 in mca_oob_tcp_msg_data () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so > #13 0x4027bb12 in mca_oob_tcp_peer_recv_handler () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so > #14 0x400703f9 in opal_event_loop () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0 > #15 0x4006adfa in opal_progress () from > /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0 > #16 0x0804c7a1 in opal_condition_wait (c=0x804fbcc, m=0x804fba8) at > condition.h:81 > #17 0x0804a4c8 in orterun (argc=6, argv=0xbffff854) at orterun.c:427 > #18 0x08049dd6 in main (argc=6, argv=0xbffff854) at main.c:13 > (gdb) > -----Message d'origine----- > De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]De la > part de Tim Prins > Envoyé : lundi 5 mars 2007 22:34 > À : Open MPI Users > Objet : Re: [OMPI users] MPI_Comm_Spawn > > > Never mind, I was just able to replicate it. I'll look into it. > > Tim > > On Mar 5, 2007, at 4:26 PM, Tim Prins wrote: > >> That is possible. Threading support is VERY lightly tested, but I >> doubt it is the problem since it always fails after 31 spawns. >> >> Again, I have tried with these configure options and the same version >> of Open MPI and have still have been able to replicate this (after >> letting it spawn over 500 times). Have you been able to try a more >> recent version of Open MPI? What kind of system is it? How many nodes >> are you running on? >> >> Tim >> >> On Mar 5, 2007, at 1:21 PM, rozzen.vinc...@fr.thalesgroup.com wrote: >> >>> >>> Maybe the problem comes from the configuration options. >>> The configuration options used are : >>> ./configure --enable-mpi-threads --enable-progress-threads --with- >>> threads=posix --enable-smp-locks >>> Could you give me your point of view about that please ? >>> Thanks >>> >>> -----Message d'origine----- >>> De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>> De la >>> part de Ralph H Castain >>> Envoyé : mardi 27 février 2007 16:26 >>> À : Open MPI Users <us...@open-mpi.org> >>> Objet : Re: [OMPI users] MPI_Comm_Spawn >>> >>> >>> Now that's interesting! There shouldn't be a limit, but to be >>> honest, I've >>> never tested that mode of operation - let me look into it and see. >>> It sounds >>> like there is some counter that is overflowing, but I'll look. >>> >>> Thanks >>> Ralph >>> >>> >>> On 2/27/07 8:15 AM, "rozzen.vinc...@fr.thalesgroup.com" >>> <rozzen.vinc...@fr.thalesgroup.com> wrote: >>> >>>> Do you know if there is a limit to the number of MPI_Comm_spawn we >>>> can use in >>>> order to launch a program? >>>> I want to start and stop a program several times (with the function >>>> MPI_Comm_spawn) but every time after 31 MPI_Comm_spawn, I get a >>>> "segmentation >>>> fault". >>>> Could you give me your point of you to solve this problem? >>>> Thanks >>>> >>>> /*file .c : spawned the file Exe*/ >>>> #include <stdio.h> >>>> #include <malloc.h> >>>> #include <unistd.h> >>>> #include "mpi.h" >>>> #include <pthread.h> >>>> #include <signal.h> >>>> #include <sys/time.h> >>>> #include <errno.h> >>>> #define EXE_TEST "/home/workspace/test_spaw1/src/ >>>> Exe" >>>> >>>> >>>> >>>> int main( int argc, char **argv ) { >>>> >>>> long *lpBufferMpi; >>>> MPI_Comm lIntercom; >>>> int lErrcode; >>>> MPI_Comm lCommunicateur; >>>> int lRangMain,lRangExe,lMessageEnvoi,lIter,NiveauThreadVoulu, >>>> NiveauThreadObtenu,lTailleBuffer; >>>> int *lpMessageEnvoi=&lMessageEnvoi; >>>> MPI_Status lStatus; /*status de reception*/ >>>> >>>> lIter=0; >>>> >>>> >>>> /* MPI environnement */ >>>> >>>> printf("main*******************************\n"); >>>> printf("main : Lancement MPI*\n"); >>>> >>>> NiveauThreadVoulu = MPI_THREAD_MULTIPLE; >>>> MPI_Init_thread( &argc, &argv, NiveauThreadVoulu, >>>> &NiveauThreadObtenu ); >>>> lpBufferMpi = calloc( 10000, sizeof(long)); >>>> MPI_Buffer_attach( (void*)lpBufferMpi, 10000 * sizeof(long) ); >>>> >>>> while (lIter<1000){ >>>> lIter ++; >>>> lIntercom=(MPI_Comm)-1 ; >>>> >>>> MPI_Comm_spawn( EXE_TEST, NULL, 1, MPI_INFO_NULL, >>>> 0, MPI_COMM_WORLD, &lIntercom, &lErrcode ); >>>> printf( "%i main***MPI_Comm_spawn return : %d\n",lIter, >>>> lErrcode ); >>>> >>>> if(lIntercom == (MPI_Comm)-1 ){ >>>> printf("%i Intercom null\n",lIter); >>>> return 0; >>>> } >>>> MPI_Intercomm_merge(lIntercom, 0,&lCommunicateur ); >>>> MPI_Comm_rank( lCommunicateur, &lRangMain); >>>> lRangExe=1-lRangMain; >>>> >>>> printf("%i main***Rang main : %i Rang exe : %i >>>> \n",lIter,(int)lRangMain,(int)lRangExe); >>>> sleep(2); >>>> >>>> } >>>> >>>> >>>> /* Arret de l'environnement MPI */ >>>> lTailleBuffer=10000* sizeof(long); >>>> MPI_Buffer_detach( (void*)lpBufferMpi, &lTailleBuffer ); >>>> MPI_Comm_free( &lCommunicateur ); >>>> MPI_Finalize( ); >>>> free( lpBufferMpi ); >>>> >>>> printf( "Main = End .\n" ); >>>> return 0; >>>> >>>> } >>>> / >>>> ******************************************************************** >>>> * >>>> ******** >>>> *******************/ >>>> Exe: >>>> #include <string.h> >>>> #include <stdlib.h> >>>> #include <stdio.h> >>>> #include <malloc.h> >>>> #include <unistd.h> /* pour sleep() */ >>>> #include <pthread.h> >>>> #include <semaphore.h> >>>> #include "mpi.h" >>>> >>>> int main( int argc, char **argv ) { >>>> /*1)pour communiaction MPI*/ >>>> MPI_Comm lCommunicateur; /*communicateur du process*/ >>>> MPI_Comm CommParent; /*Communiacteur parent à >>>> récupérer*/ >>>> int lRank; /*rang du communicateur du >>>> process*/ >>>> int lRangMain; /*rang du séquenceur si lancé en >>>> mode normal*/ >>>> int lTailleCommunicateur; /*taille du communicateur;*/ >>>> long *lpBufferMpi; /*buffer pour message*/ >>>> int lBufferSize; /*taille du buffer*/ >>>> >>>> /*2) pour les thread*/ >>>> int NiveauThreadVoulu, NiveauThreadObtenu; >>>> >>>> >>>> lCommunicateur = (MPI_Comm)-1; >>>> NiveauThreadVoulu = MPI_THREAD_MULTIPLE; >>>> int erreur = MPI_Init_thread( &argc, &argv, NiveauThreadVoulu, >>>> &NiveauThreadObtenu ); >>>> >>>> if (erreur!=0){ >>>> printf("erreur\n"); >>>> free( lpBufferMpi ); >>>> return -1; >>>> } >>>> >>>> /*2) Attachement à un buffer pour le message*/ >>>> lBufferSize=10000 * sizeof(long); >>>> lpBufferMpi = calloc( 10000, sizeof(long)); >>>> erreur = MPI_Buffer_attach( (void*)lpBufferMpi, lBufferSize ); >>>> >>>> if (erreur!=0){ >>>> printf("erreur\n"); >>>> free( lpBufferMpi ); >>>> return -1; >>>> } >>>> >>>> printf( "Exe : Lance \n" ); >>>> MPI_Comm_get_parent(&CommParent); >>>> MPI_Intercomm_merge( CommParent, 1, &lCommunicateur ); >>>> MPI_Comm_rank( lCommunicateur, &lRank ); >>>> MPI_Comm_size( lCommunicateur, &lTailleCommunicateur ); >>>> lRangMain =1-lRank; >>>> printf( "Exe: lRankExe = %d lRankMain = %d\n", lRank , >>>> lRangMain, >>>> lTailleCommunicateur); >>>> >>>> sleep(1); >>>> MPI_Buffer_detach( (void*)lpBufferMpi, &lBufferSize ); >>>> MPI_Comm_free( &lCommunicateur ); >>>> MPI_Finalize( ); >>>> free( lpBufferMpi ); >>>> printf( "Exe: Fin.\n\n\n" ); >>>> } >>>> >>>> >>>> / >>>> ******************************************************************** >>>> * >>>> ******** >>>> *******************/ >>>> result : >>>> main******************************* >>>> main : Lancement MPI* >>>> 1 main***MPI_Comm_spawn return : 0 >>>> Exe : Lance >>>> 1 main***Rang main : 0 Rang exe : 1 >>>> Exe: lRankExe = 1 lRankMain = 0 >>>> Exe: Fin. >>>> >>>> >>>> 2 main***MPI_Comm_spawn return : 0 >>>> Exe : Lance >>>> 2 main***Rang main : 0 Rang exe : 1 >>>> Exe: lRankExe = 1 lRankMain = 0 >>>> Exe: Fin. >>>> >>>> >>>> 3 main***MPI_Comm_spawn return : 0 >>>> Exe : Lance >>>> 3 main***Rang main : 0 Rang exe : 1 >>>> Exe: lRankExe = 1 lRankMain = 0 >>>> Exe: Fin. >>>> >>>> .... >>>> >>>> 30 main***MPI_Comm_spawn return : 0 >>>> Exe : Lance >>>> 30 main***Rang main : 0 Rang exe : 1 >>>> Exe: lRankExe = 1 lRankMain = 0 >>>> Exe: Fin. >>>> >>>> >>>> 31 main***MPI_Comm_spawn return : 0 >>>> Exe : Lance >>>> 31 main***Rang main : 0 Rang exe : 1 >>>> Exe: lRankExe = 1 lRankMain = 0 >>>> Erreur de segmentation >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users