Hmmm...well, nothing definitive there, I'm afraid. All I can suggest is to remove/reduce the threading. Like I said, we aren't terribly thread safe at this time. I suspect you're stepping into one of those non-safe areas here.
Hopefully will do better in later releases. On Sep 6, 2011, at 1:20 PM, Simone Pellegrini wrote: > On 09/06/2011 04:58 PM, Ralph Castain wrote: >> On Sep 6, 2011, at 12:49 PM, Simone Pellegrini wrote: >> >>> On 09/06/2011 02:57 PM, Ralph Castain wrote: >>>> Hi Simone >>>> >>>> Just to clarify: is your application threaded? Could you please send the >>>> OMPI configure cmd you used? >>> yes, it is threaded. There are basically 3 threads, 1 for the outgoing >>> messages (MPI_send), 1 for incoming messages (MPI_Iprobe / MPI_Recv) and >>> one spawning. >>> >>> I am not sure what you mean with OMPI configure cmd I used... I simply do >>> mpirun --np 1 ./executable >> How was OMPI configured when it was installed? If you didn't install it, >> then provide the output of ompi_info - it will tell us. > [@arch-moto tasksys]$ ompi_info > Package: Open MPI nobody@alderaan Distribution > Open MPI: 1.5.3 > Open MPI SVN revision: r24532 > Open MPI release date: Mar 16, 2011 > Open RTE: 1.5.3 > Open RTE SVN revision: r24532 > Open RTE release date: Mar 16, 2011 > OPAL: 1.5.3 > OPAL SVN revision: r24532 > OPAL release date: Mar 16, 2011 > Ident string: 1.5.3 > Prefix: /usr > Configured architecture: x86_64-unknown-linux-gnu > Configure host: alderaan > Configured by: nobody > Configured on: Thu Jul 7 13:21:35 UTC 2011 > Configure host: alderaan > Built by: nobody > Built on: Thu Jul 7 13:27:08 UTC 2011 > Built host: alderaan > C bindings: yes > C++ bindings: yes > Fortran77 bindings: yes (all) > Fortran90 bindings: yes > Fortran90 bindings size: small > C compiler: gcc > C compiler absolute: /usr/bin/gcc > C compiler family name: GNU > C compiler version: 4.6.1 > C++ compiler: g++ > C++ compiler absolute: /usr/bin/g++ > Fortran77 compiler: gfortran > Fortran77 compiler abs: /usr/bin/gfortran > Fortran90 compiler: /usr/bin/gfortran > Fortran90 compiler abs: > C profiling: yes > C++ profiling: yes > Fortran77 profiling: yes > Fortran90 profiling: yes > C++ exceptions: no > Thread support: posix (mpi: yes, progress: no) > Sparse Groups: no > Internal debug support: yes > MPI interface warnings: no > MPI parameter check: runtime > Memory profiling support: no > Memory debugging support: no > libltdl support: yes > Heterogeneous support: no > mpirun default --prefix: no > MPI I/O support: yes > MPI_WTIME support: gettimeofday > Symbol vis. support: yes > MPI extensions: affinity example > FT Checkpoint support: no (checkpoint thread: no) > MPI_MAX_PROCESSOR_NAME: 256 > MPI_MAX_ERROR_STRING: 256 > MPI_MAX_OBJECT_NAME: 64 > MPI_MAX_INFO_KEY: 36 > MPI_MAX_INFO_VAL: 256 > MPI_MAX_PORT_NAME: 1024 > MPI_MAX_DATAREP_STRING: 128 > MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.5.3) > MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.5.3) > MCA memory: linux (MCA v2.0, API v2.0, Component v1.5.3) > MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.3) > MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.3) > MCA carto: file (MCA v2.0, API v2.0, Component v1.5.3) > MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3) > MCA timer: linux (MCA v2.0, API v2.0, Component v1.5.3) > MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3) > MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3) > MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3) > MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3) > MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3) > MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3) > MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3) > MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3) > MCA coll: inter (MCA v2.0, API v2.0, Component v1.5.3) > MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3) > MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3) > MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3) > MCA coll: tuned (MCA v2.0, API v2.0, Component v1.5.3) > MCA io: romio (MCA v2.0, API v2.0, Component v1.5.3) > MCA mpool: fake (MCA v2.0, API v2.0, Component v1.5.3) > MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3) > MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3) > MCA pml: bfo (MCA v2.0, API v2.0, Component v1.5.3) > MCA pml: csum (MCA v2.0, API v2.0, Component v1.5.3) > MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3) > MCA pml: v (MCA v2.0, API v2.0, Component v1.5.3) > MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3) > MCA rcache: vma (MCA v2.0, API v2.0, Component v1.5.3) > MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3) > MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3) > MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3) > MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3) > MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.3) > MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.3) > MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.3) > MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.3) > MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.3) > MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.3) > MCA odls: default (MCA v2.0, API v2.0, Component v1.5.3) > MCA ras: cm (MCA v2.0, API v2.0, Component v1.5.3) > MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.5.3) > MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.5.3) > MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.5.3) > MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.5.3) > MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.5.3) > MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.3) > MCA rml: oob (MCA v2.0, API v2.0, Component v1.5.3) > MCA routed: binomial (MCA v2.0, API v2.0, Component v1.5.3) > MCA routed: cm (MCA v2.0, API v2.0, Component v1.5.3) > MCA routed: direct (MCA v2.0, API v2.0, Component v1.5.3) > MCA routed: linear (MCA v2.0, API v2.0, Component v1.5.3) > MCA routed: radix (MCA v2.0, API v2.0, Component v1.5.3) > MCA routed: slave (MCA v2.0, API v2.0, Component v1.5.3) > MCA plm: rsh (MCA v2.0, API v2.0, Component v1.5.3) > MCA plm: rshd (MCA v2.0, API v2.0, Component v1.5.3) > MCA filem: rsh (MCA v2.0, API v2.0, Component v1.5.3) > MCA errmgr: default (MCA v2.0, API v2.0, Component v1.5.3) > MCA ess: env (MCA v2.0, API v2.0, Component v1.5.3) > MCA ess: hnp (MCA v2.0, API v2.0, Component v1.5.3) > MCA ess: singleton (MCA v2.0, API v2.0, Component v1.5.3) > MCA ess: slave (MCA v2.0, API v2.0, Component v1.5.3) > MCA ess: tool (MCA v2.0, API v2.0, Component v1.5.3) > MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.5.3) > MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.5.3) > MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.5.3) > MCA notifier: command (MCA v2.0, API v1.0, Component v1.5.3) > MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.5.3) > > >> >>>> Adding the debug flags just changes the race condition. Interestingly, >>>> those values only impact the behavior of mpirun, so it looks like the race >>>> condition is occurring there. >>> The problem is that the error is totally nondeterministic. Sometimes >>> happens, others not but the error message gives me no clue where the error >>> is coming from. Is is a problem of my code or internal MPI? >> Can't tell, but it is likely an impact of threading. Race conditions within >> threaded environments are common, and OMPI isn't particularly thread safe, >> especially when it comes to comm_spawn. >> >>> cheers, Simone >>>> >>>> On Sep 6, 2011, at 3:01 AM, Simone Pellegrini wrote: >>>> >>>>> Dear all, >>>>> I am developing an MPI application which uses heavily MPI_Spawn. Usually >>>>> everything works fine for the first hundred spawn but after a while the >>>>> application exist with a curious message: >>>>> >>>>> [arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read >>>>> past end of buffer in file base/grpcomm_base_modex.c at line 349 >>>>> [arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read >>>>> past end of buffer in file grpcomm_bad_module.c at line 518 >>>>> -------------------------------------------------------------------------- >>>>> It looks like MPI_INIT failed for some reason; your parallel process is >>>>> likely to abort. There are many reasons that a parallel process can >>>>> fail during MPI_INIT; some of which are due to configuration or >>>>> environment >>>>> problems. This failure appears to be an internal failure; here's some >>>>> additional information (which may only be relevant to an Open MPI >>>>> developer): >>>>> >>>>> ompi_proc_set_arch failed >>>>> --> Returned "Data unpack would read past end of buffer" (-26) instead >>>>> of "Success" (0) >>>>> -------------------------------------------------------------------------- >>>>> *** The MPI_Init_thread() function was called before MPI_INIT was invoked. >>>>> *** This is disallowed by the MPI standard. >>>>> *** Your MPI job will now abort. >>>>> [arch-top:27712] Abort before MPI_INIT completed successfully; not able >>>>> to guarantee that all other processes were killed! >>>>> [arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read >>>>> past end of buffer in file base/grpcomm_base_modex.c at line 349 >>>>> [arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read >>>>> past end of buffer in file grpcomm_bad_module.c at line 518 >>>>> *** The MPI_Init_thread() function was called before MPI_INIT was invoked. >>>>> *** This is disallowed by the MPI standard. >>>>> *** Your MPI job will now abort. >>>>> [arch-top:27714] Abort before MPI_INIT completed successfully; not able >>>>> to guarantee that all other processes were killed! >>>>> [arch-top:27226] 1 more process has sent help message help-mpi-runtime / >>>>> mpi_init:startup:internal-failure >>>>> [arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>> all help / error messages >>>>> >>>>> Also using MPI_init instead of MPI_Init_thread does not help, the same >>>>> error occurs. >>>>> >>>>> Strangely the error does not occur if I run the code enabling debug in >>>>> (-mca plm_base_verbose 5 -mca rmaps_base_verbose 5). >>>>> >>>>> I am using OpenMPI 1.5.3 >>>>> >>>>> cheers, Simone >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users