Ralph Castain ha scritto:
> Afraid I am somewhat at a loss. The logs indicate that mpirun itself
> is having problems, likely caused by the threading. Only thing I can
> suggest is that you "unthread" the spawning loop and try it that way
> first so we can see if some underlying problem exists.
>
> FWIW: I have run a loop over calls to comm_spawn without problems.
> However, there are system limits to the number of child processes an
> orted can create. You may hit those at some point - we try to report
> that as a separate error when we see it, but it isn't always easy to
> catch.
How it works? Does it spawn and disconnect the slave in a loop? I guess
you don't perform any multithreaded MPI_Send()/Recv(), our you did?
>
> Like I said, we really don't support threaded operations like this
> right now, so I have no idea what your app may be triggering. I would
> definitely try it "unthreaded" if possible.
My approach uses one thread for each node allocated for the slave in
order to overlap
the communication and make it to progress concurrently depending by how
each slave
converge in its solution. When the slave terminate and get back its
results I would assigned
another job until I complete my workqueue.
>
> Ralph
>