[OMPI users] Binary distribution of my program possible using OpenMPI+Torque?

2014-09-26 Thread Amos Anderson
Hello OpenMPI User List --

I'm looking into making a binary distribution of my software. On my system, I 
link it to OpenMPI compiled with `--with-tm=/opt/torque/` to get Torque 
integration as suggested [here][1]. How do I accomplish this in a binary 
distribution of my program for use in an arbitrary user's environment?

 - I can distribute my binary files as compiled on my system, but I worry I'll 
get link errors if the version of OpenMPI my code linked to during compile time 
doesn't match the OpenMPI on their system.
 - I can bundle OpenMPI with my software. I could compile my bundled OpenMPI 
**with** `--with-tm` for Torque support. This way, my program running on their 
system will link just fine to OpenMPI; but will it work with their Torque 
environment (if they use Torque)? It looks like maybe it only uses the file 
`include/tm.h` which is stable between versions of Torque? But if this was 
straight forward, I'd have though the OpenMPI developers could have 
incorporated this into OpenMPI itself w/o my help.
 - I can bundle OpenMPI with my software. I could compile my bundled OpenMPI 
**without** Torque support. I don't think there would be a problem running on 
their system, but would they be able to get Torque support with a `--hostfiles` 
parameter to `mpirun` as described [here][2]? And that link doesn't confirm 
that `--hostfiles` works.
 - I haven't researched this at all, but is it reasonable to bundle Torque with 
my program? I'd think not...
 - I could distribute the source of my code so users can integrate whatever 
compatibility they desire. I could also compile without MPI support at all 
which would disable some of my features.

Maybe some of the link errors I'm describing are not likely given recent 
versions of dependencies? I don't want to have to figure out which systems this 
will work on *after* releasing my binary distribution.

Thanks for any suggestions!

Amos.


  [1]: http://www.open-mpi.org/faq/?category=building#build-rte-tm
  [2]: https://www.open-mpi.org/faq/?category=tm#tm-use-hostfile

Re: [OMPI users] Application hangs in 1.8.1 related to collective operations

2014-09-26 Thread Howard Pritchard
Hello Ed,

Could you post the output of ompi_info?  It would also help to know which
variant of the collective ops
your doing.  If you could post the output when you run with

mpirun --mca coll_base_verbose 10 "other mpirun args you've been using"

that would be great

Also, if you know the sizes (number of elements) involved in the reduce and
allreduce operations it
would be helpful to know this as well.

Thanks,

Howard


2014-09-25 3:34 GMT-06:00 Blosch, Edwin L :

>  I had an application suddenly stop making progress.  By killing the last
> process out of 208 processes, then looking at the stack trace, I found 3 of
> 208 processes were in an MPI_REDUCE call.  The other 205 had progressed in
> their execution to another routine, where they were waiting in an unrelated
> MPI_ALLREDUCE call.
>
>
>
> The code structure is such that each processes calls MPI_REDUCE 5 times
> for different variables, then some work is done, then the MPI_ALLREDUCE
> call happens early in the next iteration of the solution procedure.  I
> thought it was also noteworthy that the 3 processes stuck at MPI_REDUCE,
> were actually stuck on the 4th of 5 MPI_REDUCE calls, not the 5th call.
>
>
>
> No issues with MVAPICH.  Problem easily solved by adding MPI_BARRIER after
> the section of MPI_REDUCE calls.
>
>
>
> It seems like MPI_REDUCE has some kind of non-blocking implementation, and
> it was not safe to enter the MPI_ALLREDUCE while those MPI_REDUCE calls had
> not yet completed for other processes.
>
>
>
> This was in OpenMPI 1.8.1.  Same problem seen on 3 slightly different
> systems, all QDR Infiniband, Mellanox HCAs, using a Mellanox OFED stack
> (slightly different versions on each cluster).  Intel compilers, again
> slightly different versions on each of the 3 systems.
>
>
>
> Has anyone encountered anything similar?  While I have a workaround, I
> want to make sure the root cause of the deadlock gets fixed.  Please let me
> know what I can do to help.
>
>
>
> Thanks,
>
>
>
> Ed
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25389.php
>


[OMPI users] OpenMPI 1.8.2 segfaults while 1.6.5 works?

2014-09-26 Thread Amos Anderson
Hello all --

I'm trying to get a working configuration for my application and I can get 
OpenMPI 1.6.5 to work, while OpenMPI 1.8.2 segfaults.


Here's how I compile OpenMPI:

OPENMPI = openmpi-1.8.2
FLAGS = --enable-static
cd $(OPENMPI) ; ./configure $(FLAGS) --with-tm=/opt/torque-2.5.9/ 
--prefix=$(CURDIR)


I'm able to compile openmpi successfully, and I use a bjam instruction like 
this to compile my program (which uses boost python boost_1_55_0):
using mpi : ../tools/openmpi/bin/mpic++ ;

and I run my program in a Torque pbs script like this:

/bin/rm -rf jobname.nodes
for i in `cat ${PBS_NODEFILE} | sort -u`
do
 echo $i slots \= `grep $i ${PBS_NODEFILE} | wc -l` >> jobname.nodes
done
/home/user/myapp/tools/openmpi/bin/mpirun -np 2 -hostfile jobname.nodes 
/home/user/myapp/myapp.exe



which also compiles just fine. But when I run my program I get the segfault I 
printed below. When I switch to:
OPENMPI = openmpi-1.6.5

then everything works as expected. (As a side question, do I need both 
-hostfile and --with-tm? I asked this question earlier today on this list). 
That is, I believe that I'm using the exact same setup in both cases, and 1.6.5 
works while 1.8.2 fails. Any suggestions what I might be doing wrong?

I suppose if I have a working setup I can give up even if it's with an older 
version... but this could be evidence of something I'll have to confront 
eventually.

Thanks for any advice!
Amos.




[local:27921] *** Process received signal ***
[local:27921] Signal: Segmentation fault (11)
[local:27921] Signal code: Address not mapped (1)
[local:27921] Failing at address: 0x40
[local:27921] [ 0] /lib64/libpthread.so.0[0x322180e4c0]
[local:27921] [ 1] /lib64/libc.so.6(strlen+0x30)[0x3220c78d80]
[local:27921] [ 2] 
/home/user/myapp/tools/openmpi/lib/libopen-pal.so.6(opal_argv_join+0x95)[0x2b87f5c4e175]
[local:27921] [ 3] 
/home/user/myapp/tools/openmpi/lib/libmpi.so.1(ompi_mpi_init+0x82d)[0x2b87f3c9ec0d]
[local:27921] [ 4] 
/home/user/myapp/tools/openmpi/lib/libmpi.so.1(MPI_Init+0xf0)[0x2b87f3cbc310]
[local:27921] [ 5] 
/home/user/myapp/lib/libboost_mpi.so.1.55.0(_ZN5boost3mpi11environmentC1ERiRPPcb+0x36)[0x2b87f3795826]
[local:27921] [ 6] 
/home/user/myapp/lib/mpi.so(_ZN5boost3mpi6python8mpi_initENS_6python4listEb+0x314)[0x2b87f30bc7b4]
[local:27921] [ 7] 
/home/user/myapp/lib/mpi.so(_ZN5boost3mpi6python18export_environmentEv+0xcc6)[0x2b87f30bd5f6]
[local:27921] [ 8] 
/home/user/myapp/lib/mpi.so(_ZN5boost3mpi6python15init_module_mpiEv+0x547)[0x2b87f30d4967]
[local:27921] [ 9] 
/home/user/myapp/lib/libboost_python.so.1.55.0(_ZN5boost6python21handle_exception_implENS_9function0IvEE+0x530)[0x2b87f3558430]
[local:27921] [10] 
/home/user/myapp/lib/libboost_python.so.1.55.0(_ZN5boost6python16handle_exceptionIPFvvEEEbT_+0x38)[0x2b87f3559798]
[local:27921] [11] 
/home/user/myapp/lib/libboost_python.so.1.55.0(_ZN5boost6python6detail11init_moduleEPKcPFvvE+0x63)[0x2b87f3559463]
[local:27921] [12] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(_PyImport_LoadDynamicModule+0xc2)[0x2b87e8c79282]
[local:27921] [13] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0[0x2b87e8c771a9]
[local:27921] [14] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0[0x2b87e8c776c1]
[local:27921] [15] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x1b7)[0x2b87e8c77977]
[local:27921] [16] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0[0x2b87e8c57bcd]
[local:27921] [17] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyObject_Call+0x68)[0x2b87e8bb7ae8]
[local:27921] [18] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x56)[0x2b87e8c58216]
[local:27921] [19] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x381c)[0x2b87e8c5c79c]
[local:27921] [20] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x8c9)[0x2b87e8c60c89]
[local:27921] [21] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x2b87e8c60d02]
[local:27921] [22] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xc2)[0x2b87e8c74432]
[local:27921] [23] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0[0x2b87e8c769f0]
[local:27921] [24] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0[0x2b87e8c771a9]
[local:27921] [25] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0[0x2b87e8c77642]
[local:27921] [26] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x1b7)[0x2b87e8c77977]
[local:27921] [27] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0[0x2b87e8c57bcd]
[local:27921] [28] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyObject_Call+0x68)[0x2b87e8bb7ae8]
[local:27921] [29] 
/home/user/myapp/tools/python/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x56)[0x2b87e8c58216]
[local:27921] *** End of error message ***