The most common cause of this problem is a firewall between the nodes - you can
ssh across, but not communicate. Have you checked to see that the firewall is
turned off?
On Jan 17, 2014, at 4:59 PM, Doug Roberts wrote:
>
> 1) When openmpi programs run across multiple nodes they hang
> rather
1) When openmpi programs run across multiple nodes they hang
rather quickly as shown in the mpi_test example below. Note
that I am assuming the initital topology error message is a
separate issue since single node openmpi jobs run just fine.
[roberpj@bro127:~/samples/mpi_test]
/opt/sharcnet/op
Thank you and this makes sense. In fact we've been trying to avoid
serialization as much as possible because we found it to be a bottleneck.
Anyway I wonder if there are some samples illustrating the use of complex
structures in OpenMPI
Thank you,
Saliya
On Jan 17, 2014 5:20 PM, "Oscar Vega-Gisbe
BAH,
The error persisted when doing the test to /tmp/ (local disk)
I rebuilt the library with the same compiler and all is well now.
Sorry for the false alarm. Thanks for the help and ideas Jeff.
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(7
MPI.OBJECT is no longer supported because of it was based on
serialization, and it made the java bindings more complicated. It
brought more problems than benefits. For example, it was necessary a
shadow communicator...
You can define complex struct data using direct buffers and avoiding
s
Sorry for delay - I understood and was just occupied with something else for a
while. Thanks for the follow-up. I'm looking at the issue and trying to
decipher the right solution.
On Jan 17, 2014, at 2:00 PM, tmish...@jcity.maeda.co.jp wrote:
>
>
> Hi Ralph,
>
> I'm sorry that my explanatio
Hi Ralph,
I'm sorry that my explanation was not enough ...
This is the summary of my situation:
1. I create a hostfile as shown below manually.
2. I use mpirun to start the job without Torque, which means I'm running in
an un-managed environment.
3. Firstly, ORTE detects 8 slots on each host(
Thanks.
For what it is worth, it looks like I now have a successful build of
open-mpi plus hdf5. With the caveat (see pasted note below from the HDF5
support desk) about make check-p from hdf5.
For anyone else trying to get hdf5 going with openmpi on mavericks, here
are the configure combinatio
Brock and I chatted off list.
I'm unable to replicate the error, but I have icc 14.0.1, not 14.0. I also
don't have Lustre, which is his base case.
So there's at least 2 variables here that need to be resolved.
On Jan 9, 2014, at 11:46 AM, Brock Palen wrote:
> Attached you will find a small
No, I didn't use Torque this time.
This issue is caused only when it is not in the managed
environment - namely, orte_managed_allocation is false
(and orte_set_slots is NULL).
Under the torque management, it works fine.
I hope you can understand the situation.
Tetsuya Mishima
> I'm sorry, bu
We did update ROMIO at some point in there, so it is possible this is a ROMIO
bug that we have picked up. I've asked someone to check upstream about it.
On Jan 17, 2014, at 12:02 PM, Ronald Cohen wrote:
> Sorry, too many entries in this thread, I guess. My general goal is to get a
> working
Afraid I don't have access to Intel compilers (oddly enough), but my immediate
thought would be that there is some variable size difference - possibly a
default change between 64 and 32 bit for "int"? Your output offset just looks
to me like you wrapped the field.
I believe the MPI interfaces a
I never saw any replies on this. Has anyone else been able to produce this
sort of error? It is 100% reproducible for me.
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985
On Jan 9, 2014, at 11:46 AM, Brock Palen wrote:
> Attached
Sorry, too many entries in this thread, I guess. My general goal is to get
a working parallel hdf5 with openmpi on Mac OS X Mavericks. At one point
in the saga I had romio disabled, which naturally doesn't work for hdf5
(which is trying to read/write files in parallel). So the hdf5 tests would
o
Can you specify exactly which issue you're referring to?
- test failing when you had ROMIO disabled
- test (sometimes) failing when you had ROMIO disabled
- compiling / linking issues
?
On Jan 17, 2014, at 1:50 PM, Ronald Cohen wrote:
> Hello Ralph and others, I just got the following back fr
The attached version is modified to use passive target, which does not
require collective synchronization for remote access.
Note that I didn't compile and run this and don't write MPI in Fortran
so there may be syntax errors.
Jeff
On Thu, Jan 16, 2014 at 11:03 AM, Christoph Niethammer
wrote:
>
Hello Ralph and others, I just got the following back from the HDF-5
support group, suggesting an ompi bug. So I should either try 1.7.3 or a
recent nightly 1.7.4.Will likely opt for 1.7.3, but hopefully someone
at openmpi can look at the problem for 1.7.4. In short, the challenge is
to get
I figured that.
On Fri, Jan 17, 2014 at 10:26 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:
> On Jan 17, 2014, at 1:17 PM, Jeff Squyres (jsquyres)
> wrote:
>
> > 3. --enable-shared is *not* implied by --enable-static. So if you
> --enable-static without --disable-shared, you're buil
On Jan 17, 2014, at 1:17 PM, Jeff Squyres (jsquyres) wrote:
> 3. --enable-shared is *not* implied by --enable-static. So if you
> --enable-static without --disable-shared, you're building both libmpi.so and
> libmpi.a (both of which will have the plugins slurped up -- no DSOs). Which
> is no
Very helpful, thanks.
On Fri, Jan 17, 2014 at 10:17 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:
> Ok, thanks. A few notes:
>
> 1. --enable-static implies --disable-dlopen. Specifically:
> --enable-static does two things:
>
> 1a. Build libmpi.a (and friends)
> 1b. Slurp all the OMP
Good suggestions, and thanks! But since I haven't been able to get the
problem to recur and I'm stuck now on other issues related to getting
parallel hdf5 to pass its make check, I will likely not follow up on this
particular (non-recurring) issue (except maybe I should forward your
comments to t
Ok, thanks. A few notes:
1. --enable-static implies --disable-dlopen. Specifically: --enable-static
does two things:
1a. Build libmpi.a (and friends)
1b. Slurp all the OMPI plugins into libmpi.a (and friends), vs. building them
as standalone dynamic shared object (DSO) files (this is half of
I'm looking at your code, and I'm not actually an expert in the MPI IO sutff...
but do you have a race condition in the file close+delete and the open with
EXCL?
I'm asking because I don't know offhand if the the file close+delete is
supposed to be collective and not return until the file is gu
Thanks, I've just gotten an email with some suggestions (and promise of
more help) from the HDF5 support team. I will report back here, as it may
be of interest to others trying to build hdf5 on mavericks.
On Fri, Jan 17, 2014 at 9:08 AM, Ralph Castain wrote:
> Afraid I have no idea, but hope
I'm sorry, but I'm really confused, so let me try to understand the situation.
You use Torque to get an allocation, so you are running in a managed
environment.
You then use mpirun to start the job, but pass it a hostfile as shown below.
Somehow, ORTE believes that there is only one slot on eac
Afraid I have no idea, but hopefully someone else here with experience with
HDF5 can chime in?
On Jan 17, 2014, at 9:03 AM, Ronald Cohen wrote:
> Still a timely response, thank you.The particular problem I noted hasn't
> recurred; for reasons I will explain shortly I had to rebuild openmp
Still a timely response, thank you.The particular problem I noted
hasn't recurred; for reasons I will explain shortly I had to rebuild
openmpi again, and this time Sample_mpio.c compiled and ran successfully
from the start.
But now my problem is trying to get parallel HDF5 to run. In my first
sorry for delayed response - just getting back from travel. I don't know why
you would get that behavior other than a race condition. Afraid that code path
is foreign to me, but perhaps one of the folks in the MPI-IO area can respond
On Jan 15, 2014, at 4:26 PM, Ronald Cohen wrote:
> Update:
with what version of OMPI?
On Jan 17, 2014, at 3:23 AM, basma a.azeem wrote:
>
>
> i am trying to run Blcr with Open mpi on a cluster of 4 nodes
> blcr version 0.8.5
> when i run the command :
>
> mpirun -np 4 -am ft-enable-cr -hostfile hosts
> /home/ubuntu//N/NPB3.3-MPI/bin/bt.B.4
>
version: 1.6.5 (compiled with Intel compilers)
command used:
mpirun --machinefile mfile --debug-daemons -np 16 myapp
Description of the problem:
When myapp is compiled without optimizations everything runs fine
if compiled with -O3, then the application hangs. I cannot reproduce the
problem with
i am trying to run Blcr with Open mpi on a cluster of 4 nodes
blcr version 0.8.5
when i run the command :
mpirun -np 4 -am ft-enable-cr -hostfile hosts
/home/ubuntu//N/NPB3.3-MPI/bin/bt.B.4
i got this error :
-
It looks like opal_init failed for some reason;
Thanks a ton Christoph. That helps a lot.
2014/1/17 Christoph Niethammer
> Hello,
>
> Find attached a minimal example - hopefully doing what you intended.
>
> Regards
> Christoph
>
> --
>
> Christoph Niethammer
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstrasse 19
> 70569 Stutt
32 matches
Mail list logo