Good points...I'll see if anything can be done to speed up the master. If we
can shrink the number of MPI processes without hurting overall throughput maybe
I could save enough to fit another run on the freed cores. Thanks for the
ideas!
I was also worried about contention on the nodes since I
I think the 'middle ground' approach can be simplified even further if
the data file is in a shared device (e.g. NFS/Samba mount) that can be
mounted at the same location of the file system tree on all nodes. I
have never tried it, though and mmap()'ing a non-POSIX compliant file
system such as Sam
Eloi Gaudry wrote:
Terry,
You were right, the error indeed seems to come from the message coalescing
feature.
If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to
observe the "hdr->tag=0" error.
There are some trac requests associated to very similar error (
It seems to me there are two extremes.
One is that you replicate the data for each process. This has the
disadvantage of consuming lots of memory "unnecessarily."
Another extreme is that shared data is distributed over all processes.
This has the disadvantage of making at least some of the
Terry,
You were right, the error indeed seems to come from the message coalescing
feature.
If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not
able to observe the "hdr->tag=0" error.
There are some trac requests associated to very similar error
(https://svn.open-mpi
Terry,
No, I haven't tried any other values than P,65536,256,192,128 yet.
The reason why is quite simple. I've been reading and reading again this thread
to understand the btl_openib_receive_queues meaning and I can't figure out why
the default values seem to induce the
"hdr->tag=0" issue
(ht
Amb
It sounds like you have more workers than you can keep fed. Workers are
finishing up and requesting their next assignment but sit idle because
there are so many other idle workers too.
Load balance does not really matter if the choke point is the master. The
work is being done as fast as
That is already an answer that make sense. I understand that it is really
not a trivial issue. I have seen other recent threads about "running on
crashed nodes", and that the openmpi team is working hard on it. Well we
will wait and be glad to test the first versions when (I understand it will
take
The data are read from a file and processed before calculations begin, so I
think that mapping will not work in our case.
Global Arrays look promising indeed. As I said, we need to put just a part
of data to the shared section. John, do you (or may be other users) have an
experience of working wit
I completely neglected to mention that you could also use hwloc (Hardware
Locality), a small utility library for learning topology-kinds of things
(including if you're bound, where you're bound, etc.). Hwloc is a sub-project
of Open MPI:
http://www.open-mpi.org/projects/hwloc/
Open MPI us
That is interesting. So does the number of processes affect your runs
any. The times I've seen hdr->tag be 0 usually has been due to protocol
issues. The tag should never be 0. Have you tried to do other
receive_queue settings other than the default and the one you mention.
I wonder if you
As one of the Open MPI developers actively working on the MPI layer
stabilization/recover feature set, I don't think we can give you a specific
timeframe for availability, especially availability in a stable release. Once
the initial functionality is finished, we will open it up for user testing
Open MPI's fault tolerance is still somewhat rudimentary; it's a complex topic
within the entire scope of MPI. There has been much research into MPI and
fault tolerance over the years; the MPI Forum itself is grappling with terms
and definitions that make sense. It's by no means a "solved" pro
On the OMPI SVN trunk, we have an "Open MPI extension" call named
OMPI_Affinity_str(). Below is an excerpt from the man page. If this is
desirable, we can probably get it into 1.5.1.
-
NAME
OMPI_Affinity_str - Obtain prettyprint strings of processor affinity
information f
Am 24.09.2010 um 13:26 schrieb John Hearns:
> On 24 September 2010 08:46, Andrei Fokau wrote:
>> We use a C-program which consumes a lot of memory per process (up to few
>> GB), 99% of the data being the same for each process. So for us it would be
>> quite reasonable to put that part of data in
On 24 September 2010 08:46, Andrei Fokau wrote:
> We use a C-program which consumes a lot of memory per process (up to few
> GB), 99% of the data being the same for each process. So for us it would be
> quite reasonable to put that part of data in a shared memory.
http://www.emsl.pnl.gov/docs/glo
Is the data coming from a read-only file? In that case, a better way
might be to memory map that file in the root process and share the map
pointer in all the slave threads. This, like shared memory, will work
only for processes within a node, of course.
On Fri, Sep 24, 2010 at 3:46 AM, Andrei Fo
We use a C-program which consumes a lot of memory per process (up to few
GB), 99% of the data being the same for each process. So for us it would be
quite reasonable to put that part of data in a shared memory.
In the source code, the memory is allocated via malloc() function. What
would it requir
Ralph, could you tell us when this functionality will be available in the
stable version? A rough estimate will be fine.
On Fri, Sep 24, 2010 at 01:24, Ralph Castain wrote:
> In a word, no. If a node crashes, OMPI will abort the currently-running job
> if it had processes on that node. There is
Hello,
My question concerns the display of error message generated by a throw
std::runtime_error("Explicit error message").
I am launching on a terminal an openMPI program on several machines using:
mpirun -v -machinefile MyMachineFile.txt MyProgram.
I am wondering why I cannot see an error messag
Hi Terry,
The messages being send/received can be of any size, but the error seems to
happen more often with small messages (as an int being broadcasted or
allreduced).
The failing communication differs from one run to another, but some spots are
more likely to be failing than another. And as f
21 matches
Mail list logo