> I have a custom datatype MPI_EVENTDATA (created with
> MPI_Type_create_struct) which is a struct with some fixed size fields and a
> variable sized array of ints (data). I want to collect a variable number of
> these types (Events) from all ranks at rank 0. My current version is
> working for a fixed size custom datatype:
>
>
MPI_EVENTDATA can only support one definition, so if you want to support a
varying width .data member, each width has to be a different datatype.

An alternative solution is to define the struct such that all of the bytes
are contiguous and just send them as MPI_BYTE with varying count.  If the
variation in width is relatively narrow, you could define a single datatype
that every instance fits into.  There are a bunch of advantages of the
latter, not the least of which is you can use a O(log N) MPI_Gather instead
of O(N) Send-Recvs or MPI_Gatherv, which may be O(N) internally.


> void collect()
> {
>   int rank;
>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
>   size_t globalSize = events.size();
>   // Get total number of Events that are to be received
>   MPI_Allreduce(MPI_IN_PLACE, &globalSize, 1, MPI_INT, MPI_SUM,
> MPI_COMM_WORLD);
>   std::vector<MPI_Request> requests(globalSize);
>   std::vector<MPI_EventData> recvEvents(globalSize);
>
>   if (rank == 0) {
>     for (size_t i = 0; i < globalSize; i++) {
>       MPI_Irecv(&recvEvents[i], 1, MPI_EVENTDATA, MPI_ANY_SOURCE,
> MPI_ANY_TAG, MPI_COMM_WORLD, &requests[i]);
>     }
>

MPI wildcards are the least efficient option.  If you know how many
messages are going to be sent and from which ranks, you can eliminate the
wildcards.  If not every rank will send data, then augment MPI_Allreduce
with an MPI_Gather of 0 or 1 and post a recv for every 1 in the output
vector.  (You can elide the MPI_Allreduce if you only need globalSize at
the root.)  If you know the counts and are sending either arbitrary bytes
or a single user-defined datatype, you can replace the Send + N*Recv with
MPI_Gatherv.  Even if it's still O(N), a decent implementation will handle
flow control for you.  Posting N receives doesn't scale and you will want
to batch up into some reasonable number if you are going to run large
jobs.  Flasjik, Dinan, and Underwood have an ISC16 paper on this.


>   }
>   for (const auto & ev : events) {
>     MPI_EventData eventdata;
>     assert(ev.first.size() < 255);
>     strcpy(eventdata.name, ev.first.c_str());
>     eventdata.rank = rank;
>     eventdata.dataSize = ev.second.data.size();
>     MPI_Send(&eventdata, 1, MPI_EVENTDATA, 0, 0, MPI_COMM_WORLD);
>   }
>   if (rank == 0) {
>     MPI_Waitall(globalSize, requests.data(), MPI_STATUSES_IGNORE);
>     for (const auto & evdata : recvEvents) {
>       // Save in a std::multimap with evdata.name as key
>       globalEvents.emplace(std::piecewise_construct,
> std::forward_as_tuple(evdata.name),
>                            std::forward_as_tuple(evdata.name,
> evdata.rank));
>     }
>
>   }
> }
>
> Obviously, next step would be to allocate a buffer of size
> evdata.dataSize, receive it, add it to globalEvents multimap<Event> and be
> happy. Questions I have:
>
> * How to correlate the received Events in the first step, with the
> received data vector in the second step?
>

You can eliminate the first phase in favor of DSDE (
http://htor.inf.ethz.ch/publications/index.php?pub=99), which replaces the
MPI_Allreduce with Probe+Ibarrier (not literally - see paper for details).


> * Is there a way to use a variable sized compononent inside a custom MPI
> datatype?
>

No.  You can use MPI_Type_create_resized to avoid creating a new datatype,
but in your case, you'll be resizing the contiguous type inside of your
struct type, which probably doesn't save you anything.


> * Or dump the custom datatype and use MPI_Pack instead?
>

Writing your own serialization is likely to be faster.  MPI doesn't have
any magic here and the generic implementation inside of an MPI library
can't leverage the specifics in your code that may be amenable to compiler
optimization.

You might also look at Boost.MPI, which plays nicely with
Boost.Serialization (
http://www.boost.org/doc/libs/1_54_0/doc/html/mpi/tutorial.html).  While
Boost.MPI does not support recent features of MPI, it supports the most
widely used ones, perhaps all of the ones you need.

If you refuse to Boost for whatever reason, I am sympathetic, since this
was my position back when I had to use less-than-awesome compilers.


> * Or somehow group two succeeding messages together?
>

DSDE is probably the best generic implementation of what you are doing.
It's possible that a two-phase implementation wins when the specific usage
allows you to use a more efficient collective algorithm.


> I'm open to any good and elegant suggestions!
>

I won't guarentee that any of my suggestions satisfied either property :-)

Best,

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to