> I have a custom datatype MPI_EVENTDATA (created with > MPI_Type_create_struct) which is a struct with some fixed size fields and a > variable sized array of ints (data). I want to collect a variable number of > these types (Events) from all ranks at rank 0. My current version is > working for a fixed size custom datatype: > > MPI_EVENTDATA can only support one definition, so if you want to support a varying width .data member, each width has to be a different datatype.
An alternative solution is to define the struct such that all of the bytes are contiguous and just send them as MPI_BYTE with varying count. If the variation in width is relatively narrow, you could define a single datatype that every instance fits into. There are a bunch of advantages of the latter, not the least of which is you can use a O(log N) MPI_Gather instead of O(N) Send-Recvs or MPI_Gatherv, which may be O(N) internally. > void collect() > { > int rank; > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > size_t globalSize = events.size(); > // Get total number of Events that are to be received > MPI_Allreduce(MPI_IN_PLACE, &globalSize, 1, MPI_INT, MPI_SUM, > MPI_COMM_WORLD); > std::vector<MPI_Request> requests(globalSize); > std::vector<MPI_EventData> recvEvents(globalSize); > > if (rank == 0) { > for (size_t i = 0; i < globalSize; i++) { > MPI_Irecv(&recvEvents[i], 1, MPI_EVENTDATA, MPI_ANY_SOURCE, > MPI_ANY_TAG, MPI_COMM_WORLD, &requests[i]); > } > MPI wildcards are the least efficient option. If you know how many messages are going to be sent and from which ranks, you can eliminate the wildcards. If not every rank will send data, then augment MPI_Allreduce with an MPI_Gather of 0 or 1 and post a recv for every 1 in the output vector. (You can elide the MPI_Allreduce if you only need globalSize at the root.) If you know the counts and are sending either arbitrary bytes or a single user-defined datatype, you can replace the Send + N*Recv with MPI_Gatherv. Even if it's still O(N), a decent implementation will handle flow control for you. Posting N receives doesn't scale and you will want to batch up into some reasonable number if you are going to run large jobs. Flasjik, Dinan, and Underwood have an ISC16 paper on this. > } > for (const auto & ev : events) { > MPI_EventData eventdata; > assert(ev.first.size() < 255); > strcpy(eventdata.name, ev.first.c_str()); > eventdata.rank = rank; > eventdata.dataSize = ev.second.data.size(); > MPI_Send(&eventdata, 1, MPI_EVENTDATA, 0, 0, MPI_COMM_WORLD); > } > if (rank == 0) { > MPI_Waitall(globalSize, requests.data(), MPI_STATUSES_IGNORE); > for (const auto & evdata : recvEvents) { > // Save in a std::multimap with evdata.name as key > globalEvents.emplace(std::piecewise_construct, > std::forward_as_tuple(evdata.name), > std::forward_as_tuple(evdata.name, > evdata.rank)); > } > > } > } > > Obviously, next step would be to allocate a buffer of size > evdata.dataSize, receive it, add it to globalEvents multimap<Event> and be > happy. Questions I have: > > * How to correlate the received Events in the first step, with the > received data vector in the second step? > You can eliminate the first phase in favor of DSDE ( http://htor.inf.ethz.ch/publications/index.php?pub=99), which replaces the MPI_Allreduce with Probe+Ibarrier (not literally - see paper for details). > * Is there a way to use a variable sized compononent inside a custom MPI > datatype? > No. You can use MPI_Type_create_resized to avoid creating a new datatype, but in your case, you'll be resizing the contiguous type inside of your struct type, which probably doesn't save you anything. > * Or dump the custom datatype and use MPI_Pack instead? > Writing your own serialization is likely to be faster. MPI doesn't have any magic here and the generic implementation inside of an MPI library can't leverage the specifics in your code that may be amenable to compiler optimization. You might also look at Boost.MPI, which plays nicely with Boost.Serialization ( http://www.boost.org/doc/libs/1_54_0/doc/html/mpi/tutorial.html). While Boost.MPI does not support recent features of MPI, it supports the most widely used ones, perhaps all of the ones you need. If you refuse to Boost for whatever reason, I am sympathetic, since this was my position back when I had to use less-than-awesome compilers. > * Or somehow group two succeeding messages together? > DSDE is probably the best generic implementation of what you are doing. It's possible that a two-phase implementation wins when the specific usage allows you to use a more efficient collective algorithm. > I'm open to any good and elegant suggestions! > I won't guarentee that any of my suggestions satisfied either property :-) Best, Jeff -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users