Jeff,
The first location is indeed in ompi_coll_libnbc_iallreduce()
Lee Ann,
thanks for the bug report,
for the time being, can you please give the attached patch a try ?
Cheers,
Gilles
FWIW
NBC_Schedule_request() sets handle->tmpbuf = tmpbuf and call
NBC_Start(handle, schedule)
then handle->schedule = schedule.
if NBC_Start_round() fails, then NBC_Return_handle(handle) calls
NBC_Free(handle) that
OBJ_RELEASE(handle->schedule) and free(handle->tmpbuf)
Back to ompi_coll_libnbc_iallreduce(), we once again
OBJ_RELEASE(schedule) and free(tmpbuf).
On 3/15/2019 7:27 AM, Jeff Squyres (jsquyres) via users wrote:
Lee Ann --
Thanks for your bug report.
I'm not able to find a call to NBC_Schedule_request() in
ompi_coll_libnbc_iallreduce().
I see 2 calls to NBC_Schedule_request() in
ompi/mca/coll/libnbc/nbc_iallreduce.c, but they are in different functions.
Can you clarify exactly which one(s) you're referring to?
Location 1:
https://github.com/open-mpi/ompi/blob/7412e88689ffd1dcc95a587e6ded9d7455fa031c/ompi/mca/coll/libnbc/nbc_iallreduce.c#L183
Location 2:
https://github.com/open-mpi/ompi/blob/7412e88689ffd1dcc95a587e6ded9d7455fa031c/ompi/mca/coll/libnbc/nbc_iallreduce.c#L247
On Mar 14, 2019, at 4:27 PM, Riesen, Lee Ann <lee.ann.rie...@intel.com> wrote:
I'm trying to build OpenMPI 3.1.2 as part of Mellanox HPC-X and I'm having some
problems with the underlying libraries. The true problem was masked for awhile
by an bug in error handling in OpenMPI. In mca/coll/libnbc/nbc_iallreduce.c in
function ompi_coll_libnbc_iallreduce() we have some error handling at the end
that looks like:
res = NBC_Schedule_request (schedule, comm, libnbc_module, request, tmpbuf);
if (OPAL_UNLIKELY(OMPI_SUCCESS != res)) {
OBJ_RELEASE(schedule);
free(tmpbuf);
return res;
}
The Schedule_request call failed, and in that call the "schedule" and "tmpbuf" were freed. Then we return and again, the "schedule" and "tmpbuf" are freed. It looks like this occurs elsewhere in the source file too.
Lee Ann
-----
Lee Ann Riesen, Enterprise and Government Group, Intel Corporation, Hillsboro,
OR
Phone 503-613-1952
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
diff --git a/ompi/mca/coll/libnbc/nbc.c b/ompi/mca/coll/libnbc/nbc.c
index dff6362..43eec5f 100644
--- a/ompi/mca/coll/libnbc/nbc.c
+++ b/ompi/mca/coll/libnbc/nbc.c
@@ -720,6 +720,9 @@ int NBC_Schedule_request(NBC_Schedule *schedule,
ompi_communicator_t *comm, ompi
res = NBC_Start (handle, schedule);
if (OPAL_UNLIKELY(OMPI_SUCCESS != res)) {
+ /* temporary workaround to avoid double free */
+ handle->tmpbuf = NULL;
+ handle->schedule = NULL;
NBC_Return_handle (handle);
return res;
}
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users