Hi again,

I've been able to simplify my test case significantly. It now runs
with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used.

The pattern is as follows.

 *  - ranks 0 and 1 both own a local buffer.
 *  - each fills it with (deterministically known) data.
 *  - rank 0 collects the data from rank 1's local buffer
 *    (whose contents should be no mystery), and writes this to a
 *    file-backed mmaped area.
 *  - rank 0 compares what it receives with what it knows it *should
 *  have* received.

The test fails if:

 *  - the openib btl is used among the 2 nodes
 *  - a file-backed mmaped area is used for receiving the data.
 *  - the write is done to a newly created file.
 *  - per-node buffer is large enough.

For a per-node buffer size above 12kb (12240 bytes to be exact), my
program fails, since the MPI_Recv does not receive the correct data
chunk (it just gets zeroes).

I attach the simplified test case. I hope someone will be able to
reproduce the problem.

Best regards,

E.


On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé
<emmanuel.th...@gmail.com> wrote:
> Thanks for your answer.
>
> On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>> Just really quick off the top of my head, mmaping relies on the virtual
>> memory subsystem, whereas IB RDMA operations rely on physical memory being
>> pinned (unswappable.)
>
> Yes. Does that mean that the result of computations should be
> undefined if I happen to give a user buffer which corresponds to a
> file ? That would be surprising.
>
>> For a large message transfer, the OpenIB BTL will
>> register the user buffer, which will pin the pages and make them
>> unswappable.
>
> Yes. But what are the semantics of pinning the VM area pointed to by
> ptr if ptr happens to be mmaped from a file ?
>
>> If the data being transfered is small, you'll copy-in/out to
>> internal bounce buffers and you shouldn't have issues.
>
> Are you saying that the openib layer does have provision in this case
> for letting the RDMA happen with a pinned physical memory range, and
> later perform the copy to the file-backed mmaped range ? That would
> make perfect sense indeed, although I don't have enough familiarity
> with the OMPI code to see where it happens, and more importantly
> whether the completion properly waits for this post-RDMA copy to
> complete.
>
>
>> 1.If you try to just bcast a few kilobytes of data using this technique, do
>> you run into issues?
>
> No. All "simpler" attempts were successful, unfortunately. Can you be
> a little bit more precise about what scenario you imagine ? The
> setting "all ranks mmap a local file, and rank 0 broadcasts there" is
> successful.
>
>> 2. How large is the data in the collective (input and output), is in_place
>> used? I'm guess it's large enough that the BTL tries to work with the user
>> buffer.
>
> MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
> Collectives are with communicators of 2 nodes, and we're talking (for
> the smallest failing run) 8kb per node (i.e. 16kb total for an
> allgather).
>
> E.
>
>> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé <emmanuel.th...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm stumbling on a problem related to the openib btl in
>>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
>>> mmaped areas for receiving data through MPI collective calls.
>>>
>>> A test case is attached. I've tried to make is reasonably small,
>>> although I recognize that it's not extra thin. The test case is a
>>> trimmed down version of what I witness in the context of a rather
>>> large program, so there is no claim of relevance of the test case
>>> itself. It's here just to trigger the desired misbehaviour. The test
>>> case contains some detailed information on what is done, and the
>>> experiments I did.
>>>
>>> In a nutshell, the problem is as follows.
>>>
>>>  - I do a computation, which involves MPI_Reduce_scatter and
>>> MPI_Allgather.
>>>  - I save the result to a file (collective operation).
>>>
>>> *If* I save the file using something such as:
>>>  fd = open("blah", ...
>>>  area = mmap(..., fd, )
>>>  MPI_Gather(..., area, ...)
>>> *AND* the MPI_Reduce_scatter is done with an alternative
>>> implementation (which I believe is correct)
>>> *AND* communication is done through the openib btl,
>>>
>>> then the file which gets saved is inconsistent with what is obtained
>>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
>>> before the save).
>>>
>>> I tried to dig a bit in the openib internals, but all I've been able
>>> to witness was beyond my expertise (an RDMA read not transferring the
>>> expected data, but I'm too uncomfortable with this layer to say
>>> anything I'm sure about).
>>>
>>> Tests have been done with several openmpi versions including 1.8.3, on
>>> a debian wheezy (7.5) + OFED 2.3 cluster.
>>>
>>> It would be great if someone could tell me if he is able to reproduce
>>> the bug, or tell me whether something which is done in this test case
>>> is illegal in any respect. I'd be glad to provide further information
>>> which could be of any help.
>>>
>>> Best regards,
>>>
>>> E. Thomé.
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25732.php
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <assert.h>
#include <mpi.h>
#include <sys/mman.h>

/* This test file illustrates how in certain circumstances, a file-backed
 * mmaped area cannot correctly received data sent from an MPI_Send call.
 *
 * This program wants to run on 2 distinct nodes connected with
 * infiniband.
 *
 * Normal behaviour of the program consists in printing output similar to
 * (order may vary):
per-node buffer has size 12232 bytes
node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
 *
 * Abnormal behaviour is when the job ends with MPI_Abort after printing
 * a line wuch as:
node 0 iteration 1, lead word received from peer is 0x00000000 [NOK]
 *
 * Each iteration of the main loop does the same thing.
 *  - ranks 0 and 1 both own a local buffer.
 *  - each fills it with (deterministically known) data.
 *  - rank 0 collects the data from rank 1's local buffer
 *    (whose contents should be no mystery), and writes this to a
 *    file-backed mmaped area.
 *  - rank 0 compares what it receives with what it knows it *should
 *  have* received.
 *
 * The final check performed by rank 0 fails if the following conditions
 * are met:
 *
 *  - the openib btl is used among the 2 nodes
 *  - a file-backed mmaped area is used for receiving the data.
 *  - the write is done to a newly created file.
 *  - per-node buffer is large enough.
 * 
 * The first condition is controlled by the btl mca.
 * The second one is controlled by the flag below. Undefining it makes
 * the bug go away.
 */
#define USE_MMAP_FOR_FILE_IO
/* Whether we create a new file for the write is achieved by enabling any
 * of the two flags below. Undefining both makes the bug go away. */
#define WRITE_TO_TEMP_FILE_FIRST
#define xERASE_FILE_AFTER_WRITE
/* The size of the per-node buffer is controlled by the -s command line
 * argument */


/*
 * This is a simplified version of a more complex testcase, which showed
 * erroneous behaviour with openmpi versions 1.7 to 1.8.4rc1
 * (current), but not with older openmpi, nor with
 * mvapich2-2.1a
 *
 * The current file has only been tested with openmpi 1.8.3 and 1.8.4rc1.
 */

/* For compiling, one may do:
     
      MPI=$HOME/Packages/openmpi-1.8.3
      $MPI/bin/mpicc -W -Wall -std=c99 -O0 -g prog5.c
     
 * For running, assuming /tmp/hosts contains the list of 2 nodes, and
 * $SSH is used to connect to these:
     
      SSH_AUTH_SOCK= DISPLAY= $MPI/bin/mpiexec -machinefile /tmp/hosts --mca plm_rsh_agent $SSH --mca rmaps_base_mapping_policy node -n 2  ./a.out -s 2048
     
 * If you pass the argument --mca btl ^openib to the mpiexec command
 * line, no bug.
 */

/*
 * Tested (FAIL means that setting USE_MMAP_FOR_FILE_IO above yields to a
 * program failure, while we succeed if it is unset).
 *
 * IB boards MCX353A-FCBT, fw rev 2.32.5100, MLNX_OFED_LINUX-2.3-1.0.1-debian7.5-x86_64
 * openmpi-1.8.4rc1 FAIL   (ok with --mca btl ^openib)
 * openmpi-1.8.3 FAIL      (ok with --mca btl ^openib)
 *
 * The previous, longer test case also failed with IB boards MHGH29-XTC.
 */


unsigned long * get_file_write_buffer(const char * filename, int * pfd, size_t chunksize, MPI_Comm wr)
{
    int njobs;
    int rank;
    MPI_Comm_size(wr, &njobs);
    MPI_Comm_rank(wr, &rank);

    *pfd = -1;
    if (rank) return NULL;

    size_t siz = njobs * chunksize;
    size_t wsiz = ((siz - 1) | (sysconf (_SC_PAGESIZE)-1)) + 1;

    void * recvbuf = NULL;      // only used by leader
    int fd = -1;                // only used by leader

    int rc;

    char * filename_pre;
    rc = asprintf(&filename_pre, "%s.tmp", filename);
    if (rc < 0) abort();
#ifdef  WRITE_TO_TEMP_FILE_FIRST
    fd = open(filename_pre, O_RDWR | O_CREAT | O_EXCL, 0666);
#else
    fd = open(filename, O_RDWR | O_CREAT, 0666);
#endif
    if (fd < 0) abort();

    rc = ftruncate(fd, wsiz);
    if (rc < 0) abort();

#ifdef USE_MMAP_FOR_FILE_IO
    recvbuf = mmap(NULL, wsiz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (recvbuf == MAP_FAILED) abort();
#else
    recvbuf = malloc(wsiz);
    if (!recvbuf) abort();
#endif
    *pfd = fd;
    free(filename_pre);

    return recvbuf;
}

#ifndef MAYBE_UNUSED
#ifdef __GNUC__
#define MAYBE_UNUSED __attribute__ ((unused))
#else
#define MAYBE_UNUSED
#endif
#endif

void flush_file_write_buffer(const char * filename MAYBE_UNUSED, void * buf, int fd, size_t chunksize, MPI_Comm wr)
{
    int njobs;
    int rank;
    MPI_Comm_size(wr, &njobs);
    MPI_Comm_rank(wr, &rank);
    if (rank) return;
    size_t siz = njobs * chunksize;
    size_t wsiz = ((siz - 1) | (sysconf (_SC_PAGESIZE)-1)) + 1;

#ifdef USE_MMAP_FOR_FILE_IO
    munmap(buf, wsiz);
#else
    write(fd, buf, wsiz);
    free(buf);
#endif
    int rc = ftruncate(fd, siz);
    close(fd);
    if (rc < 0) abort();

#ifdef  WRITE_TO_TEMP_FILE_FIRST
    char * filename_pre;
    rc = asprintf(&filename_pre, "%s.tmp", filename);
    if (rc < 0) abort();
    /* unlink before rename is necessary under windows */
    unlink(filename);
    rc = rename(filename_pre, filename);
    if (rc < 0) abort();
    free(filename_pre);
#endif

#ifdef  ERASE_FILE_AFTER_WRITE
    unlink(filename);
#endif
}

int main(int argc, char * argv[])
{
    MPI_Init(&argc, &argv);
    int size;
    int rank;
    int eitems = 1530;  /* eitems >= 1530 seem to fail on my cluster */
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (size != 2) abort();

    if (argc >= 3 && strcmp(argv[1], "-s") == 0) {
        eitems = atoi(argv[2]);
    }

    size_t chunksize = eitems * sizeof(unsigned long);
    if (!rank)
        printf("per-node buffer has size %zu bytes\n", chunksize);

    for(int iter = 0 ; iter < 4 ; iter++) {
        unsigned long magic = (1 + iter) << 10;
        unsigned long * localbuf = malloc(chunksize);

        for(int item = 0 ; item < eitems ; item++) {
            localbuf[item] = magic + rank;
        }

        /* rank 0: localbuf has {magic, ... }
         * rank 1: localbuf has {magic+1, ... }
         */

        int ok = 1;

        if (rank == 0) {
            int fd;
            unsigned long * ptr = get_file_write_buffer("/tmp/u", &fd, chunksize, MPI_COMM_WORLD);
            /* fill first half with local data at rank 0 */
            memcpy(ptr, localbuf, chunksize);
            /* fill second half with data from rank 1 */
            MPI_Recv(ptr + eitems, eitems, MPI_UNSIGNED_LONG, !rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            /* we're at rank 0, ptr[0] should be magic, and ptr[eitems]
             * should be magic + 1*/
            ok = (ptr[eitems] == magic + !rank);
            printf("node %d iteration %d, lead word received from peer is 0x%08lx [%s]\n", rank, iter, ptr[eitems], ok?"ok":"NOK");
            flush_file_write_buffer("/tmp/u", ptr, fd, chunksize, MPI_COMM_WORLD);
        } else {
            /* send our local data to rank 0 for feeding the write buffer */
            MPI_Send(localbuf, eitems, MPI_UNSIGNED_LONG, 0, 0, MPI_COMM_WORLD);
        }

        /* only rank 0 has performed a new check */
        MPI_Bcast(&ok, 1, MPI_INT, 0, MPI_COMM_WORLD);
        free(localbuf);

        if (!ok) MPI_Abort(MPI_COMM_WORLD, 1);
    }

    MPI_Finalize();
}

Reply via email to