Jeff, Nathan,

Thank you for the positive feedback. I took a chance to look at current master and tried (based on my humble understanding of the OpenMPI internals) to remove the error check in ompi_osc_pt2pt_flush. Upon testing with the example code I sent initially, I saw a Segfault that stemmed from infinite recursion in ompi_osc_pt2pt_frag_alloc. I fixed it locally (see attached patch) and now I am seeing a Segfault somewhere below opal_progress():

```
Thread 5 "a.out" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdcf08700 (LWP 23912)]
0x00007fffe77b48ad in mca_pml_ob1_recv_frag_callback_match () from /home/joseph/opt/openmpi-master/lib/openmpi/mca_pml_ob1.so
(gdb) bt
#0 0x00007fffe77b48ad in mca_pml_ob1_recv_frag_callback_match () from /home/joseph/opt/openmpi-master/lib/openmpi/mca_pml_ob1.so #1 0x00007fffec985d0d in mca_btl_vader_component_progress () from /home/joseph/opt/openmpi-master/lib/openmpi/mca_btl_vader.so #2 0x00007ffff6d4220c in opal_progress () from /home/joseph/opt/openmpi-master/lib/libopen-pal.so.0 #3 0x00007fffe6754c55 in ompi_osc_pt2pt_flush_lock () from /home/joseph/opt/openmpi-master/lib/openmpi/mca_osc_pt2pt.so #4 0x00007fffe67574df in ompi_osc_pt2pt_flush () from /home/joseph/opt/openmpi-master/lib/openmpi/mca_osc_pt2pt.so #5 0x00007ffff7b608bc in PMPI_Win_flush () from /home/joseph/opt/openmpi-master/lib/libmpi.so.0
#6  0x0000000000401149 in put_blocking ()
#7  0x000000000040140f in main._omp_fn ()
#8 0x00007ffff78bfe46 in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:119 #9 0x00007ffff76936ba in start_thread (arg=0x7fffdcf08700) at pthread_create.c:333 #10 0x00007ffff73c982d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```
I guess there is a race condition somewhere. Unfortunately, I cannot see symbols in the standard build (which is probably to be expected). I recompiled master with --enable-debug and now I am facing a Segfault in mpirun:

```
Thread 1 "mpirun" received signal SIGSEGV, Segmentation fault.
0x00007fffd1e2590e in external_close () from /home/joseph/opt/openmpi-master/lib/openmpi/mca_pmix_pmix3x.so
(gdb) bt
#0 0x00007fffd1e2590e in external_close () from /home/joseph/opt/openmpi-master/lib/openmpi/mca_pmix_pmix3x.so #1 0x00007ffff77f23f5 in mca_base_component_close (component=0x7fffd20afde0 <mca_pmix_pmix3x_component>, output_id=-1) at ../../../../opal/mca/base/mca_base_components_close.c:53 #2 0x00007ffff77f24b5 in mca_base_components_close (output_id=-1, components=0x7ffff7abe250 <opal_pmix_base_framework+80>, skip=0x7fffd23c9dc0 <mca_pmix_pmix2x_component>)
    at ../../../../opal/mca/base/mca_base_components_close.c:85
#3 0x00007ffff77f2867 in mca_base_select (type_name=0x7ffff789652c "pmix", output_id=-1, components_available=0x7ffff7abe250 <opal_pmix_base_framework+80>, best_module=0x7fffffffd6a0, best_component=0x7fffffffd698, priority_out=0x0) at ../../../../opal/mca/base/mca_base_components_select.c:141 #4 0x00007ffff786d083 in opal_pmix_base_select () at ../../../../opal/mca/pmix/base/pmix_base_select.c:35 #5 0x00007ffff5339ca8 in rte_init () at ../../../../../orte/mca/ess/hnp/ess_hnp_module.c:640 #6 0x00007ffff7ae0b62 in orte_init (pargc=0x7fffffffd86c, pargv=0x7fffffffd860, flags=4) at ../../orte/runtime/orte_init.c:243 #7 0x00007ffff7b2290b in orte_submit_init (argc=6, argv=0x7fffffffde58, opts=0x0) at ../../orte/orted/orted_submit.c:535 #8 0x00000000004012d7 in orterun (argc=6, argv=0x7fffffffde58) at ../../../../orte/tools/orterun/orterun.c:133 #9 0x0000000000400fd6 in main (argc=6, argv=0x7fffffffde58) at ../../../../orte/tools/orterun/main.c:13
```

I'm giving up here for tonight. Please let me know if I can help with anything else.


Joseph


On 03/07/2017 05:43 PM, Jeff Hammond wrote:
Nathan and I discussed at the MPI Forum last week. I argued that your usage is not erroneous, although certain pathological cases (likely concocted) can lead to nasty behavior. He indicated that he would remove the error check, but it may require further discussion/debate with others.

You can remove the error check from the source and recompile if you are in a hurry, or you can use an MPICH-derivative (I have not checked, but I doubt MPICH errors on this code.).

Jeff

On Mon, Mar 6, 2017 at 8:30 AM, Joseph Schuchart <schuch...@hlrs.de <mailto:schuch...@hlrs.de>> wrote:

    Ping :) I would really appreciate any input on my question below.
    I crawled through the standard but cannot seem to find the wording
    that prohibits thread-concurrent access and synchronization.

    Using MPI_Rget works in our case but MPI_Rput only guarantees
    local completion, not remote completion. Specifically, a
    thread-parallel application would have to go into some serial
    region just to issue an MPI_Win_flush before a thread can read a
    value previously written to the same target. Re-reading remote
    values in the same process/thread might not be efficient but is a
    valid use-case for us.

    Best regards,
    Joseph



    On 02/20/2017 09:23 AM, Joseph Schuchart wrote:

        Nathan,

        Thanks for your clarification. Just so that I understand where
        my misunderstanding of this matter comes from: can you please
        point me to the place in the standard that prohibits
        thread-concurrent window synchronization using
        MPI_Win_flush[_all]? I can neither seem to find such a passage
        in 11.5.4 (Flush and Sync), nor in 12.4 (MPI and Threads). The
        latter explicitly excludes waiting on the same request object
        (which it does not) and collective operations on the same
        communicator (which MPI_Win_flush is not) but it fails to
        mention one-sided non-collective sync operations. Any hint
        would be much appreciated.

        We will look at MPI_Rput and MPI_Rget. However, having a
        single put paired with a flush is just the simplest case. We
        also want to support multiple asynchronous operations that are
        eventually synchronized on a per-thread basis where keeping
        the request handles might not be feasible.

        Thanks,
        Joseph

        On 02/20/2017 02:30 AM, Nathan Hjelm wrote:

            You can not perform synchronization at the same time as
            communication on the same target. This means if one thread
            is in MPI_Put/MPI_Get/MPI_Accumulate (target) you can’t
            have another thread in MPI_Win_flush (target) or
            MPI_Win_flush_all(). If your program is doing that it is
            not a valid MPI program. If you want to ensure a
            particular put operation is complete try MPI_Rput instead.

            -Nathan

                On Feb 19, 2017, at 2:34 PM, Joseph Schuchart
                <schuch...@hlrs.de <mailto:schuch...@hlrs.de>> wrote:

                All,

                We are trying to combine MPI_Put and MPI_Win_flush on
                locked (using MPI_Win_lock_all) dynamic windows to
                mimic a blocking put. The application is (potentially)
                multi-threaded and we are thus relying on
                MPI_THREAD_MULTIPLE support to be available.

                When I try to use this combination (MPI_Put +
                MPI_Win_flush) in our application, I am seeing threads
                occasionally hang in MPI_Win_flush, probably waiting
                for some progress to happen. However, when I try to
                create a small reproducer (attached, the original
                application has multiple layers of abstraction), I am
                seeing fatal errors in MPI_Win_flush if using more
                than one thread:

                ```
                [beryl:18037] *** An error occurred in MPI_Win_flush
                [beryl:18037] *** reported by process [4020043777,2]
                [beryl:18037] *** on win pt2pt window 3
                [beryl:18037] *** MPI_ERR_RMA_SYNC: error executing
                rma sync
                [beryl:18037] *** MPI_ERRORS_ARE_FATAL (processes in
                this win will now abort,
                [beryl:18037] ***    and potentially your MPI job)
                ```

                I could only trigger this on dynamic windows with
                multiple concurrent threads running.

                So: Is this a valid MPI program (except for the
                missing clean-up at the end ;))? It seems to run fine
                with MPICH but maybe they are more tolerant to some
                programming errors...

                If it is a valid MPI program, I assume there is some
                race condition in MPI_Win_flush that leads to the
                fatal error (or the hang that I observe otherwise)?

                I tested this with OpenMPI 1.10.5 on single node Linux
                Mint 18.1 system with stock kernel 4.8.0-36 (aka my
                laptop). OpenMPI and the test were both compiled using
                GCC 5.3.0. I could not run it using OpenMPI 2.0.2 due
                to the fatal error in MPI_Win_create (which also
                applies to MPI_Win_create_dynamic, see my other
                thread, not sure if they are related).

                Please let me know if this is a valid use case and
                whether I can provide you with additional information
                if required.

                Many thanks in advance!

                Cheers
                Joseph

-- Dipl.-Inf. Joseph Schuchart
                High Performance Computing Center Stuttgart (HLRS)
                Nobelstr. 19
                D-70569 Stuttgart

                Tel.: +49(0)711-68565890
                Fax: +49(0)711-6856832
                E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>

                
<ompi_flush_hang.c>_______________________________________________
                users mailing list
                users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
                https://rfd.newmexicoconsortium.org/mailman/listinfo/users
                <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

            _______________________________________________
            users mailing list
            users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
            https://rfd.newmexicoconsortium.org/mailman/listinfo/users
            <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>



-- Dipl.-Inf. Joseph Schuchart
    High Performance Computing Center Stuttgart (HLRS)
    Nobelstr. 19
    D-70569 Stuttgart

    Tel.: +49(0)711-68565890
    Fax: +49(0)711-6856832
    E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>

    _______________________________________________
    users mailing list
    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>




--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

diff --git a/ompi/mca/osc/pt2pt/osc_pt2pt_passive_target.c b/ompi/mca/osc/pt2pt/osc_pt2pt_passive_target.c
index 819e737..f24e40a 100644
--- a/ompi/mca/osc/pt2pt/osc_pt2pt_passive_target.c
+++ b/ompi/mca/osc/pt2pt/osc_pt2pt_passive_target.c
@@ -554,13 +554,11 @@ int ompi_osc_pt2pt_flush (int target, struct ompi_win_t *win)
         }
     }
     OPAL_THREAD_UNLOCK(&module->lock);
-    if (OPAL_UNLIKELY(NULL == lock)) {
-        OPAL_OUTPUT_VERBOSE((25, ompi_osc_base_framework.framework_output,
-                             "ompi_osc_pt2pt_flush: target %d is not locked in window %s",
-                             target, win->w_name));
-        ret = OMPI_ERR_RMA_SYNC;
-    } else {
+    if (OPAL_LIKELY(NULL != lock))
+    {
         ret = ompi_osc_pt2pt_flush_lock (module, lock, target);
+    } else {
+        ret = MPI_SUCCESS;
     }
 
     return ret;
diff --git a/ompi/mca/osc/pt2pt/osc_pt2pt_frag.h b/ompi/mca/osc/pt2pt/osc_pt2pt_frag.h
index f4e05a1..10dc2c0 100644
--- a/ompi/mca/osc/pt2pt/osc_pt2pt_frag.h
+++ b/ompi/mca/osc/pt2pt/osc_pt2pt_frag.h
@@ -173,7 +173,7 @@ static inline int ompi_osc_pt2pt_frag_alloc (ompi_osc_pt2pt_module_t *module, in
     int ret;
 
     do {
-        ret = ompi_osc_pt2pt_frag_alloc (module, target, request_len , buffer, ptr, long_send, buffered);
+        ret = _ompi_osc_pt2pt_frag_alloc (module, target, request_len , buffer, ptr, long_send, buffered);
         if (OPAL_LIKELY(OMPI_SUCCESS == ret || OMPI_ERR_OUT_OF_RESOURCE != ret)) {
             break;
         }
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to