Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]

Jakub Jelinek via Gcc-patches Wed, 24 Feb 2021 11:46:42 -0800

On Wed, Feb 24, 2021 at 06:17:01PM +0000, Kwok Cheung Yeung wrote:
> > 1) while linux --enable-futex and accel gomp_sem_t is small (int), rtems
> >     and especially posix gomp_sem_t is large; so while it might be a good
> >     idea to inline gomp_sem_t on config/{linux,accel} into the union, for
> >     the rest it might be better to use indirection; if it is only for the
> >     undeferred tasks, it could be just using an automatic variable and
> >     put into the struct address of that; could be done either always,
> >     or define some macro in config/{linux,accel}/sem.h that gomp_sem_t is
> >     small and decide on the indirection based on that macro
> 
> I think a pointer to an automatic variable would be simplest.


Agreed.

> Can anything in cpyfn make use of the fact that kind==GOMP_TASK_UNDEFERRED
> while executing it? Anyway, if we want to keep this, then I suppose we could
> just add an extra field deferred_p that does not change for the lifetime of
> the task to indicate that the task is 'really' a deferred task.

Adding a bool is fine, but see bellow.

> > 3) kind is not constant, for the deferred tasks it can change over the
> >     lifetime of the task, as you've said in the comments, it is kind ==
> >     GOMP_TASK_UNDEFERRED vs. other values; while the changes of task->kind
> >     are done while holding the task lock, omp_fulfill_event reads it before
> >     locking that lock, so I think it needs to be done using
> >     if (__atomic_load_n (&task->kind, MEMMODEL_RELAXED) == 
> > GOMP_TASK_UNDEFERRED)
> >     Pedantically the stores to task->kind also need to be done
> >     with __atomic_store_n MEMMODEL_RELAXED.
> 
> If we check task->deferred_p instead (which never changes for a task after
> instantiation), is that still necessary?

Not for kind or the new field.

> > - in gomp_barrier_handle_tasks the reason for if (new_tasks > 1)
> > is that if there is a single dependent task, the current thread
> > just finished handling one task and so can take that single task and so no
> > need to wake up.  While in the omp_fulfill_event case, even if there
> > is just one new task, we need to schedule it to some thread and so
> > is desirable to wake some thread.
> 
> In that case, we could just do 'if (new_tasks > 0)' instead?

Yes.
> 
> > All we know
> > (if team == gomp_thread ()->ts.team) is that at least one thread is doing
> > something else but that one could be busy for quite some time.
> 
> Well, it should still get around to the new task eventually, so there is no
> problem in terms of correctness here. I suppose we could always wake up one
> more thread than strictly necessary, but that might have knock-on effects on
> performance elsewhere?

Yeah, waking something unnecessarily is always going to cause performance
problems.

> I have applied your patch to move the gomp_team_barrier_done, and in
> omp_fulfill_event, I ensure that a single thread is woken up so that
> gomp_barrier_handle_tasks can signal for the barrier to finish.
> 
> I'm having some trouble coming up with a testcase to test this scenario
> though. I tried having a testcase like this to have threads in separate
> teams:

The unshackeled thread testcase would probably need a pthread_create
call and restricting the testcase to POSIX threads targets.
The teams in host teams (or target) don't have at least in OpenMP a way
to serialize, e.g. it can always be implemented like we do ATM.

But I guess that testcase can be done incrementally.

> @@ -545,8 +548,15 @@ struct gomp_task
>       entries and the gomp_task in which they reside.  */
>    struct priority_node pnode[3];
>  
> -  bool detach;
> -  gomp_sem_t completion_sem;
> +  union {
> +    /* Valid only if deferred_p is false.  */
> +    gomp_sem_t *completion_sem;
> +    /* Valid only if deferred_p is true.  Set to the team that executes the
> +       task if the task is detached and the completion event has yet to be
> +       fulfilled.  */
> +    struct gomp_team *detach_team;
> +  };
> +  bool deferred_p;
>  
>    struct gomp_task_icv icv;
>    void (*fn) (void *);

What I don't like is that this creates too much wasteful padding
in a struct that should be as small as possible.
At least on 64-bit hosts which we care about most, pahole shows with your
patch:
struct gomp_task {
        struct gomp_task *         parent;               /*     0     8 */
        struct priority_queue      children_queue;       /*     8    32 */
        struct gomp_taskgroup *    taskgroup;            /*    40     8 */
        struct gomp_dependers_vec * dependers;           /*    48     8 */
        struct htab *              depend_hash;          /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct gomp_taskwait *     taskwait;             /*    64     8 */
        size_t                     depend_count;         /*    72     8 */
        size_t                     num_dependees;        /*    80     8 */
        int                        priority;             /*    88     4 */

        /* XXX 4 bytes hole, try to pack */

        struct priority_node       pnode[3];             /*    96    48 */
        /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
        union {
                gomp_sem_t *       completion_sem;       /*   144     8 */
                struct gomp_team * detach_team;          /*   144     8 */
        };                                               /*   144     8 */
        _Bool                      deferred_p;           /*   152     1 */

        /* XXX 7 bytes hole, try to pack */

        struct gomp_task_icv       icv;                  /*   160    40 */
        /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */
        void                       (*fn)(void *);        /*   200     8 */
        void *                     fn_data;              /*   208     8 */
        enum gomp_task_kind        kind;                 /*   216     4 */
        _Bool                      in_tied_task;         /*   220     1 */
        _Bool                      final_task;           /*   221     1 */
        _Bool                      copy_ctors_done;      /*   222     1 */
        _Bool                      parent_depends_on;    /*   223     1 */
        struct gomp_task_depend_entry depend[];          /*   224     0 */

        /* size: 224, cachelines: 4, members: 21 */
        /* sum members: 213, holes: 2, sum holes: 11 */
        /* last cacheline: 32 bytes */
};

So perhaps it might be better to put the new 1 fields before int priority;
field, in order bool deferred_p; union { };
That way, there will be just 3 bytes hole in the whole struct,
not 4 + 7 byte holes.

>  
> -      if (task.detach && !task_fulfilled_p (&task))
> -     gomp_sem_wait (&task.completion_sem);
> +      if ((flags & GOMP_TASK_FLAG_DETACH) != 0 && detach)
> +     gomp_sem_wait (&completion_sem);

I think gomp_sem_destroy is missing here (in the conditional if it was
only initialized.  Furthermore, I don't understand the && detach,
the earlier code assumes that if (flags & GOMP_TASK_FLAG_DETACH) != 0
then it can dereference *(void *)) detach, so the && detach seems
to be unnecessary.

> @@ -484,15 +483,16 @@ GOMP_task (void (*fn) (void *), void *data, void 
> (*cpyfn) (void *, void *),
>        task->kind = GOMP_TASK_UNDEFERRED;
>        task->in_tied_task = parent->in_tied_task;
>        task->taskgroup = taskgroup;
> +      task->deferred_p = true;
>        if ((flags & GOMP_TASK_FLAG_DETACH) != 0)
>       {
> -       task->detach = true;
> -       gomp_sem_init (&task->completion_sem, 0);
> -       *(void **) detach = &task->completion_sem;

I think you can move task->deferred_p into the if stmt.
> +  if (!shackled_thread_p
> +      && !do_wake
> +      && gomp_team_barrier_waiting_for_tasks (&team->barrier)
> +      && team->task_detach_count == 0)

&& team->task_detach_count == 0 is cheaper than the
  && gomp_team_barrier_waiting_for_tasks (&team->barrier)
so please swap those two.

> +    {
> +      /* Ensure that at least one thread is woken up to signal that the
> +      barrier can finish.  */
> +      do_wake = 1;
> +    }

Please drop the {}s around the single do_wake = 1; stmt.

Otherwise LGTM.

        Jakub

Re: [WIP] Re: [PATCH] openmp: Fix intermittent hanging of task-detach-6 libgomp tests [PR98738]

Reply via email to