On Wed, Feb 24, 2021 at 06:17:01PM +0000, Kwok Cheung Yeung wrote: > > 1) while linux --enable-futex and accel gomp_sem_t is small (int), rtems > > and especially posix gomp_sem_t is large; so while it might be a good > > idea to inline gomp_sem_t on config/{linux,accel} into the union, for > > the rest it might be better to use indirection; if it is only for the > > undeferred tasks, it could be just using an automatic variable and > > put into the struct address of that; could be done either always, > > or define some macro in config/{linux,accel}/sem.h that gomp_sem_t is > > small and decide on the indirection based on that macro > > I think a pointer to an automatic variable would be simplest.
Agreed. > Can anything in cpyfn make use of the fact that kind==GOMP_TASK_UNDEFERRED > while executing it? Anyway, if we want to keep this, then I suppose we could > just add an extra field deferred_p that does not change for the lifetime of > the task to indicate that the task is 'really' a deferred task. Adding a bool is fine, but see bellow. > > 3) kind is not constant, for the deferred tasks it can change over the > > lifetime of the task, as you've said in the comments, it is kind == > > GOMP_TASK_UNDEFERRED vs. other values; while the changes of task->kind > > are done while holding the task lock, omp_fulfill_event reads it before > > locking that lock, so I think it needs to be done using > > if (__atomic_load_n (&task->kind, MEMMODEL_RELAXED) == > > GOMP_TASK_UNDEFERRED) > > Pedantically the stores to task->kind also need to be done > > with __atomic_store_n MEMMODEL_RELAXED. > > If we check task->deferred_p instead (which never changes for a task after > instantiation), is that still necessary? Not for kind or the new field. > > - in gomp_barrier_handle_tasks the reason for if (new_tasks > 1) > > is that if there is a single dependent task, the current thread > > just finished handling one task and so can take that single task and so no > > need to wake up. While in the omp_fulfill_event case, even if there > > is just one new task, we need to schedule it to some thread and so > > is desirable to wake some thread. > > In that case, we could just do 'if (new_tasks > 0)' instead? Yes. > > > All we know > > (if team == gomp_thread ()->ts.team) is that at least one thread is doing > > something else but that one could be busy for quite some time. > > Well, it should still get around to the new task eventually, so there is no > problem in terms of correctness here. I suppose we could always wake up one > more thread than strictly necessary, but that might have knock-on effects on > performance elsewhere? Yeah, waking something unnecessarily is always going to cause performance problems. > I have applied your patch to move the gomp_team_barrier_done, and in > omp_fulfill_event, I ensure that a single thread is woken up so that > gomp_barrier_handle_tasks can signal for the barrier to finish. > > I'm having some trouble coming up with a testcase to test this scenario > though. I tried having a testcase like this to have threads in separate > teams: The unshackeled thread testcase would probably need a pthread_create call and restricting the testcase to POSIX threads targets. The teams in host teams (or target) don't have at least in OpenMP a way to serialize, e.g. it can always be implemented like we do ATM. But I guess that testcase can be done incrementally. > @@ -545,8 +548,15 @@ struct gomp_task > entries and the gomp_task in which they reside. */ > struct priority_node pnode[3]; > > - bool detach; > - gomp_sem_t completion_sem; > + union { > + /* Valid only if deferred_p is false. */ > + gomp_sem_t *completion_sem; > + /* Valid only if deferred_p is true. Set to the team that executes the > + task if the task is detached and the completion event has yet to be > + fulfilled. */ > + struct gomp_team *detach_team; > + }; > + bool deferred_p; > > struct gomp_task_icv icv; > void (*fn) (void *); What I don't like is that this creates too much wasteful padding in a struct that should be as small as possible. At least on 64-bit hosts which we care about most, pahole shows with your patch: struct gomp_task { struct gomp_task * parent; /* 0 8 */ struct priority_queue children_queue; /* 8 32 */ struct gomp_taskgroup * taskgroup; /* 40 8 */ struct gomp_dependers_vec * dependers; /* 48 8 */ struct htab * depend_hash; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct gomp_taskwait * taskwait; /* 64 8 */ size_t depend_count; /* 72 8 */ size_t num_dependees; /* 80 8 */ int priority; /* 88 4 */ /* XXX 4 bytes hole, try to pack */ struct priority_node pnode[3]; /* 96 48 */ /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */ union { gomp_sem_t * completion_sem; /* 144 8 */ struct gomp_team * detach_team; /* 144 8 */ }; /* 144 8 */ _Bool deferred_p; /* 152 1 */ /* XXX 7 bytes hole, try to pack */ struct gomp_task_icv icv; /* 160 40 */ /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */ void (*fn)(void *); /* 200 8 */ void * fn_data; /* 208 8 */ enum gomp_task_kind kind; /* 216 4 */ _Bool in_tied_task; /* 220 1 */ _Bool final_task; /* 221 1 */ _Bool copy_ctors_done; /* 222 1 */ _Bool parent_depends_on; /* 223 1 */ struct gomp_task_depend_entry depend[]; /* 224 0 */ /* size: 224, cachelines: 4, members: 21 */ /* sum members: 213, holes: 2, sum holes: 11 */ /* last cacheline: 32 bytes */ }; So perhaps it might be better to put the new 1 fields before int priority; field, in order bool deferred_p; union { }; That way, there will be just 3 bytes hole in the whole struct, not 4 + 7 byte holes. > > - if (task.detach && !task_fulfilled_p (&task)) > - gomp_sem_wait (&task.completion_sem); > + if ((flags & GOMP_TASK_FLAG_DETACH) != 0 && detach) > + gomp_sem_wait (&completion_sem); I think gomp_sem_destroy is missing here (in the conditional if it was only initialized. Furthermore, I don't understand the && detach, the earlier code assumes that if (flags & GOMP_TASK_FLAG_DETACH) != 0 then it can dereference *(void *)) detach, so the && detach seems to be unnecessary. > @@ -484,15 +483,16 @@ GOMP_task (void (*fn) (void *), void *data, void > (*cpyfn) (void *, void *), > task->kind = GOMP_TASK_UNDEFERRED; > task->in_tied_task = parent->in_tied_task; > task->taskgroup = taskgroup; > + task->deferred_p = true; > if ((flags & GOMP_TASK_FLAG_DETACH) != 0) > { > - task->detach = true; > - gomp_sem_init (&task->completion_sem, 0); > - *(void **) detach = &task->completion_sem; I think you can move task->deferred_p into the if stmt. > + if (!shackled_thread_p > + && !do_wake > + && gomp_team_barrier_waiting_for_tasks (&team->barrier) > + && team->task_detach_count == 0) && team->task_detach_count == 0 is cheaper than the && gomp_team_barrier_waiting_for_tasks (&team->barrier) so please swap those two. > + { > + /* Ensure that at least one thread is woken up to signal that the > + barrier can finish. */ > + do_wake = 1; > + } Please drop the {}s around the single do_wake = 1; stmt. Otherwise LGTM. Jakub