Give me advice on GSoC OpenMP
Hi! I'm Sho Nakatani, a student of the University of Tokyo, Japan. I'd like to tackle GSoC this year! I'm trying to speed up the OpenMP implementation in GCC. The following graph shows the OpenMP in GCC is much slower than that of Intel C Compiler. https://github.com/laysakura/GCC-OpenMP-Speedup/raw/master/img/task-gcc-vs-icc.png Here is the code I on measured the exec time. https://github.com/laysakura/GCC-OpenMP-Speedup/blob/master/test/openmp-fibonacci.c And I compiled it by the following command: gcc -O3 -fopenmp -o openmp-fibonacci-gcc openmp-fibonacci.c icc -O3 -openmp -o openmp-fibonacci-icc openmp-fibonacci.c After that, I executed them on a machine with 32 AMD CPUs (each has 4 cores). Currently, I'm planning to change the algorighm of `task' premitive in `libgomp'. This plan is of course for GSoC but also for my graduation thesis. My teacher has some idea on the better algorithm (but I haven't learned it yet). Are there any advice from the members of GCC ML? Anything is OK: Although I know some about C programming and I have implemented a very small C compiler myself, I'm quite new to GCC. I welcome advice on how to get accepted from GSoC, too :-) Thanks, -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
Hi all, I conducted another experiment. Compiler version: - gcc (GCC) 4.7.0 20110331 (experimental) - icc (ICC) 12.0.3 20110309 CPU info: 32 AMD CPUs. Each has 4 cores. I calculated fibonacci(37) both by gcc and icc. As the OMP_NUM_THREADS increase, - gcc increased the exec time steadly. - icc decreased the exec time. I got a graph but it seems nonsense because the exection time of the program compiled by gcc was far slower than icc. Here is the result: ===Execution Time with the Increase of Number of Threads=== # OMP_NUM_THREADS Exec time [sec] # gcc 1 11.90681 2 91.97240 4 142.69790 8 281.87043 16 751.86979 321103.64058 # icc 18.38479 2 16.03128 47.81509 84.34125 16 2.19561 32 1.32377 == This paper also shows that OpenMP implementation in GOMP is slower than others. https://iwomp.zih.tu-dresden.de/downloads/runtime-olivier.pdf (There are graphs from page 5.) So, I'd like to speed up GOMP. I'm planning to use `Lazy Task Creation' algorithm for it. I heard from my teacher that my senior associate has made a library for Lazy Task Creation. So I can utilize it for the experimental purpose. >From now on, I try to visualize how tasks are created as time passes not just to show the reason why GOMP is slower compared to icc implementation but to make my idea clear for myself :-) So, I need to know OpenMPI API to get tasks info like: - A task is created - A task finished its job - On which CPU core a task is on Do you know any document? Thanks, -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
From: Richard Guenther Date: Sun, 3 Apr 2011 09:28:49 +0200 > On Sat, Apr 2, 2011 at 10:53 AM, Sho Nakatani wrote: >> Hi! >> >> I'm Sho Nakatani, a student of the University of Tokyo, Japan. >> I'd like to tackle GSoC this year! >> I'm trying to speed up the OpenMP implementation in GCC. >> >> The following graph shows the OpenMP in GCC is much slower than that of >> Intel C Compiler. >> https://github.com/laysakura/GCC-OpenMP-Speedup/raw/master/img/task-gcc-vs-icc.png >> >> Here is the code I on measured the exec time. >> https://github.com/laysakura/GCC-OpenMP-Speedup/blob/master/test/openmp-fibonacci.c >> >> And I compiled it by the following command: >> >> gcc -O3 -fopenmp -o openmp-fibonacci-gcc openmp-fibonacci.c >> icc -O3 -openmp -o openmp-fibonacci-icc openmp-fibonacci.c >> >> After that, I executed them on a machine with 32 AMD CPUs (each has 4 cores). >> >> >> Currently, I'm planning to change the algorighm of `task' premitive in >> `libgomp'. >> This plan is of course for GSoC but also for my graduation thesis. >> My teacher has some idea on the better algorithm (but I haven't learned it >> yet). >> >> Are there any advice from the members of GCC ML? >> Anything is OK: >> >> Although I know some about C programming and I have implemented a very small >> C compiler myself, I'm quite new to GCC. >> >> I welcome advice on how to get accepted from GSoC, too :-) > > What does your fibonacci testcase trying to measure? It looks like it is > measuring thread creation/switching time only. I tried to show GOMP's implementation of OpenMP Task was slow. Users of OpenMP normally expects the execution time should decrease as the number of CPU cores increase. Of course, the graph is not enough to show the cause of the low speed is `task' implementation. So I'm now trying to show how threads are created and in which cores they work on like: | \ | \ | |\ | \ | \ Then, I'll compare the trees created by gcc and icc, and point out that the implementation of OpenMP Task uses Lazy Task Creation while gcc does not. > > Richard. > -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
Hi. Sorry for being late. > Depends on what you mean by lazy task creation, gcc schedules > tasks lazily if they aren't if (0), some data structure if created > for them when encountering #pragma omp task directive, but I guess > any implementation will do something like that. I mean the following implementation by Lazy Task Generation: - 1 CPU core has 1 worker - 1 worker has 1 deque (LIFO) - 1 deque has some tasks - What worker does are: - Execute tasks from the head of deque - Steel a task from the tail of deque in another core - When task A encounters "#pragma omp task" derective, worker creates a task and immediately execute it. Worker pushes A to the head of deque. (Here occurs context switch) This is important point because A can move to other deques. (*) - Steel a task from a deque in another core when the deque on the core is empty My associate sinior has already made a library which realizes this scheduling algorithm above. (It is called `MassiveThreads' but the paper related to its work is written in Japanese.) MassiveThreads has proved this algorithm makes things like OpenMP Task speedy. Taking this implementation, - Nested `task' derectives can be processed naturally - Given that task A is a member of deque D and task A1 is created in D when task A encounters `task' derective. (See (*)) A1 runs soon after it is created. So although A will execute some functions which takes too long to finish, this work can be done after A is stolen into another deque than D Anyway, I'd like to read some materials refered to when current libgomp `task' is implemented (to read the code smoothly). Do you know any of that? > What your testcase shows is not whether tasks are created lazily or not, but > how good/poor #pragma omp taskwait implementation is. And, for your testcase > libgomp/task.c (GOMP_taskwait) definitely could be improved. Currently it > only > tries to schedule in children that will be awaited by the current tasks and if > there are no such children, goes to sleep, waiting for them to complete. > Scheduling in random unrelated tasks is problematic, because the unrelated > task might take too long to complete and delay the taskwait for way too long > (note, gcc doesn't have untied tasks, all tasks are tied once they are > scheduled > onto some particular tasks - setcontext/swapcontext is quite fragile thing to > do). > But it is true it could very well schedule tasks that are taskwaited by tasks > taskwaited by current task, and transitively further. Plus, be able to > temporarily > awake such a sleeping thread if there are tasks it can transitively taskwait > for, as if those don't complete, the current taskwait won't return. > > Jakub -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
From: Jakub Jelinek Date: Mon, 4 Apr 2011 20:15:38 +0200 > On Sun, Apr 03, 2011 at 08:10:25PM +0200, Jakub Jelinek wrote: >> On Sun, Apr 03, 2011 at 07:27:12PM +0900, Sho Nakatani wrote: >> > Then, I'll compare the trees created by gcc and icc, and point out >> > that the implementation of OpenMP Task uses Lazy Task Creation while >> > gcc does not. >> >> Depends on what you mean by lazy task creation, gcc schedules >> tasks lazily if they aren't if (0), some data structure if created >> for them when encountering #pragma omp task directive, but I guess >> any implementation will do something like that. >> >> What your testcase shows is not whether tasks are created lazily or not, but >> how good/poor #pragma omp taskwait implementation is. And, for your testcase >> libgomp/task.c (GOMP_taskwait) definitely could be improved. Currently it >> only >> tries to schedule in children that will be awaited by the current tasks and >> if >> there are no such children, goes to sleep, waiting for them to complete. >> Scheduling in random unrelated tasks is problematic, because the unrelated >> task might take too long to complete and delay the taskwait for way too long >> (note, gcc doesn't have untied tasks, all tasks are tied once they are >> scheduled >> onto some particular tasks - setcontext/swapcontext is quite fragile thing >> to do). >> But it is true it could very well schedule tasks that are taskwaited by tasks >> taskwaited by current task, and transitively further. Plus, be able to >> temporarily >> awake such a sleeping thread if there are tasks it can transitively taskwait >> for, as if those don't complete, the current taskwait won't return. > > Just FYI, I've tried to implement something like that as a quick hack, > but it unfortunately slowed things down, at least on the attached fib testcase > with arguments 40 25. Guess partly the problem is that after a task waiting > in taskwait_sem is awaken it now needs to take task_lock lock to unqueue > itself > from the new in_taskwait_list, and partly because the search for grand-grand > children > etc. is more expensive, the FIFO isn't a good data structure for that. > > Jakub Thanks a lot for your work. I'll read through it to help understand the concept of current GOMP implementation. I posted a new email a few hours ago in which I wrote what is the concept of `Lazy Task Creation' is. Do you have any comment on that? My current concern is that in order to implement Lazy Task Creation, a task should pause and resume its execution. So it's neccessary to implement context switch, which `MassiveThreads' does. Is it OK to write small assembly codes for each CPU architecture in libgomp? (I guess it is theoretically possible if some changes are added in Makefile) -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
From: Jakub Jelinek Date: Tue, 5 Apr 2011 08:33:23 +0200 > On Tue, Apr 05, 2011 at 11:05:01AM +0900, Sho Nakatani wrote: >> - When task A encounters "#pragma omp task" derective, worker creates a >> task >> and immediately execute it. Worker pushes A to the head of deque. > > Immediately starting a freshly created task on #pragma omp task is nice for > cache > locality, ... Yes. Also, Lazy Task Creation has another good aspect. Look at the pictures below. Both shows the call-tree of fibonacci(8). One is compiled by gcc and the other is by icc. (GCC) https://github.com/laysakura/GCC-OpenMP-Speedup/raw/e5671e7f4175c3ac17c1543c93edf25dda2ae6ac/test/calltree/openmp-fibonacci-calltree-gcc.png (ICC) https://github.com/laysakura/GCC-OpenMP-Speedup/raw/e5671e7f4175c3ac17c1543c93edf25dda2ae6ac/test/calltree/openmp-fibonacci-calltree-icc.png Each node with the same color represents the task which is processed on the same thread (or core). These pictures show that the program compiled by icc, which might use Lazy Task Creation, executes nodes in subtree on the same core, while that compiled by gcc doesn't. Of course, fib(8) is doesn't clearly show if Lazy Task Creation really has good effect for the programs that need sophisticated parallelization. But other researches like https://iwomp.zih.tu-dresden.de/downloads/runtime-olivier.pdf also point out it. My teacher and senior associates know good about Lazy Task Creation and I also learned it by some Japanese paper. My teacher recommended me a paper related to Lazy Task Creation (in English). Please google the query and take a look at it if you have a time. "Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs" > Immediately starting a freshly created task on #pragma omp task is nice for > cache > locality, except it doesn't work at all for tied tasks which you can't move > to other > threads. For tied tasks (and GCC currently has all tasks tied) it serializes > everything. Then, how about implementing `untied task' as a GSoC project? If it goes successful, I'll tackle better `tied task' also. What do you think about this idea? I'd like to write my proposal tomorrow :-) -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
From: Jakub Jelinek Date: Tue, 5 Apr 2011 16:37:48 +0200 > On Tue, Apr 05, 2011 at 03:45:26PM +0200, Antoniu Pop wrote: >> On Tue, Apr 5, 2011 at 3:01 PM, Sho Nakatani wrote: >> > From: Jakub Jelinek >> > Date: Tue, 5 Apr 2011 08:33:23 +0200 >> > Yes. Also, Lazy Task Creation has another good aspect. >> > Look at the pictures below. Both shows the call-tree of fibonacci(8). >> > One is compiled by gcc and the other is by icc. >> > >> > (GCC) >> > https://github.com/laysakura/GCC-OpenMP-Speedup/raw/e5671e7f4175c3ac17c1543c93edf25dda2ae6ac/test/calltree/openmp-fibonacci-calltree-gcc.png >> > (ICC) >> > https://github.com/laysakura/GCC-OpenMP-Speedup/raw/e5671e7f4175c3ac17c1543c93edf25dda2ae6ac/test/calltree/openmp-fibonacci-calltree-icc.png > > As your testcase doesn't have untied keywords, I doubt ICC uses such > scheduling > for it. > See also the paper Antoniu referenced, e.g. figure 7 in there. > > If you want to do something about task scheduling as a GSoC project, > you are certainly welcome, but starting with the idea that > you want to implement Lazy Task Creation is probably not a good idea, > you want to read a bunch of papers, download Mercurium project > with its runtime OpenMP library, read it, see what they actually implement > and how and if they have multiple strategies still implemented in there, > see how they perform on the various OpenMP task testcases (we have some > in libgomp testsuite, but the authors of Mercurium also have some testcases > (called BOTS)), then try to implement some changes into libgomp/ and see how > it performs on various testcases with various parameters. OK. I'll do it as soon as possible. Then, my current plan is: - Learn other implementations (as Antoniu said) - Learn Mercurium implementation - Implement the same/similar feature as Mercurium in libgomp/ , then evaluate it (This might be the first goal) - Implement Lazy Task Creation for `untied task' in libgomp/ , then evaluate it (If I could reach the first goal) Is this plan OK? Or do you have other better plans? >> You may also want to make sure the scheduling policy it enforces >> cannot interfere with OpenMP semantics. For example, how would the >> lazy task creation behave if a worker steals a bottom-of-the-stack >> task that has a taskwait and the tasks bound to this synchronization >> are left in the other worker's dequeue? There can be deadlock issues >> or just plain performance issues that can arise. I can't see any >> obvious issues, but it's worth thinking about. > > Tied tasks have quite strict rules on what can be actually scheduled, > a new task to be scheduled must be descendant of all the tasks tied to current > thread that aren't waiting in a barrier. > > Jakub -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
From: Antoniu Pop Date: Tue, 5 Apr 2011 15:45:26 +0200 > On Tue, Apr 5, 2011 at 3:01 PM, Sho Nakatani wrote: >> From: Jakub Jelinek >> Date: Tue, 5 Apr 2011 08:33:23 +0200 >> >>> On Tue, Apr 05, 2011 at 11:05:01AM +0900, Sho Nakatani wrote: >>>> - When task A encounters "#pragma omp task" derective, worker creates a >>>> task >>>> and immediately execute it. Worker pushes A to the head of deque. >>> >>> Immediately starting a freshly created task on #pragma omp task is nice for >>> cache >>> locality, ... >> >> Yes. Also, Lazy Task Creation has another good aspect. >> Look at the pictures below. Both shows the call-tree of fibonacci(8). >> One is compiled by gcc and the other is by icc. >> >> (GCC) >> https://github.com/laysakura/GCC-OpenMP-Speedup/raw/e5671e7f4175c3ac17c1543c93edf25dda2ae6ac/test/calltree/openmp-fibonacci-calltree-gcc.png >> (ICC) >> https://github.com/laysakura/GCC-OpenMP-Speedup/raw/e5671e7f4175c3ac17c1543c93edf25dda2ae6ac/test/calltree/openmp-fibonacci-calltree-icc.png >> >> Each node with the same color represents the task which is processed on the >> same >> thread (or core). >> These pictures show that the program compiled by icc, which might use Lazy >> Task Creation, >> executes nodes in subtree on the same core, while that compiled by gcc >> doesn't. >> Of course, fib(8) is doesn't clearly show if Lazy Task Creation really has >> good effect >> for the programs that need sophisticated parallelization. But other >> researches like >> https://iwomp.zih.tu-dresden.de/downloads/runtime-olivier.pdf >> also point out it. >> >> My teacher and senior associates know good about Lazy Task Creation and I >> also learned it >> by some Japanese paper. >> My teacher recommended me a paper related to Lazy Task Creation (in English). >> Please google the query and take a look at it if you have a time. >> "Lazy Task Creation: A Technique for Increasing the Granularity of >> Parallel Programs" >> >> >>> Immediately starting a freshly created task on #pragma omp task is nice for >>> cache >>> locality, except it doesn't work at all for tied tasks which you can't move >>> to other >>> threads. For tied tasks (and GCC currently has all tasks tied) it >>> serializes >>> everything. >> >> Then, how about implementing `untied task' as a GSoC project? >> If it goes successful, I'll tackle better `tied task' also. >> What do you think about this idea? >> I'd like to write my proposal tomorrow :-) > > I think this could indeed be quite interesting, and a much needed > improvement to GCC's OpenMP runtime. The overall idea behind Lazy Task > Creation looks good. > However, I would recommend starting with bibliographical work. I'm > certain this solution could be an improvement over the current state, > but it would be nice to know how it compares to other techniques, and > also if there are slowdowns in less dynamic and unbalanced cases. > The work I'm aware of in this area is "Evaluation of OpenMP task > scheduling strategies" by Duran et al > (http://www.sarc-ip.org/files/null/Workshop/1234128788173__TSchedStrat-iwomp08.pdf). > It could be a reasonable starting point. > Thank you so much for the realistic idea. Making the libgomp implementation better is not an easy thing, so I'll learn a lot about other implementations. Learning things around parallel programming deeply surely helps my graduation thesis :-) > You may also want to make sure the scheduling policy it enforces > cannot interfere with OpenMP semantics. For example, how would the > lazy task creation behave if a worker steals a bottom-of-the-stack > task that has a taskwait and the tasks bound to this synchronization > are left in the other worker's dequeue? There can be deadlock issues > or just plain performance issues that can arise. I can't see any > obvious issues, but it's worth thinking about. > > I believe the GSoC idea for working on the libGOMP scheduler is great! > > Antoniu -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
From: Jakub Jelinek Date: Tue, 5 Apr 2011 17:22:04 +0200 > On Wed, Apr 06, 2011 at 12:16:13AM +0900, Sho Nakatani wrote: >> OK. I'll do it as soon as possible. >> Then, my current plan is: >> - Learn other implementations (as Antoniu said) >> - Learn Mercurium implementation >> - Implement the same/similar feature as Mercurium in libgomp/ , >> then evaluate it >> (This might be the first goal) >> - Implement Lazy Task Creation for `untied task' in libgomp/ , >> then evaluate it >> (If I could reach the first goal) > > This would depend on implementing untied tasks, which is quite questionable > thing and can be badly expensive too (while tied tasks can (and do currently) > run on containing thread's stack, untied tasks need to have their own stacks > and there needs to be a way to switch in between them, for which either > you can use *context family of functions (deprecated, but could work > at least on some targets), or come up with something target specific). > > From Mercurium you probably don't need to study the actual C to C compiler, > maybe just quickly what it generates for tasks, but primarily their runtime > library. > >> Is this plan OK? > > Yeah (just that the last step depends on implementing untied tasks, > you can certainly try to implement that (preferrably with minimal > overhead for the default tied task model), but I guess that would take > quite a lot of time to write and test it. > > Jakub Thanks a lot for your kind advice. Then my latest plan is: - Learn other implementations (as Antoniu said) - Learn Mercurium implementation - Implement the same/similar feature as Mercurium in libgomp/ , then evaluate it (This is the goal for GSoC project) However, I currently know nothing about Mercurium. (Of course I'll read the documentation of Mercurium from now on) Does Mercurium implement both `tied task' and `untied task'? Should my goal be implementing both of them? -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
From: Jakub Jelinek Date: Tue, 5 Apr 2011 17:43:57 +0200 > Content-Type: text/plain; charset=us-ascii > Content-Disposition: inline > Subject: Re: Give me advice on GSoC OpenMP > From: Jakub Jelinek > To: Sho Nakatani > Cc: gcc@gcc.gnu.org, antoniu@mines-paristech.fr > Date: Tue, 5 Apr 2011 17:43:57 +0200 > Reply-To: Jakub Jelinek > User-Agent: Mutt/1.5.21 (2010-09-15) > Received-SPF: pass (google.com: domain of ja...@redhat.com designates > 209.132.183.28 as permitted sender) client-ip=209.132.183.28; > > On Wed, Apr 06, 2011 at 12:40:21AM +0900, Sho Nakatani wrote: >> Thanks a lot for your kind advice. >> Then my latest plan is: >> - Learn other implementations (as Antoniu said) >> - Learn Mercurium implementation >> - Implement the same/similar feature as Mercurium in libgomp/ , >> then evaluate it >> (This is the goal for GSoC project) >> >> However, I currently know nothing about Mercurium. >> (Of course I'll read the documentation of Mercurium from now on) >> Does Mercurium implement both `tied task' and `untied task'? >> Should my goal be implementing both of them? > > Given that the paper Antoniu referenced speaks about it > being the testbed for the paper, I guess so. I think they > implement untied tasks using some library they wrote, > but all I did was download it and skim it for a couple of minutes. > > Anyway, I think more important is the first goal, improve scheduling > of tied tasks, those are the default ones and probably most people > will use just those. Only when you are done with that it would be IMHO OK. I've completely got my plan for GSoC! 1. Learn other implementations (as Antoniu said) 2. Learn Mercurium implementation (1. and 2. would be done concurrently ;-) ) 3. Implement the Mercurium `tied task' feature in libgomp/ , then evaluate it 4. Implement the Mercurium `untied task' feature in libgomp/ , then evaluate it (4. is optional) I'll write my proposal tomorrow (since it is 1:00 a.m. in Japan). I'll ask you check it. If you have a time, please examine it. Thanks, -- Sho Nakatani
Re: Give me advice on GSoC OpenMP
From: Antoniu Pop Date: Tue, 5 Apr 2011 17:45:18 +0200 >> Thanks a lot for your kind advice. >> Then my latest plan is: >> - Learn other implementations (as Antoniu said) >> - Learn Mercurium implementation >> - Implement the same/similar feature as Mercurium in libgomp/ , >> then evaluate it >> (This is the goal for GSoC project) >> >> However, I currently know nothing about Mercurium. >> (Of course I'll read the documentation of Mercurium from now on) >> Does Mercurium implement both `tied task' and `untied task'? > > As far as I know (I have not verified the source) they do. That's important information. If you remember the source, please tell me about it. >> Should my goal be implementing both of them? > > That would be great, though probably not an easy feat. Maybe just > start with tied for now and see if you have enough time. Yeah. Although it is difficult to implement both tied and untied task, I'll continue to try them after GSoC term finish. > Antoniu Thanks, -- Sho Nakatani
My current idea for improving libgomp
Hi, I'm Sho Nakatani, accepted by Google Summer of Code 2011. I'm trying to add speed-up to libgomp, an OpenMP implementaion in GCC. As GSoC project, I'll focus on OpenMP `task' directive (especially `tied task'). Around the beginning of April, some members here told me that I can migrate the OpenMP implementation from Nanos4. However, my experiment has shown that Nanos4 is not always better than libgomp. See the graphs below. Testsuite: BOTs, which Nanos project provides. Compilers: gcc with libgomp icc (Intel C Compiler) with its OpenMP runtime mcc (Mercurium C Compiler) with Nanos4 Environment: Each test case is executed on 48 CPUs (actually 24 CPUs and Hyper-Threading). OMP_NUM_THREADS is 48 for parallelized programs. Results: Here are 9 graphs for each test case. A graph compares the performance each compiler provides and the efficiency of `task' directives. https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-protein.eps.png https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-fft.eps.png https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-fib.eps.png https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-floorplan.eps.png https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-health.eps.png https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-nqueen.eps.png https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-sort.eps.png https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-sparse.eps.png https://github.com/laysakura/GCC-OpenMP-Speedup/raw/9636d281663a8a7857efd38700c82486ff12ae7b/data/20110427-101530-tuna-strassen.eps.png In my opinion, just migrating Nanos4 implementation would not improve libgom performance as a whole. What I should aim for might be to understand both libgomp and Nanos4 implementation and add only good features of Nanos4 to libgomp. Is it OK? Give me any opinion and idea! -- Sho Nakatani
Re: My current idea for improving libgomp
Hi Andreas, Thank you for your comments. > I think the biggest problem in libgomp is that tasks are stored in a > single queue. This may work for a few threads or for long-running tasks, > but certainly not for 48 threads. In fact, contention on the queue can > grow so high that performance starts to degrade when you add more > threads. > > First of all, you need to think about data structures for storing tasks. > When you look at other parallel runtime systems/libraries, you will find > that many implement some form of work-stealing scheduling. In > work-stealing scheduling, each worker or group of workers have their own > queue on which they operate. Load balancing is done by stealing tasks > from the queues of other workers. The Nanos runtime is probably a good > place to start to get some ideas... I totally agree with this point. Currently, I'm planning to implement tied task using breath-first scheduler wrote in section 3.1 of "Evaluation of OpenMP Task Scheduling Strategies" by Nanos Group. http://www.sarc-ip.org/files/null/Workshop/1234128788173__TSchedStrat-iwomp08.pdf That is: * A team has one team queue which contains tasks. * A team has some (user-level) threads. * A thread can have one running task. * A thread has private queue which contains tasks. * When a task is created, it is queued in team queue. * Each thread steals tasks from the team queue and inserts it in the private queue. * Once tied task is executed in a thread, it is queued only in the private queue in the thread when it encounters `taskwait'. * Each thread runs a task from its private queue. But I'm not sure how to achieve good load-balancing and what kind of cutoff strategy to take. As for load-balancing, I'll read Nanos4 implementations and ask Nanos Group for it. (Of course your advice will do :-) ) As for cutoff, basically I can choose `max-tasks' strategy or `max-levels' strategy. When number of tasks or recursion levels exceed this value, the scheduler stops its work and execute each task as sentences in sequential programs. But "Evaluation of OpenMP Task Scheduling Strategies" says better cutoff strategy is different from application to application. > Regarding Lazy Task Creation (LTC)... I agree with the advice here and > would say this is not what you should focus on. The most prominent use > of LTC is in the Cilk multithreaded language. Basically, we have two > options when we encounter a new task: (1) spawn the task as a child and > continue executing the parent, or, what LTC does, (2) start executing > the task and leave the continuation (the rest of the parent) for later > execution. Load balancing requires that the continuation is made > available so that it can be picked up and executed by other threads. The > Nanos runtime supports different scheduling strategies, including > work-first scheduling, which is similar in principle to LTC. Nanos > implements tasks as user-level threads (nano-threads), so it's easier to > switch between tasks or move tasks across threads. Something like > work-first scheduling that depends on untied tasks is nice to have. But > I think a decent scheduler (for tied tasks) is more important at this > point... In GSoC project, I will not implement LTC but just focus on schedulers which Nanos4 has already has. I'll first implement `tied task' using breadth-first scheduler, and after that I'll implement `untied task' using work-first scheduler (if I have time). Thanks, -- Sho Nakatani
Re: My current idea for improving libgomp
Hello again Andreas, (I just forgot to Cc to GCC ML, so resending this email) > Right, start with distributing the queues and then think about load > balancing. OK. > I would say don't worry too much about cut-offs at this point. Finding a > good cut-off strategy that works without drawbacks is pretty much an > open research problem. Just spawn the tasks and focus on efficient task > creation and scheduling. In my experience, going from a centralized to a > distributed task pool already makes a huge difference. As you say, deciding cut-off strategy sounds pretty difficult. I read "Evaluating OpenMP 3.0 Run Time Systems on Unbalanced Task Graphs" and learned a litt about Adaptive Task Cutoff, in which good cutoff is found by profiling data collected early in the programs' execution. I guess it's almost impossible to implement it in the GSoC term but I'm interested in challenging it in the future. > To get a better overview of other implementations, which you can compare > to libgomp, I recommend a couple of papers. For example: > > - OpenMP Tasks in IBM XL Compilers, X. Teruel et al. > - Support for OpenMP Tasks in Nanos v4, X. Teruel et al. > - OpenMP 3.0 Tasking Implementation in OpenUH, C. Addison et al. > - A Runtime Implementation of OpenMP Tasks, J. LaGrone et al. Thanks. I'll read all of them ;-) -- Sho Nakatani
Re: My current idea for improving libgomp
Hi, I'm trying to make better libgomp implementation of task construct in a GSoC project. After reading several paper around OpenMP task feature, several questions occurred to me. I appreciate all of your answers. Even a few comments will do. * Tell me good reference of OpenMP standards. What is referenced when libgomp was implemented. I'm reading OpenMP Tutorial ( http://www.llnl.gov/computing/tutorials/openMP ). * How should user-level and kernel-level threads be implemented? In my opinion, user-level threads should be a small data structure with call-stack and kernel-level threads should be pthreads which are tied to CPUs by sched_setaffinity() system call. (manpage says sched_setaffinity() is provided only in Linux, though..) * Which interface (ABI) should be left in libgomp when I rewrite task feature? I guess I should be very careful since task construct is related to other ones like taskwait, barrier, and parallel. Thanks, -- Sho Nakatani
Re: My current idea for improving libgomp
Hi, > First of all, if you haven't started with the FSF assignment paperwork, > please do so, it takes a while. See http://gcc.gnu.org/contribute.html I've already started it. Thanks. > For #pragma omp parallel and tied tasks you just want user-level == > kernel-level thread as implemented in libgomp, with affinity only > done when requested by the user (GOMP_CPU_AFFINITY resp. on gomp-3_1-branch > also the to be implemented OMP_PROC_BIND env vars). In my opinion, even tied task needs user-level thread for scheduling. I've read several paper related to task implementations. Many of task schedulers have user-level thread with its own private queue. In order to prevent contention of task queue, it is good idea to let user-level threads have their own private queue. (I got this idea mostly from Section 3 of "Evaluation of OpenMP Task Scheduling Strategies" by Nanos Group) Also, it could be difficult to implement untied task without user-level thread. So, implementing user-level thread for tied task will keep simplicity of task scheduler since libgomp will have untied task implementation in the future. > IMHO you don't want to rewrite the task support, just primarily investigate > various scheduling algorithms and attempt to implement some of them and > benchmark. Sorry. I could not catch what I you would like me to do in GSoC project. I'm planning to almost rewrite task scheduler in libgomp. My goal for the project is to make faster tied task implementation and (if I have enough time,) untied task. One global task queue which libgomp currently uses would be one of the biggest defeats. So I would first like to make new data structures including user-level threads with their private queues. If I do so and current global task queue in libgomp is replaced by new data structures, most accesses to task queues should be rewrote, then most codes related to task scheduler should be rewrote, too. Give me any comment freely on this point since it is very important point in my project ;-) Thanks, -- Sho Nakatani
GSoC libgomp task project: What should I do next?
Hello, I'm improving OpenMP Task implementation in libgomp as a Google Summer of Code project. Since the mid-term evaluation is approaching, I would like to tell you what I did so far. 1. I read papers on task parallelism systems such as Cilk, Intel TBB, and OpenMP Task. 2. I read source codes of two OpenMP Task implementations. Nanos4 and OpenUH. 3. I read libgomp source codes surrounding task implementation. 4. I thought that one of the biggest problem in libgomp task implementation is that multiple processors lock only one task queue so many times. So I conducted a simple experiment to prove it. Attached "libgomp_lock._fib.eps" shows how long each CPU waits for task queue lock to be released. It is written by executing a program that use libgomp task. It calculates 22nd fibonacci number using libgomp task. The red line means the time span in which each CPU is waiting for mutex lock of task queue to be released. You can see each CPU is waiting for most of the execution time. I was convinced that each CPU (worker thread) has to have a task queue as many other task parallel implementation do. 5. I implemented and tested task queue [0]. It is held by each worker thread. Tasks in a task queue is usually popped by its own worker thread. When a worker thread has empty task queue, it steal a task from other worker thread. 6. I decided to use Portable Coroutine Library (PCL) [1] as OpenUH does to realise user context switches. It's true that user-level context switches are not necessary to implement tied task. But it is neccessary to implement efficient untied task. In fact, many task parallel systems have user-level context switches like Cilk, OpenUH, and Nanos. Although my first aim was to implement better tied task in libgomp, tied task is slower than untied task. Also, tied task can be realised by just adding some limitations to untied tasks. So I'm using PCL and firstly implementing untied task. 7. Before writing codes in libgomp, I wrote codes [2] to emulate the behavior of OpenMP Task both to deeply understand the implementation of it and evaluate the system I would implement in libgomp. It emulates a fibonacci program, in which each fib(N) is bound to a task. The codes contains: - Worker threads - Task queues - Task scheduler - OpenMP Task ABIs (like GOMP_task, GOMP_taskwait, and so on) - fibonacci function in which ABIs are explicitly called 8. I realised that task creation cost is not negligible when too many tasks are spawned. So I added cut-off depth. 9. I conducted a simple evaluation to compare my current implementation (7.), libgomp, and Nanos4. It shows thatn my current implementation is much faster than libgomp. And it seems faster than Nanos4... But it is probably because I don't know how to configure the task scheduling strategy in Nanos4. It is not perfectly scalable but it may be because tasks in fibonacci application are fine-grain to ignore the overhead of task scheduler. This is what I have done. Then I would like to ask you what I should do next. In my opinion, I have two reasonable choices. (A) Try to change implementation of libgomp following my current implementation. Then test and evaluate it using some applications. (I think Barcelona OpenMP Task Suite [3] contains good applications to test OpenMP Task implementations) (B) Ensure my current implementation is worth merging in libgomp before actually writing codes in libgomp. I wrote in 7. that I only emulated and evaluated a fibonacci program. But it is possible to emulate other applications like N-Queen or FFT. (A) is better personally because I don't have enough time before the GSoC term ends. Give me your advice. [0] I learned the implementation of the task queue from that of Cilk. You can see the basic idea of it in http://www.few.vu.nl/~kielmann/cgc/2011/slides/mihalcea.pdf [1] http://www.xmailserver.org/libpcl.html The codes are available with GPL2. [2] https://github.com/laysakura/GSoC_libgomp_task/tree/master/my_data_structures You can compile and test the emulation codes by: $ make test_fib_by_hand $ export OMP_NUM_THREADS=(int number) $ ./test_fib_by_hand (N for fib(N)) [3] http://nanos.ac.upc.edu/content/barcelona-openmp-task-suite Sho Nakatani libgomp_lock_fib.eps Description: PostScript document
Re: GSoC libgomp task project: What should I do next?
Hello Tobias, > I think it makes sense to start working on libgomp; even if the chosen > implementation is not perfect, one can make real-world experiments with > it. Additionally, to be used at the end, it has to end up in libgomp. Thank you for your advice. I'll work on libgomp soon. Sho Nakatani
Re: GSoC libgomp task project: What should I do next?
Hi Jakub, > Yeah. And, please post patches from time to time, even if they aren't > completely finished, so that others can comment on them. OK. I'll work on it. Sho Nakatani
Advice for changing OpenMP translation
Hello, I am trying to improve OpenMP task implementation. At first I thought I just have to modify ABIs (such as GOMP_task) internal but I realised I have to change the translation of OpenMP. My question is how to change the translation of OpenMP. (I guess I should edit gcc/omp-low.c:create_omp_child_function() but how?) I checked the current translation by $ gcc -O2 -fopenmp -fdump-tree-ompexp omp_fib.c and realised a problem that limits flexibility of internal implementation. I want "GOMP_taskexit" ABI which is called just before functions bound to tasks (*._omp_fn.[0-9]) ends. Here a task finishes its work. My task scheduling implementation needs GOMP_taskexit ABI. For example, when the last child task of a parent task finishes its work, it tells to the parent that all of children finished their work so the parent should resume its work. Tell me how to add "GOMP_taskexit" ABI. Sho Nakatani