Re: [RFC][AArch64] function prologue analyzer in linux kernel
On 01/16/2016 01:56 AM, Will Deacon wrote: On Wed, Jan 13, 2016 at 05:13:29PM +0900, AKASHI Takahiro wrote: On 01/13/2016 03:04 AM, Will Deacon wrote: On Tue, Jan 12, 2016 at 03:11:29PM +0900, AKASHI Takahiro wrote: On 01/09/2016 12:53 AM, Will Deacon wrote: I still don't understand why you can't use fstack-usage. Can you please tell me why that doesn't work? Am I missing something? I don't know how gcc calculates the usage here, but I guess it would be more robust than my analyzer. The issues, that come up to my mind, are - -fstack-usage generates a separate output file, *.su and so we have to manage them to be incorporated in the kernel binary. That doesn't sound too bad to me. How much data are we talking about here? This implies that (common) kernel makefiles might have to be a bit changed. - more worse, what if kernel module case? We will have no way to let the kernel know the stack usage without adding an extra step at loading. We can easily add a new __init section to modules, which is a table representing the module functions and their stack sizes (like we do for other things like alternatives). We'd just then need to slurp this information at load time and throw it into an rbtree or something. I found another issue. Let's think about 'dynamic storage' case like: $ cat stack.c extern long fooX(long a); extern long fooY(long b[]); long foo1(long a) { if (a > 1) { long b[a]; <== Here return a + fooY(b); } else { return a + fooX(a); } } Then, -fstack-usage returns 48 for foo1(): $ aarch64-linux-gnu-gcc -fno-omit-frame-pointer -fstack-usage main.c stack.c \ -pg -O2 -fasynchronous-unwind-tables $ cat stack.su stack.c:4:6:foo148 dynamic This indicates that foo1() may use 48 bytes or more depending on a condition. But in my case (ftrace-based stack tracer), I always expect 32 whether we're backtracing from fooY() or from fooX() because my stack tracer estimates: (stack pointer) = (callee's frame pointer) + (callee's stack usage) (in my previous e-mail, '-(minus)' was wrong.) where (callee's stack usage) is, as I described in my previous e-mail, a size of memory which is initially allocated on a stack in a function prologue, and should not contain a size of dynamically allocate area. According to who? What's the use in reporting only the prologue size? Me :) (I'm afraid that my wording, "stack usage", might confuse you.) My arm64-specifc check_patch() expects this in order to estimate caller's correct stack pointer address from a callee's frame pointer, which does not contain any of callee's dynamically (so probably after a prologue) allocated variables. Please take a close look at my patch #5[1]. [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-December/393721.html -Takahiro AKASHI Will
Re: help with PR69133
On 17 January 2016 at 14:56, Prathamesh Kulkarni wrote: > Hi, > I was having a look at PR69133. > It appears that with -flto-partition=none, > cgraph_node::get_untransformed_body () > is called twice for node with asm_name _ZThn4_N11xercesc_3_11C5m_fn6ERKi. > c++filt says it is: non-virtual thunk to xercesc_3_1::C::m_fn6(int const&) > > get_untransformed_body() calls > lto_free_function_in_decl_state_for_node() which sets > node->lto_file_data to NULL. > Now 2nd time get_untransformed_body is called for the same node, > node->lto_file_data is NULL and lto_get_decl_name_mapping() dereferences > lto_file_data which results in segfault. > > I was wondering if gating on lto_file_data could be a reasonable solution ? > if (!lto_file_data) > return false; > I am not sure what value to return for this case (I chose false > because of early exit). > It prevents the ICE, and the patch passes bootstrap+test on > x86_64-unknown-linux-gnu. > With partitoning enabled, for the above test-case, > get_untransformed_body () is called only once per node. However I > don't understand why it gets called twice with -flto-partition=none. Kugan pointed out to me the test-case doesn't ICE by passing -fno-ipa-icf. Thanks, Prathamesh > > Thank you, > Prathamesh
Re: help with PR69133
On Mon, Jan 18, 2016 at 10:28 AM, Prathamesh Kulkarni wrote: > On 17 January 2016 at 14:56, Prathamesh Kulkarni > wrote: >> Hi, >> I was having a look at PR69133. >> It appears that with -flto-partition=none, >> cgraph_node::get_untransformed_body () >> is called twice for node with asm_name _ZThn4_N11xercesc_3_11C5m_fn6ERKi. >> c++filt says it is: non-virtual thunk to xercesc_3_1::C::m_fn6(int const&) >> >> get_untransformed_body() calls >> lto_free_function_in_decl_state_for_node() which sets >> node->lto_file_data to NULL. >> Now 2nd time get_untransformed_body is called for the same node, >> node->lto_file_data is NULL and lto_get_decl_name_mapping() dereferences >> lto_file_data which results in segfault. >> >> I was wondering if gating on lto_file_data could be a reasonable solution ? >> if (!lto_file_data) >> return false; >> I am not sure what value to return for this case (I chose false >> because of early exit). >> It prevents the ICE, and the patch passes bootstrap+test on >> x86_64-unknown-linux-gnu. >> With partitoning enabled, for the above test-case, >> get_untransformed_body () is called only once per node. However I >> don't understand why it gets called twice with -flto-partition=none. > Kugan pointed out to me the test-case doesn't ICE by passing -fno-ipa-icf. It looks to me we are not supposed to call this twice on the same node but nothing guards against that (and we don't release the untransformed body either). -flto-partition=none is special here becaue it combines WPA and late stage. IPA ICF is the only IPA pass needing untransformed bodies during WPA. Richard. > Thanks, > Prathamesh >> >> Thank you, >> Prathamesh
Re: ipa vrp implementation in gcc
On Mon, Jan 18, 2016 at 12:00 AM, Kugan wrote: > Hi, > >> Another potential use of value ranges is the profile estimation. >> http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf >> It seems to me that we may want to have something that can feed sane loop >> bounds for profile estimation as well and we can easily store the known >> value ranges to SSA name annotations. >> So I think separate local pass to compute value ranges (perhaps with less >> accuracy than full blown VRP) is desirable. > > Thanks for the reference. I am looking at implementing a local pass for > VRP. The value range computation in tree-vrp is based on the above > reference and uses ASSERT_EXPR insertion (I understand that you posted > the reference above for profile estimation). As Richard mentioned in his > reply, the local pass should not rely on ASSERT_EXPR insertion. > Therefore, do you have any specific algorithm in mind (i.e. Any > published paper or reference from book)?. Of course we can tweak the > algorithm from the reference above but would like to understand what > your intension are. I have (very incomplete) prototype patches to do a dominator-based approach instead (what is refered to downthread as non-iterating approach). That's cheaper and is what I'd like to provide as an "utility style" interface to things liker niter analysis which need range-info based on a specific dominator (the loop header for example). Richard. >> I think the ipa-prop.c probably won't need any siginificant changes. The >> code basically analyze what values are passed thorugh the function and >> this works for constants as well as for intervals. In fact ipa-cp already >> uses the same ipa-prop analysis for >> 1) constant propagation >> 2) alignment propagation >> 3) propagation of known polymorphic call contextes. >> >> So replacing 1) by value range propagation should be easily doable. >> I would also like to replace alignment propagation by bitwise constant >> propagation (i.e. propagating what bits are known to be zero and what >> bits are known to be one). We already do have bitwise CCP, so we could >> deal with this basically in the same way as we deal with value ranges. >> >> ipa-prop could use bit of clenaing up and modularizing that I hope will >> be done next stage1 :) > > We (Myself and Prathamesh) are interested in working on LTO > improvements. Let us have a look at this. > >>> - Once we have the value ranges for parameter/return values, we could rely on tree-vrp to use this and do the optimizations >>> >>> Yep. IPA transform phase should annotate parameter default defs with >>> computed ranges. >> >> Yep, in addition we will end up with known value ranges stored in aggregates >> for that we need better separate representaiton >> >> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930 >>> > > Thanks, > Kugan
Re: help with PR69133
> On Mon, Jan 18, 2016 at 10:28 AM, Prathamesh Kulkarni > wrote: > > On 17 January 2016 at 14:56, Prathamesh Kulkarni > > wrote: > >> Hi, > >> I was having a look at PR69133. > >> It appears that with -flto-partition=none, > >> cgraph_node::get_untransformed_body () > >> is called twice for node with asm_name _ZThn4_N11xercesc_3_11C5m_fn6ERKi. > >> c++filt says it is: non-virtual thunk to xercesc_3_1::C::m_fn6(int const&) > >> > >> get_untransformed_body() calls > >> lto_free_function_in_decl_state_for_node() which sets > >> node->lto_file_data to NULL. > >> Now 2nd time get_untransformed_body is called for the same node, > >> node->lto_file_data is NULL and lto_get_decl_name_mapping() dereferences > >> lto_file_data which results in segfault. > >> > >> I was wondering if gating on lto_file_data could be a reasonable solution ? > >> if (!lto_file_data) > >> return false; > >> I am not sure what value to return for this case (I chose false > >> because of early exit). > >> It prevents the ICE, and the patch passes bootstrap+test on > >> x86_64-unknown-linux-gnu. > >> With partitoning enabled, for the above test-case, > >> get_untransformed_body () is called only once per node. However I > >> don't understand why it gets called twice with -flto-partition=none. > > Kugan pointed out to me the test-case doesn't ICE by passing -fno-ipa-icf. > > It looks to me we are not supposed to call this twice on the same node > but nothing guards against that (and we don't release the untransformed > body either). -flto-partition=none is special here becaue it combines WPA You are suppose to read every function body just once from LTO stream. Doing it multiple times is a waste. You can call get_body and get_untransformed_body as many times as you want as tHere is guard: bool cgraph_node::get_untransformed_body (void) { lto_file_decl_data *file_data; const char *data, *name; size_t len; tree decl = this->decl; if (DECL_RESULT (decl)) return false; once you read the body DECL_RESULT is set. Only way you can get past this to ICE should be the case where you read the body and then trhow it away. > and late stage. IPA ICF is the only IPA pass needing untransformed bodies > during WPA. That is not exactly true. We also read untransformed bodies when merging profiles. I will check the PR. ipa-icf is not supposed to get body of thunks. Honza
Re: [RFC] DW_OP_piece vs. DW_OP_bit_piece on a Register
On Sat, Jan 16 2016, Joel Brobecker wrote: >> After analyzing some test case failures in GCC and GDB I realized that >> there are various problems with the handling of DWARF pieces >> (particularly from registers) in the current implementations of GCC and >> GDB. I'm working on a fix for the GDB part, but first I'd like to check >> whether I'm heading into the right direction -- or what the right >> direction is supposed to be. The article below outlines these issues >> and the suggested solution options. > > This is a very nice and detailed analysis of the current situation. > Thank You! > > I admit that I read through the document fairly rapidly; it does > seem to me, at this point, that the first step might be to get > clarification from the DWARF committee? Or is input from the GCC/GDB > community going to help the discussion with the DWARF committee? I think it would be helpful to form an opinion within the GCC/GDB community first. Then we can open a DWARF issue with a specific change request, if necessary. FWIW, here's my (current) opinion: I don't like option 4.2 ("loose interpretation"), because it doesn't seem to have any significant advantages over 4.3 and is more complex. I'm less sure about 4.3 versus 4.1. Option 4.3 seems more intuitive and may require a bit less code than 4.1, but is not compliant with the current standards' wording and does not support the SPU "preferred slots". And regarding the padding support, I prefer 5.3.1 ("no padding support"). -- Andreas
Re: distro test rebuild using GCC 6
On Fri, Jan 15, 2016 at 05:52:49PM +, James Greenhalgh wrote: > > libbpp-qt_2.1.0-1ubuntu2 > > [ ICE: Looks like: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68068 > but reproduces on current trunk. Testcase reducer is in progress. ] This turned out to be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68148 , now fixed for trunk with r232410. Author: hubicka Date: Fri Jan 15 11:00:24 2016 + PR ipa/68148 * ipa-icf.c (sem_function::merge): Virtual functions may become reachable even if they address is not taken and there are no idrect calls. * g++.dg/ipa/devirt-49.C: New testcase. Thanks, James
Re: ipa vrp implementation in gcc
> On Mon, Jan 18, 2016 at 12:00 AM, Kugan > wrote: > > Hi, > > > >> Another potential use of value ranges is the profile estimation. > >> http://www.lighterra.com/papers/valuerangeprop/Patterson1995-ValueRangeProp.pdf > >> It seems to me that we may want to have something that can feed sane loop > >> bounds for profile estimation as well and we can easily store the known > >> value ranges to SSA name annotations. > >> So I think separate local pass to compute value ranges (perhaps with less > >> accuracy than full blown VRP) is desirable. > > > > Thanks for the reference. I am looking at implementing a local pass for > > VRP. The value range computation in tree-vrp is based on the above > > reference and uses ASSERT_EXPR insertion (I understand that you posted > > the reference above for profile estimation). As Richard mentioned in his > > reply, the local pass should not rely on ASSERT_EXPR insertion. > > Therefore, do you have any specific algorithm in mind (i.e. Any > > published paper or reference from book)?. Of course we can tweak the > > algorithm from the reference above but would like to understand what > > your intension are. > > I have (very incomplete) prototype patches to do a dominator-based > approach instead (what is refered to downthread as non-iterating approach). > That's cheaper and is what I'd like to provide as an "utility style" interface > to things liker niter analysis which need range-info based on a specific > dominator (the loop header for example). In general, given that we have existing VRP implementation I would suggest first implementing the IPA propagation and profile estimation bits using existing VRP pass and then try to compare the simple dominator based approach with the VRP we have and see what are the compile time/code quality effects of both. Based on that we can decide how complex VRP we really want. It will be probably also more fun to implement it this way :) I plan to collect some data on early VRP and firefox today or tomorrow. Honza > > Richard. > > >> I think the ipa-prop.c probably won't need any siginificant changes. The > >> code basically analyze what values are passed thorugh the function and > >> this works for constants as well as for intervals. In fact ipa-cp already > >> uses the same ipa-prop analysis for > >> 1) constant propagation > >> 2) alignment propagation > >> 3) propagation of known polymorphic call contextes. > >> > >> So replacing 1) by value range propagation should be easily doable. > >> I would also like to replace alignment propagation by bitwise constant > >> propagation (i.e. propagating what bits are known to be zero and what > >> bits are known to be one). We already do have bitwise CCP, so we could > >> deal with this basically in the same way as we deal with value ranges. > >> > >> ipa-prop could use bit of clenaing up and modularizing that I hope will > >> be done next stage1 :) > > > > We (Myself and Prathamesh) are interested in working on LTO > > improvements. Let us have a look at this. > > > >>> > - Once we have the value ranges for parameter/return values, we could > rely on tree-vrp to use this and do the optimizations > >>> > >>> Yep. IPA transform phase should annotate parameter default defs with > >>> computed ranges. > >> > >> Yep, in addition we will end up with known value ranges stored in > >> aggregates > >> for that we need better separate representaiton > >> > >> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68930 > >>> > > > > Thanks, > > Kugan
x86 interrupt attribute
What is the status of the x86 interrupt attribute patch? One of the last references I see is here and an attempt to update the middle-end here. -- Wink
Re: x86 interrupt attribute
On Mon, Jan 18, 2016 at 9:39 AM, Wink Saville wrote: > What is the status of the x86 interrupt attribute patch? > > One of the last references I see is here and an attempt to update the > middle-end here. > > -- Wink You can try hjl/interrupt/stage1 branch in git repo, which is queued for GCC 7. -- H.J.
Re: x86 interrupt attribute
It looks like it will be awhile before its included, what's your guess on when stage1 will commence for GCC 7? About how often will you be rebasing hjl/interrupt/stage1 onto master? On Mon, Jan 18, 2016 at 9:47 AM H.J. Lu wrote: > > On Mon, Jan 18, 2016 at 9:39 AM, Wink Saville wrote: > > What is the status of the x86 interrupt attribute patch? > > > > One of the last references I see is here and an attempt to update the > > middle-end here. > > > > -- Wink > > You can try hjl/interrupt/stage1 branch in git repo, which is queued for > GCC 7. > > -- > H.J.
Re: x86 interrupt attribute
On Mon, Jan 18, 2016 at 10:27 AM, Wink Saville wrote: > It looks like it will be awhile before its included, what's your guess > on when stage1 will commence for GCC 7? > > About how often will you be rebasing hjl/interrupt/stage1 onto master? I update hjl/interrupt/stage1 branch about once a week. > On Mon, Jan 18, 2016 at 9:47 AM H.J. Lu wrote: >> >> On Mon, Jan 18, 2016 at 9:39 AM, Wink Saville wrote: >> > What is the status of the x86 interrupt attribute patch? >> > >> > One of the last references I see is here and an attempt to update the >> > middle-end here. >> > >> > -- Wink >> >> You can try hjl/interrupt/stage1 branch in git repo, which is queued for >> GCC 7. >> >> -- >> H.J. -- H.J.