On 11/5/2021 8:13 AM, Andrew Appel wrote:
Dear GCC developers:

TL;DR Can I talk to a maintainer of gcc's tail-call optimizer to learn more 
about the very impressive technology behind it? Or is there a paper, or 
documentation, that explains it?
No paper or documentation I'm aware of.



I am sure you know that gcc does a much better job of tail-call optimizations 
than its competitors. Recently I noticed this little gem:

void f( void *p);

int h( int n, void *p)
{
{
void *a[ 10 ];
a[ 0 ]= p;
f(( void *)a);
}
return h(n, p);
}

With the inner curly braces, gcc turns the call to h into a jump. Without the 
inner curly braces, gcc does not do tail-call optimization. This is exactly 
correct, because we cannot know whether f() stores its argument in a place 
where later calls to h() might access it. The inner braces ensure that a[] is 
officially deallocated before the call to h.
Right.  The lifetime of the object is critical to knowing if its safe to tail call.

The easiest way to see what GCC does is via its debugging dumps.  In this particular case we're dealing with tail recursion which is happening entirely in the gimple optimizers.  So -fdump-tree-all-blocks-details will give you the relevant debugging dumps.

The dumps are numbered based on the pass ordering in the compiler. Looking at the pass before "tailr1" we have:

;; Function h (h, funcdef_no=0, decl_uid=1938, cgraph_uid=1, symbol_order=0)

h (int n, void * p)
{
  void * a[10];
  int _8;

;;   basic block 2, loop depth 0, maybe hot
;;    prev block 0, next block 1, flags: (NEW, VISITED)
;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
  a[0] = p_2(D);
  f (&a);
  a ={v} {CLOBBER};
  _8 = h (n_6(D), p_2(D));
  return _8;
;;    succ:       EXIT (EXECUTABLE)

}

Note the a = {v}{CLOBBER}.  That's the end of the lifetime of "a" and it's how we know tail recursion optimization is safe in this particular case.  In the .tailr1 dump we have:

;; Function h (h, funcdef_no=0, decl_uid=1938, cgraph_uid=1, symbol_order=0)

Eliminated tail recursion in bb 2 : _8 = h (n_6(D), p_2(D));


Updating SSA:
creating PHI node in block #2 for .MEM
Registering new PHI nodes in block #0
Registering new PHI nodes in block #2
Updating SSA information for statement a[0] = p_2(D);
Updating SSA information for statement f (&a);
Updating SSA information for statement a ={v} {CLOBBER};

Symbols to be put in SSA form
{ D.1944 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 3
Number of blocks to update: 2 ( 67%)
Affected blocks: 0 2


fix_loop_structure: fixing up loops for function
flow_loops_find: discovered new loop 1 with header 2
h (int n, void * p)
{
  void * a[10];

;;   basic block 2, loop depth 1, maybe hot
;;    prev block 0, next block 1, flags: (NEW, VISITED)
;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
;;                2 (FALLTHRU,DFS_BACK,EXECUTABLE)
  a[0] = p_2(D);
  f (&a);
  a ={v} {CLOBBER};
  goto <bb 2>; [INV]
;;    succ:       2 (FALLTHRU,DFS_BACK,EXECUTABLE)

}

And you can see we've turned the tail recursive call into a simple jump which creates a loop.





Clang/llvm does not tail-call-optimize this program.

So, congratulations.

I would enjoy speaking to one of you who is knowledgeable about how gcc does 
this.

A few years ago I wrote some compiler textbooks,
Modern Compiler Implementation in C, in ML, in Java (Cambridge University Press)
and I am still interested in these things. More to the point, I use C as the 
target for a functional-language compiler, for which tail-call optimization is 
very important.
Yup.  Got it on my bookshelf somewhere.


So please let me know,
* is there a paper somewhere that describes the approach that gcc takes for TCO?
* I'm particularly interested in how gcc's intermediate language represents the 
fact that a[] is deallocated at the first close-curly-brace
* Would someone like to Zoom sometime to tell me about this?
The dump files are going to be the best window into this stuff. There are cases where we just mark potential tail call sites, but have to defer the final decision until later in the optimizer pipeline, but this case is simple enough that we don't need to query the target's capabilities, look at stack usage, worry about argument mappings, etc.

jeff

Reply via email to