Re: Designs for better debug info in GCC

Alexandre Oliva Tue, 18 Dec 2007 19:51:07 -0800

On Dec 18, 2007, Ian Lance Taylor <[EMAIL PROTECTED]> wrote:

> Alexandre Oliva <[EMAIL PROTECTED]> writes:
>> A plan to fix local variable debug information in GCC
>> 
>> by Alexandre Oliva <[EMAIL PROTECTED]>
>> 
>> 2007-12-18 draft


> Thank you for writing this.  It makes an enormous difference.

NP.  Thanks for the encouragement.

>> == Goals

> I note that you don't say anything about the other big problem with
> debugging optimized code, which is that the debugger jumps around all
> over the place.

Yep, it's a separate project, that I'm somewhat interested in, and
maybe somewhat easy to fix with judicious use of is_stmt notes, but
it's not my top priority ATM.

>> Once this is established, a possible representation becomes almost
>> obvious: statements (in trees) or instructions (in rtl) that assert,
>> to the variable tracker, that a user variable or member is represented
>> by a given expression:
>> 
>> # DEBUG var expr
>> 
>> By var, we mean a tree expression that denotes a user variable, for
>> now.  We envision trivially extending it to support components of
>> variables in the future.

> While you say that this is almost obvious, it still isn't obvious at
> all to me.  You consider trees and RTL together, but I don't see why
> that is appropriate.

You snipped (skipped?) one aspect of the reasoning on why it is
appropriate.  Of course this doesn't prove it's the best possibility,
but I haven't seen evidence of why it isn't.

> My biggest concern at the tree level is the significantly increased
> memory usage

One of the first measurements we had from my code was from Richi, who
said it didn't increase it too much.

> and the introduction of a sort of a weak pointer to
> values.  Since DEBUG statements shouldn't interfere with
> optimizations, we need to explicitly ignore them in things like
> has_single_use.

That's probably the easiest part, and it's already done.

> But since our data structures need to be coherent, we can not ignore
> them when we actually eliminate SSA names.  That seems sort of
> complicated.

It's not.  The code to do this is ready.  After I got bootstrap-debug
to pass on x86_64-linux-gnu, I don't recall needing any further
changes in the tree passes for i386-linux-gnu, and none of the
ia64-linux-gnu or ppc64-linux-gnu fixes I've made so far (most to
their machine-dependent schedulers) required changes in the tree
passes either.  So, we can safely count that as easy and maintainable.
Looking at the patches in the vta branch for the tree infrastructure
will give you a very good idea of the involved effort.

> In SSA form it seems very natural to provide a set of associations
> with user variables for each GIMPLE variable.

Yes.  This provides for a simple AND WRONG representation (but not
hopeless, see below, after the sample code).

We went through some of this already.  You can't recover the
information with something that throws away information about the
point of assignment.  Even the basic block of assignment is lost.  You
can't generate correct debug information with this.

The limitation of approaches like this is addressed in passing in the
examples, but I didn't want to carry discussions about broken designs
that I thought we'd already left behind into the concise design
document.

> Since the GIMPLE variables never change, these associations never
> change.  We have to get them right when we create a new GIMPLE
> variable and when we eliminate a GIMPLE variable.

Maybe you can show us how to represent the annotations for the two
trivial examples I've chosen in the paper, to show that the compiler
can stand a chance of generating correct debug information.

> Of course this means that we are keeping the debug information in a
> reversed form.

This is not such a big deal; it would just lose some in completeness,
and it would probably carry around lots of useless notes.  The real
problem is that it loses essential information for correct debug
information generation.

> Instead of saying that a user variable is associated with an
> expression in terms of GIMPLE variables, we will say that a GIMPLE
> variable is associated with an expression in terms of user
> variables.

Let me see if I understand what you have in mind.  Given:

int f(int x, int y) {
  int i, j;

  probe1();
  i = x;
  j = y;
  probe2();
  if (x < y)
    i += y;
  else
    j -= x;
  probe3();
  return g (i ,j);
}

we'd SSAify it into something like:

int f(int x, int y) {
  int i;
  int j;
  int T;

  probe1();
  i_0 = x_1(D); /* i */
  j_2 = y_3(D); /* j */
  probe2();
  if (x_1(D) < y_3(D))

    i_4 = i_0 + y_3(D); /* i */

  else
    j_5 = j_2 - x_1(D); /* j */

  i_6 = PHI <i_4(bb_then), i_0(bb_else)> /* i */
  j_7 = PHI <j_2(bb_then), j_5(bb_else)> /* j */
  probe3();
  T_8 = g (i_6, j_7);
  return T_8;
}
  
And I can see that setting breakpoints at the probe points would get
you correct values for i and j.  In fact, these annotations, so far,
are no different from what we already have today.

But then, if we optimize this just a little bit, I can't quite tell
what we'd get to enable correct debug information:

int f(int x, int y) {
  int i;
  int j;
  int T;

  probe1();
  /* p1: ??? i, j */
  probe2();
  if (x_1(D) < y_3(D))

    i_4 = x_1(D) + y_3(D); /* i */

  else
    j_5 = y_3(D) - x_1(D); /* j */

  i_6 = PHI <i_4(bb_then), x_1(D)(bb_else)> /* i */
  j_7 = PHI <y_3(D)(bb_then), j_5(bb_else)> /* j */
  probe3();
  T_8 = g (i_6, j_7);
  return T_7;
}

Now, if you tell me that information about i_0 and j_2 is
backward-propagated to the top of the function, where x and y are set
up, I introduce say zero-initialization for i and j before probe1()
(an actual function call, mind you), and then this representation is
provably broken.

And, if you tell me that you just discard that information, then at
probe2() the variables will appear to be uninitialized (or
zero-initialized after the change), and again the representation is
wrong.

If you tell me that you keep notes at those points to tell debug
information that at probe2() both variables have unknown values, then
you may get correct debug information, but you're willfully making it
incomplete for an extremely common scenario (this example is
intentionally made similar to a scenario after one pass of inlining
into f, where i and j were former arguments to the inlined function).

If you tell me that you keep notes at that point that indicate the
expected values of i and j, then you've reached the representation I
propose.

If you tell me you keep different notes between probe1() and probe2(),
that just tell the point at which i and j receive the values of x and
y, but the annotations are still attached to the SSA assignment, then
this stands a chance of generating correct debug information.
Something like:

  x_1(D) /* x starting at entry point, and also i starting at p1 */
  y_3(D) /* y starting at entry point, and also j starting at p1 */

Maybe these annotations interspersed in the code might be easier to
handle.  I hadn't considered this before.  It's worth investigating.

But I still haven't got your proposal entirely clear.  I don't quite
see how this would handle transformations other than trivial
substitutions.

Can you perhaps give examples of how you'd get from trivial
annotations to more complex, potentially ambiguous expressions, as
optimization passes make complex transformations?  Maybe what you have
in mind is something along the lines of induction variables, that loop
optimizers would have to annotate explicitly, is that so?

> It is of course true that optimized code will move around
> unpredictably, and your proposal doesn't handle that.

It handles that in that a variable will be regarded as being assigned
to a value when execution crosses the debug stmt/insn originally
inserted right after the assignment.  This is by design, but I realize
now I forgot to mention this in the design document.

The idea is that, debug insns get high priority in scheduling.
However, since they mention the assignment just before them, if the
assignment is just moved earlier, without an intervening scheduling
barrier, then the debug instruction will follow it.  If the assignment
is removed, then the debug insn can be legitimately be move up to the
point where the assignment, if remaining, might have been moved up to.
However, if the assignment is moved to a separate basic block, say out
of a loop or a conditional, then we don't want the debug insn to move
with it: such that hoisting and commonizing are regarded as setting
temporaries, and the value is only "committed" to the variable if we
get to the point where the assignment would take place.

Neat, eh?

I'll add something to this effect to the design document.

> I don't see it as a flaw that it will be possible to view user
> variables outside of their source code range.

Agreed.  Extending the range of a (variable value) binding to a point
in which the variable wouldn't exist (yet or any more) without
optimization is fine, but extending the range of such a binding across
an assignment, even an optimized-away one, isn't.

> It's not obvious to me why a DEBUG insn is superior to a REG_NOTE
> attacked to an insn.

Mainly because we won't want to always move the note along with the
insn.  A REG_NOTE isn't unambiguous for parallel sets, but there are
ways around that.

As written in the document, combining the debug annotation with an
assignment is doable and not discarded from the plan, but at some
point the note may need to be detached, and then it's not clear to me
that the potential memory savings of this combination are worth the
additional maintenance burden of splitting them out on demand, which
is my greatest concern.

On top of that, after splitting, all the maintenance burden (no matter
how small) of dealing with stand-alone debug annotations would have to
be undertaken anyway, so it appears to me that the combination would
just add complexity.  But then again, I'm not sure about it, so I
haven't ruled it out; the design is open to it.

> The problem with DEBUG insns is of course that the RTL code
> is very sensitive to new insns, and also the additional memory usage.
> You discuss those, but it's not obvious to me why your proposed
> solution is the best one.

I can't assert it's the best, no matter how hard I've worked on this
design.  I've presented my thoughts (or at least as many of them as I
could remember; I may have forgotten some along the way ;-), and I've
shown why other designs presented before didn't solve the problem I
had to solve, as far as I could tell.

Your annotations along with the point-marking notes are an approach I
hadn't considered before, and I'm pretty sure I don't quite follow how
this would work to the fullest extect, but on first sight it appears
to me that it might work.  So let's look further into it.

>> Testing for accuracy and completeness of debug information can be best
>> accomplished using a debugging environment.

> Of course this is very unsatisfactory without an automated testsuite.

Err...  I didn't say the testing through a debugging environment
wouldn't be automated.  My plan is to use something along the lines of
the GDB testsuite scripts, but whether to use GDB or some other
debugging or monitoring infrastructure is a tiny implementation detail
that I haven't worried about at all.  The basic idea is to script the
inspection of variables and verify that the obtained values are the
expected ones, or that variables are defensibly unavailable at the
inspection points.

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
FSF Latin America Board Member         http://www.fsfla.org/
Red Hat Compiler Engineer   [EMAIL PROTECTED], gcc.gnu.org}
Free Software Evangelist  [EMAIL PROTECTED], gnu.org}

Re: Designs for better debug info in GCC

Reply via email to