Sent from my iPhone
On Mar 6, 2009, at 7:00 PM, Silvius Rus <r...@google.com> wrote:
I'm thinking about adding bitwise dataflow analysis support to RTL.
Is this a good idea? Bad idea? Already done? Please review if
interested.
There is already some bitwise dataflow implemented in combine. And I
think df supports bytewise already but it is turned off because of
problems with some targets and modes. I also think Steven B. had an
implementation on the tree level or at least ideas on how to implement
there.
Thank you,
Silvius
Motivation
==========
int foo(int x)
{
int i = 100;
do
{
if (x > 0)
x = x & 1; /* After this insn, all bits except 1 are 0. */
else
x = x & 2; /* After this insn, all bits except 2 are 0. */
i--;
}
while (i);
return x & 4; /* Before this insn, only bits 1 and 2 may be
nonzero. */
}
"foo" should simply return 0.
This optimization is currently missed at -O2 and -O3 on x86_64.
(Cases with simpler control are solved by the "combine" pass.)
Proposal
========
1. Dataflow framework to propagate bitwise register properties.
(Integrated with the current dataflow framework.)
2. Forward bitwise dataflow analysis: constant bit propagation.
3. Backward bitwise dataflow analysis: dead bit propagation.
4. Target applications: improve dce and see. (Others?)
Preliminary Design
==================
1. I don't have enough understanding of GCC to decide whether it
should be
done at RTL level or tree level. After some brainstorming with
Ian Taylor,
Diego Novillo and others, we decided to go with RTL.
2. This problem could be solved using iterative dataflow with bitmaps.
However, memory size would increase significantly compared to scalar
analysis, as would processing time.
For constant bit propagation, we need to keep 2 bits of state for
each
register bit. For 64 bit registers, that's a factor of 128x over
the scalar reaching definition problem.
Instead, I propose a sparse bitwise dataflow framework. We would
still
use the existing RTL dataflow framework to build scalar DU/UD
chains.
Once they are available, bitwise information is propagated only over
these chains, analogous to the sparse constant propagation described
by Wegman & Zadeck TOPLAS 1991.
3. This might be too much detail at this point, but just in case,
here is
a brief description of a bit constant propagation algorithm.
For each instruction I in the function body
For each register R in instruction I
def_constant_bits(I, R) = collect constants from AND/OR/...
operations.
Iterate until the def_constant_bits don't change:
For each instruction I in the function body
For each register R used at I
use_constant_bits(I, R) = merge (def_constant_bits(D, R))
across all definitions D of R that
reach I
For each register R defined at I
def_constant_bits(I, R) = transfer (use_constant_bits(I, RU))
for all register uses RU, based on
opcodes.
The data structures and routines "collect", "merge" and "transfer"
depend on the problem solved.
4. Complexity considerations.
The solver visits every DU edge once for each iteration of the
fixed point
convergence loop. The maximum number of iterations is given by the
height of the state lattice multiplied by the number of bits.
Although this can be as high as 128 for constant bit propagation
on x86_64,
in practice we expect much lower times.
Also, lower complexity guarantees can be given if less accurate
information
is allowed, e.g., byte level rather than bit level. For byte
constants,
the upper bound constant factor drops from 128 to 16.
Some of these ideas came from discussions with Preston Briggs,
Sriraman Tallam and others.