[PHP-DEV] Re: PHC Dataflow

Paul Biggar Tue, 30 Jun 2009 14:13:22 -0700

Hi Graham,

Sorry for taking so long getting back to you. I'll reply to both your
mails in one.


On Mon, Jun 29, 2009 at 9:37 PM, Graham Kelly<grah...@facebook.com> wrote:
> Anyway, I’ve been looking into building a fairly simple def-use system where
> I build a graph for each variable in the function local scope (each CV,
> TMP_VAR, and VAR).  What I’m doing is building a local graph for each

So you're right of course. Simple should be the order of the day. You
can very likely do a lot of things with a simple approach.

 What are the other possible variable types?


> variable in each basic block and then using the CFG information to connect
> these local graphs to from a function wide def-use graph for each variable.

Does the CFG represent exception information? I think nearly every
statement can throw an exception via set_error_handler? Note that some
statements do nothing due to errors.



> The def-use system seems like it should allow for some fairly straight
> forward work to be done with dead code elimination, constant propagation,
> general usage information, loop induction identification, etc, etc. I

In theory yes. I would be surprised if you were able to do
loop-invariant code elimination, but you'll probably realize some nice
results with what you named above. I expect you'll find speed-ups from
places where naive bytecode has been generated too. I found a few
places where we created weird constructs early in the compiler that
were very easy to clear up later, with good results.

What are you using loop induction identification for? Converting
for-loop into foreach?



> Where I eventually want to go with this is to try to lock down as many data
> types as possible during optimization. Of course the biggest problem
> pecl/optimizer has with this is not being able to look at the entire
> program’s structure and use that to infer function argument data types. But,
> I also feel that before such analysis could happen I need to have the
> function local data flow analysis down solid anyway and at that point I can
> worry about the unknown types of arguments issue. I also feel that data
> types alone wont really be enough for PHP and tracking possible ranges for
> variables might be useful (this way for one you could better lock down the
> data types by for example being able to determine if a ZEND_ADD opcode will
> ever overflow into a double or if you can safely infer that it will always
> be a long).

This seems out of scope of bytecode -> bytecode? How would you use
that information? Adding extra bytecodes? I very much doubt this will
be worth it, especially considering the effort involved in following
types around (don't forget that every function can return NULL, and
many can return False). I see few places where I know a variable is an
int.

However, if you do go down this route, you might try peeling off the
first iteration of a loop. That way the first iteration will have the
weird stuff, if any, and the later iterations will have more pristine
information. There's a paper on this for python:
http://portal.acm.org/citation.cfm?id=1534550


On Tue, Jun 30, 2009 at 8:51 PM, Graham Kelly<grah...@facebook.com> wrote:
> Anyway, I guess I’ll elaborate a bit on what I’ve been doing with data flow
> (ish) stuff. Basically I am looking at building a def use graph for each
> variable (as I mentioned in my last email). The problems seem to come with
> loops and with var vars, evals, and includes (I also haven't worked out
> array indexes and object properties yet). The loops heave had me considering

I would say that for var-vars and evals and includes, just don't
optimize functions containing them. Evals are bad practice, includes
are probably bad practice within a function, and var-vars are unusual.
Of course, you might be interested if a lot of the code you're
optimizing is automatically generated, in which case:

evals/includes: you no longer know anything about anything (unless you
can see what's included/evaled, and parse it, etc)
var-vars: they can read from any variable, and write to any variable.
So "$$a = 5;" is a may-def of all variables, "print ($$a)" is a
may-use of all variables. The set of all variables include unknown
variables (say, if $a is "facebook.com"). If you have var-vars and
references in the same bytecode, give up on this function.



> trying to use SSA because of the niceness it brings and because I think the
> phi functions could simplify some of the junk with loops. Fro var vars,

SSA gives a really nice framework to work with. My own experience
though is that everything I want to do needs to be done in order to
build SSA form. I gave a talk before on the nastiness of trying to do
SSA in PHP: http://www.prog.uni-saarland.de/ssasem/talks/Paul.Biggar.pdf.
But I got SCCP and DCE to work well on SSA, if I ignored lots of edge
cases.

Basically, there is a small set of hidden operations, like setting
$php_errormsg. Once you spot when those happen, you're probably home
free.


> evals, and includes I was considering inserting a fake def at each one of

That is the approach often taken:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.6974  (look
for virtual variable). Although now that I think about it, that paper
is hard, so I wouldn't bother. I've chatted with experts about what
that paper really means, and people don't really know.

So in this case, your use of variable $x would be chained to the def
of the fake var for $$a, which would be chained to all defs before
that.

However, I think they're rare enough that you can either stop
optimizing the function, or that you won't suffer much performance
penalty from over-writing every variable.


> those. The fake def could be marked specially in order to allow dead code
> elimination the ability to know that these aren't real defs and maybe make
> some observations about that (or whatever).

Don't bother. You won't gain anything. I wouldn't spend time trying to
make every single case precise.


> As for array/object stuff... I’m
> not extremely interested in trying to track this at the moment. For one I
> don’t think doing so will yield many results with the limited scope
> pecl/optimizer has for analysis plus I have no plans for using that data at

Good, that sounds right.


> the moment. The only big advantage is it might help narrow down data types
> for other vars better.  The other big area I haven’t looked at are
> references. The easiest approach I see here is to just duplicate the def-use
> graph for each reference from the point after the reference was created.

I'm not sure what you mean by "duplicate the def-use graph".
Basically, you just need to keep track of which variables might
reference other variables, and possibly whether they "might" or
"must". Variables which are passed into functions, come as return
values, are global, are parameters, etc, fall into the set of "might
be referenced". You should be able to access all the signatures of
called functions, so to avoid the results being too bad.

> This of course could lead to a lot of extra junk in code with lots of
> references. Plus references to specific array indexes etc might turn out to
> be a big pain.  Heh sorry for the open endedness here but... Any
> Ideas/comments on this approach?

Don't model array indices or fields (especially fields). Always assume
you never know the value of a referenced variable. If there are lots
of references, and you are using too much memory, give up after a
certain threshold.


Some advice that I would have like to have taken: give up on things
that look like they might be hard. (I wasn't in a position to do that,
because the hard thing is what I'm doing my PhD on, but you dont need
to do it all).



Some other things I would add:
 - focus on constants, DCE, and bytecode specific stuff. You're not
compiling here.
 - use SCCP for constant propagation. Use the DCE algorithm from
"Engineering a Compiler" by Cooper/Torczon.


Anyway, it sounds like you really know what you're at. You must have
been studying. What texts are you using? If you don't have it, I'm a
big fan of Cooper/Torczon -- its very pragmatic.



Paul


> On 6/29/09 11:04 AM, "Paul Biggar" <paul.big...@gmail.com> wrote:
>
> Hey Graham,
>
> Sorry for the delay, I meant to send this over the weekend.
>
> On Thu, Jun 25, 2009 at 2:50 AM, Paul Biggar<paul.big...@gmail.com> wrote:
>>
>> When I say model, I mean that we "abstractly" "interpret" the program.
>> Its like we're interpreting it, but we don't have the actual values,
>> so we model all possible values at a certain point. We do this using 3
>> combined analyses: constant propagation, alias analysis, and type
>> inference. We also model definition of constants and the callgraph. We
>> don't currently (but plan to in the future) model integer- and
>> string-value-ranges, included files, or "truth analysis" (though that
>> would only take about an hour, I think).
>
> (If you're already familiar with how compiler optimization works, the
> first few paragraphs won't be anything new to you. I'll gloss over the
> simple compiler terms, please email me if googling hasnt worked for
> you; I'll keep explaining the advanced concepts).
>
> So the abstract interpretation then. We analyse a representation
> called the "Medium-level Intermediate Representation" (MIR), which is
> pretty close to PHP bytecode. We start by invoking the first method,
> __MAIN__, which represents the global scope.
>
>
> Upon entering a method, we convert it into a control-flow-graph (CFG),
> which has one "basic block" per statement. We add the entry block to a
> "worklist", which is processed roughly in FIFO order. After we process
> a statement for the first time, we add its successors to the worklist.
> In the case of loops, we may process the same statement again (or a
> few times) -- in which case we only add the successor if the results
> of the analysis have changed (aka we iterate until we reach a
> "fixed-point").
>
> This is all just like normal dataflow analysis. One twist we have is
> that we use "conditional constant propagation". This means that if we
> have something like:
>
> if ($myvar) $x = 5 else $x = 6;
>
> and we know that $myvar is 7, then we don't process the $x = 6, since
> it can't occur.
>
>
> At each "program point" (ie after each block) we store the complete
> state of the analysis at this point. What exactly is stored depends on
> the analysis. For constant-propagation, we store a "lattice" of
> constants. That means that we know each variable is one of:
> - TOP: the value is undefined
> - BOTTOM: the variable may have multiple values
> - or we know the exact constant that variable represents
>
> With PHP, there is a slight difference from "normal" lattice based
> analysis, in that TOP can represent NULL, since variables are
> automatically initialized to NULL. When we find another value for $x,
> we merge it with the current value. While (TOP + some value V => V),
> (NULL + non-NULL => BOTTOM).
>
> For type-inference, we use a set of type names for each variable. The
> types are int, bool, unset, real, string, resource, array, and any
> concrete classname. We don't have a BOTTOM (most conservative result),
> because not knowing one type is the end of the analysis. In theory,
> this means that in some unusual cases (like "new $classname.$num") the
> analysis will only terminate by running of out memory. Oh well.
>
> For constants (as in "define (MYCONST, 5);"), we store the value of
> each constant so far in the program. We use the lattice approach from
> constant-propagation, but unknown values only come up if you define
> the same constant in different branches, which isn't too common I
> think. We currently don't model these well enough to be able to
> resolve a call to is_defined, but this wouldn't be too hard.
>
> Finally, the really big and important analysis is alias analysis. I'll
> go into some detail in a moment. We store at each program point a
> complete points-to graph of the alias results, and use copy-on-write
> to avoid a massive explosion in space and execution time (not very
> successfully).
>
>
> Alias analysis is the really important analysis. I mentioned before
> that not knowing a single type will ruin the analysis. When I said
> above that we stored a constant lattice and a set of types for each
> variable, that wasn't exactly true. We actually store them for each
> object, field, array, index, symbol table and variable. The alias
> analysis allows us to tell the interaction between all of these.
>
> You might notice that these three concepts are mostly the same: object
> is-to field as array is-to index as symbol-table is-to variable. In
> the Zend engine, these are all implemented using hashtables. In our
> analysis, these are also all modelled the same way.
>
> Our points-to graph is just a graph with 3 kinds of nodes:
>    - "storage nodes" represent objects, arrays and symbol tables,
>   - "index nodes" represent fields, variables and array indices,
>   - "value nodes" represent scalar values of index nodes.
>
> There are 3 kinds of edges
>  - "field edges" go from a storage node to an index node: for S -> I,
> the object S may have a field named I.
>  - "reference edges" go from index nodes to index nodes: for I1 -> I2,
> the field I1 references the field I2.
>  - "value edges" go from index nodes to storage or value nodes: for I
> -> S, one of I's values is S.
>
> As with constants, we can't always resolve if a field must exist. If a
> field might be NULL, we've modelled that the same as if its
> uninitialized. Again, this isn't impossible, just not a priority.
>
>
> To give a more concrete example, consider (we'll assume the function
> is __MAIN__):
>
> - $x = 5; -- create an index_node "x", pointed to by the local symbol
> table "__MAIN__", and pointing to a value node whose value is "5,int"
>
> - $a = $b; -- do a deep copy from "__MAIN__" -> "b" to "__MAIN__" ->
> "a". This means take all the storage nodes and do a deep clone
> (keeping track of cycles). We only do deep clones of arrays. Objects
> are copied by reference, so we just add an edge to the existing
> storage node representing that object. For scalars, we merge the
> values in the type inference and constant-propagation, but the
> points-to graphs are unaffected.
>
> - $y =& $obj->fi; -- Find all the index nodes represented by $obj->fi.
> First, find $obj, which is the index_node "obj" in the local symbol
> table "__MAIN__". Then, for each of the storage nodes $obj points to,
> find its index node "fi". If we're lucky, there will only be one such
> node, in which case we create a DEFINITE reference edge between it and
> "y". Otherwise, we create one POSSIBLE reference edge between the
> "fi"s and "y". We then copy all the values to $y by adding an edge
> from y to all the storage nodes of $obj->fi's. We don't add an edge to
> $obj->fi's value node, instead, we merge its value into $y's value
> node.
>
>
> Actually, its a little bit more complicated than this, because of
> references. So
>
> $x = 5
>
> actually checks if $x references any of index_nodes, whether it can
> overwrite them with 5 or must make the result BOTTOM, etc.
>
> I've put up an example of a points-to graph at
>
> https://www.cs.tcd.ie/~pbiggar/whirl_41.png
>
> which represents:
>
> http://code.google.com/p/phc/source/browse/branches/dataflow/test/subjects/benchmarks/php-benchmarks/benchcli/tests/phpWhirl/classes/WhirlMemory.php#45.
>
> Grey boxes are symbol tables (function local ones have "=" in their
> name), blue boxes are values (B indicated BOTTOM, {...} is the set of
> values), circles are index nodes. There are no reference edges in this
> graph, they're all points-to or value edges. This is only the very
> start of that program.
>
>
>
> OK, that's an executive overview of the analysis algorithm. There are
> quite a few details, as you might imagine, so I think the easiest
> thing is to pick on something and ask about it. Topics of interest
> might be:
>
> implementation details
> context-sensitivity
> flow-sensitivity
> loops
> recursion
> eval
> include()
> "fake" nodes
> method/function calls
> modelling builtin functions
> inheritance
> SSA
> def-use information
> {string,int}-range analyses
> optimizations
>
> Feel free to literally go through these one at a time, but I'd rather
> you directed the course rather than me (that'll make it harder for me
> to gloss over something I'm unsure about).
>
> Paul
>
>
> --
> Paul Biggar
> paul.big...@gmail.com
>
>



-- 
Paul Biggar
paul.big...@gmail.com

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] Re: PHC Dataflow

Reply via email to