[PATCH] D76287: [analysis][analyzer] Introduce the skeleton of a reaching definitions calculator

Kristóf Umann via Phabricator via cfe-commits Wed, 18 Mar 2020 09:16:27 -0700

Szelethus marked 9 inline comments as done.
Szelethus added a comment.

In D76287#1927340 <https://reviews.llvm.org/D76287#1927340>, @xazax.hun wrote:


> I think it is very crucial to make the intentions clear, how do you define 
> `definition` and `variable`?
>  Other than assignments we might include pre/postfix increment/decrement, 
> non-const by-ref function arguments and so on as `definitions`.
>  Also, it would be great to have some proposal upfront how indirect writes 
> should behave.


You're totally right. The thing that I wanted to achieve, partly, is to make 
the implementation flexible for things I may have forgotten, or will need to 
add for future standards/languages. As you can see from D64991 
<https://reviews.llvm.org/D64991>, I did make a lot more progress than what is 
shown here, so I'll explain some of my long-term plans based on the experience 
I gathered:

> I think it is very crucial to make the intentions clear, how do you define 
> [...] `variable`?

//Variable// could be practically anything that can be written, but due to the 
nature of what we can work this, we have to make severe restrictions.

- We don't really have a good points-to-analysis. This discards pointees of any 
kind pretty much immediately. I don't have long term plans to add support for 
them.
- Arrays are strongly related, and while there are studies on indexed variables 
(the last paper in my summary talks about such an approach), I think the way 
C++ complicates things on an uncomfortable level. I do have confidence however 
adding support for it in the future for the cases where the index is known 
compile time (possibly with the help of the RD algorithm itself) should be 
doable by modifying the `Variable` class.

So this leaves the most obvious: `VarDecl`s, and for record objects a 
`(VarDecl, FieldChain)` pair (for `s.x.a`, this would be `(s, [x,a])`). Setting 
the `VarDecl` part of the pair to `nullptr` could refer to the implicit this 
during the analysis of a C++ methods.

The plan was to represent each field with a separate `Variable` object, so for 
this code

  struct A {
    struct B {
      int x, y;
    };
    B b;
  };
  
  void foo() { A a };

The following list of `Variable` objects would be created:

  a
  a.b
  a.b.x
  a.b.y

The reason behind this seemingly wasteful storage is that the eventual set 
operations would be difficult if not impossible to implement, if a single 
`Variable` object were to hold the information about its fields. Mind that each 
of them could have a totally different definition associated with it. I hope 
that by emloying immutable data structures these wouldn't be terribly expensive 
memory-wise.

> I think it is very crucial to make the intentions clear, how do you define 
> `definition`[...]?
>  Other than assignments we might include pre/postfix increment/decrement, 
> non-const by-ref function arguments and so on as `definitions`.

//Definition// is a statement, more specifically a `CFGStmt` (an element of a 
`CFGBlock`), that either writes, or could potentially write a variable. The 
proposed definition finding stage should be very flexible for future additions.

> Also, it would be great to have some proposal upfront how indirect writes 
> should behave.

I wrote a mail about this during my GSoC project: 
http://lists.llvm.org/pipermail/cfe-dev/2019-July/062975.html.

> Basically, this algorithm is not very useful on its own, but it can be used 
> as a building block for other kind of analyses. So it would make sense to try 
> to look for potential users and use cases and see if they have the same 
> requirements or do you need to introduce configurations?

The Static Analyzer would be the number one user of the RD algorithm, as 
described in the summary, so I guess that you referred to the users of GEN/KILL 
sets? What do you mean under configurations?

> Trying to rewrite some existing algorithms in terms of these sets and see if 
> the tests pass might be a good experiment if it is not too much work. (In 
> case we can validate the performance we could even replace the original 
> implementations if this makes things easier.)

I'm not too sure how much work it would take (I suspect an unreasonable 
quantity, but I might be wrong), so I'll take a look. This is a great idea to 
gain a lot more knowledge about this topic, even if it eventually fails.

> I think having a reaching definition set per basic block makes perfect sense, 
> as it should be relatively easy to calculate the reaching definition of an 
> instruction given the set and the basic block. So maybe it is more efficient 
> to calculate the reaching definition sets on demand for instruction rather 
> than computing it for every single instruction.
> 
> Regarding ASTMatchers, I think they might be great for now for prototyping 
> but I think once you want to make this more robust covering every corner case 
> in the matcher expression will be at least as hard as writing a visitor if 
> not harder. But we will see :)

Thank you for the detailed response! I won't update this patch for a while to 
leave time for others to respond, but I will try to work on the other 
algorithms a bit.



================
Comment at: clang/include/clang/Analysis/Analyses/ReachingDefinitions.h:40
+
+struct Variable {
+  const VarDecl *Var;
----------------
xazax.hun wrote:
> I wonder if `Variable` will be the right notion long term. Do we want to be 
> able to represent heap locations or do we exclude them on purpose? Reasoning 
> about the heap is quite challenging so both answers might be reasonable. But 
> in case we try to tackle the more general problem, we basically need a more 
> general term like `MemoryLocation`.
I don't have long term plans to reason about pointees in general. Heap in 
particular is probably off limits for the foreseeable future.


================
Comment at: clang/include/clang/Analysis/Analyses/ReachingDefinitions.h:60
+
+class Definition : public Variable {
+public:
----------------
xazax.hun wrote:
> Is the inheritance justified here? Is a definition a variable? Maybe having a 
> variable member better express the relationship.
Yup, you're right. This was a quick hack while developing.


================
Comment at: clang/include/clang/Analysis/Analyses/ReachingDefinitions.h:176-179
+  
//===--------------------------------------------------------------------===//
+  // Utility methods for building a GEN set. These are public because the
+  // callback objects will need to call them.
+  
//===--------------------------------------------------------------------===//
----------------
whisperity wrote:
> Is this the right approach? A public method makes the user itch to call it. 
> If `GenSetMatcherCallback` is a superclass of every possible implementation, 
> I think adding that class as a friend here works, and you could make the 
> methods private, or protected.
Moving this to `GenSetMatcherCallback` would indeed be a great idea :)


================
Comment at: clang/include/clang/Analysis/Analyses/ReachingDefinitions.h:236-242
+  // TODO: Should we change to ImmutableMap? Does this matter that much?
+  std::map<const CFGBlock *, GenSet> Gen;
+  std::map<const CFGBlock *, DefinitionSet> Kill;
+
+public:
+  std::map<const CFGBlock *, DefinitionSet> In;
+  std::map<const CFGBlock *, DefinitionSet> Out;
----------------
whisperity wrote:
> The immutability might not(?), but the performance aspects of the RBT behind 
> STL `map` could? `K` is a pointer, why not use `DenseMap`? How large do you 
> expect these containers to be when they peak out?
Measurements on real-life codebases can never come soon enough, but I fear 
it'll be a while before I get them.


================
Comment at: clang/lib/Analysis/ReachingDefinitions.cpp:41
+  for (const Definition &Perpetrator : Set)
+    if (Victim.Var == Perpetrator.Var)
+      return true;
----------------
xazax.hun wrote:
> In the future this will be more complicated.
> 
> For example if I assign to a struct all of its members needs to be killed. As 
> a result, you will not only need to represent memory regions but also the 
> relationships between them. I wonder if the analyzer's memregion classes are 
> any good or are they too specific to the analyzer.
As described in my comment, record variables would have numerous `Variable` 
classes, so this function wouldn't get much more complicated in the future (as 
seen in D64991).


================
Comment at: clang/lib/Analysis/ReachingDefinitions.cpp:137
+  static internal::Matcher<Stmt> getMatcher() {
+    return declRefExpr(to(varDecl().bind("var")));
+  }
----------------
xazax.hun wrote:
> Note that, `memberExpr` is not supported at this point. It is not derived 
> from `declRefExpr`. 
Indeed, that is one beast of a matcher :) You can take a sneak peak in D64991.


================
Comment at: clang/lib/Analysis/ReachingDefinitions.cpp:162
+
+  // TODO: Destructor calls? Should we be THAT conservative?
+  // TODO: Regular function calls?
----------------
whisperity wrote:
> You mean //explicit// destructor calls here, right?
Nope, implicit. I wouldn't worry much about explicit calls, I would handle them 
the same as I would any `CallExpr`. Implicit destructor calls could modify the 
global state, and are not visibly present in the code.


================
Comment at: clang/lib/Analysis/ReachingDefinitions.cpp:275
+void ReachingDefinitionsCalculator::dumpGenSet() const {
+  llvm::errs() << "GEN sets: blockid (varname [blockid, elementid])\n";
+  for (const std::pair<const CFGBlock *, GenSet> D : Gen) {
----------------
whisperity wrote:
> Instead of simply blockID, could you harmonise this output with the CFG dump 
> and say `Bx` instead of just `X`?
What do you mean?


================
Comment at: clang/test/Analysis/dump-definitions.cpp:17-18
+
+// CHECK:      GEN sets: blockid (varname [blockid, elementid])
+// CHECK-NEXT: 1 (ptr, [1, 3]) <write>
+// CHECK-NEXT: KILL sets: blockid (varname [blockid, elementid])
----------------
whisperity wrote:
> What is an //element//? How do they get their numbers? What does the `3` mean 
> here? I get that basic block 1 (the body of the function) writes `ptr`... but 
> I don't understand this further from looking at the expected output.
A `CFGBlock` is a block that contains all statements that are always executed 
sequentially. The statements are enumerated according to the execution order. 3 
here means that the definition is the 3rd element in the 1st `CFGBlock`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76287/new/

https://reviews.llvm.org/D76287



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D76287: [analysis][analyzer] Introduce the skeleton of a reaching definitions calculator

Reply via email to