Hello, I'm currently working on improving my understanding of the implementation of the intraprocedural points-to analysis in GCC. I have already read the papers by Daniel Berlin and have been looking at the source for some time, but I understand that those references are a little bit old and might not fully reflect the status of the current points-to analysis.
In order to familiarize myself with the analysis I decided to take a look at the constraints dumped in the logs and input them into a different solver which also implements a points-to analysis. I do this during GCC's run time. In other words, before GCC computes the points-to analysis using the regular intraprocedural points-to analysis, I take a look at the constraint vector and translate them to an interface the other solver can understand. Therefore, after this other solver finished, there are two sets of solutions for the points-to analysis. In order to verify the correctness of my analysis I simply iterate over GCC's solution and place an assertion to verify that both points-to sets for each variable are equal. The solution computed by my implementation solves the constraints by abstracting the varinfo into integers by using the varinfo id. And then the relationship between points-to variables are expressed as variable A id points to variable B id. Since the variable info ids are just offsets into the varinfo array, then there should be a one to one mapping between the solutions in this solver to the solutions computed by GCC. Its implemented something like the following: ``` for (auto output : points_to_analysis) { int from_i; // pointer variable index in get_varinfo(from_i) int to_i; // point-ee variable index in get_varinfo(toi) if(!get_varinfo(from_i)->may_have_pointers) continue; bitmap vars = get_varinfo(from_i)->solution; if (!vars) continue; if (dump_file) fprintf(dump_file, "%s -> %s\n", get_varinfo(from_i)->name, get_varinfo(to_i)->name); bool same = bitmap_bit_p(vars, to_i); if (dump_file) fprintf(dump_file, "SAME = %s\n", same ? "true" : "false"); gcc_assert(same); } ``` This means that if the two analyses were equal, all test cases testing -ftree-pta should succeed. At the moment, the assertion is being triggered. Now, I would like to understand why the assertion is triggered without asking you to debug my code, so instead I'm trying to understand why GCC's solution is correct. (I gave all this exposition because I may be making assumptions that are incorrect and that can quickly be verified by someone more knowledgeable.) Now, I'll focus on the test case: The code in question is a slightly modified version of the test case ipa-pta-16.c ``` struct X { long l1; struct Y { long l2; int *p; } y; }; int i; static int __attribute__((noinline)) foo (struct X *x) { struct Y y = x->y; *y.p = 0; i = 1; return *y.p; } extern void abort (void); int main() { struct X x; struct X t; x.y.p = &i; long* e = &t.l1; if (foo(&x) != 1) abort (); return e; } ``` and is being compiled with the following flags: gcc -O2 -ftree-pta -ftree-pta-external-solver -fdump-tree-alias -ftree-salias ipa-pta-16.c I would like to focus on the foo function, which is translated to the following gimple: ``` __attribute__((noinline)) int foo.constprop.isra (int * ISRA.4) { int * y$p; struct X * x; struct X * x; int _2; <bb 2> [local count: 1073741824]: *ISRA.4_3(D) = 0; i = 1; _2 = *ISRA.4_3(D); return _2; } ``` GCC generates the following constraints (as seen on the dump): ``` ANYTHING = &ANYTHING ESCAPED = *ESCAPED ESCAPED = ESCAPED + UNKNOWN *ESCAPED = NONLOCAL NONLOCAL = &NONLOCAL NONLOCAL = &ESCAPED INTEGER = &ANYTHING ISRA.4 = &NONLOCAL derefaddrtmp(9) = &NULL *ISRA.4 = derefaddrtmp(9) i = NONLOCAL i = &NONLOCAL ESCAPED = &NONLOCAL _2 = *ISRA.4 ``` The points-to set generated by GCC are the following: ``` ANYTHING = { ANYTHING } ESCAPED = { NULL ESCAPED NONLOCAL } NONLOCAL = { ESCAPED NONLOCAL } same as i STOREDANYTHING = { } INTEGER = { ANYTHING } ISRA.4 = { NONLOCAL } derefaddrtmp(9) = { NULL } i = { ESCAPED NONLOCAL } _2 = { ESCAPED NONLOCAL } foo.constprop.0.isra.0 = { } ``` I also noticed that there is more information about ISRA.4 which is not printed in the points-to sets printed above. Instead, the dump continues and adds this line: ``` Flow-insensitive points-to information ISRA.4_3(D), points-to non-local, points-to NULL, points-to vars: { } ``` Now, I understand that the first set of solutions (i.e, `{ NONLOCAL }`) is the solution of the constraint variable. That is, the constraint variable ISRA.4 when assigned the set `{ NONLOCAL }` is a solution to the constraints... I also understand that ISRA.4_3(D) is the SSA variable on the gimple code. There is technically a mapping between the two, but I guess there's some distinction. I also understand that the first dump is generated from the function dump_solution_for_var and just looks for a set bit. ``` EXECUTE_IF_SET_IN_BITMAP (vi->solution, 0, i, bi) fprintf (file, "%s ", get_varinfo (i)->name); ``` The second dump is generated from dump_points_to_solution ``` if (pt->null) fprintf (file, ", points-to NULL"); ``` which looks into the points to solution struct. My question here is, why is NULL not in the solution in the bitmap but instead is in this special field? Probably for optimizations somehow. But, wouldn't removing this bit from the constraint variable could somehow affect what solution gets propagated to other variables? But let's go with the idea that ISRA.4 does not point to NULL as given by the points-to set dumped by GCC. Is this correct? We can try to compute the analysis by hand by solving the constraints according to Anderson's points-to analysis... I think we get that NONLOCAL points to NULL (which is not in the dump). derefaddrtmp(9) = &NULL ISRA.4 = &NONLOCAL *ISRA.4 = derefaddrtmp(9) So, how I understand it, there are 4 nodes in this subgraph of the constraint graph and the first two facts gives us the following relations: 1. derefaddrtmp(9) points to NULL 2. ISRA.4 points to NONLOCAL The third constraint says to propagate the subset constraint to all variables which are in the ISRA.4 set. So effectively we are creating a new edge from NONLOCAL to deref that says that whatever deref is pointing to, now NONLOCAL is pointing to as well. Basically, NONLOCAL now points to NULL. But as I mentioned, the NULL does not appear on the first dump for the variable NONLOCAL. ``` NONLOCAL = { ESCAPED NONLOCAL } same as i ``` I think all of this is not a bug, it is just some conventions that GCC makes because NULL is a special variable, but: * I would like some reassurance that this is indeed the case of treating some constraint variables in a special way * Some guidance on how to extend my solver to also treat the special variables in the same way (in order to generate exactly the same solution). * Some idea of what else I can expect to be handled somewhat differently that I cannot foresee at the moment. Thanks!