On Thu, Sep 12, 2024 at 12:51 PM David Malcolm <dmalc...@redhat.com> wrote:
>
> On Wed, 2024-09-11 at 20:49 -0500, tcpreimesber...@gmail.com wrote:
> > From: Thor C Preimesberger <tcpreimesber...@gmail.com>
> >
> > This patch allows the compiler to dump GENERIC trees as JSON objects.
> >
> > The dump flag -fdump-tree-original-json dumps each fndecl node in the
> > C frontend's gimplifier as a JSON object and traverses related nodes
> > in an analagous manner as to raw-dumping.
>
> Thanks for posting this patch.
>
> Are you able to upload somewhere some examples of what the dumps look
> like?

I found https://renhongl.github.io/json-editor/ which seems to accept
the output of -fdump-tree-original-json visualizes the raw JSON structure
when the input is from a single function.

struct S { int i; int j; } s;
int bar ()
{
  return s.i + s.j;
}
int main()
{
  return bar ();
}

no longer recognizes it, I would guess we'd need to produce an outer
"file level" JSON node.  Simply wrapping the file in [{ ... }] didn't
work even with comma separating two functions.

The JSON for bar looks like (sorry for the long paste)

[{"addr": "0x7f8256bda3c0",
  "expr_loc": [{"file": "t.c",
                "line": 3,
                "column": 1}],
  "start_loc": [{"file": "t.c",
                 "line": 3,
                 "column": 1}],
  "finish_loc": [{"file": "t.c",
                  "line": 3,
                  "column": 1}],
  "tree code": "bind_expr",
  "bind_expr_body": {"addr": "0x7f8256bab860",
                     "expr_loc": [{"file": "t.c",
                                   "line": 4,
                                   "column": 14}],
                     "start_loc": [{"file": "t.c",
                                    "line": 4,
                                    "column": 10}],
                     "finish_loc": [{"file": "t.c",
                                     "line": 4,
                                     "column": 18}],
                     "tree code": "return_expr",
                     "return_expr": {"addr": "0x7f8256bb1b40",
                                     "expr_loc": [{"file": "t.c",
                                                   "line": 4,
                                                   "column": 14}],
                                     "start_loc": [{"file": "t.c",
                                                    "line": 4,
                                                    "column": 10}],
                                     "finish_loc": [{"file": "t.c",
                                                     "line": 4,
                                                     "column": 18}],
                                     "tree code": "plus_expr",
                                     "bin_operator": "+",
                                     "operands": [{"addr": "0x7f8256bda360",
                                                   "expr_loc": [{"file": "t.c",
                                                                 "line": 4,
                                                                 "column": 11}],
                                                   "start_loc": [{"file": "t.c",
                                                                  "line": 4,

"column": 10}],
                                                   "finish_loc":
[{"file": "t.c",
                                                                   "line": 4,

"column": 12}],
                                                   "tree code": "component_ref",
                                                   "expr": {"addr":
"0x7f8256a10c60",

"decl_loc": [{"file": "t.c",

   "line": 1,

   "column": 28}],
                                                            "tree
code": "var_decl",
                                                            "used": true,
                                                            "public": true,
                                                            " static": true,
                                                            "read": true,
                                                            "mode": "DI",

"defer-output": true,
                                                            "id_to_locale": "s",
                                                            "id_point": "s"},
                                                   "field": {"addr":
"0x7f8256a30688",

"decl_loc": [{"file": "t.c",

    "line": 1,

    "column": 16}],
                                                             "tree
code": "field_decl",
                                                             "mode": "SI",

"id_to_locale": "i",
                                                             "id_point": "i"}},
                                                  {"addr": "0x7f8256bda390",
                                                   "expr_loc": [{"file": "t.c",
                                                                 "line": 4,
                                                                 "column": 17}],
                                                   "start_loc": [{"file": "t.c",
                                                                  "line": 4,

"column": 16}],
                                                   "finish_loc":
[{"file": "t.c",
                                                                   "line": 4,

"column": 18}],
                                                   "tree code": "component_ref",
                                                   "expr": {"addr":
"0x7f8256a10c60",

"decl_loc": [{"file": "t.c",

   "line": 1,

   "column": 28}],
                                                            "tree
code": "var_decl",
                                                            "used": true,
                                                            "public": true,
                                                            " static": true,
                                                            "read": true,
                                                            "mode": "DI",

"defer-output": true,
                                                            "id_to_locale": "s",
                                                            "id_point": "s"},
                                                   "field": {"addr":
"0x7f8256a30720",

"decl_loc": [{"file": "t.c",

    "line": 1,

    "column": 23}],
                                                             "tree
code": "field_decl",
                                                             "mode": "SI",

"id_to_locale": "j",

"id_point": "j"}}]}}}]

There are three oddities I immediately notice:

The PLUS_EXPR operands are in a array "operands" while the RETURN_EXPR
"operand" or "child pointer" is refered to from "return_expr".  I think both are
tcc_expression trees and the operands are in exp.operands.  Ideally the
JSON would more closely reflect that rather than possibly following the "pretty"
printing logic.

While the tree_code of a tree node is the most important bit (maybe besides of
its address), the "tree code" attribute is after the locations (which
are also quite
verbose and distracting - at least when looking at raw JSON).  For locations
one could honor TDF_LINENO and only dump those when using
-fdump-tree-original-json-lineno.  I'd re-order "tree code" after "addr".

The third issue is that above the tree node with address 0x7f8256a10c60
(and its children) appear twice - while you maintain a splay tree and assign
unique numbers the duplicate nodes are not output by reference?  I would
suggest to use { "ref_addr" : "0x7f8256a10c60" } for the output of such
reference for example.

I'm not sure whether JSON allows different object kinds or if that's solely
done by having a special attribute if that's needed.  With the above
regular tree nodes would be "addr" and references be "ref_addr".  A recursive
JSON structure like above is OK to look at in RAW, I'm not sure whether
for automatic processing and for example translating to a graph a linear
collection of nodes with references would be easier to handle.


Few comments on the patch itself - the #include of tree-emit-json.h from
dumpfile.cc doesn't seem to be necessary.  Since you declare
dump_node_json in dumpfile.h it should be possible to elide the header
and put the contents into the tree-emit-json.cc file.

Another #include is duplicated (and also looks unnecessary).


I know you have some crude JSON -> html translation script written in
python - can you share that as well?  I'd suggest to post separate from
this main patch, adding it to contrib/.

Can we solve the multi-function issue somehow?  I know we have some
abstraction for a dump file, we'd need a hook that's invoked on opening
and closing to emit a something there - I guess it would be even OK to
hard-code this into dumpfile.cc for the -JSON dump variant.  It might
be possible to register dump specific data with that object and get
to the "current" dump file in dump_node_json so the splay-tree could
be kept live and the allocations released on dump-file close?  Again,
two hard-coded hooks from dumpfile.cc at open/close time into
the JSON dumping for this might be feasible and track the global state
with global variables.  That's to allow references to global objects and
types streamed in a previous function context.

Thanks,
Richard.


> Some high level thoughts:
>
> * the patch uses "dummy" throughout as a variable name.  To me the name
> "dummy" suggests something unimportant that we had to give a name to,
> or something that we'd prefer didn't exist but had to create.  However
> in most(all?) cases "dummy" seems to refer to the json object being
> created or having properties added to it, and thus the most interesting
> thing in the function.  I suspect that renaming "dummy" to "js_obj" or
> "json_obj" throughout would be an improvement in readability in terms
> of capturing the intent of the code (assuming that all of them are
> indeed json objects).
>
> * I think the code is leaking memory for all of the json values created
> - there are lots of uses of "naked new" in this code, but I don't see
> any uses of "delete".  For example, in
>
> > +void
> > +dump_node_json (const_tree t, dump_flags_t flags, FILE *stream)
> > +{
> > +  struct dump_info di;
> > +  dump_queue_p dq;
> > +  dump_queue_p next_dq;
> > +  pretty_printer pp;
> > +  /* Initialize the dump-information structure.  */
> > +  di.stream = stream;
> > +  di.index = 0;
> > +  di.column = 0;
> > +  di.queue = 0;
> > +  di.queue_end = 0;
> > +  di.free_list = 0;
> > +  di.flags = flags;
> > +  di.node = t;
> > +  di.nodes = splay_tree_new (splay_tree_compare_pointers, 0,
> > +                          splay_tree_delete_pointers);
> > +  di.json_dump = new json::array ();
>      ^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^
>      allocated with naked new here
>
> > +  /* Queue up the first node.  */
> > +  queue (&di, t);
> > +
> > +  /* Until the queue is empty, keep dumping nodes.  */
> > +  while (di.queue)
> > +    dequeue_and_dump (&di);
> > +
> > +  di.json_dump->dump(stream, true);
> > +  fputs("\n", stream);
> > +  /* Now, clean up.  */
> > +  for (dq = di.free_list; dq; dq = next_dq)
> > +    {
> > +      next_dq = dq->next;
> > +      free (dq);
> > +    }
> > +  splay_tree_delete (di.nodes);
>
> and di.json_dump goes out of scope here and is leaked, I think.  So I
> *think* all of the json values being created during dumping are being
> leaked.
>
> > +}
>
> Similarly, in:
>
> > +DEBUG_FUNCTION void
> > +debug_tree_json (tree t)
> > +{
> > +  json::object* _x = node_emit_json(t);
> > +  _x->dump(stderr, true);
> > +  fprintf(stderr, "\n");
> > +}
>
> if I'm reading things right, node_emit_json doesn't "emit" json so much
> as create a new json::object on the heap via "new", and when "_x" goes
> out of scope, it's leaked.
>
> The pattern in the code seems to be that node_emit_json creates a new
> json::object and populates it with properties (sometimes recursively).
>
> Given that, and that we can use C++11, I recommend using
> std::unique_ptr<json::object> for it, to capture the intent that this
> is a heap-allocated pointer with responsibility for being "delete"-d at
> some point.
>
> That way, rather that:
>
>   json::object*
>   node_emit_json(tree t)
>   {
>     tree op0, op1, type;
>     enum tree_code code;
>     expanded_location xloc;
>     json::object *dummy;
>     json::array* holder;
>     char address_buffer[sizeof(&t)] = {"\0"};
>
>     dummy = new json::object ();
>     holder = new json::array ();
>
>     [...snip...]
>
>     return dummy;
>   }
>
>
> we could have (perhaps renaming to "node_to_json"):
>
>   std::unique_ptr<json::object>
>   node_to_json(tree t)
>   {
>     tree op0, op1, type;
>     enum tree_code code;
>     expanded_location xloc;
>     char address_buffer[sizeof(&t)] = {"\0"};
>
>     auto js_obj = ::make_unique<json::object> (); // to implicitly use 
> std::unique_ptr<json::object>
>     auto holder = ::make_unique<json::array> ();  // likewise 
> std::unique_ptr<json::array>
>
>     [...snip...]
>
>     return js_obj;
>   }
>
> ...assuming that I'm correctly understanding the ownership of the json
> values in the patch.  Note that we have to use ::make_unique from our
> make-unique.h, rather than std::make_unique from <memory> since the
> latter was only added in C++14.
>
> Many of our data structures don't properly handle objects with
> destructors, and I suspect splay_tree is one of these.  You can use
> js_obj.release () to transfer ownership to such data structures, and
> will (probably) need to manually use "delete" on the pointers in the
> right places.
>
> What happens to "holder" in that function?  It seems to get populated
> with json objects for the various tree nodes found recursively, but
> then it seems to simply be leaked (or populated then auto-deleted, if
> we use std::unique_ptr>.  Or am I missing something?
>
> In case it's helpful, a couple of months ago I converted the SARIF
> output code from using "naked" json pointers to using std::unique_ptr
> in:
>   https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658204.html
> and I found it helped a *lot* with documenting ownership and avoiding
> leaks.  You might find other things of interest in the first half of
> this patch kit:
>   https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658194.html
> FWIW in that code I'm also using a class hierarchy of json::object
> subclasses to help with compliance with a JSON schema (sarif_object is
> a subclass of json::object), but given that there probably isn't a
> schema for the JSON in TDF_JSON dumps, that's probably not relevant to
> this case.
>
> FWIW you can test for leaks by running the compiler under valgrind by
> configuring with --enable-valgrind-annotations and appending "-wrapper
> valgrind" to the command line (or "-wrapper valgrind,--leak-check=full"
> IIRC).
>
> Thanks again for the patch; hope this is constructive.
> Dave
>

Reply via email to