sammccall added a comment.

In https://reviews.llvm.org/D41102#994311, @juliehockett wrote:

> I'm going to take a stab at refactoring the serialization part of this next 
> -- rather than keeping it all in memory and dumping it at the end, it should 
> serialize as it goes and do some sort of reduce at the end. I'm currently 
> thinking of writing it out to a file as it goes, and then reading and 
> reducing from there -- thoughts?


Sounds awesome :-) Some thoughts on structure (sorry for the rant - clangd's 
indexer has a similar problem so I've been thinking about this).

This neatly separates the problem out into

- "business logic" for map and reduce - these are basically pure functions
- imperative frameworky stuff: serializing/deserializing, grouping map keys 
together, running threadpools

Sadly we don't have a generic implementation of the frameworky part in 
libtooling I think, so you'll probably end up writing the bits you need in 
clang-doc.
There is `ExecutionContext` in Tooling/Execution.h, which allows you to report 
KV pairs, and loop over them afterwards, which models map output (must be 
serialized as strings).

Still I think there's value in separating out the two layers:

- conceptual clarity
- testing (the pure functions are nice to test)
- one can hook up the mapper/reducer to hadoop, or google's internal 
infrastructure, or ...
- it can inspire and benefit from someone writing a reusable upstream MR API

So maybe the API could look something roughly like:

`newFrontendActionFactory(DocSink)` returns clang-doc's extractor part (the 
mapper, i.e. most of this patch)
`DocSink` is a class with a bunch of methods like `emitNamespaceInfo(StringRef 
Name, NSInfo)`. This adapts to the MR machinery: it could write to 
ExecutionContext, or files on disk as you suggest.
Merging provided via classes like `class MergeNS { MergeNS(StringRef Name); 
void add(NSInfo); NSInfo finish(); }` These would be invoked by the MR 
machinery (for now, just code that iterates over the ExecutionContext or files 
on disk).

In this case, you could test the mapper in isolation by writing a small tool 
that just runs the mapper over some files, gathers the result, and dumps it as 
YAML. The reducer and the MR driver wouldn't need to be part of this patch.

Anyway, of course you may want to take this in a different direction - just 
some ideas.



================
Comment at: tools/clang-doc/ClangDoc.h:29
+struct ClangDocContext {
+  // Which format in which to emit representation.
+  OutFormat EmitFormat;
----------------
juliehockett wrote:
> sammccall wrote:
> > Is this the intermediate representation referred to in the design doc, or 
> > the final output format?
> > 
> > If the former, why two formats rather than picking one?
> > YAML is nice for being usable by out-of-tree tools (though not as nice as 
> > JSON). But it seems like providing YAML as a trivial backend format would 
> > fit well?
> > Bitcode is presumably more space-efficient - if this is significant in 
> > practice it seems like a better choice.
> That's the idea -- for developing purposes, I wrote up the YAML output first 
> for this patch, and there will be a follow-on patch expanding the 
> bitcode/binary output. I've updated the flags to default to the binary, with 
> an option to dump the yaml (rather than the other way around).
What's still not clear to me is: is YAML a) a "real" intermediate format, or b) 
just a debug representation?

I would suggest for orthogonality that there only be one intermediate format, 
and that any debug version be generated from it. In practice I guess this means:
 - the reporter builds the in-memory representation
 - you can serialize/deserialize memory representation to the IR (bitcode)
 - you can serialize memory representation to debug representation (YAML) but 
not parse
 - maybe the clang-doc core should *only* know about IR, and YAML should be 
produced in the same way e.g. HTML would be?

This does pose a short-term problem: the canonical IR is bitcode, we need YAML 
for the lit tests, and we don't have the decoder/transformer part yet. This 
could be solved either by using YAML as the IR *for now* and switching later, 
or by adding a simple decoder now.
Either way it points to the *reporter* not having an output format option, and 
having to support two formats.

WDYT? I might be missing something here.


================
Comment at: tools/clang-doc/ClangDoc.h:33
+
+class ClangDocVisitor : public RecursiveASTVisitor<ClangDocVisitor> {
+public:
----------------
juliehockett wrote:
> jakehehrlich wrote:
> > sammccall wrote:
> > > This API makes essentially everything public. Is that the intent?
> > > 
> > > It seems like `ClangDocVisitor` is a detail, and the operation you want 
> > > to expose is "extract doc from this AST into this reporter" or maybe 
> > > "create an AST consumer that feeds this reporter".
> > > 
> > > It would be useful to have an API to extract documentation from 
> > > individual AST nodes (e.g. a Decl). But I'd be nervous about trying to 
> > > use the classes exposed here to do that. If it's efficiently possible, 
> > > it'd be nice to expose a function.
> > > (one use case for this is clangd)
> > Correct me if I'm wrong but I believe that everything needs to be public in 
> > this case because the base class needs to be able to call them. So the 
> > visit methods all need to be public.
> Yes to the `Visit*Decl` methods being public because of the base class.
> 
> That said, I shifted a few things around here and implemented it as a 
> `MatcherFinder` instead of a `RecursiveASTVisitor`. The change will allow us 
> to make most of the methods private, and have the ability to fairly easily 
> implement an API for pulling a specific node (e.g. by name or by decl type). 
> As far as I understand (and please correct me if I'm wrong), the matcher 
> traverses the tree in a similar way. This will also make mapping through 
> individual nodes easier.
Sorry for being vague - yes overridden or "CRTP-overridden" methods may need to 
be public.
I meant that the classes themselves don't need to be exposed, I think. (The 
header could just expose a function to create the needed ones, that returns 
`unique_ptr<interface>`

There are now fewer classes exposed here, but I think most/all of them can 
still reasonably be hidden.


https://reviews.llvm.org/D41102



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to