sammccall added a comment.

Just some naming and doc nits. This looks really solid now, nice job!

In D83814#2261458 <https://reviews.llvm.org/D83814#2261458>, @jkorous wrote:

> Hi @usaxena95 and @sammccall,
>
> I am wondering about couple high-level things.
>
> Do you guys intend to open-source also the training part of the model 
> pipeline or publish a model trained on generic-enough training set so it 
> could be reasonably used on "any" codebase?

@adamcz and I were talking about this too... I think it's important we do as 
much of this as possible. I was the one not finding time to do it though, and I 
think Adam may do better :-)

- the existing training stuff is using internal tech, but AFAIK it's nothing 
LightGBM can't do (it trains on a single machine). So we should be able to 
open-source the training setup and actually use that.
- training data generation is harder to open, because it involves sampling a 
large diverse body of code and parsing lots of it. The core code that embeds 
clangd and extracts completion candidates should be very doable, so one could 
run over LLVM on one machine. The framework to run at a larger scale is coupled 
to internal infra though, and we're currently training on both public and 
non-public code.



================
Comment at: clang-tools-extra/clangd/quality/CompletionModel.cmake:10
+
+  set(model_json ${model}/forest.json)
+  set(model_features ${model}/features.json)
----------------
these vars are used only once, I'd suggest inlining them for readability


================
Comment at: clang-tools-extra/clangd/quality/CompletionModel.cmake:13
+  
+  set(output_dir ${CMAKE_BINARY_DIR}/generated/decision_forest)
+  set(header_file ${output_dir}/${filename}.h)
----------------
/generated/decision_forest seems redundant considering ${CMAKE_BINARY_DIR} is 
already the generated-files tree for the directory of the calling CMakeLists.

Can't we just use ${CMAKE_BINARY_DIR} directly and avoid the 
DECISION_FOREST_OUTPUT_DIR variable?


================
Comment at: clang-tools-extra/clangd/quality/CompletionModelCodegen.py:150
+    # Class definition.
+    code += f"class {cpp_class.name} {{\n"
+    code += "public:\n"
----------------
you can use triple-quoted f-strings, which I think would be more readable than 
blocks of "code +="

```
code += f"""class {cpp_class.name} {{
public:
  {"\n   ".join(setters)}

private:
  {"\n  ".join(class_members)}
"""
```

etc

may even do the whole thing in one go.


================
Comment at: clang-tools-extra/clangd/quality/CompletionModelCodegen.py:217
+
+    code = "\n".join(f'#include <{h}>' for h in angled_include) + "\n\n"
+    code += "\n".join(f'#include "{h}"' for h in quoted_include) + "\n\n"
----------------
again, making this one big triple-quoted f-string may be nicer, up to you


================
Comment at: clang-tools-extra/clangd/quality/README.md:4
+## Decision Forest
+A **decision forest** is a collection of many decision trees. A **decision 
tree** is a full binary tree where every non-leaf node has exactly **2** 
children. 
+
----------------
The second half of this sentence simply restates the first.

Maybe we can combine this with the second paragraph: "A decision tree is a full 
binary tree that provides a quality prediction for an input (code completion 
item). Internal nodes represent a binary decision based on the input data, and 
leaf nodes represent a prediction."


================
Comment at: clang-tools-extra/clangd/quality/README.md:8
+
+At every non-leaf node, we evaluate the condition present in the node.  The 
condition refers to exactly one **feature**. It uses the value of this 
attribute from the code completion item to evaluate the condition.
+Based on the condition, we move to its true child or the false child.
----------------
Nit: I think it's worth separating out defining features vs conditions.

e.g.
"An input (code completion candidate) is characterized as a set of 
**features**, such as the type of symbol or the number of existing references

At every non-leaf node, we evaluate the condition to decide whether to go left 
or right. The condition compares one feature of the input against a constant. 
It is either:
...".


================
Comment at: clang-tools-extra/clangd/quality/README.md:16
+A leaf node only contains the value **score**.
+Once we know the set of leaves (one from each decision tree), we add the 
**score** values from each of the leaves to get the final relevance score.
+
----------------
nit: rather than alternating between describing traversing all trees and one 
tree, I'd just say "To compute an overall quality score, we traverse each tree 
in this way and add up the scores".


================
Comment at: clang-tools-extra/clangd/quality/README.md:26
+  "name": "a_numerical_feature",
+  "type": "NUMBER"
+}
----------------
This is a little prone to confusion with C++ type.

Consider "kind" instead?


================
Comment at: clang-tools-extra/clangd/quality/README.md:34
+  "type": "ENUM",
+  "enum": "fully::qualified::enum",
+  "header": "path/to/HeaderDeclaringEnum.h"
----------------
this might be "type"?


================
Comment at: clang-tools-extra/clangd/quality/README.md:38
+```
+The field `enum` specifies the fully qualified name of the enum.
+
----------------
The max numeric value may not exceed 32 (is that right?)


================
Comment at: clang-tools-extra/clangd/quality/README.md:76
+## Code Generator for Inference
+The implementation of inference runtime is split across:
+- Build System (CMake)
----------------
nit: order should match order of the actual paragraphs.

This is short enough though that you might not want the mini-TOC here.


================
Comment at: clang-tools-extra/clangd/quality/README.md:98
+
+## Example
+### model/features.json
----------------
This seems like it might be a pain to maintain.
Maybe just include the json files and the public interface from 
DecicionForestRuntime.h?


================
Comment at: clang-tools-extra/clangd/quality/README.md:98
+
+## Example
+### model/features.json
----------------
sammccall wrote:
> This seems like it might be a pain to maintain.
> Maybe just include the json files and the public interface from 
> DecicionForestRuntime.h?
may want to add the cmake invocation to generate the files.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D83814/new/

https://reviews.llvm.org/D83814

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to