[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Aaron Ballman via Phabricator via cfe-commits Mon, 02 May 2022 05:16:55 -0700

aaron.ballman added a comment.

In D124500#3483328 <https://reviews.llvm.org/D124500#3483328>, 
@LegalizeAdulthood wrote:

> In D124500#3483224 <https://reviews.llvm.org/D124500#3483224>, @aaron.ballman 
> wrote:
>
>> To clarify my previous comments, I'm fine punting on the edge cases until 
>> user reports come in, so don't let them block this review if you feel 
>> strongly about not supporting them. But when user reports start coming in, 
>> at some point I might start asking to replace the custom parser with calling 
>> into the clangParse library through some invented utility interface so that 
>> we don't have to deal with a long tail of bug reports.
>
> Yeah, I think punting on edge cases is the right thing to do here.  As I say,
> the worst that happens is your macro isn't converted automatically when you
> could convert it manually.

I largely agree, but I've found cases where we'll convert correct code to 
incorrect code, so it's a bit stronger than that.

> Maybe we're thinking about this check differently.
>
> I want this check to do the majority of the heavy lifting so that I'm only 
> left with a
> few "weird" macros that I might have to convert by hand.  I never, ever, ever 
> want
> this check to generate invalid code.  Therefore it is of paramount importance 
> that
> it be conservative about what it recognizes as a candidate for conversion.

I think that's a reasonable goal, but we're not meeting the "never ever 
generate invalid code" part. I already know we can break correct C and C++ code 
through overflow. Should we ever allow an option to use an enum with a fixed 
underlying type in C++ (either `enum class` or `enum : type` form), we'll have 
the same breakage there but at different thresholds.

> I think this is the baseline for all the modernize checks, really.  There are 
> still cases
> where loop-convert doesn't recognize an iterator based loop that could be
> converted to a range-based for loop.  The most important thing is that 
> loop-convert
> doesn't take my weird iterator based loop and convert it to a range based for 
> loop
> that doesn't compile.
>
> As for calling into `clangParse`, I think that would be overkill for a couple 
> reasons.
> First, the real parser is going to do a lot of work creating AST nodes which 
> we will
> never use, except to traverse the structure looking for things that would 
> invalidate
> it as a candidate for a constant initializing expression.  Second, we only 
> need to
> match the structure, we don't need to extract any information from the token 
> stream
> other than a "thumbs up" or "thumbs down" that it is a valid initializing 
> expression.

There's a few reasons I disagree with this. First, you need to know the value 
of the constant expression in order to know whether it's valid as an 
enumeration constant. That alone requires expression evaluation capabilities, 
assuming you want the check to behave correctly for those kinds of cases. But 
second, without support for generating that AST and validating it, you can 
never handle cases like this:

  constexpr int a = 12;
  constexpr int foo() { return 12; }

  #define FOO (a + 1)
  #define BAR (a + 2)
  #define BAZ (a + 3)
  #define QUUX (foo() + 4)

because you have no way to know whether or not that constant expression is 
valid based solely on token soup (especially when you start to factor in 
namespaces, qualified lookup, template instantiations, etc). So I think 
avoiding the parser will limit the utility of this check. And maybe that's 
fine, maybe it's only ever intended to convert C code using defines to C code 
using enumerations or other such simple cases.

> Many times in clang-tidy reviews performance concerns are raised and I think
> matching the token stream with the recursive descent matcher here is going to 
> be
> much, much faster than invoking `clangParse`, particularly since the matcher 
> bails
> out early on the first token that doesn't match.

Absolutely 100% agreed on this point.

> The only thing I can think of that
> would make it faster is if we could get the lexed tokens from the preprocessor
> instead of making it re-lex the macro body, but that's a change beyond the 
> scope
> of this check or even clang-tidy.

Yeah, and that wouldn't even be sufficient because I still think we're going to 
want to know the *value* of the expression at some point.

Allllll that "this is the wrong way!" doom and gloom aside, I still think 
you're fine to proceed with the current approach if you'd like to. The 
situations under which it will break correct code should be something we 
document explicitly in the check somewhere, but they feel sufficiently like 
edge cases to me that I'm fine moving forward with the caution in mind that if 
this becomes too much more problematic in the future, we have *a* path forward 
we could use to improve things should we decide we need to.

================
Comment at: 
clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:99
+
+  if (!Current->isLiteral() || isStringLiteral(Current->getKind()) ||
+      !isIntegralConstant(*Current)) {
----------------
LegalizeAdulthood wrote:
> aaron.ballman wrote:
> > LegalizeAdulthood wrote:
> > > aaron.ballman wrote:
> > > > I know this is code moved from elsewhere, but I suppose we never 
> > > > considered the odd edge case where a user does something like 
> > > > `"foo"[0]` as a really awful integer constant. :-D
> > > It's always possible to write crazy contorted code and have a check not 
> > > recognize it.  I don't think it's worthwhile to spend time trying to 
> > > handle torture cases unless we can find data that shows it's prevalent in 
> > > real world code.
> > > 
> > > If I was doing a code review and saw this:
> > > ```
> > > enum {
> > >     FOO = "foo"[0]
> > > };
> > > ```
> > > I'd flag that in a code review as bogus, whereas if I saw:
> > > ```
> > > enum {
> > >     FOO = 'f'
> > > };
> > > ```
> > > That would be acceptable, which is why character literals are accepted as 
> > > an integral literal initializer for an enum in this check.
> > >  I don't think it's worthwhile to spend time trying to handle torture 
> > > cases unless we can find data that shows it's prevalent in real world 
> > > code.
> > 
> > I think I'm okay agreeing to that in this particular case, but this is more 
> > to point out that writing your own parser is a maintenance burden. Users 
> > will hit cases we've both forgotten about here, they'll file a bug, then 
> > someone has to deal with it. It's very hard to justify to users "we think 
> > you write silly code" because they often have creative ways in which their 
> > code is not actually so silly, especially when we support "most" valid 
> > expressions.
> Writing your own parser is unavoidable here because we can't just assume that 
> any old thing will be a valid initializer just by looking at the set of 
> tokens present in the macro body.  (There is a separate discussion going on 
> about improving the preprocessor support and parsing things more deeply, but 
> that isn't even to the point of a prototype yet.)  The worst thing we can do 
> is create "fixits" that produce invalid code.
> 
> The worst that happens if your expression isn't recognized is that your macro 
> isn't matched as a candidate for an enum.  You can always make it an enum 
> manually and join it with adjacent macros that were recognized and converted.
> 
> As it stands, the check only recognizes a single literal with an optional 
> unary operator.
> 
> This change expands the check to recognize a broad range of expressions, 
> allowing those macros to be converted to enums.  I opened the issue because 
> running modernize-macro-to-enum on the [[ 
> https://github.com/InsightSoftwareConsortium/ITK | ITK codebase ]] showed 
> some simple expressions involving literals that weren't recognized and 
> converted.
> 
> If an expression isn't recognized and an issue is opened, it will be an 
> enhancement request to support a broader range of expression, not a bug that 
> this check created invalid code.
> 
> IMO, the more useful thing that's missing from the grammar is recognizing 
> `sizeof` expressions rather than indexing string literals with an integral 
> literal subscript.
> 
> I had planned on doing another increment to recognize `sizeof` expressions.
> Writing your own parser is unavoidable here because we can't just assume that 
> any old thing will be a valid initializer just by looking at the set of 
> tokens present in the macro body. 

If you ran the token sequence through clang's parser and got an AST node out, 
you'd have significantly *more* information as to whether something is a valid 
enum constant initializer because you can check that it's an actual constant 
expression *and* that it's within a valid range of values. This not only fixes 
edge case bugs with your approach (like the fact that you can generate a series 
of literal expressions that result in a value too large to store within an 
enumerator constant), but it enables new functionality your approach currently 
disallows (like using constexpr variables instead of just numeric literals).

So I don't agree that it's unavoidable to write another custom parser.

================
Comment at: 
clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:204
+  return true;
+}
+
----------------
LegalizeAdulthood wrote:
> LegalizeAdulthood wrote:
> > aaron.ballman wrote:
> > > LegalizeAdulthood wrote:
> > > > aaron.ballman wrote:
> > > > > Comma operator?
> > > > Remember that the use case here is identifying expressions that are 
> > > > initializers for an enum.  If you were doing a code review and you saw 
> > > > this:
> > > > ```
> > > > enum {
> > > >     FOO = (2, 3)
> > > > };
> > > > ```
> > > > Would you be OK with that?  I wouldn't.  Clang even warns about it: 
> > > > https://godbolt.org/z/Y641cb8Y9
> > > > 
> > > > Therefore I deliberately left comma operator out of the grammar.
> > > This is another case where I think you're predicting that users won't be 
> > > using the full expressivity of the language and we'll get bug reports 
> > > later. Again, in insolation, I tend to agree that I wouldn't be happy 
> > > seeing that code. However, users write some very creative code and 
> > > there's no technical reason why we can't or shouldn't handle comma 
> > > operators.
> > "Don't let the perfect be the enemy of the good."
> > 
> > My inclination is to simply explicitly state that comma operator is not 
> > recognized in the documentation.  It's already implicit by it's absence 
> > from the list of recognized operators.
> > 
> > Again, the worst that happens is that your macro isn't converted.
> > 
> > I'm open to being convinced that it's important, but you haven't convinced 
> > me yet `:)`
> It wasn't much extra work/code to add comma operator support so I've done 
> that.
> "Don't let the perfect be the enemy of the good."

This is a production compiler toolchain. Correctness is important and that 
sometimes means caring more about perfection than you otherwise would like to.

> I'm open to being convinced that it's important, but you haven't convinced me 
> yet :)

It's less about importance and more about maintainability coupled with 
correctness. With your approach, we get something that will have a long tail of 
bugs. If you used Clang's parser, you don't get the same issue -- maintenance 
largely comes along for free, and the bugs are far less likely.

About the only reason I like your current approach over using clang's parsing 
is that it quite likely performs much better than doing an actual token parsing 
of the expression. But as you pointed out, about the worst thing for a check 
can do is take correct code and make it incorrect -- doing that right requires 
some amount of semantic evaluation of the expressions (which you're not doing). 
For example:
```
#define FINE 1LL << 30LL;
#define BAD 1LL << 31LL;
#define ALSO_BAD 1LL << 32L;
```
We'll convert this into an enumeration and break `-pedantic-errors` builds in 
C. If we had a `ConstantExpr` object, we could see what it's value is and note 
that it's greater than what fits into an `int` and decide to do something 
smarter.

So I continue to see the current approach as being somewhat reasonable 
(especially for experimentation), but incorrect in the long run. Not 
sufficiently incorrect for me to block this patch on, but incorrect enough that 
the first time this check becomes a maintenance burden, I'll be asking more 
strongly to do this the correct way.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124500/new/

https://reviews.llvm.org/D124500

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Reply via email to