[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

Marcus Johnson via Phabricator via cfe-commits Sat, 17 Jul 2021 17:46:10 -0700

MarcusJohnson91 added inline comments.


================
Comment at: clang/include/clang/AST/Expr.h:1846-1871
+  std::string getStringAsChar() const {
+    assert(getCharByteWidth() == 1 &&
+           "This function is used in places that assume strings use char");
+    return std::string(getTrailingObjects<char>(), getTrailingObjects<char>() 
+ getByteLength());
+  }
+                                  
+  std::u16string getStringAsChar16() const {
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > One potential issue to this is that in C++, these types are defined to be 
> > UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, 
> > Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but 
> > we may need to add some sanity checks to catch if a target overrides that 
> > behavior. Currently, all the targets in Clang are keeping that promise, but 
> > I have no idea what shenanigans downstream users get up to and whether 
> > their targets remove the macro definition or define it to `0` instead of 
> > `1`.
> Is it possible that the host and target wchar_t have a different size here?
I've honestly been wondering how Clang handled that, in the codebase vs at 
runtime myself for a while.


================
Comment at: clang/include/clang/AST/FormatString.h:83
     AsMAllocate,  // for '%ms', GNU extension to scanf
+    AsUTF16,      // for '%l16(c|s)', soon to be standardized
+    AsUTF32,      // for '%l32(c|s)', soon to be standardized
----------------
aaron.ballman wrote:
> May want to drop the "soon to be standardized" given that the proposal hasn't 
> been seen by WG14 yet. I think it's fine to say "Clang extension", though. 
> More on the format specifier itself, below.
Done,

and here is the link to the document, I haven't heard any feedback?

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2761.pdf


================
Comment at: clang/include/clang/AST/Type.h:1975
   bool isBooleanType() const;
+  bool isType(const std::string TypeName) const;
   bool isCharType() const;
----------------
aaron.ballman wrote:
> It's not clear from the header file what this function actually *does*. 
> Presumably, everything that can be represented by a `Type` is a type, so I'd 
> assume this returns `true` always. Or should this be a static function?
Yeah, could choose a better name just not sure what.

It takes a type name as the argument, and then it desugars the type one step at 
a time, and if it finds a match it returns true.

so, let's say we're in C++, and someone typedef'd char8_t to String.

This function will say yes, String is compatible with char8_t for example.

it's mostly for C mode's typedef's of char16_t and char32_t


================
Comment at: clang/lib/AST/Expr.cpp:1197
                                  unsigned *StartTokenByteOffset) const {
-  assert((getKind() == StringLiteral::Ascii ||
-          getKind() == StringLiteral::UTF8) &&
----------------
aaron.ballman wrote:
> I don't see changes that make this assertion valid to remove -- have I missed 
> something?
Nope you didn't miss anything, I did.

this is a remnant from when I was trying to templatize all the format checking 
code instead of converting the format strings.

Restored the assert.


================
Comment at: clang/lib/AST/FormatString.cpp:235-240
+      if (LO.C2x && I + 1 != E && I[0] == '1' && I[1] == '6') {
+        ++I;
+        ++I;
+        lmKind = LengthModifier::AsUTF16;
+        break;
+      } else if (LO.C2x && I + 1 != E && I[0] == '3' && I[1] == '2') {
----------------
aaron.ballman wrote:
> I don't think this is a conforming extension to C -- lowercase length 
> modifiers and conversion specifiers are reserved for the standard, and 
> uppercase length modifiers and conversion specifiers are reserved for the 
> implementation. `l16` starts with a lowercase letter, so it's reserved for 
> the standard.
> 
> Note: WG14 has been considering extensions in a closely-related space 
> (integer types rather than string or character types) that you may be 
> interested in, if you're not already aware of it: 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2511.pdf. This proposal was 
> not adopted, but did have strong sentiment for something along these lines.
Yeah, honestly not really sure what to do about the reservation, L is also used 
for wide characters and some integers too, and (U|u)(16|32) wasn't taken well 
by the community.


================
Comment at: clang/lib/AST/OSLog.cpp:212
+  } else if (Lit->isUTF16()) {
+    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> Convert;
+    std::u16string U16 = Lit->getStringAsChar16();
----------------
aaron.ballman wrote:
> MarcusJohnson91 wrote:
> > aaron.ballman wrote:
> > > cor3ntin wrote:
> > > > I'm not sure I have a better suggestion but `codecvt_utf8_utf16` is 
> > > > deprecated in C++17
> > > Good point -- this likely should be lifted into a common interface (as it 
> > > appears several times in the patch) and make use of the existing LLVM UTF 
> > > conversion functionality: 
> > > https://github.com/intel/llvm/blob/sycl/llvm/include/llvm/Support/ConvertUTF.h
> > The ConvertUTF version you linked contains `convertUTF32toUTF8String` but 
> > the one in LLVM does not, what's the process for updating this?
> Oh, oops, I managed to grab a link from downstream and not LLVM, sorry about 
> that! But the downstream and LLVM both have a method to perform the 
> conversion 
> https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Support/ConvertUTF.h#L162,
>  so you could add convenience functions to that file that wraps the API, as 
> part of this patch.
I'll probably write a wrapper then


================
Comment at: clang/lib/AST/PrintfFormatString.cpp:254-257
+  if (ParseLengthModifier(FS, I, E, LO) && I == E) { // THIS IS WHERE IT 
FREAKS OUT
     // No more characters left?
     if (Warn)
+      printf("ParsePrintfSpecifier: Incomplete Specifier 8\n");
----------------
aaron.ballman wrote:
> Surprising comment and what looks to be some debugging code?
Removed


================
Comment at: clang/lib/AST/PrintfFormatString.cpp:635
       }
-      if (LM.getKind() == LengthModifier::AsWide)
-        return ArgType(ArgType::WCStrTy, "wchar_t *");
----------------
aaron.ballman wrote:
> It looks like this coverage got lost?
Restored


================
Comment at: clang/lib/AST/PrintfFormatString.cpp:844
   case BuiltinType::SChar:
+  case BuiltinType::Char8:
     LM.setKind(LengthModifier::AsChar);
----------------
aaron.ballman wrote:
> Is there a reason why you're adding this support for `char16_t` and 
> `char32_t` but not `char8_t` aside from "C doesn't have `char8_t`?" If that's 
> the sole reason,  you might be interested in tracking: 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm
Added char8_t


================
Comment at: clang/lib/AST/Type.cpp:1962
 
+bool Type::isType(const std::string TypeName) const {
+  QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();
----------------
aaron.ballman wrote:
> Oh, I see now that this is doing a name comparison against the type -- that's 
> not a good API in general because it's *really* hard to guess at what the 
> type will come out as textually. e.g., `class` and `struct` keywords are 
> interchangeable in C++, C sometimes gets confused with `bool` vs `_Bool`, 
> template arguments sometimes matter, nested name specifiers, etc. Callers of 
> this API will have to guess at these details and the printing of the type may 
> change over time (e.g., C may switch from `_Bool` to `bool` and then code 
> calling `isType("_Bool")` may react poorly to the change).
> 
> I think we need to avoid this sort of API on `Type`.
I see your point, I reverted the behavior back to doing the desugaring in just 
isChar16Type and isChar32Type


================
Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions &LangOpts) const {
   if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char16;
-  return false;
-}
-
-bool Type::isChar32Type() const {
-  if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char32;
-  return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
-  const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
-  if (!BT) return false;
-  switch (BT->getKind()) {
-  default: return false;
-  case BuiltinType::Char_U:
-  case BuiltinType::UChar:
-  case BuiltinType::WChar_U:
-  case BuiltinType::Char8:
-  case BuiltinType::Char16:
-  case BuiltinType::Char32:
-  case BuiltinType::Char_S:
-  case BuiltinType::SChar:
-  case BuiltinType::WChar_S:
-    return true;
+    if (BT->getKind() == BuiltinType::Char16)
+      return true;
+  if (!LangOpts.CPlusPlus) {
+    return isType("char16_t");
   }
----------------
aaron.ballman wrote:
> MarcusJohnson91 wrote:
> > aaron.ballman wrote:
> > > If I understand properly, one issue is that `char16_t` is defined by C to 
> > > be *the same type* as `uint_least16_t`, which itself is a typedef to some 
> > > other integer type. So we can't make `char16_t` be a distinct type in C, 
> > > it has to be a typedef to a typedef to an integer type.
> > > 
> > > One concern I have about the approach in this patch is that it makes the 
> > > type system a bit more convoluted. This type is *not* really a character 
> > > type in C, it's a typedef type that playacts as a character type. I think 
> > > this is a pretty fundamental change to the type system in the compiler in 
> > > some ways because this query is no longer about the canonical type but 
> > > about the semantics of how the type is expected to be used.
> > > 
> > > I'd definitely like to hear what @rsmith thinks of this approach.
> > I see your point, but I'm not sure how else we could get it through the 
> > type checking system without doing it this way?
> > 
> > I tried to be careful to only allow the actual character typedefs through 
> > by making sure char16_t or char32_t is in the typedef chain, and only in C 
> > mode.
> > I see your point, but I'm not sure how else we could get it through the 
> > type checking system without doing it this way?
> 
> I am thinking that rather than doing this work as part of the `Type` object, 
> the extra work can be done from the format string checker. This keeps the 
> confusion about types localized to the part of the code that needs to care 
> about it, rather than pushing it onto all consumers of the type system.
Hmmmm, that's a good idea, but I'm not sure how to do that, let me think on it 
for a bit.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1883-1888
   if (SpellingPtr[0] == 'u' && SpellingPtr[1] == '8')
     SpellingPtr += 2;
-
-  assert(SpellingPtr[0] != 'L' && SpellingPtr[0] != 'u' &&
-         SpellingPtr[0] != 'U' && "Doesn't handle wide or utf strings yet");
+  
+  // Handle UTF-16 strings
+  if (SpellingPtr[0] == 'u' && SpellingPtr[1] != '8')
+    SpellingPtr += 1;
----------------
aaron.ballman wrote:
> 
Did you have anything to say here?


================
Comment at: 
clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp:791-808
+    
+    StringLiteral::StringKind Kind = FormatString->getKind();
+    
+    if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
+      String = FormatString->getStringAsChar();
+    } else if (Kind == StringLiteral::UTF16) {
+      std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> 
Convert;
----------------
cor3ntin wrote:
> I think this code is present in multiple places in your PR, could it be 
> deduplicated?
Yup, I've been working on that too; This was actually the first thing I started 
with.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

Reply via email to