MarcusJohnson91 added inline comments.
================ Comment at: clang/include/clang/AST/Expr.h:1846-1871 + std::string getStringAsChar() const { + assert(getCharByteWidth() == 1 && + "This function is used in places that assume strings use char"); + return std::string(getTrailingObjects<char>(), getTrailingObjects<char>() + getByteLength()); + } + + std::u16string getStringAsChar16() const { ---------------- cor3ntin wrote: > aaron.ballman wrote: > > One potential issue to this is that in C++, these types are defined to be > > UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, > > Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but > > we may need to add some sanity checks to catch if a target overrides that > > behavior. Currently, all the targets in Clang are keeping that promise, but > > I have no idea what shenanigans downstream users get up to and whether > > their targets remove the macro definition or define it to `0` instead of > > `1`. > Is it possible that the host and target wchar_t have a different size here? I've honestly been wondering how Clang handled that, in the codebase vs at runtime myself for a while. ================ Comment at: clang/include/clang/AST/FormatString.h:83 AsMAllocate, // for '%ms', GNU extension to scanf + AsUTF16, // for '%l16(c|s)', soon to be standardized + AsUTF32, // for '%l32(c|s)', soon to be standardized ---------------- aaron.ballman wrote: > May want to drop the "soon to be standardized" given that the proposal hasn't > been seen by WG14 yet. I think it's fine to say "Clang extension", though. > More on the format specifier itself, below. Done, and here is the link to the document, I haven't heard any feedback? http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2761.pdf ================ Comment at: clang/include/clang/AST/Type.h:1975 bool isBooleanType() const; + bool isType(const std::string TypeName) const; bool isCharType() const; ---------------- aaron.ballman wrote: > It's not clear from the header file what this function actually *does*. > Presumably, everything that can be represented by a `Type` is a type, so I'd > assume this returns `true` always. Or should this be a static function? Yeah, could choose a better name just not sure what. It takes a type name as the argument, and then it desugars the type one step at a time, and if it finds a match it returns true. so, let's say we're in C++, and someone typedef'd char8_t to String. This function will say yes, String is compatible with char8_t for example. it's mostly for C mode's typedef's of char16_t and char32_t ================ Comment at: clang/lib/AST/Expr.cpp:1197 unsigned *StartTokenByteOffset) const { - assert((getKind() == StringLiteral::Ascii || - getKind() == StringLiteral::UTF8) && ---------------- aaron.ballman wrote: > I don't see changes that make this assertion valid to remove -- have I missed > something? Nope you didn't miss anything, I did. this is a remnant from when I was trying to templatize all the format checking code instead of converting the format strings. Restored the assert. ================ Comment at: clang/lib/AST/FormatString.cpp:235-240 + if (LO.C2x && I + 1 != E && I[0] == '1' && I[1] == '6') { + ++I; + ++I; + lmKind = LengthModifier::AsUTF16; + break; + } else if (LO.C2x && I + 1 != E && I[0] == '3' && I[1] == '2') { ---------------- aaron.ballman wrote: > I don't think this is a conforming extension to C -- lowercase length > modifiers and conversion specifiers are reserved for the standard, and > uppercase length modifiers and conversion specifiers are reserved for the > implementation. `l16` starts with a lowercase letter, so it's reserved for > the standard. > > Note: WG14 has been considering extensions in a closely-related space > (integer types rather than string or character types) that you may be > interested in, if you're not already aware of it: > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2511.pdf. This proposal was > not adopted, but did have strong sentiment for something along these lines. Yeah, honestly not really sure what to do about the reservation, L is also used for wide characters and some integers too, and (U|u)(16|32) wasn't taken well by the community. ================ Comment at: clang/lib/AST/OSLog.cpp:212 + } else if (Lit->isUTF16()) { + std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> Convert; + std::u16string U16 = Lit->getStringAsChar16(); ---------------- aaron.ballman wrote: > MarcusJohnson91 wrote: > > aaron.ballman wrote: > > > cor3ntin wrote: > > > > I'm not sure I have a better suggestion but `codecvt_utf8_utf16` is > > > > deprecated in C++17 > > > Good point -- this likely should be lifted into a common interface (as it > > > appears several times in the patch) and make use of the existing LLVM UTF > > > conversion functionality: > > > https://github.com/intel/llvm/blob/sycl/llvm/include/llvm/Support/ConvertUTF.h > > The ConvertUTF version you linked contains `convertUTF32toUTF8String` but > > the one in LLVM does not, what's the process for updating this? > Oh, oops, I managed to grab a link from downstream and not LLVM, sorry about > that! But the downstream and LLVM both have a method to perform the > conversion > https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Support/ConvertUTF.h#L162, > so you could add convenience functions to that file that wraps the API, as > part of this patch. I'll probably write a wrapper then ================ Comment at: clang/lib/AST/PrintfFormatString.cpp:254-257 + if (ParseLengthModifier(FS, I, E, LO) && I == E) { // THIS IS WHERE IT FREAKS OUT // No more characters left? if (Warn) + printf("ParsePrintfSpecifier: Incomplete Specifier 8\n"); ---------------- aaron.ballman wrote: > Surprising comment and what looks to be some debugging code? Removed ================ Comment at: clang/lib/AST/PrintfFormatString.cpp:635 } - if (LM.getKind() == LengthModifier::AsWide) - return ArgType(ArgType::WCStrTy, "wchar_t *"); ---------------- aaron.ballman wrote: > It looks like this coverage got lost? Restored ================ Comment at: clang/lib/AST/PrintfFormatString.cpp:844 case BuiltinType::SChar: + case BuiltinType::Char8: LM.setKind(LengthModifier::AsChar); ---------------- aaron.ballman wrote: > Is there a reason why you're adding this support for `char16_t` and > `char32_t` but not `char8_t` aside from "C doesn't have `char8_t`?" If that's > the sole reason, you might be interested in tracking: > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm Added char8_t ================ Comment at: clang/lib/AST/Type.cpp:1962 +bool Type::isType(const std::string TypeName) const { + QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType(); ---------------- aaron.ballman wrote: > Oh, I see now that this is doing a name comparison against the type -- that's > not a good API in general because it's *really* hard to guess at what the > type will come out as textually. e.g., `class` and `struct` keywords are > interchangeable in C++, C sometimes gets confused with `bool` vs `_Bool`, > template arguments sometimes matter, nested name specifiers, etc. Callers of > this API will have to guess at these details and the printing of the type may > change over time (e.g., C may switch from `_Bool` to `bool` and then code > calling `isType("_Bool")` may react poorly to the change). > > I think we need to avoid this sort of API on `Type`. I see your point, I reverted the behavior back to doing the desugaring in just isChar16Type and isChar32Type ================ Comment at: clang/lib/AST/Type.cpp:1980-2002 +bool Type::isChar16Type(const LangOptions &LangOpts) const { if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType)) - return BT->getKind() == BuiltinType::Char16; - return false; -} - -bool Type::isChar32Type() const { - if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType)) - return BT->getKind() == BuiltinType::Char32; - return false; -} - -/// Determine whether this type is any of the built-in character -/// types. -bool Type::isAnyCharacterType() const { - const auto *BT = dyn_cast<BuiltinType>(CanonicalType); - if (!BT) return false; - switch (BT->getKind()) { - default: return false; - case BuiltinType::Char_U: - case BuiltinType::UChar: - case BuiltinType::WChar_U: - case BuiltinType::Char8: - case BuiltinType::Char16: - case BuiltinType::Char32: - case BuiltinType::Char_S: - case BuiltinType::SChar: - case BuiltinType::WChar_S: - return true; + if (BT->getKind() == BuiltinType::Char16) + return true; + if (!LangOpts.CPlusPlus) { + return isType("char16_t"); } ---------------- aaron.ballman wrote: > MarcusJohnson91 wrote: > > aaron.ballman wrote: > > > If I understand properly, one issue is that `char16_t` is defined by C to > > > be *the same type* as `uint_least16_t`, which itself is a typedef to some > > > other integer type. So we can't make `char16_t` be a distinct type in C, > > > it has to be a typedef to a typedef to an integer type. > > > > > > One concern I have about the approach in this patch is that it makes the > > > type system a bit more convoluted. This type is *not* really a character > > > type in C, it's a typedef type that playacts as a character type. I think > > > this is a pretty fundamental change to the type system in the compiler in > > > some ways because this query is no longer about the canonical type but > > > about the semantics of how the type is expected to be used. > > > > > > I'd definitely like to hear what @rsmith thinks of this approach. > > I see your point, but I'm not sure how else we could get it through the > > type checking system without doing it this way? > > > > I tried to be careful to only allow the actual character typedefs through > > by making sure char16_t or char32_t is in the typedef chain, and only in C > > mode. > > I see your point, but I'm not sure how else we could get it through the > > type checking system without doing it this way? > > I am thinking that rather than doing this work as part of the `Type` object, > the extra work can be done from the format string checker. This keeps the > confusion about types localized to the part of the code that needs to care > about it, rather than pushing it onto all consumers of the type system. Hmmmm, that's a good idea, but I'm not sure how to do that, let me think on it for a bit. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:1883-1888 if (SpellingPtr[0] == 'u' && SpellingPtr[1] == '8') SpellingPtr += 2; - - assert(SpellingPtr[0] != 'L' && SpellingPtr[0] != 'u' && - SpellingPtr[0] != 'U' && "Doesn't handle wide or utf strings yet"); + + // Handle UTF-16 strings + if (SpellingPtr[0] == 'u' && SpellingPtr[1] != '8') + SpellingPtr += 1; ---------------- aaron.ballman wrote: > Did you have anything to say here? ================ Comment at: clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp:791-808 + + StringLiteral::StringKind Kind = FormatString->getKind(); + + if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) { + String = FormatString->getStringAsChar(); + } else if (Kind == StringLiteral::UTF16) { + std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> Convert; ---------------- cor3ntin wrote: > I think this code is present in multiple places in your PR, could it be > deduplicated? Yup, I've been working on that too; This was actually the first thing I started with. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D103426/new/ https://reviews.llvm.org/D103426 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits