sammccall added a comment. Yeah I think there must be some confusion about what this code is doing. It's specifically iterating over the unicode codepoints of what are supposed to be UTF-8-encoded input bytes.
The input turns out sometimes not to be UTF-8 (e.g. the file on disk is ISO-8859-1 and clang thinks it's UTF-8 and just loads the bytes). We can't give any sort of right answer in these cases - we don't know the actual encoding and we can't even always detect these cases! What we can do is strengthen the contract: instead of UB, assert in practice, we can say returns some garbage value but doesn't crash. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D74731/new/ https://reviews.llvm.org/D74731 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits