benhamilton added inline comments.

================
Comment at: clang-tools-extra/trunk/clangd/SourceCode.cpp:38
+    // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
+    // 11111xxx is not valid UTF-8 at all. Assert because it's probably our 
bug.
+    assert((UTF8Length >= 2 && UTF8Length <= 4) &&
----------------
sammccall wrote:
> benhamilton wrote:
> > This is user input, right? Have we actually checked for valid UTF-8, or do 
> > we just assume it's valid?
> > 
> > If not, it seems like an assertion is not the right check, but we should 
> > reject it when we're reading the input.
> > 
> Yeah, I wasn't sure about this, offline discussion tentatively concluded we 
> wanted an assert, but I'm happy to switch to something else.
> 
> We don't validate the code on the way in, so strings are "bytes of 
> presumed-UTF8". This is usually not a big pain actually. But we could/should 
> certainly make the JSON parser validate the UTF-8. (If we want to go this 
> route, D45753 should be resolved first).
> 
> There's two ways the assertion could fire: the code is invalid UTF-8, or 
> there's a bug in the unicode logic here. I thought the latter was more likely 
> at least in the short-term :) and this is the least invasive way to catch it. 
> And if a developer build (assert-enabled) crashes because an editor feeds it 
> invalid bytes, then that's probably better than doing nothing (though not as 
> good as catching the error earlier).
The problem with not validating is it's easy to cause OOB memory access (and 
thus security issues) if someone crafts malicious UTF-8 and makes us read off 
the end of a string.

We should be clear about the status of all strings in the documentation to APIs.


Repository:
  rL LLVM

https://reviews.llvm.org/D46035



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to