[PATCH] D46035: [clangd] Fix unicode handling, using UTF-16 where LSP requires it.

Sam McCall via Phabricator via cfe-commits Mon, 30 Apr 2018 09:27:06 -0700

sammccall added inline comments.


================
Comment at: clang-tools-extra/trunk/clangd/SourceCode.cpp:38
+    // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
+    // 11111xxx is not valid UTF-8 at all. Assert because it's probably our 
bug.
+    assert((UTF8Length >= 2 && UTF8Length <= 4) &&
----------------
benhamilton wrote:
> sammccall wrote:
> > benhamilton wrote:
> > > sammccall wrote:
> > > > benhamilton wrote:
> > > > > This is user input, right? Have we actually checked for valid UTF-8, 
> > > > > or do we just assume it's valid?
> > > > > 
> > > > > If not, it seems like an assertion is not the right check, but we 
> > > > > should reject it when we're reading the input.
> > > > > 
> > > > Yeah, I wasn't sure about this, offline discussion tentatively 
> > > > concluded we wanted an assert, but I'm happy to switch to something 
> > > > else.
> > > > 
> > > > We don't validate the code on the way in, so strings are "bytes of 
> > > > presumed-UTF8". This is usually not a big pain actually. But we 
> > > > could/should certainly make the JSON parser validate the UTF-8. (If we 
> > > > want to go this route, D45753 should be resolved first).
> > > > 
> > > > There's two ways the assertion could fire: the code is invalid UTF-8, 
> > > > or there's a bug in the unicode logic here. I thought the latter was 
> > > > more likely at least in the short-term :) and this is the least 
> > > > invasive way to catch it. And if a developer build (assert-enabled) 
> > > > crashes because an editor feeds it invalid bytes, then that's probably 
> > > > better than doing nothing (though not as good as catching the error 
> > > > earlier).
> > > The problem with not validating is it's easy to cause OOB memory access 
> > > (and thus security issues) if someone crafts malicious UTF-8 and makes us 
> > > read off the end of a string.
> > > 
> > > We should be clear about the status of all strings in the documentation 
> > > to APIs.
> > You still have to find/write a UTF-8 decoder that doesn't check bounds, 
> > which is (hopefully!) the harder part of writing that bug :-)
> > But I agree in principle, there's more subtle attacks too, like `C08A` 
> > which is invalid but non-validating decoders will treat as a newline.
> > 
> > I have a nearly-finished patch to add real validation to the JSON library, 
> > I'll copy you on it.
> Seems like this particular decoder isn't checking bounds, eh? ;)
> 
> If `NDEBUG` is set, it will happily set `UTF8Length` to however many leading 
> 1s there are (valid or not) and pass that to `CB(UTF8Length)`. It's true that 
> the current callbacks passed in won't directly turn that into an OOB memory 
> access, but they will end up returning an invalid UTF-16 code unit length 
> from `positionToOffset()`, so who knows what that will end up doing.
> 
> Thanks, I'm always happy to review Unicode stuff.
Ah, yeah - the implicit context here is that we don't do anything with UTF-16 
other than send it back to the client. So this is garbage in, garbage out. 
Definitely not ideal.


Repository:
  rL LLVM

https://reviews.llvm.org/D46035



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D46035: [clangd] Fix unicode handling, using UTF-16 where LSP requires it.

Reply via email to