ilya-biryukov added a comment.

In https://reviews.llvm.org/D37491#862160, @cameron314 wrote:

> > Are there potential problems we may run into because of the changing 
> > offsets? Could we add tests checking changing the offsets does not matter?
>
> That's a good point; I've looked into it and the PCH for the preamble is 
> parsed using just the buffer slice that contains the preamble, excluding any 
> BOM. That means that when we resume parsing later on a main buffer with a 
> BOM, the `SourceLocation`s within the preamble itself will be off. However, 
> normally this doesn't matter since the only things in the preamble are 
> preprocessor directives, whose positions are very rarely used. (I should note 
> at this point that we've been using a variant of this patch in production for 
> a few years without any problem.) So, we have two choices: Either parse the 
> preamble with the BOM and throw out the preamble/PCH when the BOM presence 
> changes from the main buffer, or slice the buffer when using a preamble PCH 
> so that it never has a BOM during parsing. I'm leaning towards the second 
> option, since it's a little cleaner and lets the preamble be reused more 
> easily; the only downside is that an external consumer would not be able to 
> use any absolute offsets from the AST (note that line/column offsets would be 
> identical) in the original buffer if it has a BOM -- but in any case, 
> absolute offsets are usually useless without the buffer itself, which if 
> obtained from clang would always be the correct buffer.


Maybe there's  a third option option to remove the BOM from the buffer before 
passing it to clang?
Could you elaborate on your use-case a little more? Is there no way to 
consistently always pass buffers either with or without BOM?

Out of two options you mention discarding preamble on BOM changes seems like an 
easy option that is both correct and won't make a difference in performance 
since BOM rarely changes.
Looking at your use-case, it sounds like you'll only have 1 extra reparse of 
preamble, which is probably fine.

>> Should we add checks that BOM was removed or added, but not changed? I would 
>> not expect preamble to be reusable "as is" if BOM (and therefore, input 
>> encoding) changed.
> 
> I'm not sure I understand this point. Clang only understands UTF-8; the BOM 
> is either present or not, but the encoding never changes. (And the BOM itself 
> is always the same byte sequence too.) It has no impact on the file contents.

Sure, it's not something clang supports, it's an edge-case when clang receives 
"malformed" input. Does lexer only skip utf-8 BOM, but not other versions of 
BOM?
But you're right, it's highly unlikely anything will break in that case.



================
Comment at: unittests/Frontend/PchPreambleTest.cpp:190
+
+  ASSERT_TRUE(ReparseAST(AST));
+  ASSERT_FALSE(AST->getDiagnostics().hasErrorOccurred());
----------------
cameron314 wrote:
> ilya-biryukov wrote:
> > We're not really testing that preamble was reused.
> > Maybe return a flag from `ASTUnit::Reparse` to indicate if preamble was 
> > reused and check it here?
> We are; if it wasn't reused, the header would have been opened again and the 
> last assert on `GetFileReadCount` below would fail.
Missed that, thanks. Looks good.
Maybe add a comment explicitly noting that?


https://reviews.llvm.org/D37491



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to