On Tue, 2021-11-16 at 19:37 -0500, Marek Polacek wrote:
> On Tue, Nov 16, 2021 at 06:00:58PM -0500, David Malcolm wrote:
> > > On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote:
> > > > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote:
> > > > > > Ping, can we conclude on the name?   IMHO, -Wbidirectional is
> > > > > > just fine,
> > > > > > but changing the name is a trivial operation. 
> > > > > 
> > > > > Here's a patch with a better name (suggested by Jonathan W.). 
> > > > > Otherwise no
> > > > > changes.
> > > > 
> > > > Thanks for implementing this.
> > > > 
> > > > > 
> > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > > > > 
> > > > > -- >8 --
> > > > > From a link below:
> > > > > "An issue was discovered in the Bidirectional Algorithm in the
> > > > > Unicode
> > > > > Specification through 14.0. It permits the visual reordering of
> > > > > characters via control sequences, which can be used to craft
> > > > > source code
> > > > > that renders different logic than the logical ordering of
> > > > > tokens
> > > > > ingested by compilers and interpreters. Adversaries can
> > > > > leverage this to
> > > > > encode source code for compilers accepting Unicode such that
> > > > > targeted
> > > > > vulnerabilities are introduced invisibly to human reviewers."
> > > > > 
> > > > > More info:
> > > > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574
> > > > > https://trojansource.codes/
> > > > > 
> > > > > This is not a compiler bug.  However, to mitigate the problem,
> > > > > this patch
> > > > > implements -Wbidi-chars=[none|unpaired|any] to warn about
> > > > > possibly
> > > > > misleading Unicode bidirectional characters the preprocessor
> > > > > may encounter.
> > 
> > [...snip...]
> > 
> > > > 
> > > > Terminology nit:
> > > > The patch is referring to "bidirectional characters", but I think
> > > > the
> > > > term "bidirectional control characters" would be better.
> > > 
> > > Adjusted.
> > 
> > Thanks.
> > 
> > I wonder if the warning should be -Wbidi-control-chars, but I don't
> > care enough to insist on it being changed.
> > 
> > >  
> > > > For example, a passage of text containing both numbers and
> > > > characters
> > > > in a right-to-left script could be considered "bidirectional",
> > > > since
> > > > the numbers are written from left-to-right.
> > > > 
> > > > Specifically, the patch looks for these specific characters:
> > > >   * U+202A LEFT-TO-RIGHT EMBEDDING
> > > >   * U+202B RIGHT-TO-LEFT EMBEDDING
> > > >   * U+202C POP DIRECTIONAL FORMATTING
> > > >   * U+202D LEFT-TO-RIGHT OVERRIDE
> > > >   * U+202E RIGHT-TO-LEFT OVERRIDE
> > > >   * U+2066 LEFT-TO-RIGHT ISOLATE
> > > >   * U+2067 RIGHT-TO-LEFT ISOLATE
> > > >   * U+2068 FIRST STRONG ISOLATE
> > > >   * U+2069 POP DIRECTIONAL ISOLATE
> > > > 
> > > > However, the following characters could also be considered as
> > > > "bidirectional control characters":
> > > >   * U+200E ‎LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E)
> > > >   * U+200F ‎RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F)
> > > > but aren't checked for in the patch.  Should they be?  I can
> > > > imagine
> > > > ways in which they could be abused, so I think so.
> > > 
> > > I'd only intended to check the bidi chars described in the original
> > > trojan source pdf, but I added checking for U+200E/U+200F too,
> > > since
> > > it was easy enough.  AFAIK they aren't popped by a PDF/PDI like the
> > > rest, so don't need to go on the vec, and so we only warn with
> > > =any.
> > > Tests: Wbidi-chars-16.c + Wbidi-chars-17.c
> > 
> > Thanks.  I took a look through the revised patch and I think you
> > updated things correctly.
> > 
> > [...snip...]
> > 
> > > > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > > > b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > > > new file mode 100644
> > > > > index 00000000000..9fd4bc535ca
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > > > @@ -0,0 +1,166 @@
> > > > > +/* PR preprocessor/103026 */
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-
> > > > > overflow" } */
> > > > > +/* Test all bidi chars in various contexts (identifiers,
> > > > > comments,
> > > > > +   string literals, character constants), both UCN and UTF-8. 
> > > > > The bidi
> > > > > +   chars here are properly terminated, except for the
> > > > > character constants.  */
> > > > > +
> > > > > +/* a b c LRE‪ 1 2 3 PDF‬ x y z */
> > > > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */
> > > > > +/* a b c RLE‫ 1 2 3 PDF‬ x y z */
> > > > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */
> > > > > +/* a b c LRO‭ 1 2 3 PDF‬ x y z */
> > > > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */
> > > > > +/* a b c RLO‮ 1 2 3 PDF‬ x y z */
> > > > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */
> > > > > +/* a b c LRI⁦ 1 2 3 PDI⁩ x y z */
> > > > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */
> > > > > +/* a b c RLI⁧ 1 2 3 PDI⁩ x y */
> > > > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */
> > > > > +/* a b c FSI⁨ 1 2 3 PDI⁩ x y z */
> > > > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */
> > > > 
> > > > AIUI the Unicode bidirectionality algorithm works at the line
> > > > level,
> > > > and so each line in a block comment should be checked
> > > > individually for
> > > > unclossed bidi control chars, rather than a block comment as a
> > > > whole. 
> > > > Hence I think the test case needs to have block comment test
> > > > coverage
> > > > for:
> > > > - single line blocks
> > > > - first line of a multiline block comment
> > > > - middle line of a multiline block comment
> > > > - final line of a multiline block comment
> > > > but I think the patch as it stands is only checking for the first
> > > > of
> > > > these four cases.
> > > 
> > > The patch handles all of them, because of:
> > > 1534           if (warn_bidi_p)
> > > 1535             maybe_warn_bidi_on_close (pfile, cur);
> > > in _cpp_skip_block_comment, but I was lacking some more testing, so
> > > I've
> > > added some testing, and included a new test: Wbidi-chars-15.c.
> > 
> > All of the cases in Wbidi-chars-15.c only test for unparired chars in
> > a
> > middle line of a multiline block comment; I don't think the patch has
> > any explicit coverage for unpaired control chars happening in the
> > first
> > line and last lines of *multiline* block comments.  So it would be
> > good
> > if Wbidi-chars-15.c could gain some coverage for that (don't have to
> > handle all the different chars).
>  
> Sorry for a dumb question, but is this what you have in mind?
> 
> /* LRE
>    PDF */
> /* FSI
>    PDI */
> and check that we warn for these?

I mean something like the following multiline comments in which lines
within them at the start, middle and end have unpaired constructs
within a given line:


/* RLI
 *
 */

/*
 * RLI
 */

/*
 *  
 * RLI */

and that we should warn for each case at the line containing the
unpaired control character.

(the above lines don't have the actual chars, just "RLI")

Mostly this is just me trying to think about it from a black-box
testing perspective, or in case we ever touch this code in the future
(perhaps it's obviously correct by inspection of the implementation
now, but let's have regression tests for these cases).

Sorry to add more work, but here's an idea for another test case:
multiple comments on one line:

  /* RLI */  /* PDF */

where the closure of a comment should trigger closing a "context", so
we should complain about the above.

> 
> > > > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile,
> > > > > const uchar *base, bool starts_ucn,
> > > > >      {
> > > > >        /* Slower version for identifiers containing UCNs
> > > > >          or extended chars (including $).  */
> > > > > -      do {
> > > > > -       while (ISIDNUM (*pfile->buffer->cur))
> > > > > -         {
> > > > > -           NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer-
> > > > > >cur);
> > > > > -           pfile->buffer->cur++;
> > > > > -         }
> > > > > -      } while (forms_identifier_p (pfile, false, nst));
> > > > > +      do
> > > > > +       {
> > > > > +         while (ISIDNUM (*pfile->buffer->cur))
> > > > > +           {
> > > > > +             NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile-
> > > > > >buffer->cur);
> > > > > +             pfile->buffer->cur++;
> > > > > +           }
> > > > > +       }
> > > > > +      while (forms_identifier_p (pfile, false, nst));
> > > > 
> > > > Is the above purely a whitespace change?
> > > 
> > > Yes.
> > 
> > If I'm reading things correctly, these lines in the existing code
> > were
> > correctly indented, so is there a purpose to this change?  If not,
> > please can you remove this change from the patch (to minimize the
> > change to the history).
> 
> I dropped that change then.  Sometimes it's hard to resist fixing
> formatting.  ;)

Thanks.  But I don't think the existing formatting in the code *was*
broken; I thought the patch was taking correct formatting and breaking
it (hence my objection to a whitespace change).  If I misread this,
sorry.


>  
> > [...snip...]
> > 
> > > +/* We're closing a bidi context, that is, we've encountered a
> > > newline,
> > > +   are closing a C-style comment, or are at the end of a string
> > > literal,
> > > +   character constant, or identifier.  Warn if this context was
> > > not
> > > +   properly terminated by a PDI or PDF.  P points to the last
> > > character
> > > +   in this context.  */
> > > +
> > > +static void
> > > +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p)
> > > +{
> > > +  if (CPP_OPTION (pfile, cpp_warn_bidirectional) ==
> > > bidirectional_unpaired
> > > +      && bidi::vec.count () > 0)
> > > +    {
> > > +      const location_t loc
> > > +       = linemap_position_for_column (pfile->line_table,
> > > +                                      CPP_BUF_COLUMN (pfile-
> > > >buffer, p));
> > > +      cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0,
> > > +                            "unpaired UTF-8 bidirectional
> > > character "
> > > +                            "detected");
> > > +    }
> > 
> > Sorry, I missed this one in my initial review, should be "control
> > character" here.
> 
> Fixed.
> 
> > [...snip...]
> > 
> > OK for trunk with the above nits fixed.
> 
> Thanks again for the review.
> 
> I'll push this once the test question above is resolved.

Hopefully the above makes sense and is constructive; let me know when
you push your patch so that I can work on my followup.

Thanks
Dave


Reply via email to