On 11/2/21 11:18 AM, Marek Polacek via Gcc-patches wrote:
On Mon, Nov 01, 2021 at 10:10:40PM +0000, Joseph Myers wrote:
On Mon, 1 Nov 2021, Marek Polacek via Gcc-patches wrote:

+  /* We've read a bidi char, update the current vector as necessary.  */
+  void on_char (kind k, bool ucn_p)
+  {
+    switch (k)
+      {
+      case kind::LRE:
+      case kind::RLE:
+      case kind::LRO:
+      case kind::RLO:
+       vec.push (ucn_p ? 3u : 1u);
+       break;
+      case kind::LRI:
+      case kind::RLI:
+      case kind::FSI:
+       vec.push (ucn_p ? 2u : 0u);
+       break;
+      case kind::PDF:
+       if (current_ctx () == kind::PDF)
+         pop ();
+       break;
+      case kind::PDI:
+       if (current_ctx () == kind::PDI)
+         pop ();

My understanding is that PDI should pop all intermediate PDF contexts
outward to a PDI context, which it also pops.  (But if it's embedded only
in PDF contexts, with no PDI context containing it, it doesn't pop
anything.)

I think failing to handle that only means libcpp sometimes models there
as being more bidirectional contexts open than there should be, so it
might give spurious warnings when in fact all such contexts had been
closed by end of string or comment.

Ah, you're right.
https://www.unicode.org/reports/tr9/#Terminating_Explicit_Directional_Isolates
says that "[PDI] terminates the scope of the last LRI, RLI, or FSI whose
scope has not yet been terminated, as well as the scopes of any subsequent
LREs, RLEs, LROs, or RLOs whose scopes have not yet been terminated."
but PDF doesn't have the latter quirk.

Fixed in the below: I added a suitable truncate into on_char.  The new test
Wbidirectional-14.c exercises the handling of PDI.

Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?

-- >8 --
 From a link below:
"An issue was discovered in the Bidirectional Algorithm in the Unicode
Specification through 14.0. It permits the visual reordering of
characters via control sequences, which can be used to craft source code
that renders different logic than the logical ordering of tokens
ingested by compilers and interpreters. Adversaries can leverage this to
encode source code for compilers accepting Unicode such that targeted
vulnerabilities are introduced invisibly to human reviewers."

More info:
https://nvd.nist.gov/vuln/detail/CVE-2021-42574
https://trojansource.codes/

This is not a compiler bug.  However, to mitigate the problem, this patch
implements -Wbidirectional=[none|unpaired|any] to warn about possibly
misleading Unicode bidirectional characters the preprocessor may encounter.

Birectional sounds very general.  Can we come up with a name
that's a bit more descriptive of the problem the warning reports?
From skimming the docs and the tests it looks like the warning
points out uses of bidirectonal characters in the program source
code as well as comments.  Would -Wbidirectional-text be better?
Or -Wbidirectional-chars?  (If Clang is also adding a warning
for this, syncing up with them one way or the other might be
helpful.)

...
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index c5730228821..9dfb95dc24c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}.
  -Warith-conversion @gol
  -Warray-bounds  -Warray-bounds=@var{n}  -Warray-compare @gol
  -Wno-attributes  -Wattribute-alias=@var{n} -Wno-attribute-alias @gol
--Wno-attribute-warning  -Wbool-compare  -Wbool-operation @gol
+-Wno-attribute-warning  @gol
+-Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol
+-Wbool-compare  -Wbool-operation @gol
  -Wno-builtin-declaration-mismatch @gol
  -Wno-builtin-macro-redefined  -Wc90-c99-compat  -Wc99-c11-compat @gol
  -Wc11-c2x-compat @gol
@@ -7674,6 +7676,21 @@ Attributes considered include @code{alloc_align}, 
@code{alloc_size},
  This is the default.  You can disable these warnings with either
  @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}.
+@item -Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]}
+@opindex Wbidirectional=
+@opindex Wbidirectional
+@opindex Wno-bidirectional
+Warn about UTF-8 bidirectional characters.

I suggest to mention where.  If everywhere, enumerate the most
common contexts to make it clear it means everywhere:

  Warn about UTF-8 bidirectional characters in source code,
  including string literals, identifiers, and comments.

Martin

Reply via email to