On 15/11/2021 18:28, Marek Polacek via Gcc-patches wrote:
On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote:
Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine,
but changing the name is a trivial operation.
Here's a patch with a better name (suggested by Jonathan W.). Otherwise no
changes.
Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
-- >8 --
From a link below:
"An issue was discovered in the Bidirectional Algorithm in the Unicode
Specification through 14.0. It permits the visual reordering of
characters via control sequences, which can be used to craft source code
that renders different logic than the logical ordering of tokens
ingested by compilers and interpreters. Adversaries can leverage this to
encode source code for compilers accepting Unicode such that targeted
vulnerabilities are introduced invisibly to human reviewers."
More info:
https://nvd.nist.gov/vuln/detail/CVE-2021-42574
https://trojansource.codes/
This is not a compiler bug. However, to mitigate the problem, this patch
implements -Wbidi-chars=[none|unpaired|any] to warn about possibly
misleading Unicode bidirectional characters the preprocessor may encounter.
The default is =unpaired, which warns about improperly terminated
bidirectional characters; e.g. a LRE without its appertaining PDF. The
level =any warns about any use of bidirectional characters.
This patch handles both UCNs and UTF-8 characters. UCNs designating
bidi characters in identifiers are accepted since r204886. Then r217144
enabled -fextended-identifiers by default. Extended characters in C/C++
identifiers have been accepted since r275979. However, this patch still
warns about mixing UTF-8 and UCN bidi characters; there seems to be no
good reason to allow mixing them.
I wonder what the rationale is to warn about UCNs, like in
aText = u"\u202D" + aText;
(as found in the LibreOffice source code).
We warn in different contexts: comments (both C and C++-style), string
literals, character constants, and identifiers. Expectedly, UCNs are ignored
in comments and raw string literals. The bidirectional characters can nest
so this patch handles that as well.
I have not included nor tested this at all with Fortran (which also has
string literals and line comments).
Dave M. posted patches improving diagnostic involving Unicode characters.
This patch does not make use of this new infrastructure yet.
PR preprocessor/103026
[...]