[PATCH] Ensure colorization doesn't corrupt multibyte sequences in diagnostics

Lewis Hyatt Thu, 12 Dec 2019 15:21:52 -0800

Hello-

In the original discussion of implementing UTF-8 identifiers
( https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26 ), I pointed out
that colorization would corrupt the appearance of certain diagnostics. For
example, this code, with -std=c99:


----------
int ٩x;
----------

Produces:

t2.cpp:1:5: error: extended character ٩ is not valid at the start of an 
identifier
    1 | int ٩x;
      |     ^

The diagnostic location contains only the first byte of the character, so
when colorization is enabled, the ANSI escapes are inserted in the middle
of the UTF-8 sequence and produce corrupted output on the terminal.

I feel like there are two separate issues here:

#1. diagnostic_show_locus() should be sure it will not corrupt output in
this way, regardless of what ranges it is given to work with.

#2. libcpp should probably generate a range that includes the whole UTF-8
character. Actually in other ways the range seems not ideal, for example
if an invalid character appears in the middle of the identifier, the
diagnostic still points to the first byte of the identifier.

The attached patch fixes #1. It's essentially a one-line change, plus a
new selftest. Would you please have a look at it sometime? bootstrap
and testsuite were done on linux x86-64.

Other questions that I have:

- I am not quite clear when a selftest is preferred vs a dejagnu test. In
  this case I stuck with the selftest because color diagnostics don't seem
  to work well with dg-error etc, and it didn't seem worth creating a new
  plugin-based test like g++.dg/plugin just for this. (I also considered
  using the existing g++.dg plugin, but it seems this test should run for
  gcc as well.)

- I wasn't sure if I should create a PR for an issue such as this, if
  there is already a patch readily available. And if I did create a PR,
  not sure if it's preferred to post the patch to gcc-patches, or as an
  attachment to the PR.

- Does it seem worth me looking into #2? I think the patch to address #1 is
  appropriate in any case, because it handles generically all potential
  cases where this may arise, but still perhaps the ranges coming out of
  libcpp could be improved?

Thanks...

-Lewis

gcc/ChangeLog:

2019-12-12  Lewis Hyatt  <lhy...@gmail.com>

        * diagnostic-show-locus.c (layout::print_source_line): Do not emit
        color codes in the middle of a UTF-8 sequence.
        (test_one_liner_colorized_utf8): New test.
        (test_diagnostic_show_locus_one_liner_utf8): Call the new test.

diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
index c87603caf41..7385b8a1692 100644
--- a/gcc/diagnostic-show-locus.c
+++ b/gcc/diagnostic-show-locus.c
@@ -1482,6 +1482,8 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes,
   int last_non_ws = 0;
   for (int col_byte = 1 + x_offset_bytes; col_byte <= line_bytes; col_byte++)
     {
+      char c = *line;
+
       /* Assuming colorization is enabled for the caret and underline
 	 characters, we may also colorize the associated characters
 	 within the source line.
@@ -1493,8 +1495,13 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 
 	 For frontends that only generate carets, we don't colorize the
 	 characters above them, since this would look strange (e.g.
-	 colorizing just the first character in a token).  */
-      if (m_colorize_source_p)
+	 colorizing just the first character in a token).
+
+	 We need to avoid inserting color codes ahead of UTF-8 continuation
+	 bytes to avoid corrupting the output, in case the location range
+	 points only to the first byte of a multibyte sequence.  */
+
+      if (m_colorize_source_p && (((unsigned int) c) & 0xC0) != 0x80)
 	{
 	  bool in_range_p;
 	  point_state state;
@@ -1507,7 +1514,6 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes,
 	  else
 	    m_colorizer.set_normal_text ();
 	}
-      char c = *line;
       if (c == '\0' || c == '\t' || c == '\r')
 	c = ' ';
       if (c != ' ')
@@ -3836,6 +3842,27 @@ test_one_liner_labels_utf8 ()
   }
 }
 
+/* Make sure that colorization codes don't interrupt a multibyte
+   sequence, which would corrupt it.  */
+static void
+test_one_liner_colorized_utf8 ()
+{
+  test_diagnostic_context dc;
+  dc.colorize_source_p = true;
+  diagnostic_color_init (&dc, DIAGNOSTICS_COLOR_YES);
+  const location_t pi = linemap_position_for_column (line_table, 12);
+  rich_location richloc (line_table, pi);
+  diagnostic_show_locus (&dc, &richloc, DK_ERROR);
+
+  /* In order to avoid having the test depend on exactly how the colorization
+     was effected, just confirm there are two pi characters in the output.  */
+  const char *result = pp_formatted_text (dc.printer);
+  const char *null_term = result + strlen (result);
+  const char *first_pi = strstr (result, "\xcf\x80");
+  ASSERT_TRUE (first_pi && first_pi <= null_term - 2);
+  ASSERT_STR_CONTAINS (first_pi + 2, "\xcf\x80");
+}
+
 /* Run the various one-liner tests.  */
 
 static void
@@ -3882,6 +3909,7 @@ test_diagnostic_show_locus_one_liner_utf8 (const line_table_case &case_)
   test_one_liner_many_fixits_1_utf8 ();
   test_one_liner_many_fixits_2_utf8 ();
   test_one_liner_labels_utf8 ();
+  test_one_liner_colorized_utf8 ();
 }
 
 /* Verify that gcc_rich_location::add_location_if_nearby works.  */

[PATCH] Ensure colorization doesn't corrupt multibyte sequences in diagnostics

Reply via email to