Re: [PATCH][RFC] diagnostics: Add support for Unicode drawing characters

David Malcolm via Gcc-patches Thu, 23 Jul 2020 14:48:27 -0700

On Thu, 2020-07-23 at 12:28 -0400, Lewis Hyatt via Gcc-patches wrote:
> Hello-
> 
> The attached patch is complete including docs, but I tagged as RFC
> because I am not sure if anyone will like it, or if the general
> reaction may
> be closer to recoiling in horror :). Would appreciate your thoughts,
> please...


Thanks for working on this.  I'm interested in other people's thoughts
on this.  Various comments inline throughout below.

Currently, if a UTF-8 locale is detected, GCC changes the quote
> characters
> it outputs in diagnostics to Unicode directional quotes. I feel like
> this is
> a nice touch, so I was wondering whether GCC shouldn't do more along
> these
> lines. This patch adds support for using Unicode line drawing
> characters and
> similar things when outputting diagnostics. There is a new option
> -fdiagnostics-unicode-drawing=[auto|never|always] to control it,
> which
> defaults to auto. "auto" will enable the feature under the same
> circumstances that Unicode quotes get output, namely when the locale
> is
> determined by gcc_init_libintl() to support UTF-8. (The new option
> does not
> affect Unicode quote characters, which currently are not configurable
> and
> are determined solely by the locale.)

FWIW when I first started experimenting with location ranges back in
2015 my first patches had box-drawing characters for underlines; you
can see this in some of the early examples here (and similar URLs from
around then):

https://dmalcolm.fedorapeople.org/gcc/2015-08-18/plugin.html
  (this also has a different approach for labeling ranges, which I
called "captions", putting them in a right margin)

https://dmalcolm.fedorapeople.org/gcc/2015-08-19/diagnostic-test-string-literals-1.html

https://dmalcolm.fedorapeople.org/gcc/2015-08-26/tree-expression-ranges.html

etc; the patch kits were:

https://gcc.gnu.org/legacy-ml/gcc-patches/2015-03/msg00837.html
https://gcc.gnu.org/pipermail/gcc-patches/2015-September/428036.html
https://gcc.gnu.org/legacy-ml/gcc-patches/2015-09/msg01696.html

In:
  https://gcc.gnu.org/legacy-ml/gcc-patches/2015-09/msg01700.html
I wrote:
> * Eliminated UTF-8/box-drawing and captions.  Captions were cute but
>   weren't "fully baked".  Without them, box-drawing isn't really
>   needed, and I think I prefer the ASCII look, with the actual
>   "caret" character, and '~' makes it easier to count characters
>   compared to a box-drawing line, in my terminal's font, at least.
>   Doing so greatly simplifies the new locus-printing code.

So I dropped the UTF-8 box drawing from that original kit for:
(a) simplicity (the original patch kit was huge in scope, covering a
bunch of ideas for diagnostics - ranges, labeling, fix-it hints,
spelling suggestions, so I wanted to reduce the scope to something
manageable)
(b) I found it easier to count characters with "~"


The thing I'm most nervous about with this patch is the potential for
introducing mojibake when people copy and paste GCC output.

For example, looking at:
https://gcc.gnu.org/legacy-ml/gcc-patches/2015-03/msg00837.html
I see mojibake where the unicode line-drawing characters in my email
are being displayed in the HTML mailing list archive via "&#xE2;" -
something has gone wrong with encoding somewhere between the copy&paste
from my terminal, the email, and the list archive.

That said, looking at your email in the archive here:
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550551.html
I don't see any mojibake.

What happens if GCC's stderr is piped into "less"?
What happens if GCC's stderr is saved in a build.log file, uploaded
somewhere, and then viewed?
etc.


> The elements implemented are:
> 
>     * Vertical lines, e.g. those indicating labels and those
> separating the
>       source lines from the line numbers, are changed to line drawing
>       characters.
> 
>     * The diagnostic paths output by the static analyzer make use of
> line
>       drawing characters to output smooth corners etc.
> 
>     * The squiggly underline ~~~~~ used to highlight source locations
> is
>       changed to a double underline ═════. The main reason for this
> is that
>       it enables a seamless "tee" character to connect the underline
> to a
>       label line if one exists.
> 
>     * Carets (^) are changed to a slightly different character (∧). I
> think
>       the new one is a little nicer looking, although probably not
> worth the
>       trouble on its own. I wanted to implement the support in this
> patch
>       beause carets are harder to change than the rest of the
> elements
>       (front ends have an interface to override them, which currently
>       Fortran makes use of), so I thought it worthwhile to get this
> logic in
>       place, so that it can easily be changed to a more superior
> character
>       in the future if one comes up. It would also be easy enough to
> leave
>       the Unicode support in place for carets, but keep the default
> set to
>       the plain one for now.

Some other ideas:

* fix-it hints

* maybe have a different character for separating the line numbers as
opposed to those for labels and for showing interprocedural paths.

> As an example, this diagnostic from gcc.dg/format/diagnostic-
> ranges.c:
> 
> diagnostic-ranges.c:196:28: warning: field width specifier ‘*’
> expects argument of type ‘int’, but argument 3 has type ‘long int’ [-
> Wformat=]
>   196 |   __builtin_sprintf (d, " %*ld ", foo + bar, foo);
>       |                           ~^~~    ~~~~~~~~~
>       |                            |          |
>       |                            int        long int
> 
> would become instead:
> 
> diagnostic-ranges.c:196:28: warning: field width specifier ‘*’
> expects argument of type ‘int’, but argument 3 has type ‘long int’ [-
> Wformat=]
>   196 │   __builtin_sprintf (d, " %*ld ", foo + bar, foo);
>       │                           ═∧══    ════╤════
>       │                            │          │
>       │                            int        long int
> 
> Hopefully you are viewing this in a terminal that displays it
> properly :), in
> which case, hopefully you may find it to be an improvement?

I wonder if you can upload colorized examples somewhere?

e.g. using bin/gcc-color-to-html.py from our website repository:
 https://gcc.gnu.org/git/?p=gcc-wwwdocs.git;a=blob;f=bin/gcc-color-to-html.py
or one of the various ansi2html conversion scripts e.g.
  http://www.pixelbeat.org/scripts/ansi2html.sh

> Here is a more involved example from the analyzer:
> 
> setjmp-5.c: In function ‘outer’:
> setjmp-5.c:21:3: warning: ‘longjmp’ called after enclosing function
> of ‘setjmp’ has returned [-Wanalyzer-stale-setjmp-buffer]
>    21 |   longjmp (env, 42); /* { dg-warning "'longjmp' called after
> enclosing function of 'setjmp' has returned" } */
>       |   ^~~~~~~~~~~~~~~~~
>   ‘outer’: events 1-2
>     |
>     |   15 | void outer (void)
>     |      |      ^~~~~
>     |      |      |
>     |      |      (1) entry to ‘outer’
>     |......
>     |   19 |   inner ();
>     |      |   ~~~~~~~~
>     |      |   |
>     |      |   (2) calling ‘inner’ from ‘outer’
>     |
>     +--> ‘inner’: event 3
>            |
>            |   10 | static void inner (void)
>            |      |             ^~~~~
>            |      |             |
>            |      |             (3) entry to ‘inner’
>            |
>          ‘inner’: event 4
>            |
>            |   12 |   SETJMP (env);
>            |      |   ^~~~~~
>            |      |   |
>            |      |   (4) ‘setjmp’ called here
>            |
>     <------+
>     |
>   ‘outer’: events 5-6
>     |
>     |   19 |   inner ();
>     |      |   ^~~~~~~~
>     |      |   |
>     |      |   (5) returning to ‘outer’ from ‘inner’
>     |   20 |
>     |   21 |   longjmp (env, 42); /* { dg-warning "'longjmp' called
> after enclosing function of 'setjmp' has returned" } */
>     |      |   ~~~~~~~~~~~~~~~~~
>     |      |   |
>     |      |   (6) here
>     |
> 
> would become instead:
> 
> setjmp-5.c: In function ‘outer’:
> setjmp-5.c:21:3: warning: ‘longjmp’ called after enclosing function
> of ‘setjmp’ has returned [-Wanalyzer-stale-setjmp-buffer]
>    21 │   longjmp (env, 42); /* { dg-warning "'longjmp' called after
> enclosing function of 'setjmp' has returned" } */
>       │   ∧════════════════
>   ‘outer’: events 1-2
>     │
>     │   15 │ void outer (void)
>     │      │      ∧════
>     │      │      │
>     │      │      (1) entry to ‘outer’
>     │......
>     │   19 │   inner ();

I wonder if there's a fancier way to express the gap in the lines if
Unicode is available?


>     │      │   ╤═══════
>     │      │   │
>     │      │   (2) calling ‘inner’ from ‘outer’
>     │
>     └──> ‘inner’: event 3
>            │
>            │   10 │ static void inner (void)
>            │      │             ∧════
>            │      │             │
>            │      │             (3) entry to ‘inner’
>            │
>          ‘inner’: event 4
>            │
>            │   12 │   SETJMP (env);
>            │      │   ∧═════
>            │      │   │
>            │      │   (4) ‘setjmp’ called here
>            │

Unrelated to this patch , but it would be nice if the analyzer inserted
an event at the function end showing the frame in "env" becoming
invalid, since that's what pertinent to the diagnostic.

>     ┌<─────┘
>     │
>   ‘outer’: events 5-6
>     │
>     │   19 │   inner ();
>     │      │   ∧═══════
>     │      │   │
>     │      │   (5) returning to ‘outer’ from ‘inner’
>     │   20 │
>     │   21 │   longjmp (env, 42); /* { dg-warning "'longjmp' called
> after enclosing function of 'setjmp' has returned" } */
>     │      │   ╤════════════════
>     │      │   │
>     │      │   (6) here
>     │

FWIW I experimented with using unicode circled number characters in
place of (1), (2), etc for events in diagnostic_paths but the results
looked bad in my terminal, so I stuck to the ASCII form above.

In my more adventurous moments I've been tempted to use background
colorization to show the stack pushes and pops in a flamegraph-style
way, but I suspect it would garish and be too "busy" visually.

> 
> Although probably premature, bootstrap and regtest were done on x86-
> 64
> linux, all tests the same before/after and new tests passing:
> FAIL 96 96
> PASS 479090 479239
> UNSUPPORTED 11946 11946
> UNTESTED 194 194
> XFAIL 1839 1839
> XPASS 36 36

I see the patch kit touches Fortran; was this with all frontends
enabled?  (though I guess I'm likewise being premature here)

> I tried to set this up as a general framework, at least, it is easy
> in one
> place to change the characters that are used for various contexts, so
> that
> if people like the general idea, but not some of the specifics, the
> patch is
> easily modified for that now or in the future. Thanks for any
> feedback!

Thanks again for the patch; let's see what others think.
Dave

Re: [PATCH][RFC] diagnostics: Add support for Unicode drawing characters

Reply via email to