[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

dmalcolm at gcc dot gnu.org Mon, 09 Dec 2019 12:27:56 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973


--- Comment #20 from David Malcolm <dmalcolm at gcc dot gnu.org> ---
I've committed r279137 on Lewis's behalf, which fixes the issues identified in
patch #13.

As noted in review of the patch, we didn't attempt to change the behavior of
diagnostic_get_location_text with this change.  Quoting myself from:
  https://gcc.gnu.org/ml/gcc-patches/2019-11/msg02171.html

> This is the column number as reported in the diagnostic i.e the COL_NUM
> when printing e.g.
>   warning: FILENAME:LINE_NUM:COL_NUM: some message
> 
> It seems to me that PR 49973 and this patch cover two separate things:
> (a) bytes vs display columns in diagnostic-show-locus.c
> (b) the "COL_NUM" mentioned above.
> 
> I'd prefer to omit (b) from the patch, and have the focus of the patch
> be (a), to tackle (b) in a separate patch.
> 
> [There's also the meaning of column numbers in the JSON output, and in
> the output of -fdiagnostics-parseable-fixits (which is intended to mimic
> clang's output format)]
> 
> It's unclear to me what the reported COL_NUM should be.
> There are various possibilities:
> 
> Units:
>   (A) [status quo] report a count of bytes within the line
>   (B) report a count of unicode characters
>   (C) report a count of unicode graphemes
>   (D) report based on the wcwidth of the characters
>   etc
> 
> Origin/baseline:
>   (A) [status quo] use 1 for the leftmost column
>   (B) use 0 for the leftmost column
> 
> Tab-handling:
>   (A) [status quo] don't give any kind of special status to tab characters
>   (B) implement tab stops, somehow.  For example, get_visual_column in
>       c-family/c-indentation implements tab stops based on bytes.
> 
> (so at least 4*2*2 = 16 possible meanings, ugh)
> 
> See also e.g.:
>   https://github.com/oasis-tcs/sarif-spec/issues/178
> 
> The GNU Coding Standards say
> 
>    Line numbers should start from 1 at the beginning of the file, and
>    column numbers should start from 1 at the beginning of the line.
>    (Both of these conventions are chosen for compatibility.) Calculate
>    column numbers assuming that space and all ASCII printing characters
>    have equal width, and assuming tab stops every 8 columns. For
>    non-ASCII characters, Unicode character widths should be used when in
>    a UTF-8 locale; GNU libc and GNU gnulib provide suitable wcwidth
>    functions.
> (https://www.gnu.org/prep/standards/standards.html#Errors)
> 
> I think if we do change the meaning of the "COL_NUM" output, we should
> probably add an option for it, to help with the transition (so that
> people can easily revert to the old behavior).
> 
> Perhaps something like:
> 
>   -fdiagnostics-column-unit=[bytes|gnu]
> 
>      bytes: [status-quo]; 1-based count of bytes, not respecting tab stops
>      gnu: as per GNU Coding Standards above
> 
> and have gcc 10 default to "gnu" (or whatever we call it), so that
> people can override it back to "bytes".
> 
> (again, I'm thinking aloud here)
> 
> But please can you split that out as a separate patch? (it's arguably
> still in time for GCC 10, as it's from a patch was posted before the
> stage 1 deadline).

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

Reply via email to