On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> Hello-
> 
> The attached patch addresses PR93067:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> 
> This is similar to the patch I posted last year on the PR, with some
tweaks
> to make it a little simpler. Recapping some of the commentary on the
PR:
> 
> When source lines are needed for diagnostics output, they are
retrieved from
> the source file by the fcache infrastructure in input.c, since libcpp
has
> generally already forgotten them (plus not all front ends are using
> libcpp). This infrastructure does not read the files in the same way
as
> libcpp does; in particular, it does not translate the encoding as
requested
> by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> present. The patch adds this ability. My thinking in deciding how to
do it
> was the following:
> 
> - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
still,
>   so this patch should try hard not to introduce any worse
performance
>   unless these things are needed.
> 
> - It is desirable to reuse libcpp's encoding infrastructure from
charset.c
>   rather than repeat it in input.c. (Notably, libcpp uses iconv but
it also
>   has hand-coded routines for certain charsets to make sure they are
>   available.)
> 
> - There is a performance degradation required in order to make use of
libcpp
>   directly, because the input.c infrastructure only reads as much of
the
>   source file as necessary, whereas libcpp interfaces as-is require
to read
>   the entire file into memory.
> 
> - It can't be quite as simple as just "only delegate to libcpp if
>   -finput-charset was specified", because the stripping of the UTF-8
BOM has
>   to happen with or without this option.
> 
> - So it seemed a reasonable compromise to me, if -finput-charset is
>   specified, then use libcpp to convert the file, otherwise, strip
the BOM
>   in input.c and then process the file the same way it is done now.
There's
>   a little bit of leakage of charset logic from libcpp this way (for
the
>   BOM), but it seems worthwhile, since otherwise, diagnostics would
always
>   be reading the entire file into memory, which is not a cost paid
>   currently.

Thanks for the patch; sorry about the delay in reviewing it.

This mostly seems good to me.

One aspect I'm not quite convinced about is the
input_cpp_context.in_use flag.  The input.c machinery is used by
diagnostics, and so could be used by middle-end warnings for frontends
that don't use libcpp.  Presumably we'd still want to remove the UTF-8
BOM for those, and do encoding fixups if necessary - is it more a case
of initializing things to express what the expected input charset is?
(since that is part of the cpp_options)

c.opt has:
  finput-charset=
  C ObjC C++ ObjC++ Joined RejectNegative
  -finput-charset=<cset>        Specify the default character set for
source files.

I believe that D and Go are the two frontends that don't use libcpp for
parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
considering the heritage of both).  I don't know what source encodings
D supports.

> Separate from the main patch are two testcases that both fail before
this
> patch and pass after. I attached them gzipped because they use non-
standard
> encodings that won't email well.
> 
> The main question I have about the patch is whether I chose a good
way to
> address the following complication. location_get_source_line() in
input.c is
> used to generate diagnostics for all front ends, whether they use
libcpp to
> read the files or not. So input.c needs some way to know whether
libcpp is
> in use or not. I ended up adding a new function
input_initialize_cpp_context(),
> which front ends have to call if they are using libcpp to read their
> files. Currently that is c-family and fortran. I don't see a simpler
way
> to do it at least... Even if there were a better way for input.c to
find out
> the value passed to -finput-charset, it would still need to know
whether
> libcpp is being used or not.

At some point I want to convert the global state in input.c into a
source_manager class, probably hung off the diagnostic_context, so the
natural place to put that state would be in there (rather than the
fcache being global state).  Perhaps have a source_reader policy class
hung off of the source_manager, which would have responsibility for
reading and converting the file, either piecemeal, or fully for the
"need to use libcpp" case.

But that's not necessary for this fix.

> Please let me know if it looks OK (either now or for stage 1,
whatever makes
> sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all
tests the
> same before & after plus 6 new PASS from the new tests. Thanks!

Various comments inline below.  In particular, does the patch add a
memory leak in _cpp_convert_input?

Thanks
Dave

[...]

> libcpp/ChangeLog:
> 
>         PR other/93067
>         * charset.c (init_iconv_desc): Adapt to permit PFILE argument
to
>         be NULL.
>         (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic
to...
>         (cpp_check_utf8_bom): ...here.  New function.
>         (cpp_input_conversion_is_trivial): New function.
>         * files.c (read_file_guts): Allow PFILE argument to be NULL.
Add
>         INPUT_CHARSET argument as an alternate source of this
information.
>         (cpp_get_converted_source): New function.
>         * include/cpplib.h (struct cpp_converted_source): Declare.
>         (cpp_get_converted_source): Declare.
>         (cpp_input_conversion_is_trivial): Declare.
>         (cpp_check_utf8_bom): Declare.
> 
> gcc/testsuite/ChangeLog:
> 
>         PR other/93067
>         * gcc.dg/diagnostic-input-charset-1.c: New test.
>         * gcc.dg/diagnostic-input-charset-2.c: New test.

Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ?

> 
> diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> index 59cabd12407..d5aa7859cc1 100644
> --- a/gcc/c-family/c-opts.c
> +++ b/gcc/c-family/c-opts.c
> @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename)
>    cpp_post_options (parse_in);
>    init_global_opts_from_cpp (&global_options, cpp_get_options
(parse_in));
>  
> +  /* Let diagnostics infrastructure know we are using libcpp to read
> +     the input.  */
> +  input_initialize_cpp_context (parse_in);

As noted above I think the most significant thing here is getting the
source encoding from parse_in, so maybe reword the comment accordingly.

[...]

> diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> index 51baf141711..2b12a98afc0 100644
> --- a/gcc/fortran/cpp.c
> +++ b/gcc/fortran/cpp.c
> @@ -493,6 +493,10 @@ gfc_cpp_post_options (void)
>  
>    cpp_post_options (cpp_in);
>  
> +  /* Let diagnostics infrastructure know we are using libcpp to read
> +     the input.  */
> +  input_initialize_cpp_context (cpp_in);

Likewise.

[...]

> diff --git a/gcc/input.c b/gcc/input.c
> index 29d10f06b86..1dcdd464bc1 100644
> --- a/gcc/input.c
> +++ b/gcc/input.c

[...]

> @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path)
>    r->total_lines = total_lines_num (file_path);
>    r->missing_trailing_newline = true;
>  
> +  /* If libcpp is managing the reading, then there are two cases we
need to
> +     consider.  If -finput-charset is not in effect, then we just
need to
> +     strip a UTF-8 BOM, so do that ourselves rather than calling
into libcpp so
> +     as to avoid paying the penalty of using libcpp, namely that the
entire file
> +     must be read at once.  In the (generally rare) case that a non-
trivial
> +     -finput-charset is needed, then go ahead and use libcpp to read
the whole
> +     file and do the conversion.  */
> +  if (input_cpp_context.in_use)
> +    {
> +      if (input_cpp_context.conversion_is_trivial)
> +       {
> +         /* Strip the UTF8 BOM if present.  */
> +         if (read_data (r))
> +           {
> +             const int offset = cpp_check_utf8_bom (r->data, r-
>nb_read);
> +             r->offset_buffer (offset);
> +             r->nb_read -= offset;
> +           }

This assumes that trivial conversions are always UTF8 to UTF8, which
assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC...

[...]

> +/* Utility to strip a UTF-8 byte order marking from the beginning
> +   of a buffer.  Returns the number of bytes to skip, which
currently
> +   will be either 0 or 3.  */
> +int
> +cpp_check_utf8_bom (const char *data, size_t data_length)
> +{
> +
> +#if HOST_CHARSET == HOST_CHARSET_ASCII
> +  const unsigned char *udata = (const unsigned char *) data;
> +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> +      && udata[2] == 0xbf)
> +    return 3;
> +#endif
> +
> +  return 0;
> +}

...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a
no-op for the EBCDIC case, so that's OK.

[...]

> @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const
char *input_charset,
>        to.text = XNEWVEC (uchar, to.asize);
>        to.len = 0;
>  
> -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> -       cpp_error (pfile, CPP_DL_ERROR,
> -                  "failure to convert %s to %s",
> -                  CPP_OPTION (pfile, input_charset),
SOURCE_CHARSET);
> +      const bool ok = APPLY_CONVERSION (input_cset, input, len,
&to);
>  
> -      free (input);
> -    }
> +      /* Handle conversion failure.  */
> +      if (!ok)
> +       {
> +         free (input);
> +         if (!pfile)
> +           {
> +             XDELETEVEC (to.text);
> +             *buffer_start = NULL;
> +             *st_size = 0;
> +             return NULL;
> +           }
> +         cpp_error (pfile, CPP_DL_ERROR,
> +                    "failure to convert %s to %s",
> +                    CPP_OPTION (pfile, input_charset),
SOURCE_CHARSET);
> +       }
> +    }
>

Doesn't this lose the
  free (input);
for the case where the conversion is successful?  Is this a leak?

[...]

> diff --git a/libcpp/files.c b/libcpp/files.c
> index 301b2379a23..178bb9ed1e6 100644
> --- a/libcpp/files.c
> +++ b/libcpp/files.c
> @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile,
_cpp_file *file,
>  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
>                               bool *invalid_pch, location_t loc);
>  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> -                           location_t loc);
> +                           location_t loc, const char *input_charset
= NULL);

read_file_guts is only used in one place before the patch, so rather
than add a default argument, I think it's simpler to pass in the
input_charset at that call.

> @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char
*fname, cpp_dir *start_dir,
>  
>     Use LOC for any diagnostics.
>  
> +   The input charset may be specified in the INPUT_CHARSET argument,
or
> +   else it will be taken from PFILE.

...and doing so means that input_charset will be non-NULL at all
callers.

> +   PFILE may be NULL.  In this case, no diagnostics are issued, and
the
> +   input charset must be specified in the arguments.
> +
>     FIXME: Flush file cache and try again if we run out of memory. 
*/
>  static bool
> -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> +               const char *input_charset)
>  {
>    ssize_t size, total, count;
>    uchar *buf;
>    bool regular;
>  
> +  if (!input_charset)
> +    {
> +      gcc_assert (pfile);
> +      input_charset = CPP_OPTION (pfile, input_charset);
> +    }

...which would allow for this logic to be removed, moving this to the
existing caller, I believe.

[...]

Hope this is constructive
Dave


Reply via email to