Re: [PATCH] diagnostics: Support for -finput-charset [PR93067]

Lewis Hyatt via Gcc-patches Fri, 30 Jul 2021 13:17:39 -0700

On Fri, Jan 29, 2021 at 10:46:30AM -0500, Lewis Hyatt wrote:
> On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote:
> > On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote:
> > > Hello-
> > > 
> > > The attached patch addresses PR93067:
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0
> > > 
> > > This is similar to the patch I posted last year on the PR, with some
> > tweaks
> > > to make it a little simpler. Recapping some of the commentary on the
> > PR:
> > > 
> > > When source lines are needed for diagnostics output, they are
> > retrieved from
> > > the source file by the fcache infrastructure in input.c, since libcpp
> > has
> > > generally already forgotten them (plus not all front ends are using
> > > libcpp). This infrastructure does not read the files in the same way
> > as
> > > libcpp does; in particular, it does not translate the encoding as
> > requested
> > > by -finput-charset, and it does not strip a UTF-8 byte-order mark if
> > > present. The patch adds this ability. My thinking in deciding how to
> > do it
> > > was the following:
> > > 
> > > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer
> > still,
> > >   so this patch should try hard not to introduce any worse
> > performance
> > >   unless these things are needed.
> > > 
> > > - It is desirable to reuse libcpp's encoding infrastructure from
> > charset.c
> > >   rather than repeat it in input.c. (Notably, libcpp uses iconv but
> > it also
> > >   has hand-coded routines for certain charsets to make sure they are
> > >   available.)
> > > 
> > > - There is a performance degradation required in order to make use of
> > libcpp
> > >   directly, because the input.c infrastructure only reads as much of
> > the
> > >   source file as necessary, whereas libcpp interfaces as-is require
> > to read
> > >   the entire file into memory.
> > > 
> > > - It can't be quite as simple as just "only delegate to libcpp if
> > >   -finput-charset was specified", because the stripping of the UTF-8
> > BOM has
> > >   to happen with or without this option.
> > > 
> > > - So it seemed a reasonable compromise to me, if -finput-charset is
> > >   specified, then use libcpp to convert the file, otherwise, strip
> > the BOM
> > >   in input.c and then process the file the same way it is done now.
> > There's
> > >   a little bit of leakage of charset logic from libcpp this way (for
> > the
> > >   BOM), but it seems worthwhile, since otherwise, diagnostics would
> > always
> > >   be reading the entire file into memory, which is not a cost paid
> > >   currently.
> > 
> > Thanks for the patch; sorry about the delay in reviewing it.
> >
> 
> Thanks for the comments! Here is an updated patch that addresses your
> feedback, plus some responses inline below.
> 
> Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests
> the same before and after, plus 6 new PASS.
> 
> FAIL 85 85
> PASS 479191 479197
> UNSUPPORTED 13664 13664
> UNTESTED 129 129
> XFAIL 2292 2292
> XPASS 30 30
> 
> 
> > This mostly seems good to me.
> > 
> > One aspect I'm not quite convinced about is the
> > input_cpp_context.in_use flag.  The input.c machinery is used by
> > diagnostics, and so could be used by middle-end warnings for frontends
> > that don't use libcpp.  Presumably we'd still want to remove the UTF-8
> > BOM for those, and do encoding fixups if necessary - is it more a case
> > of initializing things to express what the expected input charset is?
> > (since that is part of the cpp_options)
> > 
> > c.opt has:
> >   finput-charset=
> >   C ObjC C++ ObjC++ Joined RejectNegative
> >   -finput-charset=<cset>    Specify the default character set for
> > source files.
> > 
> > I believe that D and Go are the two frontends that don't use libcpp for
> > parsing.  I believe Go source is required to be UTF-8 (unsurprisingly
> > considering the heritage of both).  I don't know what source encodings
> > D supports.
> >
> 
> For this patch I was rather singularly focused on libcpp, so I looked
> deeper at the other frontends now. It seems to me that there are basically
> two questions to answer, and the three frontend styles answer this pair in
> three different ways.
> 
> Q1: What is the input charset?
> A1:
> 
>     libcpp: Whatever was passed to -finput-charset (note, for Fortran,
>     -finput-charset is not supported though.)
> 
>     go: Assume UTF-8.
> 
>     D: UTF-8, UTF-16, or UTF-32 (the latter two in either
>        endianness); determined by inspecting the first bytes of the file.
> 
> Q2: How should a UTF-8 BOM be handled?
> A2:
> 
>     libcpp: Treat entirely the same, as if it was not present at all. So
>     a diagnostic referring to the first non-BOM character in the file will
>     point to column 1, not to column 4.
> 
>     go: Treat it like whitespace, ignored for parsing purposes, but still
>     logically part of the file. A diagnostic referring to the first non-BOM
>     character in the file will point to column 4.
> 
>     D: Same as libcpp.
> 
> So input.c's current behavior (with or without my patch) actually matches
> the "go" frontend exactly and does the right thing for it. As you
> thought, though, for D it would be desirable to skip over the UTF-8 BOM
> too.
> 
> D adds an additional wrinkle in that, instead of using -finput-charset, it
> uses its own detection logic -- based on knowledge that the first codepoint
> in the file must be ASCII, it is able to deduce the encoding from the first
> few bytes. This means that in D, each source file can have a different
> encoding, which is not expressible in libcpp frontends currently, and hence
> the notion of a global input_cpp_context with a constant charset is not
> sufficient to capture this.
> 
> In this iteration of the patch, I replaced the input_cpp_context with a more
> general input_context that can handle all of these cases. The context now
> consists of a callback function and a bool, which the frontend is supposed
> to configure. The bool determines whether or not to skip the BOM. The
> callback, when passed a filename, returns the input charset needed to
> convert that file. For libcpp, the filename is not used as the charset is
> determined from -finput-charset for all files. For D, the callback function
> is currently a stub, but it could be expanded to open the file, read the
> first few bytes, and determine the charset to be used. I have not
> implemented it for now, because it seems inelegant to duplicate the logic D
> already has for this detection, but this logic is part of the dmd/ tree
> which I think is maintained external to GCC, and so it didn't seem right to
> attempt to refactor it. I figured there may not be much interest in this
> feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
> D), and this patch makes no change to it. This patch does fix the handling
> of a UTF-8 BOM in D diagnostics, however.
> 
> > > Separate from the main patch are two testcases that both fail before
> > this
> > > patch and pass after. I attached them gzipped because they use non-
> > standard
> > > encodings that won't email well.
> > > 
> > > The main question I have about the patch is whether I chose a good
> > way to
> > > address the following complication. location_get_source_line() in
> > input.c is
> > > used to generate diagnostics for all front ends, whether they use
> > libcpp to
> > > read the files or not. So input.c needs some way to know whether
> > libcpp is
> > > in use or not. I ended up adding a new function
> > input_initialize_cpp_context(),
> > > which front ends have to call if they are using libcpp to read their
> > > files. Currently that is c-family and fortran. I don't see a simpler
> > way
> > > to do it at least... Even if there were a better way for input.c to
> > find out
> > > the value passed to -finput-charset, it would still need to know
> > whether
> > > libcpp is being used or not.
> > 
> > At some point I want to convert the global state in input.c into a
> > source_manager class, probably hung off the diagnostic_context, so the
> > natural place to put that state would be in there (rather than the
> > fcache being global state).  Perhaps have a source_reader policy class
> > hung off of the source_manager, which would have responsibility for
> > reading and converting the file, either piecemeal, or fully for the
> > "need to use libcpp" case.
> > 
> > But that's not necessary for this fix.
> > 
> > > Please let me know if it looks OK (either now or for stage 1,
> > whatever makes
> > > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all
> > tests the
> > > same before & after plus 6 new PASS from the new tests. Thanks!
> > 
> > Various comments inline below.  In particular, does the patch add a
> > memory leak in _cpp_convert_input?
> > 
> > Thanks
> > Dave
> > 
> > [...]
> > 
> > > libcpp/ChangeLog:
> > > 
> > >         PR other/93067
> > >         * charset.c (init_iconv_desc): Adapt to permit PFILE argument
> > to
> > >         be NULL.
> > >         (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic
> > to...
> > >         (cpp_check_utf8_bom): ...here.  New function.
> > >         (cpp_input_conversion_is_trivial): New function.
> > >         * files.c (read_file_guts): Allow PFILE argument to be NULL.
> > Add
> > >         INPUT_CHARSET argument as an alternate source of this
> > information.
> > >         (cpp_get_converted_source): New function.
> > >         * include/cpplib.h (struct cpp_converted_source): Declare.
> > >         (cpp_get_converted_source): Declare.
> > >         (cpp_input_conversion_is_trivial): Declare.
> > >         (cpp_check_utf8_bom): Declare.
> > > 
> > > gcc/testsuite/ChangeLog:
> > > 
> > >         PR other/93067
> > >         * gcc.dg/diagnostic-input-charset-1.c: New test.
> > >         * gcc.dg/diagnostic-input-charset-2.c: New test.
> > 
> > Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ?
> >
> 
> Done.
> 
> > > 
> > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> > > index 59cabd12407..d5aa7859cc1 100644
> > > --- a/gcc/c-family/c-opts.c
> > > +++ b/gcc/c-family/c-opts.c
> > > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename)
> > >    cpp_post_options (parse_in);
> > >    init_global_opts_from_cpp (&global_options, cpp_get_options
> > (parse_in));
> > >  
> > > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > > +     the input.  */
> > > +  input_initialize_cpp_context (parse_in);
> > 
> > As noted above I think the most significant thing here is getting the
> > source encoding from parse_in, so maybe reword the comment accordingly.
> > 
> > [...]
> > 
> > > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> > > index 51baf141711..2b12a98afc0 100644
> > > --- a/gcc/fortran/cpp.c
> > > +++ b/gcc/fortran/cpp.c
> > > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void)
> > >  
> > >    cpp_post_options (cpp_in);
> > >  
> > > +  /* Let diagnostics infrastructure know we are using libcpp to read
> > > +     the input.  */
> > > +  input_initialize_cpp_context (cpp_in);
> > 
> > Likewise.
> >
> 
> I kept this in mind when redoing this part, hopefully it is better now.
> 
> > [...]
> > 
> > > diff --git a/gcc/input.c b/gcc/input.c
> > > index 29d10f06b86..1dcdd464bc1 100644
> > > --- a/gcc/input.c
> > > +++ b/gcc/input.c
> > 
> > [...]
> > 
> > > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path)
> > >    r->total_lines = total_lines_num (file_path);
> > >    r->missing_trailing_newline = true;
> > >  
> > > +  /* If libcpp is managing the reading, then there are two cases we
> > need to
> > > +     consider.  If -finput-charset is not in effect, then we just
> > need to
> > > +     strip a UTF-8 BOM, so do that ourselves rather than calling
> > into libcpp so
> > > +     as to avoid paying the penalty of using libcpp, namely that the
> > entire file
> > > +     must be read at once.  In the (generally rare) case that a non-
> > trivial
> > > +     -finput-charset is needed, then go ahead and use libcpp to read
> > the whole
> > > +     file and do the conversion.  */
> > > +  if (input_cpp_context.in_use)
> > > +    {
> > > +      if (input_cpp_context.conversion_is_trivial)
> > > +       {
> > > +         /* Strip the UTF8 BOM if present.  */
> > > +         if (read_data (r))
> > > +           {
> > > +             const int offset = cpp_check_utf8_bom (r->data, r-
> > >nb_read);
> > > +             r->offset_buffer (offset);
> > > +             r->nb_read -= offset;
> > > +           }
> > 
> > This assumes that trivial conversions are always UTF8 to UTF8, which
> > assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC...
> > 
> > [...]
> > 
> > > +/* Utility to strip a UTF-8 byte order marking from the beginning
> > > +   of a buffer.  Returns the number of bytes to skip, which
> > currently
> > > +   will be either 0 or 3.  */
> > > +int
> > > +cpp_check_utf8_bom (const char *data, size_t data_length)
> > > +{
> > > +
> > > +#if HOST_CHARSET == HOST_CHARSET_ASCII
> > > +  const unsigned char *udata = (const unsigned char *) data;
> > > +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> > > +      && udata[2] == 0xbf)
> > > +    return 3;
> > > +#endif
> > > +
> > > +  return 0;
> > > +}
> > 
> > ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a
> > no-op for the EBCDIC case, so that's OK.
> > 
> > [...]
> > 
> > > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const
> > char *input_charset,
> > >        to.text = XNEWVEC (uchar, to.asize);
> > >        to.len = 0;
> > >  
> > > -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> > > -       cpp_error (pfile, CPP_DL_ERROR,
> > > -                  "failure to convert %s to %s",
> > > -                  CPP_OPTION (pfile, input_charset),
> > SOURCE_CHARSET);
> > > +      const bool ok = APPLY_CONVERSION (input_cset, input, len,
> > &to);
> > >  
> > > -      free (input);
> > > -    }
> > > +      /* Handle conversion failure.  */
> > > +      if (!ok)
> > > +       {
> > > +         free (input);
> > > +         if (!pfile)
> > > +           {
> > > +             XDELETEVEC (to.text);
> > > +             *buffer_start = NULL;
> > > +             *st_size = 0;
> > > +             return NULL;
> > > +           }
> > > +         cpp_error (pfile, CPP_DL_ERROR,
> > > +                    "failure to convert %s to %s",
> > > +                    CPP_OPTION (pfile, input_charset),
> > SOURCE_CHARSET);
> > > +       }
> > > +    }
> > >
> > 
> > Doesn't this lose the
> >   free (input);
> > for the case where the conversion is successful?  Is this a leak?
> >
> 
> Ugh, sorry to waste your time with that mistake. Fixed.
> 
> > [...]
> > 
> > > diff --git a/libcpp/files.c b/libcpp/files.c
> > > index 301b2379a23..178bb9ed1e6 100644
> > > --- a/libcpp/files.c
> > > +++ b/libcpp/files.c
> > > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile,
> > _cpp_file *file,
> > >  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
> > >                               bool *invalid_pch, location_t loc);
> > >  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> > > -                           location_t loc);
> > > +                           location_t loc, const char *input_charset
> > = NULL);
> > 
> > read_file_guts is only used in one place before the patch, so rather
> > than add a default argument, I think it's simpler to pass in the
> > input_charset at that call.
> > 
> > > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char
> > *fname, cpp_dir *start_dir,
> > >  
> > >     Use LOC for any diagnostics.
> > >  
> > > +   The input charset may be specified in the INPUT_CHARSET argument,
> > or
> > > +   else it will be taken from PFILE.
> > 
> > ...and doing so means that input_charset will be non-NULL at all
> > callers.
> > 
> > > +   PFILE may be NULL.  In this case, no diagnostics are issued, and
> > the
> > > +   input charset must be specified in the arguments.
> > > +
> > >     FIXME: Flush file cache and try again if we run out of memory. 
> > */
> > >  static bool
> > > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> > > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> > > +               const char *input_charset)
> > >  {
> > >    ssize_t size, total, count;
> > >    uchar *buf;
> > >    bool regular;
> > >  
> > > +  if (!input_charset)
> > > +    {
> > > +      gcc_assert (pfile);
> > > +      input_charset = CPP_OPTION (pfile, input_charset);
> > > +    }
> > 
> > ...which would allow for this logic to be removed, moving this to the
> > existing caller, I believe.
> >
> 
> Much better your way, done.
> 
> Thanks for taking a look at it, sorry this version is a bit changed from the
> previous one. FWIW, the libcpp pieces were not touched other than addressing
> your comments; the new logic is in input.h/input.c plus the associated
> tweaks to the frontends.
> 
> -Lewis


> From: Lewis Hyatt <[email protected]>
> Date: Wed, 27 Jan 2021 18:23:13 -0500
> Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067]
> 
> Adds the logic to handle -finput-charset in layout_get_source_line(), so that
> source lines are converted from their input encodings prior to being output by
> diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly.
> 
> gcc/c-family/ChangeLog:
> 
>       PR other/93067
>       * c-opts.c (c_common_input_charset_cb): New function.
>       (c_common_post_options): Call new function input_initialize_context().
> 
> gcc/d/ChangeLog:
> 
>       PR other/93067
>       * d-lang.cc (d_input_charset_callback): New function.
>       (d_init): Call new function input_initialize_context().
> 
> gcc/fortran/ChangeLog:
> 
>       PR other/93067
>       * cpp.c (gfc_cpp_post_options): Call new function
>       input_initialize_cpp_context().
> 
> gcc/ChangeLog:
> 
>       PR other/93067
>       * input.c (default_charset_callback): New function.
>       (input_initialize_context): New function.
>       (read_data): Add prototype.
>       (add_file_to_cache_tab): Use libcpp to convert input encoding when
>       needed. Strip UTF-8 BOM when needed.
>       (class fcache): Add new members to track input encoding conversion.
>       (fcache::fcache): Adapt for new members.
>       (fcache::~fcache): Likewise.
>       (maybe_grow): Likewise.
>       (needs_read): Adapt to be aware that fp member may be NULL now.
>       (get_next_line): Likewise.
>       * input.h (input_initialize_context): Declare.
> 
> libcpp/ChangeLog:
> 
>       PR other/93067
>       * charset.c (init_iconv_desc): Adapt to permit PFILE argument to
>       be NULL.
>       (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
>       (cpp_check_utf8_bom): ...here.  New function.
>       (cpp_input_conversion_is_trivial): New function.
>       * files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
>       INPUT_CHARSET argument as an alternate source of this information.
>       (read_file): Pass the new argument to read_file_guts.
>       (cpp_get_converted_source): New function.
>       * include/cpplib.h (struct cpp_converted_source): Declare.
>       (cpp_get_converted_source): Declare.
>       (cpp_input_conversion_is_trivial): Declare.
>       (cpp_check_utf8_bom): Declare.
> 
> gcc/testsuite/ChangeLog:
> 
>       PR other/93067
>       * gcc.dg/diagnostic-input-charset-1.c: New test.
>       * gcc.dg/diagnostic-input-utf8-bom.c: New test.
> 
> diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
> index bd15b9cd902..2f58e8413b9 100644
> --- a/gcc/c-family/c-opts.c
> +++ b/gcc/c-family/c-opts.c
> @@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context 
> *context)
>    context->opt_permissive = OPT_fpermissive;
>  }
>  
> +/* Input charset configuration for diagnostics.  */
> +static const char *
> +c_common_input_charset_cb (const char * /*filename*/)
> +{
> +  const char *cs = cpp_opts->input_charset;
> +  return cpp_input_conversion_is_trivial (cs) ? NULL : cs;
> +}
> +
>  /* Whether options from all C-family languages should be accepted
>     quietly.  */
>  static bool accept_all_c_family_options = false;
> @@ -1131,6 +1139,10 @@ c_common_post_options (const char **pfilename)
>    cpp_post_options (parse_in);
>    init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
>  
> +  /* Let diagnostics infrastructure know how to convert input files the same
> +     way libcpp will do it, namely using the configured input charset and
> +     skipping a UTF-8 BOM if present.  */
> +  input_initialize_context (c_common_input_charset_cb, true);
>    input_location = UNKNOWN_LOCATION;
>  
>    *pfilename = this_input_filename
> diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
> index 0fd207da7f3..9f170c09886 100644
> --- a/gcc/d/d-lang.cc
> +++ b/gcc/d/d-lang.cc
> @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "print-tree.h"
>  #include "debug.h"
> +#include "input.h"
>  
>  #include "d-tree.h"
>  #include "id.h"
> @@ -362,6 +363,19 @@ d_option_lang_mask (void)
>    return CL_D;
>  }
>  
> +/* Implements input charset and BOM skipping configuration for
> +   diagnostics.  */
> +static const char *d_input_charset_callback (const char * /*filename*/)
> +{
> +  /* TODO: The input charset is automatically determined by code in
> +     dmd/dmodule.c based on the contents of the file.  If this detection
> +     logic were factored out and could be reused here, then we would be able
> +     to return UTF-16 or UTF-32 as needed here.  For now, we return always
> +     NULL, which means no conversion is necessary, i.e. the input is assumed
> +     to be UTF-8 when diagnostics read this file.  */
> +  return NULL;
> +}
> +
>  /* Implements the lang_hooks.init routine for language D.  */
>  
>  static bool
> @@ -373,6 +387,10 @@ d_init (void)
>    Expression::_init ();
>    Objc::_init ();
>  
> +  /* Diagnostics input init, to enable BOM skipping and
> +     input charset conversion.  */
> +  input_initialize_context (d_input_charset_callback, true);
> +
>    /* Back-end init.  */
>    global_binding_level = ggc_cleared_alloc <binding_level> ();
>    current_binding_level = global_binding_level;
> diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
> index 419cd6accbe..8b564888673 100644
> --- a/gcc/fortran/cpp.c
> +++ b/gcc/fortran/cpp.c
> @@ -493,6 +493,12 @@ gfc_cpp_post_options (void)
>  
>    cpp_post_options (cpp_in);
>  
> +
> +  /* Let diagnostics infrastructure know how to convert input files the same
> +     way libcpp will do it, namely, with no charset conversion but with
> +     skipping of a UTF-8 BOM if present.  */
> +  input_initialize_context (NULL, true);
> +
>    gfc_cpp_register_include_paths ();
>  }
>  
> diff --git a/gcc/input.c b/gcc/input.c
> index 9e39e7df83c..f8f6dc7e4ca 100644
> --- a/gcc/input.c
> +++ b/gcc/input.c
> @@ -30,6 +30,29 @@ along with GCC; see the file COPYING3.  If not see
>  #define HAVE_ICONV 0
>  #endif
>  
> +/* Input charset configuration.  */
> +static const char *default_charset_callback (const char *)
> +{
> +  return NULL;
> +}
> +
> +struct
> +{
> +  input_context_charset_callback ccb;
> +  bool should_skip_bom;
> +} static input_context =
> +{
> +  default_charset_callback,
> +  false
> +};
> +
> +void input_initialize_context (input_context_charset_callback ccb,
> +                            bool should_skip_bom)
> +{
> +  input_context.ccb = (ccb ? ccb : default_charset_callback);
> +  input_context.should_skip_bom = should_skip_bom;
> +}
> +
>  /* This is a cache used by get_next_line to store the content of a
>     file to be searched for file lines.  */
>  class fcache
> @@ -78,6 +101,10 @@ public:
>       far.  */
>    char *data;
>  
> +  /* The allocated buffer to be freed may start a little earlier than DATA,
> +     e.g. if a UTF8 BOM was skipped at the beginning.  */
> +  int alloc_offset;
> +
>    /*  The size of the DATA array above.*/
>    size_t size;
>  
> @@ -118,6 +145,17 @@ public:
>  
>    fcache ();
>    ~fcache ();
> +
> +  void offset_buffer (int offset)
> +  {
> +    gcc_assert (offset < 0 ? alloc_offset + offset >= 0
> +             : (size_t) offset <= size);
> +    gcc_assert (data);
> +    alloc_offset += offset;
> +    data += offset;
> +    size -= offset;
> +  }
> +
>  };
>  
>  /* Current position in real source file.  */
> @@ -364,6 +402,9 @@ evicted_cache_tab_entry (unsigned *highest_use_count)
>    return to_evict;
>  }
>  
> +static bool
> +read_data (fcache *c);
> +
>  /* Create the cache used for the content of a given file to be
>     accessed by caret diagnostic.  This cache is added to an array of
>     cache and can be retrieved by lookup_file_in_cache_tab.  This
> @@ -384,6 +425,8 @@ add_file_to_cache_tab (const char *file_path)
>    if (r->fp)
>      fclose (r->fp);
>    r->fp = fp;
> +  if (r->alloc_offset)
> +    r->offset_buffer (-r->alloc_offset);
>    r->nb_read = 0;
>    r->line_start_idx = 0;
>    r->line_num = 0;
> @@ -394,6 +437,33 @@ add_file_to_cache_tab (const char *file_path)
>    r->total_lines = total_lines_num (file_path);
>    r->missing_trailing_newline = true;
>  
> +  /* Check the frontend configuration to determine if we need to do any
> +     transformations, such as charset conversion or BOM skipping.  */
> +  if (const char *input_charset = input_context.ccb (file_path))
> +    {
> +      /* Need a full-blown conversion of the input charset.  */
> +      fclose (r->fp);
> +      r->fp = NULL;
> +      const cpp_converted_source cs
> +     = cpp_get_converted_source (file_path, input_charset);
> +      if (!cs.data)
> +     return NULL;
> +      if (r->data)
> +     XDELETEVEC (r->data);
> +      r->data = cs.data;
> +      r->nb_read = r->size = cs.len;
> +      r->alloc_offset = cs.data - cs.to_free;
> +    }
> +  else if (input_context.should_skip_bom)
> +    {
> +      if (read_data (r))
> +     {
> +       const int offset = cpp_check_utf8_bom (r->data, r->nb_read);
> +       r->offset_buffer (offset);
> +       r->nb_read -= offset;
> +     }
> +    }
> +
>    return r;
>  }
>  
> @@ -415,7 +485,7 @@ lookup_or_add_file_to_cache_tab (const char *file_path)
>     diagnostic.  */
>  
>  fcache::fcache ()
> -: use_count (0), file_path (NULL), fp (NULL), data (0),
> +: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0),
>    size (0), nb_read (0), line_start_idx (0), line_num (0),
>    total_lines (0), missing_trailing_newline (true)
>  {
> @@ -433,6 +503,7 @@ fcache::~fcache ()
>      }
>    if (data)
>      {
> +      offset_buffer (-alloc_offset);
>        XDELETEVEC (data);
>        data = 0;
>      }
> @@ -447,9 +518,9 @@ fcache::~fcache ()
>  static bool
>  needs_read (fcache *c)
>  {
> -  return (c->nb_read == 0
> -       || c->nb_read == c->size
> -       || (c->line_start_idx >= c->nb_read - 1));
> +  return c->fp && (c->nb_read == 0
> +                || c->nb_read == c->size
> +                || (c->line_start_idx >= c->nb_read - 1));
>  }
>  
>  /*  Return TRUE iff the cache is full and thus needs to be
> @@ -469,9 +540,20 @@ maybe_grow (fcache *c)
>    if (!needs_grow (c))
>      return;
>  
> -  size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2;
> -  c->data = XRESIZEVEC (char, c->data, size);
> -  c->size = size;
> +  if (!c->data)
> +    {
> +      gcc_assert (c->size == 0 && c->alloc_offset == 0);
> +      c->size = fcache_buffer_size;
> +      c->data = XNEWVEC (char, c->size);
> +    }
> +  else
> +    {
> +      const int offset = c->alloc_offset;
> +      c->offset_buffer (-offset);
> +      c->size *= 2;
> +      c->data = XRESIZEVEC (char, c->data, c->size);
> +      c->offset_buffer (offset);
> +    }
>  }
>  
>  /*  Read more data into the cache.  Extends the cache if need be.
> @@ -570,7 +652,7 @@ get_next_line (fcache *c, char **line, ssize_t *line_len)
>        c->missing_trailing_newline = false;
>      }
>  
> -  if (ferror (c->fp))
> +  if (c->fp && ferror (c->fp))
>      return false;
>  
>    /* At this point, we've found the end of the of line.  It either
> diff --git a/gcc/input.h b/gcc/input.h
> index f1ef5d76cfd..8a793399958 100644
> --- a/gcc/input.h
> +++ b/gcc/input.h
> @@ -214,4 +214,21 @@ class GTY(()) string_concat_db
>    hash_map <location_hash, string_concat *> *m_table;
>  };
>  
> +/* Because we read source files a second time after the frontend did it the
> +   first time, we need to know how the frontend handled things like character
> +   set conversion and UTF-8 BOM stripping, in order to make everything
> +   consistent.  This function needs to be called by each frontend that 
> requires
> +   non-default behavior, to inform the diagnostics infrastructure how input 
> is
> +   to be processed.  The default behavior is to do no conversion and not to
> +   strip a UTF-8 BOM.
> +
> +   The callback should return the input charset to be used to convert the 
> given
> +   file's contents to UTF-8, or it should return NULL if no conversion is 
> needed
> +   for this file.  SHOULD_SKIP_BOM only applies in case no conversion was
> +   performed, and if true, it will cause a UTF-8 BOM to be skipped at the
> +   beginning of the file.  (In case a conversion was performed, the BOM is
> +   rather skipped as part of the conversion process.)  */
> +typedef const char * (*input_context_charset_callback)(const char *);
> +void input_initialize_context (input_context_charset_callback ccb,
> +                            bool should_skip_bom);
>  #endif
> diff --git a/libcpp/charset.c b/libcpp/charset.c
> index 99a9b73e5ab..61881f978a8 100644
> --- a/libcpp/charset.c
> +++ b/libcpp/charset.c
> @@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = {
>     cset_converter structure for conversion from FROM to TO.  If
>     iconv_open() fails, issue an error and return an identity
>     converter.  Silently return an identity converter if FROM and TO
> -   are identical.  */
> +   are identical.
> +
> +   PFILE is only used for generating diagnostics; setting it to NULL
> +   suppresses diagnostics.  */
> +
>  static struct cset_converter
>  init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
>  {
> @@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, 
> const char *from)
>  
>        if (ret.cd == (iconv_t) -1)
>       {
> -       if (errno == EINVAL)
> -         cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
> -                    "conversion from %s to %s not supported by iconv",
> -                    from, to);
> -       else
> -         cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
> -
> +       if (pfile)
> +         {
> +           if (errno == EINVAL)
> +             cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
> +                        "conversion from %s to %s not supported by iconv",
> +                        from, to);
> +           else
> +             cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
> +         }
>         ret.func = convert_no_conversion;
>       }
>      }
>    else
>      {
> -      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
> -              "no iconv implementation, cannot convert from %s to %s",
> -              from, to);
> +      if (pfile)
> +     {
> +       cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
> +                  "no iconv implementation, cannot convert from %s to %s",
> +                  from, to);
> +     }
>        ret.func = convert_no_conversion;
>        ret.cd = (iconv_t) -1;
>        ret.width = -1;
>      }
> +
>    return ret;
>  }
>  
> @@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const 
> uchar *id, size_t len)
>                                 buf, bufp - buf, HT_ALLOC));
>  }
>  
> +
> +/* Utility to strip a UTF-8 byte order marking from the beginning
> +   of a buffer.  Returns the number of bytes to skip, which currently
> +   will be either 0 or 3.  */
> +int
> +cpp_check_utf8_bom (const char *data, size_t data_length)
> +{
> +
> +#if HOST_CHARSET == HOST_CHARSET_ASCII
> +  const unsigned char *udata = (const unsigned char *) data;
> +  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
> +      && udata[2] == 0xbf)
> +    return 3;
> +#endif
> +
> +  return 0;
> +}
> +
> +
>  /* Convert an input buffer (containing the complete contents of one
>     source file) from INPUT_CHARSET to the source character set.  INPUT
>     points to the input buffer, SIZE is its allocated size, and LEN is
> @@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const 
> uchar *id, size_t len)
>     INPUT is expected to have been allocated with xmalloc.  This
>     function will either set *BUFFER_START to INPUT, or free it and set
>     *BUFFER_START to a pointer to another xmalloc-allocated block of
> -   memory.  */
> +   memory.
> +
> +   PFILE is only used to generate diagnostics; setting it to NULL suppresses
> +   diagnostics, and causes a return of NULL if there was any error instead.  
> */
> +
>  uchar * 
>  _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
>                   uchar *input, size_t size, size_t len,
> @@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char 
> *input_charset,
>        to.text = XNEWVEC (uchar, to.asize);
>        to.len = 0;
>  
> -      if (!APPLY_CONVERSION (input_cset, input, len, &to))
> -     cpp_error (pfile, CPP_DL_ERROR,
> -                "failure to convert %s to %s",
> -                CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
> -
> +      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
>        free (input);
> -    }
>  
> -  /* Clean up the mess.  */
> -  if (input_cset.func == convert_using_iconv)
> -    iconv_close (input_cset.cd);
> +      /* Clean up the mess.  */
> +      if (input_cset.func == convert_using_iconv)
> +     iconv_close (input_cset.cd);
> +
> +      /* Handle conversion failure.  */
> +      if (!ok)
> +     {
> +       if (!pfile)
> +         {
> +           XDELETEVEC (to.text);
> +           *buffer_start = NULL;
> +           *st_size = 0;
> +           return NULL;
> +         }
> +       cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s",
> +                  input_charset, SOURCE_CHARSET);
> +     }
> +    }
>  
>    /* Resize buffer if we allocated substantially too much, or if we
>       haven't enough space for the \n-terminator or following
> @@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char 
> *input_charset,
>  
>    buffer = to.text;
>    *st_size = to.len;
> -#if HOST_CHARSET == HOST_CHARSET_ASCII
> -  /* The HOST_CHARSET test just above ensures that the source charset
> -     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
> -     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
> +
> +  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
> +     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
>       BOM -- however, even if it did, we would still need this code due
>       to the 'convert_no_conversion' case.  */
> -  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
> -      && to.text[2] == 0xbf)
> -    {
> -      *st_size -= 3;
> -      buffer += 3;
> -    }
> -#endif
> +  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
> +  *st_size -= bom_len;
> +  buffer += bom_len;
>  
>    *buffer_start = to.text;
>    return buffer;
> @@ -2244,6 +2282,13 @@ _cpp_default_encoding (void)
>    return current_encoding;
>  }
>  
> +/* Check if the configured input charset requires no conversion, other than
> +   possibly stripping a UTF-8 BOM.  */
> +bool cpp_input_conversion_is_trivial (const char *input_charset)
> +{
> +  return !strcasecmp (input_charset, SOURCE_CHARSET);
> +}
> +
>  /* Implementation of class cpp_string_location_reader.  */
>  
>  /* Constructor for cpp_string_location_reader.  */
> diff --git a/libcpp/files.c b/libcpp/files.c
> index 5ea3f8e1bf3..d91606c1532 100644
> --- a/libcpp/files.c
> +++ b/libcpp/files.c
> @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file 
> *file,
>  static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
>                             bool *invalid_pch, location_t loc);
>  static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
> -                         location_t loc);
> +                         location_t loc, const char *input_charset);
>  static bool read_file (cpp_reader *pfile, _cpp_file *file,
>                      location_t loc);
>  static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
> @@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, 
> cpp_dir *start_dir,
>  
>     Use LOC for any diagnostics.
>  
> +   PFILE may be NULL.  In this case, no diagnostics are issued.
> +
>     FIXME: Flush file cache and try again if we run out of memory.  */
>  static bool
> -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
> +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
> +             const char *input_charset)
>  {
>    ssize_t size, total, count;
>    uchar *buf;
> @@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
> location_t loc)
>  
>    if (S_ISBLK (file->st.st_mode))
>      {
> -      cpp_error_at (pfile, CPP_DL_ERROR, loc,
> -                 "%s is a block device", file->path);
> +      if (pfile)
> +     cpp_error_at (pfile, CPP_DL_ERROR, loc,
> +                   "%s is a block device", file->path);
>        return false;
>      }
>  
> @@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
> location_t loc)
>        does not bite us.  */
>        if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
>       {
> -       cpp_error_at (pfile, CPP_DL_ERROR, loc,
> -                     "%s is too large", file->path);
> +       if (pfile)
> +         cpp_error_at (pfile, CPP_DL_ERROR, loc,
> +                       "%s is too large", file->path);
>         return false;
>       }
>  
> @@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
> location_t loc)
>  
>    if (count < 0)
>      {
> -      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
> +      if (pfile)
> +     cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
>        free (buf);
>        return false;
>      }
>  
> -  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
> +  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
>      cpp_error_at (pfile, CPP_DL_WARNING, loc,
>              "%s is shorter than expected", file->path);
>  
>    file->buffer = _cpp_convert_input (pfile,
> -                                  CPP_OPTION (pfile, input_charset),
> +                                  input_charset,
>                                    buf, size + 16, total,
>                                    &file->buffer_start,
>                                    &file->st.st_size);
> -  file->buffer_valid = true;
> -
> -  return true;
> +  file->buffer_valid = file->buffer;
> +  return file->buffer_valid;
>  }
>  
>  /* Convenience wrapper around read_file_guts that opens the file if
>     necessary and closes the file descriptor after reading.  FILE must
>     have been passed through find_file() at some stage.  Use LOC for
> -   any diagnostics.  */
> +   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
>  static bool
>  read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
>  {
> @@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t 
> loc)
>        return false;
>      }
>  
> -  file->dont_read = !read_file_guts (pfile, file, loc);
> +  file->dont_read = !read_file_guts (pfile, file, loc,
> +                                  CPP_OPTION (pfile, input_charset));
>    close (file->fd);
>    file->fd = -1;
>  
> @@ -2118,3 +2124,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, 
> int angle_brackets,
>    return file->err_no != ENOENT;
>  }
>  
> +/* Read a file and convert to input charset, the same as if it were being 
> read
> +   by a cpp_reader.  */
> +
> +cpp_converted_source
> +cpp_get_converted_source (const char *fname, const char *input_charset)
> +{
> +  cpp_converted_source res = {};
> +  _cpp_file file = {};
> +  file.fd = -1;
> +  file.name = lbasename (fname);
> +  file.path = fname;
> +  if (!open_file (&file))
> +    return res;
> +  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
> +  close (file.fd);
> +  if (!ok)
> +    return res;
> +  res.to_free = (char *) file.buffer_start;
> +  res.data = (char *) file.buffer;
> +  res.len = file.st.st_size;
> +  return res;
> +}
> diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
> index 4467c73284d..d02ef4aa054 100644
> --- a/libcpp/include/cpplib.h
> +++ b/libcpp/include/cpplib.h
> @@ -1369,6 +1369,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *);
>  extern cpp_buffer *cpp_get_prev (cpp_buffer *);
>  extern void cpp_clear_file_cache (cpp_reader *);
>  
> +/* cpp_get_converted_source returns the contents of the given file, as it 
> exists
> +   after cpplib has read it and converted it from the input charset to the
> +   source charset.  Return struct will be zero-filled if the data could not 
> be
> +   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
> +   pointer is what should be passed to free(), as there may be an offset.  */
> +struct cpp_converted_source
> +{
> +  char *to_free;
> +  char *data;
> +  size_t len;
> +};
> +cpp_converted_source cpp_get_converted_source (const char *fname,
> +                                            const char *input_charset);
> +
>  /* In pch.c */
>  struct save_macro_data;
>  extern int cpp_save_state (cpp_reader *, FILE *);
> @@ -1439,6 +1453,7 @@ class cpp_display_width_computation {
>  /* Convenience functions that are simple use cases for class
>     cpp_display_width_computation.  Tab characters will be expanded to spaces
>     as determined by TABSTOP.  */
> +
>  int cpp_byte_column_to_display_column (const char *data, int data_length,
>                                      int column, int tabstop);
>  inline int cpp_display_width (const char *data, int data_length,
> @@ -1451,4 +1466,7 @@ int cpp_display_column_to_byte_column (const char 
> *data, int data_length,
>                                      int display_col, int tabstop);
>  int cpp_wcwidth (cppchar_t c);
>  
> +bool cpp_input_conversion_is_trivial (const char *input_charset);
> +int cpp_check_utf8_bom (const char *data, size_t data_length);
> +
>  #endif /* ! LIBCPP_CPPLIB_H */

Hello-

As discussed most recently here:

https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575243.html

Here is the updated patch to support -finput-charset in diagnostics. As I
mentioned in the above thread, I adapted the patch in input.h and input.c now
that class file_cache has been created. That change also enabled me to move
the input context state into the file_cache class, so that it doesn't need to
be global. Since global_dc is now in charge of creating the file_cache, it
seemed to make sense to me also to move the initialization function for
frontends to call into the diagnostic_context, so they can call it on the
global_dc. Please let me know if this makes sense, overall it is "less global"
than before, which I think is the direction everyone would prefer. Happy to
adjust as needed. Thanks!

As before, I put the new testcases in a separate gzipped file, because they
have non-standard encodings.

Bootstrap + testsuite looks good on x86-64 Linux, all tests the same before
and after except for the 6 new passes.

-Lewis

From: Lewis Hyatt <[email protected]>
Date: Fri, 30 Jul 2021 12:53:43 -0400
Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067]

Adds the logic to handle -finput-charset in layout_get_source_line(), so that
source lines are converted from their input encodings prior to being output by
diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly.

gcc/c-family/ChangeLog:

        PR other/93067
        * c-opts.c (c_common_input_charset_cb): New function.
        (c_common_post_options): Call new function
        diagnostic_initialize_input_context().

gcc/d/ChangeLog:

        PR other/93067
        * d-lang.cc (d_input_charset_callback): New function.
        (d_init): Call new function
        diagnostic_initialize_input_context().

gcc/fortran/ChangeLog:

        PR other/93067
        * cpp.c (gfc_cpp_post_options): Call new function
        diagnostic_initialize_input_context().

gcc/ChangeLog:

        PR other/93067
        * coretypes.h (typedef diagnostic_input_charset_callback): Declare.
        * diagnostic.c (diagnostic_initialize_input_context): New function.
        * diagnostic.h (diagnostic_initialize_input_context): Declare.
        * input.c (default_charset_callback): New function.
        (file_cache::initialize_input_context): New function.
        (file_cache_slot::create): Added ability to convert the input
        according to the input context.
        (file_cache::file_cache): Initialize the new input context.
        (class file_cache_slot): Added new m_alloc_offset member.
        (file_cache_slot::file_cache_slot): Initialize the new member.
        (file_cache_slot::~file_cache_slot): Handle potentially offset buffer.
        (file_cache_slot::maybe_grow): Likewise.
        (file_cache_slot::needs_read_p): Handle NULL fp, which is now possible.
        (file_cache_slot::get_next_line): Likewise.
        * input.h (class file_cache): Added input context member.

libcpp/ChangeLog:

        PR other/93067
        * charset.c (init_iconv_desc): Adapt to permit PFILE argument to
        be NULL.
        (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to...
        (cpp_check_utf8_bom): ...here.  New function.
        (cpp_input_conversion_is_trivial): New function.
        * files.c (read_file_guts): Allow PFILE argument to be NULL.  Add
        INPUT_CHARSET argument as an alternate source of this information.
        (read_file): Pass the new argument to read_file_guts.
        (cpp_get_converted_source): New function.
        * include/cpplib.h (struct cpp_converted_source): Declare.
        (cpp_get_converted_source): Declare.
        (cpp_input_conversion_is_trivial): Declare.
        (cpp_check_utf8_bom): Declare.

gcc/testsuite/ChangeLog:

        PR other/93067
        * gcc.dg/diagnostic-input-charset-1.c: New test.
        * gcc.dg/diagnostic-input-utf8-bom.c: New test.

diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c
index 1c4e832c7ed..961e40c29e7 100644
--- a/gcc/c-family/c-opts.c
+++ b/gcc/c-family/c-opts.c
@@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context 
*context)
   context->opt_permissive = OPT_fpermissive;
 }
 
+/* Input charset configuration for diagnostics.  */
+static const char *
+c_common_input_charset_cb (const char * /*filename*/)
+{
+  const char *cs = cpp_opts->input_charset;
+  return cpp_input_conversion_is_trivial (cs) ? nullptr : cs;
+}
+
 /* Whether options from all C-family languages should be accepted
    quietly.  */
 static bool accept_all_c_family_options = false;
@@ -1136,6 +1144,11 @@ c_common_post_options (const char **pfilename)
   cpp_post_options (parse_in);
   init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in));
 
+  /* Let diagnostics infrastructure know how to convert input files the same
+     way libcpp will do it, namely using the configured input charset and
+     skipping a UTF-8 BOM if present.  */
+  diagnostic_initialize_input_context (global_dc,
+                                      c_common_input_charset_cb, true);
   input_location = UNKNOWN_LOCATION;
 
   *pfilename = this_input_filename
diff --git a/gcc/coretypes.h b/gcc/coretypes.h
index 406572e947d..726fcaddda2 100644
--- a/gcc/coretypes.h
+++ b/gcc/coretypes.h
@@ -154,6 +154,7 @@ struct cl_option_handlers;
 struct diagnostic_context;
 class pretty_printer;
 class diagnostic_event_id_t;
+typedef const char * (*diagnostic_input_charset_callback)(const char *);
 
 template<typename T> struct array_traits;
 
diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
index 4386a489ff2..fa29a46ab1e 100644
--- a/gcc/d/d-lang.cc
+++ b/gcc/d/d-lang.cc
@@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "print-tree.h"
 #include "debug.h"
+#include "input.h"
 
 #include "d-tree.h"
 #include "id.h"
@@ -362,6 +363,19 @@ d_option_lang_mask (void)
   return CL_D;
 }
 
+/* Implements input charset and BOM skipping configuration for
+   diagnostics.  */
+static const char *d_input_charset_callback (const char * /*filename*/)
+{
+  /* TODO: The input charset is automatically determined by code in
+     dmd/dmodule.c based on the contents of the file.  If this detection
+     logic were factored out and could be reused here, then we would be able
+     to return UTF-16 or UTF-32 as needed here.  For now, we return always
+     NULL, which means no conversion is necessary, i.e. the input is assumed
+     to be UTF-8 when diagnostics read this file.  */
+  return nullptr;
+}
+
 /* Implements the lang_hooks.init routine for language D.  */
 
 static bool
@@ -373,6 +387,11 @@ d_init (void)
   Expression::_init ();
   Objc::_init ();
 
+  /* Diagnostics input init, to enable BOM skipping and
+     input charset conversion.  */
+  diagnostic_initialize_input_context (global_dc,
+                                      d_input_charset_callback, true);
+
   /* Back-end init.  */
   global_binding_level = ggc_cleared_alloc <binding_level> ();
   current_binding_level = global_binding_level;
diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
index 8361f68aace..b3afbeae648 100644
--- a/gcc/diagnostic.c
+++ b/gcc/diagnostic.c
@@ -293,6 +293,17 @@ diagnostic_urls_init (diagnostic_context *context, int 
value /*= -1 */)
     = determine_url_format ((diagnostic_url_rule_t) value);
 }
 
+/* Create the file_cache, if not already created, and tell it how to
+   translate files on input.  */
+void diagnostic_initialize_input_context (diagnostic_context *context,
+                                         diagnostic_input_charset_callback ccb,
+                                         bool should_skip_bom)
+{
+  if (!context->m_file_cache)
+    context->m_file_cache = new file_cache;
+  context->m_file_cache->initialize_input_context (ccb, should_skip_bom);
+}
+
 /* Do any cleaning up required after the last diagnostic is emitted.  */
 
 void
diff --git a/gcc/diagnostic.h b/gcc/diagnostic.h
index 7227dae1b6b..f90d20a4f5e 100644
--- a/gcc/diagnostic.h
+++ b/gcc/diagnostic.h
@@ -446,6 +446,25 @@ extern void diagnostic_show_locus (diagnostic_context *,
                                   diagnostic_t diagnostic_kind);
 extern void diagnostic_show_any_path (diagnostic_context *, diagnostic_info *);
 
+/* Because we read source files a second time after the frontend did it the
+   first time, we need to know how the frontend handled things like character
+   set conversion and UTF-8 BOM stripping, in order to make everything
+   consistent.  This function needs to be called by each frontend that requires
+   non-default behavior, to inform the diagnostics infrastructure how input is
+   to be processed.  The default behavior is to do no conversion and not to
+   strip a UTF-8 BOM.
+
+   The callback should return the input charset to be used to convert the given
+   file's contents to UTF-8, or it should return NULL if no conversion is 
needed
+   for this file.  SHOULD_SKIP_BOM only applies in case no conversion was
+   performed, and if true, it will cause a UTF-8 BOM to be skipped at the
+   beginning of the file.  (In case a conversion was performed, the BOM is
+   rather skipped as part of the conversion process.)  */
+
+void diagnostic_initialize_input_context (diagnostic_context *context,
+                                         diagnostic_input_charset_callback ccb,
+                                         bool should_skip_bom);
+
 /* Force diagnostics controlled by OPTIDX to be kind KIND.  */
 extern diagnostic_t diagnostic_classify_diagnostic (diagnostic_context *,
                                                    int /* optidx */,
diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c
index 419cd6accbe..83c4517acdb 100644
--- a/gcc/fortran/cpp.c
+++ b/gcc/fortran/cpp.c
@@ -493,6 +493,12 @@ gfc_cpp_post_options (void)
 
   cpp_post_options (cpp_in);
 
+
+  /* Let diagnostics infrastructure know how to convert input files the same
+     way libcpp will do it, namely, with no charset conversion but with
+     skipping of a UTF-8 BOM if present.  */
+  diagnostic_initialize_input_context (global_dc, nullptr, true);
+
   gfc_cpp_register_include_paths ();
 }
 
diff --git a/gcc/input.c b/gcc/input.c
index de20d983d2c..4b809862e02 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -22,7 +22,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "coretypes.h"
 #include "intl.h"
 #include "diagnostic.h"
-#include "diagnostic-core.h"
 #include "selftest.h"
 #include "cpplib.h"
 
@@ -30,6 +29,20 @@ along with GCC; see the file COPYING3.  If not see
 #define HAVE_ICONV 0
 #endif
 
+/* Input charset configuration.  */
+static const char *default_charset_callback (const char *)
+{
+  return nullptr;
+}
+
+void
+file_cache::initialize_input_context (diagnostic_input_charset_callback ccb,
+                                     bool should_skip_bom)
+{
+  in_context.ccb = (ccb ? ccb : default_charset_callback);
+  in_context.should_skip_bom = should_skip_bom;
+}
+
 /* This is a cache used by get_next_line to store the content of a
    file to be searched for file lines.  */
 class file_cache_slot
@@ -51,7 +64,8 @@ public:
 
   void inc_use_count () { m_use_count++; }
 
-  void create (const char *file_path, FILE *fp, unsigned highest_use_count);
+  bool create (const file_cache::input_context &in_context,
+              const char *file_path, FILE *fp, unsigned highest_use_count);
   void evict ();
 
  private:
@@ -110,6 +124,10 @@ public:
      far.  */
   char *m_data;
 
+  /* The allocated buffer to be freed may start a little earlier than DATA,
+     e.g. if a UTF8 BOM was skipped at the beginning.  */
+  int m_alloc_offset;
+
   /*  The size of the DATA array above.*/
   size_t m_size;
 
@@ -147,6 +165,17 @@ public:
      doesn't explode.  We thus scale total_lines down to
      line_record_size.  */
   vec<line_info, va_heap> m_line_record;
+
+  void offset_buffer (int offset)
+  {
+    gcc_assert (offset < 0 ? m_alloc_offset + offset >= 0
+               : (size_t) offset <= m_size);
+    gcc_assert (m_data);
+    m_alloc_offset += offset;
+    m_data += offset;
+    m_size -= offset;
+  }
+
 };
 
 /* Current position in real source file.  */
@@ -419,21 +448,25 @@ file_cache::add_file (const char *file_path)
 
   unsigned highest_use_count = 0;
   file_cache_slot *r = evicted_cache_tab_entry (&highest_use_count);
-  r->create (file_path, fp, highest_use_count);
+  if (!r->create (in_context, file_path, fp, highest_use_count))
+    return NULL;
   return r;
 }
 
 /* Populate this slot for use on FILE_PATH and FP, dropping any
    existing cached content within it.  */
 
-void
-file_cache_slot::create (const char *file_path, FILE *fp,
+bool
+file_cache_slot::create (const file_cache::input_context &in_context,
+                        const char *file_path, FILE *fp,
                         unsigned highest_use_count)
 {
   m_file_path = file_path;
   if (m_fp)
     fclose (m_fp);
   m_fp = fp;
+  if (m_alloc_offset)
+    offset_buffer (-m_alloc_offset);
   m_nb_read = 0;
   m_line_start_idx = 0;
   m_line_num = 0;
@@ -443,6 +476,36 @@ file_cache_slot::create (const char *file_path, FILE *fp,
   m_use_count = ++highest_use_count;
   m_total_lines = total_lines_num (file_path);
   m_missing_trailing_newline = true;
+
+
+  /* Check the input configuration to determine if we need to do any
+     transformations, such as charset conversion or BOM skipping.  */
+  if (const char *input_charset = in_context.ccb (file_path))
+    {
+      /* Need a full-blown conversion of the input charset.  */
+      fclose (m_fp);
+      m_fp = NULL;
+      const cpp_converted_source cs
+       = cpp_get_converted_source (file_path, input_charset);
+      if (!cs.data)
+       return false;
+      if (m_data)
+       XDELETEVEC (m_data);
+      m_data = cs.data;
+      m_nb_read = m_size = cs.len;
+      m_alloc_offset = cs.data - cs.to_free;
+    }
+  else if (in_context.should_skip_bom)
+    {
+      if (read_data ())
+       {
+         const int offset = cpp_check_utf8_bom (m_data, m_nb_read);
+         offset_buffer (offset);
+         m_nb_read -= offset;
+       }
+    }
+
+  return true;
 }
 
 /* file_cache's ctor.  */
@@ -450,6 +513,7 @@ file_cache_slot::create (const char *file_path, FILE *fp,
 file_cache::file_cache ()
 : m_file_slots (new file_cache_slot[num_file_slots])
 {
+  initialize_input_context (nullptr, false);
 }
 
 /* file_cache's dtor.  */
@@ -478,8 +542,8 @@ file_cache::lookup_or_add_file (const char *file_path)
 
 file_cache_slot::file_cache_slot ()
 : m_use_count (0), m_file_path (NULL), m_fp (NULL), m_data (0),
-  m_size (0), m_nb_read (0), m_line_start_idx (0), m_line_num (0),
-  m_total_lines (0), m_missing_trailing_newline (true)
+  m_alloc_offset (0), m_size (0), m_nb_read (0), m_line_start_idx (0),
+  m_line_num (0), m_total_lines (0), m_missing_trailing_newline (true)
 {
   m_line_record.create (0);
 }
@@ -495,6 +559,7 @@ file_cache_slot::~file_cache_slot ()
     }
   if (m_data)
     {
+      offset_buffer (-m_alloc_offset);
       XDELETEVEC (m_data);
       m_data = 0;
     }
@@ -509,7 +574,7 @@ file_cache_slot::~file_cache_slot ()
 bool
 file_cache_slot::needs_read_p () const
 {
-  return (m_nb_read == 0
+  return m_fp && (m_nb_read == 0
          || m_nb_read == m_size
          || (m_line_start_idx >= m_nb_read - 1));
 }
@@ -531,9 +596,20 @@ file_cache_slot::maybe_grow ()
   if (!needs_grow_p ())
     return;
 
-  size_t size = m_size == 0 ? buffer_size : m_size * 2;
-  m_data = XRESIZEVEC (char, m_data, size);
-  m_size = size;
+  if (!m_data)
+    {
+      gcc_assert (m_size == 0 && m_alloc_offset == 0);
+      m_size = buffer_size;
+      m_data = XNEWVEC (char, m_size);
+    }
+  else
+    {
+      const int offset = m_alloc_offset;
+      offset_buffer (-offset);
+      m_size *= 2;
+      m_data = XRESIZEVEC (char, m_data, m_size);
+      offset_buffer (offset);
+    }
 }
 
 /*  Read more data into the cache.  Extends the cache if need be.
@@ -632,7 +708,7 @@ file_cache_slot::get_next_line (char **line, ssize_t 
*line_len)
       m_missing_trailing_newline = false;
     }
 
-  if (ferror (m_fp))
+  if (m_fp && ferror (m_fp))
     return false;
 
   /* At this point, we've found the end of the of line.  It either
diff --git a/gcc/input.h b/gcc/input.h
index bbcec84c521..e6881072c5f 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -111,6 +111,15 @@ class file_cache
   file_cache_slot *lookup_or_add_file (const char *file_path);
   void forcibly_evict_file (const char *file_path);
 
+  /* See comments in diagnostic.h about the input conversion context.  */
+  struct input_context
+  {
+    diagnostic_input_charset_callback ccb;
+    bool should_skip_bom;
+  };
+  void initialize_input_context (diagnostic_input_charset_callback ccb,
+                                bool should_skip_bom);
+
  private:
   file_cache_slot *evicted_cache_tab_entry (unsigned *highest_use_count);
   file_cache_slot *add_file (const char *file_path);
@@ -119,6 +128,7 @@ class file_cache
  private:
   static const size_t num_file_slots = 16;
   file_cache_slot *m_file_slots;
+  input_context in_context;
 };
 
 extern expanded_location
diff --git a/libcpp/charset.c b/libcpp/charset.c
index 99a9b73e5ab..61881f978a8 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = {
    cset_converter structure for conversion from FROM to TO.  If
    iconv_open() fails, issue an error and return an identity
    converter.  Silently return an identity converter if FROM and TO
-   are identical.  */
+   are identical.
+
+   PFILE is only used for generating diagnostics; setting it to NULL
+   suppresses diagnostics.  */
+
 static struct cset_converter
 init_iconv_desc (cpp_reader *pfile, const char *to, const char *from)
 {
@@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, const 
char *from)
 
       if (ret.cd == (iconv_t) -1)
        {
-         if (errno == EINVAL)
-           cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
-                      "conversion from %s to %s not supported by iconv",
-                      from, to);
-         else
-           cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
-
+         if (pfile)
+           {
+             if (errno == EINVAL)
+               cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */
+                          "conversion from %s to %s not supported by iconv",
+                          from, to);
+             else
+               cpp_errno (pfile, CPP_DL_ERROR, "iconv_open");
+           }
          ret.func = convert_no_conversion;
        }
     }
   else
     {
-      cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
-                "no iconv implementation, cannot convert from %s to %s",
-                from, to);
+      if (pfile)
+       {
+         cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */
+                    "no iconv implementation, cannot convert from %s to %s",
+                    from, to);
+       }
       ret.func = convert_no_conversion;
       ret.cd = (iconv_t) -1;
       ret.width = -1;
     }
+
   return ret;
 }
 
@@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const 
uchar *id, size_t len)
                                  buf, bufp - buf, HT_ALLOC));
 }
 
+
+/* Utility to strip a UTF-8 byte order marking from the beginning
+   of a buffer.  Returns the number of bytes to skip, which currently
+   will be either 0 or 3.  */
+int
+cpp_check_utf8_bom (const char *data, size_t data_length)
+{
+
+#if HOST_CHARSET == HOST_CHARSET_ASCII
+  const unsigned char *udata = (const unsigned char *) data;
+  if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb
+      && udata[2] == 0xbf)
+    return 3;
+#endif
+
+  return 0;
+}
+
+
 /* Convert an input buffer (containing the complete contents of one
    source file) from INPUT_CHARSET to the source character set.  INPUT
    points to the input buffer, SIZE is its allocated size, and LEN is
@@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const 
uchar *id, size_t len)
    INPUT is expected to have been allocated with xmalloc.  This
    function will either set *BUFFER_START to INPUT, or free it and set
    *BUFFER_START to a pointer to another xmalloc-allocated block of
-   memory.  */
+   memory.
+
+   PFILE is only used to generate diagnostics; setting it to NULL suppresses
+   diagnostics, and causes a return of NULL if there was any error instead.  */
+
 uchar * 
 _cpp_convert_input (cpp_reader *pfile, const char *input_charset,
                    uchar *input, size_t size, size_t len,
@@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char 
*input_charset,
       to.text = XNEWVEC (uchar, to.asize);
       to.len = 0;
 
-      if (!APPLY_CONVERSION (input_cset, input, len, &to))
-       cpp_error (pfile, CPP_DL_ERROR,
-                  "failure to convert %s to %s",
-                  CPP_OPTION (pfile, input_charset), SOURCE_CHARSET);
-
+      const bool ok = APPLY_CONVERSION (input_cset, input, len, &to);
       free (input);
-    }
 
-  /* Clean up the mess.  */
-  if (input_cset.func == convert_using_iconv)
-    iconv_close (input_cset.cd);
+      /* Clean up the mess.  */
+      if (input_cset.func == convert_using_iconv)
+       iconv_close (input_cset.cd);
+
+      /* Handle conversion failure.  */
+      if (!ok)
+       {
+         if (!pfile)
+           {
+             XDELETEVEC (to.text);
+             *buffer_start = NULL;
+             *st_size = 0;
+             return NULL;
+           }
+         cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s",
+                    input_charset, SOURCE_CHARSET);
+       }
+    }
 
   /* Resize buffer if we allocated substantially too much, or if we
      haven't enough space for the \n-terminator or following
@@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char 
*input_charset,
 
   buffer = to.text;
   *st_size = to.len;
-#if HOST_CHARSET == HOST_CHARSET_ASCII
-  /* The HOST_CHARSET test just above ensures that the source charset
-     is UTF-8.  So, ignore a UTF-8 BOM if we see one.  Note that
-     glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
+
+  /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8.  Note
+     that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a
      BOM -- however, even if it did, we would still need this code due
      to the 'convert_no_conversion' case.  */
-  if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb
-      && to.text[2] == 0xbf)
-    {
-      *st_size -= 3;
-      buffer += 3;
-    }
-#endif
+  const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len);
+  *st_size -= bom_len;
+  buffer += bom_len;
 
   *buffer_start = to.text;
   return buffer;
@@ -2244,6 +2282,13 @@ _cpp_default_encoding (void)
   return current_encoding;
 }
 
+/* Check if the configured input charset requires no conversion, other than
+   possibly stripping a UTF-8 BOM.  */
+bool cpp_input_conversion_is_trivial (const char *input_charset)
+{
+  return !strcasecmp (input_charset, SOURCE_CHARSET);
+}
+
 /* Implementation of class cpp_string_location_reader.  */
 
 /* Constructor for cpp_string_location_reader.  */
diff --git a/libcpp/files.c b/libcpp/files.c
index 6e20fc5887f..c93a03c69ef 100644
--- a/libcpp/files.c
+++ b/libcpp/files.c
@@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file 
*file,
 static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file,
                              bool *invalid_pch, location_t loc);
 static bool read_file_guts (cpp_reader *pfile, _cpp_file *file,
-                           location_t loc);
+                           location_t loc, const char *input_charset);
 static bool read_file (cpp_reader *pfile, _cpp_file *file,
                       location_t loc);
 static struct cpp_dir *search_path_head (cpp_reader *, const char *fname,
@@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, 
cpp_dir *start_dir,
 
    Use LOC for any diagnostics.
 
+   PFILE may be NULL.  In this case, no diagnostics are issued.
+
    FIXME: Flush file cache and try again if we run out of memory.  */
 static bool
-read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc)
+read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
+               const char *input_charset)
 {
   ssize_t size, total, count;
   uchar *buf;
@@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
location_t loc)
 
   if (S_ISBLK (file->st.st_mode))
     {
-      cpp_error_at (pfile, CPP_DL_ERROR, loc,
-                   "%s is a block device", file->path);
+      if (pfile)
+       cpp_error_at (pfile, CPP_DL_ERROR, loc,
+                     "%s is a block device", file->path);
       return false;
     }
 
@@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
location_t loc)
         does not bite us.  */
       if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t))
        {
-         cpp_error_at (pfile, CPP_DL_ERROR, loc,
-                       "%s is too large", file->path);
+         if (pfile)
+           cpp_error_at (pfile, CPP_DL_ERROR, loc,
+                         "%s is too large", file->path);
          return false;
        }
 
@@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, 
location_t loc)
 
   if (count < 0)
     {
-      cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
+      if (pfile)
+       cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc);
       free (buf);
       return false;
     }
 
-  if (regular && total != size && STAT_SIZE_RELIABLE (file->st))
+  if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st))
     cpp_error_at (pfile, CPP_DL_WARNING, loc,
               "%s is shorter than expected", file->path);
 
   file->buffer = _cpp_convert_input (pfile,
-                                    CPP_OPTION (pfile, input_charset),
+                                    input_charset,
                                     buf, size + 16, total,
                                     &file->buffer_start,
                                     &file->st.st_size);
-  file->buffer_valid = true;
-
-  return true;
+  file->buffer_valid = file->buffer;
+  return file->buffer_valid;
 }
 
 /* Convenience wrapper around read_file_guts that opens the file if
    necessary and closes the file descriptor after reading.  FILE must
    have been passed through find_file() at some stage.  Use LOC for
-   any diagnostics.  */
+   any diagnostics.  Unlike read_file_guts(), PFILE may not be NULL.  */
 static bool
 read_file (cpp_reader *pfile, _cpp_file *file, location_t loc)
 {
@@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t 
loc)
       return false;
     }
 
-  file->dont_read = !read_file_guts (pfile, file, loc);
+  file->dont_read = !read_file_guts (pfile, file, loc,
+                                    CPP_OPTION (pfile, input_charset));
   close (file->fd);
   file->fd = -1;
 
@@ -2145,3 +2151,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, 
int angle_brackets,
   return file->err_no != ENOENT;
 }
 
+/* Read a file and convert to input charset, the same as if it were being read
+   by a cpp_reader.  */
+
+cpp_converted_source
+cpp_get_converted_source (const char *fname, const char *input_charset)
+{
+  cpp_converted_source res = {};
+  _cpp_file file = {};
+  file.fd = -1;
+  file.name = lbasename (fname);
+  file.path = fname;
+  if (!open_file (&file))
+    return res;
+  const bool ok = read_file_guts (NULL, &file, 0, input_charset);
+  close (file.fd);
+  if (!ok)
+    return res;
+  res.to_free = (char *) file.buffer_start;
+  res.data = (char *) file.buffer;
+  res.len = file.st.st_size;
+  return res;
+}
diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
index 7e840635a38..af1429171ea 100644
--- a/libcpp/include/cpplib.h
+++ b/libcpp/include/cpplib.h
@@ -1379,6 +1379,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *);
 extern cpp_buffer *cpp_get_prev (cpp_buffer *);
 extern void cpp_clear_file_cache (cpp_reader *);
 
+/* cpp_get_converted_source returns the contents of the given file, as it 
exists
+   after cpplib has read it and converted it from the input charset to the
+   source charset.  Return struct will be zero-filled if the data could not be
+   read for any reason.  The data starts at the DATA pointer, but the TO_FREE
+   pointer is what should be passed to free(), as there may be an offset.  */
+struct cpp_converted_source
+{
+  char *to_free;
+  char *data;
+  size_t len;
+};
+cpp_converted_source cpp_get_converted_source (const char *fname,
+                                              const char *input_charset);
+
 /* In pch.c */
 struct save_macro_data;
 extern int cpp_save_state (cpp_reader *, FILE *);
@@ -1449,6 +1463,7 @@ class cpp_display_width_computation {
 /* Convenience functions that are simple use cases for class
    cpp_display_width_computation.  Tab characters will be expanded to spaces
    as determined by TABSTOP.  */
+
 int cpp_byte_column_to_display_column (const char *data, int data_length,
                                       int column, int tabstop);
 inline int cpp_display_width (const char *data, int data_length,
@@ -1461,4 +1476,7 @@ int cpp_display_column_to_byte_column (const char *data, 
int data_length,
                                       int display_col, int tabstop);
 int cpp_wcwidth (cppchar_t c);
 
+bool cpp_input_conversion_is_trivial (const char *input_charset);
+int cpp_check_utf8_bom (const char *data, size_t data_length);
+
 #endif /* ! LIBCPP_CPPLIB_H */

diagnostics-input-charset-v3-part2.txt.gz
Description: application/gunzip

Re: [PATCH] diagnostics: Support for -finput-charset [PR93067]

Reply via email to