Hello- I thought it might be a good time to check on this patch please? Thanks! https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576449.html
-Lewis On Fri, Jul 30, 2021 at 4:13 PM Lewis Hyatt <lhy...@gmail.com> wrote: > > On Fri, Jan 29, 2021 at 10:46:30AM -0500, Lewis Hyatt wrote: > > On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote: > > > On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote: > > > > Hello- > > > > > > > > The attached patch addresses PR93067: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0 > > > > > > > > This is similar to the patch I posted last year on the PR, with some > > > tweaks > > > > to make it a little simpler. Recapping some of the commentary on the > > > PR: > > > > > > > > When source lines are needed for diagnostics output, they are > > > retrieved from > > > > the source file by the fcache infrastructure in input.c, since libcpp > > > has > > > > generally already forgotten them (plus not all front ends are using > > > > libcpp). This infrastructure does not read the files in the same way > > > as > > > > libcpp does; in particular, it does not translate the encoding as > > > requested > > > > by -finput-charset, and it does not strip a UTF-8 byte-order mark if > > > > present. The patch adds this ability. My thinking in deciding how to > > > do it > > > > was the following: > > > > > > > > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer > > > still, > > > > so this patch should try hard not to introduce any worse > > > performance > > > > unless these things are needed. > > > > > > > > - It is desirable to reuse libcpp's encoding infrastructure from > > > charset.c > > > > rather than repeat it in input.c. (Notably, libcpp uses iconv but > > > it also > > > > has hand-coded routines for certain charsets to make sure they are > > > > available.) > > > > > > > > - There is a performance degradation required in order to make use of > > > libcpp > > > > directly, because the input.c infrastructure only reads as much of > > > the > > > > source file as necessary, whereas libcpp interfaces as-is require > > > to read > > > > the entire file into memory. > > > > > > > > - It can't be quite as simple as just "only delegate to libcpp if > > > > -finput-charset was specified", because the stripping of the UTF-8 > > > BOM has > > > > to happen with or without this option. > > > > > > > > - So it seemed a reasonable compromise to me, if -finput-charset is > > > > specified, then use libcpp to convert the file, otherwise, strip > > > the BOM > > > > in input.c and then process the file the same way it is done now. > > > There's > > > > a little bit of leakage of charset logic from libcpp this way (for > > > the > > > > BOM), but it seems worthwhile, since otherwise, diagnostics would > > > always > > > > be reading the entire file into memory, which is not a cost paid > > > > currently. > > > > > > Thanks for the patch; sorry about the delay in reviewing it. > > > > > > > Thanks for the comments! Here is an updated patch that addresses your > > feedback, plus some responses inline below. > > > > Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests > > the same before and after, plus 6 new PASS. > > > > FAIL 85 85 > > PASS 479191 479197 > > UNSUPPORTED 13664 13664 > > UNTESTED 129 129 > > XFAIL 2292 2292 > > XPASS 30 30 > > > > > > > This mostly seems good to me. > > > > > > One aspect I'm not quite convinced about is the > > > input_cpp_context.in_use flag. The input.c machinery is used by > > > diagnostics, and so could be used by middle-end warnings for frontends > > > that don't use libcpp. Presumably we'd still want to remove the UTF-8 > > > BOM for those, and do encoding fixups if necessary - is it more a case > > > of initializing things to express what the expected input charset is? > > > (since that is part of the cpp_options) > > > > > > c.opt has: > > > finput-charset= > > > C ObjC C++ ObjC++ Joined RejectNegative > > > -finput-charset=<cset> Specify the default character set for > > > source files. > > > > > > I believe that D and Go are the two frontends that don't use libcpp for > > > parsing. I believe Go source is required to be UTF-8 (unsurprisingly > > > considering the heritage of both). I don't know what source encodings > > > D supports. > > > > > > > For this patch I was rather singularly focused on libcpp, so I looked > > deeper at the other frontends now. It seems to me that there are basically > > two questions to answer, and the three frontend styles answer this pair in > > three different ways. > > > > Q1: What is the input charset? > > A1: > > > > libcpp: Whatever was passed to -finput-charset (note, for Fortran, > > -finput-charset is not supported though.) > > > > go: Assume UTF-8. > > > > D: UTF-8, UTF-16, or UTF-32 (the latter two in either > > endianness); determined by inspecting the first bytes of the file. > > > > Q2: How should a UTF-8 BOM be handled? > > A2: > > > > libcpp: Treat entirely the same, as if it was not present at all. So > > a diagnostic referring to the first non-BOM character in the file will > > point to column 1, not to column 4. > > > > go: Treat it like whitespace, ignored for parsing purposes, but still > > logically part of the file. A diagnostic referring to the first non-BOM > > character in the file will point to column 4. > > > > D: Same as libcpp. > > > > So input.c's current behavior (with or without my patch) actually matches > > the "go" frontend exactly and does the right thing for it. As you > > thought, though, for D it would be desirable to skip over the UTF-8 BOM > > too. > > > > D adds an additional wrinkle in that, instead of using -finput-charset, it > > uses its own detection logic -- based on knowledge that the first codepoint > > in the file must be ASCII, it is able to deduce the encoding from the first > > few bytes. This means that in D, each source file can have a different > > encoding, which is not expressible in libcpp frontends currently, and hence > > the notion of a global input_cpp_context with a constant charset is not > > sufficient to capture this. > > > > In this iteration of the patch, I replaced the input_cpp_context with a more > > general input_context that can handle all of these cases. The context now > > consists of a callback function and a bool, which the frontend is supposed > > to configure. The bool determines whether or not to skip the BOM. The > > callback, when passed a filename, returns the input charset needed to > > convert that file. For libcpp, the filename is not used as the charset is > > determined from -finput-charset for all files. For D, the callback function > > is currently a stub, but it could be expanded to open the file, read the > > first few bytes, and determine the charset to be used. I have not > > implemented it for now, because it seems inelegant to duplicate the logic D > > already has for this detection, but this logic is part of the dmd/ tree > > which I think is maintained external to GCC, and so it didn't seem right to > > attempt to refactor it. I figured there may not be much interest in this > > feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in > > D), and this patch makes no change to it. This patch does fix the handling > > of a UTF-8 BOM in D diagnostics, however. > > > > > > Separate from the main patch are two testcases that both fail before > > > this > > > > patch and pass after. I attached them gzipped because they use non- > > > standard > > > > encodings that won't email well. > > > > > > > > The main question I have about the patch is whether I chose a good > > > way to > > > > address the following complication. location_get_source_line() in > > > input.c is > > > > used to generate diagnostics for all front ends, whether they use > > > libcpp to > > > > read the files or not. So input.c needs some way to know whether > > > libcpp is > > > > in use or not. I ended up adding a new function > > > input_initialize_cpp_context(), > > > > which front ends have to call if they are using libcpp to read their > > > > files. Currently that is c-family and fortran. I don't see a simpler > > > way > > > > to do it at least... Even if there were a better way for input.c to > > > find out > > > > the value passed to -finput-charset, it would still need to know > > > whether > > > > libcpp is being used or not. > > > > > > At some point I want to convert the global state in input.c into a > > > source_manager class, probably hung off the diagnostic_context, so the > > > natural place to put that state would be in there (rather than the > > > fcache being global state). Perhaps have a source_reader policy class > > > hung off of the source_manager, which would have responsibility for > > > reading and converting the file, either piecemeal, or fully for the > > > "need to use libcpp" case. > > > > > > But that's not necessary for this fix. > > > > > > > Please let me know if it looks OK (either now or for stage 1, > > > whatever makes > > > > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all > > > tests the > > > > same before & after plus 6 new PASS from the new tests. Thanks! > > > > > > Various comments inline below. In particular, does the patch add a > > > memory leak in _cpp_convert_input? > > > > > > Thanks > > > Dave > > > > > > [...] > > > > > > > libcpp/ChangeLog: > > > > > > > > PR other/93067 > > > > * charset.c (init_iconv_desc): Adapt to permit PFILE argument > > > to > > > > be NULL. > > > > (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic > > > to... > > > > (cpp_check_utf8_bom): ...here. New function. > > > > (cpp_input_conversion_is_trivial): New function. > > > > * files.c (read_file_guts): Allow PFILE argument to be NULL. > > > Add > > > > INPUT_CHARSET argument as an alternate source of this > > > information. > > > > (cpp_get_converted_source): New function. > > > > * include/cpplib.h (struct cpp_converted_source): Declare. > > > > (cpp_get_converted_source): Declare. > > > > (cpp_input_conversion_is_trivial): Declare. > > > > (cpp_check_utf8_bom): Declare. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > PR other/93067 > > > > * gcc.dg/diagnostic-input-charset-1.c: New test. > > > > * gcc.dg/diagnostic-input-charset-2.c: New test. > > > > > > Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ? > > > > > > > Done. > > > > > > > > > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c > > > > index 59cabd12407..d5aa7859cc1 100644 > > > > --- a/gcc/c-family/c-opts.c > > > > +++ b/gcc/c-family/c-opts.c > > > > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename) > > > > cpp_post_options (parse_in); > > > > init_global_opts_from_cpp (&global_options, cpp_get_options > > > (parse_in)); > > > > > > > > + /* Let diagnostics infrastructure know we are using libcpp to read > > > > + the input. */ > > > > + input_initialize_cpp_context (parse_in); > > > > > > As noted above I think the most significant thing here is getting the > > > source encoding from parse_in, so maybe reword the comment accordingly. > > > > > > [...] > > > > > > > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c > > > > index 51baf141711..2b12a98afc0 100644 > > > > --- a/gcc/fortran/cpp.c > > > > +++ b/gcc/fortran/cpp.c > > > > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void) > > > > > > > > cpp_post_options (cpp_in); > > > > > > > > + /* Let diagnostics infrastructure know we are using libcpp to read > > > > + the input. */ > > > > + input_initialize_cpp_context (cpp_in); > > > > > > Likewise. > > > > > > > I kept this in mind when redoing this part, hopefully it is better now. > > > > > [...] > > > > > > > diff --git a/gcc/input.c b/gcc/input.c > > > > index 29d10f06b86..1dcdd464bc1 100644 > > > > --- a/gcc/input.c > > > > +++ b/gcc/input.c > > > > > > [...] > > > > > > > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path) > > > > r->total_lines = total_lines_num (file_path); > > > > r->missing_trailing_newline = true; > > > > > > > > + /* If libcpp is managing the reading, then there are two cases we > > > need to > > > > + consider. If -finput-charset is not in effect, then we just > > > need to > > > > + strip a UTF-8 BOM, so do that ourselves rather than calling > > > into libcpp so > > > > + as to avoid paying the penalty of using libcpp, namely that the > > > entire file > > > > + must be read at once. In the (generally rare) case that a non- > > > trivial > > > > + -finput-charset is needed, then go ahead and use libcpp to read > > > the whole > > > > + file and do the conversion. */ > > > > + if (input_cpp_context.in_use) > > > > + { > > > > + if (input_cpp_context.conversion_is_trivial) > > > > + { > > > > + /* Strip the UTF8 BOM if present. */ > > > > + if (read_data (r)) > > > > + { > > > > + const int offset = cpp_check_utf8_bom (r->data, r- > > > >nb_read); > > > > + r->offset_buffer (offset); > > > > + r->nb_read -= offset; > > > > + } > > > > > > This assumes that trivial conversions are always UTF8 to UTF8, which > > > assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC... > > > > > > [...] > > > > > > > +/* Utility to strip a UTF-8 byte order marking from the beginning > > > > + of a buffer. Returns the number of bytes to skip, which > > > currently > > > > + will be either 0 or 3. */ > > > > +int > > > > +cpp_check_utf8_bom (const char *data, size_t data_length) > > > > +{ > > > > + > > > > +#if HOST_CHARSET == HOST_CHARSET_ASCII > > > > + const unsigned char *udata = (const unsigned char *) data; > > > > + if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb > > > > + && udata[2] == 0xbf) > > > > + return 3; > > > > +#endif > > > > + > > > > + return 0; > > > > +} > > > > > > ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a > > > no-op for the EBCDIC case, so that's OK. > > > > > > [...] > > > > > > > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const > > > char *input_charset, > > > > to.text = XNEWVEC (uchar, to.asize); > > > > to.len = 0; > > > > > > > > - if (!APPLY_CONVERSION (input_cset, input, len, &to)) > > > > - cpp_error (pfile, CPP_DL_ERROR, > > > > - "failure to convert %s to %s", > > > > - CPP_OPTION (pfile, input_charset), > > > SOURCE_CHARSET); > > > > + const bool ok = APPLY_CONVERSION (input_cset, input, len, > > > &to); > > > > > > > > - free (input); > > > > - } > > > > + /* Handle conversion failure. */ > > > > + if (!ok) > > > > + { > > > > + free (input); > > > > + if (!pfile) > > > > + { > > > > + XDELETEVEC (to.text); > > > > + *buffer_start = NULL; > > > > + *st_size = 0; > > > > + return NULL; > > > > + } > > > > + cpp_error (pfile, CPP_DL_ERROR, > > > > + "failure to convert %s to %s", > > > > + CPP_OPTION (pfile, input_charset), > > > SOURCE_CHARSET); > > > > + } > > > > + } > > > > > > > > > > Doesn't this lose the > > > free (input); > > > for the case where the conversion is successful? Is this a leak? > > > > > > > Ugh, sorry to waste your time with that mistake. Fixed. > > > > > [...] > > > > > > > diff --git a/libcpp/files.c b/libcpp/files.c > > > > index 301b2379a23..178bb9ed1e6 100644 > > > > --- a/libcpp/files.c > > > > +++ b/libcpp/files.c > > > > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, > > > _cpp_file *file, > > > > static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file, > > > > bool *invalid_pch, location_t loc); > > > > static bool read_file_guts (cpp_reader *pfile, _cpp_file *file, > > > > - location_t loc); > > > > + location_t loc, const char *input_charset > > > = NULL); > > > > > > read_file_guts is only used in one place before the patch, so rather > > > than add a default argument, I think it's simpler to pass in the > > > input_charset at that call. > > > > > > > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char > > > *fname, cpp_dir *start_dir, > > > > > > > > Use LOC for any diagnostics. > > > > > > > > + The input charset may be specified in the INPUT_CHARSET argument, > > > or > > > > + else it will be taken from PFILE. > > > > > > ...and doing so means that input_charset will be non-NULL at all > > > callers. > > > > > > > + PFILE may be NULL. In this case, no diagnostics are issued, and > > > the > > > > + input charset must be specified in the arguments. > > > > + > > > > FIXME: Flush file cache and try again if we run out of memory. > > > */ > > > > static bool > > > > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) > > > > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc, > > > > + const char *input_charset) > > > > { > > > > ssize_t size, total, count; > > > > uchar *buf; > > > > bool regular; > > > > > > > > + if (!input_charset) > > > > + { > > > > + gcc_assert (pfile); > > > > + input_charset = CPP_OPTION (pfile, input_charset); > > > > + } > > > > > > ...which would allow for this logic to be removed, moving this to the > > > existing caller, I believe. > > > > > > > Much better your way, done. > > > > Thanks for taking a look at it, sorry this version is a bit changed from the > > previous one. FWIW, the libcpp pieces were not touched other than addressing > > your comments; the new logic is in input.h/input.c plus the associated > > tweaks to the frontends. > > > > -Lewis > > > From: Lewis Hyatt <lhy...@gmail.com> > > Date: Wed, 27 Jan 2021 18:23:13 -0500 > > Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067] > > > > Adds the logic to handle -finput-charset in layout_get_source_line(), so > > that > > source lines are converted from their input encodings prior to being output > > by > > diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly. > > > > gcc/c-family/ChangeLog: > > > > PR other/93067 > > * c-opts.c (c_common_input_charset_cb): New function. > > (c_common_post_options): Call new function input_initialize_context(). > > > > gcc/d/ChangeLog: > > > > PR other/93067 > > * d-lang.cc (d_input_charset_callback): New function. > > (d_init): Call new function input_initialize_context(). > > > > gcc/fortran/ChangeLog: > > > > PR other/93067 > > * cpp.c (gfc_cpp_post_options): Call new function > > input_initialize_cpp_context(). > > > > gcc/ChangeLog: > > > > PR other/93067 > > * input.c (default_charset_callback): New function. > > (input_initialize_context): New function. > > (read_data): Add prototype. > > (add_file_to_cache_tab): Use libcpp to convert input encoding when > > needed. Strip UTF-8 BOM when needed. > > (class fcache): Add new members to track input encoding conversion. > > (fcache::fcache): Adapt for new members. > > (fcache::~fcache): Likewise. > > (maybe_grow): Likewise. > > (needs_read): Adapt to be aware that fp member may be NULL now. > > (get_next_line): Likewise. > > * input.h (input_initialize_context): Declare. > > > > libcpp/ChangeLog: > > > > PR other/93067 > > * charset.c (init_iconv_desc): Adapt to permit PFILE argument to > > be NULL. > > (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to... > > (cpp_check_utf8_bom): ...here. New function. > > (cpp_input_conversion_is_trivial): New function. > > * files.c (read_file_guts): Allow PFILE argument to be NULL. Add > > INPUT_CHARSET argument as an alternate source of this information. > > (read_file): Pass the new argument to read_file_guts. > > (cpp_get_converted_source): New function. > > * include/cpplib.h (struct cpp_converted_source): Declare. > > (cpp_get_converted_source): Declare. > > (cpp_input_conversion_is_trivial): Declare. > > (cpp_check_utf8_bom): Declare. > > > > gcc/testsuite/ChangeLog: > > > > PR other/93067 > > * gcc.dg/diagnostic-input-charset-1.c: New test. > > * gcc.dg/diagnostic-input-utf8-bom.c: New test. > > > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c > > index bd15b9cd902..2f58e8413b9 100644 > > --- a/gcc/c-family/c-opts.c > > +++ b/gcc/c-family/c-opts.c > > @@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context > > *context) > > context->opt_permissive = OPT_fpermissive; > > } > > > > +/* Input charset configuration for diagnostics. */ > > +static const char * > > +c_common_input_charset_cb (const char * /*filename*/) > > +{ > > + const char *cs = cpp_opts->input_charset; > > + return cpp_input_conversion_is_trivial (cs) ? NULL : cs; > > +} > > + > > /* Whether options from all C-family languages should be accepted > > quietly. */ > > static bool accept_all_c_family_options = false; > > @@ -1131,6 +1139,10 @@ c_common_post_options (const char **pfilename) > > cpp_post_options (parse_in); > > init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in)); > > > > + /* Let diagnostics infrastructure know how to convert input files the > > same > > + way libcpp will do it, namely using the configured input charset and > > + skipping a UTF-8 BOM if present. */ > > + input_initialize_context (c_common_input_charset_cb, true); > > input_location = UNKNOWN_LOCATION; > > > > *pfilename = this_input_filename > > diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc > > index 0fd207da7f3..9f170c09886 100644 > > --- a/gcc/d/d-lang.cc > > +++ b/gcc/d/d-lang.cc > > @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3. If not see > > #include "output.h" > > #include "print-tree.h" > > #include "debug.h" > > +#include "input.h" > > > > #include "d-tree.h" > > #include "id.h" > > @@ -362,6 +363,19 @@ d_option_lang_mask (void) > > return CL_D; > > } > > > > +/* Implements input charset and BOM skipping configuration for > > + diagnostics. */ > > +static const char *d_input_charset_callback (const char * /*filename*/) > > +{ > > + /* TODO: The input charset is automatically determined by code in > > + dmd/dmodule.c based on the contents of the file. If this detection > > + logic were factored out and could be reused here, then we would be > > able > > + to return UTF-16 or UTF-32 as needed here. For now, we return always > > + NULL, which means no conversion is necessary, i.e. the input is > > assumed > > + to be UTF-8 when diagnostics read this file. */ > > + return NULL; > > +} > > + > > /* Implements the lang_hooks.init routine for language D. */ > > > > static bool > > @@ -373,6 +387,10 @@ d_init (void) > > Expression::_init (); > > Objc::_init (); > > > > + /* Diagnostics input init, to enable BOM skipping and > > + input charset conversion. */ > > + input_initialize_context (d_input_charset_callback, true); > > + > > /* Back-end init. */ > > global_binding_level = ggc_cleared_alloc <binding_level> (); > > current_binding_level = global_binding_level; > > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c > > index 419cd6accbe..8b564888673 100644 > > --- a/gcc/fortran/cpp.c > > +++ b/gcc/fortran/cpp.c > > @@ -493,6 +493,12 @@ gfc_cpp_post_options (void) > > > > cpp_post_options (cpp_in); > > > > + > > + /* Let diagnostics infrastructure know how to convert input files the > > same > > + way libcpp will do it, namely, with no charset conversion but with > > + skipping of a UTF-8 BOM if present. */ > > + input_initialize_context (NULL, true); > > + > > gfc_cpp_register_include_paths (); > > } > > > > diff --git a/gcc/input.c b/gcc/input.c > > index 9e39e7df83c..f8f6dc7e4ca 100644 > > --- a/gcc/input.c > > +++ b/gcc/input.c > > @@ -30,6 +30,29 @@ along with GCC; see the file COPYING3. If not see > > #define HAVE_ICONV 0 > > #endif > > > > +/* Input charset configuration. */ > > +static const char *default_charset_callback (const char *) > > +{ > > + return NULL; > > +} > > + > > +struct > > +{ > > + input_context_charset_callback ccb; > > + bool should_skip_bom; > > +} static input_context = > > +{ > > + default_charset_callback, > > + false > > +}; > > + > > +void input_initialize_context (input_context_charset_callback ccb, > > + bool should_skip_bom) > > +{ > > + input_context.ccb = (ccb ? ccb : default_charset_callback); > > + input_context.should_skip_bom = should_skip_bom; > > +} > > + > > /* This is a cache used by get_next_line to store the content of a > > file to be searched for file lines. */ > > class fcache > > @@ -78,6 +101,10 @@ public: > > far. */ > > char *data; > > > > + /* The allocated buffer to be freed may start a little earlier than DATA, > > + e.g. if a UTF8 BOM was skipped at the beginning. */ > > + int alloc_offset; > > + > > /* The size of the DATA array above.*/ > > size_t size; > > > > @@ -118,6 +145,17 @@ public: > > > > fcache (); > > ~fcache (); > > + > > + void offset_buffer (int offset) > > + { > > + gcc_assert (offset < 0 ? alloc_offset + offset >= 0 > > + : (size_t) offset <= size); > > + gcc_assert (data); > > + alloc_offset += offset; > > + data += offset; > > + size -= offset; > > + } > > + > > }; > > > > /* Current position in real source file. */ > > @@ -364,6 +402,9 @@ evicted_cache_tab_entry (unsigned *highest_use_count) > > return to_evict; > > } > > > > +static bool > > +read_data (fcache *c); > > + > > /* Create the cache used for the content of a given file to be > > accessed by caret diagnostic. This cache is added to an array of > > cache and can be retrieved by lookup_file_in_cache_tab. This > > @@ -384,6 +425,8 @@ add_file_to_cache_tab (const char *file_path) > > if (r->fp) > > fclose (r->fp); > > r->fp = fp; > > + if (r->alloc_offset) > > + r->offset_buffer (-r->alloc_offset); > > r->nb_read = 0; > > r->line_start_idx = 0; > > r->line_num = 0; > > @@ -394,6 +437,33 @@ add_file_to_cache_tab (const char *file_path) > > r->total_lines = total_lines_num (file_path); > > r->missing_trailing_newline = true; > > > > + /* Check the frontend configuration to determine if we need to do any > > + transformations, such as charset conversion or BOM skipping. */ > > + if (const char *input_charset = input_context.ccb (file_path)) > > + { > > + /* Need a full-blown conversion of the input charset. */ > > + fclose (r->fp); > > + r->fp = NULL; > > + const cpp_converted_source cs > > + = cpp_get_converted_source (file_path, input_charset); > > + if (!cs.data) > > + return NULL; > > + if (r->data) > > + XDELETEVEC (r->data); > > + r->data = cs.data; > > + r->nb_read = r->size = cs.len; > > + r->alloc_offset = cs.data - cs.to_free; > > + } > > + else if (input_context.should_skip_bom) > > + { > > + if (read_data (r)) > > + { > > + const int offset = cpp_check_utf8_bom (r->data, r->nb_read); > > + r->offset_buffer (offset); > > + r->nb_read -= offset; > > + } > > + } > > + > > return r; > > } > > > > @@ -415,7 +485,7 @@ lookup_or_add_file_to_cache_tab (const char *file_path) > > diagnostic. */ > > > > fcache::fcache () > > -: use_count (0), file_path (NULL), fp (NULL), data (0), > > +: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0), > > size (0), nb_read (0), line_start_idx (0), line_num (0), > > total_lines (0), missing_trailing_newline (true) > > { > > @@ -433,6 +503,7 @@ fcache::~fcache () > > } > > if (data) > > { > > + offset_buffer (-alloc_offset); > > XDELETEVEC (data); > > data = 0; > > } > > @@ -447,9 +518,9 @@ fcache::~fcache () > > static bool > > needs_read (fcache *c) > > { > > - return (c->nb_read == 0 > > - || c->nb_read == c->size > > - || (c->line_start_idx >= c->nb_read - 1)); > > + return c->fp && (c->nb_read == 0 > > + || c->nb_read == c->size > > + || (c->line_start_idx >= c->nb_read - 1)); > > } > > > > /* Return TRUE iff the cache is full and thus needs to be > > @@ -469,9 +540,20 @@ maybe_grow (fcache *c) > > if (!needs_grow (c)) > > return; > > > > - size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2; > > - c->data = XRESIZEVEC (char, c->data, size); > > - c->size = size; > > + if (!c->data) > > + { > > + gcc_assert (c->size == 0 && c->alloc_offset == 0); > > + c->size = fcache_buffer_size; > > + c->data = XNEWVEC (char, c->size); > > + } > > + else > > + { > > + const int offset = c->alloc_offset; > > + c->offset_buffer (-offset); > > + c->size *= 2; > > + c->data = XRESIZEVEC (char, c->data, c->size); > > + c->offset_buffer (offset); > > + } > > } > > > > /* Read more data into the cache. Extends the cache if need be. > > @@ -570,7 +652,7 @@ get_next_line (fcache *c, char **line, ssize_t > > *line_len) > > c->missing_trailing_newline = false; > > } > > > > - if (ferror (c->fp)) > > + if (c->fp && ferror (c->fp)) > > return false; > > > > /* At this point, we've found the end of the of line. It either > > diff --git a/gcc/input.h b/gcc/input.h > > index f1ef5d76cfd..8a793399958 100644 > > --- a/gcc/input.h > > +++ b/gcc/input.h > > @@ -214,4 +214,21 @@ class GTY(()) string_concat_db > > hash_map <location_hash, string_concat *> *m_table; > > }; > > > > +/* Because we read source files a second time after the frontend did it the > > + first time, we need to know how the frontend handled things like > > character > > + set conversion and UTF-8 BOM stripping, in order to make everything > > + consistent. This function needs to be called by each frontend that > > requires > > + non-default behavior, to inform the diagnostics infrastructure how > > input is > > + to be processed. The default behavior is to do no conversion and not to > > + strip a UTF-8 BOM. > > + > > + The callback should return the input charset to be used to convert the > > given > > + file's contents to UTF-8, or it should return NULL if no conversion is > > needed > > + for this file. SHOULD_SKIP_BOM only applies in case no conversion was > > + performed, and if true, it will cause a UTF-8 BOM to be skipped at the > > + beginning of the file. (In case a conversion was performed, the BOM is > > + rather skipped as part of the conversion process.) */ > > +typedef const char * (*input_context_charset_callback)(const char *); > > +void input_initialize_context (input_context_charset_callback ccb, > > + bool should_skip_bom); > > #endif > > diff --git a/libcpp/charset.c b/libcpp/charset.c > > index 99a9b73e5ab..61881f978a8 100644 > > --- a/libcpp/charset.c > > +++ b/libcpp/charset.c > > @@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = { > > cset_converter structure for conversion from FROM to TO. If > > iconv_open() fails, issue an error and return an identity > > converter. Silently return an identity converter if FROM and TO > > - are identical. */ > > + are identical. > > + > > + PFILE is only used for generating diagnostics; setting it to NULL > > + suppresses diagnostics. */ > > + > > static struct cset_converter > > init_iconv_desc (cpp_reader *pfile, const char *to, const char *from) > > { > > @@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, > > const char *from) > > > > if (ret.cd == (iconv_t) -1) > > { > > - if (errno == EINVAL) > > - cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */ > > - "conversion from %s to %s not supported by iconv", > > - from, to); > > - else > > - cpp_errno (pfile, CPP_DL_ERROR, "iconv_open"); > > - > > + if (pfile) > > + { > > + if (errno == EINVAL) > > + cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */ > > + "conversion from %s to %s not supported by iconv", > > + from, to); > > + else > > + cpp_errno (pfile, CPP_DL_ERROR, "iconv_open"); > > + } > > ret.func = convert_no_conversion; > > } > > } > > else > > { > > - cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */ > > - "no iconv implementation, cannot convert from %s to %s", > > - from, to); > > + if (pfile) > > + { > > + cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */ > > + "no iconv implementation, cannot convert from %s to %s", > > + from, to); > > + } > > ret.func = convert_no_conversion; > > ret.cd = (iconv_t) -1; > > ret.width = -1; > > } > > + > > return ret; > > } > > > > @@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const > > uchar *id, size_t len) > > buf, bufp - buf, HT_ALLOC)); > > } > > > > + > > +/* Utility to strip a UTF-8 byte order marking from the beginning > > + of a buffer. Returns the number of bytes to skip, which currently > > + will be either 0 or 3. */ > > +int > > +cpp_check_utf8_bom (const char *data, size_t data_length) > > +{ > > + > > +#if HOST_CHARSET == HOST_CHARSET_ASCII > > + const unsigned char *udata = (const unsigned char *) data; > > + if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb > > + && udata[2] == 0xbf) > > + return 3; > > +#endif > > + > > + return 0; > > +} > > + > > + > > /* Convert an input buffer (containing the complete contents of one > > source file) from INPUT_CHARSET to the source character set. INPUT > > points to the input buffer, SIZE is its allocated size, and LEN is > > @@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const > > uchar *id, size_t len) > > INPUT is expected to have been allocated with xmalloc. This > > function will either set *BUFFER_START to INPUT, or free it and set > > *BUFFER_START to a pointer to another xmalloc-allocated block of > > - memory. */ > > + memory. > > + > > + PFILE is only used to generate diagnostics; setting it to NULL > > suppresses > > + diagnostics, and causes a return of NULL if there was any error > > instead. */ > > + > > uchar * > > _cpp_convert_input (cpp_reader *pfile, const char *input_charset, > > uchar *input, size_t size, size_t len, > > @@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char > > *input_charset, > > to.text = XNEWVEC (uchar, to.asize); > > to.len = 0; > > > > - if (!APPLY_CONVERSION (input_cset, input, len, &to)) > > - cpp_error (pfile, CPP_DL_ERROR, > > - "failure to convert %s to %s", > > - CPP_OPTION (pfile, input_charset), SOURCE_CHARSET); > > - > > + const bool ok = APPLY_CONVERSION (input_cset, input, len, &to); > > free (input); > > - } > > > > - /* Clean up the mess. */ > > - if (input_cset.func == convert_using_iconv) > > - iconv_close (input_cset.cd); > > + /* Clean up the mess. */ > > + if (input_cset.func == convert_using_iconv) > > + iconv_close (input_cset.cd); > > + > > + /* Handle conversion failure. */ > > + if (!ok) > > + { > > + if (!pfile) > > + { > > + XDELETEVEC (to.text); > > + *buffer_start = NULL; > > + *st_size = 0; > > + return NULL; > > + } > > + cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s", > > + input_charset, SOURCE_CHARSET); > > + } > > + } > > > > /* Resize buffer if we allocated substantially too much, or if we > > haven't enough space for the \n-terminator or following > > @@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char > > *input_charset, > > > > buffer = to.text; > > *st_size = to.len; > > -#if HOST_CHARSET == HOST_CHARSET_ASCII > > - /* The HOST_CHARSET test just above ensures that the source charset > > - is UTF-8. So, ignore a UTF-8 BOM if we see one. Note that > > - glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a > > + > > + /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8. > > Note > > + that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a > > BOM -- however, even if it did, we would still need this code due > > to the 'convert_no_conversion' case. */ > > - if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb > > - && to.text[2] == 0xbf) > > - { > > - *st_size -= 3; > > - buffer += 3; > > - } > > -#endif > > + const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len); > > + *st_size -= bom_len; > > + buffer += bom_len; > > > > *buffer_start = to.text; > > return buffer; > > @@ -2244,6 +2282,13 @@ _cpp_default_encoding (void) > > return current_encoding; > > } > > > > +/* Check if the configured input charset requires no conversion, other than > > + possibly stripping a UTF-8 BOM. */ > > +bool cpp_input_conversion_is_trivial (const char *input_charset) > > +{ > > + return !strcasecmp (input_charset, SOURCE_CHARSET); > > +} > > + > > /* Implementation of class cpp_string_location_reader. */ > > > > /* Constructor for cpp_string_location_reader. */ > > diff --git a/libcpp/files.c b/libcpp/files.c > > index 5ea3f8e1bf3..d91606c1532 100644 > > --- a/libcpp/files.c > > +++ b/libcpp/files.c > > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file > > *file, > > static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file, > > bool *invalid_pch, location_t loc); > > static bool read_file_guts (cpp_reader *pfile, _cpp_file *file, > > - location_t loc); > > + location_t loc, const char *input_charset); > > static bool read_file (cpp_reader *pfile, _cpp_file *file, > > location_t loc); > > static struct cpp_dir *search_path_head (cpp_reader *, const char *fname, > > @@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, > > cpp_dir *start_dir, > > > > Use LOC for any diagnostics. > > > > + PFILE may be NULL. In this case, no diagnostics are issued. > > + > > FIXME: Flush file cache and try again if we run out of memory. */ > > static bool > > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) > > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc, > > + const char *input_charset) > > { > > ssize_t size, total, count; > > uchar *buf; > > @@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, > > location_t loc) > > > > if (S_ISBLK (file->st.st_mode)) > > { > > - cpp_error_at (pfile, CPP_DL_ERROR, loc, > > - "%s is a block device", file->path); > > + if (pfile) > > + cpp_error_at (pfile, CPP_DL_ERROR, loc, > > + "%s is a block device", file->path); > > return false; > > } > > > > @@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, > > location_t loc) > > does not bite us. */ > > if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t)) > > { > > - cpp_error_at (pfile, CPP_DL_ERROR, loc, > > - "%s is too large", file->path); > > + if (pfile) > > + cpp_error_at (pfile, CPP_DL_ERROR, loc, > > + "%s is too large", file->path); > > return false; > > } > > > > @@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, > > location_t loc) > > > > if (count < 0) > > { > > - cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc); > > + if (pfile) > > + cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc); > > free (buf); > > return false; > > } > > > > - if (regular && total != size && STAT_SIZE_RELIABLE (file->st)) > > + if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st)) > > cpp_error_at (pfile, CPP_DL_WARNING, loc, > > "%s is shorter than expected", file->path); > > > > file->buffer = _cpp_convert_input (pfile, > > - CPP_OPTION (pfile, input_charset), > > + input_charset, > > buf, size + 16, total, > > &file->buffer_start, > > &file->st.st_size); > > - file->buffer_valid = true; > > - > > - return true; > > + file->buffer_valid = file->buffer; > > + return file->buffer_valid; > > } > > > > /* Convenience wrapper around read_file_guts that opens the file if > > necessary and closes the file descriptor after reading. FILE must > > have been passed through find_file() at some stage. Use LOC for > > - any diagnostics. */ > > + any diagnostics. Unlike read_file_guts(), PFILE may not be NULL. */ > > static bool > > read_file (cpp_reader *pfile, _cpp_file *file, location_t loc) > > { > > @@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, > > location_t loc) > > return false; > > } > > > > - file->dont_read = !read_file_guts (pfile, file, loc); > > + file->dont_read = !read_file_guts (pfile, file, loc, > > + CPP_OPTION (pfile, input_charset)); > > close (file->fd); > > file->fd = -1; > > > > @@ -2118,3 +2124,25 @@ _cpp_has_header (cpp_reader *pfile, const char > > *fname, int angle_brackets, > > return file->err_no != ENOENT; > > } > > > > +/* Read a file and convert to input charset, the same as if it were being > > read > > + by a cpp_reader. */ > > + > > +cpp_converted_source > > +cpp_get_converted_source (const char *fname, const char *input_charset) > > +{ > > + cpp_converted_source res = {}; > > + _cpp_file file = {}; > > + file.fd = -1; > > + file.name = lbasename (fname); > > + file.path = fname; > > + if (!open_file (&file)) > > + return res; > > + const bool ok = read_file_guts (NULL, &file, 0, input_charset); > > + close (file.fd); > > + if (!ok) > > + return res; > > + res.to_free = (char *) file.buffer_start; > > + res.data = (char *) file.buffer; > > + res.len = file.st.st_size; > > + return res; > > +} > > diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h > > index 4467c73284d..d02ef4aa054 100644 > > --- a/libcpp/include/cpplib.h > > +++ b/libcpp/include/cpplib.h > > @@ -1369,6 +1369,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *); > > extern cpp_buffer *cpp_get_prev (cpp_buffer *); > > extern void cpp_clear_file_cache (cpp_reader *); > > > > +/* cpp_get_converted_source returns the contents of the given file, as it > > exists > > + after cpplib has read it and converted it from the input charset to the > > + source charset. Return struct will be zero-filled if the data could > > not be > > + read for any reason. The data starts at the DATA pointer, but the > > TO_FREE > > + pointer is what should be passed to free(), as there may be an offset. > > */ > > +struct cpp_converted_source > > +{ > > + char *to_free; > > + char *data; > > + size_t len; > > +}; > > +cpp_converted_source cpp_get_converted_source (const char *fname, > > + const char *input_charset); > > + > > /* In pch.c */ > > struct save_macro_data; > > extern int cpp_save_state (cpp_reader *, FILE *); > > @@ -1439,6 +1453,7 @@ class cpp_display_width_computation { > > /* Convenience functions that are simple use cases for class > > cpp_display_width_computation. Tab characters will be expanded to > > spaces > > as determined by TABSTOP. */ > > + > > int cpp_byte_column_to_display_column (const char *data, int data_length, > > int column, int tabstop); > > inline int cpp_display_width (const char *data, int data_length, > > @@ -1451,4 +1466,7 @@ int cpp_display_column_to_byte_column (const char > > *data, int data_length, > > int display_col, int tabstop); > > int cpp_wcwidth (cppchar_t c); > > > > +bool cpp_input_conversion_is_trivial (const char *input_charset); > > +int cpp_check_utf8_bom (const char *data, size_t data_length); > > + > > #endif /* ! LIBCPP_CPPLIB_H */ > > Hello- > > As discussed most recently here: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575243.html > > Here is the updated patch to support -finput-charset in diagnostics. As I > mentioned in the above thread, I adapted the patch in input.h and input.c now > that class file_cache has been created. That change also enabled me to move > the input context state into the file_cache class, so that it doesn't need to > be global. Since global_dc is now in charge of creating the file_cache, it > seemed to make sense to me also to move the initialization function for > frontends to call into the diagnostic_context, so they can call it on the > global_dc. Please let me know if this makes sense, overall it is "less global" > than before, which I think is the direction everyone would prefer. Happy to > adjust as needed. Thanks! > > As before, I put the new testcases in a separate gzipped file, because they > have non-standard encodings. > > Bootstrap + testsuite looks good on x86-64 Linux, all tests the same before > and after except for the 6 new passes. > > -Lewis