On Fri, Jan 29, 2021 at 10:46:30AM -0500, Lewis Hyatt wrote: > On Tue, Jan 26, 2021 at 04:02:52PM -0500, David Malcolm wrote: > > On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote: > > > Hello- > > > > > > The attached patch addresses PR93067: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0 > > > > > > This is similar to the patch I posted last year on the PR, with some > > tweaks > > > to make it a little simpler. Recapping some of the commentary on the > > PR: > > > > > > When source lines are needed for diagnostics output, they are > > retrieved from > > > the source file by the fcache infrastructure in input.c, since libcpp > > has > > > generally already forgotten them (plus not all front ends are using > > > libcpp). This infrastructure does not read the files in the same way > > as > > > libcpp does; in particular, it does not translate the encoding as > > requested > > > by -finput-charset, and it does not strip a UTF-8 byte-order mark if > > > present. The patch adds this ability. My thinking in deciding how to > > do it > > > was the following: > > > > > > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer > > still, > > > so this patch should try hard not to introduce any worse > > performance > > > unless these things are needed. > > > > > > - It is desirable to reuse libcpp's encoding infrastructure from > > charset.c > > > rather than repeat it in input.c. (Notably, libcpp uses iconv but > > it also > > > has hand-coded routines for certain charsets to make sure they are > > > available.) > > > > > > - There is a performance degradation required in order to make use of > > libcpp > > > directly, because the input.c infrastructure only reads as much of > > the > > > source file as necessary, whereas libcpp interfaces as-is require > > to read > > > the entire file into memory. > > > > > > - It can't be quite as simple as just "only delegate to libcpp if > > > -finput-charset was specified", because the stripping of the UTF-8 > > BOM has > > > to happen with or without this option. > > > > > > - So it seemed a reasonable compromise to me, if -finput-charset is > > > specified, then use libcpp to convert the file, otherwise, strip > > the BOM > > > in input.c and then process the file the same way it is done now. > > There's > > > a little bit of leakage of charset logic from libcpp this way (for > > the > > > BOM), but it seems worthwhile, since otherwise, diagnostics would > > always > > > be reading the entire file into memory, which is not a cost paid > > > currently. > > > > Thanks for the patch; sorry about the delay in reviewing it. > > > > Thanks for the comments! Here is an updated patch that addresses your > feedback, plus some responses inline below. > > Bootstrap + regtest all languages was done on x86-64 GNU/Linux. All tests > the same before and after, plus 6 new PASS. > > FAIL 85 85 > PASS 479191 479197 > UNSUPPORTED 13664 13664 > UNTESTED 129 129 > XFAIL 2292 2292 > XPASS 30 30 > > > > This mostly seems good to me. > > > > One aspect I'm not quite convinced about is the > > input_cpp_context.in_use flag. The input.c machinery is used by > > diagnostics, and so could be used by middle-end warnings for frontends > > that don't use libcpp. Presumably we'd still want to remove the UTF-8 > > BOM for those, and do encoding fixups if necessary - is it more a case > > of initializing things to express what the expected input charset is? > > (since that is part of the cpp_options) > > > > c.opt has: > > finput-charset= > > C ObjC C++ ObjC++ Joined RejectNegative > > -finput-charset=<cset> Specify the default character set for > > source files. > > > > I believe that D and Go are the two frontends that don't use libcpp for > > parsing. I believe Go source is required to be UTF-8 (unsurprisingly > > considering the heritage of both). I don't know what source encodings > > D supports. > > > > For this patch I was rather singularly focused on libcpp, so I looked > deeper at the other frontends now. It seems to me that there are basically > two questions to answer, and the three frontend styles answer this pair in > three different ways. > > Q1: What is the input charset? > A1: > > libcpp: Whatever was passed to -finput-charset (note, for Fortran, > -finput-charset is not supported though.) > > go: Assume UTF-8. > > D: UTF-8, UTF-16, or UTF-32 (the latter two in either > endianness); determined by inspecting the first bytes of the file. > > Q2: How should a UTF-8 BOM be handled? > A2: > > libcpp: Treat entirely the same, as if it was not present at all. So > a diagnostic referring to the first non-BOM character in the file will > point to column 1, not to column 4. > > go: Treat it like whitespace, ignored for parsing purposes, but still > logically part of the file. A diagnostic referring to the first non-BOM > character in the file will point to column 4. > > D: Same as libcpp. > > So input.c's current behavior (with or without my patch) actually matches > the "go" frontend exactly and does the right thing for it. As you > thought, though, for D it would be desirable to skip over the UTF-8 BOM > too. > > D adds an additional wrinkle in that, instead of using -finput-charset, it > uses its own detection logic -- based on knowledge that the first codepoint > in the file must be ASCII, it is able to deduce the encoding from the first > few bytes. This means that in D, each source file can have a different > encoding, which is not expressible in libcpp frontends currently, and hence > the notion of a global input_cpp_context with a constant charset is not > sufficient to capture this. > > In this iteration of the patch, I replaced the input_cpp_context with a more > general input_context that can handle all of these cases. The context now > consists of a callback function and a bool, which the frontend is supposed > to configure. The bool determines whether or not to skip the BOM. The > callback, when passed a filename, returns the input charset needed to > convert that file. For libcpp, the filename is not used as the charset is > determined from -finput-charset for all files. For D, the callback function > is currently a stub, but it could be expanded to open the file, read the > first few bytes, and determine the charset to be used. I have not > implemented it for now, because it seems inelegant to duplicate the logic D > already has for this detection, but this logic is part of the dmd/ tree > which I think is maintained external to GCC, and so it didn't seem right to > attempt to refactor it. I figured there may not be much interest in this > feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in > D), and this patch makes no change to it. This patch does fix the handling > of a UTF-8 BOM in D diagnostics, however. > > > > Separate from the main patch are two testcases that both fail before > > this > > > patch and pass after. I attached them gzipped because they use non- > > standard > > > encodings that won't email well. > > > > > > The main question I have about the patch is whether I chose a good > > way to > > > address the following complication. location_get_source_line() in > > input.c is > > > used to generate diagnostics for all front ends, whether they use > > libcpp to > > > read the files or not. So input.c needs some way to know whether > > libcpp is > > > in use or not. I ended up adding a new function > > input_initialize_cpp_context(), > > > which front ends have to call if they are using libcpp to read their > > > files. Currently that is c-family and fortran. I don't see a simpler > > way > > > to do it at least... Even if there were a better way for input.c to > > find out > > > the value passed to -finput-charset, it would still need to know > > whether > > > libcpp is being used or not. > > > > At some point I want to convert the global state in input.c into a > > source_manager class, probably hung off the diagnostic_context, so the > > natural place to put that state would be in there (rather than the > > fcache being global state). Perhaps have a source_reader policy class > > hung off of the source_manager, which would have responsibility for > > reading and converting the file, either piecemeal, or fully for the > > "need to use libcpp" case. > > > > But that's not necessary for this fix. > > > > > Please let me know if it looks OK (either now or for stage 1, > > whatever makes > > > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all > > tests the > > > same before & after plus 6 new PASS from the new tests. Thanks! > > > > Various comments inline below. In particular, does the patch add a > > memory leak in _cpp_convert_input? > > > > Thanks > > Dave > > > > [...] > > > > > libcpp/ChangeLog: > > > > > > PR other/93067 > > > * charset.c (init_iconv_desc): Adapt to permit PFILE argument > > to > > > be NULL. > > > (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic > > to... > > > (cpp_check_utf8_bom): ...here. New function. > > > (cpp_input_conversion_is_trivial): New function. > > > * files.c (read_file_guts): Allow PFILE argument to be NULL. > > Add > > > INPUT_CHARSET argument as an alternate source of this > > information. > > > (cpp_get_converted_source): New function. > > > * include/cpplib.h (struct cpp_converted_source): Declare. > > > (cpp_get_converted_source): Declare. > > > (cpp_input_conversion_is_trivial): Declare. > > > (cpp_check_utf8_bom): Declare. > > > > > > gcc/testsuite/ChangeLog: > > > > > > PR other/93067 > > > * gcc.dg/diagnostic-input-charset-1.c: New test. > > > * gcc.dg/diagnostic-input-charset-2.c: New test. > > > > Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ? > > > > Done. > > > > > > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c > > > index 59cabd12407..d5aa7859cc1 100644 > > > --- a/gcc/c-family/c-opts.c > > > +++ b/gcc/c-family/c-opts.c > > > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename) > > > cpp_post_options (parse_in); > > > init_global_opts_from_cpp (&global_options, cpp_get_options > > (parse_in)); > > > > > > + /* Let diagnostics infrastructure know we are using libcpp to read > > > + the input. */ > > > + input_initialize_cpp_context (parse_in); > > > > As noted above I think the most significant thing here is getting the > > source encoding from parse_in, so maybe reword the comment accordingly. > > > > [...] > > > > > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c > > > index 51baf141711..2b12a98afc0 100644 > > > --- a/gcc/fortran/cpp.c > > > +++ b/gcc/fortran/cpp.c > > > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void) > > > > > > cpp_post_options (cpp_in); > > > > > > + /* Let diagnostics infrastructure know we are using libcpp to read > > > + the input. */ > > > + input_initialize_cpp_context (cpp_in); > > > > Likewise. > > > > I kept this in mind when redoing this part, hopefully it is better now. > > > [...] > > > > > diff --git a/gcc/input.c b/gcc/input.c > > > index 29d10f06b86..1dcdd464bc1 100644 > > > --- a/gcc/input.c > > > +++ b/gcc/input.c > > > > [...] > > > > > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path) > > > r->total_lines = total_lines_num (file_path); > > > r->missing_trailing_newline = true; > > > > > > + /* If libcpp is managing the reading, then there are two cases we > > need to > > > + consider. If -finput-charset is not in effect, then we just > > need to > > > + strip a UTF-8 BOM, so do that ourselves rather than calling > > into libcpp so > > > + as to avoid paying the penalty of using libcpp, namely that the > > entire file > > > + must be read at once. In the (generally rare) case that a non- > > trivial > > > + -finput-charset is needed, then go ahead and use libcpp to read > > the whole > > > + file and do the conversion. */ > > > + if (input_cpp_context.in_use) > > > + { > > > + if (input_cpp_context.conversion_is_trivial) > > > + { > > > + /* Strip the UTF8 BOM if present. */ > > > + if (read_data (r)) > > > + { > > > + const int offset = cpp_check_utf8_bom (r->data, r- > > >nb_read); > > > + r->offset_buffer (offset); > > > + r->nb_read -= offset; > > > + } > > > > This assumes that trivial conversions are always UTF8 to UTF8, which > > assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC... > > > > [...] > > > > > +/* Utility to strip a UTF-8 byte order marking from the beginning > > > + of a buffer. Returns the number of bytes to skip, which > > currently > > > + will be either 0 or 3. */ > > > +int > > > +cpp_check_utf8_bom (const char *data, size_t data_length) > > > +{ > > > + > > > +#if HOST_CHARSET == HOST_CHARSET_ASCII > > > + const unsigned char *udata = (const unsigned char *) data; > > > + if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb > > > + && udata[2] == 0xbf) > > > + return 3; > > > +#endif > > > + > > > + return 0; > > > +} > > > > ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a > > no-op for the EBCDIC case, so that's OK. > > > > [...] > > > > > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const > > char *input_charset, > > > to.text = XNEWVEC (uchar, to.asize); > > > to.len = 0; > > > > > > - if (!APPLY_CONVERSION (input_cset, input, len, &to)) > > > - cpp_error (pfile, CPP_DL_ERROR, > > > - "failure to convert %s to %s", > > > - CPP_OPTION (pfile, input_charset), > > SOURCE_CHARSET); > > > + const bool ok = APPLY_CONVERSION (input_cset, input, len, > > &to); > > > > > > - free (input); > > > - } > > > + /* Handle conversion failure. */ > > > + if (!ok) > > > + { > > > + free (input); > > > + if (!pfile) > > > + { > > > + XDELETEVEC (to.text); > > > + *buffer_start = NULL; > > > + *st_size = 0; > > > + return NULL; > > > + } > > > + cpp_error (pfile, CPP_DL_ERROR, > > > + "failure to convert %s to %s", > > > + CPP_OPTION (pfile, input_charset), > > SOURCE_CHARSET); > > > + } > > > + } > > > > > > > Doesn't this lose the > > free (input); > > for the case where the conversion is successful? Is this a leak? > > > > Ugh, sorry to waste your time with that mistake. Fixed. > > > [...] > > > > > diff --git a/libcpp/files.c b/libcpp/files.c > > > index 301b2379a23..178bb9ed1e6 100644 > > > --- a/libcpp/files.c > > > +++ b/libcpp/files.c > > > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, > > _cpp_file *file, > > > static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file, > > > bool *invalid_pch, location_t loc); > > > static bool read_file_guts (cpp_reader *pfile, _cpp_file *file, > > > - location_t loc); > > > + location_t loc, const char *input_charset > > = NULL); > > > > read_file_guts is only used in one place before the patch, so rather > > than add a default argument, I think it's simpler to pass in the > > input_charset at that call. > > > > > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char > > *fname, cpp_dir *start_dir, > > > > > > Use LOC for any diagnostics. > > > > > > + The input charset may be specified in the INPUT_CHARSET argument, > > or > > > + else it will be taken from PFILE. > > > > ...and doing so means that input_charset will be non-NULL at all > > callers. > > > > > + PFILE may be NULL. In this case, no diagnostics are issued, and > > the > > > + input charset must be specified in the arguments. > > > + > > > FIXME: Flush file cache and try again if we run out of memory. > > */ > > > static bool > > > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) > > > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc, > > > + const char *input_charset) > > > { > > > ssize_t size, total, count; > > > uchar *buf; > > > bool regular; > > > > > > + if (!input_charset) > > > + { > > > + gcc_assert (pfile); > > > + input_charset = CPP_OPTION (pfile, input_charset); > > > + } > > > > ...which would allow for this logic to be removed, moving this to the > > existing caller, I believe. > > > > Much better your way, done. > > Thanks for taking a look at it, sorry this version is a bit changed from the > previous one. FWIW, the libcpp pieces were not touched other than addressing > your comments; the new logic is in input.h/input.c plus the associated > tweaks to the frontends. > > -Lewis
> From: Lewis Hyatt <lhy...@gmail.com> > Date: Wed, 27 Jan 2021 18:23:13 -0500 > Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067] > > Adds the logic to handle -finput-charset in layout_get_source_line(), so that > source lines are converted from their input encodings prior to being output by > diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly. > > gcc/c-family/ChangeLog: > > PR other/93067 > * c-opts.c (c_common_input_charset_cb): New function. > (c_common_post_options): Call new function input_initialize_context(). > > gcc/d/ChangeLog: > > PR other/93067 > * d-lang.cc (d_input_charset_callback): New function. > (d_init): Call new function input_initialize_context(). > > gcc/fortran/ChangeLog: > > PR other/93067 > * cpp.c (gfc_cpp_post_options): Call new function > input_initialize_cpp_context(). > > gcc/ChangeLog: > > PR other/93067 > * input.c (default_charset_callback): New function. > (input_initialize_context): New function. > (read_data): Add prototype. > (add_file_to_cache_tab): Use libcpp to convert input encoding when > needed. Strip UTF-8 BOM when needed. > (class fcache): Add new members to track input encoding conversion. > (fcache::fcache): Adapt for new members. > (fcache::~fcache): Likewise. > (maybe_grow): Likewise. > (needs_read): Adapt to be aware that fp member may be NULL now. > (get_next_line): Likewise. > * input.h (input_initialize_context): Declare. > > libcpp/ChangeLog: > > PR other/93067 > * charset.c (init_iconv_desc): Adapt to permit PFILE argument to > be NULL. > (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to... > (cpp_check_utf8_bom): ...here. New function. > (cpp_input_conversion_is_trivial): New function. > * files.c (read_file_guts): Allow PFILE argument to be NULL. Add > INPUT_CHARSET argument as an alternate source of this information. > (read_file): Pass the new argument to read_file_guts. > (cpp_get_converted_source): New function. > * include/cpplib.h (struct cpp_converted_source): Declare. > (cpp_get_converted_source): Declare. > (cpp_input_conversion_is_trivial): Declare. > (cpp_check_utf8_bom): Declare. > > gcc/testsuite/ChangeLog: > > PR other/93067 > * gcc.dg/diagnostic-input-charset-1.c: New test. > * gcc.dg/diagnostic-input-utf8-bom.c: New test. > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c > index bd15b9cd902..2f58e8413b9 100644 > --- a/gcc/c-family/c-opts.c > +++ b/gcc/c-family/c-opts.c > @@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context > *context) > context->opt_permissive = OPT_fpermissive; > } > > +/* Input charset configuration for diagnostics. */ > +static const char * > +c_common_input_charset_cb (const char * /*filename*/) > +{ > + const char *cs = cpp_opts->input_charset; > + return cpp_input_conversion_is_trivial (cs) ? NULL : cs; > +} > + > /* Whether options from all C-family languages should be accepted > quietly. */ > static bool accept_all_c_family_options = false; > @@ -1131,6 +1139,10 @@ c_common_post_options (const char **pfilename) > cpp_post_options (parse_in); > init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in)); > > + /* Let diagnostics infrastructure know how to convert input files the same > + way libcpp will do it, namely using the configured input charset and > + skipping a UTF-8 BOM if present. */ > + input_initialize_context (c_common_input_charset_cb, true); > input_location = UNKNOWN_LOCATION; > > *pfilename = this_input_filename > diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc > index 0fd207da7f3..9f170c09886 100644 > --- a/gcc/d/d-lang.cc > +++ b/gcc/d/d-lang.cc > @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3. If not see > #include "output.h" > #include "print-tree.h" > #include "debug.h" > +#include "input.h" > > #include "d-tree.h" > #include "id.h" > @@ -362,6 +363,19 @@ d_option_lang_mask (void) > return CL_D; > } > > +/* Implements input charset and BOM skipping configuration for > + diagnostics. */ > +static const char *d_input_charset_callback (const char * /*filename*/) > +{ > + /* TODO: The input charset is automatically determined by code in > + dmd/dmodule.c based on the contents of the file. If this detection > + logic were factored out and could be reused here, then we would be able > + to return UTF-16 or UTF-32 as needed here. For now, we return always > + NULL, which means no conversion is necessary, i.e. the input is assumed > + to be UTF-8 when diagnostics read this file. */ > + return NULL; > +} > + > /* Implements the lang_hooks.init routine for language D. */ > > static bool > @@ -373,6 +387,10 @@ d_init (void) > Expression::_init (); > Objc::_init (); > > + /* Diagnostics input init, to enable BOM skipping and > + input charset conversion. */ > + input_initialize_context (d_input_charset_callback, true); > + > /* Back-end init. */ > global_binding_level = ggc_cleared_alloc <binding_level> (); > current_binding_level = global_binding_level; > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c > index 419cd6accbe..8b564888673 100644 > --- a/gcc/fortran/cpp.c > +++ b/gcc/fortran/cpp.c > @@ -493,6 +493,12 @@ gfc_cpp_post_options (void) > > cpp_post_options (cpp_in); > > + > + /* Let diagnostics infrastructure know how to convert input files the same > + way libcpp will do it, namely, with no charset conversion but with > + skipping of a UTF-8 BOM if present. */ > + input_initialize_context (NULL, true); > + > gfc_cpp_register_include_paths (); > } > > diff --git a/gcc/input.c b/gcc/input.c > index 9e39e7df83c..f8f6dc7e4ca 100644 > --- a/gcc/input.c > +++ b/gcc/input.c > @@ -30,6 +30,29 @@ along with GCC; see the file COPYING3. If not see > #define HAVE_ICONV 0 > #endif > > +/* Input charset configuration. */ > +static const char *default_charset_callback (const char *) > +{ > + return NULL; > +} > + > +struct > +{ > + input_context_charset_callback ccb; > + bool should_skip_bom; > +} static input_context = > +{ > + default_charset_callback, > + false > +}; > + > +void input_initialize_context (input_context_charset_callback ccb, > + bool should_skip_bom) > +{ > + input_context.ccb = (ccb ? ccb : default_charset_callback); > + input_context.should_skip_bom = should_skip_bom; > +} > + > /* This is a cache used by get_next_line to store the content of a > file to be searched for file lines. */ > class fcache > @@ -78,6 +101,10 @@ public: > far. */ > char *data; > > + /* The allocated buffer to be freed may start a little earlier than DATA, > + e.g. if a UTF8 BOM was skipped at the beginning. */ > + int alloc_offset; > + > /* The size of the DATA array above.*/ > size_t size; > > @@ -118,6 +145,17 @@ public: > > fcache (); > ~fcache (); > + > + void offset_buffer (int offset) > + { > + gcc_assert (offset < 0 ? alloc_offset + offset >= 0 > + : (size_t) offset <= size); > + gcc_assert (data); > + alloc_offset += offset; > + data += offset; > + size -= offset; > + } > + > }; > > /* Current position in real source file. */ > @@ -364,6 +402,9 @@ evicted_cache_tab_entry (unsigned *highest_use_count) > return to_evict; > } > > +static bool > +read_data (fcache *c); > + > /* Create the cache used for the content of a given file to be > accessed by caret diagnostic. This cache is added to an array of > cache and can be retrieved by lookup_file_in_cache_tab. This > @@ -384,6 +425,8 @@ add_file_to_cache_tab (const char *file_path) > if (r->fp) > fclose (r->fp); > r->fp = fp; > + if (r->alloc_offset) > + r->offset_buffer (-r->alloc_offset); > r->nb_read = 0; > r->line_start_idx = 0; > r->line_num = 0; > @@ -394,6 +437,33 @@ add_file_to_cache_tab (const char *file_path) > r->total_lines = total_lines_num (file_path); > r->missing_trailing_newline = true; > > + /* Check the frontend configuration to determine if we need to do any > + transformations, such as charset conversion or BOM skipping. */ > + if (const char *input_charset = input_context.ccb (file_path)) > + { > + /* Need a full-blown conversion of the input charset. */ > + fclose (r->fp); > + r->fp = NULL; > + const cpp_converted_source cs > + = cpp_get_converted_source (file_path, input_charset); > + if (!cs.data) > + return NULL; > + if (r->data) > + XDELETEVEC (r->data); > + r->data = cs.data; > + r->nb_read = r->size = cs.len; > + r->alloc_offset = cs.data - cs.to_free; > + } > + else if (input_context.should_skip_bom) > + { > + if (read_data (r)) > + { > + const int offset = cpp_check_utf8_bom (r->data, r->nb_read); > + r->offset_buffer (offset); > + r->nb_read -= offset; > + } > + } > + > return r; > } > > @@ -415,7 +485,7 @@ lookup_or_add_file_to_cache_tab (const char *file_path) > diagnostic. */ > > fcache::fcache () > -: use_count (0), file_path (NULL), fp (NULL), data (0), > +: use_count (0), file_path (NULL), fp (NULL), data (0), alloc_offset (0), > size (0), nb_read (0), line_start_idx (0), line_num (0), > total_lines (0), missing_trailing_newline (true) > { > @@ -433,6 +503,7 @@ fcache::~fcache () > } > if (data) > { > + offset_buffer (-alloc_offset); > XDELETEVEC (data); > data = 0; > } > @@ -447,9 +518,9 @@ fcache::~fcache () > static bool > needs_read (fcache *c) > { > - return (c->nb_read == 0 > - || c->nb_read == c->size > - || (c->line_start_idx >= c->nb_read - 1)); > + return c->fp && (c->nb_read == 0 > + || c->nb_read == c->size > + || (c->line_start_idx >= c->nb_read - 1)); > } > > /* Return TRUE iff the cache is full and thus needs to be > @@ -469,9 +540,20 @@ maybe_grow (fcache *c) > if (!needs_grow (c)) > return; > > - size_t size = c->size == 0 ? fcache_buffer_size : c->size * 2; > - c->data = XRESIZEVEC (char, c->data, size); > - c->size = size; > + if (!c->data) > + { > + gcc_assert (c->size == 0 && c->alloc_offset == 0); > + c->size = fcache_buffer_size; > + c->data = XNEWVEC (char, c->size); > + } > + else > + { > + const int offset = c->alloc_offset; > + c->offset_buffer (-offset); > + c->size *= 2; > + c->data = XRESIZEVEC (char, c->data, c->size); > + c->offset_buffer (offset); > + } > } > > /* Read more data into the cache. Extends the cache if need be. > @@ -570,7 +652,7 @@ get_next_line (fcache *c, char **line, ssize_t *line_len) > c->missing_trailing_newline = false; > } > > - if (ferror (c->fp)) > + if (c->fp && ferror (c->fp)) > return false; > > /* At this point, we've found the end of the of line. It either > diff --git a/gcc/input.h b/gcc/input.h > index f1ef5d76cfd..8a793399958 100644 > --- a/gcc/input.h > +++ b/gcc/input.h > @@ -214,4 +214,21 @@ class GTY(()) string_concat_db > hash_map <location_hash, string_concat *> *m_table; > }; > > +/* Because we read source files a second time after the frontend did it the > + first time, we need to know how the frontend handled things like character > + set conversion and UTF-8 BOM stripping, in order to make everything > + consistent. This function needs to be called by each frontend that > requires > + non-default behavior, to inform the diagnostics infrastructure how input > is > + to be processed. The default behavior is to do no conversion and not to > + strip a UTF-8 BOM. > + > + The callback should return the input charset to be used to convert the > given > + file's contents to UTF-8, or it should return NULL if no conversion is > needed > + for this file. SHOULD_SKIP_BOM only applies in case no conversion was > + performed, and if true, it will cause a UTF-8 BOM to be skipped at the > + beginning of the file. (In case a conversion was performed, the BOM is > + rather skipped as part of the conversion process.) */ > +typedef const char * (*input_context_charset_callback)(const char *); > +void input_initialize_context (input_context_charset_callback ccb, > + bool should_skip_bom); > #endif > diff --git a/libcpp/charset.c b/libcpp/charset.c > index 99a9b73e5ab..61881f978a8 100644 > --- a/libcpp/charset.c > +++ b/libcpp/charset.c > @@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = { > cset_converter structure for conversion from FROM to TO. If > iconv_open() fails, issue an error and return an identity > converter. Silently return an identity converter if FROM and TO > - are identical. */ > + are identical. > + > + PFILE is only used for generating diagnostics; setting it to NULL > + suppresses diagnostics. */ > + > static struct cset_converter > init_iconv_desc (cpp_reader *pfile, const char *to, const char *from) > { > @@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, > const char *from) > > if (ret.cd == (iconv_t) -1) > { > - if (errno == EINVAL) > - cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */ > - "conversion from %s to %s not supported by iconv", > - from, to); > - else > - cpp_errno (pfile, CPP_DL_ERROR, "iconv_open"); > - > + if (pfile) > + { > + if (errno == EINVAL) > + cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */ > + "conversion from %s to %s not supported by iconv", > + from, to); > + else > + cpp_errno (pfile, CPP_DL_ERROR, "iconv_open"); > + } > ret.func = convert_no_conversion; > } > } > else > { > - cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */ > - "no iconv implementation, cannot convert from %s to %s", > - from, to); > + if (pfile) > + { > + cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */ > + "no iconv implementation, cannot convert from %s to %s", > + from, to); > + } > ret.func = convert_no_conversion; > ret.cd = (iconv_t) -1; > ret.width = -1; > } > + > return ret; > } > > @@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const > uchar *id, size_t len) > buf, bufp - buf, HT_ALLOC)); > } > > + > +/* Utility to strip a UTF-8 byte order marking from the beginning > + of a buffer. Returns the number of bytes to skip, which currently > + will be either 0 or 3. */ > +int > +cpp_check_utf8_bom (const char *data, size_t data_length) > +{ > + > +#if HOST_CHARSET == HOST_CHARSET_ASCII > + const unsigned char *udata = (const unsigned char *) data; > + if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb > + && udata[2] == 0xbf) > + return 3; > +#endif > + > + return 0; > +} > + > + > /* Convert an input buffer (containing the complete contents of one > source file) from INPUT_CHARSET to the source character set. INPUT > points to the input buffer, SIZE is its allocated size, and LEN is > @@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const > uchar *id, size_t len) > INPUT is expected to have been allocated with xmalloc. This > function will either set *BUFFER_START to INPUT, or free it and set > *BUFFER_START to a pointer to another xmalloc-allocated block of > - memory. */ > + memory. > + > + PFILE is only used to generate diagnostics; setting it to NULL suppresses > + diagnostics, and causes a return of NULL if there was any error instead. > */ > + > uchar * > _cpp_convert_input (cpp_reader *pfile, const char *input_charset, > uchar *input, size_t size, size_t len, > @@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char > *input_charset, > to.text = XNEWVEC (uchar, to.asize); > to.len = 0; > > - if (!APPLY_CONVERSION (input_cset, input, len, &to)) > - cpp_error (pfile, CPP_DL_ERROR, > - "failure to convert %s to %s", > - CPP_OPTION (pfile, input_charset), SOURCE_CHARSET); > - > + const bool ok = APPLY_CONVERSION (input_cset, input, len, &to); > free (input); > - } > > - /* Clean up the mess. */ > - if (input_cset.func == convert_using_iconv) > - iconv_close (input_cset.cd); > + /* Clean up the mess. */ > + if (input_cset.func == convert_using_iconv) > + iconv_close (input_cset.cd); > + > + /* Handle conversion failure. */ > + if (!ok) > + { > + if (!pfile) > + { > + XDELETEVEC (to.text); > + *buffer_start = NULL; > + *st_size = 0; > + return NULL; > + } > + cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s", > + input_charset, SOURCE_CHARSET); > + } > + } > > /* Resize buffer if we allocated substantially too much, or if we > haven't enough space for the \n-terminator or following > @@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char > *input_charset, > > buffer = to.text; > *st_size = to.len; > -#if HOST_CHARSET == HOST_CHARSET_ASCII > - /* The HOST_CHARSET test just above ensures that the source charset > - is UTF-8. So, ignore a UTF-8 BOM if we see one. Note that > - glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a > + > + /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8. Note > + that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a > BOM -- however, even if it did, we would still need this code due > to the 'convert_no_conversion' case. */ > - if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb > - && to.text[2] == 0xbf) > - { > - *st_size -= 3; > - buffer += 3; > - } > -#endif > + const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len); > + *st_size -= bom_len; > + buffer += bom_len; > > *buffer_start = to.text; > return buffer; > @@ -2244,6 +2282,13 @@ _cpp_default_encoding (void) > return current_encoding; > } > > +/* Check if the configured input charset requires no conversion, other than > + possibly stripping a UTF-8 BOM. */ > +bool cpp_input_conversion_is_trivial (const char *input_charset) > +{ > + return !strcasecmp (input_charset, SOURCE_CHARSET); > +} > + > /* Implementation of class cpp_string_location_reader. */ > > /* Constructor for cpp_string_location_reader. */ > diff --git a/libcpp/files.c b/libcpp/files.c > index 5ea3f8e1bf3..d91606c1532 100644 > --- a/libcpp/files.c > +++ b/libcpp/files.c > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file > *file, > static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file, > bool *invalid_pch, location_t loc); > static bool read_file_guts (cpp_reader *pfile, _cpp_file *file, > - location_t loc); > + location_t loc, const char *input_charset); > static bool read_file (cpp_reader *pfile, _cpp_file *file, > location_t loc); > static struct cpp_dir *search_path_head (cpp_reader *, const char *fname, > @@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, > cpp_dir *start_dir, > > Use LOC for any diagnostics. > > + PFILE may be NULL. In this case, no diagnostics are issued. > + > FIXME: Flush file cache and try again if we run out of memory. */ > static bool > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc, > + const char *input_charset) > { > ssize_t size, total, count; > uchar *buf; > @@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, > location_t loc) > > if (S_ISBLK (file->st.st_mode)) > { > - cpp_error_at (pfile, CPP_DL_ERROR, loc, > - "%s is a block device", file->path); > + if (pfile) > + cpp_error_at (pfile, CPP_DL_ERROR, loc, > + "%s is a block device", file->path); > return false; > } > > @@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, > location_t loc) > does not bite us. */ > if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t)) > { > - cpp_error_at (pfile, CPP_DL_ERROR, loc, > - "%s is too large", file->path); > + if (pfile) > + cpp_error_at (pfile, CPP_DL_ERROR, loc, > + "%s is too large", file->path); > return false; > } > > @@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, > location_t loc) > > if (count < 0) > { > - cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc); > + if (pfile) > + cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc); > free (buf); > return false; > } > > - if (regular && total != size && STAT_SIZE_RELIABLE (file->st)) > + if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st)) > cpp_error_at (pfile, CPP_DL_WARNING, loc, > "%s is shorter than expected", file->path); > > file->buffer = _cpp_convert_input (pfile, > - CPP_OPTION (pfile, input_charset), > + input_charset, > buf, size + 16, total, > &file->buffer_start, > &file->st.st_size); > - file->buffer_valid = true; > - > - return true; > + file->buffer_valid = file->buffer; > + return file->buffer_valid; > } > > /* Convenience wrapper around read_file_guts that opens the file if > necessary and closes the file descriptor after reading. FILE must > have been passed through find_file() at some stage. Use LOC for > - any diagnostics. */ > + any diagnostics. Unlike read_file_guts(), PFILE may not be NULL. */ > static bool > read_file (cpp_reader *pfile, _cpp_file *file, location_t loc) > { > @@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t > loc) > return false; > } > > - file->dont_read = !read_file_guts (pfile, file, loc); > + file->dont_read = !read_file_guts (pfile, file, loc, > + CPP_OPTION (pfile, input_charset)); > close (file->fd); > file->fd = -1; > > @@ -2118,3 +2124,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, > int angle_brackets, > return file->err_no != ENOENT; > } > > +/* Read a file and convert to input charset, the same as if it were being > read > + by a cpp_reader. */ > + > +cpp_converted_source > +cpp_get_converted_source (const char *fname, const char *input_charset) > +{ > + cpp_converted_source res = {}; > + _cpp_file file = {}; > + file.fd = -1; > + file.name = lbasename (fname); > + file.path = fname; > + if (!open_file (&file)) > + return res; > + const bool ok = read_file_guts (NULL, &file, 0, input_charset); > + close (file.fd); > + if (!ok) > + return res; > + res.to_free = (char *) file.buffer_start; > + res.data = (char *) file.buffer; > + res.len = file.st.st_size; > + return res; > +} > diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h > index 4467c73284d..d02ef4aa054 100644 > --- a/libcpp/include/cpplib.h > +++ b/libcpp/include/cpplib.h > @@ -1369,6 +1369,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *); > extern cpp_buffer *cpp_get_prev (cpp_buffer *); > extern void cpp_clear_file_cache (cpp_reader *); > > +/* cpp_get_converted_source returns the contents of the given file, as it > exists > + after cpplib has read it and converted it from the input charset to the > + source charset. Return struct will be zero-filled if the data could not > be > + read for any reason. The data starts at the DATA pointer, but the TO_FREE > + pointer is what should be passed to free(), as there may be an offset. */ > +struct cpp_converted_source > +{ > + char *to_free; > + char *data; > + size_t len; > +}; > +cpp_converted_source cpp_get_converted_source (const char *fname, > + const char *input_charset); > + > /* In pch.c */ > struct save_macro_data; > extern int cpp_save_state (cpp_reader *, FILE *); > @@ -1439,6 +1453,7 @@ class cpp_display_width_computation { > /* Convenience functions that are simple use cases for class > cpp_display_width_computation. Tab characters will be expanded to spaces > as determined by TABSTOP. */ > + > int cpp_byte_column_to_display_column (const char *data, int data_length, > int column, int tabstop); > inline int cpp_display_width (const char *data, int data_length, > @@ -1451,4 +1466,7 @@ int cpp_display_column_to_byte_column (const char > *data, int data_length, > int display_col, int tabstop); > int cpp_wcwidth (cppchar_t c); > > +bool cpp_input_conversion_is_trivial (const char *input_charset); > +int cpp_check_utf8_bom (const char *data, size_t data_length); > + > #endif /* ! LIBCPP_CPPLIB_H */ Hello- As discussed most recently here: https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575243.html Here is the updated patch to support -finput-charset in diagnostics. As I mentioned in the above thread, I adapted the patch in input.h and input.c now that class file_cache has been created. That change also enabled me to move the input context state into the file_cache class, so that it doesn't need to be global. Since global_dc is now in charge of creating the file_cache, it seemed to make sense to me also to move the initialization function for frontends to call into the diagnostic_context, so they can call it on the global_dc. Please let me know if this makes sense, overall it is "less global" than before, which I think is the direction everyone would prefer. Happy to adjust as needed. Thanks! As before, I put the new testcases in a separate gzipped file, because they have non-standard encodings. Bootstrap + testsuite looks good on x86-64 Linux, all tests the same before and after except for the 6 new passes. -Lewis
From: Lewis Hyatt <lhy...@gmail.com> Date: Fri, 30 Jul 2021 12:53:43 -0400 Subject: [PATCH] diagnostics: Support for -finput-charset [PR93067] Adds the logic to handle -finput-charset in layout_get_source_line(), so that source lines are converted from their input encodings prior to being output by diagnostics machinery. Also adds the ability to strip a UTF-8 BOM similarly. gcc/c-family/ChangeLog: PR other/93067 * c-opts.c (c_common_input_charset_cb): New function. (c_common_post_options): Call new function diagnostic_initialize_input_context(). gcc/d/ChangeLog: PR other/93067 * d-lang.cc (d_input_charset_callback): New function. (d_init): Call new function diagnostic_initialize_input_context(). gcc/fortran/ChangeLog: PR other/93067 * cpp.c (gfc_cpp_post_options): Call new function diagnostic_initialize_input_context(). gcc/ChangeLog: PR other/93067 * coretypes.h (typedef diagnostic_input_charset_callback): Declare. * diagnostic.c (diagnostic_initialize_input_context): New function. * diagnostic.h (diagnostic_initialize_input_context): Declare. * input.c (default_charset_callback): New function. (file_cache::initialize_input_context): New function. (file_cache_slot::create): Added ability to convert the input according to the input context. (file_cache::file_cache): Initialize the new input context. (class file_cache_slot): Added new m_alloc_offset member. (file_cache_slot::file_cache_slot): Initialize the new member. (file_cache_slot::~file_cache_slot): Handle potentially offset buffer. (file_cache_slot::maybe_grow): Likewise. (file_cache_slot::needs_read_p): Handle NULL fp, which is now possible. (file_cache_slot::get_next_line): Likewise. * input.h (class file_cache): Added input context member. libcpp/ChangeLog: PR other/93067 * charset.c (init_iconv_desc): Adapt to permit PFILE argument to be NULL. (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to... (cpp_check_utf8_bom): ...here. New function. (cpp_input_conversion_is_trivial): New function. * files.c (read_file_guts): Allow PFILE argument to be NULL. Add INPUT_CHARSET argument as an alternate source of this information. (read_file): Pass the new argument to read_file_guts. (cpp_get_converted_source): New function. * include/cpplib.h (struct cpp_converted_source): Declare. (cpp_get_converted_source): Declare. (cpp_input_conversion_is_trivial): Declare. (cpp_check_utf8_bom): Declare. gcc/testsuite/ChangeLog: PR other/93067 * gcc.dg/diagnostic-input-charset-1.c: New test. * gcc.dg/diagnostic-input-utf8-bom.c: New test. diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c index 1c4e832c7ed..961e40c29e7 100644 --- a/gcc/c-family/c-opts.c +++ b/gcc/c-family/c-opts.c @@ -188,6 +188,14 @@ c_common_diagnostics_set_defaults (diagnostic_context *context) context->opt_permissive = OPT_fpermissive; } +/* Input charset configuration for diagnostics. */ +static const char * +c_common_input_charset_cb (const char * /*filename*/) +{ + const char *cs = cpp_opts->input_charset; + return cpp_input_conversion_is_trivial (cs) ? nullptr : cs; +} + /* Whether options from all C-family languages should be accepted quietly. */ static bool accept_all_c_family_options = false; @@ -1136,6 +1144,11 @@ c_common_post_options (const char **pfilename) cpp_post_options (parse_in); init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in)); + /* Let diagnostics infrastructure know how to convert input files the same + way libcpp will do it, namely using the configured input charset and + skipping a UTF-8 BOM if present. */ + diagnostic_initialize_input_context (global_dc, + c_common_input_charset_cb, true); input_location = UNKNOWN_LOCATION; *pfilename = this_input_filename diff --git a/gcc/coretypes.h b/gcc/coretypes.h index 406572e947d..726fcaddda2 100644 --- a/gcc/coretypes.h +++ b/gcc/coretypes.h @@ -154,6 +154,7 @@ struct cl_option_handlers; struct diagnostic_context; class pretty_printer; class diagnostic_event_id_t; +typedef const char * (*diagnostic_input_charset_callback)(const char *); template<typename T> struct array_traits; diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc index 4386a489ff2..fa29a46ab1e 100644 --- a/gcc/d/d-lang.cc +++ b/gcc/d/d-lang.cc @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3. If not see #include "output.h" #include "print-tree.h" #include "debug.h" +#include "input.h" #include "d-tree.h" #include "id.h" @@ -362,6 +363,19 @@ d_option_lang_mask (void) return CL_D; } +/* Implements input charset and BOM skipping configuration for + diagnostics. */ +static const char *d_input_charset_callback (const char * /*filename*/) +{ + /* TODO: The input charset is automatically determined by code in + dmd/dmodule.c based on the contents of the file. If this detection + logic were factored out and could be reused here, then we would be able + to return UTF-16 or UTF-32 as needed here. For now, we return always + NULL, which means no conversion is necessary, i.e. the input is assumed + to be UTF-8 when diagnostics read this file. */ + return nullptr; +} + /* Implements the lang_hooks.init routine for language D. */ static bool @@ -373,6 +387,11 @@ d_init (void) Expression::_init (); Objc::_init (); + /* Diagnostics input init, to enable BOM skipping and + input charset conversion. */ + diagnostic_initialize_input_context (global_dc, + d_input_charset_callback, true); + /* Back-end init. */ global_binding_level = ggc_cleared_alloc <binding_level> (); current_binding_level = global_binding_level; diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c index 8361f68aace..b3afbeae648 100644 --- a/gcc/diagnostic.c +++ b/gcc/diagnostic.c @@ -293,6 +293,17 @@ diagnostic_urls_init (diagnostic_context *context, int value /*= -1 */) = determine_url_format ((diagnostic_url_rule_t) value); } +/* Create the file_cache, if not already created, and tell it how to + translate files on input. */ +void diagnostic_initialize_input_context (diagnostic_context *context, + diagnostic_input_charset_callback ccb, + bool should_skip_bom) +{ + if (!context->m_file_cache) + context->m_file_cache = new file_cache; + context->m_file_cache->initialize_input_context (ccb, should_skip_bom); +} + /* Do any cleaning up required after the last diagnostic is emitted. */ void diff --git a/gcc/diagnostic.h b/gcc/diagnostic.h index 7227dae1b6b..f90d20a4f5e 100644 --- a/gcc/diagnostic.h +++ b/gcc/diagnostic.h @@ -446,6 +446,25 @@ extern void diagnostic_show_locus (diagnostic_context *, diagnostic_t diagnostic_kind); extern void diagnostic_show_any_path (diagnostic_context *, diagnostic_info *); +/* Because we read source files a second time after the frontend did it the + first time, we need to know how the frontend handled things like character + set conversion and UTF-8 BOM stripping, in order to make everything + consistent. This function needs to be called by each frontend that requires + non-default behavior, to inform the diagnostics infrastructure how input is + to be processed. The default behavior is to do no conversion and not to + strip a UTF-8 BOM. + + The callback should return the input charset to be used to convert the given + file's contents to UTF-8, or it should return NULL if no conversion is needed + for this file. SHOULD_SKIP_BOM only applies in case no conversion was + performed, and if true, it will cause a UTF-8 BOM to be skipped at the + beginning of the file. (In case a conversion was performed, the BOM is + rather skipped as part of the conversion process.) */ + +void diagnostic_initialize_input_context (diagnostic_context *context, + diagnostic_input_charset_callback ccb, + bool should_skip_bom); + /* Force diagnostics controlled by OPTIDX to be kind KIND. */ extern diagnostic_t diagnostic_classify_diagnostic (diagnostic_context *, int /* optidx */, diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c index 419cd6accbe..83c4517acdb 100644 --- a/gcc/fortran/cpp.c +++ b/gcc/fortran/cpp.c @@ -493,6 +493,12 @@ gfc_cpp_post_options (void) cpp_post_options (cpp_in); + + /* Let diagnostics infrastructure know how to convert input files the same + way libcpp will do it, namely, with no charset conversion but with + skipping of a UTF-8 BOM if present. */ + diagnostic_initialize_input_context (global_dc, nullptr, true); + gfc_cpp_register_include_paths (); } diff --git a/gcc/input.c b/gcc/input.c index de20d983d2c..4b809862e02 100644 --- a/gcc/input.c +++ b/gcc/input.c @@ -22,7 +22,6 @@ along with GCC; see the file COPYING3. If not see #include "coretypes.h" #include "intl.h" #include "diagnostic.h" -#include "diagnostic-core.h" #include "selftest.h" #include "cpplib.h" @@ -30,6 +29,20 @@ along with GCC; see the file COPYING3. If not see #define HAVE_ICONV 0 #endif +/* Input charset configuration. */ +static const char *default_charset_callback (const char *) +{ + return nullptr; +} + +void +file_cache::initialize_input_context (diagnostic_input_charset_callback ccb, + bool should_skip_bom) +{ + in_context.ccb = (ccb ? ccb : default_charset_callback); + in_context.should_skip_bom = should_skip_bom; +} + /* This is a cache used by get_next_line to store the content of a file to be searched for file lines. */ class file_cache_slot @@ -51,7 +64,8 @@ public: void inc_use_count () { m_use_count++; } - void create (const char *file_path, FILE *fp, unsigned highest_use_count); + bool create (const file_cache::input_context &in_context, + const char *file_path, FILE *fp, unsigned highest_use_count); void evict (); private: @@ -110,6 +124,10 @@ public: far. */ char *m_data; + /* The allocated buffer to be freed may start a little earlier than DATA, + e.g. if a UTF8 BOM was skipped at the beginning. */ + int m_alloc_offset; + /* The size of the DATA array above.*/ size_t m_size; @@ -147,6 +165,17 @@ public: doesn't explode. We thus scale total_lines down to line_record_size. */ vec<line_info, va_heap> m_line_record; + + void offset_buffer (int offset) + { + gcc_assert (offset < 0 ? m_alloc_offset + offset >= 0 + : (size_t) offset <= m_size); + gcc_assert (m_data); + m_alloc_offset += offset; + m_data += offset; + m_size -= offset; + } + }; /* Current position in real source file. */ @@ -419,21 +448,25 @@ file_cache::add_file (const char *file_path) unsigned highest_use_count = 0; file_cache_slot *r = evicted_cache_tab_entry (&highest_use_count); - r->create (file_path, fp, highest_use_count); + if (!r->create (in_context, file_path, fp, highest_use_count)) + return NULL; return r; } /* Populate this slot for use on FILE_PATH and FP, dropping any existing cached content within it. */ -void -file_cache_slot::create (const char *file_path, FILE *fp, +bool +file_cache_slot::create (const file_cache::input_context &in_context, + const char *file_path, FILE *fp, unsigned highest_use_count) { m_file_path = file_path; if (m_fp) fclose (m_fp); m_fp = fp; + if (m_alloc_offset) + offset_buffer (-m_alloc_offset); m_nb_read = 0; m_line_start_idx = 0; m_line_num = 0; @@ -443,6 +476,36 @@ file_cache_slot::create (const char *file_path, FILE *fp, m_use_count = ++highest_use_count; m_total_lines = total_lines_num (file_path); m_missing_trailing_newline = true; + + + /* Check the input configuration to determine if we need to do any + transformations, such as charset conversion or BOM skipping. */ + if (const char *input_charset = in_context.ccb (file_path)) + { + /* Need a full-blown conversion of the input charset. */ + fclose (m_fp); + m_fp = NULL; + const cpp_converted_source cs + = cpp_get_converted_source (file_path, input_charset); + if (!cs.data) + return false; + if (m_data) + XDELETEVEC (m_data); + m_data = cs.data; + m_nb_read = m_size = cs.len; + m_alloc_offset = cs.data - cs.to_free; + } + else if (in_context.should_skip_bom) + { + if (read_data ()) + { + const int offset = cpp_check_utf8_bom (m_data, m_nb_read); + offset_buffer (offset); + m_nb_read -= offset; + } + } + + return true; } /* file_cache's ctor. */ @@ -450,6 +513,7 @@ file_cache_slot::create (const char *file_path, FILE *fp, file_cache::file_cache () : m_file_slots (new file_cache_slot[num_file_slots]) { + initialize_input_context (nullptr, false); } /* file_cache's dtor. */ @@ -478,8 +542,8 @@ file_cache::lookup_or_add_file (const char *file_path) file_cache_slot::file_cache_slot () : m_use_count (0), m_file_path (NULL), m_fp (NULL), m_data (0), - m_size (0), m_nb_read (0), m_line_start_idx (0), m_line_num (0), - m_total_lines (0), m_missing_trailing_newline (true) + m_alloc_offset (0), m_size (0), m_nb_read (0), m_line_start_idx (0), + m_line_num (0), m_total_lines (0), m_missing_trailing_newline (true) { m_line_record.create (0); } @@ -495,6 +559,7 @@ file_cache_slot::~file_cache_slot () } if (m_data) { + offset_buffer (-m_alloc_offset); XDELETEVEC (m_data); m_data = 0; } @@ -509,7 +574,7 @@ file_cache_slot::~file_cache_slot () bool file_cache_slot::needs_read_p () const { - return (m_nb_read == 0 + return m_fp && (m_nb_read == 0 || m_nb_read == m_size || (m_line_start_idx >= m_nb_read - 1)); } @@ -531,9 +596,20 @@ file_cache_slot::maybe_grow () if (!needs_grow_p ()) return; - size_t size = m_size == 0 ? buffer_size : m_size * 2; - m_data = XRESIZEVEC (char, m_data, size); - m_size = size; + if (!m_data) + { + gcc_assert (m_size == 0 && m_alloc_offset == 0); + m_size = buffer_size; + m_data = XNEWVEC (char, m_size); + } + else + { + const int offset = m_alloc_offset; + offset_buffer (-offset); + m_size *= 2; + m_data = XRESIZEVEC (char, m_data, m_size); + offset_buffer (offset); + } } /* Read more data into the cache. Extends the cache if need be. @@ -632,7 +708,7 @@ file_cache_slot::get_next_line (char **line, ssize_t *line_len) m_missing_trailing_newline = false; } - if (ferror (m_fp)) + if (m_fp && ferror (m_fp)) return false; /* At this point, we've found the end of the of line. It either diff --git a/gcc/input.h b/gcc/input.h index bbcec84c521..e6881072c5f 100644 --- a/gcc/input.h +++ b/gcc/input.h @@ -111,6 +111,15 @@ class file_cache file_cache_slot *lookup_or_add_file (const char *file_path); void forcibly_evict_file (const char *file_path); + /* See comments in diagnostic.h about the input conversion context. */ + struct input_context + { + diagnostic_input_charset_callback ccb; + bool should_skip_bom; + }; + void initialize_input_context (diagnostic_input_charset_callback ccb, + bool should_skip_bom); + private: file_cache_slot *evicted_cache_tab_entry (unsigned *highest_use_count); file_cache_slot *add_file (const char *file_path); @@ -119,6 +128,7 @@ class file_cache private: static const size_t num_file_slots = 16; file_cache_slot *m_file_slots; + input_context in_context; }; extern expanded_location diff --git a/libcpp/charset.c b/libcpp/charset.c index 99a9b73e5ab..61881f978a8 100644 --- a/libcpp/charset.c +++ b/libcpp/charset.c @@ -630,7 +630,11 @@ static const struct cpp_conversion conversion_tab[] = { cset_converter structure for conversion from FROM to TO. If iconv_open() fails, issue an error and return an identity converter. Silently return an identity converter if FROM and TO - are identical. */ + are identical. + + PFILE is only used for generating diagnostics; setting it to NULL + suppresses diagnostics. */ + static struct cset_converter init_iconv_desc (cpp_reader *pfile, const char *to, const char *from) { @@ -672,25 +676,31 @@ init_iconv_desc (cpp_reader *pfile, const char *to, const char *from) if (ret.cd == (iconv_t) -1) { - if (errno == EINVAL) - cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */ - "conversion from %s to %s not supported by iconv", - from, to); - else - cpp_errno (pfile, CPP_DL_ERROR, "iconv_open"); - + if (pfile) + { + if (errno == EINVAL) + cpp_error (pfile, CPP_DL_ERROR, /* FIXME should be DL_SORRY */ + "conversion from %s to %s not supported by iconv", + from, to); + else + cpp_errno (pfile, CPP_DL_ERROR, "iconv_open"); + } ret.func = convert_no_conversion; } } else { - cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */ - "no iconv implementation, cannot convert from %s to %s", - from, to); + if (pfile) + { + cpp_error (pfile, CPP_DL_ERROR, /* FIXME: should be DL_SORRY */ + "no iconv implementation, cannot convert from %s to %s", + from, to); + } ret.func = convert_no_conversion; ret.cd = (iconv_t) -1; ret.width = -1; } + return ret; } @@ -2122,6 +2132,25 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len) buf, bufp - buf, HT_ALLOC)); } + +/* Utility to strip a UTF-8 byte order marking from the beginning + of a buffer. Returns the number of bytes to skip, which currently + will be either 0 or 3. */ +int +cpp_check_utf8_bom (const char *data, size_t data_length) +{ + +#if HOST_CHARSET == HOST_CHARSET_ASCII + const unsigned char *udata = (const unsigned char *) data; + if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb + && udata[2] == 0xbf) + return 3; +#endif + + return 0; +} + + /* Convert an input buffer (containing the complete contents of one source file) from INPUT_CHARSET to the source character set. INPUT points to the input buffer, SIZE is its allocated size, and LEN is @@ -2135,7 +2164,11 @@ _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len) INPUT is expected to have been allocated with xmalloc. This function will either set *BUFFER_START to INPUT, or free it and set *BUFFER_START to a pointer to another xmalloc-allocated block of - memory. */ + memory. + + PFILE is only used to generate diagnostics; setting it to NULL suppresses + diagnostics, and causes a return of NULL if there was any error instead. */ + uchar * _cpp_convert_input (cpp_reader *pfile, const char *input_charset, uchar *input, size_t size, size_t len, @@ -2158,17 +2191,27 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset, to.text = XNEWVEC (uchar, to.asize); to.len = 0; - if (!APPLY_CONVERSION (input_cset, input, len, &to)) - cpp_error (pfile, CPP_DL_ERROR, - "failure to convert %s to %s", - CPP_OPTION (pfile, input_charset), SOURCE_CHARSET); - + const bool ok = APPLY_CONVERSION (input_cset, input, len, &to); free (input); - } - /* Clean up the mess. */ - if (input_cset.func == convert_using_iconv) - iconv_close (input_cset.cd); + /* Clean up the mess. */ + if (input_cset.func == convert_using_iconv) + iconv_close (input_cset.cd); + + /* Handle conversion failure. */ + if (!ok) + { + if (!pfile) + { + XDELETEVEC (to.text); + *buffer_start = NULL; + *st_size = 0; + return NULL; + } + cpp_error (pfile, CPP_DL_ERROR, "failure to convert %s to %s", + input_charset, SOURCE_CHARSET); + } + } /* Resize buffer if we allocated substantially too much, or if we haven't enough space for the \n-terminator or following @@ -2192,19 +2235,14 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset, buffer = to.text; *st_size = to.len; -#if HOST_CHARSET == HOST_CHARSET_ASCII - /* The HOST_CHARSET test just above ensures that the source charset - is UTF-8. So, ignore a UTF-8 BOM if we see one. Note that - glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a + + /* Ignore a UTF-8 BOM if we see one and the source charset is UTF-8. Note + that glib'c UTF-8 iconv() provider (as of glibc 2.7) does not ignore a BOM -- however, even if it did, we would still need this code due to the 'convert_no_conversion' case. */ - if (to.len >= 3 && to.text[0] == 0xef && to.text[1] == 0xbb - && to.text[2] == 0xbf) - { - *st_size -= 3; - buffer += 3; - } -#endif + const int bom_len = cpp_check_utf8_bom ((const char *) to.text, to.len); + *st_size -= bom_len; + buffer += bom_len; *buffer_start = to.text; return buffer; @@ -2244,6 +2282,13 @@ _cpp_default_encoding (void) return current_encoding; } +/* Check if the configured input charset requires no conversion, other than + possibly stripping a UTF-8 BOM. */ +bool cpp_input_conversion_is_trivial (const char *input_charset) +{ + return !strcasecmp (input_charset, SOURCE_CHARSET); +} + /* Implementation of class cpp_string_location_reader. */ /* Constructor for cpp_string_location_reader. */ diff --git a/libcpp/files.c b/libcpp/files.c index 6e20fc5887f..c93a03c69ef 100644 --- a/libcpp/files.c +++ b/libcpp/files.c @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file *file, static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file, bool *invalid_pch, location_t loc); static bool read_file_guts (cpp_reader *pfile, _cpp_file *file, - location_t loc); + location_t loc, const char *input_charset); static bool read_file (cpp_reader *pfile, _cpp_file *file, location_t loc); static struct cpp_dir *search_path_head (cpp_reader *, const char *fname, @@ -671,9 +671,12 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir, Use LOC for any diagnostics. + PFILE may be NULL. In this case, no diagnostics are issued. + FIXME: Flush file cache and try again if we run out of memory. */ static bool -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc, + const char *input_charset) { ssize_t size, total, count; uchar *buf; @@ -681,8 +684,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) if (S_ISBLK (file->st.st_mode)) { - cpp_error_at (pfile, CPP_DL_ERROR, loc, - "%s is a block device", file->path); + if (pfile) + cpp_error_at (pfile, CPP_DL_ERROR, loc, + "%s is a block device", file->path); return false; } @@ -699,8 +703,9 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) does not bite us. */ if (file->st.st_size > INTTYPE_MAXIMUM (ssize_t)) { - cpp_error_at (pfile, CPP_DL_ERROR, loc, - "%s is too large", file->path); + if (pfile) + cpp_error_at (pfile, CPP_DL_ERROR, loc, + "%s is too large", file->path); return false; } @@ -733,29 +738,29 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) if (count < 0) { - cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc); + if (pfile) + cpp_errno_filename (pfile, CPP_DL_ERROR, file->path, loc); free (buf); return false; } - if (regular && total != size && STAT_SIZE_RELIABLE (file->st)) + if (pfile && regular && total != size && STAT_SIZE_RELIABLE (file->st)) cpp_error_at (pfile, CPP_DL_WARNING, loc, "%s is shorter than expected", file->path); file->buffer = _cpp_convert_input (pfile, - CPP_OPTION (pfile, input_charset), + input_charset, buf, size + 16, total, &file->buffer_start, &file->st.st_size); - file->buffer_valid = true; - - return true; + file->buffer_valid = file->buffer; + return file->buffer_valid; } /* Convenience wrapper around read_file_guts that opens the file if necessary and closes the file descriptor after reading. FILE must have been passed through find_file() at some stage. Use LOC for - any diagnostics. */ + any diagnostics. Unlike read_file_guts(), PFILE may not be NULL. */ static bool read_file (cpp_reader *pfile, _cpp_file *file, location_t loc) { @@ -773,7 +778,8 @@ read_file (cpp_reader *pfile, _cpp_file *file, location_t loc) return false; } - file->dont_read = !read_file_guts (pfile, file, loc); + file->dont_read = !read_file_guts (pfile, file, loc, + CPP_OPTION (pfile, input_charset)); close (file->fd); file->fd = -1; @@ -2145,3 +2151,25 @@ _cpp_has_header (cpp_reader *pfile, const char *fname, int angle_brackets, return file->err_no != ENOENT; } +/* Read a file and convert to input charset, the same as if it were being read + by a cpp_reader. */ + +cpp_converted_source +cpp_get_converted_source (const char *fname, const char *input_charset) +{ + cpp_converted_source res = {}; + _cpp_file file = {}; + file.fd = -1; + file.name = lbasename (fname); + file.path = fname; + if (!open_file (&file)) + return res; + const bool ok = read_file_guts (NULL, &file, 0, input_charset); + close (file.fd); + if (!ok) + return res; + res.to_free = (char *) file.buffer_start; + res.data = (char *) file.buffer; + res.len = file.st.st_size; + return res; +} diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h index 7e840635a38..af1429171ea 100644 --- a/libcpp/include/cpplib.h +++ b/libcpp/include/cpplib.h @@ -1379,6 +1379,20 @@ extern struct _cpp_file *cpp_get_file (cpp_buffer *); extern cpp_buffer *cpp_get_prev (cpp_buffer *); extern void cpp_clear_file_cache (cpp_reader *); +/* cpp_get_converted_source returns the contents of the given file, as it exists + after cpplib has read it and converted it from the input charset to the + source charset. Return struct will be zero-filled if the data could not be + read for any reason. The data starts at the DATA pointer, but the TO_FREE + pointer is what should be passed to free(), as there may be an offset. */ +struct cpp_converted_source +{ + char *to_free; + char *data; + size_t len; +}; +cpp_converted_source cpp_get_converted_source (const char *fname, + const char *input_charset); + /* In pch.c */ struct save_macro_data; extern int cpp_save_state (cpp_reader *, FILE *); @@ -1449,6 +1463,7 @@ class cpp_display_width_computation { /* Convenience functions that are simple use cases for class cpp_display_width_computation. Tab characters will be expanded to spaces as determined by TABSTOP. */ + int cpp_byte_column_to_display_column (const char *data, int data_length, int column, int tabstop); inline int cpp_display_width (const char *data, int data_length, @@ -1461,4 +1476,7 @@ int cpp_display_column_to_byte_column (const char *data, int data_length, int display_col, int tabstop); int cpp_wcwidth (cppchar_t c); +bool cpp_input_conversion_is_trivial (const char *input_charset); +int cpp_check_utf8_bom (const char *data, size_t data_length); + #endif /* ! LIBCPP_CPPLIB_H */
diagnostics-input-charset-v3-part2.txt.gz
Description: application/gunzip