On Fri, 2020-12-18 at 18:03 -0500, Lewis Hyatt wrote: > Hello- > > The attached patch addresses PR93067: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0 > > This is similar to the patch I posted last year on the PR, with some tweaks > to make it a little simpler. Recapping some of the commentary on the PR: > > When source lines are needed for diagnostics output, they are retrieved from > the source file by the fcache infrastructure in input.c, since libcpp has > generally already forgotten them (plus not all front ends are using > libcpp). This infrastructure does not read the files in the same way as > libcpp does; in particular, it does not translate the encoding as requested > by -finput-charset, and it does not strip a UTF-8 byte-order mark if > present. The patch adds this ability. My thinking in deciding how to do it > was the following: > > - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer still, > so this patch should try hard not to introduce any worse performance > unless these things are needed. > > - It is desirable to reuse libcpp's encoding infrastructure from charset.c > rather than repeat it in input.c. (Notably, libcpp uses iconv but it also > has hand-coded routines for certain charsets to make sure they are > available.) > > - There is a performance degradation required in order to make use of libcpp > directly, because the input.c infrastructure only reads as much of the > source file as necessary, whereas libcpp interfaces as-is require to read > the entire file into memory. > > - It can't be quite as simple as just "only delegate to libcpp if > -finput-charset was specified", because the stripping of the UTF-8 BOM has > to happen with or without this option. > > - So it seemed a reasonable compromise to me, if -finput-charset is > specified, then use libcpp to convert the file, otherwise, strip the BOM > in input.c and then process the file the same way it is done now. There's > a little bit of leakage of charset logic from libcpp this way (for the > BOM), but it seems worthwhile, since otherwise, diagnostics would always > be reading the entire file into memory, which is not a cost paid > currently.
Thanks for the patch; sorry about the delay in reviewing it. This mostly seems good to me. One aspect I'm not quite convinced about is the input_cpp_context.in_use flag. The input.c machinery is used by diagnostics, and so could be used by middle-end warnings for frontends that don't use libcpp. Presumably we'd still want to remove the UTF-8 BOM for those, and do encoding fixups if necessary - is it more a case of initializing things to express what the expected input charset is? (since that is part of the cpp_options) c.opt has: finput-charset= C ObjC C++ ObjC++ Joined RejectNegative -finput-charset=<cset> Specify the default character set for source files. I believe that D and Go are the two frontends that don't use libcpp for parsing. I believe Go source is required to be UTF-8 (unsurprisingly considering the heritage of both). I don't know what source encodings D supports. > Separate from the main patch are two testcases that both fail before this > patch and pass after. I attached them gzipped because they use non- standard > encodings that won't email well. > > The main question I have about the patch is whether I chose a good way to > address the following complication. location_get_source_line() in input.c is > used to generate diagnostics for all front ends, whether they use libcpp to > read the files or not. So input.c needs some way to know whether libcpp is > in use or not. I ended up adding a new function input_initialize_cpp_context(), > which front ends have to call if they are using libcpp to read their > files. Currently that is c-family and fortran. I don't see a simpler way > to do it at least... Even if there were a better way for input.c to find out > the value passed to -finput-charset, it would still need to know whether > libcpp is being used or not. At some point I want to convert the global state in input.c into a source_manager class, probably hung off the diagnostic_context, so the natural place to put that state would be in there (rather than the fcache being global state). Perhaps have a source_reader policy class hung off of the source_manager, which would have responsibility for reading and converting the file, either piecemeal, or fully for the "need to use libcpp" case. But that's not necessary for this fix. > Please let me know if it looks OK (either now or for stage 1, whatever makes > sense...) bootstrap + regtest all languages on x86-64 GNU/Linux, all tests the > same before & after plus 6 new PASS from the new tests. Thanks! Various comments inline below. In particular, does the patch add a memory leak in _cpp_convert_input? Thanks Dave [...] > libcpp/ChangeLog: > > PR other/93067 > * charset.c (init_iconv_desc): Adapt to permit PFILE argument to > be NULL. > (_cpp_convert_input): Likewise. Also move UTF-8 BOM logic to... > (cpp_check_utf8_bom): ...here. New function. > (cpp_input_conversion_is_trivial): New function. > * files.c (read_file_guts): Allow PFILE argument to be NULL. Add > INPUT_CHARSET argument as an alternate source of this information. > (cpp_get_converted_source): New function. > * include/cpplib.h (struct cpp_converted_source): Declare. > (cpp_get_converted_source): Declare. > (cpp_input_conversion_is_trivial): Declare. > (cpp_check_utf8_bom): Declare. > > gcc/testsuite/ChangeLog: > > PR other/93067 > * gcc.dg/diagnostic-input-charset-1.c: New test. > * gcc.dg/diagnostic-input-charset-2.c: New test. Maybe rename the 2nd test to "diagnostic-input-utf8-bom.c" ? > > diff --git a/gcc/c-family/c-opts.c b/gcc/c-family/c-opts.c > index 59cabd12407..d5aa7859cc1 100644 > --- a/gcc/c-family/c-opts.c > +++ b/gcc/c-family/c-opts.c > @@ -1124,6 +1124,10 @@ c_common_post_options (const char **pfilename) > cpp_post_options (parse_in); > init_global_opts_from_cpp (&global_options, cpp_get_options (parse_in)); > > + /* Let diagnostics infrastructure know we are using libcpp to read > + the input. */ > + input_initialize_cpp_context (parse_in); As noted above I think the most significant thing here is getting the source encoding from parse_in, so maybe reword the comment accordingly. [...] > diff --git a/gcc/fortran/cpp.c b/gcc/fortran/cpp.c > index 51baf141711..2b12a98afc0 100644 > --- a/gcc/fortran/cpp.c > +++ b/gcc/fortran/cpp.c > @@ -493,6 +493,10 @@ gfc_cpp_post_options (void) > > cpp_post_options (cpp_in); > > + /* Let diagnostics infrastructure know we are using libcpp to read > + the input. */ > + input_initialize_cpp_context (cpp_in); Likewise. [...] > diff --git a/gcc/input.c b/gcc/input.c > index 29d10f06b86..1dcdd464bc1 100644 > --- a/gcc/input.c > +++ b/gcc/input.c [...] > @@ -394,6 +432,42 @@ add_file_to_cache_tab (const char *file_path) > r->total_lines = total_lines_num (file_path); > r->missing_trailing_newline = true; > > + /* If libcpp is managing the reading, then there are two cases we need to > + consider. If -finput-charset is not in effect, then we just need to > + strip a UTF-8 BOM, so do that ourselves rather than calling into libcpp so > + as to avoid paying the penalty of using libcpp, namely that the entire file > + must be read at once. In the (generally rare) case that a non- trivial > + -finput-charset is needed, then go ahead and use libcpp to read the whole > + file and do the conversion. */ > + if (input_cpp_context.in_use) > + { > + if (input_cpp_context.conversion_is_trivial) > + { > + /* Strip the UTF8 BOM if present. */ > + if (read_data (r)) > + { > + const int offset = cpp_check_utf8_bom (r->data, r- >nb_read); > + r->offset_buffer (offset); > + r->nb_read -= offset; > + } This assumes that trivial conversions are always UTF8 to UTF8, which assumes that SOURCE_CHARSET is UTF8, which isn't the case for EBCDIC... [...] > +/* Utility to strip a UTF-8 byte order marking from the beginning > + of a buffer. Returns the number of bytes to skip, which currently > + will be either 0 or 3. */ > +int > +cpp_check_utf8_bom (const char *data, size_t data_length) > +{ > + > +#if HOST_CHARSET == HOST_CHARSET_ASCII > + const unsigned char *udata = (const unsigned char *) data; > + if (data_length >= 3 && udata[0] == 0xef && udata[1] == 0xbb > + && udata[2] == 0xbf) > + return 3; > +#endif > + > + return 0; > +} ...though the check on HOST_CHARSET == HOST_CHARSET_ASCII makes this a no-op for the EBCDIC case, so that's OK. [...] > @@ -2158,17 +2191,28 @@ _cpp_convert_input (cpp_reader *pfile, const char *input_charset, > to.text = XNEWVEC (uchar, to.asize); > to.len = 0; > > - if (!APPLY_CONVERSION (input_cset, input, len, &to)) > - cpp_error (pfile, CPP_DL_ERROR, > - "failure to convert %s to %s", > - CPP_OPTION (pfile, input_charset), SOURCE_CHARSET); > + const bool ok = APPLY_CONVERSION (input_cset, input, len, &to); > > - free (input); > - } > + /* Handle conversion failure. */ > + if (!ok) > + { > + free (input); > + if (!pfile) > + { > + XDELETEVEC (to.text); > + *buffer_start = NULL; > + *st_size = 0; > + return NULL; > + } > + cpp_error (pfile, CPP_DL_ERROR, > + "failure to convert %s to %s", > + CPP_OPTION (pfile, input_charset), SOURCE_CHARSET); > + } > + } > Doesn't this lose the free (input); for the case where the conversion is successful? Is this a leak? [...] > diff --git a/libcpp/files.c b/libcpp/files.c > index 301b2379a23..178bb9ed1e6 100644 > --- a/libcpp/files.c > +++ b/libcpp/files.c > @@ -173,7 +173,7 @@ static bool pch_open_file (cpp_reader *pfile, _cpp_file *file, > static bool find_file_in_dir (cpp_reader *pfile, _cpp_file *file, > bool *invalid_pch, location_t loc); > static bool read_file_guts (cpp_reader *pfile, _cpp_file *file, > - location_t loc); > + location_t loc, const char *input_charset = NULL); read_file_guts is only used in one place before the patch, so rather than add a default argument, I think it's simpler to pass in the input_charset at that call. > @@ -671,18 +671,32 @@ _cpp_find_file (cpp_reader *pfile, const char *fname, cpp_dir *start_dir, > > Use LOC for any diagnostics. > > + The input charset may be specified in the INPUT_CHARSET argument, or > + else it will be taken from PFILE. ...and doing so means that input_charset will be non-NULL at all callers. > + PFILE may be NULL. In this case, no diagnostics are issued, and the > + input charset must be specified in the arguments. > + > FIXME: Flush file cache and try again if we run out of memory. */ > static bool > -read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc) > +read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc, > + const char *input_charset) > { > ssize_t size, total, count; > uchar *buf; > bool regular; > > + if (!input_charset) > + { > + gcc_assert (pfile); > + input_charset = CPP_OPTION (pfile, input_charset); > + } ...which would allow for this logic to be removed, moving this to the existing caller, I believe. [...] Hope this is constructive Dave