Thanks a lot for your detailed feedback! I will rework my patch, especially to make the get_source_line_preprocessed function more readable and more efficient.
Some comments in the mean time: > This may sound silly, but please use "num" rather than "no" as an > abbreviation for "number": to my under-caffeinated brain "no" reads > as > the opposite of "yes". I fully agree, seems like 5 years ago my naming taste was a bit off. > What about buffer[1]? Does this have to be a ' '? ... > Don't we also need a trailing '"' and a newline? Ack, doesn't hurt to check buffer[1]. The newline might not come directly after the trailing '"' however, as line markers can contain additional flags after the source file name. If we have: # (\d+) "(.+)" it should be safe to assume it's a line marker. > Has this code been tested on large examples? It looks to me like this function does a linear scan starting at line 1 every time in the .i/.ii file upwards looking for markers of interest; Good points, I am looking into tracking line markers in the file_cache_slot. I am currently testing a version with a simple vector for memoization. FWIW we are using this patch in our internal gcc fork for building a pretty substantial C++ codebase and have not seen build time regressions directly linked to it. However, testing it against a fairly large .ii file that produces many warnings from that codebase (~400k LoC) shows significant speedup with the tracking version: ~/gcc-dev/bin/gcc TEST.cpp.ii 62.58s user 1.01s system 99% cpu 1:03.66 total # path as is ~/gcc-dev/bin/gcc TEST.cpp.ii 32.80s user 0.95s system 99% cpu 33.749 total # with line marker memoization > This function is rather complicated I agree. To be honest, looking at it after a couple of years was also not very pleasant. > Something else that occurred to me: assuming the file in question has gone through libcpp, we've already parsed the line markers, and line_table will have a series of line map instances representing the ranges in question. According to some comments I found, it seems that for e.g the C front-end, it can happen that we start emitting diagnostics before the line map has seen the end of the file, so it could probably only act as a hint? Best Lucas -----Original Message----- From: David Malcolm <dmalc...@redhat.com> Sent: Wednesday, 29 January 2025 16:17 To: Bader, Lucas <lucas.ba...@sap.com>; gcc-patches@gcc.gnu.org Subject: Re: [PATCH v2] get source line for diagnostic from preprocessed file [PR preprocessor/79106] On Wed, 2025-01-29 at 09:16 -0500, David Malcolm wrote: > On Wed, 2025-01-29 at 13:05 +0000, Bader, Lucas wrote: > > Hi, > > > > as discussed, I rebased and tested my patch against current master. > > Additionally, it now has the Signed-off-by tag. > > Looking forward to your comments. > > > > Best > > Lucas > > Thanks for the updated patch. > > Various notes inline throughout below... > > > > > ---- > > > > Within a compile cluster, only the preprocessed output of GCC is > > transferred > > to remote nodes for compilation. When GCC produces advanced > > diagnostics > > (with -fdiagnostics-show-caret), e.g. prints out the affected > > source > > line and fixit hints, it attempts to read the source file again, > > even > > when > > compiling a preprocessed file (-fpreprocessed). This leads to wrong > > diagnostics when building with a compile cluster, or, more > > generally, > > when changing or deleting the original source file. > > > > This patch alters GCC to read from the preprocessed file instead by > > calculating the corresponding source line. This behavior is > > consistent > > with clang. > > > > gcc/c-family/ChangeLog: > > > > * c-opts.cc (c_common_handle_option): pass -fpreprocessed > > option value to global diagnostic configuration > > > > gcc/ChangeLog: > > > > * diagnostic-show-locus.cc > > (layout::calculate_x_offset_display): read line from source or > > preprocessed > > file based on -fpreprocessed value > > (source_line::source_line): read line from source or > > preprocessed > > file based on -fpreprocessed value > > (layout_printer::print_line): read line from source or > > preprocessed > > file based on -fpreprocessed value > > > > * diagnostic.cc (diagnostic_context::initialize): > > initialize > > new members > > * diagnostic.h: new members for reading > > source lines from preprocessed files > > > > * input.cc (file_cache::get_source_line_preprocessed): new > > function > > to read source lines from preprocessed files > > (test_reading_source_line_preprocessed): new test case > > (input_cc_tests): execute new test case > > * input.h (class file_cache): add new member function > > > > * opts-global.cc (read_cmdline_options): pass input > > filename > > to global > > diagnostic context > > > Some minor ChangeLog nits: > * you should add "PR preprocessor/79106" (without quotes) to the > start > of each ChangeLog entry; have a look at the existing ChangeLog > entries > to see how we do it > * the entries should read as sentences, so please capitalize the > first > letter, and add periods > * please wrap at 74 columns. > > You can check the patch using > ./contrib/gcc-changelog/git_check_commit.py > (which is run as a server-side pre-commit hook, I believe) > > > > > > Signed-off-by: Lucas Bader <lucas.ba...@sap.com> > > --- > > gcc/c-family/c-opts.cc | 1 + > > gcc/diagnostic-show-locus.cc | 25 ++++-- > > gcc/diagnostic.cc | 2 + > > gcc/diagnostic.h | 6 ++ > > gcc/input.cc | 169 > > +++++++++++++++++++++++++++++++++++ > > gcc/input.h | 2 + > > gcc/opts-global.cc | 4 + > > 7 files changed, 204 insertions(+), 5 deletions(-) > > > > diff --git a/gcc/c-family/c-opts.cc b/gcc/c-family/c-opts.cc > > index 87b231861a6..105d8fa6eff 100644 > > --- a/gcc/c-family/c-opts.cc > > +++ b/gcc/c-family/c-opts.cc > > @@ -534,6 +534,7 @@ c_common_handle_option (size_t scode, const > > char > > *arg, HOST_WIDE_INT value, > > > > case OPT_fpreprocessed: > > cpp_opts->preprocessed = value; > > + global_dc->m_is_preprocessed = value; > > break; > > > > case OPT_fdebug_cpp: > > diff --git a/gcc/diagnostic-show-locus.cc b/gcc/diagnostic-show- > > locus.cc > > index 898efe74acf..7ac183a126f 100644 > > --- a/gcc/diagnostic-show-locus.cc > > +++ b/gcc/diagnostic-show-locus.cc > > @@ -1786,8 +1786,13 @@ layout::calculate_x_offset_display () > > return; > > } > > > > - const char_span line = m_file_cache.get_source_line > > (m_exploc.file, > > - > > m_exploc.line); > > + char_span line (NULL, 0); > > + if (global_dc->m_main_input_file_path != NULL && global_dc- > > > m_is_preprocessed) > > + line = m_file_cache.get_source_line_preprocessed ( > > + m_exploc.file, global_dc->m_main_input_file_path, > > m_exploc.line); > > + else > > + line = m_file_cache.get_source_line (m_exploc.file, > > m_exploc.line); > > + > > Here and in two other places below in this file the patch adds logic > of the form: > > if (global_dc->main_input_file_path != NULL && global_dc- > >is_preprocessed) > line = location_get_source_line_preprocessed (some file, > global_dc- > >main_input_file_path, > some line); > else > line = location_get_source_line (some file, some line); > > Please introduce a subroutine for this to avoid the repetition, > something like: > > static char_span > get_line_maybe_preprocessed (const char *filename, int line_num); > > I wonder if this logic should be moved into the file_cache itself, > but > let's leave that for now. > > > > if (!line) > > { > > /* Nothing to do, we couldn't find the source line. */ > > @@ -2780,7 +2785,12 @@ public: > > > > source_line::source_line (file_cache &fc, const char *filename, > > int > > line) > > { > > - char_span span = fc.get_source_line (filename, line); > > + char_span span (NULL, 0); > > + if (global_dc->m_main_input_file_path != NULL && global_dc- > > > m_is_preprocessed) > > + span = fc.get_source_line_preprocessed ( > > + filename, global_dc->m_main_input_file_path, line); > > + else > > + span = fc.get_source_line (filename, line); > > Please use the subroutine here. > > > > chars = span.get_buffer (); > > width = span.length (); > > } > > @@ -3132,8 +3142,13 @@ layout_printer::show_ruler (int max_column) > > void > > layout_printer::print_line (linenum_type row) > > { > > - char_span line > > - = m_layout.m_file_cache.get_source_line > > (m_layout.m_exploc.file, > > row); > > + char_span line (NULL, 0); > > + if (global_dc->m_main_input_file_path != NULL && global_dc- > > > m_is_preprocessed) > > + line = m_layout.m_file_cache.get_source_line_preprocessed ( > > + m_layout.m_exploc.file, global_dc->m_main_input_file_path, > > row); > > + else > > + line = m_layout.m_file_cache.get_source_line > > (m_layout.m_exploc.file, row); > > + > > Again, please use the subroutine here. > > > > if (!line) > > return; > > > > diff --git a/gcc/diagnostic.cc b/gcc/diagnostic.cc > > index c2f6714c24a..c8a93d407ad 100644 > > --- a/gcc/diagnostic.cc > > +++ b/gcc/diagnostic.cc > > @@ -290,6 +290,8 @@ diagnostic_context::initialize (int n_opts) > > m_diagrams.m_theme = nullptr; > > m_original_argv = nullptr; > > m_diagnostic_buffer = nullptr; > > + m_main_input_file_path = nullptr; > > + m_is_preprocessed = false; > > > > enum diagnostic_text_art_charset text_art_charset > > = DIAGNOSTICS_TEXT_ART_CHARSET_EMOJI; > > diff --git a/gcc/diagnostic.h b/gcc/diagnostic.h > > index 202760b2f85..60a7900eb23 100644 > > --- a/gcc/diagnostic.h > > +++ b/gcc/diagnostic.h > > @@ -850,6 +850,12 @@ public: > > /* True if warnings should be given in system headers. */ > > bool m_warn_system_headers; > > > > + /* Original input file. */ > > + const char* m_main_input_file_path; > > + > > + /* True if the input file is treated as preprocessed. */ > > + bool m_is_preprocessed; > > + > > private: > > /* Maximum number of errors to report. */ > > int m_max_errors; > > diff --git a/gcc/input.cc b/gcc/input.cc > > index 9f3cc6651e8..c261f2c4750 100644 > > --- a/gcc/input.cc > > +++ b/gcc/input.cc > > @@ -1056,6 +1056,109 @@ file_cache::get_source_line (const char > > *file_path, int line) > > return char_span (buffer, len); > > } > > > > +/* Return the physical source line that corresponds to > > SOURCE_NAME/LINE. > > + Read the line from the preprocessed file at FILE_PATH. > > + The line is not nul-terminated. The returned pointer is only > > + valid until the next call of get_source_line. > > + Note that the line can contain several null characters, > > + so the returned value's length has the actual length of the > > line. > > + If the function fails, a NULL char_span is returned. */ > > + > > +char_span > > +file_cache::get_source_line_preprocessed (const char *source_name, > > + const char *file_path, int > > line) > > +{ > > + char *buffer = NULL; > > + ssize_t len; > > + > > + if (line == 0) > > + return char_span (NULL, 0); > > + > > + file_cache_slot *c = lookup_or_add_file (file_path); > > + if (c == NULL) > > + return char_span (NULL, 0); > > + > > + // find linemarker closest to actual line number > > + int linemarker_line = 0; // line no of linemarker in pp file > > + int linemarker_loc = 0; // source line no indicated in > > linemarker > > This may sound silly, but please use "num" rather than "no" as an > abbreviation for "number": to my under-caffeinated brain "no" reads > as > the opposite of "yes". > > > > + int current_line = 1; > > + ssize_t source_name_len = (ssize_t) strlen (source_name); > > + while (c->read_line_num (current_line, &buffer, &len)) > > + { > > + if (len > 2 && buffer[0] == '#' && ISDIGIT (buffer[2])) > > What about buffer[1]? Does this have to be a ' '? > > > + { > > + // is a linemarker > > + bool is_match = false; > > + for (ssize_t i = 3; i < len && source_name_len < len - i > > - > > 1; ++i) > > + { > > + if (buffer[i] == '"') > > + { > > + is_match > > + = memcmp (buffer + i + 1, source_name, > > source_name_len) > > + == 0; > > Don't we also need a trailing '"' and a newline? > > > + break; > > + } > > + } > > + bool match_past_line = false; > > + if (is_match) > > + { > > + // we found a linemarker that matches the source > > file > > + // e.g. > > + // # 14 "../../file.h" > > + // we use the last suitable marker we find > > + int loc = atoi (buffer + 2); > > This function is rather complicated. I think a useful simplification > would be to split out the above logic into a subroutine, something > like > this: > > /* Return true iff the text LINE is a line marker of the form > "# ([0-9]+) "(.+)"\n" > where the quoted string (group 2) matches SOURCE_NAME. > If returning true, write the line number (group 1) to *OUT_LINE. > */ > > static bool > is_line_marker_for_file_p (char_span line, > char_span source_name, > int *out_line_num) > { > // etc, the loop to determine is_match and the atoi > } > > and then this subroutine could be unit-tested as well. > > (though despite the leading comment I wrote, I'm not sure if "line" > should include the trailing newline) > > > > + if (loc <= line && line - loc <= line - > > linemarker_loc) > > + { > > Sorry, I confess my eyes started glazing over at this point and I > feel > I'm probably missing things. I think splitting out the subroutine to > find the match will help readability. > > Has this code been tested on large examples? It looks to me like > this > function does a linear scan starting at line 1 every time in the > .i/.ii > file upwards looking for markers of interest; with a call to > read_line_num (current_line_num in a loop. These calls could be non- > trivial (see Andi's recent patches). I haven't looked in detail at > the > time-complexity, but this *might* introduce O(n^2) behavior or worse. > So we might want to have some extra data structure to track line > markers to amortize the cost - though I'm not sure at this point. Something else that occurred to me: assuming the file in question has gone through libcpp, we've already parsed the line markers, and line_table will have a series of line map instances representing the ranges in question. So maybe there's a way to do things using that. Dave