On Fri, Mar 24, 2023 at 9:04 PM David Malcolm via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > PR analyzer/109098 notes that the SARIF spec mandates that .sarif > files are UTF-8 encoded, but -fdiagnostics-format=sarif-file naively > assumes that the source files are UTF-8 encoded when quoting source > artefacts in the .sarif output, which can lead to us writing out > .sarif files with non-UTF-8 bytes in them (which break my reporting > scripts). > > The root cause is that sarif_builder::maybe_make_artifact_content_object > was using maybe_read_file to load the file content as bytes, and > assuming they were UTF-8 encoded. > > This patch reworks both overloads of this function (one used for the > whole file, the other for snippets of quoted lines) so that they go > through input.cc's file cache, which attempts to decode the input files > according to the input charset, and then encode as UTF-8. They also > check that the result actually is UTF-8, for cases where the input > charset is missing, or incorrectly specified, and omit the quoted > source for such awkward cases. > > Doing so fixes all of the cases I've encountered. > > The patch adds a new: > { dg-final { verify-sarif-file } } > directive to all SARIF test cases in the test suite, which verifies > that the output is UTF-8 encoded, and is valid JSON. In particular > it verifies that when we complain about encoding problems, the .sarif > report we emit is itself correctly encoded. > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu. > Integration testing shows no regressions, and a fix for the case > seen in haproxy-2.7.1. > Pushed to trunk as r13-6861-gd495ea2b232f3e.
Hi David- Regarding the patch series I had about _Pragma locations (most recently https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609472.html and https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609473.html). That one will need some work now in order to apply on top of these changes to input.cc. Happy to do that, but I thought I better check in first to see if you had any feedback please on the new approach to input.cc that's in the v2 patch? Do you think it's a worthwhile feature, or you'd rather I just drop it? Thanks! -Lewis