https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118686

            Bug ID: 118686
           Summary: Poor error message for ill-formed UTF-8 sequence
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: sarif-replay
          Assignee: dmalcolm at gcc dot gnu.org
          Reporter: dmalcolm at gcc dot gnu.org
  Target Milestone: ---

Created attachment 60308
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60308&action=edit
Malformed generated .sarif that sarif-replay doesn't handle well

I'm attaching a generated .sarif file which somehow has malformed UTF-8.

Python reports the byte offset of the problem:

/home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 517696:
invalid start byte

but sarif-replay merely says line:1 column:1 

$ LD_LIBRARY_PATH=. ./sarif-replay
/home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif
/home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif:1:1:
error: ill-formed UTF-8 sequence
    1 | {"$schema":
"https://docs.oasis-open.org/sarif/sarif/v2.1.0/errata01/os/schemas/sarif-schema-2.1.0.json";,
      | ^

Ideally should show the precise line of malformed data, and use the escaping
logic to show the bytes in the annotations to the quoted source, as per
-fdiagnostics-escape-format=bytes

Reply via email to