https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118686
Bug ID: 118686 Summary: Poor error message for ill-formed UTF-8 sequence Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: sarif-replay Assignee: dmalcolm at gcc dot gnu.org Reporter: dmalcolm at gcc dot gnu.org Target Milestone: --- Created attachment 60308 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60308&action=edit Malformed generated .sarif that sarif-replay doesn't handle well I'm attaching a generated .sarif file which somehow has malformed UTF-8. Python reports the byte offset of the problem: /home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 517696: invalid start byte but sarif-replay merely says line:1 column:1 $ LD_LIBRARY_PATH=. ./sarif-replay /home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif /home/david/coding-3/gcc-build/test/experiment/x86_64-pc-linux-gnu/integration-tests/qemu-7.2.0/qemu-7.2.0/build/libdecnumber_decNumber.c.c.sarif:1:1: error: ill-formed UTF-8 sequence 1 | {"$schema": "https://docs.oasis-open.org/sarif/sarif/v2.1.0/errata01/os/schemas/sarif-schema-2.1.0.json", | ^ Ideally should show the precise line of malformed data, and use the escaping logic to show the bytes in the annotations to the quoted source, as per -fdiagnostics-escape-format=bytes