Hey all,

I've run into an issue with run-tests.php with the junit format. The XML
that it generates can be invalid because of invalid UTF-8 characters and
invalid XML characters. This means that trying to parse it using something
like Jenkins gives a huge stack-trace because of invalid XML. I've been
digging through how to fix it, and I think I've come up with a solution.
But I'm not too happy with it, so I'd like some feedback.

https://github.com/php/php-src/blob/master/run-tests.php#L2096

Right now, the diff for a failed test is just injected in cdata tags, and
stuck unencoded in the result XML. For tests that are testing invalid UTF-8
bytes (or other character sets), that diff can contain bad byte sequences.

        $diff = empty($diff) ? '' : "<![CDATA[\n " . preg_replace('/\e/',
'<esc>', $diff) . "\n]]>";


What I'm proposing is to escape all non-UTF8 and non-XML safe bytes with
their value wrapped by <>. So chr(0xFF) (which is invalid in UTF8) would
become <xFF>

Now, to implement it is a bit more interesting. I've come up with a single
regex that will do it:

        $diff = preg_replace_callback(
'/(
[\x0-\x8]                                           # Control Characters
| [\xB-\xC]                                        # Invalid XML Characters
| [\xE-\x19]                                       # Invalid XML Characters
| [\xF8-\xFF]                                     # Invalid UTF-8 Bytes
| [\xC0-\xDF](?![\x80-\xBF])                # Invalid UTF-8 Sequence Start
| [\xE0-\xEF](?![\x80-\xBF]{2})            # Invalid UTF-8 Sequence Start
| [\xF0-\xF7](?![\x80-\xBF]{3})             # Invalid UTF-8 Sequence Start
| (?<=[\x0-\x7F\xF8-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
| (?<!
[\xC0-\xDF]                       # Not Byte 2 of 2 Byte Sequence
| [\xE0-\xEF]                     # Not Byte 2 of 3 Byte Sequence
| [\xE0-\xEF][\x80-\xBF]     # Not Byte 3 of 3 Byte Sequence
| [\xF0-\xF7]                      # Not Byte 2 of 4 Byte Sequence
| [\xF0-\xF7][\x80-\xBF]      # Not Byte 3 of 4 Byte Sequence
| [\xF0-\xF7][\x80-\xBF]{2}  # Not Byte 4 of 4 Byte Sequence
)[\x80-\xBF]                                                              #
Overlong Sequence
| (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF])                # Short 3 byte
sequence
| (?<=[\xF0-\xF7])[\x80-\xBF](?![\x80-\xBF]{2})             # Short 4 byte
sequence
| (?<=[\xF0-\xF7][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte
sequence

)/x',
            function($match) { return sprintf('<x%02x>', ord($match[1])); },
            $diff
        );

But given the size and complexity of it, I'm hesitant to go with it.

What do you think?

Anthony

Reply via email to