Hey all, I've run into an issue with run-tests.php with the junit format. The XML that it generates can be invalid because of invalid UTF-8 characters and invalid XML characters. This means that trying to parse it using something like Jenkins gives a huge stack-trace because of invalid XML. I've been digging through how to fix it, and I think I've come up with a solution. But I'm not too happy with it, so I'd like some feedback.
https://github.com/php/php-src/blob/master/run-tests.php#L2096 Right now, the diff for a failed test is just injected in cdata tags, and stuck unencoded in the result XML. For tests that are testing invalid UTF-8 bytes (or other character sets), that diff can contain bad byte sequences. $diff = empty($diff) ? '' : "<![CDATA[\n " . preg_replace('/\e/', '<esc>', $diff) . "\n]]>"; What I'm proposing is to escape all non-UTF8 and non-XML safe bytes with their value wrapped by <>. So chr(0xFF) (which is invalid in UTF8) would become <xFF> Now, to implement it is a bit more interesting. I've come up with a single regex that will do it: $diff = preg_replace_callback( '/( [\x0-\x8] # Control Characters | [\xB-\xC] # Invalid XML Characters | [\xE-\x19] # Invalid XML Characters | [\xF8-\xFF] # Invalid UTF-8 Bytes | [\xC0-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start | [\xF0-\xF7](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start | (?<=[\x0-\x7F\xF8-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle | (?<! [\xC0-\xDF] # Not Byte 2 of 2 Byte Sequence | [\xE0-\xEF] # Not Byte 2 of 3 Byte Sequence | [\xE0-\xEF][\x80-\xBF] # Not Byte 3 of 3 Byte Sequence | [\xF0-\xF7] # Not Byte 2 of 4 Byte Sequence | [\xF0-\xF7][\x80-\xBF] # Not Byte 3 of 4 Byte Sequence | [\xF0-\xF7][\x80-\xBF]{2} # Not Byte 4 of 4 Byte Sequence )[\x80-\xBF] # Overlong Sequence | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence | (?<=[\xF0-\xF7])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence | (?<=[\xF0-\xF7][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence )/x', function($match) { return sprintf('<x%02x>', ord($match[1])); }, $diff ); But given the size and complexity of it, I'm hesitant to go with it. What do you think? Anthony