On Sat, Jul 7, 2012 at 4:41 PM, Anthony Ferrara <ircmax...@gmail.com> wrote:
> Hey all, > > I've run into an issue with run-tests.php with the junit format. The XML > that it generates can be invalid because of invalid UTF-8 characters and > invalid XML characters. This means that trying to parse it using something > like Jenkins gives a huge stack-trace because of invalid XML. I've been > digging through how to fix it, and I think I've come up with a solution. > But I'm not too happy with it, so I'd like some feedback. > > https://github.com/php/php-src/blob/master/run-tests.php#L2096 > > Right now, the diff for a failed test is just injected in cdata tags, and > stuck unencoded in the result XML. For tests that are testing invalid UTF-8 > bytes (or other character sets), that diff can contain bad byte sequences. > > $diff = empty($diff) ? '' : "<![CDATA[\n " . preg_replace('/\e/', > '<esc>', $diff) . "\n]]>"; > > > What I'm proposing is to escape all non-UTF8 and non-XML safe bytes with > their value wrapped by <>. So chr(0xFF) (which is invalid in UTF8) would > become <xFF> > > Now, to implement it is a bit more interesting. I've come up with a single > regex that will do it: > > $diff = preg_replace_callback( > '/( > [\x0-\x8] # Control Characters > | [\xB-\xC] # Invalid XML Characters > | [\xE-\x19] # Invalid XML Characters > | [\xF8-\xFF] # Invalid UTF-8 Bytes > | [\xC0-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start > | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start > | [\xF0-\xF7](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start > | (?<=[\x0-\x7F\xF8-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle > | (?<! > [\xC0-\xDF] # Not Byte 2 of 2 Byte Sequence > | [\xE0-\xEF] # Not Byte 2 of 3 Byte Sequence > | [\xE0-\xEF][\x80-\xBF] # Not Byte 3 of 3 Byte Sequence > | [\xF0-\xF7] # Not Byte 2 of 4 Byte Sequence > | [\xF0-\xF7][\x80-\xBF] # Not Byte 3 of 4 Byte Sequence > | [\xF0-\xF7][\x80-\xBF]{2} # Not Byte 4 of 4 Byte Sequence > )[\x80-\xBF] # > Overlong Sequence > | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte > sequence > | (?<=[\xF0-\xF7])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte > sequence > | (?<=[\xF0-\xF7][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte > sequence > > )/x', > function($match) { return sprintf('<x%02x>', ord($match[1])); > }, > $diff > ); > > But given the size and complexity of it, I'm hesitant to go with it. > > What do you think? > > Anthony > we bumped into a similar problem a while back, Felipe and myself tried to come up with an elegant solution, but failed to do so, so we ended up doing a replace as well ( https://github.com/php/php-src/commit/78742a33a7a3c43b71d20e4b06665694c89b4c11 ) as far as I know, there is no better solution, that convert those characters outside of the valid xml chars into some kind of textual representation. (and your suggestion is similar to what for example the subversion devs implemented for their own suite: http://www.mail-archive.com/dev@subversion.apache.org/msg00397.html) so I think it is ok. ps: there are a few control characters which are allowed by the 1.0 xml spec, see http://www.w3.org/TR/2006/REC-xml-20060816/#charsets ps2: I think there would be still one hiccup in the code: afair we didn't handled the case when a CDATA closure happens to be in the test output, that could be handled by http://www.lshift.net/blog/2007/10/25/xml-cdata-and-escaping I guess. -- Ferenc Kovács @Tyr43l - http://tyrael.hu