Re: [PHP-DEV] Run-tests.php JUnit format issue

Ferenc Kovacs Fri, 27 Jul 2012 16:41:23 -0700

On Sat, Jul 7, 2012 at 4:41 PM, Anthony Ferrara <ircmax...@gmail.com> wrote:


> Hey all,
>
> I've run into an issue with run-tests.php with the junit format. The XML
> that it generates can be invalid because of invalid UTF-8 characters and
> invalid XML characters. This means that trying to parse it using something
> like Jenkins gives a huge stack-trace because of invalid XML. I've been
> digging through how to fix it, and I think I've come up with a solution.
> But I'm not too happy with it, so I'd like some feedback.
>
> https://github.com/php/php-src/blob/master/run-tests.php#L2096
>
> Right now, the diff for a failed test is just injected in cdata tags, and
> stuck unencoded in the result XML. For tests that are testing invalid UTF-8
> bytes (or other character sets), that diff can contain bad byte sequences.
>
>         $diff = empty($diff) ? '' : "<![CDATA[\n " . preg_replace('/\e/',
> '<esc>', $diff) . "\n]]>";
>
>
> What I'm proposing is to escape all non-UTF8 and non-XML safe bytes with
> their value wrapped by <>. So chr(0xFF) (which is invalid in UTF8) would
> become <xFF>
>
> Now, to implement it is a bit more interesting. I've come up with a single
> regex that will do it:
>
>         $diff = preg_replace_callback(
> '/(
> [\x0-\x8]                                           # Control Characters
> | [\xB-\xC]                                        # Invalid XML Characters
> | [\xE-\x19]                                       # Invalid XML Characters
> | [\xF8-\xFF]                                     # Invalid UTF-8 Bytes
> | [\xC0-\xDF](?![\x80-\xBF])                # Invalid UTF-8 Sequence Start
> | [\xE0-\xEF](?![\x80-\xBF]{2})            # Invalid UTF-8 Sequence Start
> | [\xF0-\xF7](?![\x80-\xBF]{3})             # Invalid UTF-8 Sequence Start
> | (?<=[\x0-\x7F\xF8-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
> | (?<!
> [\xC0-\xDF]                       # Not Byte 2 of 2 Byte Sequence
> | [\xE0-\xEF]                     # Not Byte 2 of 3 Byte Sequence
> | [\xE0-\xEF][\x80-\xBF]     # Not Byte 3 of 3 Byte Sequence
> | [\xF0-\xF7]                      # Not Byte 2 of 4 Byte Sequence
> | [\xF0-\xF7][\x80-\xBF]      # Not Byte 3 of 4 Byte Sequence
> | [\xF0-\xF7][\x80-\xBF]{2}  # Not Byte 4 of 4 Byte Sequence
> )[\x80-\xBF]                                                              #
> Overlong Sequence
> | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF])                # Short 3 byte
> sequence
> | (?<=[\xF0-\xF7])[\x80-\xBF](?![\x80-\xBF]{2})             # Short 4 byte
> sequence
> | (?<=[\xF0-\xF7][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte
> sequence
>
> )/x',
>             function($match) { return sprintf('<x%02x>', ord($match[1]));
> },
>             $diff
>         );
>
> But given the size and complexity of it, I'm hesitant to go with it.
>
> What do you think?
>
> Anthony
>

we bumped into a similar problem a while back, Felipe and myself tried to
come up with an elegant solution, but failed to do so, so we ended up doing
a replace as well (
https://github.com/php/php-src/commit/78742a33a7a3c43b71d20e4b06665694c89b4c11
 )
as far as I know, there is no better solution, that convert those
characters outside of the valid xml chars into some kind of textual
representation.
(and your suggestion is similar to what for example the subversion devs
implemented for their own suite:
http://www.mail-archive.com/dev@subversion.apache.org/msg00397.html)
so I think it is ok.

ps: there are a few control characters which are allowed by the 1.0 xml
spec, see http://www.w3.org/TR/2006/REC-xml-20060816/#charsets
ps2: I think there would be still one hiccup in the code: afair we didn't
handled the case when a CDATA closure happens to be in the test output,
that could be handled by
http://www.lshift.net/blog/2007/10/25/xml-cdata-and-escaping I guess.

-- 
Ferenc Kovács
@Tyr43l - http://tyrael.hu

Re: [PHP-DEV] Run-tests.php JUnit format issue

Reply via email to