Hi,

we're abusing org.apache.xerces.impl.xpath.regex.RegularExpression to
validate XSD flavor regular expression strings and later matching test
strings against them. It seemingly worked, until someone tried to use a
very specific regex.

Here's the code:

    import org.apache.xerces.impl.xpath.regex.RegularExpression;

    public class XercesRegexTest {

        public static void main(String[] args) {
            String regexString = "([a-zA-Z][^ ]*)";
            RegularExpression regex = new RegularExpression(regexString,
"x");
            System.out.println(regex.toString());
        }

    }

The `x` option is supposed to make the regex engine conform to XSD regular
expressions. But if you run this code, you'll end up with

    Exception in thread "main"
org.apache.xerces.impl.xpath.regex.ParseException: Unexpected end of the
pattern in a character class.
        at org.apache.xerces.impl.xpath.regex.RegexParser.ex(Unknown Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseCharacterClass(Unknown
Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex(Unknown Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.processParen(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parse(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegularExpression.<init>(Unknown Source)
        at
com.mgsoft.testing.regex.XercesRegexTest.main(XercesRegexTest.java:9)
    Java Result: 1

It first looked like a bug in Xerces' regular expression parser, but after
re-reading the documentation (
http://xerces.apache.org/xerces-j/apiDocs/org/apache/xerces/utils/regex/RegularExpression.html)
of this class, I found out that the `x` option should actually be `X`
(upper case). Thing is...it worked for countless other regular expressions.
In fact it is that space that is causing problems, any other char works
fine. Also removing the option and using the single string constructor of
`RegularExpression` works fine.

Does anyone know why this is happening? I realize that this class is
probably not intended for such usage, but since the spec we're implementing
uses XSD regular expressions, we tried to avoid reinventing the wheel
though re-usage.

We are using xercesImpl.jar that is distributed with xalan-j 2.7.1.

Reply via email to