On 11 October 2017 at 21:15, Juergen Sauermann <
juergen.sauerm...@t-online.de> wrote:


> If I understand *libpcre* correctly (and I propably don't) then a general
> regular expression RE is a tree whose
> structure is determined by the nesting of the parentheses in RE, and the
> result of a match follows the tree structure.
>

Actually, this is not the case. When you have subexpressions, what you have
is simply a list of them, and each subexpression has a value. Whether or
not these subexpressions are nested does not matter. Its position is purely
dictated by the index of the opening parentheses.

When you use subexpressions, it means that I am interested in specific
parts of the matched string. If I am interested in a specific part of a
string, it is very unlikely that I want to know the content of the entire
match. But, if I do, I can always retrieve that using another set of parens
that surrounds the entire regexp.

When you don't have any subexpressions, it's most likely that I am not
interested in the matched string at all, but rather just a boolean result
telling me if I have a match at all.

The boolean case is simple, so the only aspect of this that warrants any
discussion is how that should be achieved. My opinion is that it should be
the default, but a flag can also be used.

For subexpressions, I think a few examples will help explain how they are
used:

Let's assume the following regexp:

    A(.)|B(.)

This regexp has  two subexpressions, and the result with therefore have two
values. Due to the fact that they are separated by the alternation symbol
(|), one of the subexpressions will always be empty. So, here are the
different possible results when matching different strings:

    "AXY"  Subexpr 1: "X", Subexpr 2: ""
    "BZA"  Subexpr 1: "",  Subexpr 2: "Z"
    "CXY"  *No match*

(with the current implementation, there is no way I can differentiate
between cases 1 and 2, which shows that the current implementation is not
working correctly)

As you can see from this example, I can look at the content of
subexpressions 1 and 2 to determine which of the alternatives was matched.

If I really want to see the whole match as well, I can force this by adding
a third subexpression (which will be number 1 since its opening parenthesis
comes first):

    (A(.)|B(.))

Here, the result will also contain the full match:

    "AXY"  Subexpr 1: "AX", Subexpr 2: "X", Subexpr 3: ""
    "BZA"  Subexpr 1: "BZ", Subexpr 2: "",  Subexpr 3: "Z"
    "CXY"  *No match*

I hope this helps explain why my design was the way it was. There is an
argument that the no-subexpression case should not return the full match
but rather a boolean value simply indicating whether a match was found or
not. In that case the old behaviour can still be achieved by wrapping the
entire regexp in a set of parentheses as shown above. However, I think a
flag to achieve this would be more clear.

Regards,
Elias

Reply via email to