On 12 October 2017 at 20:55, Juergen Sauermann <
juergen.sauerm...@t-online.de> wrote:

> Hi Elias,
>
> see below.
>
> /// Jürgen
>
>
> On 10/12/2017 09:13 AM, Elias Mårtenson wrote:
>
> On 11 October 2017 at 21:15, Juergen Sauermann <
> juergen.sauerm...@t-online.de> wrote:
>
>
>> If I understand *libpcre2* correctly (and I probably don't) then a
>> general regular expression RE is a tree whose
>> structure is determined by the nesting of the parentheses in RE, and the
>> result of a match follows the tree structure.
>>
>
> Actually, this is not the case. When you have subexpressions, what you
> have is simply a list of them, and each subexpression has a value. Whether
> or not these subexpressions are nested does not matter. Its position is
> purely dictated by the index of the opening parentheses.
>
> Not exactly. It is true that libpcre returns a list of matches in terms of
> the position of each
> match in the subject string B. However any two matches are either disjoint
> or one match is
> contained in the other. This containment relation defines a partial order
> between the
> matches which is most conveniently described by a tree. In that tree one
> RE, say RE1 is a
> child of another RE RE2 if the substring of B corresponding to RE2 is
> contained in the
> substring of B that corresponds to RE2.
>

Wow, this mail became longer than I had expected. Just to be clear: I am
mostly talking about the case where strings are returned. The cases where
indexes or ⊂-data is returned are different, and what I'm saying below does
not necessarily apply entirely.

I will henceforth use the word "group" instead of "subexpression". The
formal name is “capturing group”.

The fact that parenthesised groups can nest is orthogonal to the behaviour
of the groups themselves.

The groups are referenced by index or by name (there is a special pattern
syntax to give a group a name), and the API then provides functions to
access the values of these groups. You can extract the values of a group
either by using the name or by the index.

Notice that you can even refer to a capturing group using backreferences
within the pattern itself. This should help prove how simple they are:

In a pattern the sequence \1 can be used to refer back to the result of the
first group, \2 to the second, etc. Thus (and you can test this with PCRE
yourself):

*x(.(.))-\1*  matches:  *xab-ab*, xwe-we, etc...
*x(.(.))-\2*  matches:  xab-b, xwe-e, etc...

Every single Regexp API I have used (including languages such as C, Perl,
Java, Javascript, Elixir, etc) returns the groups in the exact same way as
I have explained. You use groups to choose what parts of the text you're
interested in, and refer to them by index.

Speaking of Elixir, this is the set of group-related flags that their API
supports (something tells me they also use PCRE behind the scenes):

   -

   :all - all captured subpatterns including the complete matching string
   (this is the default)
   -

   :first - only the first captured subpattern, which is always the
   complete matching part of the string; all explicitly captured subpatterns
   are discarded
   -

   :all_but_first- all but the first matching subpattern, i.e. all
   explicitly captured subpatterns, but not the complete matching part of the
   string
   -

   :none - does not return matching subpatterns at all
   -

   :all_names - captures all names in the Regex
   -

   list(binary) - a list of named captures to capture

I think this is an excellent approach, and definitely something to emulate.

The question is then: shall *⎕RE* simply return the array of matches (which
> was what your
> implementation did) or shall *⎕RE* return the matches as a tree? This is
> the same question
> as shall the tree be represented as a simple vector of nodes
> (corresponding to an APL
> vector of some kind) or shall it be represented as a recursive
> node-properties + children structure (corresponding to a nested APL value)?
>

My argument here is one of pragmatism:

   1. Regexp implementations in all known languages returns an indexed list
   of groups
   2. A simple list is simply the most useful. I struggle to think of any
   case where returning a nested structure is in any way useful.

The vector of nodes and the nested APL value are both equivalent in
> describing the
> tree. However, converting the nested tree structure to a vector of nodes
> is much simpler
> (in APL) than the other way around because converting a node vector to the
> tree involves
> a lot of comparisons which are quite lightweight but extremely ugly in
> APL. That was why
> decided to return the tree and not the vector of nodes.
>

You are indeed correct that converting it into a tree is more difficult
than the other way around. But then again, why should we go out of our ways
to make an incredibly rare use-case somewhat easier when the common
use-case becomes annoyingly complicated?

Can you think of a case where returning a nested structure is useful?


> I am not entirely against a flag that goes into that direction, but I
> believe that flag should
> determine if either the tree is returned (default) or the node vector of
> the of the tree if
> the flag is given. Unfortunately that flag, even though it is far more
> consistent with the
> structure of the ⎕RE result than 1↓, does not solve your 1↓ because it
> would still contain
> the top-level match (= the root of the tree).
>

I'm still curious as to why you are interested in even having the option of
returning a tree. I just don't see any need for it, and no other language
that I have ever used have done anything like it.

Do you have a use-case in mind?

> When you use subexpressions, it means that I am interested in specific
> parts of the matched string. If I am interested in a specific part of a
> string, it is very unlikely that I want to know the content of the entire
> match. But, if I do, I can always retrieve that using another set of parens
> that surrounds the entire regexp.
>
> Not necessarily. It could also be a boundary condition of your match that
> you
> only want to be satisfied no matter how. REs like  *[A-Z][a-z][0-9]* are
> often used that way.
>

You can easily change that to *(*[A-Z][a-z][0-9]*)* get the entire match in
this case.


> When you don't have any subexpressions, it's most likely that I am not
> interested in the matched string at all, but rather just a boolean result
> telling me if I have a match at all.
>
> The boolean case is simple, so the only aspect of this that warrants any
> discussion is how that should be achieved. My opinion is that it should be
> the default, but a flag can also be used.
>
> For subexpressions, I think a few examples will help explain how they are
> used:
>
> Let's assume the following regexp:
>
>     A(.)|B(.)
>
> This regexp has  two subexpressions, and the result with therefore have
> two values. Due to the fact that they are separated by the alternation
> symbol (|), one of the subexpressions will always be empty. So, here are
> the different possible results when matching different strings:
>
>     "AXY"  Subexpr 1: "X", Subexpr 2: ""
>     "BZA"  Subexpr 1: "",  Subexpr 2: "Z"
>     "CXY"  *No match*
>
> Not sure if that should be so but i am not too familiar with *libpre2*
> either. I would naively
> expect that an RE of the form *A|B* would either return a match for *A*
> or a match for *B* but
> not both. *man pcre2pattern* says:
>

The pattern always has a fixed number of groups that is defined by the
pattern itself. In fact, PCRE has an API call that allows you to extract
the number of groups, and this number is always constant for any given
pattern (the call is pcre2_get_ovector_count(), and it should be noted that
this is the number of substrings that should be returned. The return value
from pcre2_match() may be lower than this number, at which it should be
assumed that the remaining captured groups are empty. My old code didn't
handle this case correctly).


>
> *       Vertical bar characters are used to separate alternative
> patterns.  For*
> *       example, the pattern*
>
> *         gilbert|sullivan*
>
> *       matches  either "gilbert" or "sullivan". Any number of
> alternatives may*
> *       appear, and an empty  alternative  is  permitted  (matching  the
> empty*
> *       string). The matching process tries each alternative in turn, from
> left*
> *       to right, and the first one that succeeds is used.*
>
> My understanding of this is that, for example, B is ignored if A matches.
> That implies that
> the matching of *B* is not even performed so "" (for no match) would be
> incorrect because
> B could also match as well.
>

This is correct. But it has no relation to capturing groups. If we
generalise your example a bit and add some groups, we get the following:

(gil....)|(sul......)

In this case, if you try to match against "gilbert", the second group still
exists, but is empty. The same goes for the opposite case. Let's take three
cases and look at what PCRE returns in this case:

*foo* - As there is no match, pcre2_match() will return -1.
*gilbert* - pcre2_match() will return 2, since it never looked at the
second alternative
*sullivan* - pcre2_match() will return 3, since it had to look at both sides

In all of the cases where a match happened, pcre2_get_ovector_count() will
return 3, since there is one full match + 2 capture groups.

Note that pcre returns unused capture groups with both start and end
indexes being -1. This can be used to determine which side of an alternate
actually matched. How to actually treat this in the case where strings
(rather than indexes) are being returned in an open question. Returning ⍬
is an option, as is the empty string.

Regards,
Elias

Reply via email to