On 12 October 2017 at 20:55, Juergen Sauermann <
juergen.sauerm...@t-online.de> wrote:
> Hi Elias,
>
> see below.
>
> /// Jürgen
>
>
> On 10/12/2017 09:13 AM, Elias Mårtenson wrote:
>
> On 11 October 2017 at 21:15, Juergen Sauermann <
> juergen.sauerm...@t-online.de> wrote:
>
>
>> If I understand *libpcre2* correctly (and I probably don't) then a
>> general regular expression RE is a tree whose
>> structure is determined by the nesting of the parentheses in RE, and the
>> result of a match follows the tree structure.
>>
>
> Actually, this is not the case. When you have subexpressions, what you
> have is simply a list of them, and each subexpression has a value. Whether
> or not these subexpressions are nested does not matter. Its position is
> purely dictated by the index of the opening parentheses.
>
> Not exactly. It is true that libpcre returns a list of matches in terms of
> the position of each
> match in the subject string B. However any two matches are either disjoint
> or one match is
> contained in the other. This containment relation defines a partial order
> between the
> matches which is most conveniently described by a tree. In that tree one
> RE, say RE1 is a
> child of another RE RE2 if the substring of B corresponding to RE2 is
> contained in the
> substring of B that corresponds to RE2.
>
Wow, this mail became longer than I had expected. Just to be clear: I am
mostly talking about the case where strings are returned. The cases where
indexes or ⊂-data is returned are different, and what I'm saying below does
not necessarily apply entirely.
I will henceforth use the word "group" instead of "subexpression". The
formal name is “capturing group”.
The fact that parenthesised groups can nest is orthogonal to the behaviour
of the groups themselves.
The groups are referenced by index or by name (there is a special pattern
syntax to give a group a name), and the API then provides functions to
access the values of these groups. You can extract the values of a group
either by using the name or by the index.
Notice that you can even refer to a capturing group using backreferences
within the pattern itself. This should help prove how simple they are:
In a pattern the sequence \1 can be used to refer back to the result of the
first group, \2 to the second, etc. Thus (and you can test this with PCRE
yourself):
*x(.(.))-\1* matches: *xab-ab*, xwe-we, etc...
*x(.(.))-\2* matches: xab-b, xwe-e, etc...
Every single Regexp API I have used (including languages such as C, Perl,
Java, Javascript, Elixir, etc) returns the groups in the exact same way as
I have explained. You use groups to choose what parts of the text you're
interested in, and refer to them by index.
Speaking of Elixir, this is the set of group-related flags that their API
supports (something tells me they also use PCRE behind the scenes):
-
:all - all captured subpatterns including the complete matching string
(this is the default)
-
:first - only the first captured subpattern, which is always the
complete matching part of the string; all explicitly captured subpatterns
are discarded
-
:all_but_first- all but the first matching subpattern, i.e. all
explicitly captured subpatterns, but not the complete matching part of the
string
-
:none - does not return matching subpatterns at all
-
:all_names - captures all names in the Regex
-
list(binary) - a list of named captures to capture
I think this is an excellent approach, and definitely something to emulate.
The question is then: shall *⎕RE* simply return the array of matches (which
> was what your
> implementation did) or shall *⎕RE* return the matches as a tree? This is
> the same question
> as shall the tree be represented as a simple vector of nodes
> (corresponding to an APL
> vector of some kind) or shall it be represented as a recursive
> node-properties + children structure (corresponding to a nested APL value)?
>
My argument here is one of pragmatism:
1. Regexp implementations in all known languages returns an indexed list
of groups
2. A simple list is simply the most useful. I struggle to think of any
case where returning a nested structure is in any way useful.
The vector of nodes and the nested APL value are both equivalent in
> describing the
> tree. However, converting the nested tree structure to a vector of nodes
> is much simpler
> (in APL) than the other way around because converting a node vector to the
> tree involves
> a lot of comparisons which are quite lightweight but extremely ugly in
> APL. That was why
> decided to return the tree and not the vector of nodes.
>
You are indeed correct that converting it into a tree is more difficult
than the other way around. But then again, why should we go out of our ways
to make an incredibly rare use-case somewhat easier when the common
use-case becomes annoyingly complicated?
Can you think of a case where returning a nested structure is useful?