date:20171013

Re: [Bug-apl] Suggestion for Quad-RE

2017-10-13 Thread Elias Mårtenson

On 12 October 2017 at 20:55, Juergen Sauermann <
juergen.sauerm...@t-online.de> wrote:

> Hi Elias,
>
> see below.
>
> /// Jürgen
>
>
> On 10/12/2017 09:13 AM, Elias Mårtenson wrote:
>
> On 11 October 2017 at 21:15, Juergen Sauermann <
> juergen.sauerm...@t-online.de> wrote:
>
>
>> If I understand *libpcre2* correctly (and I probably don't) then a
>> general regular expression RE is a tree whose
>> structure is determined by the nesting of the parentheses in RE, and the
>> result of a match follows the tree structure.
>>
>
> Actually, this is not the case. When you have subexpressions, what you
> have is simply a list of them, and each subexpression has a value. Whether
> or not these subexpressions are nested does not matter. Its position is
> purely dictated by the index of the opening parentheses.
>
> Not exactly. It is true that libpcre returns a list of matches in terms of
> the position of each
> match in the subject string B. However any two matches are either disjoint
> or one match is
> contained in the other. This containment relation defines a partial order
> between the
> matches which is most conveniently described by a tree. In that tree one
> RE, say RE1 is a
> child of another RE RE2 if the substring of B corresponding to RE2 is
> contained in the
> substring of B that corresponds to RE2.
>

Wow, this mail became longer than I had expected. Just to be clear: I am
mostly talking about the case where strings are returned. The cases where
indexes or ⊂-data is returned are different, and what I'm saying below does
not necessarily apply entirely.

I will henceforth use the word "group" instead of "subexpression". The
formal name is “capturing group”.

The fact that parenthesised groups can nest is orthogonal to the behaviour
of the groups themselves.

The groups are referenced by index or by name (there is a special pattern
syntax to give a group a name), and the API then provides functions to
access the values of these groups. You can extract the values of a group
either by using the name or by the index.

Notice that you can even refer to a capturing group using backreferences
within the pattern itself. This should help prove how simple they are:

In a pattern the sequence \1 can be used to refer back to the result of the
first group, \2 to the second, etc. Thus (and you can test this with PCRE
yourself):

*x(.(.))-\1*  matches:  *xab-ab*, xwe-we, etc...
*x(.(.))-\2*  matches:  xab-b, xwe-e, etc...

Every single Regexp API I have used (including languages such as C, Perl,
Java, Javascript, Elixir, etc) returns the groups in the exact same way as
I have explained. You use groups to choose what parts of the text you're
interested in, and refer to them by index.

Speaking of Elixir, this is the set of group-related flags that their API
supports (something tells me they also use PCRE behind the scenes):

   -

   :all - all captured subpatterns including the complete matching string
   (this is the default)
   -

   :first - only the first captured subpattern, which is always the
   complete matching part of the string; all explicitly captured subpatterns
   are discarded
   -

   :all_but_first- all but the first matching subpattern, i.e. all
   explicitly captured subpatterns, but not the complete matching part of the
   string
   -

   :none - does not return matching subpatterns at all
   -

   :all_names - captures all names in the Regex
   -

   list(binary) - a list of named captures to capture

I think this is an excellent approach, and definitely something to emulate.

The question is then: shall *⎕RE* simply return the array of matches (which
> was what your
> implementation did) or shall *⎕RE* return the matches as a tree? This is
> the same question
> as shall the tree be represented as a simple vector of nodes
> (corresponding to an APL
> vector of some kind) or shall it be represented as a recursive
> node-properties + children structure (corresponding to a nested APL value)?
>

My argument here is one of pragmatism:

   1. Regexp implementations in all known languages returns an indexed list
   of groups
   2. A simple list is simply the most useful. I struggle to think of any
   case where returning a nested structure is in any way useful.

The vector of nodes and the nested APL value are both equivalent in
> describing the
> tree. However, converting the nested tree structure to a vector of nodes
> is much simpler
> (in APL) than the other way around because converting a node vector to the
> tree involves
> a lot of comparisons which are quite lightweight but extremely ugly in
> APL. That was why
> decided to return the tree and not the vector of nodes.
>

You are indeed correct that converting it into a tree is more difficult
than the other way around. But then again, why should we go out of our ways
to make an incredibly rare use-case somewhat easier when the common
use-case becomes annoyingly complicated?

Can you think of a case where returning a nested structure is useful?

Re: [Bug-apl] Suggestion for Quad-RE

2017-10-13 Thread Juergen Sauermann


  
  
Hi Elias,

I believe we consider the ⎕RE from two points of view. From
a language or function
designer's point of view your focus is less on specific use cases
but more on how well the function fits into the rest of the
language. From a function user's point of view you care
more about use cases and the simplicity of use.

Before we dive too deeply into technical details let me ask you two
questions:

1. would you agree that the result that is returned by pcre2_match()
is a tree, regardless of
 how that tree is represented in the API of libpcre2?

2. Suppose you have to choose between two APL libraries libA
and libB.

 libA provides a single function V←foo B that solves some problem
in a generic way (so that
all use cases can be covered) but the result returned by foo
may need to be adopted to
different use cases by means of simple APL operations such as 1↓foo
  B, N⍴foo B, etc.

libB does not provide the generic foo of libA
but instead a number of different functions
foo1 returning 1↓foo B, function foo2
returning N⍴foo B, etc. Some rare use cases of
foo in libA that are not covered by one of the fooN
functions in libB, but your project
does not need them.

Given these two libraries, you have to irrevocably decide between
using libA or libB in
your project. What would be your decision?

/// Jürgen

Re: [Bug-apl] Suggestion for Quad-RE

Re: [Bug-apl] Suggestion for Quad-RE

2 matches

Site Navigation

Mail list logo

Footer information