Re: [Bug-apl] Suggestion for Quad-RE

2017-10-12 Thread Elias Mårtenson
On 11 October 2017 at 21:15, Juergen Sauermann <
juergen.sauerm...@t-online.de> wrote:


> If I understand *libpcre* correctly (and I propably don't) then a general
> regular expression RE is a tree whose
> structure is determined by the nesting of the parentheses in RE, and the
> result of a match follows the tree structure.
>

Actually, this is not the case. When you have subexpressions, what you have
is simply a list of them, and each subexpression has a value. Whether or
not these subexpressions are nested does not matter. Its position is purely
dictated by the index of the opening parentheses.

When you use subexpressions, it means that I am interested in specific
parts of the matched string. If I am interested in a specific part of a
string, it is very unlikely that I want to know the content of the entire
match. But, if I do, I can always retrieve that using another set of parens
that surrounds the entire regexp.

When you don't have any subexpressions, it's most likely that I am not
interested in the matched string at all, but rather just a boolean result
telling me if I have a match at all.

The boolean case is simple, so the only aspect of this that warrants any
discussion is how that should be achieved. My opinion is that it should be
the default, but a flag can also be used.

For subexpressions, I think a few examples will help explain how they are
used:

Let's assume the following regexp:

A(.)|B(.)

This regexp has  two subexpressions, and the result with therefore have two
values. Due to the fact that they are separated by the alternation symbol
(|), one of the subexpressions will always be empty. So, here are the
different possible results when matching different strings:

"AXY"  Subexpr 1: "X", Subexpr 2: ""
"BZA"  Subexpr 1: "",  Subexpr 2: "Z"
"CXY"  *No match*

(with the current implementation, there is no way I can differentiate
between cases 1 and 2, which shows that the current implementation is not
working correctly)

As you can see from this example, I can look at the content of
subexpressions 1 and 2 to determine which of the alternatives was matched.

If I really want to see the whole match as well, I can force this by adding
a third subexpression (which will be number 1 since its opening parenthesis
comes first):

(A(.)|B(.))

Here, the result will also contain the full match:

"AXY"  Subexpr 1: "AX", Subexpr 2: "X", Subexpr 3: ""
"BZA"  Subexpr 1: "BZ", Subexpr 2: "",  Subexpr 3: "Z"
"CXY"  *No match*

I hope this helps explain why my design was the way it was. There is an
argument that the no-subexpression case should not return the full match
but rather a boolean value simply indicating whether a match was found or
not. In that case the old behaviour can still be achieved by wrapping the
entire regexp in a set of parentheses as shown above. However, I think a
flag to achieve this would be more clear.

Regards,
Elias


Re: [Bug-apl] Monadic form of ↓

2017-10-12 Thread Blake McBride
On Wed, Oct 11, 2017 at 7:52 AM, Juergen Sauermann <
juergen.sauerm...@t-online.de> wrote:

> Hi Louis,
>
> 
> Every such extension of the APL syntax creates a new incompatibility and
> should be avoided in the first place.
> .
>


Agreed!


Re: [Bug-apl] Suggestion for Quad-RE

2017-10-12 Thread Juergen Sauermann

  
  
Hi Elias,
  
  see below.
  
  /// Jürgen
  

On 10/12/2017 09:13 AM, Elias Mårtenson
  wrote:


  

  On 11 October 2017 at 21:15, Juergen
Sauermann 
wrote:
 

  If I understand libpcre2 correctly
  (and I probably don't) then a general regular
  _expression_ RE is a tree whose
  structure is determined by the nesting of the
  parentheses in RE, and the result of a match follows
  the tree structure.




Actually, this is not the case. When you have
  subexpressions, what you have is simply a list of them,
  and each subexpression has a value. Whether or not these
  subexpressions are nested does not matter. Its position is
  purely dictated by the index of the opening parentheses.


  

  

Not exactly. It is true that libpcre returns a list of matches in
terms of the position of each
match in the subject string B. However any two matches are either
disjoint or one match is
contained in the other. This containment relation defines a partial
order between the
matches which is most conveniently described by a tree. In that tree
one RE, say RE1 is a
child of another RE RE2 if the substring of B corresponding to RE2
is contained in the
substring of B that corresponds to RE2.

The question is then: shall ⎕RE simply return the array of
matches (which was what your
implementation did) or shall ⎕RE return the matches as a
tree? This is the same question
as shall the tree be represented as a simple vector of nodes
(corresponding to an APL
vector of some kind) or shall it be represented as a recursive
node-properties + children structure (corresponding to a nested APL
value)?

The vector of nodes and the nested APL value are both equivalent in
describing the
tree. However, converting the nested tree structure to a vector of
nodes is much simpler
(in APL) than the other way around because converting a node vector
to the tree involves
a lot of comparisons which are quite lightweight but extremely ugly
in APL. That was why
decided to return the tree and not the vector of nodes.

Now, to have an option that drops the first element means to have an
option that returns
the nodes of the result tree except its root node. Although
technically possible, this sounds
very arbitrary to me. It may suit a particular use case, but it do
not, IMHO, deserve a
special flag. I could also create a use case where it makes sense
that only every second
node of the tree is returned, for example when matching some
name=value pairs where
I am only interested in the values and not the names.

I am not entirely against a flag that goes into that direction, but
I believe that flag should
determine if either the tree is returned (default) or the node
vector of the of the tree if
the flag is given. Unfortunately that flag, even though it is far
more consistent with the
structure of the ⎕RE result than 1↓, does not solve your 1↓ because
it would still contain
the top-level match (= the root of the tree).


  

  
When you use subexpressions, it means that I am
  interested in specific parts of the matched string. If I
  am interested in a specific part of a string, it is very
  unlikely that I want to know the content of the entire
  match. But, if I do, I can always retrieve that using
  another set of parens that surrounds the entire regexp.



  

  

Not necessarily. It could also be a boundary condition of your match
that you
only want to be satisfied no matter how. REs like  [A-Z][a-z][0-9]
are often used that way.

  

  
When you don't have any subexpressions, it's most
  likely that I am not interested in the matched string at
  all, but rather just a boolean result telling me if I have
  a match at all.


The boolean case is simple, so the only aspect of this
  that warrants any discussion is how that should be
  achieved. My opinion is that it should be the default, but
  a flag can also be used.


For subexpressions, I think a few examples will help
  explain how they are used:


Let's assume the following regexp: