date:20171002

Re: [Bug-apl] Regex support

2017-10-02 Thread Elias Mårtenson

Some progress:

The behaviour I described earlier still works, but now has the ability to
work N-dimensional arrays of strings, compiling the regex only once and
then applying it on all the cells.

In addition to this, I have now also added a flag "B" (meaning "bitmap")
that creates a bitmap of all matches and can be used in conjunction with ⊂
to split strings by regex.

Here's an example:

*  " +" ⎕RE["B"] "this is   a test"*
┏→━━┓
┃0 0 0 0 1 0 0 2 2 2 0 3 3 3 3 3 0 0 0 0┃
┗━━━┛

This matches any sequence of spaces, and we can easily use ⊂ to split the
string:

*  {⍵ ⊂⍨ 0=" +" ⎕RE["B"] ⍵} "this is   a test"*
┏→━┓
┃"this" "is" "a" "test"┃
┗∊━┛

However, I'm not sure if the value returned from the function are ideal.
The idea of the increasing numbers is to be able to differentiate between
the result of:

*  " " ⎕RE["B"] ""*
┏→━━┓
┃1 2 3 4┃
┗━━━┛

vs:

*  " +" ⎕RE["B"] ""*
┏→━━┓
┃1 1 1 1┃
┗━━━┛

Should it be left like this, or should it be done in some other way?

Regards,
Elias

On 25 September 2017 at 20:10, Juergen Sauermann <
juergen.sauerm...@t-online.de> wrote:

> Hi Elias,
>
> making a quad function an operator is simple if the function argument(s)
> is/are primitive functions
> and a little more complicated if not.
>
> First of all you have to implement (read: overload) some of the eval_XXX()
> function that have function
> arguments. For monadic operators these eval_XXX() functions areare:
>
>virtual Token eval_ALB(Value_P A, Token & LO, Value_P B)
>virtual Token eval_ALXB(Value_P A, Token & LO, Value_P X, Value_P B)
>virtual Token eval_LB(Token & LO, Value_P B)
>virtual Token eval_LXB(Token & LO, Value_P X, Value_P B)
>
> where L resp. LO stands for the left function argument. For a dyadic
> operators they are:
>
>virtual Token eval_ALRB(Value_P A, Token & LO, Token & RO, Value_P B)
>virtual Token eval_ALRXB(Value_P A, Token & LO, Token & RO, Value_P X,
> Value_P B)
>virtual Token eval_LRB(Token & LO, Token & RO, Value_P B)
>virtual Token eval_LRXB(Token & LO, Token & RO, Value_P X, Value_P B)
>
> where L resp. LO and R resp. RO stand for the left and right function
> argument(s), A and B
> are the value arguments, and X the axis.
>
> Not all of them need to be implemented only those that have function
> signatures that
> are supported by the operator (mainly in terms of allowing an axis
> argument X or a
> left value argument A).
>
> If an operator supports defined functions (as opposed to primitive
> functions) then it will typically
> implement the operator itself as a macro, which means that the
> implementation is written in APL
> rather than in C++ (similar to "magic functions" in NARS). This is needed
> because primitive functions
> are atomic (they either succeed or fail, but cannot be continued after a
> failure) while defined functions
> (and operators) can continue at the point of interruption after having
> fixed the values that have cause
> the fault.
>
> Some of the build-in operators in GNU APL have both a primitive
> implementation (which is used when
> the function arguments are primitive) and a macro based implementation if
> not. This is for performance
> reasons so that the ability to take defined functions as arguments does
> not performance-wise harm the
> cases where the function arguments are primitive.
>
> The Macro definitions are contained in Macro.def
>
> Please note that in GNU APL functions cannot return functions, which may
> or may not be a problem
> in your case, depending on whether the function argument(s) of the
> ⎕-operator is/are primitive or not.
> In standard APL you cannot assign a function to a name. The usual
> work-around return a string and ⍎ it.
>
> My guts feeling is that if you need function arguments for implementing
> regular expressions then
> something has been going into the wrong direction somewhere else.
>
> Best Regards,
> /// Jürgen
>
>
>
> On 09/25/2017 05:18 AM, Elias Mårtenson wrote:
>
>> Dyalog's implementation is much more expressive than what I had proposed.
>>
>> There are technical reasons why we have no hope of replicating their
>> functionality (in particular, GNU APL does not have support for namespaces).
>>
>> Their function takes arguments and returns a function, which is a matcher
>> function that can be reused, which is useful since you'd only compile the
>> regexp once. Jürgen, how can I make a quad-function behave like below? It
>> seems to be similar in behaviour to ⍤ and ⍣.
>>
>> *  ('.at' ⎕R '\u0') 'The cat sat on the mat' *
>> The CAT SAT on the MAT
>>
>> It can also accept a function, in which case the function is called for
>> each match, to return a replacement string. Can you explain how to make a
>> quad-function an operator?
>> *
>> *
>> *  ('\w+' ⎕R {⌽⍵.Match}) 'The cat sat on the mat'*
>> ehT tac tas no eht tam
>>
>>

Re: [Bug-apl] Regex support

2017-10-02 Thread Elias Mårtenson

In playing around with this, I realise that the "B" mode is quite useful.
So much so, in fact, that I'm wondering if it's warranted to have a
dedicated quad-function for this specific behaviour.

Here's an example of extracting sequences of 4 characters:

*  {⍵ ⊂⍨ "[a-z]{4}" ⎕RE['B'] ⍵} 'abcdef45abchello9'*
┏→━━━┓
┃"abcd" "abch" "ello"┃
┗∊━━━┛

Regards,
Elias

On 2 October 2017 at 16:27, Elias Mårtenson  wrote:

> Some progress:
>
> The behaviour I described earlier still works, but now has the ability to
> work N-dimensional arrays of strings, compiling the regex only once and
> then applying it on all the cells.
>
> In addition to this, I have now also added a flag "B" (meaning "bitmap")
> that creates a bitmap of all matches and can be used in conjunction with ⊂
> to split strings by regex.
>
> Here's an example:
>
> *  " +" ⎕RE["B"] "this is   a test"*
> ┏→━━┓
> ┃0 0 0 0 1 0 0 2 2 2 0 3 3 3 3 3 0 0 0 0┃
> ┗━━━┛
>
> This matches any sequence of spaces, and we can easily use ⊂ to split the
> string:
>
> *  {⍵ ⊂⍨ 0=" +" ⎕RE["B"] ⍵} "this is   a test"*
> ┏→━┓
> ┃"this" "is" "a" "test"┃
> ┗∊━┛
>
> However, I'm not sure if the value returned from the function are ideal.
> The idea of the increasing numbers is to be able to differentiate between
> the result of:
>
> *  " " ⎕RE["B"] ""*
> ┏→━━┓
> ┃1 2 3 4┃
> ┗━━━┛
>
> vs:
>
> *  " +" ⎕RE["B"] ""*
> ┏→━━┓
> ┃1 1 1 1┃
> ┗━━━┛
>
> Should it be left like this, or should it be done in some other way?
>
> Regards,
> Elias
>
> On 25 September 2017 at 20:10, Juergen Sauermann <
> juergen.sauerm...@t-online.de> wrote:
>
>> Hi Elias,
>>
>> making a quad function an operator is simple if the function argument(s)
>> is/are primitive functions
>> and a little more complicated if not.
>>
>> First of all you have to implement (read: overload) some of the
>> eval_XXX() function that have function
>> arguments. For monadic operators these eval_XXX() functions areare:
>>
>>virtual Token eval_ALB(Value_P A, Token & LO, Value_P B)
>>virtual Token eval_ALXB(Value_P A, Token & LO, Value_P X, Value_P B)
>>virtual Token eval_LB(Token & LO, Value_P B)
>>virtual Token eval_LXB(Token & LO, Value_P X, Value_P B)
>>
>> where L resp. LO stands for the left function argument. For a dyadic
>> operators they are:
>>
>>virtual Token eval_ALRB(Value_P A, Token & LO, Token & RO, Value_P B)
>>virtual Token eval_ALRXB(Value_P A, Token & LO, Token & RO, Value_P X,
>> Value_P B)
>>virtual Token eval_LRB(Token & LO, Token & RO, Value_P B)
>>virtual Token eval_LRXB(Token & LO, Token & RO, Value_P X, Value_P B)
>>
>> where L resp. LO and R resp. RO stand for the left and right function
>> argument(s), A and B
>> are the value arguments, and X the axis.
>>
>> Not all of them need to be implemented only those that have function
>> signatures that
>> are supported by the operator (mainly in terms of allowing an axis
>> argument X or a
>> left value argument A).
>>
>> If an operator supports defined functions (as opposed to primitive
>> functions) then it will typically
>> implement the operator itself as a macro, which means that the
>> implementation is written in APL
>> rather than in C++ (similar to "magic functions" in NARS). This is needed
>> because primitive functions
>> are atomic (they either succeed or fail, but cannot be continued after a
>> failure) while defined functions
>> (and operators) can continue at the point of interruption after having
>> fixed the values that have cause
>> the fault.
>>
>> Some of the build-in operators in GNU APL have both a primitive
>> implementation (which is used when
>> the function arguments are primitive) and a macro based implementation if
>> not. This is for performance
>> reasons so that the ability to take defined functions as arguments does
>> not performance-wise harm the
>> cases where the function arguments are primitive.
>>
>> The Macro definitions are contained in Macro.def
>>
>> Please note that in GNU APL functions cannot return functions, which may
>> or may not be a problem
>> in your case, depending on whether the function argument(s) of the
>> ⎕-operator is/are primitive or not.
>> In standard APL you cannot assign a function to a name. The usual
>> work-around return a string and ⍎ it.
>>
>> My guts feeling is that if you need function arguments for implementing
>> regular expressions then
>> something has been going into the wrong direction somewhere else.
>>
>> Best Regards,
>> /// Jürgen
>>
>>
>>
>> On 09/25/2017 05:18 AM, Elias Mårtenson wrote:
>>
>>> Dyalog's implementation is much more expressive than what I had proposed.
>>>
>>> There are technical reasons why we have no hope of replicating their
>>> functionality (in particular, GNU APL does not have support for namespaces).
>>>
>>> Their function takes argume

Re: [Bug-apl] Regex support

2017-10-02 Thread Juergen Sauermann


  
  
Hi Elias,
  
  I believe it is better to keep things together, i.e. in a single ⎕
  function than in several.
  
  It may be intuitive to use the character ⊂ instead of B in
the axis argument to indicate
that the result is meant for dyadic ⊂.

  /// Jürgen
  

  
On 10/02/2017 10:47 AM, Elias Mårtenson
  wrote:


  In playing around with this, I realise that the "B"
mode is quite useful. So much so, in fact, that I'm wondering if
it's warranted to have a dedicated quad-function for this
specific behaviour.


Here's an example of extracting sequences of 4 characters:



        {⍵ ⊂⍨
"[a-z]{4}" ⎕RE['B'] ⍵} 'abcdef45abchello9'
  ┏→━━━┓
  ┃"abcd" "abch" "ello"┃
  ┗∊━━━┛



Regards,
Elias
  
  
On 2 October 2017 at 16:27, Elias
  Mårtenson 
  wrote:
  
Some progress:
  
  
  The behaviour I described earlier still works, but
now has the ability to work N-dimensional arrays of
strings, compiling the regex only once and then applying
it on all the cells.
  
  
  In addition to this, I have now also added a flag "B"
(meaning "bitmap") that creates a bitmap of all matches
and can be used in conjunction with ⊂ to split strings
by regex.
  
  
  Here's an example:
  



        " +"
⎕RE["B"] "this is   a     test"
  ┏→━━┓
  ┃0 0 0 0 1 0 0
  2 2 2 0 3 3 3 3 3 0 0 0 0┃
  ┗━━━┛

  
  
  
  This matches any sequence of spaces, and we can
easily use ⊂ to split the string:
  
  
  
      {⍵ ⊂⍨
  0=" +" ⎕RE["B"] ⍵} "this is   a     test"
┏→━┓
┃"this" "is" "a"
"test"┃
┗∊━┛
  
  
  
  However, I'm not sure if the value returned from the
function are ideal. The idea of the increasing numbers
is to be able to differentiate between the result of:
  
  
  
      " "
  ⎕RE["B"] "    "
┏→━━┓
┃1 2 3 4┃
┗━━━┛


vs:


      " +"
  ⎕RE["B"] "    "
┏→━━┓
┃1 1 1 1┃
┗━━━┛
  
  
  
  Should it be left like this, or should it be done in
some other way?
  
  
  Regards,
  Elias


  

  On 25 September 2017 at
20:10, Juergen Sauermann 
wrote:
Hi
  Elias,
  
  making a quad function an operator is simple if
  the function argument(s) is/are primitive
  functions
  and a little more complicated if not.
  
  First of all you have to implement (read:
  overload) some of the eval_XXX() function that
  have function
  arguments. For monadic operators these eval_XXX()
  functions areare:
  
     virtual Token eval_ALB(Value_P A, Token &
  LO, Value_P B)
     virtual Token eval_ALXB(Value_P A, Token &
  LO, Value_P X, Value_P B)
     virtual Token eval_LB(Token & LO, Value_P
  B)
     virtual Token eval_LXB(Token & LO, Value_P
  X, Value_P B)
  
  where L resp. LO stands for the left function
  argument. For a dyadic operators they are:
  
     virtual Token eval_ALRB(Value_P A, Token &

Re: [Bug-apl] Regex support

2017-10-02 Thread Elias Mårtenson

In the default mode, as I have demonstrated earlier, when the regexp has
parenthesised subexpressions, the strings matching those expressions will
be returned as separate strings. This is logical and in my opinion makes
perfect sense.

When using ⊂-mode, parenthesised expressions doesn't change the behaviour
at all, as there is no natural behaviour to implement in this case.

However, it would be nice to have a way to use subexpressions to split
strings, so I'm thinking of something like the following:


*  "([0-9]{4})-([0-9]{2})-([0-9]{2})" ⎕RE[something] "foo 2010-02-03"*
┏→━━┓
┃0 0 0 0 1 1 1 1 0 2 2 0 3 3┃
┗━━━┛

Note that this variation is different from the previous one in that
the ⊂-mode described in my previous email repeatedly calls the matching
function, marking each result in the output bitmap, while the proposed
version above runs the match only once, marking the subexpressions in the
result.

I'm starting to think that both are needed, but what symbols should be used
in the axis argument to indicate the desired mode?

An alternative output for the same expression would be something like the
following, which would match pretty much exactly what the underlying PCRE
function returns:

┏→┓
↓ 4 14┃
┃ 4  8┃
┃10 11┃
┃13 14┃
┃ 4 14┃
┗━┛

Would this is be a useful variation too? And if so, what axis marker should
be used for it?

Regards,
Elias

On 3 October 2017 at 01:30, Juergen Sauermann  wrote:

> Hi Elias,
>
> I believe it is better to keep things together, i.e. in a single ⎕
> function than in several.
>
> It may be intuitive to use the character ⊂ instead of B in the axis
> argument to indicate
> that the result is meant for dyadic ⊂.
>
> /// Jürgen
>
>
> On 10/02/2017 10:47 AM, Elias Mårtenson wrote:
>
> In playing around with this, I realise that the "B" mode is quite useful.
> So much so, in fact, that I'm wondering if it's warranted to have a
> dedicated quad-function for this specific behaviour.
>
> Here's an example of extracting sequences of 4 characters:
>
> *  {⍵ ⊂⍨ "[a-z]{4}" ⎕RE['B'] ⍵} 'abcdef45abchello9'*
> ┏→━━━┓
> ┃"abcd" "abch" "ello"┃
> ┗∊━━━┛
>
> Regards,
> Elias
>
> On 2 October 2017 at 16:27, Elias Mårtenson  wrote:
>
>> Some progress:
>>
>> The behaviour I described earlier still works, but now has the ability to
>> work N-dimensional arrays of strings, compiling the regex only once and
>> then applying it on all the cells.
>>
>> In addition to this, I have now also added a flag "B" (meaning "bitmap")
>> that creates a bitmap of all matches and can be used in conjunction with ⊂
>> to split strings by regex.
>>
>> Here's an example:
>>
>> *  " +" ⎕RE["B"] "this is   a test"*
>> ┏→━━┓
>> ┃0 0 0 0 1 0 0 2 2 2 0 3 3 3 3 3 0 0 0 0┃
>> ┗━━━┛
>>
>> This matches any sequence of spaces, and we can easily use ⊂ to split the
>> string:
>>
>> *  {⍵ ⊂⍨ 0=" +" ⎕RE["B"] ⍵} "this is   a test"*
>> ┏→━┓
>> ┃"this" "is" "a" "test"┃
>> ┗∊━┛
>>
>> However, I'm not sure if the value returned from the function are ideal.
>> The idea of the increasing numbers is to be able to differentiate between
>> the result of:
>>
>> *  " " ⎕RE["B"] ""*
>> ┏→━━┓
>> ┃1 2 3 4┃
>> ┗━━━┛
>>
>> vs:
>>
>> *  " +" ⎕RE["B"] ""*
>> ┏→━━┓
>> ┃1 1 1 1┃
>> ┗━━━┛
>>
>> Should it be left like this, or should it be done in some other way?
>>
>> Regards,
>> Elias
>>
>> On 25 September 2017 at 20:10, Juergen Sauermann <
>> juergen.sauerm...@t-online.de> wrote:
>>
>>> Hi Elias,
>>>
>>> making a quad function an operator is simple if the function argument(s)
>>> is/are primitive functions
>>> and a little more complicated if not.
>>>
>>> First of all you have to implement (read: overload) some of the
>>> eval_XXX() function that have function
>>> arguments. For monadic operators these eval_XXX() functions areare:
>>>
>>>virtual Token eval_ALB(Value_P A, Token & LO, Value_P B)
>>>virtual Token eval_ALXB(Value_P A, Token & LO, Value_P X, Value_P B)
>>>virtual Token eval_LB(Token & LO, Value_P B)
>>>virtual Token eval_LXB(Token & LO, Value_P X, Value_P B)
>>>
>>> where L resp. LO stands for the left function argument. For a dyadic
>>> operators they are:
>>>
>>>virtual Token eval_ALRB(Value_P A, Token & LO, Token & RO, Value_P B)
>>>virtual Token eval_ALRXB(Value_P A, Token & LO, Token & RO, Value_P
>>> X, Value_P B)
>>>virtual Token eval_LRB(Token & LO, Token & RO, Value_P B)
>>>virtual Token eval_LRXB(Token & LO, Token & RO, Value_P X, Value_P B)
>>>
>>> where L resp. LO and R resp. RO stand for the left and right function
>>> argument(s), A and B
>>> are the value arguments, and X the axis.
>>>
>>> Not all of them need to be implemented only those that have function
>>> signatures that
>>> are supported by the operator (mainly in terms of allowing

Re: [Bug-apl] Regex support

Re: [Bug-apl] Regex support

Re: [Bug-apl] Regex support

Re: [Bug-apl] Regex support

4 matches

Site Navigation

Mail list logo

Footer information