Re: [Bug-apl] Regex support
Some progress: The behaviour I described earlier still works, but now has the ability to work N-dimensional arrays of strings, compiling the regex only once and then applying it on all the cells. In addition to this, I have now also added a flag "B" (meaning "bitmap") that creates a bitmap of all matches and can be used in conjunction with ⊂ to split strings by regex. Here's an example: * " +" ⎕RE["B"] "this is a test"* ┏→━━┓ ┃0 0 0 0 1 0 0 2 2 2 0 3 3 3 3 3 0 0 0 0┃ ┗━━━┛ This matches any sequence of spaces, and we can easily use ⊂ to split the string: * {⍵ ⊂⍨ 0=" +" ⎕RE["B"] ⍵} "this is a test"* ┏→━┓ ┃"this" "is" "a" "test"┃ ┗∊━┛ However, I'm not sure if the value returned from the function are ideal. The idea of the increasing numbers is to be able to differentiate between the result of: * " " ⎕RE["B"] ""* ┏→━━┓ ┃1 2 3 4┃ ┗━━━┛ vs: * " +" ⎕RE["B"] ""* ┏→━━┓ ┃1 1 1 1┃ ┗━━━┛ Should it be left like this, or should it be done in some other way? Regards, Elias On 25 September 2017 at 20:10, Juergen Sauermann < juergen.sauerm...@t-online.de> wrote: > Hi Elias, > > making a quad function an operator is simple if the function argument(s) > is/are primitive functions > and a little more complicated if not. > > First of all you have to implement (read: overload) some of the eval_XXX() > function that have function > arguments. For monadic operators these eval_XXX() functions areare: > >virtual Token eval_ALB(Value_P A, Token & LO, Value_P B) >virtual Token eval_ALXB(Value_P A, Token & LO, Value_P X, Value_P B) >virtual Token eval_LB(Token & LO, Value_P B) >virtual Token eval_LXB(Token & LO, Value_P X, Value_P B) > > where L resp. LO stands for the left function argument. For a dyadic > operators they are: > >virtual Token eval_ALRB(Value_P A, Token & LO, Token & RO, Value_P B) >virtual Token eval_ALRXB(Value_P A, Token & LO, Token & RO, Value_P X, > Value_P B) >virtual Token eval_LRB(Token & LO, Token & RO, Value_P B) >virtual Token eval_LRXB(Token & LO, Token & RO, Value_P X, Value_P B) > > where L resp. LO and R resp. RO stand for the left and right function > argument(s), A and B > are the value arguments, and X the axis. > > Not all of them need to be implemented only those that have function > signatures that > are supported by the operator (mainly in terms of allowing an axis > argument X or a > left value argument A). > > If an operator supports defined functions (as opposed to primitive > functions) then it will typically > implement the operator itself as a macro, which means that the > implementation is written in APL > rather than in C++ (similar to "magic functions" in NARS). This is needed > because primitive functions > are atomic (they either succeed or fail, but cannot be continued after a > failure) while defined functions > (and operators) can continue at the point of interruption after having > fixed the values that have cause > the fault. > > Some of the build-in operators in GNU APL have both a primitive > implementation (which is used when > the function arguments are primitive) and a macro based implementation if > not. This is for performance > reasons so that the ability to take defined functions as arguments does > not performance-wise harm the > cases where the function arguments are primitive. > > The Macro definitions are contained in Macro.def > > Please note that in GNU APL functions cannot return functions, which may > or may not be a problem > in your case, depending on whether the function argument(s) of the > ⎕-operator is/are primitive or not. > In standard APL you cannot assign a function to a name. The usual > work-around return a string and ⍎ it. > > My guts feeling is that if you need function arguments for implementing > regular expressions then > something has been going into the wrong direction somewhere else. > > Best Regards, > /// Jürgen > > > > On 09/25/2017 05:18 AM, Elias Mårtenson wrote: > >> Dyalog's implementation is much more expressive than what I had proposed. >> >> There are technical reasons why we have no hope of replicating their >> functionality (in particular, GNU APL does not have support for namespaces). >> >> Their function takes arguments and returns a function, which is a matcher >> function that can be reused, which is useful since you'd only compile the >> regexp once. Jürgen, how can I make a quad-function behave like below? It >> seems to be similar in behaviour to ⍤ and ⍣. >> >> * ('.at' ⎕R '\u0') 'The cat sat on the mat' * >> The CAT SAT on the MAT >> >> It can also accept a function, in which case the function is called for >> each match, to return a replacement string. Can you explain how to make a >> quad-function an operator? >> * >> * >> * ('\w+' ⎕R {⌽⍵.Match}) 'The cat sat on the mat'* >> ehT tac tas no eht tam >> >>
Re: [Bug-apl] Regex support
In playing around with this, I realise that the "B" mode is quite useful. So much so, in fact, that I'm wondering if it's warranted to have a dedicated quad-function for this specific behaviour. Here's an example of extracting sequences of 4 characters: * {⍵ ⊂⍨ "[a-z]{4}" ⎕RE['B'] ⍵} 'abcdef45abchello9'* ┏→━━━┓ ┃"abcd" "abch" "ello"┃ ┗∊━━━┛ Regards, Elias On 2 October 2017 at 16:27, Elias Mårtenson wrote: > Some progress: > > The behaviour I described earlier still works, but now has the ability to > work N-dimensional arrays of strings, compiling the regex only once and > then applying it on all the cells. > > In addition to this, I have now also added a flag "B" (meaning "bitmap") > that creates a bitmap of all matches and can be used in conjunction with ⊂ > to split strings by regex. > > Here's an example: > > * " +" ⎕RE["B"] "this is a test"* > ┏→━━┓ > ┃0 0 0 0 1 0 0 2 2 2 0 3 3 3 3 3 0 0 0 0┃ > ┗━━━┛ > > This matches any sequence of spaces, and we can easily use ⊂ to split the > string: > > * {⍵ ⊂⍨ 0=" +" ⎕RE["B"] ⍵} "this is a test"* > ┏→━┓ > ┃"this" "is" "a" "test"┃ > ┗∊━┛ > > However, I'm not sure if the value returned from the function are ideal. > The idea of the increasing numbers is to be able to differentiate between > the result of: > > * " " ⎕RE["B"] ""* > ┏→━━┓ > ┃1 2 3 4┃ > ┗━━━┛ > > vs: > > * " +" ⎕RE["B"] ""* > ┏→━━┓ > ┃1 1 1 1┃ > ┗━━━┛ > > Should it be left like this, or should it be done in some other way? > > Regards, > Elias > > On 25 September 2017 at 20:10, Juergen Sauermann < > juergen.sauerm...@t-online.de> wrote: > >> Hi Elias, >> >> making a quad function an operator is simple if the function argument(s) >> is/are primitive functions >> and a little more complicated if not. >> >> First of all you have to implement (read: overload) some of the >> eval_XXX() function that have function >> arguments. For monadic operators these eval_XXX() functions areare: >> >>virtual Token eval_ALB(Value_P A, Token & LO, Value_P B) >>virtual Token eval_ALXB(Value_P A, Token & LO, Value_P X, Value_P B) >>virtual Token eval_LB(Token & LO, Value_P B) >>virtual Token eval_LXB(Token & LO, Value_P X, Value_P B) >> >> where L resp. LO stands for the left function argument. For a dyadic >> operators they are: >> >>virtual Token eval_ALRB(Value_P A, Token & LO, Token & RO, Value_P B) >>virtual Token eval_ALRXB(Value_P A, Token & LO, Token & RO, Value_P X, >> Value_P B) >>virtual Token eval_LRB(Token & LO, Token & RO, Value_P B) >>virtual Token eval_LRXB(Token & LO, Token & RO, Value_P X, Value_P B) >> >> where L resp. LO and R resp. RO stand for the left and right function >> argument(s), A and B >> are the value arguments, and X the axis. >> >> Not all of them need to be implemented only those that have function >> signatures that >> are supported by the operator (mainly in terms of allowing an axis >> argument X or a >> left value argument A). >> >> If an operator supports defined functions (as opposed to primitive >> functions) then it will typically >> implement the operator itself as a macro, which means that the >> implementation is written in APL >> rather than in C++ (similar to "magic functions" in NARS). This is needed >> because primitive functions >> are atomic (they either succeed or fail, but cannot be continued after a >> failure) while defined functions >> (and operators) can continue at the point of interruption after having >> fixed the values that have cause >> the fault. >> >> Some of the build-in operators in GNU APL have both a primitive >> implementation (which is used when >> the function arguments are primitive) and a macro based implementation if >> not. This is for performance >> reasons so that the ability to take defined functions as arguments does >> not performance-wise harm the >> cases where the function arguments are primitive. >> >> The Macro definitions are contained in Macro.def >> >> Please note that in GNU APL functions cannot return functions, which may >> or may not be a problem >> in your case, depending on whether the function argument(s) of the >> ⎕-operator is/are primitive or not. >> In standard APL you cannot assign a function to a name. The usual >> work-around return a string and ⍎ it. >> >> My guts feeling is that if you need function arguments for implementing >> regular expressions then >> something has been going into the wrong direction somewhere else. >> >> Best Regards, >> /// Jürgen >> >> >> >> On 09/25/2017 05:18 AM, Elias Mårtenson wrote: >> >>> Dyalog's implementation is much more expressive than what I had proposed. >>> >>> There are technical reasons why we have no hope of replicating their >>> functionality (in particular, GNU APL does not have support for namespaces). >>> >>> Their function takes argume
Re: [Bug-apl] Regex support
Hi Elias, I believe it is better to keep things together, i.e. in a single ⎕ function than in several. It may be intuitive to use the character ⊂ instead of B in the axis argument to indicate that the result is meant for dyadic ⊂. /// Jürgen On 10/02/2017 10:47 AM, Elias Mårtenson wrote: In playing around with this, I realise that the "B" mode is quite useful. So much so, in fact, that I'm wondering if it's warranted to have a dedicated quad-function for this specific behaviour. Here's an example of extracting sequences of 4 characters: {⍵ ⊂⍨ "[a-z]{4}" ⎕RE['B'] ⍵} 'abcdef45abchello9' ┏→━━━┓ ┃"abcd" "abch" "ello"┃ ┗∊━━━┛ Regards, Elias On 2 October 2017 at 16:27, Elias Mårtensonwrote: Some progress: The behaviour I described earlier still works, but now has the ability to work N-dimensional arrays of strings, compiling the regex only once and then applying it on all the cells. In addition to this, I have now also added a flag "B" (meaning "bitmap") that creates a bitmap of all matches and can be used in conjunction with ⊂ to split strings by regex. Here's an example: " +" ⎕RE["B"] "this is a test" ┏→━━┓ ┃0 0 0 0 1 0 0 2 2 2 0 3 3 3 3 3 0 0 0 0┃ ┗━━━┛ This matches any sequence of spaces, and we can easily use ⊂ to split the string: {⍵ ⊂⍨ 0=" +" ⎕RE["B"] ⍵} "this is a test" ┏→━┓ ┃"this" "is" "a" "test"┃ ┗∊━┛ However, I'm not sure if the value returned from the function are ideal. The idea of the increasing numbers is to be able to differentiate between the result of: " " ⎕RE["B"] " " ┏→━━┓ ┃1 2 3 4┃ ┗━━━┛ vs: " +" ⎕RE["B"] " " ┏→━━┓ ┃1 1 1 1┃ ┗━━━┛ Should it be left like this, or should it be done in some other way? Regards, Elias On 25 September 2017 at 20:10, Juergen Sauermann wrote: Hi Elias, making a quad function an operator is simple if the function argument(s) is/are primitive functions and a little more complicated if not. First of all you have to implement (read: overload) some of the eval_XXX() function that have function arguments. For monadic operators these eval_XXX() functions areare: virtual Token eval_ALB(Value_P A, Token & LO, Value_P B) virtual Token eval_ALXB(Value_P A, Token & LO, Value_P X, Value_P B) virtual Token eval_LB(Token & LO, Value_P B) virtual Token eval_LXB(Token & LO, Value_P X, Value_P B) where L resp. LO stands for the left function argument. For a dyadic operators they are: virtual Token eval_ALRB(Value_P A, Token &
Re: [Bug-apl] Regex support
In the default mode, as I have demonstrated earlier, when the regexp has parenthesised subexpressions, the strings matching those expressions will be returned as separate strings. This is logical and in my opinion makes perfect sense. When using ⊂-mode, parenthesised expressions doesn't change the behaviour at all, as there is no natural behaviour to implement in this case. However, it would be nice to have a way to use subexpressions to split strings, so I'm thinking of something like the following: * "([0-9]{4})-([0-9]{2})-([0-9]{2})" ⎕RE[something] "foo 2010-02-03"* ┏→━━┓ ┃0 0 0 0 1 1 1 1 0 2 2 0 3 3┃ ┗━━━┛ Note that this variation is different from the previous one in that the ⊂-mode described in my previous email repeatedly calls the matching function, marking each result in the output bitmap, while the proposed version above runs the match only once, marking the subexpressions in the result. I'm starting to think that both are needed, but what symbols should be used in the axis argument to indicate the desired mode? An alternative output for the same expression would be something like the following, which would match pretty much exactly what the underlying PCRE function returns: ┏→┓ ↓ 4 14┃ ┃ 4 8┃ ┃10 11┃ ┃13 14┃ ┃ 4 14┃ ┗━┛ Would this is be a useful variation too? And if so, what axis marker should be used for it? Regards, Elias On 3 October 2017 at 01:30, Juergen Sauermann wrote: > Hi Elias, > > I believe it is better to keep things together, i.e. in a single ⎕ > function than in several. > > It may be intuitive to use the character ⊂ instead of B in the axis > argument to indicate > that the result is meant for dyadic ⊂. > > /// Jürgen > > > On 10/02/2017 10:47 AM, Elias Mårtenson wrote: > > In playing around with this, I realise that the "B" mode is quite useful. > So much so, in fact, that I'm wondering if it's warranted to have a > dedicated quad-function for this specific behaviour. > > Here's an example of extracting sequences of 4 characters: > > * {⍵ ⊂⍨ "[a-z]{4}" ⎕RE['B'] ⍵} 'abcdef45abchello9'* > ┏→━━━┓ > ┃"abcd" "abch" "ello"┃ > ┗∊━━━┛ > > Regards, > Elias > > On 2 October 2017 at 16:27, Elias Mårtenson wrote: > >> Some progress: >> >> The behaviour I described earlier still works, but now has the ability to >> work N-dimensional arrays of strings, compiling the regex only once and >> then applying it on all the cells. >> >> In addition to this, I have now also added a flag "B" (meaning "bitmap") >> that creates a bitmap of all matches and can be used in conjunction with ⊂ >> to split strings by regex. >> >> Here's an example: >> >> * " +" ⎕RE["B"] "this is a test"* >> ┏→━━┓ >> ┃0 0 0 0 1 0 0 2 2 2 0 3 3 3 3 3 0 0 0 0┃ >> ┗━━━┛ >> >> This matches any sequence of spaces, and we can easily use ⊂ to split the >> string: >> >> * {⍵ ⊂⍨ 0=" +" ⎕RE["B"] ⍵} "this is a test"* >> ┏→━┓ >> ┃"this" "is" "a" "test"┃ >> ┗∊━┛ >> >> However, I'm not sure if the value returned from the function are ideal. >> The idea of the increasing numbers is to be able to differentiate between >> the result of: >> >> * " " ⎕RE["B"] ""* >> ┏→━━┓ >> ┃1 2 3 4┃ >> ┗━━━┛ >> >> vs: >> >> * " +" ⎕RE["B"] ""* >> ┏→━━┓ >> ┃1 1 1 1┃ >> ┗━━━┛ >> >> Should it be left like this, or should it be done in some other way? >> >> Regards, >> Elias >> >> On 25 September 2017 at 20:10, Juergen Sauermann < >> juergen.sauerm...@t-online.de> wrote: >> >>> Hi Elias, >>> >>> making a quad function an operator is simple if the function argument(s) >>> is/are primitive functions >>> and a little more complicated if not. >>> >>> First of all you have to implement (read: overload) some of the >>> eval_XXX() function that have function >>> arguments. For monadic operators these eval_XXX() functions areare: >>> >>>virtual Token eval_ALB(Value_P A, Token & LO, Value_P B) >>>virtual Token eval_ALXB(Value_P A, Token & LO, Value_P X, Value_P B) >>>virtual Token eval_LB(Token & LO, Value_P B) >>>virtual Token eval_LXB(Token & LO, Value_P X, Value_P B) >>> >>> where L resp. LO stands for the left function argument. For a dyadic >>> operators they are: >>> >>>virtual Token eval_ALRB(Value_P A, Token & LO, Token & RO, Value_P B) >>>virtual Token eval_ALRXB(Value_P A, Token & LO, Token & RO, Value_P >>> X, Value_P B) >>>virtual Token eval_LRB(Token & LO, Token & RO, Value_P B) >>>virtual Token eval_LRXB(Token & LO, Token & RO, Value_P X, Value_P B) >>> >>> where L resp. LO and R resp. RO stand for the left and right function >>> argument(s), A and B >>> are the value arguments, and X the axis. >>> >>> Not all of them need to be implemented only those that have function >>> signatures that >>> are supported by the operator (mainly in terms of allowing