Re: add regexp-split: a summary and new proposal

Daniel Hartwig Sat, 31 Dec 2011 01:31:08 -0800

On 31 December 2011 15:30, Eli Barzilay <e...@barzilay.org> wrote:
> But there's one more point that bugs me about the python thing: the
> resulting list has both the matches and the non-matching gaps, and
> knowing which is which is tricky.  For example, if you do this (I'll
> use our syntax here, so note the minor differences):
>
>  (define (foo rx)
>    (regexp-split rx "some string"))
>
> then you can't tell which is which in its output without knowing how
> many grouping parens are in the input regexp.  It therefore makes
> sense to me to have this instead:
>
>  > (regexp-explode #rx"([^0-9])" "123+456*/")
>  '("123" ("+") "456" ("*") "" ("/") "")
>
> and now it's easy to know which is which.  This is of course a simple
> example with a single group so it doesn't look like much help, but
> when with more than one group things can get confusing otherwise: for
> example, in python you can get `None's in the result:
>
>  >>> re.split('([^0-9](4)?)', '123+456*/')
>  ['123', '+4', '4', '56', '*', None, '', '/', None, '']
>
> but with the above, this becomes:
>
>  > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
>  '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")
>
> so you can rely on the odd-numbered elements to be strings.  This is
> probably going to be different for you, since you allow string
> predicates instead of regexps.
>
> Finally, the Racket implementation will probably be a little different
> still -- our `regexp-match' returns a list with the matched substring
> first, and then the matches for the capturing groups.  Following this,


The format is the same in Guile, substring followed by capturing
groups:

scheme@(guile-user)> (string-match "([^0-9])" "123+456*/")
$7 = #("123+456*/" (3 . 4) (3 . 4))

Though that is more of an analogue to `regexp-match-positions'.

> a more uniform behavior for a `regexp-explode' would be to return
> these lists, so we'd actually get:
>
>  > (regexp-explode #rx"[^0-9]" "123+456*/")
>  '("123" ("+") "456" ("*") "" ("/") "")
>  > (regexp-explode #rx"([^0-9])" "123+456*/")
>  '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")

This is a very interesting way to return the results.

Now that the `explode' has been separated from `split' I am actually
quite partial to always including the matched substring in the result.
This makes even more sense considering the output would be the same
using a char-predicate or regexp with no capturing groups:

scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
$8 = ("123" "+" "456" "*" "" "/" "")
scheme@(guile-user)> (string-explode "123+456*/" (make-regexp "[^0-9]"))
$9 = ("123" "+" "456" "*" "" "/" "")

And the result is compatible with using `string-concatenate' as an
inverse operation:

scheme@(guile-user)> (string-concatenate $9)
$10 = "123+456*/"

Bonus!

WRT to all the capturing groups as a list:

 + as you mention earlier the user can be somewhat ignorant of the
   number of capturing groups (why not just use `split'?);
 + easier to handle collectively;

 - result is no longer a flat list (I *do* like sexps, really);
 - moving away from *all* existing implementations;

 * trivial to transform between styles assuming one knows how many
   capturing groups;

So now I am thinking about both `string-explode' (flat output) and
`regexp-explode' with the nested output.

> And again, this looks silly in this simple example, but would be more
> useful in more complex ones.  We would also have a similar
> `regexp-explode-positions' function that returns position pairs for
> cases where you don't want to allocate all substrings.

... or need to know the positioning information.

[BTW, substrings in Guile share copy-on-write memory with their super
so I don't see string allocation as an issue on the Guile front.  Not
sure about substrings in Racket.]

Re: add regexp-split: a summary and new proposal

Reply via email to