On 31 December 2011 15:30, Eli Barzilay <e...@barzilay.org> wrote: > But there's one more point that bugs me about the python thing: the > resulting list has both the matches and the non-matching gaps, and > knowing which is which is tricky. For example, if you do this (I'll > use our syntax here, so note the minor differences): > > (define (foo rx) > (regexp-split rx "some string")) > > then you can't tell which is which in its output without knowing how > many grouping parens are in the input regexp. It therefore makes > sense to me to have this instead: > > > (regexp-explode #rx"([^0-9])" "123+456*/") > '("123" ("+") "456" ("*") "" ("/") "") > > and now it's easy to know which is which. This is of course a simple > example with a single group so it doesn't look like much help, but > when with more than one group things can get confusing otherwise: for > example, in python you can get `None's in the result: > > >>> re.split('([^0-9](4)?)', '123+456*/') > ['123', '+4', '4', '56', '*', None, '', '/', None, ''] > > but with the above, this becomes: > > > (regexp-explode #rx"([^0-9](4)?)" "123+456*/") > '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "") > > so you can rely on the odd-numbered elements to be strings. This is > probably going to be different for you, since you allow string > predicates instead of regexps. > > Finally, the Racket implementation will probably be a little different > still -- our `regexp-match' returns a list with the matched substring > first, and then the matches for the capturing groups. Following this,
The format is the same in Guile, substring followed by capturing groups: scheme@(guile-user)> (string-match "([^0-9])" "123+456*/") $7 = #("123+456*/" (3 . 4) (3 . 4)) Though that is more of an analogue to `regexp-match-positions'. > a more uniform behavior for a `regexp-explode' would be to return > these lists, so we'd actually get: > > > (regexp-explode #rx"[^0-9]" "123+456*/") > '("123" ("+") "456" ("*") "" ("/") "") > > (regexp-explode #rx"([^0-9])" "123+456*/") > '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "") This is a very interesting way to return the results. Now that the `explode' has been separated from `split' I am actually quite partial to always including the matched substring in the result. This makes even more sense considering the output would be the same using a char-predicate or regexp with no capturing groups: scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?)) $8 = ("123" "+" "456" "*" "" "/" "") scheme@(guile-user)> (string-explode "123+456*/" (make-regexp "[^0-9]")) $9 = ("123" "+" "456" "*" "" "/" "") And the result is compatible with using `string-concatenate' as an inverse operation: scheme@(guile-user)> (string-concatenate $9) $10 = "123+456*/" Bonus! WRT to all the capturing groups as a list: + as you mention earlier the user can be somewhat ignorant of the number of capturing groups (why not just use `split'?); + easier to handle collectively; - result is no longer a flat list (I *do* like sexps, really); - moving away from *all* existing implementations; * trivial to transform between styles assuming one knows how many capturing groups; So now I am thinking about both `string-explode' (flat output) and `regexp-explode' with the nested output. > And again, this looks silly in this simple example, but would be more > useful in more complex ones. We would also have a similar > `regexp-explode-positions' function that returns position pairs for > cases where you don't want to allocate all substrings. ... or need to know the positioning information. [BTW, substrings in Guile share copy-on-write memory with their super so I don't see string allocation as an issue on the Guile front. Not sure about substrings in Racket.]