Hello On 31 December 2011 04:11, Eli Barzilay <e...@barzilay.org> wrote: > [I don't think that I'm subscribed to the Guile list, but feel free to > forward it there.]
Copying back the list and Marijn. > 5 hours ago, Marijn wrote: >> On 30-12-11 12:52, Nala Ginrut wrote: >> > I just expressed "I think group capturing is useful and someone >> > didn't think that's true". If this is not what your last mail >> > mean, I think it's better to ignore it. >> >> Group capturing is useful, but the question is whether it is useful >> in the context of regexp-split. > > Yes, that's exactly the point. What I'm worried about is someone > defining a regexp for several uses, for example: > > (define rx "foo([0-9]*)") > > with the intention of using it for both splitting and other > extraction. (This is a bad example but it's a common case for > regexps.) The problem is that if you really want to just *split* with > this pattern, you're stuck in bad-code-land... Two possible > solutions: > > Do the split, then filter out the even-numbered items from the > result. > > This is bad not only because it's inefficiently allocaing substrings > that will get discarded (investing work redundantly which will get > trashed by more work) -- it's also bad because such code is sensitive > to the number of groups. Eg, if the pattern is changed to have two > groups, then you need to modify the filtering now. Another solution: > > Tweak the regexp and turn all groups into non-capturing groups. > > This is something that I've run into several times, and IME it is a > very bad solution. Usually, you end up doing some half-assed job of > this tweaking: you don't bother to cache the compiled expressions for > speed, and you tend to introduce assumptions by mistake -- like > assuming that all "("s that are not precedded by a backslash or > followed by a "?:" are groups -- and fail miserably when the input is > something like "...\\\\(..." or "...[0-9()]...". Actually, that leads > into yet another solution: > > Explicitly say that your functions expect patterns without groups. > > That fails since it propagates the problem up for users of your code > (they might need to maintain two versions of regexps too). And since > many of them are likely to skim the docs and just do whatever works > for them, they can easily write code that can fail satisfying these > assumptions -- and the fun part is that this happens, the result of > such bugs is utterly confusing... > > Four hours ago, Daniel Hartwig wrote: >> Having the *option* to return the captured groups in `regexp-split' is >> certainly useful -- consider implementing a parser [1]. If the >> captured groups are not desired, then simply omit the grouping parens >> from the expression. > > Hopefully the above explains why I think that that "simply omit" can > turn out to be a disaster... > > In any case, that's my reason for disliking that added functionality > even if it "can be more useful". Lucky for me, in Racket we also have > the existing behavior with code that will break if we change it, so I > don't need to argue my point much... > > And BTW, all of that is *not* to say that this functionality is > useless -- just arguing for it to be provided under a different name. > How about having an optional argument to control the behaviour? The default could be to not include the groups, thus mimicking the output of Guile's `string-split' and `regexp-split' in other Schemes. If two procedures are implemented they will be almost verbatim copies of each other. The changes required in the body would be minimal: (groups (if incl-groups? (map (lambda (n) (match:substring m n)) (iota (1- (match:count m)))) '()))) > >> [...] If you're so convinced that python is doing it right here and >> should be followed, then perhaps you can give some examples of how >> capturing groups are useful in a function that is supposed to split >> strings at regexps. > > I don't think that such examples will help. It's obvious how it can > be useful to have this feature -- the main issue is the kind of bugs > that it will lead to. (And in the above I tried to give some examples > of how that's bad.) > > >> Another data point: >> >> [14:17] <hkBst> what does chicken return for (irregex-split "([^0-9])" >> "123+456*/") ? >> [14:18] <sjamaan> ("123" "456") >> >> Looks like chicken doesn't do capturing groups in their version, but >> they don't have the empty matches either. How about that... > > Yeah, we've considered these things for a while. There is > inconsistency between different languages and regexp libraries on how > to deal with empty strings -- some drop them at the edges, some drop > all of them, and IIRC, some even drop all empty empty strings. Oh, > and things get infinitely more amusing when you consider look-ahead > and look-back patterns (including \b patterns)... > > You can see our tests here: > > > https://github.com/plt/racket/blob/master/collects/tests/racket/string.rktl#L37 > > with some comparisons to Perl. > No comment on Perl's handling. I think Racket does the right thing by keeping *all* the empty strings in place.