Re: [PATCH] add regexp-split

Daniel Hartwig Fri, 30 Dec 2011 17:46:52 -0800

Hello

On 31 December 2011 04:11, Eli Barzilay <e...@barzilay.org> wrote:
> [I don't think that I'm subscribed to the Guile list, but feel free to
> forward it there.]


Copying back the list and Marijn.

> 5 hours ago, Marijn wrote:
>> On 30-12-11 12:52, Nala Ginrut wrote:
>> > I just expressed "I think group capturing is useful and someone
>> > didn't think that's true". If this is not what your last mail
>> > mean, I think it's better to ignore it.
>>
>> Group capturing is useful, but the question is whether it is useful
>> in the context of regexp-split.
>
> Yes, that's exactly the point.  What I'm worried about is someone
> defining a regexp for several uses, for example:
>
>  (define rx "foo([0-9]*)")
>
> with the intention of using it for both splitting and other
> extraction.  (This is a bad example but it's a common case for
> regexps.)  The problem is that if you really want to just *split* with
> this pattern, you're stuck in bad-code-land...  Two possible
> solutions:
>
>  Do the split, then filter out the even-numbered items from the
>  result.
>
> This is bad not only because it's inefficiently allocaing substrings
> that will get discarded (investing work redundantly which will get
> trashed by more work) -- it's also bad because such code is sensitive
> to the number of groups.  Eg, if the pattern is changed to have two
> groups, then you need to modify the filtering now.  Another solution:
>
>  Tweak the regexp and turn all groups into non-capturing groups.
>
> This is something that I've run into several times, and IME it is a
> very bad solution.  Usually, you end up doing some half-assed job of
> this tweaking: you don't bother to cache the compiled expressions for
> speed, and you tend to introduce assumptions by mistake -- like
> assuming that all "("s that are not precedded by a backslash or
> followed by a "?:" are groups -- and fail miserably when the input is
> something like "...\\\\(..." or "...[0-9()]...".  Actually, that leads
> into yet another solution:
>
>  Explicitly say that your functions expect patterns without groups.
>
> That fails since it propagates the problem up for users of your code
> (they might need to maintain two versions of regexps too).  And since
> many of them are likely to skim the docs and just do whatever works
> for them, they can easily write code that can fail satisfying these
> assumptions -- and the fun part is that this happens, the result of
> such bugs is utterly confusing...
>
> Four hours ago, Daniel Hartwig wrote:
>> Having the *option* to return the captured groups in `regexp-split' is
>> certainly useful -- consider implementing a parser [1].  If the
>> captured groups are not desired, then simply omit the grouping parens
>> from the expression.
>
> Hopefully the above explains why I think that that "simply omit" can
> turn out to be a disaster...
>
> In any case, that's my reason for disliking that added functionality
> even if it "can be more useful".  Lucky for me, in Racket we also have
> the existing behavior with code that will break if we change it, so I
> don't need to argue my point much...
>
> And BTW, all of that is *not* to say that this functionality is
> useless -- just arguing for it to be provided under a different name.
>


How about having an optional argument to control the behaviour?  The
default could be to not include the groups, thus mimicking the output
of Guile's `string-split' and `regexp-split' in other Schemes.

If two procedures are implemented they will be almost verbatim copies
of each other.  The changes required in the body would be minimal:

 (groups (if incl-groups?
             (map (lambda (n) (match:substring m n))
                              (iota (1- (match:count m))))
             '())))

>
>> [...] If you're so convinced that python is doing it right here and
>> should be followed, then perhaps you can give some examples of how
>> capturing groups are useful in a function that is supposed to split
>> strings at regexps.
>
> I don't think that such examples will help.  It's obvious how it can
> be useful to have this feature -- the main issue is the kind of bugs
> that it will lead to.  (And in the above I tried to give some examples
> of how that's bad.)
>
>
>> Another data point:
>>
>> [14:17] <hkBst> what does chicken return for (irregex-split "([^0-9])"
>>  "123+456*/")  ?
>> [14:18] <sjamaan> ("123" "456")
>>
>> Looks like chicken doesn't do capturing groups in their version, but
>> they don't have the empty matches either. How about that...
>
> Yeah, we've considered these things for a while.  There is
> inconsistency between different languages and regexp libraries on how
> to deal with empty strings -- some drop them at the edges, some drop
> all of them, and IIRC, some even drop all empty empty strings.  Oh,
> and things get infinitely more amusing when you consider look-ahead
> and look-back patterns (including \b patterns)...
>
> You can see our tests here:
>
>  
> https://github.com/plt/racket/blob/master/collects/tests/racket/string.rktl#L37
>
> with some comparisons to Perl.
>

No comment on Perl's handling.

I think Racket does the right thing by keeping *all* the empty strings in place.

Re: [PATCH] add regexp-split

Reply via email to