[issue7132] Regexp: capturing groups in repetitions

Philippe Verdy Wed, 14 Oct 2009 16:30:31 -0700

Philippe Verdy <verd...@wanadoo.fr> added the comment:

I had read carefully ALL what ezio said, this is clear in the fact that 
I have summarized my responses to ALL the 4 points given by ezio.


Capturing groups is a VERY useful feature of regular expressions, but 
they currently DON'T work as expected (in a useful way) when they are 
used within repetitions (unless you don't need any captures at all, for 
example when just using find(), and not performing substitutions on the 
groups.

My proposal woul have absolutely NO performance impact when capturing 
groups are not used (find only, without replacement, so there the R flag 
can be safely ignored).

It would also not affect the case where capturing groups are used in the 
regexp, but these groups are not referenced in the substitution or in 
the code using MatchObject.group(index) : these indexes are already not 
used (or should not, because this is most of the time a bug when it just 
returns the last occurence).

Using multiple parsing operations with multiple regexps is really 
tricky, when all could be done directly from the original regexp, 
without modifying it. In addition, using split() or similar will not 
work as expected, when the splitting operations will not correctly parse 
the context in which the multiple occurences are safely separated (this 
context is only correctly specified in the original regexp where the 
groups, capturing or not, are specified).

This extension will also NOT affect the non-capturing groups like:
 (?:X){m,n}
 (?:X)*
 (?:X)+
It will ONLY affect the CAPTURING groups like:
 (X){m,n}
 (X)*
 (X)+
and only if the R flag is set (in which case this will NOT affect the 
backtracking behavior, or which strings that will be effectively 
matched, but only the values of the returned "\n" indexed group.

If my suggestion to keep the existing MatchObject.function(index) API 
looks too dangerous for you, because it would change the type of the 
returned values when the R flag is set, you can as well rename them to 
get a specific occurence of a group. Such as:

 MatchObject.groupOccurences(index)
 MatchObject.startOccurences(index)
 MatchObject.endOccurences(index)
 MatchObject.spanOccurences(index)
 MatchObject.groupsOccurences(index)

But I don't think this is necessary; it will be already expected that 
they will return lists of values (or lists of pairs), instead of just 
single values (or single pairs) for each group: Python (as well as PHP 
or Perl) can already manage return values with varying datatypes.

May be only PCRE (written for C/C++) would need a new API name to return 
lists of values instead of single values for each group, due to existing 
datatype restrictions.

My proposal is not inconsistant: it returns consistant datatypes when 
the R flag is set, for ALL capturing groups (not just those that are 
repeated.

Anyway I'll submit my idea to other groups, if I can find where to post 
them. Note that I've already implemented it in my own local 
implementation of PCRE, and this works perfectly with effectively very 
few changes (currently I have had to change the datatypes for matching 
objects so that they can return varying types), and I have used it to 
create a modified version of 'sed' to perform massive filtering of data:

It really reduces the number of transformation steps needed to process 
such data correctly, because a single regexp (exactly the same that is 
already used in the first step used to match the substrings we are 
interested in, when using existing 'sed' implementations) can be used to 
perform the substitutions using indexes within captured groups. And I 
would like to have it incoporated in Python (and also Perl or PHP) as 
well.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue7132>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7132] Regexp: capturing groups in repetitions

Reply via email to