[issue7132] Regexp: capturing groups in repetitions

Philippe Verdy Wed, 14 Oct 2009 13:08:25 -0700

New submission from Philippe Verdy <verd...@wanadoo.fr>:

For now, when capturing groups are used within repetitions, it is impossible to 
capure what they match 
individually within the list of matched repetitions.


E.g. the following regular expression:

(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)(?:\.(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)){3}

is a regexp that contains two capturing groups (\1 and \2), but whose the 
second one is repeated (3 times) to 
match an IPv4 address in dotted decimal format. We'd like to be able to get the 
individual multiple matchs 
for the second group.

For now, capturing groups don't record the full list of matches, but just 
override the last occurence of the 
capturing group (or just the first if the repetition is not greedy, which is 
not the case here because the 
repetition "{3}" is not followed by a "?"). So \1 will effectively return the 
first decimal component of the 
IPv4 address, but \2 will just return the last (fourth) decimal component.


I'd like to have the possibility to have a compilation flag "R" that would 
indicate that capturing groups 
will not just return a single occurence, but all occurences of the same group. 
If this "R" flag is enabled, 
then:

- the Match.group(index) will not just return a single string but a list of 
strings, with as many occurences 
as the number of effective repetitions of the same capturing group. The last 
element in that list will be the 
one equal to the current behavior

- the Match.start(index) and Match.end(index) will also both return a list of 
positions, those lists having 
the same length as the list returned by Match.group(index).

- for consistency, the returned values as lists of strings (instead of just 
single strings) will apply to all 
capturing groups, even if they are not repeated.


Effectively, with the same regexp above, we will be able to retreive (and 
possibily substitute):

- the first decimal component of the IPv4 address with "{\1:1}" (or "{\1:}" or 
"{\1}" or "\1" as before), 
i.e. the 1st (and last) occurence of capturing group 1, or in 
Match.group(1)[1], or between string positions Match.start(1)[1] and 
Match.end(1)[1] ;

- the second decimal component of the IPv4 address with "{\2:1}", i.e. the 1st 
occurence of capturing group 
2, or in Match.group(2)[1], or between string positions Match.start(2)[1] and 
Match.end(2)[1] ;

- the third decimal component of the IPv4 address with "{\2:2}", i.e. the 2nd 
occurence of capturing group 2, 
or in Match.group(2)[2], or between string positions Match.start(2)[2] and 
Match.end(2)[2] ;

- the fourth decimal component of the IPv4 address with "{\2:3}" (or "{\2:}" or 
"{\2}" or "\2"), i.e. the 3rd 
(and last) occurence of capturing group 2, or in Match.group(2)[2], or between 
string positions 
Match.start(2)[3] and Match.end(2)[3] ;


This should work with all repetition patterns (both greedy and not greedy, 
atomic or not, or possessive), in 
which the repeated pattern contains any capturing group.


This idea should also be submitted to the developers of the PCRE library (and 
Perl from which they originate, 
and PHP where PCRE is also used), so that they adopt a similar behavior in 
their regular expressions.

If there's already a candidate syntax or compilation flag in those libraries, 
this syntax should be used for 
repeated capturing groups.

----------
components: Library (Lib)
messages: 94022
nosy: verdy_p
severity: normal
status: open
title: Regexp: capturing groups in repetitions
type: feature request
versions: Python 3.2

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue7132>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7132] Regexp: capturing groups in repetitions

Reply via email to