New submission from Philippe Verdy <verd...@wanadoo.fr>: For now, when capturing groups are used within repetitions, it is impossible to capure what they match individually within the list of matched repetitions.
E.g. the following regular expression: (0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)(?:\.(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)){3} is a regexp that contains two capturing groups (\1 and \2), but whose the second one is repeated (3 times) to match an IPv4 address in dotted decimal format. We'd like to be able to get the individual multiple matchs for the second group. For now, capturing groups don't record the full list of matches, but just override the last occurence of the capturing group (or just the first if the repetition is not greedy, which is not the case here because the repetition "{3}" is not followed by a "?"). So \1 will effectively return the first decimal component of the IPv4 address, but \2 will just return the last (fourth) decimal component. I'd like to have the possibility to have a compilation flag "R" that would indicate that capturing groups will not just return a single occurence, but all occurences of the same group. If this "R" flag is enabled, then: - the Match.group(index) will not just return a single string but a list of strings, with as many occurences as the number of effective repetitions of the same capturing group. The last element in that list will be the one equal to the current behavior - the Match.start(index) and Match.end(index) will also both return a list of positions, those lists having the same length as the list returned by Match.group(index). - for consistency, the returned values as lists of strings (instead of just single strings) will apply to all capturing groups, even if they are not repeated. Effectively, with the same regexp above, we will be able to retreive (and possibily substitute): - the first decimal component of the IPv4 address with "{\1:1}" (or "{\1:}" or "{\1}" or "\1" as before), i.e. the 1st (and last) occurence of capturing group 1, or in Match.group(1)[1], or between string positions Match.start(1)[1] and Match.end(1)[1] ; - the second decimal component of the IPv4 address with "{\2:1}", i.e. the 1st occurence of capturing group 2, or in Match.group(2)[1], or between string positions Match.start(2)[1] and Match.end(2)[1] ; - the third decimal component of the IPv4 address with "{\2:2}", i.e. the 2nd occurence of capturing group 2, or in Match.group(2)[2], or between string positions Match.start(2)[2] and Match.end(2)[2] ; - the fourth decimal component of the IPv4 address with "{\2:3}" (or "{\2:}" or "{\2}" or "\2"), i.e. the 3rd (and last) occurence of capturing group 2, or in Match.group(2)[2], or between string positions Match.start(2)[3] and Match.end(2)[3] ; This should work with all repetition patterns (both greedy and not greedy, atomic or not, or possessive), in which the repeated pattern contains any capturing group. This idea should also be submitted to the developers of the PCRE library (and Perl from which they originate, and PHP where PCRE is also used), so that they adopt a similar behavior in their regular expressions. If there's already a candidate syntax or compilation flag in those libraries, this syntax should be used for repeated capturing groups. ---------- components: Library (Lib) messages: 94022 nosy: verdy_p severity: normal status: open title: Regexp: capturing groups in repetitions type: feature request versions: Python 3.2 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7132> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com