Philippe Verdy <verd...@wanadoo.fr> added the comment: I had read carefully ALL what ezio said, this is clear in the fact that I have summarized my responses to ALL the 4 points given by ezio.
Capturing groups is a VERY useful feature of regular expressions, but they currently DON'T work as expected (in a useful way) when they are used within repetitions (unless you don't need any captures at all, for example when just using find(), and not performing substitutions on the groups. My proposal woul have absolutely NO performance impact when capturing groups are not used (find only, without replacement, so there the R flag can be safely ignored). It would also not affect the case where capturing groups are used in the regexp, but these groups are not referenced in the substitution or in the code using MatchObject.group(index) : these indexes are already not used (or should not, because this is most of the time a bug when it just returns the last occurence). Using multiple parsing operations with multiple regexps is really tricky, when all could be done directly from the original regexp, without modifying it. In addition, using split() or similar will not work as expected, when the splitting operations will not correctly parse the context in which the multiple occurences are safely separated (this context is only correctly specified in the original regexp where the groups, capturing or not, are specified). This extension will also NOT affect the non-capturing groups like: (?:X){m,n} (?:X)* (?:X)+ It will ONLY affect the CAPTURING groups like: (X){m,n} (X)* (X)+ and only if the R flag is set (in which case this will NOT affect the backtracking behavior, or which strings that will be effectively matched, but only the values of the returned "\n" indexed group. If my suggestion to keep the existing MatchObject.function(index) API looks too dangerous for you, because it would change the type of the returned values when the R flag is set, you can as well rename them to get a specific occurence of a group. Such as: MatchObject.groupOccurences(index) MatchObject.startOccurences(index) MatchObject.endOccurences(index) MatchObject.spanOccurences(index) MatchObject.groupsOccurences(index) But I don't think this is necessary; it will be already expected that they will return lists of values (or lists of pairs), instead of just single values (or single pairs) for each group: Python (as well as PHP or Perl) can already manage return values with varying datatypes. May be only PCRE (written for C/C++) would need a new API name to return lists of values instead of single values for each group, due to existing datatype restrictions. My proposal is not inconsistant: it returns consistant datatypes when the R flag is set, for ALL capturing groups (not just those that are repeated. Anyway I'll submit my idea to other groups, if I can find where to post them. Note that I've already implemented it in my own local implementation of PCRE, and this works perfectly with effectively very few changes (currently I have had to change the datatypes for matching objects so that they can return varying types), and I have used it to create a modified version of 'sed' to perform massive filtering of data: It really reduces the number of transformation steps needed to process such data correctly, because a single regexp (exactly the same that is already used in the first step used to match the substrings we are interested in, when using existing 'sed' implementations) can be used to perform the substitutions using indexes within captured groups. And I would like to have it incoporated in Python (and also Perl or PHP) as well. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7132> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com