New submission from Jeffrey C. Jacobs <[EMAIL PROTECTED]>: I am working on adding features to the current Regexp implementation, which is now set to 2.2.2. These features are to bring the Regexp code closer in line with Perl 5.10 as well as add a few python-specific niceties and potential speed-ups and clean-ups.
I will be posting regular patch updates to this thread when major milestones have been reach with a description of the feature(s) added. Currently, the list of proposed changes are (in no particular order): 1) Fix <a href="http://bugs.python.org/issue433030">issue 433030</a> by adding support for Atomic Grouping and Possessive Qualifiers 2) Make named matches direct attributes of the match object; i.e. instead of m.group('foo'), one will be able to write simply m.foo. 3) (maybe) make Match objects subscriptable, such that m[n] is equivalent to m.group(n) and allow slicing. 4) Implement Perl-style back-references including relative back-references. 5) Add a well-formed, python-specific comment modifier, e.g. (?P#...); the difference between (?P#...) and Perl/Python's (?#...) is that the former will allow nested parentheses as well as parenthetical escaping, so that patterns of the form '(?P# Evaluate (the following) expression, 3\) using some other technique)'. The (?P#...) will interpret this entire expression as a comment, where as with (?#...) only, everything following ' expression...' would be considered part of the match. (?P#...) will necessarily be slower than (?#...) and so only should be used if richer commenting style is required but the verbose mode is not desired. 6) Add official support for fast, non-repeating capture groups with the Template option. Template is unofficially supported and disables all repeat operators (*, + and ?). This would mainly consist of documenting its behavior. 7) Modify the re compiled expression cache to better handle the thrashing condition. Currently, when regular expressions are compiled, the result is cached so that if the same expression is compiled again, it is retrieved from the cache and no extra work has to be done. This cache supports up to 100 entries. Once the 100th entry is reached, the cache is cleared and a new compile must occur. The danger, all be it rare, is that one may compile the 100th expression only to find that one recompiles it and has to do the same work all over again when it may have been done 3 expressions ago. By modifying this logic slightly, it is possible to establish an arbitrary counter that gives a time stamp to each compiled entry and instead of clearing the entire cache when it reaches capacity, only eliminate the oldest half of the cache, keeping the half that is more recent. This should limit the possibility of thrashing to cases where a very large number of Regular Expressions are continually recompiled. In addition to this, I will update the limit to 256 entries, meaning that the 128 most recent are kept. 8) Emacs/Perl style character classes, e.g. [:alphanum:]. For instance, :alphanum: would not include the '_' in the character class. 9) C-Engine speed-ups. I commenting and cleaning up the _sre.c Regexp engine to make it flow more linearly, rather than with all the current gotos and replace the switch-case statements with lookup tables, which in tests have shown to be faster. This will also include adding many more comments to the C code in order to make it easier for future developers to follow. These changes are subject to testing and some modifications may not be included in the final release if they are shown to be slower than the existing code. Also, a number of Macros are being eliminated where appropriate. 10) Export any (not already) shared value between the Python Code and the C code, e.g. the default Maximum Repeat count (65536); this will allow those constants to be changed in 1 central place. 11) Various other Perl 5.10 conformance modifications, TBD. More items may come and suggestions are welcome. ----- Currently, I have code which implements 5) and 7), have done some work on 10) and am almost 9). When 9) is complete, I will work on 1), some of which, such as parsing, is already done, then probably 8) and 4) because they should not require too much work -- 4) is parser-only AFAICT. Then, I will attempt 2) and 3), though those will require changes at the C-Code level. Then I will investigate what additional elements of 11) I can easily implement. Finally, I will write documentation for all of these features, including 6). In a few days, I will provide a patch with my interim results and will update the patches with regular updates when Milestones are reached. ---------- components: Library (Lib) messages: 65513 nosy: timehorse severity: normal status: open title: Regexp 2.6 (modifications to current re 2.2.2) type: feature request versions: Python 2.6 __________________________________ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2636> __________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com