On 4/20/2011 12:23 PM, Neil Cerutti wrote:
On 2011-04-20, John Nagle<na...@animats.com> wrote:
Here's something that surprised me about Python regular expressions.
krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()
('f',)
The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.
The documentation in fact says that, at
http://docs.python.org/library/re.html
"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."
That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.
.findall
Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.
Consider a regular expression for matching domain names:
>>> kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
>>> s = 'www.example.com'
>>> ms = kre.match(s)
>>> ms.groups()
('www', 'com')
>>> msall = kre.findall(s)
>>> msall
[('www', 'com')]
This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.
John Nagle
--
http://mail.python.org/mailman/listinfo/python-list