On 22/07/2006 9:25 AM, John Machin wrote: Apologies if this appears twice ... post to the newsgroup hasn't shown up; trying the mailing-list.
> On 22/07/2006 2:18 AM, Simon Forman wrote: >> John Salerno wrote: >>> Simon Forman wrote: >>> >>>> Python's re.match() matches from the start of the string, so if you > > (1) Every regex library's match() starts matching from the beginning of > the string (unless of course there's an arg for an explicit starting > position) -- where else would it start? > > (2) This has absolutely zero relevance to the "match whole string or > not" question. > >>>> want to ensure that the whole string matches completely you'll probably >>>> want to end your re pattern with the "$" character (depending on what >>>> the rest of your pattern matches.) > > *NO* ... if you want to ensure that the whole string matches completely, > you need to end your pattern with "\Z", *not* "$". > > Perusal of the manual would seem to be indicated :-) > >>> Is that necessary? I was thinking that match() was used to match the >>> full RE and string, and if they weren't the same, they wouldn't match >>> (meaning a begin/end of string character wasn't necessary). That's >>> wrong? > > Yes. If the default were to match the whole string, then a metacharacter > would be required to signal "*don't* match the whole string" ... > functionality which is quite useful. > >> >> My understanding, from the docs and from dim memories of using >> re.match() long ago, is that it will match on less than the full input >> string if the re pattern allows it (for instance, if the pattern >> *doesn't* end in '.*' or something similar.) > > Ending a pattern with '.*' or something similar is typically a mistake > and does nothing but waste CPU cycles: > > C:\junk>python -mtimeit -s"import > re;s='a'+80*'z';m=re.compile('a').match" "m(s)" > 1000000 loops, best of 3: 1.12 usec per loop > > C:\junk>python -mtimeit -s"import > re;s='a'+8000*'z';m=re.compile('a').match" "m(s)" > 100000 loops, best of 3: 1.15 usec per loop > > C:\junk>python -mtimeit -s"import > re;s='a'+80*'z';m=re.compile('a.*').match" "m(s)" > 100000 loops, best of 3: 1.39 usec per loop > > C:\junk>python -mtimeit -s"import > re;s='a'+8000*'z';m=re.compile('a.*').match" "m(s)" > 10000 loops, best of 3: 24.2 usec per loop > > The regex engine can't optimise it away because '.' means by default > "any character except a newline" , so it has to trundle all the way to > the end just in case there's a newline lurking somewhere. > > Oh and just in case you were wondering: > > C:\junk>python -mtimeit -s"import > re;s='a'+8000*'z';m=re.compile('a.*',re.DOTALL).match" "m(s)" > 1000000 loops, best of 3: 1.18 usec per loop > > In this case, logic says the '.*' will match anything, so it can stop > immediately. > >> >> I'd test this, though, before trusting it. >> >> What the heck, I'll do that now: >> >>>>> import re >>>>> re.match('ab', 'abcde') >> <_sre.SRE_Match object at 0xb6ff8790> >>>>> m = _ > > ??? What's wrong with _.group() ??? > >>>>> m.group() >> 'ab' >>>>> print re.match('ab$', 'abcde') >> None >> > > HTH, > John > > > -- http://mail.python.org/mailman/listinfo/python-list