On May 29, 10:35 am, Andrew Berg <bahamutzero8...@gmail.com> wrote: > On 2011.05.29 09:18 AM, Steven D'Aprano wrote:> >> What makes you think it > shouldn't match? > > > > AFAIK, dots aren't supposed to match carriage returns or any other > > > whitespace characters. > > I got things mixed up there (was thinking whitespace instead of > newlines), but I thought dots aren't supposed to match '\r' (carriage > return). Why is '\r' not considered a newline character?
Dots don't match end-of-line-for-your-current-OS is how I think of it. While I almost usually nod my head at Steven D'Aprano's comments, in this case I have to say that if you just want to grab something from a chunk of HTML, full-blown HTML parsers are overkill. True, malformed HTML can throw you off, but they can also throw a parser off. I could not make your regex work on my Linux box with Python 2.6. In your case, and because x264 might change their HTML, I suggest the following code, which works great on my system.YMMV. I changed your newline matches to use \s and put some capturing parentheses around the date, so you could grab it. >>> import urllib2 >>> import re >>> >>> content = urllib2.urlopen("http://x264.nl/x264_main.php").read() >>> >>> rx_x264version= >>> re.compile(r"http://x264\.nl/x264/64bit/8bit_depth/revision\s*(\d{4})\s*/x264\s*\.exe") >>> >>> m = rx_x264version.search(content) >>> if m: ... print m.group(1) ... 1995 >>> \s is your friend -- matches space, tab, newline, or carriage return. \s* says match 0 or more spaces, which is what's needed here in case the web site decides to *not* put whitespace in the middle of a URL... As Steven said, when you want match a dot, it needs to be escaped, although it will work by accident much of the time. Also, be sure to use a raw string when composing REs, so you don't run into backslash issues. HTH, John Strickler -- http://mail.python.org/mailman/listinfo/python-list