On 2011.05.16 02:26 AM, Karim wrote: > Use regular expression for bad HTLM or beautifulSoup (google it), below > a exemple to extract all html links: Actually, using regex wasn't so bad: > import re > import urllib.request > > url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth' > page = str(urllib.request.urlopen(url).read(), encoding='utf-8') # > urlopen() returns a bytes object, need to get a normal string > rev_re = re.compile('revision[0-9][0-9][0-9][0-9]') > num_re = re.compile('[0-9][0-9][0-9][0-9]') > rev = rev_re.findall(str(page))[0] # only need the first item since > the first listing is the latest revision > num = num_re.findall(rev)[0] # findall() always returns a list > print(num) prints out the revision number - 1995. 'revision1995' might be useful, so I saved that to rev.
This actually works pretty well for consistently formatted lists. I suppose I went about this the wrong way - I thought I needed to parse the HTML to get the links and do simple regexes on those, but I can just do simple regexes on the entire HTML document. -- http://mail.python.org/mailman/listinfo/python-list