Re: Trying to understand html.parser.HTMLParser

Karim Thu, 19 May 2011 14:55:42 -0700

On 05/19/2011 11:35 PM, Andrew Berg wrote:

On 2011.05.16 02:26 AM, Karim wrote:

Use regular expression for bad HTLM or beautifulSoup (google it), below
a exemple to extract all html links:

Actually, using regex wasn't so bad:

import re
import urllib.request


url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
page = str(urllib.request.urlopen(url).read(), encoding='utf-8') #
urlopen() returns a bytes object, need to get a normal string
rev_re = re.compile('revision[0-9][0-9][0-9][0-9]')
num_re = re.compile('[0-9][0-9][0-9][0-9]')
rev = rev_re.findall(str(page))[0] # only need the first item since
the first listing is the latest revision
num = num_re.findall(rev)[0] # findall() always returns a list
print(num)

prints out the revision number - 1995. 'revision1995' might be useful,
so I saved that to rev.

This actually works pretty well for consistently formatted lists. I
suppose I went about this the wrong way - I thought I needed to parse
the HTML to get the links and do simple regexes on those, but I can just
do simple regexes on the entire HTML document.

Great for you!
Use what works well and easy to code, always the simpler is the better.
For complicate search link to avoid using too complex and bugs prone regex
you can derived the code I gave on HTMLParser with max comparison.
Anyway you get the choice which is cool, not be stuck on only one solution.

Cheers
Karim
--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to understand html.parser.HTMLParser

Reply via email to