On 05/16/2011 03:06 AM, David Robinow wrote:
On Sun, May 15, 2011 at 4:45 PM, Andrew Berg<bahamutzero8...@gmail.com>  wrote:
I'm trying to understand why HMTLParser.feed() isn't returning the whole
page. My test script is this:

import urllib.request
import html.parser
class MyHTMLParser(html.parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a' and attrs:
            print(tag,'-',attrs)

url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
page = urllib.request.urlopen(url).read()
parser = MyHTMLParser()
parser.feed(str(page))

I can do print(page) and get the entire HTML source, but
parser.feed(str(page)) only spits out the information for the top links
and none of the "revisionxxxx" links. Ultimately, I just want to find
the name of the first "revisionxxxx" link (right now it's
"revision1995", when a new build is uploaded it will be "revision2000"
or whatever). I figure this is a relatively simple page; once I
understand all of this, I can move on to more complicated pages.
You've got bad HTML. Look closely and you'll see the there's no space
between the "revisionxxxx" strings and the style tag following.
The parser doesn't like this. I don't know a solution other than
fixing the html.
(I created a local copy, edited it and it worked.)
Hello,

Use regular expression for bad HTLM or beautifulSoup (google it), below a exemple to extract all html links:

linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
for link in linksList:
    print link

Cheers
Karim
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to