On 2012-12-03 01:23, Jason Hsu wrote:
I'm trying to extract the data on "total assets" from Yahoo Finance using 
Python 2.7 and lxml.

Here is a special test script I set up to work on this issue:

     import urllib
     import lxml
     import lxml.html

     url_local1 = 
"http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView";
     result1 = urllib.urlopen(url_local1)
     element_html1 = result1.read()
     doc1 = lxml.html.document_fromstring (element_html1)
     list_row1 = doc1.xpath(u'.//th[div[text()="Total 
Assets"]]/following-sibling::td/text()')
     print list_row1

     url_local2 = "http://finance.yahoo.com/q/bs?s=FAST";
     result2 = urllib.urlopen(url_local2)
     element_html2 = result2.read()
     doc2 = lxml.html.document_fromstring (element_html2)
     list_row2 = doc2.xpath(u'.//td[strong[text()="Total 
Assets"]]/following-sibling::td/strong/text()')
     print list_row2

I'm able to get the row of data on total assets from the Smartmoney page, but I 
get just an empty list when I try to parse the Yahoo Finance page.

The problem is that you're asking it to look for an exact match.

If you look at the HTML itself, you'll see that there's whitespace
around the "Total Assets" part.

This should work:

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total Assets")]]/following-sibling::td/strong/text()')

(Although I tested it in Python 3.2.)
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to