On 2012-12-03 01:23, Jason Hsu wrote:
I'm trying to extract the data on "total assets" from Yahoo Finance using
Python 2.7 and lxml.
Here is a special test script I set up to work on this issue:
import urllib
import lxml
import lxml.html
url_local1 =
"http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView"
result1 = urllib.urlopen(url_local1)
element_html1 = result1.read()
doc1 = lxml.html.document_fromstring (element_html1)
list_row1 = doc1.xpath(u'.//th[div[text()="Total
Assets"]]/following-sibling::td/text()')
print list_row1
url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
result2 = urllib.urlopen(url_local2)
element_html2 = result2.read()
doc2 = lxml.html.document_fromstring (element_html2)
list_row2 = doc2.xpath(u'.//td[strong[text()="Total
Assets"]]/following-sibling::td/strong/text()')
print list_row2
I'm able to get the row of data on total assets from the Smartmoney page, but I
get just an empty list when I try to parse the Yahoo Finance page.
The problem is that you're asking it to look for an exact match.
If you look at the HTML itself, you'll see that there's whitespace
around the "Total Assets" part.
This should work:
list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
Assets")]]/following-sibling::td/strong/text()')
(Although I tested it in Python 3.2.)
--
http://mail.python.org/mailman/listinfo/python-list