MRAB, 03.12.2012 03:25: > On 2012-12-03 01:23, Jason Hsu wrote: >> I'm trying to extract the data on "total assets" from Yahoo Finance using >> Python 2.7 and lxml. >> >> Here is a special test script I set up to work on this issue: >> >> import urllib >> import lxml >> import lxml.html >> >> url_local1 = >> "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView" >> >> result1 = urllib.urlopen(url_local1) >> element_html1 = result1.read() >> doc1 = lxml.html.document_fromstring (element_html1)
The last three lines are unnecessarily complicated code. Just use doc = lxml.html.parse(url_local1) >> list_row1 = doc1.xpath(u'.//th[div[text()="Total >> Assets"]]/following-sibling::td/text()') >> print list_row1 >> >> url_local2 = "http://finance.yahoo.com/q/bs?s=FAST" >> result2 = urllib.urlopen(url_local2) >> element_html2 = result2.read() >> doc2 = lxml.html.document_fromstring (element_html2) >> list_row2 = doc2.xpath(u'.//td[strong[text()="Total >> Assets"]]/following-sibling::td/strong/text()') >> print list_row2 >> >> I'm able to get the row of data on total assets from the Smartmoney page, >> but I get just an empty list when I try to parse the Yahoo Finance page. >> > The problem is that you're asking it to look for an exact match. > > If you look at the HTML itself, you'll see that there's whitespace > around the "Total Assets" part. > > This should work: > > list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total > Assets")]]/following-sibling::td/strong/text()') Something like "contains(text(),"Total Assets")" is better expressed as "contains(.,"Total Assets")" because it considers the complete text content instead of just one text node. Stefan -- http://mail.python.org/mailman/listinfo/python-list