in the middle of web ,there is a problem,how to parse

contro opinion Wed, 18 Jan 2012 08:56:22 -0800

here is my code:

import urllib
import lxml.html


down="
http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm
"
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)

data1 = root.xpath('//tr[@class="tr_normal"  and  .//img]')
print "the row which contains img  :"
for u in data1:
    print  u.text_content()

data2 = root.xpath('//tr[@class="tr_normal"  and  not(.//img)]')
print "the row which do not contain img  :"
for u in data2:
    print  u.text_content()


the output is :(i omit many lines )

the row which contains img  :
00329
the row which do not contain img  :
00001长江实业1,000#HOF
................many lines omitted
00327百富环球1,000#H
00328ALCO HOLDINGS2,000#

i wondered why  there are so many lines i can't get such as :
(you can see in the web
http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm
)


00330思捷环球<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00330&WidCoAbbName=&Month=&langcode=c>
100#HOF00331春天百货<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00331&WidCoAbbName=&Month=&langcode=c>
2,000#H  00332NGAI LIK
IND<http://sc.hkex.com.hk/gb/www.hkex.com.hk/chi/invest/company/profile_page_c.asp?WidCoID=00332&WidCoAbbName=&Month=&langcode=c>
4,000#   ...................many lines  ommitted
i want to know how can i get these ??

-- 
http://mail.python.org/mailman/listinfo/python-list

in the middle of web ,there is a problem,how to parse

Reply via email to