----- Original Message ----- 

> I am having issues with the urllib and lxml.html modules.
> Here is my original code: import urllib import lxml . html
> down = 'http://v.163.com/special/visualizingdata/' file = urllib .
> urlopen ( down ). read () root = lxml . html . document_fromstring (
> file ) xpath_str = "//div[@class='down s-fc3 f-fl']/a" urllist =
> root . xpath ( xpath_str ) for url in urllist : print url . get (
> "href" )
> When run, it returns this output: http :
> //mov.bn.netease.com/movieMP4/2012/12/A/7/S8H1TH9A7.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/D/9/S8H1ULCD9.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/4/P/S8H1UUH4P.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/B/V/S8H1V8RBV.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/6/E/S8H1VIF6E.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/B/G/S8H1VQ2BG.mp4
> But, when I change the line
> xpath_str='//div[@class="down s-fc3 f-fl"]//a'
> into
> xpath_str='//div[@class="col f-cb"]//div[@class="down s-fc3
> f-fl"]//a'
> that is to say, urllist = root . xpath ( '//div[@class="col
> f-cb"]//div[@class="down s-fc3 f-fl"]//a' )
> I do not receive any output. What is the flaw in this code?
> it is so strange that the shorter one can work,the longer one can
> not,they have the same xpath structure!

Are you sure this is somehow related to python ? It looks like you just have 
issue parsing the xml.

I know little about what you're trying to do but :

1/ you're overriding the built-in 'file' type
2/ your selector is probably wrong 'class="col f-cb"' will fail because in the 
document, the div class may be "col f-cb", "col  f-cb" (2 spaces) or "f-cb col" 
etc...
3/ your short selector will return all elements without regard for the parent, 
hence it is not sensible to the issue 2/

How to get all .mp4 links:


hrefList = root.xpath('//a[@href]')
mp4List =[ref for ref in hrefList if '.mp4' in ref.attrib.get('href','')]

mp4List
[<Element a at 8d7ee0c>,
 <Element a at 8d7eefc>,
 <Element a at 8d7ee6c>,
 <Element a at 8d7ed7c>,
 <Element a at 8d7ef8c>,
 <Element a at 8d7efbc>]

>From this list you can access to parent and child informations.

for mp4 in mp4List:
  print mp4.get('href')

http://mov.bn.netease.com/movieMP4/2012/12/A/7/S8H1TH9A7.mp4
http://mov.bn.netease.com/movieMP4/2012/12/D/9/S8H1ULCD9.mp4
http://mov.bn.netease.com/movieMP4/2012/12/4/P/S8H1UUH4P.mp4
http://mov.bn.netease.com/movieMP4/2012/12/B/V/S8H1V8RBV.mp4
http://mov.bn.netease.com/movieMP4/2012/12/6/E/S8H1VIF6E.mp4
http://mov.bn.netease.com/movieMP4/2012/12/B/G/S8H1VQ2BG.mp4

cheers

JM


-- IMPORTANT NOTICE: 

The contents of this email and any attachments are confidential and may also be 
privileged. If you are not the intended recipient, please notify the sender 
immediately and do not disclose the contents to any other person, use it for 
any purpose, or store or copy the information in any medium. Thank you.
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to