I am trying to download a few IEEE pages by using urllib2, but with certain pages I get only the first part of the page. With other pages from the same server and url (just another pageID) I get the right results. The difference between these pages seems to be the date the paper for which the page is was published. Any papers from before 2000 end just before the date, pages from 2000 and later and at <\html>.
Two example URLs: Does not work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=517048 Does work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=854728 I tried both urlopen and urlretrieve and tried both urllib and urllib2. With urlopen I tried both .read() and .read(10000) to make sure I got the whole page, but nothing helped. Sample code: import urllib2 response = urllib2.urlopen("http://ieeexplore.ieee.org/xpl/ freeabs_all.jsp?arnumber=517048") html = response.read() print html The cutoff is allways at the same location: just after the label "Meeting date" and before the date itself. Could it be that something is interpreted as and eof command or something like that? example of the cutoff point with a bad page: <br/><b>Meeting Date: </b> example of the cutoff point with a good page: <br/><b>Meeting Date: </b> 13 jun 2000 The bad pages do continue after this point btw. if you use a webbrowser, it does not seem to be a server problem. -- http://mail.python.org/mailman/listinfo/python-list