Oops, het "Good" page is alos handled wrongly. The papers from 2000 are handled wrong too so a real example of a well performing page:
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5206867 On May 22, 11:43 am, Dragon Lord <dragonlord...@gmail.com> wrote: > I am trying to download a few IEEE pages by using urllib2, but with > certain pages I get only the first part of the page. With other pages > from the same server and url (just another pageID) I get the right > results. The difference between these pages seems to be the date the > paper for which the page is was published. Any papers from before 2000 > end just before the date, pages from 2000 and later and at <\html>. > > Two example URLs: > > Does not work:http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=517048 > Does work:http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=854728 > > I tried both urlopen and urlretrieve and tried both urllib and > urllib2. With urlopen I tried both .read() and .read(10000) to make > sure I got the whole page, but nothing helped. > Sample code: > > import urllib2 > response = urllib2.urlopen("http://ieeexplore.ieee.org/xpl/ > freeabs_all.jsp?arnumber=517048") > html = response.read() > print html > > The cutoff is allways at the same location: just after the label > "Meeting date" and before the date itself. Could it be that something > is interpreted as and eof command or something like that? > > example of the cutoff point with a bad page: > <br/><b>Meeting Date: </b> > > example of the cutoff point with a good page: > <br/><b>Meeting Date: </b> > > > 13 jun 2000 > > The bad pages do continue after this point btw. if you use a > webbrowser, it does not seem to be a server problem. -- http://mail.python.org/mailman/listinfo/python-list