On Sun, Nov 20, 2011 at 1:06 PM, Jabba Laci <jabba.l...@gmail.com> wrote: > Hi, > > I want to extract the URLs of all the posts on a tumblr blog. Let's > take for instance this blog: http://loveyourchaos.tumblr.com/archive . > If I download this page with a script, there are only 50 posts in the > HTML. If you scroll down in your browser to the end of the archive, > the browser will dynamically load newer and newer posts. > > How to scrape such a dynamic page? > > Thanks, > > Laszlo > --
The page isn't really that dynamic- HTTP doesn't allow for that. Scrolling down the page triggers some Javascript. That Javascript sends some HTTP requests to the server, which returns more HTML, which gets stuck into the middle of the page. If you take the time to monitor your network traffic using a tool like Firebug, you should be able to figure out the pattern in the requests for more content. Just send those same requests yourself and parse the results. -- http://mail.python.org/mailman/listinfo/python-list