Re: scraping a tumblr.com archive page

Benjamin Kaplan Sun, 20 Nov 2011 10:22:39 -0800

On Sun, Nov 20, 2011 at 1:06 PM, Jabba Laci <[email protected]> wrote:
> Hi,
>
> I want to extract the URLs of all the posts on a tumblr blog. Let's
> take for instance this blog: http://loveyourchaos.tumblr.com/archive .
> If I download this page with a script, there are only 50 posts in the
> HTML. If you scroll down in your browser to the end of the archive,
> the browser will dynamically load newer and newer posts.
>
> How to scrape such a dynamic page?
>
> Thanks,
>
> Laszlo
> --


The page isn't really that dynamic- HTTP doesn't allow for that.
Scrolling down the page triggers some Javascript. That Javascript
sends some HTTP requests to the server, which returns more HTML, which
gets stuck into the middle of the page. If you take the time to
monitor your network traffic using a tool like Firebug, you should be
able to figure out the pattern in the requests for more content. Just
send those same requests yourself and parse the results.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: scraping a tumblr.com archive page

Reply via email to