Re: Programming exercises/challenges

Philip Semanchuk Wed, 19 Nov 2008 06:42:20 -0800


On Nov 19, 2008, at 7:12 AM, Mr.SpOOn wrote:

On Wed, Nov 19, 2008 at 2:39 AM, Mensanator <[EMAIL PROTECTED]>wrote:

Another hobby I have is tracking movie box-office receipts
(where you can make interesting graphs comparing Titanic
to Harry Potter or how well the various sequels do, if Pierce
Brosnan saved the James Bond franchise, what can you say about
Daniel Craig?). Lots of potential database problems there.
Not to mention automating the data collection from the Internet
Movie Database by writing a web page scraper than can grab
six months worth of data in a single session (you probably
wouldn't need this if you cough up a subscription fee for
professional access, but I'm not THAT serious about it).


This is really interesting. What would one need to do such a thing?
The only program web related I did in Python was generating a rss feed
from a local newspaper static site, using BeautifulSoup. But I never
put it on an online host. I'm not even sure if I could run. What
requisites should have the host to run python code?

I'm not sure why you'd need to host the Python code anywhere otherthan your home computer. If you wanted to pull thousands of pages froma site like that, you'd need to respect their robots.txt file. Don'tforget to look for a crawl-delay specification. Even if they don'tspecify one, you shouldn't let your bot hammer their servers at fullspeed -- give it a delay, let it run in the background, it might takeyou three days versus an hour to collect the data you need but that'snot too big of deal in the service of good manners, is it?

You might also want to change the user-agent string that you send out.Some sites serve up different content to bots than to browsers.

You could even use wget to scrape the site instead of rolling your ownbot if you're more interested in the data manipulation aspect of theproject than the bot writing.


Enjoy
Philip

--
http://mail.python.org/mailman/listinfo/python-list

Re: Programming exercises/challenges

Reply via email to