On Nov 19, 2008, at 7:12 AM, Mr.SpOOn wrote:
On Wed, Nov 19, 2008 at 2:39 AM, Mensanator <[EMAIL PROTECTED]>
wrote:
Another hobby I have is tracking movie box-office receipts
(where you can make interesting graphs comparing Titanic
to Harry Potter or how well the various sequels do, if Pierce
Brosnan saved the James Bond franchise, what can you say about
Daniel Craig?). Lots of potential database problems there.
Not to mention automating the data collection from the Internet
Movie Database by writing a web page scraper than can grab
six months worth of data in a single session (you probably
wouldn't need this if you cough up a subscription fee for
professional access, but I'm not THAT serious about it).
This is really interesting. What would one need to do such a thing?
The only program web related I did in Python was generating a rss feed
from a local newspaper static site, using BeautifulSoup. But I never
put it on an online host. I'm not even sure if I could run. What
requisites should have the host to run python code?
I'm not sure why you'd need to host the Python code anywhere other
than your home computer. If you wanted to pull thousands of pages from
a site like that, you'd need to respect their robots.txt file. Don't
forget to look for a crawl-delay specification. Even if they don't
specify one, you shouldn't let your bot hammer their servers at full
speed -- give it a delay, let it run in the background, it might take
you three days versus an hour to collect the data you need but that's
not too big of deal in the service of good manners, is it?
You might also want to change the user-agent string that you send out.
Some sites serve up different content to bots than to browsers.
You could even use wget to scrape the site instead of rolling your own
bot if you're more interested in the data manipulation aspect of the
project than the bot writing.
Enjoy
Philip
--
http://mail.python.org/mailman/listinfo/python-list