On 9/12/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > On 11 Sep 2007 at 16:09, Chas Owens wrote: > > On 9/11/07, Jenda Krynicky <[EMAIL PROTECTED]> wrote: > > > On 11 Sep 2007 at 15:15, Srinivas wrote: > > > > I want to write a perl script that scrapes various job sites like > > > > monster, dice, career builders etc. > > > > > > > > Given the job id and web site name it should scrape the > > > > information and store in a mySQL database. > > > > > > And are you sure they won't mind? I don't work there anymore, but > > > still ... you should make sure what you plan to do is OK with them. > > snip > > > > The easiest way to do this is to obey their robots.txt file. You can > > learn more about robots.txt here: > > http://www.robotstxt.org/wc/faq.html. Also, be careful, the text you > > are copying is still copyrighted and you cannot republish more than a > > snippet without running into potential legal hazards. > > I don't think that's enough. It's one thing to index a site for searching > (think ... Google) and another > to scrape the data and present it elsewhere as yours. The fact that it's OK > to run a script to > download some data doesn't mean all uses of said data are all right. snip
Right, that is why I warned about the possible legal hazards*. A script should never** request data that is under a url marked disallow, but even if it is acceptable to read data, it is almost never acceptable to display more than a snippet of the data (think of the one or two lines after a search result in Google). However, you may, if my understanding of US and international copyright laws is correct, derive new information from the data you scrape off a website. So, you could create a robot that scrapes all of the new jobs off of several job websites (assuming the robots.txt allows you to) and then create a webpage that looks like this Monster has 5 new jobs requiring Perl 20 new jobs requiring Java 1000 new jobs requiring Befunge Dice has 15 new jobs requiring Perl 50 new jobs requiring Java 0 jobs requiring Befunge It would not be legal (in my opinion) to then have those lines link to the full text of the jobs in question on your own website, but a page like this New jobs on Monster: DBA/Developer - Sacramento, CA Sys Admin - BFE, OK Senior Sysadmin - Atlanta, GA Perl Developer - Norfolk, VA Data Munging Expert - Portland, OR where each line deep links to the job offer on Monster's website would be legal (again, in my opinion). * always consult legal council before working with someone else's data. ** for certain values of never -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/