On Wed, Oct 26, 2011 at 03:22:24PM +0200, Shlomi Fish wrote: > Hi Jeswin, > > On Wed, 26 Oct 2011 09:04:32 -0400 > Jeswin <phillyj...@gmail.com> wrote: > > > Hi all, > > I'm still a beginner but I have a project I want to work on. > > > > I want to pull price data from a website and would like your advice on > > getting started. > > > > This is my idea and a basic implementation of the process: > > > > 1) The input is coverted to the web link, i.e., if I type in "force of will" > > the output is > > http://sales.starcitygames.com//search.php?substring=Force+of+Will&auto=Y > > 2) Somehow, I ask perl to go to the link and get the prices and take an > > average or display individual prices. > > > > I see that using the filter (and a longer, more complex web link) I can get > > the web output displayed as a simple chart [1]. > > Looking at the html source, the price data is displayed as "<td class= > > "deckdbbody2">$1.99 </td>" . So maybe I can get a regexp to get all the > > different prices and list them. > > > > What should I be looking at to learn more on doing this? Is there a better > > way? > > First of all, you should be using WWW::Mechanize or something similar to > perform the web-automation. Then you should use XML::LibXML's HTML parsing > mode or HTML::TreeBuilder or similar to retrieve the data from the HTML. Do > *not* parse HTML using regular expressions: > > http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html > > Oct 13 16:53:51 <rindolf> perlbot: html > Oct 13 16:53:51 <perlbot> rindolf: Don't parse or modify html with > regular expressions! See one of HTML::Parser's subclasses: HTML::TokeParser, > HTML::TokeParser::Simple, HTML::TreeBuilder(::Xpath)?, HTML::TableExtract > etc. If your response begins "that's overkill. i only want to..." you are > wrong. http://en.wikipedia.org/wiki/Chomsky_hierarchy and > http://xrl.us/bf4jh6 for why not to use regex on HTML
Yeah, yeah, yeah. Using the argument that regular expressions can't parse HTML because HTML isn't regular is fine. Until you realise that Perl's regular expressions, along with those of pretty much every other language or library, aren't actually regular either. However, even that doesn't really matter because sometimes, as in this case, you don't need to fully parse the HTML anyway. And sometimes you just want a 95% solution, especially if you are prototyping. If that's what you want, then something like the code below will serve as a basis. If you do want more, then feel free to do it the "proper" way. #!/usr/bin/perl my @html = `wget -O- 'http://sales.starcitygames.com//spoiler/display.php?name=vampiric+spirit&namematch=EXACT&text=&oracle=1&textmatch=AND&flavor=&flavormatch=EXACT&action=Show+Results&s_all=All&format=&c_all=All&multicolor=&colormatch=OR&ccl=0&ccu=99&t_all=All&z[]=&critter[]=&crittermatch=OR&pwrop=%3D&pwr=&pwrcc=&tghop=%3D&tgh=-&tghcc=-&mincost=0.00&maxcost=9999.99&minavail=0&maxavail=9999&r_all=All&g_all=All&foil=nofoil&for=no&sort1=4&sort2=1&sort3=10&sort4=0&display=2&numpage=25'`; for (@html) { print "Found [$1]\n" if /<td class="deckdbbody2">\$(.*?) <\/td>/; } This space is reserved for people to explain that you shouldn't waste a process by shelling out, that you might use up all your memory if the HTML is large, that you should always use strict and warnings, that the code is not portable, that screen scraping is evil (and do you have permission?) and what if the site changes its format, and that no, you really, really (no, really) shouldn't parse HTML with a regular expression. -- Paul Johnson - p...@pjcj.net http://www.pjcj.net -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/