Re: Help pulling data from website

Paul Johnson Wed, 26 Oct 2011 08:01:47 -0700

On Wed, Oct 26, 2011 at 03:22:24PM +0200, Shlomi Fish wrote:
> Hi Jeswin,
> 
> On Wed, 26 Oct 2011 09:04:32 -0400
> Jeswin <phillyj...@gmail.com> wrote:
> 
> > Hi all,
> > I'm still a beginner but I have a project I want to work on.
> > 
> > I want to pull price data from a website and would like your advice on
> > getting started.
> > 
> > This is my idea and a basic implementation of the process:
> > 
> > 1) The input is coverted to the web link, i.e., if I type in "force of will"
> > the output is
> > http://sales.starcitygames.com//search.php?substring=Force+of+Will&auto=Y
> > 2) Somehow, I ask perl to go to the link and get the prices and take an
> > average or display individual prices.
> > 
> > I see that using the filter (and a longer, more complex web link) I can get
> > the web output displayed as a simple chart [1].
> > Looking at the html source, the price data is displayed as "<td class=
> > "deckdbbody2">$1.99&nbsp;</td>" . So maybe I can get a regexp to get all the
> > different prices and list them.
> > 
> > What should I be looking at to learn more on doing this? Is there a better
> > way?
> 
> First of all, you should be using WWW::Mechanize or something similar to
> perform the web-automation. Then you should use XML::LibXML's HTML parsing
> mode or HTML::TreeBuilder or similar to retrieve the data from the HTML. Do
> *not* parse HTML using regular expressions:
> 
> http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
> 
> Oct 13 16:53:51 <rindolf>       perlbot: html
> Oct 13 16:53:51 <perlbot>       rindolf: Don't parse or modify html with 
> regular expressions! See one of HTML::Parser's subclasses: HTML::TokeParser, 
> HTML::TokeParser::Simple, HTML::TreeBuilder(::Xpath)?, HTML::TableExtract 
> etc. If your response begins "that's overkill. i only want to..." you are 
> wrong. http://en.wikipedia.org/wiki/Chomsky_hierarchy and 
> http://xrl.us/bf4jh6 for why not to use regex on HTML


Yeah, yeah, yeah.  Using the argument that regular expressions can't
parse HTML because HTML isn't regular is fine.  Until you realise that
Perl's regular expressions, along with those of pretty much every other
language or library, aren't actually regular either.

However, even that doesn't really matter because sometimes, as in this
case, you don't need to fully parse the HTML anyway.

And sometimes you just want a 95% solution, especially if you are
prototyping.  If that's what you want, then something like the code
below will serve as a basis.  If you do want more, then feel free to do
it the "proper" way.


#!/usr/bin/perl

my @html = `wget -O- 
'http://sales.starcitygames.com//spoiler/display.php?name=vampiric+spirit&namematch=EXACT&text=&oracle=1&textmatch=AND&flavor=&flavormatch=EXACT&action=Show+Results&s_all=All&format=&c_all=All&multicolor=&colormatch=OR&ccl=0&ccu=99&t_all=All&z[]=&critter[]=&crittermatch=OR&pwrop=%3D&pwr=&pwrcc=&tghop=%3D&tgh=-&tghcc=-&mincost=0.00&maxcost=9999.99&minavail=0&maxavail=9999&r_all=All&g_all=All&foil=nofoil&for=no&sort1=4&sort2=1&sort3=10&sort4=0&display=2&numpage=25'`;

for (@html)
{
    print "Found [$1]\n" if /<td class="deckdbbody2">\$(.*?)&nbsp;<\/td>/;
}






This space is reserved for people to explain that you shouldn't waste a
process by shelling out, that you might use up all your memory if the
HTML is large, that you should always use strict and warnings, that the
code is not portable, that screen scraping is evil (and do you have
permission?) and what if the site changes its format, and that no, you
really, really (no, really) shouldn't parse HTML with a regular
expression.

-- 
Paul Johnson - p...@pjcj.net
http://www.pjcj.net

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Help pulling data from website

Reply via email to