Re: How to extract the HTML tag?

Chas. Owens Thu, 03 Jan 2008 10:45:22 -0800

On Jan 3, 2008 9:41 AM, howa <[EMAIL PROTECTED]> wrote:
> On 1月3日, 下午10時00分, [EMAIL PROTECTED] (Chas. Owens) wrote:
>
> > On Jan 3, 2008 8:27 AM, howa <[EMAIL PROTECTED]> wrote:
> > snip> And it should handle other rare cases, e.g.
> >
> > > my $str = " <div   \n  style='...'> apple </div> ";
> >
> > snip
> >
> > And right there you showed why regexes are not good for parsing HTML
> > (and XML).  That problem is non-trival and therefore we have modules
> > that take care of the messy parsing that is necessary to get the
> > information you want.  See my other email for the names of some
> > modules that you might find handy.
>
> Even that , I would like to know how to make this work...
>
> For example, why the following didn't work, e.g.
>
> use strict;
>
> my $str = " <div style='...'> apple </div> ";
>
> if ($str =~ /<(.*?)\s*?.*?>/gi) {
snip


Let's break the regex down to its pieces and see what each one will match:

< a literal <
(.*?) matches nothing or anything, whichever makes a shorter match
\s*? matches nothing or consecutive whitespace, whichever makes a shorter match
.*? matches nothing or anything, whichever makes a shorter match
> a literal >

So, I predict that (.*?) will match nothing, \s*? will match nothing,
and .*? will match everything between the first < character and the
next > character after it.  Let's see what happens*: [<][][][div
style='...'][>].  Yep, it looks like you have fallen prey to the
classic non-greedy-match-followed-by-non-greedy-match issue.
Non-greedy matches need an anchor to work.  This is why the last
non-greedy match ate everything; it was the only one with an anchor.
You could try saying

$str =~ /<\s*(\w+).*?>/gm

but that undoubtedly has problems as well.  The only safe way to do it
is with a parser, and happily there are several already written and
waiting for you to use them.  For instance, the code to extract divs
from an html file looks like this with HTML::Parser**:

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Parser;

my $p = HTML::Parser->new(api_version => 3) or die;
my @divs;
$p->handler(start => sub { push @divs, $_[0] if $_[0] eq 'div' }, 'tag');

$p->parse("<div>foo</div><div><span>bar</span></div>") or die;

print map { "$_\n" } @divs;

And this code has the benefit that it won't break on valid html
(unlike the regular expressions we were using).  And it can easily be
extended:

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Parser;

my $p = HTML::Parser->new(api_version => 3) or die;
my @divs;
my $save_text = 0;
$p->handler(start => sub {
                push @divs, { tag => $_[0] } if $_[0] eq 'div';
                $save_text++ if $_[0] eq 'div';
}, 'tagname');
$p->handler(text => sub {
                $divs[-1]{text} .= $_[0] if $save_text;
}, 'text');
$p->handler(end => sub {
                $save_text-- if $_[0] eq 'div';
}, 'tagname');

$p->parse("<div>foo</div><div><span>bar</span></div>") or die;

print map { "$_->{tag} holds $_->{text}\n" } @divs;

* here is the source code I used
#!/usr/local/ActivePerl-5.10/bin/perl

use strict;
use warnings;
use feature ":5.10";

my $str = " <div style='...'> apple </div> ";

say map { "[$_]" } $str =~ /(<)(.*?)(\s*?)(.*?)(>)/;

** http://search.cpan.org/dist/HTML-Parser/Parser.pm

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: How to extract the HTML tag?

Reply via email to