On Jan 3, 2008 9:41 AM, howa <[EMAIL PROTECTED]> wrote: > On 1月3日, 下午10時00分, [EMAIL PROTECTED] (Chas. Owens) wrote: > > > On Jan 3, 2008 8:27 AM, howa <[EMAIL PROTECTED]> wrote: > > snip> And it should handle other rare cases, e.g. > > > > > my $str = " <div \n style='...'> apple </div> "; > > > > snip > > > > And right there you showed why regexes are not good for parsing HTML > > (and XML). That problem is non-trival and therefore we have modules > > that take care of the messy parsing that is necessary to get the > > information you want. See my other email for the names of some > > modules that you might find handy. > > Even that , I would like to know how to make this work... > > For example, why the following didn't work, e.g. > > use strict; > > my $str = " <div style='...'> apple </div> "; > > if ($str =~ /<(.*?)\s*?.*?>/gi) { snip
Let's break the regex down to its pieces and see what each one will match: < a literal < (.*?) matches nothing or anything, whichever makes a shorter match \s*? matches nothing or consecutive whitespace, whichever makes a shorter match .*? matches nothing or anything, whichever makes a shorter match > a literal > So, I predict that (.*?) will match nothing, \s*? will match nothing, and .*? will match everything between the first < character and the next > character after it. Let's see what happens*: [<][][][div style='...'][>]. Yep, it looks like you have fallen prey to the classic non-greedy-match-followed-by-non-greedy-match issue. Non-greedy matches need an anchor to work. This is why the last non-greedy match ate everything; it was the only one with an anchor. You could try saying $str =~ /<\s*(\w+).*?>/gm but that undoubtedly has problems as well. The only safe way to do it is with a parser, and happily there are several already written and waiting for you to use them. For instance, the code to extract divs from an html file looks like this with HTML::Parser**: #!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $p = HTML::Parser->new(api_version => 3) or die; my @divs; $p->handler(start => sub { push @divs, $_[0] if $_[0] eq 'div' }, 'tag'); $p->parse("<div>foo</div><div><span>bar</span></div>") or die; print map { "$_\n" } @divs; And this code has the benefit that it won't break on valid html (unlike the regular expressions we were using). And it can easily be extended: #!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $p = HTML::Parser->new(api_version => 3) or die; my @divs; my $save_text = 0; $p->handler(start => sub { push @divs, { tag => $_[0] } if $_[0] eq 'div'; $save_text++ if $_[0] eq 'div'; }, 'tagname'); $p->handler(text => sub { $divs[-1]{text} .= $_[0] if $save_text; }, 'text'); $p->handler(end => sub { $save_text-- if $_[0] eq 'div'; }, 'tagname'); $p->parse("<div>foo</div><div><span>bar</span></div>") or die; print map { "$_->{tag} holds $_->{text}\n" } @divs; * here is the source code I used #!/usr/local/ActivePerl-5.10/bin/perl use strict; use warnings; use feature ":5.10"; my $str = " <div style='...'> apple </div> "; say map { "[$_]" } $str =~ /(<)(.*?)(\s*?)(.*?)(>)/; ** http://search.cpan.org/dist/HTML-Parser/Parser.pm -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/