Re: Infinite loop--RegEx problem?

Steve Grazzini Thu, 31 Oct 2002 20:28:54 -0800

[EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Please help!!!  I'm trying to complete an assignment to determine the number 
> of HTML tags within a string at each level of nesting.  ie:  for the string 
> 
> "<p><strong>a</strong><a>b</a></p><strong>c<italic>d<a>e</a></italic></strong>"
> 
> <p> is at level 0
> <strong> and <a> are at level 1
> etc...
> 
> My code is below.  The problem is that apparently my RegEx is wrong (in the 
> while loop), and so the loop is infinite, and no values are printed--just "".  
> I'm just learning perl, and RegEx is really confusing to me.  Any feedback 
> would be appreciated!


Okay -- first of all, be patient.  You mailed this three times.

> #!usr/bin/perl
> #Exercise 8.9
> #To determine how many HTML tags are included in a string at each level of nesting.
> 
> use strict;
> use warnings;
> 
> my $string = 
>"<p><strong>a</strong><a>b</a></p><strong>c<italic>d<a>e</a></italic></strong>";
> 
> print("The HTML string is:\n$string\n\n");
> 
> our @counter;
> count($string, 0);
> 
> for (0 .. $#counter){
>   print("There are $counter[$_] valid level",$_," html tags.\n");
> }
> 
> sub count
> {
>   my $string = shift();
>   my $level = shift();
> 
>   while ($string =~ m!.*?<([a-z]*?)>(.*)</\1>!g){

The regex is not bad, given that sample input.

There are two bugs and some unnecesary clutter, though:

  m!
    .*?          # This doesn't do anything useful.

                 # Since there's no anchor at the front, there's 
                 # no need to match the text in between <tags>.  
                 # Just let perl skip them.


    <([a-z]*?)>  # Here you don't need the "?".

                 # Since '>' is outside the range [a-z], the star 
                 # has to consume all the letters, and greediness
                 # can't enter into it.

                 # But you ought to consider whether you really 
                 # want to accept *zero* or more letters here.


    (.*)         # And _this_ might be a good place to use the 
                 # nongreedy "*?" quantifier.

    </\1> 

  !xg;

> #should remember html tag as 1 and contents as 2
>     $counter[$level]++;
>     print("Tag is \"\1\", contains \"\2\" ");

Backreferences (\1, \2, etc) are only valid inside the regex, where 
you've used \1 correctly.

Outside the regex, the captured substrings are in the digit variables
$1, $2 ... $n.

>     print("at level $level\n");
>     count($string, $level+1);
>
> #recursively calling count with the string and the level increased by one

Here you recursively process the original $string, which matches at the
same place and recurses again on the original $string, which matches at
the same place and... 

You want to recurse on "the bit in the middle", which was captured in the
second set of parens.

>   }
> }
> 

Since it's homework, I'll leave you to it.

But I really hope your instructor has explained that the only time 
you should use regexes on HTML is in the context of an exercise.

In real-life, go directly to HTML::Parser.

In fact, I'm probably assuming too much and you should go get 
HTML::Parser *right now* :)

-- 
Steve

perldoc -qa.j | perl -lpe '($_)=m("(.*)")'

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Infinite loop--RegEx problem?

Reply via email to