[EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Please help!!! I'm trying to complete an assignment to determine the number > of HTML tags within a string at each level of nesting. ie: for the string > > "<p><strong>a</strong><a>b</a></p><strong>c<italic>d<a>e</a></italic></strong>" > > <p> is at level 0 > <strong> and <a> are at level 1 > etc... > > My code is below. The problem is that apparently my RegEx is wrong (in the > while loop), and so the loop is infinite, and no values are printed--just "". > I'm just learning perl, and RegEx is really confusing to me. Any feedback > would be appreciated!
Okay -- first of all, be patient. You mailed this three times. > #!usr/bin/perl > #Exercise 8.9 > #To determine how many HTML tags are included in a string at each level of nesting. > > use strict; > use warnings; > > my $string = >"<p><strong>a</strong><a>b</a></p><strong>c<italic>d<a>e</a></italic></strong>"; > > print("The HTML string is:\n$string\n\n"); > > our @counter; > count($string, 0); > > for (0 .. $#counter){ > print("There are $counter[$_] valid level",$_," html tags.\n"); > } > > sub count > { > my $string = shift(); > my $level = shift(); > > while ($string =~ m!.*?<([a-z]*?)>(.*)</\1>!g){ The regex is not bad, given that sample input. There are two bugs and some unnecesary clutter, though: m! .*? # This doesn't do anything useful. # Since there's no anchor at the front, there's # no need to match the text in between <tags>. # Just let perl skip them. <([a-z]*?)> # Here you don't need the "?". # Since '>' is outside the range [a-z], the star # has to consume all the letters, and greediness # can't enter into it. # But you ought to consider whether you really # want to accept *zero* or more letters here. (.*) # And _this_ might be a good place to use the # nongreedy "*?" quantifier. </\1> !xg; > #should remember html tag as 1 and contents as 2 > $counter[$level]++; > print("Tag is \"\1\", contains \"\2\" "); Backreferences (\1, \2, etc) are only valid inside the regex, where you've used \1 correctly. Outside the regex, the captured substrings are in the digit variables $1, $2 ... $n. > print("at level $level\n"); > count($string, $level+1); > > #recursively calling count with the string and the level increased by one Here you recursively process the original $string, which matches at the same place and recurses again on the original $string, which matches at the same place and... You want to recurse on "the bit in the middle", which was captured in the second set of parens. > } > } > Since it's homework, I'll leave you to it. But I really hope your instructor has explained that the only time you should use regexes on HTML is in the context of an exercise. In real-life, go directly to HTML::Parser. In fact, I'm probably assuming too much and you should go get HTML::Parser *right now* :) -- Steve perldoc -qa.j | perl -lpe '($_)=m("(.*)")' -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]