On Sun, 30 Nov 2008 02:51:57 +0200, Canol Gökel wrote:
> My problem is to match HTML tags with RegExp. I managed to match
> something like this, properly:
> 
> la la la <p>a paragraph</p> bla bla bla <p>another paragraph</p> ya ya
> ya
> 
> But when nested, there arises problems:
> 
> <p>a paragraph <p>bla bla bla</p> la la la</p>
> 
> It matches
> 
> <p>A paragraph <p>bla bla bla</p>
> 
> instead of matching the most inner part:
> 
> <p>bla bla bla</p>
> 
> How can one write an expression to match always the most inner part? I
> couldn't write an expression like "match a non-greedy <p>.*</p> which
> does not have a <p> inside.

Here is the pattern:

        (<p>(?:.(?!<p>))*?</p>)

$ cat /tmp/foo
#!/usr/local/bin/perl
use strict;
use warnings;

# print "Perl version $]\n";

$_ = do { local $/; <DATA> };

m{
 (                # start capturing
   <p>            # match an opening tag
   (?: .          # match a character
       (?!<p>)    # not followed by opening tag
   )*?            # nongreedily
   </p>           # match a closing tag
  )               # end capturing
}xs and print "Matched: $1\n";

__END__
Outermost: <p>
  Middle: <p>
    Inner: <p> Content
    </p>  Trailing
  </p> Trailing
</p> Finished

$ /tmp/foo
Matched: <p> Content
    </p>


-- 
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to