Canol Gökel schreef:

> My problem is to match HTML tags with RegExp.

What is your personal definition of a tag?

To match a tag, you could use "/<[^>]*>/" but that would also match
"<>".
Maybe you are just looking for "/<[A-Za-z]+>/"?


> I managed to match
> something like this, properly:
>
> la la la <p>a paragraph</p> bla bla bla <p>another paragraph</p> ya
> ya ya
>
> But when nested, there arises problems:

And that is what you should expect when you use the wrong tool, or use a
tool in the wrong way.

You are not matching tags there, but constructs delimited by tags. That
is much easier to do with multiple passes. Use a parser. It is often
simple to create one.


> <p>a paragraph <p>bla bla bla</p> la la la</p>
>
> It matches
>
> <p>A paragraph <p>bla bla bla</p>
>
> instead of matching the most inner part:
>
> <p>bla bla bla</p>
>
> How can one write an expression to match always the most inner part?

Don't focus on "an expression", your toolbox is bigger than that.

    my $atom = qr~<([a-z]+)>[^<>]+</\1>~; # meant to evolve ;-)


> I couldn't write an expression like "match a non-greedy <p>.*</p>
> which does not have a <p> inside.
>
> Note: Most probably there is a module for this but:
>  - I want to learn the logic,
>  - I don't use Perl in this project,
>  - Actually my problem is different than matching HTML tags but I
> choose them to explain my problem, easily.

That is very stupid.

-- 
Affijn, Ruud

"Gewoon is een tijger."


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to