On Sun, 2008-11-30 at 02:51 +0200, Canol Gökel wrote:
> How can one write an expression to match always the most inner part? I
> couldn't write an expression like "match a non-greedy <p>.*</p> which
> does not have a <p> inside.
> 

You can't write a regular expression to do this.  And no, I'm not going
to write an entire second-year university course in an email explaining
why you can't.  You will just have to take my word for it.

The data structure you're describing has unbounded nested contexts.
This means the you can put a structure like <p>...</p> inside itself and
you can do this an unlimited number of times.  The only way to correctly
parse such structures is to use a finite-state automation (FSA) with a
push-down stack.

If you want to parse HTML, I suggest you use a module like
HTML::TreeBuilder.  If you want to parse XML, you should consider
XML::Parser at the least.  Now days, modules like XML::Twig, XML::DOM
ans XML::SAX are preferred but they are built on top of XML::Parser, so
you'll need to install it too.

If your data does not have a common definition, you could use modules
like Parse::RecDescent to simplify the creation of the FSA.

But since you don't plan to use Perl, I suggest that you ask the experts
in the language of your choice.  They will be able to suggest the best
way to solve your problem and suggest the modules and libraries to help.


-- 
Just my 0.00000002 million dollars worth,
  Shawn

The key to success is being too stupid to realize you can fail.


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to