Re: Regular expression, "not this string"

Dave Cardwell Mon, 12 Mar 2007 10:32:59 -0800

Rob Dixon wrote:

Dave Cardwell wrote:

Rob Dixon wrote:

Dave Cardwell wrote:


Hello there, I'm having trouble constructing a regular expression
that would do the following:

FOO...
...followed by anything but BAR (non-greedy)...
...followed by BAZ (captured)...
...followed by anything but BAR (greedy)...
...followed by BAR

I've been looking at zero-width negative look-ahead, but I haven't
used this area of regular expressions before so I'm struggling. A
solution or prod in the right direction would be lovely.


Please show us the real problem. I know you mean to clarify, but your
summary is so ambiguous that understanding it becomes the most difficult
part of providing a solution.

Thanks,

Rob


I was afraid of that, sorry. I'm using HTML::Parser to scan through a
document, but I need to do one quick manipulation first that depends on
seeing the document as a whole (unlike per-token as with HTML::Parser).
Rather than attempting to fit all of the real work in a regular
expression, I thought it best to simply mark the element with a custom
attribute that HTML::Parser could pick up later.

To that end, I need to find an <a> (BAZ) that contains just plain text,
somewhere between an opening <td> (FOO) and the closest closing </td>
(BAR), ie something along the lines of:

s%
    <td([^>]*>
        {not </td>}*?
            <a[^>]*>[\w\s]+</a>
        {not </td>}*?
    </td>)
%<td foo="1"$1%gismx;

It's the {not </td>} bits I'm having difficulty with.


OK I see. But I think you should be parsing the HTML instead of trying to
do this sort of stuff with a regex, which is notoriously awkward, mainly
because it doesn't take account of the structure of nested text like HTML
or XML.

I know the HTML::Parser interface isn't the easiest in the world to work
with, but one of it subclasses should do it for you. If the markup was
parsed with HTML::TreeBuilder, for example, I could write:

 foreach my $td ($tree->find('td')) {

   foreach my $anchor ($td->find('a')) {

     my @content = $anchor->content_list;
     next if grep ref, @content;

     $anchor->attr(myattr => 1);
   }
 }

which finds all anchor tags which appear anywhere (at any level) within a
table data tag and contain no further HTML markup, and adds an attribute
'myattr' to that anchor with a value of 1. This may or may not be exactly
what you want, but you see the principle. I think an attempt to write a
regular expression to do the job would be problematic, to say the least.

HTH,

Rob


I'll certainly look into TreeBuilder - thank you.


--
Best wishes,
Dave Cardwell.

http://perlprogrammer.co.uk/


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Regular expression, "not this string"

Reply via email to