Re: Matching XML Tag Contents with Regex

Diez B. Roggisch Tue, 11 Dec 2007 10:15:42 -0800

Chris wrote:

> On Dec 11, 11:41 am, garage <[EMAIL PROTECTED]> wrote:
>> > Is what I'm trying to do possible with Python's Regex library? Is
>> > there an error in my Regex?
>>
>> Search for '*?' onhttp://docs.python.org/lib/re-syntax.html.
>>
>> To get around the greedy single match, you can add a question mark
>> after the asterisk in the 'content' portion the the markup.  This
>> causes it to take the shortest match, instead of the longest. eg
>>
>> <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]*
>>
>> There's still some funkiness in the regex and logic, but this gives
>> you the three matches
> 
> Thanks, that's pretty close to what I was looking for. How would I
> filter out tags that don't have certain text in the contents? I'm
> running into the same issue again. For instance, if I use the regex:
> 
> <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(%
> (tagName)s)]*
> 
> each match will include "targettext". However, some matches will still
> include </%(tagName)s)>, presumably from the tags which didn't contain
> targettext.


Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse &
access HTML. 

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Matching XML Tag Contents with Regex

Reply via email to