Re: Regular expression to structure HTML

Stefan Behnel Fri, 02 Oct 2009 05:36:31 -0700

Paul McGuire wrote:
> On Oct 2, 12:10 am, "[email protected]" <[email protected]> wrote:
>> I'm kind of new to regular expressions, and I've spent hours trying to
>> finesse a regular expression to build a substitution.
>>
>> What I'd like to do is extract data elements from HTML and structure
>> them so that they can more readily be imported into a database.
> 
> Oy! If I had a nickel for every misguided coder who tried to scrape
> HTML with regexes...
> 
> Some reasons why RE's are no good at parsing HTML:
> - tags can be mixed case
> - tags can have whitespace in many unexpected places
> - tags with no body can combine opening and closing tag with a '/'
> before the closing '>', as in "<BR/>"
> - tags can have attributes that you did not expect (like "<BR
> CLEAR=ALL>")
> - attributes can occur in any order within the tag
> - attribute names can also be in unexpected upper/lower case
> - attribute values can be enclosed in double quotes, single quotes, or
> even (surprise!) NO quotes


BTW, BeautifulSoup's parser also uses regexes, so if the OP used it, he/she
could claim to have solved the problem "with regular expressions" without
even lying.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular expression to structure HTML

Reply via email to