Paul McGuire wrote: > On Oct 2, 12:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: >> I'm kind of new to regular expressions, and I've spent hours trying to >> finesse a regular expression to build a substitution. >> >> What I'd like to do is extract data elements from HTML and structure >> them so that they can more readily be imported into a database. > > Oy! If I had a nickel for every misguided coder who tried to scrape > HTML with regexes... > > Some reasons why RE's are no good at parsing HTML: > - tags can be mixed case > - tags can have whitespace in many unexpected places > - tags with no body can combine opening and closing tag with a '/' > before the closing '>', as in "<BR/>" > - tags can have attributes that you did not expect (like "<BR > CLEAR=ALL>") > - attributes can occur in any order within the tag > - attribute names can also be in unexpected upper/lower case > - attribute values can be enclosed in double quotes, single quotes, or > even (surprise!) NO quotes
BTW, BeautifulSoup's parser also uses regexes, so if the OP used it, he/she could claim to have solved the problem "with regular expressions" without even lying. Stefan -- http://mail.python.org/mailman/listinfo/python-list