On Mar 24, 6:32 pm, Tess <[EMAIL PROTECTED]> wrote: > Hello All, > > I have a Beautiful Soup question and I'd appreciate any guidance the > forum can provide. >
I *know* you're using Beautiful Soup, and I *know* that BS is the de facto HTML parser/processor library. Buuuuuut, I just couldn't help myself in trying a pyparsing scanning approach to your problem. See the program below for a pyparsing treatment of your question. -- Paul """ My goal is to extract all elements where the following is true: <p align="left"> and <div align="center">. """ from pyparsing import makeHTMLTags, withAttribute, keepOriginalText, SkipTo p,pEnd = makeHTMLTags("P") p.setParseAction( withAttribute(align="left") ) div,divEnd = makeHTMLTags("DIV") div.setParseAction( withAttribute(align="center") ) # basic scanner for matching either <p> or <div> with desired attrib value patt = ( p + SkipTo(pEnd) + pEnd ) | ( div + SkipTo(divEnd) + divEnd ) patt.setParseAction( keepOriginalText ) print "\nBasic scanning" for match in patt.searchString(html): print match[0] # simplified data access, by adding some results names patt = ( p + SkipTo(pEnd)("body") + pEnd )("P") | \ ( div + SkipTo(divEnd)("body") + divEnd )("DIV") patt.setParseAction( keepOriginalText ) print "\nSimplified field access using results names" for match in patt.searchString(html): if match.P: print "P -", match.body if match.DIV: print "DIV -", match.body Prints: Basic scanning <p align="left">P1</p> <div align="center">div2a</div> <div align="center">div2b</div> <p align="left">P3</p> <div align="center">div3b</div> <p align="left">P4</p> <div align="center">div4b</div> Simplified field access using results names P - P1 DIV - div2a DIV - div2b P - P3 DIV - div3b P - P4 DIV - div4b -- http://mail.python.org/mailman/listinfo/python-list