Lad - Well, here's what I've got so far. I'll leave the extraction of the description to you as an exercise, but as a clue, it looks like it is delimited by "<b>View Detail</b></a></td></tr></tbody></table> <br>" at the beginning, and "Quantity: 500<br>" at the end, where 500 could be any number. This program will print out:
['Title:', 'Sell 2.4GHz Wireless Mini Color Camera With Audio Function Manufacturers Hong Kong - Exporters, Suppliers, Factories, Seller'] ['Contact:', 'Mr. Simon Cheung'] ['Company:', 'Lanjin Electronics Co., Ltd.'] ['Address:', 'Rm 602, 6/F., Tung Ning Bldg., 2 Hillier Street, Sheung Wan , Hong Kong\n , HK\n ( Hong Kong )'] ['Phone:', '852 35763877'] ['Fax:', '852 31056238'] ['Mobile:', '852-96439737'] So I think pyparsing will get you pretty far along the way. Code attached below (unfortunately, I am posting thru Google Groups, which strips leading whitespace, so I have inserted '.'s to preserve code indentation; just strip the leading '.' characters). -- Paul =================================== from pyparsing import * import urllib # get input data url = "http://www.ourglobalmarket.com/Test.htm" page = urllib.urlopen( url ) pageHTML = page.read() page.close() #~ I would like to extract the tittle ( it is below Lanjin Electronics #~ Co., Ltd. ) #~ (Sell 2.4GHz Wireless Mini Color Camera With Audio Function ) #~ description - below the tittle next to the picture #~ Contact person #~ Company name #~ Address #~ fax #~ phone #~ Website Address LANGBRK = Literal("<") RANGBRK = Literal(">") SLASH = Literal("/") tagAttr = Word(alphanums) + "=" + dblQuotedString # helpers for defining HTML tag expressions def startTag( tagname ): ....return ( LANGBRK + CaselessLiteral(tagname) + \ ...............ZeroOrMore(tagAttr) + RANGBRK ).suppress() def endTag( tagname ): ....return ( LANGBRK + SLASH + CaselessLiteral(tagname) + RANGBRK ).suppress() def makeHTMLtags( tagname ): ....return startTag(tagname), endTag(tagname) def strong( expr ): ....return strongStartTag + expr + strongEndTag strongStartTag, strongEndTag = makeHTMLtags("strong") titleStart, titleEnd = makeHTMLtags("title") tdStart, tdEnd = makeHTMLtags("td") h1Start, h1End = makeHTMLtags("h1") title = titleStart + SkipTo( titleEnd ).setResultsName("title") + titleEnd contactPerson = tdStart + h1Start + \ ...............SkipTo( h1End ).setResultsName("contact") company = ( tdStart + strong("Company:") + tdEnd + tdStart ) + \ ...............SkipTo( tdEnd ).setResultsName("company") address = ( tdStart + strong("Address:") + tdEnd + tdStart ) + \ ...............SkipTo( tdEnd ).setResultsName("address") phoneNum = ( tdStart + strong("Phone:") + tdEnd + tdStart ) + \ ...............SkipTo( tdEnd ).setResultsName("phoneNum") faxNum = ( tdStart + strong("Fax:") + tdEnd + tdStart ) + \ ...............SkipTo( tdEnd ).setResultsName("faxNum") mobileNum = ( tdStart + strong("Mobile:") + tdEnd + tdStart ) + \ ...............SkipTo( tdEnd ).setResultsName("mobileNum") webSite = ( tdStart + strong("Website Address:") + tdEnd + tdStart ) + \ ...............SkipTo( tdEnd ).setResultsName("webSite") scrapes = title | contactPerson | company | address | phoneNum | faxNum | mobileNum | webSite # use parse actions to remove hyperlinks linkStart, linkEnd = makeHTMLtags("a") linkExpr = linkStart + SkipTo( linkEnd ) + linkEnd def stripHyperLink(s,l,t): ....return [ t[0], linkExpr.transformString( t[1] ) ] company.setParseAction( stripHyperLink ) # use parse actions to add labels for data elements that don't # have labels in the HTML def prependLabel(pre): ....def prependAction(s,l,t): ........return [pre] + t[:] ....return prependAction title.setParseAction( prependLabel("Title:") ) contactPerson.setParseAction( prependLabel("Contact:") ) for tokens,start,end in scrapes.scanString( pageHTML ): ....print tokens -- http://mail.python.org/mailman/listinfo/python-list