On Mar 4, 11:42 am, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > I understand that the web is full of ill-formed XHTML web pages but > this is Microsoft: > > http://moneycentral.msn.com/companyreport?Symbol=BBBY > > I can't validate it and xml.minidom.dom.parseString won't work on it. > > If this was just some teenager's web site I'd move on. Is there any > hope avoiding regular expression hacks to extract the data from this > page? > > Chris
How about a pyparsing hack instead? With English-readable expression names and a few comments, I think this is fairly easy to follow. Also note the sample statement at then end showing how to use the results names to access the individual data fields (much easier than indexing into a 20-element list!). (You should also verify you are not running afoul of any terms of service related to the content of this page.) -- Paul ======================= from pyparsing import * import urllib # define matching elements integer = Word(nums).setParseAction(lambda t:int(t[0])) real = Combine(Word(nums) + Word(".",nums)).setParseAction(lambda t:float(t[0])) pct = real + Suppress("%") date = Combine(Word(nums) + '/' + Word(nums)) tdStart,tdEnd = map(Suppress,makeHTMLTags("td")) dollarUnits = oneOf("Mil Bil") # stats are one of two patterns - single value or double value stat, wrapped in HTML <td> tags # also, attach parse action to make sure each matches only once def statPattern(name,label,statExpr=real): if (isinstance(statExpr,And)): statExpr.exprs[0] = statExpr.exprs[0].setResultsName(name) else: statExpr = statExpr.setResultsName(name) expr = tdStart + Suppress(label) + tdEnd + tdStart + statExpr + tdEnd return expr.setParseAction(OnlyOnce(lambda t:None)) def bistatPattern(name,label,statExpr1=real,statExpr2=real): expr = (tdStart + Suppress(label) + tdEnd + tdStart + statExpr1 + tdEnd + tdStart + statExpr2 + tdEnd).setResultsName(name) return expr.setParseAction(OnlyOnce(lambda t:None)) stats = [ statPattern("last","Last Price"), statPattern("hi","52 Week High"), statPattern("lo","52 Week Low"), statPattern("vol","Volume", real + Suppress(dollarUnits)), statPattern("aveDailyVol_13wk","Average Daily Volume (13wk)", real + Suppress(dollarUnits)), statPattern("movingAve_50day","50 Day Moving Average"), statPattern("movingAve_200day","200 Day Moving Average"), statPattern("volatility","Volatility (beta)"), bistatPattern("relStrength_last3","Last 3 Months", pct, integer), bistatPattern("relStrength_last6","Last 6 Months", pct, integer), bistatPattern("relStrength_last12","Last 12 Months", pct, integer), bistatPattern("sales","Sales", real+Suppress(dollarUnits), pct), bistatPattern("income","Income", real+Suppress(dollarUnits), pct), bistatPattern("divRate","Dividend Rate", real, pct | "NA"), bistatPattern("divYield","Dividend Yield", pct, pct), statPattern("curQtrEPSest","Qtr("+date+") EPS Estimate"), statPattern("curFyEPSest","FY("+date+") EPS Estimate"), statPattern("curPE","Current P/E"), statPattern("fwdEPSest","FY("+date+") EPS Estimate"), statPattern("fwdPE","Forward P/E"), ] # create overall search pattern - things move faster if we verify that we are positioned # at a <td> tag before going through the MatchFirst group statSearchPattern = FollowedBy(tdStart) + MatchFirst(stats) # SETUP IS DONE - now get the HTML source # read in web page pg = urllib.urlopen("http://moneycentral.msn.com/companyreport? Symbol=BBBY") stockHTML = pg.read() pg.close() # extract and merge statistics ticker = sum( statSearchPattern.searchString(stockHTML),ParseResults([]) ) # print them out print ticker.dump() print ticker.last, ticker.hi,ticker.lo,ticker.vol,ticker.volatility ----------------------- prints: [39.549999999999997, 43.32, 30.920000000000002, 2.3599999999999999, 2.7400000000000002, 40.920000000000002, 37.659999999999997, 0.72999999999999998, 1.5, 55, 15.5, 69, 9.8000000000000007, 62, 6.2999999999999998, 19.399999999999999, 586.29999999999995, 27.199999999999999, 0.0, 'NA', 0.0, 0.0, 0.78000000000000003, 2.1499999999999999, 19.399999999999999, 2.3900000000000001, 18.399999999999999] - aveDailyVol_13wk: 2.74 - curFyEPSest: 2.15 - curPE: 19.4 - curQtrEPSest: 0.78 - divRate: [0.0, 'NA'] - divYield: [0.0, 0.0] - fwdEPSest: 2.39 - fwdPE: 18.4 - hi: 43.32 - income: [586.29999999999995, 27.199999999999999] - last: 39.55 - lo: 30.92 - movingAve_200day: 37.66 - movingAve_50day: 40.92 - relStrength_last12: [9.8000000000000007, 62] - relStrength_last3: [1.5, 55] - relStrength_last6: [15.5, 69] - sales: [6.2999999999999998, 19.399999999999999] - vol: 2.36 - volatility: 0.73 39.55 43.32 30.92 2.36 0.73 -- http://mail.python.org/mailman/listinfo/python-list