"Ankit" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Frederic, > > Thanks for posting the solution. I used the original solution you > posted and it worked beautifully. > > Paul, > > I understand your concern for the site's TOS. Although, this may not > mean anything, the reason I wanted this "parser" was because I wanted > to get the Advanced, and Translated Stats for personal use. I don't > have any commercial motives but play with baseball stats is my hobby. > The site does allow one to download stuff for personal use, which I > abide by. Also, I am only looking to get the aforementioned stats for > some players. The site has player pages for over 16,000 players. I > think it would be unfair to the site owners if I went to download all > 16,000 players using the script. In the end, they might just move the > stats in to their premium package (not free) and then I would be really > screwed. > > So, I understand your concerns and thank you for posting them. > > Ankit > Frederic and Ankit -
I guess you may have caught me in a more-than-curmudgeon-ly mood. Thanks for giving me the benefit of the doubt. I guess I should put more faith in our "consenting adults" environment - if someone wants to use posted code to create a bot or virus or TOS-violating web page scraper, that is their business, not mine. I've noticed that the esteemed C. Titus Brown in his twill intro gives an example violating Google's TOS, but at least he gives a suitable admonition in the code to the effect of "this is just an example, but don't do it." So in that spirit, for EDUCATION AND PERSONAL USE PURPOSES ONLY, here is a pyparsing rendition that processes the HTML of the previously cited web site. Ankit, you already know the suitable url's to use for this, so I don't need to post them again (in a weak attempt to shield that web site from casual slamming). At first glance, this is *way* more complicated than Frederic's SE-based solution. The catch is that the pattern we are keying off of has a lot of HTML junk in it. Frederic just dumps it on the floor, and really this program doesn't do much more with it. Note that we suppress almost all of the parsed HTML tags, which is just pyparsing's way of saying "don't need this...", but the tag expression still needs to be included in the pattern we are scanning for. There are a couple of beyond-beginner pyparsing techniques in this example: - Using a parse action to reject text that matches syntax, but not semantics. In this case, we reject <h3> tags that don't have the right section name. From a parsing standpoint, all <h3>'s match the h3Start expression, so we attach a parse action to perform the additional filtering. - Using Dict is always kind of magic. At parse time, the Dict class instructs the parser to build a dict-style result, use the first token in each matched group as a key, and the remainder as the value. This gives us a keyed lookup by age to the yearly stats values. - We have to stop reading stats at the line break, so we first check if we are not at the end-of-line before accepting the next number. That is why the expression reads "OneOrMore(~lineEnd + number)" to parse in the actual statistics values. Once the parsing is done, I go through a little extra work showing different ways to get at the parsed results. pyparsing does much more than just return nested lists of strings. In this case, we are associating field names with some content, and also dynamically generating dict-style access to statistics by age. Finally, there is also the output to CSV format, which was the original intent. I think that as HTML-scraping apps go, this is fairly typical for a pyparsing approach. The feedback I get is that people take an hour or two getting their programs just the way they want them, but then the resulting code is pretty robust over time, as minor page changes or enhancement require simple if any updates to the scraper. For instance, if new stat columns were added to this page, there would be *no* change to the parser. Anyway, here is the pyparsing datapoint for your comparison. -- Paul (... and what was Babe Ruth doing between the ages of 26 and 35? Did he retire for 9 years and then come back?) from pyparsing import * import urllib playerURL = "http://rest_of_URL_goes_here" # define start/end HTML tags for key items # makeHTMLTags takes care of unexpected attributes, whitespace, case, etc. h3Start,h3End = makeHTMLTags("h3") aStart,aEnd = makeHTMLTags("a") preStart,preEnd = makeHTMLTags("pre") aStart = aStart.suppress() aEnd = aEnd.suppress() preStart = preStart.suppress() preEnd = preEnd.suppress() # spell out some of the specific HTML patterns we are looking for sectionStart = (h3Start + aStart + SkipTo(aEnd).setResultsName("section") + aEnd + h3End ) | \ (h3Start + SkipTo(h3End).setResultsName("section") + h3End ) sectionHeading = OneOrMore(aStart + SkipTo(aEnd) + aEnd).setResultsName("statsNames") sectionHeading2 = OneOrMore(~lineEnd + Word(alphanums.upper()+"/")).setResultsName("statsNames") integer = Combine(Optional("-") + Word(nums)) real = Combine(Optional("-") + Optional(Word(nums)) + "." + Word(nums)) number = real | integer teamName = Word(alphas.upper() + "_-") # create parse action that will filter for sections of a particular name wrongSectionName = ParseException("",0,"") def onlyAcceptSectionNamed(sec): def parseAction(tokens): if tokens.section != sec: raise wrongSectionName return parseAction import pprint def getStatistics(url): htm_page = urllib.urlopen(url) htm_lines = htm_page.read() htm_page.close () actualPitchingStats = \ sectionStart.copy().setParseAction(onlyAcceptSectionNamed("Actual Pitching Statistics ")) + \ preStart + \ sectionHeading + \ Dict( OneOrMore( Group(integer + aStart.suppress() + integer + teamName + aEnd.suppress() + \ OneOrMore(~lineEnd + number).setResultsName("stats") ) )).setResultsName("statsByAge") + \ Group( OneOrMore(number) ).setResultsName("careerStats") + preEnd aps = actualPitchingStats.searchString(htm_lines)[0] translatedPitchingStats = \ sectionStart.copy().setParseAction(onlyAcceptSectionNamed("Translated Pitching Statistics")) + \ preStart + lineEnd + \ sectionHeading2 + \ Dict( OneOrMore( Group(integer + aStart.suppress() + integer + teamName + aEnd.suppress() + \ OneOrMore(~lineEnd + number).setResultsName("stats") ) )).setResultsName("statsByAge") + \ Suppress("Career") + Group( OneOrMore(number) ).setResultsName("careerStats") + preEnd tps = translatedPitchingStats.searchString(htm_lines)[0] # examples of accessing data fields in returned parse results for res in (aps,tps): print res.section print '-'*len(res.section.rstrip()) for k in res.keys(): print "- %s: %s" % (k,res[k]) # career stats don't have age, year, or team name, so skip over those stats names pprint.pprint( zip(res.statsNames[3:],res.careerStats) ) print # print stats for year at age 24 # by-age stats don't include age, so skip over first stats name pprint.pprint( zip(res.statsNames[1:],res.statsByAge["24"]) ) print # output CSV-style data, for each year and then for career for yearlyStats in res.statsByAge: print ", ".join(yearlyStats) print " , , ,",", ".join(res.careerStats) print getStatistics(playerURL) Gives this output: Actual Pitching Statistics -------------------------- - endH3: </h3> - statsByAge: [['19', '1914', 'BOS-A', '2', '1', '0', '3.91', '4', '3', '96', '23.0', '21', '12', '10', '1', '7', '3', '0', '0', '0', '0', '1', '0'], ['20', '1915', 'BOS-A', '18', '8', '0', '2.44', '32', '28', '874', '217.7', '166', '80', '59', '3', '85', '112', '6', '0', '9', '1', '16', '1'], ['21', '1916', 'BOS-A', '23', '12', '1', '1.75', '44', '41', '1272', '323.7', '230', '83', '63', '0', '118', '170', '8', '0', '3', '1', '23', '9'], ['22', '1917', 'BOS-A', '24', '13', '2', '2.01', '41', '38', '1277', '326.3', '244', '93', '73', '2', '108', '128', '11', '0', '5', '0', '35', '6'], ['23', '1918', 'BOS-A', '13', '7', '0', '2.22', '20', '19', '660', '166.3', '125', '51', '41', '1', '49', '40', '2', '0', '3', '1', '18', '1'], ['24', '1919', 'BOS-A', '9', '5', '1', '2.97', '17', '15', '570', '133.3', '148', '59', '44', '2', '58', '30', '2', '0', '5', '1', '12', '0'], ['25', '1920', 'NY_-A', '1', '0', '0', '4.50', '1', '1', '17', '4.0', '3', '4', '2', '0', '2', '0', '0', '0', '0', '0', '0', '0'], ['26', '1921', 'NY_-A', '2', '0', '0', '9.00', '2', '1', '49', '9.0', '14', '10', '9', '1', '9', '2', '0', '0', '0', '0', '0', '0'], ['35', '1930', 'NY_-A', '1', '0', '0', '3.00', '1', '1', '39', '9.0', '11', '3', '3', '0', '2', '3', '0', '0', '0', '0', '1', '0'], ['38', '1933', 'NY_-A', '1', '0', '0', '5.00', '1', '1', '42', '9.0', '12', '5', '5', '0', '3', '0', '0', '0', '0', '0', '1', '0']] - startH3: ['h3', ['class', 'cardsect'], False] - section: Actual Pitching Statistics - statsNames: ['AGE', 'YEAR', 'TEAM', 'W', 'L', 'SV', 'ERA', 'G', 'GS', 'TBF', 'IP', 'H', 'R', 'ER', 'HR', 'BB', 'SO', 'HBP', 'IBB', 'WP', 'BK', 'CG', 'SHO'] - careerStats: ['94', '46', '4', '2.28', '163', '148', '4896', '1221.3', '974', '400', '309', '10', '441', '488', '29', '0', '25', '4', '107', '17'] - class: cardsect - empty: False [('W', '94'), ('L', '46'), ('SV', '4'), ('ERA', '2.28'), ('G', '163'), ('GS', '148'), ('TBF', '4896'), ('IP', '1221.3'), ('H', '974'), ('R', '400'), ('ER', '309'), ('HR', '10'), ('BB', '441'), ('SO', '488'), ('HBP', '29'), ('IBB', '0'), ('WP', '25'), ('BK', '4'), ('CG', '107'), ('SHO', '17')] [('YEAR', '1919'), ('TEAM', 'BOS-A'), ('W', '9'), ('L', '5'), ('SV', '1'), ('ERA', '2.97'), ('G', '17'), ('GS', '15'), ('TBF', '570'), ('IP', '133.3'), ('H', '148'), ('R', '59'), ('ER', '44'), ('HR', '2'), ('BB', '58'), ('SO', '30'), ('HBP', '2'), ('IBB', '0'), ('WP', '5'), ('BK', '1'), ('CG', '12'), ('SHO', '0')] 19, 1914, BOS-A, 2, 1, 0, 3.91, 4, 3, 96, 23.0, 21, 12, 10, 1, 7, 3, 0, 0, 0, 0, 1, 0 20, 1915, BOS-A, 18, 8, 0, 2.44, 32, 28, 874, 217.7, 166, 80, 59, 3, 85, 112, 6, 0, 9, 1, 16, 1 21, 1916, BOS-A, 23, 12, 1, 1.75, 44, 41, 1272, 323.7, 230, 83, 63, 0, 118, 170, 8, 0, 3, 1, 23, 9 22, 1917, BOS-A, 24, 13, 2, 2.01, 41, 38, 1277, 326.3, 244, 93, 73, 2, 108, 128, 11, 0, 5, 0, 35, 6 23, 1918, BOS-A, 13, 7, 0, 2.22, 20, 19, 660, 166.3, 125, 51, 41, 1, 49, 40, 2, 0, 3, 1, 18, 1 24, 1919, BOS-A, 9, 5, 1, 2.97, 17, 15, 570, 133.3, 148, 59, 44, 2, 58, 30, 2, 0, 5, 1, 12, 0 25, 1920, NY_-A, 1, 0, 0, 4.50, 1, 1, 17, 4.0, 3, 4, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0 26, 1921, NY_-A, 2, 0, 0, 9.00, 2, 1, 49, 9.0, 14, 10, 9, 1, 9, 2, 0, 0, 0, 0, 0, 0 35, 1930, NY_-A, 1, 0, 0, 3.00, 1, 1, 39, 9.0, 11, 3, 3, 0, 2, 3, 0, 0, 0, 0, 1, 0 38, 1933, NY_-A, 1, 0, 0, 5.00, 1, 1, 42, 9.0, 12, 5, 5, 0, 3, 0, 0, 0, 0, 0, 1, 0 , , , 94, 46, 4, 2.28, 163, 148, 4896, 1221.3, 974, 400, 309, 10, 441, 488, 29, 0, 25, 4, 107, 17 Translated Pitching Statistics ------------------------------ - endH3: </h3> - statsByAge: [['19', '1914', 'BOS-A', '20.0', '19', '15', '5', '6', '0', '4', '6.75', '1', '1', '0', '8.6', '2.2', '2.7', '1.8'], ['20', '1915', 'BOS-A', '191.3', '163', '87', '24', '74', '6', '134', '4.09', '13', '9', '0', '7.7', '1.1', '3.5', '6.3'], ['21', '1916', 'BOS-A', '274.0', '212', '82', '21', '101', '9', '212', '2.69', '22', '8', '1', '7.0', '.7', '3.3', '7.0'], ['22', '1917', 'BOS-A', '277.3', '239', '107', '29', '98', '13', '178', '3.47', '20', '11', '2', '7.8', '.9', '3.2', '5.8'], ['23', '1918', 'BOS-A', '149.0', '128', '69', '19', '51', '3', '65', '4.17', '9', '8', '0', '7.7', '1.1', '3.1', '3.9'], ['24', '1919', 'BOS-A', '123.3', '147', '65', '14', '59', '3', '47', '4.74', '7', '6', '1', '10.7', '1.0', '4.3', '3.4'], ['25', '1920', 'NY_-A', '3.3', '3', '4', '0', '2', '0', '0', '10.80', '0', '1', '0', '8.1', '.0', '5.4', '.0'], ['26', '1921', 'NY_-A', '7.7', '10', '9', '2', '9', '0', '3', '10.57', '0', '1', '0', '11.7', '2.3', '10.6', '3.5'], ['35', '1930', 'NY_-A', '8.7', '11', '3', '0', '2', '0', '4', '3.12', '1', '0', '0', '11.4', '.0', '2.1', '4.2'], ['38', '1933', 'NY_-A', '8.7', '15', '6', '0', '3', '0', '1', '6.23', '0', '1', '0', '15.6', '.0', '3.1', '1.0']] - startH3: ['h3', ['class', 'cardsect'], False] - section: Translated Pitching Statistics - statsNames: ['AGE', 'YEAR', 'TEAM', 'IP', 'H', 'ER', 'HR', 'BB', 'HBP', 'SO', 'ERA', 'W', 'L', 'SV', 'H/9', 'HR/9', 'BB/9', 'SO/9'] - careerStats: ['1063.3', '947', '447', '114', '405', '34', '648', '3.78', '73', '46', '6', '8.0', '1.0', '3.4', '5.5'] - class: cardsect - empty: False [('IP', '1063.3'), ('H', '947'), ('ER', '447'), ('HR', '114'), ('BB', '405'), ('HBP', '34'), ('SO', '648'), ('ERA', '3.78'), ('W', '73'), ('L', '46'), ('SV', '6'), ('H/9', '8.0'), ('HR/9', '1.0'), ('BB/9', '3.4'), ('SO/9', '5.5')] [('YEAR', '1919'), ('TEAM', 'BOS-A'), ('IP', '123.3'), ('H', '147'), ('ER', '65'), ('HR', '14'), ('BB', '59'), ('HBP', '3'), ('SO', '47'), ('ERA', '4.74'), ('W', '7'), ('L', '6'), ('SV', '1'), ('H/9', '10.7'), ('HR/9', '1.0'), ('BB/9', '4.3'), ('SO/9', '3.4')] 19, 1914, BOS-A, 20.0, 19, 15, 5, 6, 0, 4, 6.75, 1, 1, 0, 8.6, 2.2, 2.7, 1.8 20, 1915, BOS-A, 191.3, 163, 87, 24, 74, 6, 134, 4.09, 13, 9, 0, 7.7, 1.1, 3.5, 6.3 21, 1916, BOS-A, 274.0, 212, 82, 21, 101, 9, 212, 2.69, 22, 8, 1, 7.0, .7, 3.3, 7.0 22, 1917, BOS-A, 277.3, 239, 107, 29, 98, 13, 178, 3.47, 20, 11, 2, 7.8, .9, 3.2, 5.8 23, 1918, BOS-A, 149.0, 128, 69, 19, 51, 3, 65, 4.17, 9, 8, 0, 7.7, 1.1, 3.1, 3.9 24, 1919, BOS-A, 123.3, 147, 65, 14, 59, 3, 47, 4.74, 7, 6, 1, 10.7, 1.0, 4.3, 3.4 25, 1920, NY_-A, 3.3, 3, 4, 0, 2, 0, 0, 10.80, 0, 1, 0, 8.1, .0, 5.4, .0 26, 1921, NY_-A, 7.7, 10, 9, 2, 9, 0, 3, 10.57, 0, 1, 0, 11.7, 2.3, 10.6, 3.5 35, 1930, NY_-A, 8.7, 11, 3, 0, 2, 0, 4, 3.12, 1, 0, 0, 11.4, .0, 2.1, 4.2 38, 1933, NY_-A, 8.7, 15, 6, 0, 3, 0, 1, 6.23, 0, 1, 0, 15.6, .0, 3.1, 1.0 , , , 1063.3, 947, 447, 114, 405, 34, 648, 3.78, 73, 46, 6, 8.0, 1.0, 3.4, 5.5 -- http://mail.python.org/mailman/listinfo/python-list