Frederic, Thanks for posting the solution. I used the original solution you posted and it worked beautifully.
Paul, I understand your concern for the site's TOS. Although, this may not mean anything, the reason I wanted this "parser" was because I wanted to get the Advanced, and Translated Stats for personal use. I don't have any commercial motives but play with baseball stats is my hobby. The site does allow one to download stuff for personal use, which I abide by. Also, I am only looking to get the aforementioned stats for some players. The site has player pages for over 16,000 players. I think it would be unfair to the site owners if I went to download all 16,000 players using the script. In the end, they might just move the stats in to their premium package (not free) and then I would be really screwed. So, I understand your concerns and thank you for posting them. Ankit Anthra Norell wrote: > ----- Original Message ----- > From: "Paul McGuire" <[EMAIL PROTECTED]> > Newsgroups: comp.lang.python > To: <python-list@python.org> > Sent: Wednesday, July 26, 2006 1:01 AM > Subject: Re: Parsing Baseball Stats > > > > "Anthra Norell" <[EMAIL PROTECTED]> wrote in message > > news:[EMAIL PROTECTED] > > > > snip > > > > > Frederic - > > > > HTML parsing is one of those slippery slopes - or perhaps "tar babies" might > > be a better metaphor - that starts out as a simple problem, but then one > > exception after the next drags the solution out for daaaays. Probably once > > or twice a week, there is a posting here from someone trying to extract data > > from a website, usually something like trying to pull the href's out of some > > snip > > > So what started out as a little joke (microscopic, even) has eventually > > touched a nerve, so thanks and apologies to those who have read this whole > > mess. Frederic, SE looks like a killer - may it become the next regexp! > > > > -- Paul > > > > Paul, > > A year ago or so someone posted a call for ideas on encoding passwords for > his own private use. I suggested a solution using > python's random number generator and was immediately reminded by several > knowledgeable people, quite sharply by some, that the > random number generator was not to be used for cryptographic applications, > since the doc specifically said so. I was also given good > advice on what to read. > I thought that my solution was good, if not by the catechism, then by > the requirements of the OP's problem which I considered > to be the issue. I hoped the OP would come back with his opinion, but he > didn't. > Not then and there. He did some time later, off list, telling me > privately that he had incorporated my solution with some > adaptations and that it was exactly what he had been looking for. > > So let me pursue this on two lines: A) your response and B) the issue. > > A) I thank you for the considerable time you must have taken to explain > pyparse in such detail. I didn't know you're the author. > Congratulations! It certainly looks very professional. I have no doubt that > it is an excellent and powerful tool. > Thanks also for your explanation of the TOS concept. It isn't alien to > me and I have no problem with it. But I don't believe > it means that one should voluntarily argue against one's own freedom, barking > at oneself with the voice of the legal watchdogs out > there that would restrict our freedom preemptively, getting a tug on the > leash for excessive zeal but a pat on the head nontheless. > We have little cause to assume that the OP is setting up a baseball > information service and have much cause to assume that he is > not. So let us reserve the benefit of the doubt because this is what the > others do. And work by plausible assumption--necessarily, > because the realm of certainty is too small an action base. > SE is not a parser. It is a stream editor. I believe it fills a gap, > handling a certain kind of problem very gracefully while > being particularly easy to use. Your spontaneous reaction of horror was the > consequence of a misinterpretation. The Tag_Stripper's > argument ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=") is not the frightful > incarnation of a novel, yet more arcane regular expression > syntax. It is simply a string consisting of three very simple expressions: > '<.*?>', '<[^>]*' and '[^<]*>'. They could also be > written as or-ed alternatives: '<.*?>|<[^>]*|[^<]*>'. The tildes brace the > regex to identify it as such. The equal sign says replace > what precedes with what follows. Nothing happens to follow, which means > replace it with nothing, which means delete it (tags). > That's all. SE allows--encourages--to break down a complex search into any > number of simple components. > (Having just said 'easy to use' I notice a mistake. I correct it below > in section C.) > > B) I would welcome the OP's opinion. > > Regards > > Frederic > > > C) Correction: The second and third expression were meant to catch tags > spanning lines. There weren't any such tags and so the > expressions were useless--and inoffensive too: the second one, as a matter of > fact, could also delete text. The Tag Stripper should > be defined like this: > > Tag_Stripper = ('"~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~="') > > It now deletes tags even if they span lines and it incorporates a second > definition that deletes comments which, as you made me > aware, may contain tags. I now have to run the whole file through this before > I look at the lines. > > def get_statistics (name_of_player): > > statistics = { > 'Actual Pitching Statistics' : [], > 'Advanced Pitching Statistics' : [], > } > > url = 'http://www.baseballprospectus.com/dt/%s.shtml' % name_of_player > htm_page = urllib.urlopen (url) > lines = StringIO.StringIO (Tag_Stripper (htm_page.read ())) > htm_page.close () > current_list = None > for line in lines: > line = line.strip () > if line == '': > continue > if 'Statistics' in line: # That's the section headings. > if statistics.has_key (line): > current_list = statistics [line] > current_list.append (line) > else: > current_list = None > else: > if current_list != None: > current_list.append (CSV_Maker (line)) > > return statistics > > > show_statistics (statistics) displays this tab-delimited CSV: > > Advanced Pitching Statistics > AGE YEAR TEAM XIP RA DH DR DW NRA RAA PRAA PRAR DERA NRA RAA PRAA PRAR DERA > STF > 19 1914 BOS-A 25.3 4.70 -2 3 1 5.75 -4 -5 -2 6.15 6.19 -5 -5 -2 6.36 -25 > 20 1915 BOS-A 225.3 3.31 -12 3 2 4.01 12 4 45 4.33 4.25 6 1 42 4.44 12 > 21 1916 BOS-A 318.2 2.31 -32 -8 0 3.19 46 41 101 3.35 3.30 43 39 99 3.41 24 > 22 1917 BOS-A 336.5 2.56 -20 -7 1 3.49 38 23 83 3.88 3.72 29 20 80 3.96 13 > 23 1918 BOS-A 171.6 2.76 -16 5 0 3.80 13 6 34 4.20 4.16 6 3 31 4.36 3 > 24 1919 BOS-A 129.4 3.98 4 -16 2 4.63 -2 -2 19 4.61 4.79 -4 -3 17 4.70 -6 > 25 1920 NY_-A 6.4 9.00 -1 3 1 8.64 -3 -3 -3 8.96 8.95 -3 -3 -3 9.14 -35 > 26 1921 NY_-A 13.2 10.00 2 0 1 9.16 -7 -7 -5 9.36 9.61 -8 -8 -5 9.65 -41 > 35 1930 NY_-A 8.8 3.00 1 -2 0 2.84 2 2 4 2.57 3.07 1 2 3 2.66 13 > 38 1933 NY_-A 8.8 5.00 1 -1 0 5.01 -1 0 0 4.59 5.27 -1 0 0 4.73 -22 > 1243.5 2.95 -76 -22 8 3.78 96 59 275 4.07 3.95 65 45 262 4.17 10 > > Actual Pitching Statistics > AGE YEAR TEAM W L SV ERA G GS TBF IP H R ER HR BB SO HBP IBB WP BK CG SHO > 19 1914 BOS-A 2 1 0 3.91 4 3 96 23.0 21 12 10 1 7 3 0 0 0 0 1 0 > 20 1915 BOS-A 18 8 0 2.44 32 28 874 217.7 166 80 59 3 85 112 6 0 9 1 16 1 > 21 1916 BOS-A 23 12 1 1.75 44 41 1272 323.7 230 83 63 0 118 170 8 0 3 1 23 9 > 22 1917 BOS-A 24 13 2 2.01 41 38 1277 326.3 244 93 73 2 108 128 11 0 5 0 35 6 > 23 1918 BOS-A 13 7 0 2.22 20 19 660 166.3 125 51 41 1 49 40 2 0 3 1 18 1 > 24 1919 BOS-A 9 5 1 2.97 17 15 570 133.3 148 59 44 2 58 30 2 0 5 1 12 0 > 25 1920 NY_-A 1 0 0 4.50 1 1 17 4.0 3 4 2 0 2 0 0 0 0 0 0 0 > 26 1921 NY_-A 2 0 0 9.00 2 1 49 9.0 14 10 9 1 9 2 0 0 0 0 0 0 > 35 1930 NY_-A 1 0 0 3.00 1 1 39 9.0 11 3 3 0 2 3 0 0 0 0 1 0 > 38 1933 NY_-A 1 0 0 5.00 1 1 42 9.0 12 5 5 0 3 0 0 0 0 0 1 0 > 94 46 4 2.28 163 148 4896 1221.3 974 400 309 10 441 488 29 0 25 4 107 17 > > (The last line remains to be shifted three columns to the right.) -- http://mail.python.org/mailman/listinfo/python-list