----- Original Message ----- From: "Paul McGuire" <[EMAIL PROTECTED]> Newsgroups: comp.lang.python To: <python-list@python.org> Sent: Wednesday, July 26, 2006 1:01 AM Subject: Re: Parsing Baseball Stats
> "Anthra Norell" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > snip > > > Frederic - > > HTML parsing is one of those slippery slopes - or perhaps "tar babies" might > be a better metaphor - that starts out as a simple problem, but then one > exception after the next drags the solution out for daaaays. Probably once > or twice a week, there is a posting here from someone trying to extract data > from a website, usually something like trying to pull the href's out of some snip > So what started out as a little joke (microscopic, even) has eventually > touched a nerve, so thanks and apologies to those who have read this whole > mess. Frederic, SE looks like a killer - may it become the next regexp! > > -- Paul > Paul, A year ago or so someone posted a call for ideas on encoding passwords for his own private use. I suggested a solution using python's random number generator and was immediately reminded by several knowledgeable people, quite sharply by some, that the random number generator was not to be used for cryptographic applications, since the doc specifically said so. I was also given good advice on what to read. I thought that my solution was good, if not by the catechism, then by the requirements of the OP's problem which I considered to be the issue. I hoped the OP would come back with his opinion, but he didn't. Not then and there. He did some time later, off list, telling me privately that he had incorporated my solution with some adaptations and that it was exactly what he had been looking for. So let me pursue this on two lines: A) your response and B) the issue. A) I thank you for the considerable time you must have taken to explain pyparse in such detail. I didn't know you're the author. Congratulations! It certainly looks very professional. I have no doubt that it is an excellent and powerful tool. Thanks also for your explanation of the TOS concept. It isn't alien to me and I have no problem with it. But I don't believe it means that one should voluntarily argue against one's own freedom, barking at oneself with the voice of the legal watchdogs out there that would restrict our freedom preemptively, getting a tug on the leash for excessive zeal but a pat on the head nontheless. We have little cause to assume that the OP is setting up a baseball information service and have much cause to assume that he is not. So let us reserve the benefit of the doubt because this is what the others do. And work by plausible assumption--necessarily, because the realm of certainty is too small an action base. SE is not a parser. It is a stream editor. I believe it fills a gap, handling a certain kind of problem very gracefully while being particularly easy to use. Your spontaneous reaction of horror was the consequence of a misinterpretation. The Tag_Stripper's argument ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=") is not the frightful incarnation of a novel, yet more arcane regular expression syntax. It is simply a string consisting of three very simple expressions: '<.*?>', '<[^>]*' and '[^<]*>'. They could also be written as or-ed alternatives: '<.*?>|<[^>]*|[^<]*>'. The tildes brace the regex to identify it as such. The equal sign says replace what precedes with what follows. Nothing happens to follow, which means replace it with nothing, which means delete it (tags). That's all. SE allows--encourages--to break down a complex search into any number of simple components. (Having just said 'easy to use' I notice a mistake. I correct it below in section C.) B) I would welcome the OP's opinion. Regards Frederic C) Correction: The second and third expression were meant to catch tags spanning lines. There weren't any such tags and so the expressions were useless--and inoffensive too: the second one, as a matter of fact, could also delete text. The Tag Stripper should be defined like this: Tag_Stripper = ('"~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~="') It now deletes tags even if they span lines and it incorporates a second definition that deletes comments which, as you made me aware, may contain tags. I now have to run the whole file through this before I look at the lines. def get_statistics (name_of_player): statistics = { 'Actual Pitching Statistics' : [], 'Advanced Pitching Statistics' : [], } url = 'http://www.baseballprospectus.com/dt/%s.shtml' % name_of_player htm_page = urllib.urlopen (url) lines = StringIO.StringIO (Tag_Stripper (htm_page.read ())) htm_page.close () current_list = None for line in lines: line = line.strip () if line == '': continue if 'Statistics' in line: # That's the section headings. if statistics.has_key (line): current_list = statistics [line] current_list.append (line) else: current_list = None else: if current_list != None: current_list.append (CSV_Maker (line)) return statistics show_statistics (statistics) displays this tab-delimited CSV: Advanced Pitching Statistics AGE YEAR TEAM XIP RA DH DR DW NRA RAA PRAA PRAR DERA NRA RAA PRAA PRAR DERA STF 19 1914 BOS-A 25.3 4.70 -2 3 1 5.75 -4 -5 -2 6.15 6.19 -5 -5 -2 6.36 -25 20 1915 BOS-A 225.3 3.31 -12 3 2 4.01 12 4 45 4.33 4.25 6 1 42 4.44 12 21 1916 BOS-A 318.2 2.31 -32 -8 0 3.19 46 41 101 3.35 3.30 43 39 99 3.41 24 22 1917 BOS-A 336.5 2.56 -20 -7 1 3.49 38 23 83 3.88 3.72 29 20 80 3.96 13 23 1918 BOS-A 171.6 2.76 -16 5 0 3.80 13 6 34 4.20 4.16 6 3 31 4.36 3 24 1919 BOS-A 129.4 3.98 4 -16 2 4.63 -2 -2 19 4.61 4.79 -4 -3 17 4.70 -6 25 1920 NY_-A 6.4 9.00 -1 3 1 8.64 -3 -3 -3 8.96 8.95 -3 -3 -3 9.14 -35 26 1921 NY_-A 13.2 10.00 2 0 1 9.16 -7 -7 -5 9.36 9.61 -8 -8 -5 9.65 -41 35 1930 NY_-A 8.8 3.00 1 -2 0 2.84 2 2 4 2.57 3.07 1 2 3 2.66 13 38 1933 NY_-A 8.8 5.00 1 -1 0 5.01 -1 0 0 4.59 5.27 -1 0 0 4.73 -22 1243.5 2.95 -76 -22 8 3.78 96 59 275 4.07 3.95 65 45 262 4.17 10 Actual Pitching Statistics AGE YEAR TEAM W L SV ERA G GS TBF IP H R ER HR BB SO HBP IBB WP BK CG SHO 19 1914 BOS-A 2 1 0 3.91 4 3 96 23.0 21 12 10 1 7 3 0 0 0 0 1 0 20 1915 BOS-A 18 8 0 2.44 32 28 874 217.7 166 80 59 3 85 112 6 0 9 1 16 1 21 1916 BOS-A 23 12 1 1.75 44 41 1272 323.7 230 83 63 0 118 170 8 0 3 1 23 9 22 1917 BOS-A 24 13 2 2.01 41 38 1277 326.3 244 93 73 2 108 128 11 0 5 0 35 6 23 1918 BOS-A 13 7 0 2.22 20 19 660 166.3 125 51 41 1 49 40 2 0 3 1 18 1 24 1919 BOS-A 9 5 1 2.97 17 15 570 133.3 148 59 44 2 58 30 2 0 5 1 12 0 25 1920 NY_-A 1 0 0 4.50 1 1 17 4.0 3 4 2 0 2 0 0 0 0 0 0 0 26 1921 NY_-A 2 0 0 9.00 2 1 49 9.0 14 10 9 1 9 2 0 0 0 0 0 0 35 1930 NY_-A 1 0 0 3.00 1 1 39 9.0 11 3 3 0 2 3 0 0 0 0 1 0 38 1933 NY_-A 1 0 0 5.00 1 1 42 9.0 12 5 5 0 3 0 0 0 0 0 1 0 94 46 4 2.28 163 148 4896 1221.3 974 400 309 10 441 488 29 0 25 4 107 17 (The last line remains to be shifted three columns to the right.) -- http://mail.python.org/mailman/listinfo/python-list