Re: Parsing Baseball Stats

Anthra Norell Wed, 26 Jul 2006 03:23:39 -0700

----- Original Message -----
From: "Paul McGuire" <[EMAIL PROTECTED]>
Newsgroups: comp.lang.python
To: <python-list@python.org>
Sent: Wednesday, July 26, 2006 1:01 AM
Subject: Re: Parsing Baseball Stats



> "Anthra Norell" <[EMAIL PROTECTED]> wrote in message
> news:[EMAIL PROTECTED]
> >
          snip
> >
> Frederic -
>
> HTML parsing is one of those slippery slopes - or perhaps "tar babies" might
> be a better metaphor - that starts out as a simple problem, but then one
> exception after the next drags the solution out for daaaays.  Probably once
> or twice a week, there is a posting here from someone trying to extract data
> from a website, usually something like trying to pull the href's out of some

          snip

> So what started out as a little joke (microscopic, even) has eventually
> touched a nerve, so thanks and apologies to those who have read this whole
> mess.  Frederic, SE looks like a killer - may it become the next regexp!
>
> -- Paul
>

Paul,

A year ago or so someone posted a call for ideas on encoding passwords for his 
own private use. I suggested a solution using
python's random number generator and was immediately reminded by several 
knowledgeable people, quite sharply by some, that the
random number generator was not to be used for cryptographic applications, 
since the doc specifically said so. I was also given good
advice on what to read.
      I thought that my solution was good, if not by the catechism, then by the 
requirements of the OP's problem which I considered
to be the issue. I hoped the OP would come back with his opinion, but he didn't.
      Not then and there. He did some time later, off list, telling me 
privately that he had incorporated my solution with some
adaptations and that it was exactly what he had been looking for.

So let me pursue this on two lines: A) your response and B) the issue.

A) I thank you for the considerable time you must have taken to explain pyparse 
in such detail. I didn't know you're the author.
Congratulations! It certainly looks very professional. I have no doubt that it 
is an excellent and powerful tool.
      Thanks also for your explanation of the TOS concept. It isn't alien to me 
and I have no problem with it. But I don't believe
it means that one should voluntarily argue against one's own freedom, barking 
at oneself with the voice of the legal watchdogs out
there that would restrict our freedom preemptively, getting a tug on the leash 
for excessive zeal but a pat on the head nontheless.
We have little cause to assume that the OP is setting up a baseball information 
service and have much cause to assume that he is
not. So let us reserve the benefit of the doubt because this is what the others 
do. And work by plausible assumption--necessarily,
because the realm of certainty is too small an action base.
      SE is not a parser. It is a stream editor. I believe it fills a gap, 
handling a certain kind of problem very gracefully while
being particularly easy to use. Your spontaneous reaction of horror was the 
consequence of a misinterpretation. The Tag_Stripper's
argument ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=") is not the frightful 
incarnation of a novel, yet more arcane regular expression
syntax. It is simply a string consisting of three very simple expressions: 
'<.*?>', '<[^>]*' and '[^<]*>'. They could also be
written as or-ed alternatives: '<.*?>|<[^>]*|[^<]*>'. The tildes brace the 
regex to identify it as such. The equal sign says replace
what precedes with what follows. Nothing happens to follow, which means replace 
it with nothing, which means delete it (tags).
That's all. SE allows--encourages--to break down a complex search into any 
number of simple components.
      (Having just said 'easy to use' I notice a mistake. I correct it below in 
section C.)

B) I would welcome the OP's opinion.

Regards

Frederic


C) Correction: The second and third expression were meant to catch tags 
spanning lines. There weren't any such tags and so the
expressions were useless--and inoffensive too: the second one, as a matter of 
fact, could also delete text. The Tag Stripper should
be defined like this:

Tag_Stripper = ('"~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~="')

It now deletes tags even if they span lines and it incorporates a second 
definition that deletes comments which, as you made me
aware, may contain tags. I now have to run the whole file through this before I 
look at the lines.

def get_statistics (name_of_player):

   statistics = {
     'Actual Pitching Statistics'   : [],
     'Advanced Pitching Statistics' : [],
   }

   url = 'http://www.baseballprospectus.com/dt/%s.shtml' % name_of_player
   htm_page = urllib.urlopen (url)
   lines = StringIO.StringIO (Tag_Stripper (htm_page.read ()))
   htm_page.close ()
   current_list = None
   for line in lines:
      line = line.strip ()
      if line == '':
         continue
      if 'Statistics' in line:  # That's the section headings.
         if statistics.has_key (line):
            current_list = statistics [line]
            current_list.append (line)
         else:
            current_list = None
      else:
         if current_list != None:
            current_list.append (CSV_Maker (line))

   return statistics


show_statistics (statistics) displays this tab-delimited CSV:

Advanced Pitching Statistics
AGE YEAR TEAM XIP RA DH DR DW NRA RAA PRAA PRAR DERA NRA RAA PRAA PRAR DERA STF
19 1914 BOS-A 25.3 4.70 -2 3 1 5.75 -4 -5 -2 6.15 6.19 -5 -5 -2 6.36 -25
20 1915 BOS-A 225.3 3.31 -12 3 2 4.01 12 4 45 4.33 4.25 6 1 42 4.44 12
21 1916 BOS-A 318.2 2.31 -32 -8 0 3.19 46 41 101 3.35 3.30 43 39 99 3.41 24
22 1917 BOS-A 336.5 2.56 -20 -7 1 3.49 38 23 83 3.88 3.72 29 20 80 3.96 13
23 1918 BOS-A 171.6 2.76 -16 5 0 3.80 13 6 34 4.20 4.16 6 3 31 4.36 3
24 1919 BOS-A 129.4 3.98 4 -16 2 4.63 -2 -2 19 4.61 4.79 -4 -3 17 4.70 -6
25 1920 NY_-A 6.4 9.00 -1 3 1 8.64 -3 -3 -3 8.96 8.95 -3 -3 -3 9.14 -35
26 1921 NY_-A 13.2 10.00 2 0 1 9.16 -7 -7 -5 9.36 9.61 -8 -8 -5 9.65 -41
35 1930 NY_-A 8.8 3.00 1 -2 0 2.84 2 2 4 2.57 3.07 1 2 3 2.66 13
38 1933 NY_-A 8.8 5.00 1 -1 0 5.01 -1 0 0 4.59 5.27 -1 0 0 4.73 -22
1243.5 2.95 -76 -22 8 3.78 96 59 275 4.07 3.95 65 45 262 4.17 10

Actual Pitching Statistics
AGE YEAR TEAM W L SV ERA G GS TBF IP H R ER HR BB SO HBP IBB WP BK CG SHO
19 1914 BOS-A 2 1 0 3.91 4 3 96 23.0 21 12 10 1 7 3 0 0 0 0 1 0
20 1915 BOS-A 18 8 0 2.44 32 28 874 217.7 166 80 59 3 85 112 6 0 9 1 16 1
21 1916 BOS-A 23 12 1 1.75 44 41 1272 323.7 230 83 63 0 118 170 8 0 3 1 23 9
22 1917 BOS-A 24 13 2 2.01 41 38 1277 326.3 244 93 73 2 108 128 11 0 5 0 35 6
23 1918 BOS-A 13 7 0 2.22 20 19 660 166.3 125 51 41 1 49 40 2 0 3 1 18 1
24 1919 BOS-A 9 5 1 2.97 17 15 570 133.3 148 59 44 2 58 30 2 0 5 1 12 0
25 1920 NY_-A 1 0 0 4.50 1 1 17 4.0 3 4 2 0 2 0 0 0 0 0 0 0
26 1921 NY_-A 2 0 0 9.00 2 1 49 9.0 14 10 9 1 9 2 0 0 0 0 0 0
35 1930 NY_-A 1 0 0 3.00 1 1 39 9.0 11 3 3 0 2 3 0 0 0 0 1 0
38 1933 NY_-A 1 0 0 5.00 1 1 42 9.0 12 5 5 0 3 0 0 0 0 0 1 0
94 46 4 2.28 163 148 4896 1221.3 974 400 309 10 441 488 29 0 25 4 107 17

(The last line remains to be shifted three columns to the right.)


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Baseball Stats

Reply via email to