Joseph,

Thanks for writing and the advice. Here's another crack at the question.

> -----Original Message-----
> From: R. Joseph Newton [mailto:[EMAIL PROTECTED]
> Sent: Monday, April 05, 2004 5:39 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 'Stuart 
> V. Jordan'
> Subject: Re: Regular expression question: non-greedy matches
> 
> 
> That is a bit off.  I think we really need a sample of actual
> data to be able to help you.  If the data is of a 
> confidential nature, then you will have to do meaningful 
> substitutions for any matter that is not public.  Boilerplate 
> substitutions do not work.  So far, I have seen three 
> different formats for your sample string.  Each of them would 
> call logically for a somewhat different extraction approach. 
> . My best advice would be not to do it all in one regex.  
> Regular expressions are powerful tools, and amaxingly 
> efficient given the demands placed on them, but they get 
> progressively less efficient as they increase in complexity.  
> If there is any distinct marker that separates the items 
> being voted on, I would strongly recommend that you first 
> split on this marker so that each vote ahs its own element.

No problem. I just thought I would present the gist of the issue with a
smaller amount of code. Here's a sample of the actual data I am parsing --
it is a single page of a legislative journal published by Alabama, detailing
the results of votes (scanned and OCR'd). Naturally there are thousands of
pages, but the same problem appears again and again with nongreedy matches.

*** Data starts here ******

REGULAR SESSION 187 3rd Day
Yeas 81; Nays 0; Abstains 0.

Yea:
Mr. Speaker, Allen, Bandy, Barton, Beason,. Black (M), Boothe, Boyd,
Bridges, Buskey, Carns, Carter, Clark, Clouse, Crigler, Curry, Dukes, Dunn,
Ford (C), Gaines, Galliher, Gaston, Gipson, Grantland, Greene, Greeson,
Guin, Hall (A.), Hall (L), Haney, Hawkins, Hill, Hilliard, Hogan, Houston,
Hubbard, Humphryes, Hurst, Johnson, Jones, Kennedy, Knight, Laird,'Letson,
Lindsey, Major, Mancuso, Martin, McClammy, McDaniel, McKee, McLaughlin,
McMillan, Mitchell, Morrow, Morton, Newton (C), Newton (D), Oden, Page,
Parker (T), Parker (W), Payne, Penry, Perdue, Robinson (J), Robinson (0),
Rogers (J), Rogers (M), Sanderford, Sanderson, Schmitz, Spicer, Starkey,
Thigpen, Thomas (D), Thomas (E), Turner, Venable, Willis and Wren. -81
BUDGET ISOLATION RESOLUTION

On motion of Representative Black (M), the Budget Isolation Resolution
relating to the bill, HB121, was adopted.

Yeas 75; Nays 8; Abstains 2.

Yea:

Mr. Speaker, Allen, Barton, Beasley, Beason, Black (M), Boothe, Boyd,
Bridges, Carns, Carothers, Clark, Clouse, Crigler, Curry, Dukes, Ford (J),
Gaines, Galliher, Gaston, Gipson, Graham, Grantland, Greene, Greeson, Guin,
Hall (A), Hamilton, Haney, Hawkins, Hayden, Hill, Hogan, Hooper, Houston,
Hubbard, Humphryes, Hurst, Jackson, Johnson, Jones, Kennedy, Laird, Layson,
Mancuso, Martin, McDaniel, McKee, Millican, Morrison, Morrow, Morton, Newton
(C), Newton (D), Oden, Page, Parker (T), Payne, Pringle, Robinson (J),
Robinson (0), Rogers (M), Sanderford, Sanderson, Schmitz, Seibenhener,
Spicer, Thomas (D), Thomas (E), Turner, Vance, Venable, Warren, Willis and
Wren.
- 75
Nay:
Representatives Hall (L), Hilliard, Holmes, Major, McClammy, Mitchell,
Parker (W) and Rogers (J). -8
Abstain:
Representatives Dunn and Perdue.
2
BILLS ON THIRD READING

And the bill:

HB121 (With Substitute): Relating to execution of the death sentence;
providing for execution of the death sentence by lethal injection unless the
person elects to be executed by electrocution; to provide the procedure for
a person

***** Data ends here *********

And my regular expression is as follows:

while ($clean =~ /Yeas (\d+); Nays
(\d+);.*?Yea:(.*?)\.\n.*?(Nay:(.*?)\.\n)?/gis)
{ # process the text here } 

This regular expression captures the announced vote result (in $1 and $2) of
both votes, the 81 Yeas in the first vote (in $3), the 0 Nays in the first
vote ($4), the 75 nays in the second vote, and, incorrectly, 0 Nays in the
second vote. It should be 8 Nays. Note that I am not interested in
abstainers.


> 
> It is much better to have explicit 0's for the losing side in
> any unanimous vote.  Undefined values only confuse issues.

I wholeheartedly agree, but I am given the data and can only deal with it as
given.

Thanks for your help!

Boris


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to