Hello, I'm taking a PhD course that requires the use of Perl and pattern matching. I've taken on the motto "divide and conquer," but it hasn't quite worked. I appreciate anyone's help.
The task is to extract sentences from a relatively large text file (928K, ca. 300 pages). But of course, the text file is messy. I've tried two approaches. 1. My first approach was to use substitute to get rid of a range of things between <DOC> and </DATELINE>. A short version looks like this. $hello = "<DOC> man at the bar order the </DATELINE>"; $hello =~ s/<DOC>.*<\/DATELINE>//gi; print "$hello\n"; This works until the code comes across a quotation mark ("). So then I replace double quotation marks (") with single quotation marks ('). But then as, I put more text under $hello, the code seems to break. For example, running the substitution against the code below simply re-shows me the code. <DOC> <DOCNO> WSJ890728-0079 </DOCNO> <DD> = 890728 </DD> <AN> 890728-0079. </AN> <HL> Major Deficit @ Signaled by Sun </DATELINE> It doesn't remove everything between <DOC> and </DATELINE>. 2. My second approach was to simply find what I wanted using matching and ignoring deleting. A short version looks like this. $mystring = "<TEXT> Sun Microsystems Inc. said it will post a larger-than-expected fourth-quarter loss of as much as $26 million and may show a loss in the current first quarter, raising further troubling questions about the once high-flying computer workstation maker. </TEXT> ."; if($mystring =~ m/<TEXT>(.*?)<\/TEXT>/) { print $1; } This works. Again I change the double quotation marks for the single quotation marks. But once again when I include more data with line breaks, the code breaks. This is the first part of a 5-part question. Very frustrating. Every university should have Perl tutors just as they have (or should have) language tutors. Cheers, Zach -- -------------------------------------------------------------------------------------------------- Zachary S. Brooks PhD Student in Second Language Acquisition and Teaching (SLAT) The University of Arizona - http://www.coh.arizona.edu/slat/ Graduate Associate in Teaching - Department of English M.A. Applied Linguistics - University of Massachusetts Boston ---------------------------------------------------------------------------------------------------