Re: new for reading file containing multiple records

Shawn Corey Sun, 08 Jan 2006 06:32:52 -0800

chen li wrote:

You are 50% right. This method is not correct for the
first record(which actually contains ">' only) but it
is correct for the last record(and others in between).

I want to edit the file first and try to delete the
first ">" in this big file. I browse Programming Perl
and Perl Cookbook there is not such  example: just
delete the first charater in a file. But they have
examples to delete the last line from a file. It seems

odd to me.

First, I'd recommend against changing a large file. Unless your programis the only user of the file, you would have to change all the otherprograms. And in this case, you would be unable to distinguish onerecord from another.

There are three ways to distinguish records in a file: by recordseparators, by beginning-of-record tokens, and by end-of-record tokens.Your file may use one, two or all three methods. When writing code, yourpreference should be (in order) record separator, end-of-record token,and finally beginning-of-record token.

In Perl, the variable $/ is used to distinguish the end-of-record token;even though it is called the INPUT_RECORD_SEPARATOR. Its name ismisleading. If it was a true record separator, your code would neverhave to process the record separator; it would be discarded at a lowerlevel.

The records in your file are distinguished only by a beginning-of-recordtoken, specifically a greater-than sign at the beginning of a record.You can process the file in two ways: treat the beginning-of recordtoken as an end-of-record token, or read ahead in the file and processthe record only after reading the beginning of the next record. Bothhave the advantages and disadvantages.

If you want to treat the beginning-of-record token as an end-of-recordone, your records are going to have some anomalies. The first record isgoing to have a beginning-of-record token attached to it. Your lastrecord is not going to have an end-of-record token. For your case, itwould look something like this:


my $beginning_token = '>';
my $end_token = "\n$beginning_token";
$/ = $end_token;
my $first = 1;
while( <FH> ){
  if( $first ){
    s/^\Q$beginning_token//;
    $first = 0;
  }
  s/\Q$end_token\E$//;
  process_record( $_ );
}

If you want to use only the beginning-of-record token, you will have todo at least a partial read ahead. This means you have to store the readahead and the last record will be processed outside the read loop. Foryou case:


my $beginning_token = '>';
my $record = '';
while( <FH> ){
  if( /^\Q$beginning_token/ ){
    if( $record =~ /^\Q$beginning_token/ ){
      process_record( $record );
    }
    $record = '';
  }
  $record .= $_;
}
if( $record =~ /^\Q$beginning_token/ ){
  process_record( $record );
}



--

Just my 0.00000002 million dollars worth,
   --- Shawn

"Probability is now one. Any problems that are left are your own."
   SS Heart of Gold, _The Hitchhiker's Guide to the Galaxy_

* Perl tutorials at http://perlmonks.org/?node=Tutorials
* A searchable perldoc is available at http://perldoc.perl.org/

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: new for reading file containing multiple records

Reply via email to