Re: Home made mail news search tool, and folded header lines

R. Joseph Newton Fri, 26 Mar 2004 23:17:32 -0800

Harry Putnam wrote:

> I'm writing a home boy mail/news search tool and wondered if there is
> a cononical way to handle folded or indented header lines.
>
> An example would be that I wanted to run a series of regex against each
> line of input (While in headers) grabing the matches into an array
> for printing.
>
> Something like:
> [...] snipped getopts and other unrelated stuff
>       while(<FILE>){
>           chomp;
>           my $line = $_;


Why here.  Since you are doing this with each line, you could write in the loop
control:
while (my $line = <FILE>) {

>
>           ## @hdregs is an array of several regex for the headers
>           for($ii=0;$ii<=$#hdregs;$ii++){

Why a C-style for loop?  Are you using the index somewhere?  Why no space
between clauses?  Why no space around assignment operators?

>
>              if($line =~ /$hdregs[$ii]/){

Right now, you have just gotten quite a bit of information about this line,
including [with the same amount of effort, the type of header line involved.

>
>                 ## Capture the line
>                 push @hits,$line;

You now have thrown away the type information for the line, by throwing it back
in an usorted bag.  As Joe Ben Stamper said "When you fall, fall in the
direction of your work".  These lines should probably be going into a hash,
keyed to the portion of the line before the colon.  You may wish to throw out
about 3/4 of them, since there are hundreds of different attributes carried in
header lines, and only a small subset is going to be useful for data
management.  Under any circumstances, you should probably try to capture *all*
the information available at this point.

>
>              }
>           }
>     }
>           (somewhere later... if some body regex also match....
>             print the hits.)
>
> I used a for loop so as to stop at each incoming line and compare it
> to a number of regex in rotation, instead of using a possibly long
> string of alternation operators (if ($_ =~ /regex|regex2|regex3/)  etc
>
> But going this route means lines like `Received: ' lines that might
> have folded (indented) lines containing newlines after them will get
> missed.

Then buffer the input.  Declare a variable outside of the loop to hold the
preivous line.  If the line currently being read begins with whitespace, join it
to the $current_line with a newline.  It might take a little restructuring of
the sequence within the loop.  This is one case where a priming read could be of
assistance, since your loop could then have something in the buffer to spit out
unless the line being read has space at the start.

>
>
> It seems like it might take some fairly complex code to do something
> like above but include slurping the indented lines on hits that have
> them.   I wondered if there is some well worn way to do this?

I'm not sure about well-worn, but it doesn't have to be all that complicated
either:

Greetings! E:\d_drive\perlStuff\hdr>perl -w
use Data::Dumper;
open IN, 'hdr00006.txt' or die "Couldn't open header file: $!";
# stuff specific to storage on my mailer:
my $current_line = <IN>;
my $date_string;
if ($current_line =~ /From - (.*)$/) {
   $date_string = $1;
}
# general purpose code for verbatim mail headers:
$current_line = <IN>;
my ($current_key, $current_value) = split /:\s+/, $current_line;
my %header_info;
while (my $line = <IN>) {
   if ($line =~ /^\s+/) {
      $line =~ s/^\s+/ /;
      $current_value .= $line;
   } else {
      $header_info{$current_key} = $current_value;
      ($current_key, $current_value) = split /:\s+/, $line;
   }
}

print Dumper(\%header_info);

^Z
$VAR1 = {
          'MIME-Version' => '1.0
',
          'Status' => '',
          'X-Spam-Status' => 'No, hits=-100.1 required=5.0 tests=SUBJ_ENDS_IN_Q_

MARK,USER_IN_WHITELIST version=2.20
',
          'List-Post' => '<mailto:[EMAIL PROTECTED]>
',
          'X-Mailer' => 'Mozilla 4.79 [en] (Windows NT 5.0; U)

...' [about 10 or 15 more headers.  You can now choose among the relevant ones]
        };


HTH

Joseph



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Home made mail news search tool, and folded header lines

Reply via email to