Re: Using variables in a regex

John W. Krahn Mon, 03 May 2004 19:56:53 -0700

Kevin Zembower wrote:
> 
> I'm trying to analyze web logs records which look like this:
> 
> 2004-03-28 00:38:31 d7.facsmf.utexas.edu - W3SVC1 DB db.jhuccp.org GET 
> /dbtw-wpd/exec/dbtwpcgi.exe 
> XC=%2Fdbtw-wpd%2Fexec%2Fdbtwpcgi.exe&BU=http%3A%2F%2Fdb.jhuccp.org%2Fpopinform%2Fbasic.html&QB0=AND&QF0=Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb&QI0=China%0D%0A&QB1=AND&QF1=Author+%7C+CN&QI1=&MR=10&TN=popline&RF=ShortRecordDisplay&DF=LongRecordDisplay&DL=1&RL=1&NP=0&AC=QBE_QUERY&x=37&y=4
>  200 0 21248 814 19391 80 HTTP/1.1 
> Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705) - 
> http://db.jhuccp.org/popinform/basic.html
> 
> In this record, in the tenth space-delimited field, which starts "XC=%2Fdbtw"
> there are variables which start with "QF" followed by a number, for instance
> "QF0=Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb&"
> This indicates that the fields to be searched in the database are "Abstract
> KeywordsMajor KeywordsMinor..." The same numbered "QI" variable, in this case
> "QI0=China%0D%0A" indicates searching for "China" in these fields.
> 
> For every "QF" record, there should be a corresponding "QI" record with the
> same number, although the value might be blank, as in "QF1=Author+%7C+CN&QI1=&".
> This section of the above example indicates that a search should be performed
> in the "Author" and "CN" fields, but the value for "QI1" is blank, so it matches
> everything.
> 
> My program, which I've pasted in below my signature, tries to find a "QF"
> value, matches it to a list of fieldnames ("If the list of fields to be
> searched contains the 'Abstract' field, it should be considered a 'subject'
> search") then grabs the corresponding "QI" value, to print it out. However,
> I can never match anything beyond the digit. In my program below, the line:
> 
>            print "Match successful!\n" if ($query =~ /QI$1/);


You are using $1 from a previous match so it will work here however the
current match, if it is successful, will change the value of $1.

> works, but the next three lines:
> 
>            $query =~ /QI$1=(.*?)&/;

This will always fail.

>            $subject = $1;

You should only use the dollar digit variables when the match succeeded.

>            print "Subject: $1\n" if ($debug);
> 
> never matches anything.
> 
> I've been working on this, on and off, the last two days. Any suggestions or
> pointing out my boneheaded errors is gratefully appreciated.  Any other
> overall suggestions on my coding are welcomed. This script seems to run very
> slowly, due probably to all the complex regex.
> 
> centernet:/opt/analog/logdata/db # cat listqueries3.pl
> #!/usr/local/bin/perl

use warnings;
use strict;

> $debug = 1;

my $debug = 1;

Or perhaps even:

use constant DEBUG => 1;


> while (<>) {
>    next unless (/TN=popline/i); #Just analyze the records for the POPLINE database

Are you sure you need a case insensitive match?  You should also include
delimiters to ensure a correct match and the parentheses are not
required because the precedence is unambiguous.

   next unless /&TN=popline&/i;

Or:

   next unless /\bTN=popline\b/i;


>    $subject = $author = $docno = $title = $journalsource = $keywords = $languages = 
> $popreporttopic = $refereed = $year = "";
> 
>    if (/^.* .* .* .* .* .* .* GET [^ ]*dbtwpcgi\.exe .*QI0=[^&]*&.*QI1=[^&]*&.*/){

That regular expression is VERY inefficient.  Each '.* ' will try to
match as much as possible and will then have to backtrack.  For example,
compare the output of these two one-liners:

perl -Mre=debug -le'$_="a b c d e f g h i";/^.* .* .* d/&&print'

perl -Mre=debug -le'$_="a b c d e f g h i";/^\S* \S* \S* d/&&print'


>      if (/QI2/) { $type = "A"; } else {  $type = "B"; }
>      ($date, $time, $source, $junk, $junk, $host, $FQDN, $method, $file, $query, 
> $junk) = split;

Instead of using the variable $junk for fields you don't want to keep
you should use the undef function.

  my ($date, $time, $source, undef, undef, $host, $FQDN, $method, $file,
$query) = split;
  next unless $method eq 'GET' and substr($file, -12) eq 'dbtwpcgi.exe';
  my $type = $query =~ /QI2/ ? 'A' : 'B';


>      while ($query =~ m/QF(\d+)=(\S*?)&/ig) {
>         print "fieldnumber = :$1:\tfieldname = $2\n" if ($debug);
>         if ($2 =~ /abstract/i) {
>            print "Abstract found!\n" if ($debug);
>            print "Query: $query\n" if ($debug);
>            print "Match successful!\n" if ($query =~ /QI$1/);
>            $query =~ /QI$1=(.*?)&/;
>            $subject = $1;
>            print "Subject: $1\n" if ($debug);
>         } elsif ($2 =~ /author/i) {
>            $query =~ /QI$1=(\S*?)&/;
>            $author = $1;
>         } elsif ($2 =~ /endtitle/i) {
>            $query =~ /QI$1=(\S*?)&/;
>            $title = $1;
>         }
>      } #while there are more matches for QFn fields

This may work better:

    my %hash = ( abstract => '', author => '', endtitle => '' );
    $hash{ lc $2 } = $3
        while $query =~
/QF(\d+)=[^&]*\b(abstract|author|endtitle)\b(?=.*?&QI\1=([^&]*))/ig;


>      $outstring = 
> "$type\t$date\t$time\t$subject\t$author\t$title\t$journalsource\t$keywords\t$languages\t$popreporttopic\t$refereed\t$year\n";

    my $outstring = join( "\t",
            $type, $date, $time, @hash{ qw( abstract author endtitle )
},
            $journalsource, $keywords, $languages, $popreporttopic,
$refereed, $year
            ) . "\n";


>      print translate($outstring);

Calling a subroutine _may_ slow down the loop (or it may not.)


>    }# if it's a request for a database query
> }# while there are more lines in the input file
> 
> sub translate() {
>    $_ = $_[0];
>    s/%22/\"/g;
>    s/%2C/,/g;
>    s/%20/ /g;
>    s#%2F#/#g;
>    s/%3D/=/g;
>    s/%3B/;/g;
>    s/%26/&/g;
>    s/%0D//g;
>    s/%0A//g;
>    s/\+/ /g;
>    s/%29/)/g;
>    s/%28/(/g;
>    s/%27/\' /g;

You don't have to escape " or ' in a double quoted string when the
delimiters are /.


>    s/%2b/+/g;
>    s/%7C/|/g;
>    s/%3A/:/g;
>    #Debbie request all boolean logical words and sumbols be replaced with '|'
>    s/\b(and)\b/|/ig;
>    s/\b(or)\b/|/ig;

The parentheses are not required because you are not grouping or
capturing anything.

     s/\band\b/|/ig;
     s/\bor\b/|/ig;


>    s/&/|/g;
>    s[/][|]g;

For single character replacement the tr/// operator is faster.

     tr!&/!|!;

>    $_;
>    }


John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Using variables in a regex

Reply via email to