Kevin Zembower wrote: > > I'm trying to analyze web logs records which look like this: > > 2004-03-28 00:38:31 d7.facsmf.utexas.edu - W3SVC1 DB db.jhuccp.org GET > /dbtw-wpd/exec/dbtwpcgi.exe > XC=%2Fdbtw-wpd%2Fexec%2Fdbtwpcgi.exe&BU=http%3A%2F%2Fdb.jhuccp.org%2Fpopinform%2Fbasic.html&QB0=AND&QF0=Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb&QI0=China%0D%0A&QB1=AND&QF1=Author+%7C+CN&QI1=&MR=10&TN=popline&RF=ShortRecordDisplay&DF=LongRecordDisplay&DL=1&RL=1&NP=0&AC=QBE_QUERY&x=37&y=4 > 200 0 21248 814 19391 80 HTTP/1.1 > Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705) - > http://db.jhuccp.org/popinform/basic.html > > In this record, in the tenth space-delimited field, which starts "XC=%2Fdbtw" > there are variables which start with "QF" followed by a number, for instance > "QF0=Abstract+%7C+KeywordsMajor+%7C+KeywordsMinor+%7C+Notes+%7C+EngTitle+%7C+TT+%7C+FREAb+%7C+SPAAb&" > This indicates that the fields to be searched in the database are "Abstract > KeywordsMajor KeywordsMinor..." The same numbered "QI" variable, in this case > "QI0=China%0D%0A" indicates searching for "China" in these fields. > > For every "QF" record, there should be a corresponding "QI" record with the > same number, although the value might be blank, as in "QF1=Author+%7C+CN&QI1=&". > This section of the above example indicates that a search should be performed > in the "Author" and "CN" fields, but the value for "QI1" is blank, so it matches > everything. > > My program, which I've pasted in below my signature, tries to find a "QF" > value, matches it to a list of fieldnames ("If the list of fields to be > searched contains the 'Abstract' field, it should be considered a 'subject' > search") then grabs the corresponding "QI" value, to print it out. However, > I can never match anything beyond the digit. In my program below, the line: > > print "Match successful!\n" if ($query =~ /QI$1/);
You are using $1 from a previous match so it will work here however the current match, if it is successful, will change the value of $1. > works, but the next three lines: > > $query =~ /QI$1=(.*?)&/; This will always fail. > $subject = $1; You should only use the dollar digit variables when the match succeeded. > print "Subject: $1\n" if ($debug); > > never matches anything. > > I've been working on this, on and off, the last two days. Any suggestions or > pointing out my boneheaded errors is gratefully appreciated. Any other > overall suggestions on my coding are welcomed. This script seems to run very > slowly, due probably to all the complex regex. > > centernet:/opt/analog/logdata/db # cat listqueries3.pl > #!/usr/local/bin/perl use warnings; use strict; > $debug = 1; my $debug = 1; Or perhaps even: use constant DEBUG => 1; > while (<>) { > next unless (/TN=popline/i); #Just analyze the records for the POPLINE database Are you sure you need a case insensitive match? You should also include delimiters to ensure a correct match and the parentheses are not required because the precedence is unambiguous. next unless /&TN=popline&/i; Or: next unless /\bTN=popline\b/i; > $subject = $author = $docno = $title = $journalsource = $keywords = $languages = > $popreporttopic = $refereed = $year = ""; > > if (/^.* .* .* .* .* .* .* GET [^ ]*dbtwpcgi\.exe .*QI0=[^&]*&.*QI1=[^&]*&.*/){ That regular expression is VERY inefficient. Each '.* ' will try to match as much as possible and will then have to backtrack. For example, compare the output of these two one-liners: perl -Mre=debug -le'$_="a b c d e f g h i";/^.* .* .* d/&&print' perl -Mre=debug -le'$_="a b c d e f g h i";/^\S* \S* \S* d/&&print' > if (/QI2/) { $type = "A"; } else { $type = "B"; } > ($date, $time, $source, $junk, $junk, $host, $FQDN, $method, $file, $query, > $junk) = split; Instead of using the variable $junk for fields you don't want to keep you should use the undef function. my ($date, $time, $source, undef, undef, $host, $FQDN, $method, $file, $query) = split; next unless $method eq 'GET' and substr($file, -12) eq 'dbtwpcgi.exe'; my $type = $query =~ /QI2/ ? 'A' : 'B'; > while ($query =~ m/QF(\d+)=(\S*?)&/ig) { > print "fieldnumber = :$1:\tfieldname = $2\n" if ($debug); > if ($2 =~ /abstract/i) { > print "Abstract found!\n" if ($debug); > print "Query: $query\n" if ($debug); > print "Match successful!\n" if ($query =~ /QI$1/); > $query =~ /QI$1=(.*?)&/; > $subject = $1; > print "Subject: $1\n" if ($debug); > } elsif ($2 =~ /author/i) { > $query =~ /QI$1=(\S*?)&/; > $author = $1; > } elsif ($2 =~ /endtitle/i) { > $query =~ /QI$1=(\S*?)&/; > $title = $1; > } > } #while there are more matches for QFn fields This may work better: my %hash = ( abstract => '', author => '', endtitle => '' ); $hash{ lc $2 } = $3 while $query =~ /QF(\d+)=[^&]*\b(abstract|author|endtitle)\b(?=.*?&QI\1=([^&]*))/ig; > $outstring = > "$type\t$date\t$time\t$subject\t$author\t$title\t$journalsource\t$keywords\t$languages\t$popreporttopic\t$refereed\t$year\n"; my $outstring = join( "\t", $type, $date, $time, @hash{ qw( abstract author endtitle ) }, $journalsource, $keywords, $languages, $popreporttopic, $refereed, $year ) . "\n"; > print translate($outstring); Calling a subroutine _may_ slow down the loop (or it may not.) > }# if it's a request for a database query > }# while there are more lines in the input file > > sub translate() { > $_ = $_[0]; > s/%22/\"/g; > s/%2C/,/g; > s/%20/ /g; > s#%2F#/#g; > s/%3D/=/g; > s/%3B/;/g; > s/%26/&/g; > s/%0D//g; > s/%0A//g; > s/\+/ /g; > s/%29/)/g; > s/%28/(/g; > s/%27/\' /g; You don't have to escape " or ' in a double quoted string when the delimiters are /. > s/%2b/+/g; > s/%7C/|/g; > s/%3A/:/g; > #Debbie request all boolean logical words and sumbols be replaced with '|' > s/\b(and)\b/|/ig; > s/\b(or)\b/|/ig; The parentheses are not required because you are not grouping or capturing anything. s/\band\b/|/ig; s/\bor\b/|/ig; > s/&/|/g; > s[/][|]g; For single character replacement the tr/// operator is faster. tr!&/!|!; > $_; > } John -- use Perl; program fulfillment -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>