searches

Jeff 'japhy' Pinyan Thu, 13 May 2004 06:40:15 -0700

On May 13, Tim & Kylie Duke said:

>I am trying to build a perl program that reads through a very large text
>file, searches for a pattern, and prints the pattern - within its context -
>into a log file for later study.  The context is defined as, say, 20
>characters before and after the found pattern.  I intend to use this for
>linguistic analysis.


You're creating a program that creates a concordance, right?

>The pos() function within a m//g loop seems to return the position within
>the line, not the position within the file.  Is there a different function
>to return the absolute position in the file, which I could then use later on
>as a basis for

Well, here's one way to do it.  It uses some concepts you might not yet be
familiar with, so I'll stop the code and explain it as we go along:

  open FILE, "< file.txt" or die "can't read file.txt: $!";
  while (<FILE>) {
    my $orig_file_pos = tell FILE;
    my $line_start = $orig_file_pos - length($_);

The tell() function returns the position in the filehandle.  We need this
because, after we get a match, we're going to do some jumping around in
the file to extract the 20 pre- and 20 post-characters, and we want to be
able to go back to where we SHOULD be when all the work is done.

Then, we subtract the length of this line FROM that value, and put that
into $line_start.  After we've read a line, tell() is the position in the
file RIGHT AFTER that line.  We also need to know the position at the
BEGINNING of that line, and you'll see why soon:

    while (/pattern/g) {
      my $chunk = "";
      my $match_start = $-[0];
      my $match_length = $+[0] - $-[0];

Ok, $-[0] is the first element of the @- array.  Assuming you're using AT
LEAST Perl 5.6, you have this array.  After a successful regex match, the
@- and @+ arrays are populated with the offsets of the capture groups in
the string you matched.  Here's an example:

  "japhy" =~ /a((.).)/;
  # $&: $-[0] is 1    $+[0] is 4
  # $1: $-[1] is 2    $+[1] is 4
  # $2: $-[2] is 2    $+[2] is 3

Basically, it says that $1 is from position 2 to position 4 in the string
we matched ("japhy"), and that $2 is from position 2 to position 3.  $-[0]
and $+[0] allow us to access the whole match ($&) without having to use
the evil $& variable (which causes slowdowns for all regexes in your
code).  (See 'perldoc perlvar' for more details.)

As you can see from my small example above, $+[$N] - $-[$N] returns the
length of the capture group.  In our case, $+[0] - $-[0] returns the
length of the entire match.

      seek(FILE, $line_start + $match_start - 20, 0);

This brings us to 20 characters BEFORE the start of the match (regardless
of what line that puts us on).  The arguments to seek() are the
filehandle, the offset (where in the file to go), and a third argument
that defines how to use this offset.  (If you wanted to go 10 characters
back from your current position, you would do seek(FILE, -10, 1), and if
you wanted to go 10 characters forward from your current position, you'd
do seek(FILE, 10, 1).)  In this case, the argument '0' means "relative to
the beginning of the file".

      read(FILE, $chunk, 20 + $match_length + 20);

Here, we read $match_length + 40 characters into $chunk.  Simple enough.

      # if you want to turn newlines into literal \n symbols, do this:
      $chunk =~ s/\n/\\n/g;

That's just in case you don't want to display multi-line chunks as
multiple lines, but just as containing newlines.  Now, we print:

      print "FOUND: [$chunk]\n";
    }
    seek(FILE, $orig_file_pos, 0);

Once we're done finding matches on this line, we want to go back to where
we were before, IMMEDIATELY after this line.  This will allow <FILE> to
read the next line properly.

  }
  close FILE;

Here's the code, uninterrupted:

  open FILE, "< file.txt" or die "can't read file.txt: $!";
  while (<FILE>) {
    my $orig_file_pos = tell FILE;
    my $line_start = $orig_file_pos - length($_);

    while (/pattern/g) {
      my $chunk = "";
      my $match_start = $-[0];
      my $match_length = $+[0] - $-[0];

      seek(FILE, $line_start + $match_start - 20, 0);
      read(FILE, $chunk, 20 + $match_length + 20);

      $chunk =~ s/\n/\\n/g;  # turn \n into literal \ n

      print "FOUND: [$chunk]\n";
    }
    seek(FILE, $orig_file_pos, 0);
  }
  close FILE;

And there you go.  There are other ways to do this, of course.  One way
would be to see if the current line has enough characters on it that we
can just do a substr() on it, and not move around in the file, but I think
that's too much thinking and work for this task right now, and that this
is more straightforward.

-- 
Jeff "japhy" Pinyan      [EMAIL PROTECTED]      http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
CPAN ID: PINYAN    [Need a programmer?  If you like my work, let me know.]
<stu> what does y/// stand for?  <tenderpuss> why, yansliterate of course.



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: recording the position of m// searches

Reply via email to