Re: techniques for handling large text files

James Edward Gray II Mon, 29 Dec 2003 08:23:46 -0800

On Dec 29, 2003, at 12:05 AM, danl001 wrote:

Hi,

Howdy.

If this question would be better posted to another perl list, please let me know.

I think you found the right place.

I have a very large text files (~2 GB) and it's in the following format:
header line
header line
header line
marker 1
header line
header line
header line
marker 2
line type 1
line type 1
line type 1
...
line type 1
line type 2
line type 2
line type 2
...
line type 2
end of file marker line
My objective is to put all "line type 1" lines to file1.txt and all "line type 2" lines to file2.txt. The "header line" and any of the marker lines will not appear in either file1.txt or file2.txt. Note there is no marker line between where line type 1 ends and where line type 2 starts, but that can be determined by examining a field in the line.

Sounds easy enough. I'm with you to here.

So I have a script to do this. Essentially, it visits each line in the file and decides which output file to write it to. The problem is it takes a long time to run (roughly 45 min) (dual p4, 512 ram).

Forgive my ignorance, I'm a Mac guy. Dual P4??? I thought you couldn't do that.

The time seems very excessive to me. My gut instinct is that a line by line read should preform much better than that. My two immediate guesses: One, you accidently read the whole thing into memory, say by using a foreach() instead of a while() for the input loop. Or two, there is something else going on in the parts of the code you didn't show us.

I'd like to cut this running time down as much as possible. What I'm looking is either suggestions on a better way to do this in perl, or suggestions or techniques I could use to speed up my current script. I have pasted the relevant parts of the script below. I noticed I could shave a bit off the runtime by reading the original file in a buffered manner instead of line by line.

Another warning sign, I think. I wouldn't expect this to speed it up. IO should be buffered already.

My outputs to file1.txt and file2.txt at this point take place with prints to their respective file handles.

Any suggestions that will speed this up in any way will be greatly appreciated! Thanks,

Nothing jumped out and bit me in your code, so I ran my own test. I built a file that matches your regexes and is close to the proper size (1.89 GB). (Yes, it could still be pretty different, especially in line length.) After that I built some code to split it, line by line. It took under 3 minutes on my dual G5, which sounds a lot closer to right.

Below is the code I used. It's pretty straight forward, but shout if you need me to explain anything.

#!/usr/bin/perl

use strict;
use warnings;

die "Usage: perl type_split INPUT OUTPUT_ONE OUTPUT_TWO\n" unless @ARGV == 3;

open INPUT, '<', shift() or die "File error:  $!";
open OUTPUT, '>', shift() or die "File error:  $!";

my $section = 'header1';
while (<INPUT>) {
        last if m/^end file/;   # quit when we're done
        if ($section eq 'header1' || $section eq 'header2') {   # skip headers
                next unless m/^marker/;
                $section = $section eq 'header1' ? 'header2' : 'one';
        }
        elsif ($section eq 'one') {     # handle section one
                if (m/^\S+\s+2/) {      # and watch for section two
                        $section = 'two';
                        open OUTPUT, '>', shift() or die "File error:  $!";
                }
                print OUTPUT $_;
        }
        else { print OUTPUT $_; }       # handle section two
}

__END__

Hope that helps you along.

James


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: techniques for handling large text files

Reply via email to