Re: Comparing fields in 2 large files

Rob Dixon Wed, 11 Jun 2008 12:37:27 -0700

Ferry, Craig wrote:
>
> I am new to perl and would appreciate any suggestions as to how to do
> the following.
> 
> I have two files, one with 3.5 million records, the other with almost a
> million records.   Basically here's what I need to do.
> 
> See if field_1 in file_a is part of field_1 in file_b
> If so, see if field_2 in file_a is part of field_1 in file_b
> If so, see if field_3 in file_a is equal to field_2 in file_b
> If not equal, write out field_4 in file_a plus field_3 in file_b
> 
> I have written a script that will do this, but it runs literally for
> days.   I'm guessing my method is not the most efficient.   I do not
> have enough memory to read the files at one time into an array.
> 
> Thanks in advance for ideas.


Hi Craig

I have some questions:

- What size are your records?

- When you say, 'part of' do you really mean that? For instance, field_1 of
  file_a can appear /anywhere/ within field_1 of file_b and so on?

My first thought is that data of this size should be in a database. You should
seriously consider that if there is the slightest chance of you having to do
this again.

My second is that you need to start by creating some subsidiary index files.
Write a program that creates a file containing just fields 1, 2, 3 and 4 from
file a, and another with just fields 1, 2 and 3 from file b. Then program your
query against those files. If it's still running too slowly you should try
creating one file for each significant field of both sources, so four files for
file_a and three files for file_b.

Ideally one of those files will be small enough to read into memory, but don't
try it unless a file is just a few megabytes in size or things will just get
very much slower.

HTH,

Rob

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Comparing fields in 2 large files

Reply via email to