On 08/10/2006 09:12 AM, Roman Daszczyszak wrote:
Hello all,
I have several text files with a few thousand contacts in each, and I
am trying to pull out all the contacts from certain email domains
(about 15 of them). I wrote a script that loops through each file,
then loops through matching each domain to the line and writes the
results to two files, one for matches, one for non-matches.
I am just curious if there is a way to match all the domains in turn,
without having a foreach looping through them?
Here's my code:
#!/perl/bin/perl
use strict;
use warnings;
my $program_time = time();
die "SYNTAX: strip_email_addresses.pl FILE1 FILE2 .. FILE(N)\n" unless
(@ARGV);
my $domain_filename = "intel_addresses.txt";
my @email_domains;
open(DOMAINS, "<$domain_filename") or die "Cannot open $domain_filename:
$!\n";
chomp(@email_domains = <DOMAINS>);
You can simplify this by using File::Slurp, e.g.
use File::Slurp;
...
chomp (@email_domains = read_file($domain_filename));
# Email_domains is more useful as a hash:
my %email_domains = map +($_, 1), @email_domains;
LINE: while (<>)
{
my $filename = $ARGV;
$filename =~ s/\.csv//gi;
open(FOUND, ">>${filename}_match.csv") or die "Cannot open
${filename}_match.csv\n";
open(NOTFOUND, ">>${filename}_nomatch.csv") or die "Cannot open
${filename}_nomatch.csv\n";
This opens the output files each time a line in found from one
of the input file; that inefficient. I would leave out the
reading from <> and do it completely differently.
foreach my $domain (@email_domains)
{
if (m/$domain/i)
{
I would just use the hash created above.
print(FOUND $_);
next LINE;
}
}
print(NOTFOUND $_);
}
print("Run time: ",time() - $program_time,"\n");
---------------------------------------------------------------------------
Additionally, does anyone know of a better way to open the results
files, keeping the practice of making two files for each original,
without having to reopen the file on each iteration of the while loop?
Does reopening the file cause a performance hit each open?
Yes, reopening will hurt performance--probably a lot.
You didn't post any data. It's not too easy to test a program
without its data, but this is how I'd approach the problem:
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
# WARNING: UNTESTED CODE
# Remove exit below to run.
exit;
# DOMAIN: Replace with the index value of your domain column.
my $DOMAIN = 6;
my $domain_filename = "intel_addresses.txt";
my $program_time = time();
die "SYNTAX: strip_email_addresses.pl FILE1 FILE2 .. FILE(N)\n"
unless (@ARGV);
my %email_domains = map { chomp; $_, 1 }
read_file($domain_filename);
my $csv = Text::CSV_XS->new();
foreach my $infile (@ARGV) {
my $filename = $infile;
$filename =~ s/\.csv//gi;
my @data =
map { $csv->parse($_); [ $_, $csv->fields ] }
read_file($infile);
my (@found, @notfound);
foreach my $rec (@data) {
if ($email_domains{$rec->[$DOMAIN+1]}) {
push @found, $rec->[0];
} else {
push @notfound, $rec->[0];
}
}
write_file("${filename}_found.csv", @found);
write_file("${filename}_notfound.csv", @notfound);
}
undef $csv;
# WARNING: UNTESTED CODE
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>