Forgot the anonymize_mailbox script.
On Sun, 23 Aug 2009, David Coppit wrote:
Hi there,
I need more information to debug this. Please either confirm the bug and
provide more information, or mark this bug as "not a bug".
grepmail uses Mail::Mbox::MessageParser, which is designed to use memory
proportional to the largest email message in a mailbox. I verified that it
does indeed operate this way, using a 54MB mailbox:
mbox size: 56683943
max email size: 11182857
max read buffer: 11184795 <-- Biggest size of M::M::MP's read buffer
folder_reader: 11186558 <-- Biggest size of the M::M::MP Perl object
Some stats from ps(1):
Plain text mailbox:
min real memory: 4976640
min virtual memory: 618000384
max real memory: 38674432
max virtual memory: 651546624
Gzip compressed:
min real memory: 5005312
min virtual memory: 618016768
max real memory: 38694912
max virtual memory: 651563008
I also tried a 540MB mailbox, created by concatenating the mailbox 10
times:
Plain text x10:
min real memory: 4976640
min virtual memory: 618000384
max real memory: 40292352
max virtual memory: 652021760
Gzip compressed x10:
min real memory: 5005312
min virtual memory: 618016768
max real memory: 40284160
max virtual memory: 652038144
The numbers above were basically the same for a 23KB mailbox. Also note
that this command:
perl -e 'system "ps -o rss,vsz $$"'
consumes 1175552 real and 615645184 virtual memory, so the numbers above
are not out of the ordinary.
If you could run the attached anonymize_mailbox script on your mailbox,
verify that memory usage is still bad, then send the mailbox to me, I can
debug this better.
Another idea: perhaps your mailbox is malformed, such that grepmail only
sees 1 email in the whole mailbox. You can check this by running:
grepmail -r . my_big_mailbox
If you want to confirm that you have a very large email in your mailbox,
find this line in grepmail:
my $email = $folder_reader->read_next_email();
and follow it with this line:
print length($$email) . "\n";
then run something like:
grepmail nonexistent_pattern my_big_mailbox | sort -n
Regards,
David
_____________________________________________________________________
David Coppit http://coppit.org/
_____________________________________________________________________
David Coppit http://coppit.org/
#!/usr/bin/perl -w
$VERSION = '1.00';
use strict;
use FileHandle;
#-------------------------------------------------------------------------------
my $LINE = 0;
my $FILE_HANDLE = undef;
my $START = 0;
my $END = 0;
my $READ_BUFFER = '';
sub reset_file
{
my $file_handle = shift;
$FILE_HANDLE = $file_handle;
$LINE = 1;
$START = 0;
$END = 0;
$READ_BUFFER = '';
}
#-------------------------------------------------------------------------------
# Need this for a lookahead.
my $READ_CHUNK_SIZE = 0;
sub read_email
{
# Undefined read buffer means we hit eof on the last read.
return 0 unless defined $READ_BUFFER;
my $line = $LINE;
$START = $END;
# Look for the start of the next email
LOOK_FOR_NEXT_HEADER:
while($READ_BUFFER =~ m/^(From\s.*\d:\d+:\d.* \d{4})/mg)
{
$END = pos($READ_BUFFER) - length($1);
# Don't stop on email header for the first email in the buffer
next if $END == 0;
# Keep looking if the header we found is part of a "Begin Included
# Message".
my $end_of_string = substr($READ_BUFFER, $END-200, 200);
if ($end_of_string =~
/\n-----( Begin Included Message |Original Message)-----\n[^\n]*\n*$/i)
{
next;
}
# Found the next email!
my $email = substr($READ_BUFFER, $START, $END-$START);
$LINE += ($email =~ tr/\n//);
return (1, $email, $line);
}
# Didn't find next email in current buffer. Most likely we need to read some
# more of the mailbox. Shift the current email to the front of the buffer
# unless we've already done so.
$READ_BUFFER = substr($READ_BUFFER,$START) unless $START == 0;
$START = 0;
# Start looking at the end of the buffer, but back up some in case the edge
# of the newly read buffer contains the start of a new header. I believe the
# RFC says header lines can be at most 90 characters long.
my $search_position = length($READ_BUFFER) - 90;
$search_position = 0 if $search_position < 0;
# Can't use sysread because it doesn't work with ungetc
if ($READ_CHUNK_SIZE == 0)
{
local $/ = undef;
if (eof $FILE_HANDLE)
{
my $email = $READ_BUFFER;
undef $READ_BUFFER;
return (1, $email, $line);
}
else
{
$READ_BUFFER = <$FILE_HANDLE>;
pos($READ_BUFFER) = $search_position;
goto LOOK_FOR_NEXT_HEADER;
}
}
else
{
if (read($FILE_HANDLE, $READ_BUFFER, $READ_CHUNK_SIZE,
length($READ_BUFFER)))
{
pos($READ_BUFFER) = $search_position;
goto LOOK_FOR_NEXT_HEADER;
}
else
{
my $email = $READ_BUFFER;
undef $READ_BUFFER;
return (1, $email, $line);
}
}
}
sub Read_Chunk_Of_Body
{
my $email = shift;
local $/ = "\nFrom ";
my $chunk = <$FILE_HANDLE>;
local $/ = "From ";
chomp $chunk;
$LINE += ($chunk =~ tr/\n//);
$$email .= $chunk;
}
die unless @ARGV;
$FILE_HANDLE = new FileHandle($ARGV[0]);
while(1)
{
my ($status,$email,$line) = read_email();
exit unless $status;
my ($header,$body) = $email =~ /(.*?\n\n)(.*)/s;
$body =~ s/\w/X/g;
{
my ($header_to) = $header =~ /^To: (.*)$/m;
my ($header_subject) = $header =~ /^Subject: (.*)$/m;
if (defined $header_to)
{
my $modified_header_to = $header_to;
$modified_header_to =~ s/\w/X/g;
$header =~ s/To: \Q$header_to\E/To: $modified_header_to/g;
}
if (defined $header_subject)
{
my $modified_header_subject = $header_subject;
$modified_header_subject =~ s/\w/X/g;
$header =~ s/Subject: \Q$header_subject\E/Subject:
$modified_header_subject/g;
}
}
print $header,$body;
}