Hi list members,

I have some problems with a script using hashes.
I use hashes for years but never had this kind of problem before...
I have a ascii file with 6.5 MB of data. This file is tokenized by Parse:Lex
module.
The tokens are then stored in a two level hash:
$TokenHash{$TokenType}->{$TokenID}=$TokenValue.
The file contains 536,332 tokens which will lead to 79 keys for %TokenHash.
I'm evaluating the hash with two loops, one for each level.
Due to that I need to move back and forth through the _sorted_ hash while
being in the loop I can't use the built-in procedures like "foreach $key1
(keys %TokenHash)...". So I decided to use Tie::LLHash.
Now I'm amazed by the memory consumption. The script uses up to 300MB for
processing this small file which will lead to a 3.5 MB file at the end. I
developed and tested my script with a 2K subset of the original file and
therefore I haven't encountered the problem during tests.
A simple "if (not exists $TokenHash{$TokenType}->{$TokenID}) {}" uses 110MB
of memory. I encountered this when storing the elements into the hash was
commented out. Just tokenization of the file uses 4M of memory. So in my
opinion it's hash/hash operations related.
In production the files to be processed will be up to several 100MB of size,
so memory usage is really an issue for me.
I also tried with "simple/built-in" hashes just to be sure that the module
isn't the problem. But I got the same strange results. And I tried to use
multi-dimension arrays, but they also use up to 50MB of memory.

Anything I need to consider? Anybody with the same experience?

Perl: 5.6.1
Tie::LLHash: 1.002
Parse::Lex: 2.15

Best regards,

Oliver

Reply via email to