Re: Please help: perl run out of memory

Rob Coops Mon, 18 Apr 2022 13:26:29 -0700

Hi Wilson,

Looking at the script I see some room for improvement. You currently
declare %hash as a global variable, and keep it around forever. With tens
of millions of rows that is quite a large structure to just have sitting
around after you have build the %stat hash. So I would start by limiting
the scope of the %hash from global to only till %stat has been filled. Of
course then there is the posibility of swapping the key and the value in
your stats hash, making the value an array of item ids or even a string of
item ids (might make sense if you do not expect to ever print all of them
in a list but only a top 1000 or so.

As for sorting a hugh hash in memory there comes a point where one has to
decide if Perl is the right tool for the job. It can probably do the job
but much like you can hammer a nail into a wall with a swiss armyknife Perl
is likely not the best tool for the job here. Looking around I see mostly
people advising to use a DB solution of some description be that Berkley
DB, SQLite or similar relatively simple database solutions you are likely
going to find this a much faster option than using pure Perl for sorting
such large hashes.
The reason why Spark is so much faster at this and more memory efficient at
the same time is that it has been designed to handle huge datasets like
this which very often need to be sorted or counted in some way. Though you
can definetely get the same output using Perl you will likely find that you
are looking at a nail while holding a swiss armyknife.

Just generally speaking having any variable sticking around for any length
of time after you are done with it is bad prectice in all languages. In an
interpeted language such as perl there is no way for an optimization step
to know you are never going to use this variable again so Perl is likely
going to hold on to the variable till the program exists where in a
compiled language an compiler might be able to see that this is the last
time this variable is used thus evict it from memory on your behalf but
that is only possible in some cases in many others such optimisations are
not going to kick in as the compiler cannot be certain that this function
that uses the variable is never going to be called again for instance. This
is why global variables in all languages are generally a bad idea but in
interpeted languages even more so.

Hope that helps a bit,

Rob

On Sun, Apr 17, 2022 at 8:00 PM David Mertens <dcmertens.p...@gmail.com>
wrote:

> I see nothing glaringly inefficient in the Perl. This would be fine on
> your system if you were dealing with 1 million items, but you could easily
> be pushing up against your system's limits with the generic data structures
> that Perl uses, especially since Perl is probably using 64-bit floats and
> ints, and storing the hash keys twice (because you have to hashes).
>
> You could try to use the Perl Data Language, PDL, to create large typed
> arrays with minimal overhead. However, I think a more Perlish approach
> would be to use a single hash to store the data, as you do (or maybe using
> pack/unpack to store the data using 32-bit floats and integers). Then
> instead of using sort, run through the whole collection and build your own
> top-20 list (or 50 or whatever) by hand. This way the final process of
> picking out the top 20 doesn't allocate new storage for all 80 million
> items.
>
> Does that make sense? I could bang out some code illustrating what I mean
> if that would help.
>
> David
>
> On Sun, Apr 17, 2022, 5:33 AM wilson <i...@bigcount.xyz> wrote:
>
>> hello the experts,
>>
>> can you help check my script for how to optimize it?
>> currently it was going as "run out of memory".
>>
>> $ perl count.pl
>> Out of memory!
>> Killed
>>
>>
>> My script:
>> use strict;
>>
>> my %hash;
>> my %stat;
>>
>> # dataset: userId, itemId, rate, time
>> # AV056ETQ5RXLN,0000031887,1.0,1397692800
>>
>> open HD,"rate.csv" or die $!;
>> while(<HD>) {
>>      my ($item,$rate) = (split /\,/)[1,2];
>>      $hash{$item}{total} += $rate;
>>      $hash{$item}{count} +=1;
>> }
>> close HD;
>>
>> for my $key (keys %hash) {
>>      $stat{$key} = $hash{$key}{total} / $hash{$key}{count};
>> }
>>
>> my $i = 0;
>> for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
>>      print "$_: $stat{$_}\n";
>>      last if $i == 99;
>>      $i ++;
>> }
>>
>> The purpose is to aggregate and average the itemId's scores, and print
>> the result after sorting.
>>
>> The dataset has 80+ million items:
>>
>> $ wc -l rate.csv
>> 82677131 rate.csv
>>
>> And my memory is somewhat limited:
>>
>> $ free -m
>>                total        used        free      shared  buff/cache
>> available
>> Mem:           1992         152          76           0        1763
>>    1700
>> Swap:          1023         802         221
>>
>>
>>
>> What confused me is that Apache Spark can make this job done with this
>> limited memory. It got the statistics done within 2 minutes. But I want
>> to give perl a try since it's not that convenient to run a spark job
>> always.
>>
>> The spark implementation:
>>
>> scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
>> val schema: String = uid STRING,item STRING,rate FLOAT,time INT
>>
>> scala> val df =
>> spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
>> val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ...
>> 2 more fields]
>>
>> scala>
>>
>> df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
>> +----------+--------+
>>
>> |      item|avg_rate|
>> +----------+--------+
>> |0001061100|     5.0|
>> |0001543849|     5.0|
>> |0001061127|     5.0|
>> |0001019880|     5.0|
>> |0001062395|     5.0|
>> |0000143502|     5.0|
>> |000014357X|     5.0|
>> |0001527665|     5.0|
>> |000107461X|     5.0|
>> |0000191639|     5.0|
>> |0001127748|     5.0|
>> |0000791156|     5.0|
>> |0001203088|     5.0|
>> |0001053744|     5.0|
>> |0001360183|     5.0|
>> |0001042335|     5.0|
>> |0001374400|     5.0|
>> |0001046810|     5.0|
>> |0001380877|     5.0|
>> |0001050230|     5.0|
>> +----------+--------+
>> only showing top 20 rows
>>
>>
>> I think my perl script should be possible to be optimized to run this
>> job as well. So ask for your helps.
>>
>> Thanks in advance.
>>
>> wilson
>>
>> --
>> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
>> For additional commands, e-mail: beginners-h...@perl.org
>> http://learn.perl.org/
>>
>>
>>

Re: Please help: perl run out of memory

Reply via email to