Re: Please help: perl run out of memory

hw Tue, 26 Apr 2022 06:18:40 -0700


On Sun, 2022-04-17 at 17:33 +0800, wilson wrote:
> hello the experts,
> 
> can you help check my script for how to optimize it?
> currently it was going as "run out of memory".
> 
> $ perl count.pl
> Out of memory!
> Killed


I would use a database like Mariadb for this, not only to create a
report but also to store the data.  Check out https://dbi.perl.org/
and phpmyadmin.  IIRC DBI can also work with CSV files as database
tables.  There's also Text::CSV ...

If you want to re-invent the wheel, you can do it in perl going itemId
by itemId, line by line, using another file to store the results per
itemId, and then come up with a variant of bubblesort that uses files,
too.

That's assuming that you have not as many itemIds as you have rows.
Maybe you can make it faster by using an index for the itemId field.


> 
> My script:
> use strict;
> 
> my %hash;
> my %stat;
> 
> # dataset: userId, itemId, rate, time
> # AV056ETQ5RXLN,0000031887,1.0,1397692800
> 
> open HD,"rate.csv" or die $!;
> while(<HD>) {
>      my ($item,$rate) = (split /\,/)[1,2];
>      $hash{$item}{total} += $rate;
>      $hash{$item}{count} +=1;
> }
> close HD;
> 
> for my $key (keys %hash) {
>      $stat{$key} = $hash{$key}{total} / $hash{$key}{count};
> }
> 
> my $i = 0;
> for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
>      print "$_: $stat{$_}\n";
>      last if $i == 99;
>      $i ++;
> }
> 
> The purpose is to aggregate and average the itemId's scores, and print 
> the result after sorting.
> 
> The dataset has 80+ million items:
> 
> $ wc -l rate.csv
> 82677131 rate.csv
> 
> And my memory is somewhat limited:
> 
> $ free -m
>                total        used        free      shared  buff/cache 
> available
> Mem:           1992         152          76           0        1763 
>    1700
> Swap:          1023         802         221
> 
> 
> 
> What confused me is that Apache Spark can make this job done with this 
> limited memory. It got the statistics done within 2 minutes. But I want 
> to give perl a try since it's not that convenient to run a spark job always.
> 
> The spark implementation:
> 
> scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
> val schema: String = uid STRING,item STRING,rate FLOAT,time INT
> 
> scala> val df = 
> spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
> val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ... 
> 2 more fields]
> 
> scala> 
> df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
> +----------+--------+ 
> 
> >      item|avg_rate|
> +----------+--------+
> > 0001061100|     5.0|
> > 0001543849|     5.0|
> > 0001061127|     5.0|
> > 0001019880|     5.0|
> > 0001062395|     5.0|
> > 0000143502|     5.0|
> > 000014357X|     5.0|
> > 0001527665|     5.0|
> > 000107461X|     5.0|
> > 0000191639|     5.0|
> > 0001127748|     5.0|
> > 0000791156|     5.0|
> > 0001203088|     5.0|
> > 0001053744|     5.0|
> > 0001360183|     5.0|
> > 0001042335|     5.0|
> > 0001374400|     5.0|
> > 0001046810|     5.0|
> > 0001380877|     5.0|
> > 0001050230|     5.0|
> +----------+--------+
> only showing top 20 rows
> 
> 
> I think my perl script should be possible to be optimized to run this 
> job as well. So ask for your helps.
> 
> Thanks in advance.
> 
> wilson
> 



-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Please help: perl run out of memory

Reply via email to